frontpage.

Disclaimer: I am not a ML researcher, so the terms are informal/wonky. Apologies!

I’m doing a small experiment to see whether models “know when they know” on T20 international cricket scorecards (cricsheet.com for source). The idea is to test models on publicly available data they likely saw during training, and see if they hallucinate or admit they don't know.

Setup: Each question is from a single T20 match. Model must return an answer (numeric or choice from options) or `no_answer`.

Results (N=100 per model):

- gpt-4o-search-preview • Answer rate: 0.96 • Accuracy: 0.88 • Accuracy (answered): 0.91 • Hallucination (answered): 0.09 • Wrong/100: 9

- gpt-5 • Answer rate: 0.35 • Accuracy: 0.27 • Accuracy (answered): 0.77 • Hallucination (answered): 0.23 • Wrong/100: 8

- gpt-4o-mini • Answer rate: 0.37 • Accuracy: 0.14 • Accuracy (answered): 0.38 • Hallucination (answered): 0.62 • Wrong/100: 23

- gpt-5-mini • Answer rate: 0.05 • Accuracy: 0.02 • Accuracy (answered): 0.40 • Hallucination (answered): 0.60 • Wrong/100: 3

Note: most remaining “errors” with search are obscure/disputed cases where public sources disagree.

It seems for domains where models might have seen some data, it’s better to rely on abstention + RAG vs a larger model with more coverage but worse hallucination rate.

Code/Data: https://github.com/jobswithgpt/llmcriceval

Ancient Statues Emerge from the Egypt's Coast, Where They'd Been for 1000 Years

Tariff Simulator

Rebel Hideout: ILM's Employee-Made 'Star Wars' Gathering Place

Ask HN: How are you attributing your AI usage when developing software?

Famulor – Deutschlands Führender KI-Telefonassistent – Intelligente Telefonie

Encoding sortable binary database keys

What happens when ambassadors are summoned by the host country?

Firestore with MongoDB compatibility goes GA

Show HN: OpenCQRS – A new CQRS framework for JVM developers

Framework announced the second-gen Framework Laptop 16

The Hexagon: A Battle-Tested Blueprint for Your Event-Driven App

92-year-old sprinter has the muscle cells of someone in their 20s

MAGA Rages over Trump's Chinese Student Numbers: 'Should Never Allow That'

Stop Trying to Kill the SPA

Physicists solve 90-year-old puzzle of quantum damped harmonic oscillators

UniFi Network Object Oriented Networking Explained

Can LLMs Dream of Electric Sheep?

Instacart Built a Modern Search Infrastructure on Postgres

The strange and broken world of DMV login pages

Imgur Users Rebel Against MediaLab over Moderation, Glitches, and Lost Community

Nous Research – Hermes 4 405B/70B released

I Procrastinate (2019)

Wormhole.app: Share files E2E encrypted and a link that automatically expires

Trump media group in $6B deal to buy Crypto.com tokens

Lessons from Building a Game Engine from Scratch in Gleam [video]

Omarchy 2.0

Show HN: Emulating aarch64 in software using JIT compilation and Rust

iPhone Is Lying to You About Files [video]

Vortek: Our Answer to Zero Purge Waste Multi-Material Printing [video]

Cupertino must stop calling Apple Watches 'carbon neutral,' German court rules

High rate of LLM (GPT5) hallucinations in dense stats domains (cricket)

Comments