I’m doing a small experiment to see whether models “know when they know” on T20 international cricket scorecards (cricsheet.com for source). The idea is to test models on publicly available data they likely saw during training, and see if they hallucinate or admit they don't know.
Setup: Each question is from a single T20 match. Model must return an answer (numeric or choice from options) or `no_answer`.
Results (N=100 per model):
- gpt-4o-search-preview • Answer rate: 0.96 • Accuracy: 0.88 • Accuracy (answered): 0.91 • Hallucination (answered): 0.09 • Wrong/100: 9
- gpt-5 • Answer rate: 0.35 • Accuracy: 0.27 • Accuracy (answered): 0.77 • Hallucination (answered): 0.23 • Wrong/100: 8
- gpt-4o-mini • Answer rate: 0.37 • Accuracy: 0.14 • Accuracy (answered): 0.38 • Hallucination (answered): 0.62 • Wrong/100: 23
- gpt-5-mini • Answer rate: 0.05 • Accuracy: 0.02 • Accuracy (answered): 0.40 • Hallucination (answered): 0.60 • Wrong/100: 3
Note: most remaining “errors” with search are obscure/disputed cases where public sources disagree.
It seems for domains where models might have seen some data, it’s better to rely on abstention + RAG vs a larger model with more coverage but worse hallucination rate.
Code/Data: https://github.com/jobswithgpt/llmcriceval
whinvik•2h ago
sp1982•1h ago