frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

High rate of LLM (GPT5) hallucinations in dense stats domains (cricket)

3•sp1982•2h ago
Disclaimer: I am not a ML researcher, so the terms are informal/wonky. Apologies!

I’m doing a small experiment to see whether models “know when they know” on T20 international cricket scorecards (cricsheet.com for source). The idea is to test models on publicly available data they likely saw during training, and see if they hallucinate or admit they don't know.

Setup: Each question is from a single T20 match. Model must return an answer (numeric or choice from options) or `no_answer`.

Results (N=100 per model):

- gpt-4o-search-preview • Answer rate: 0.96 • Accuracy: 0.88 • Accuracy (answered): 0.91 • Hallucination (answered): 0.09 • Wrong/100: 9

- gpt-5 • Answer rate: 0.35 • Accuracy: 0.27 • Accuracy (answered): 0.77 • Hallucination (answered): 0.23 • Wrong/100: 8

- gpt-4o-mini • Answer rate: 0.37 • Accuracy: 0.14 • Accuracy (answered): 0.38 • Hallucination (answered): 0.62 • Wrong/100: 23

- gpt-5-mini • Answer rate: 0.05 • Accuracy: 0.02 • Accuracy (answered): 0.40 • Hallucination (answered): 0.60 • Wrong/100: 3

Note: most remaining “errors” with search are obscure/disputed cases where public sources disagree.

It seems for domains where models might have seen some data, it’s better to rely on abstention + RAG vs a larger model with more coverage but worse hallucination rate.

Code/Data: https://github.com/jobswithgpt/llmcriceval

Comments

whinvik•2h ago
Is this exercise done to determine what the model can produce from its training data or is the data shown again to the model?
sp1982•1h ago
From training data.

Ancient Statues Emerge from the Egypt's Coast, Where They'd Been for 1000 Years

https://www.smithsonianmag.com/smart-news/ancient-statues-emerge-from-the-waters-off-egypts-coast...
1•ulrischa•50s ago•0 comments

Tariff Simulator

https://tariffs.flexport.com
1•Teever•1m ago•0 comments

Rebel Hideout: ILM's Employee-Made 'Star Wars' Gathering Place

https://www.ilm.com/rebel-hideout-lounge-ilm-lucasfilm-star-wars/
1•CharlesW•2m ago•0 comments

Ask HN: How are you attributing your AI usage when developing software?

1•hartleybrody•2m ago•0 comments

Famulor – Deutschlands Führender KI-Telefonassistent – Intelligente Telefonie

https://www.famulor.io/en
1•imankoma•4m ago•0 comments

Encoding sortable binary database keys

https://stately.cloud/blog/encoding-sortable-binary-database-keys/
1•itunpredictable•4m ago•0 comments

What happens when ambassadors are summoned by the host country?

https://politics.stackexchange.com/questions/93401/what-happens-when-ambassadors-are-summoned-by-...
2•azeemba•5m ago•0 comments

Firestore with MongoDB compatibility goes GA

https://cloud.google.com/blog/products/databases/firestore-with-mongodb-compatibility-is-now-ga
1•fuzquat•6m ago•1 comments

Show HN: OpenCQRS – A new CQRS framework for JVM developers

https://github.com/open-cqrs/opencqrs
1•goloroden•6m ago•0 comments

Framework announced the second-gen Framework Laptop 16

https://www.theverge.com/news/766161/framework-egpu-haptic-touchpad-trackpoint-nub
2•halicarnassus•6m ago•1 comments

The Hexagon: A Battle-Tested Blueprint for Your Event-Driven App

https://mina-tafreshi.medium.com/the-hexagon-a-battle-tested-blueprint-for-your-event-driven-app-...
1•minatafreshi•7m ago•0 comments

92-year-old sprinter has the muscle cells of someone in their 20s

https://www.washingtonpost.com/wellness/2025/08/24/92-year-old-sprinter-emma-mazzenga/
1•wslh•7m ago•1 comments

MAGA Rages over Trump's Chinese Student Numbers: 'Should Never Allow That'

https://www.newsweek.com/maga-rages-trump-chinese-student-numbers-2119215
1•01-_-•9m ago•0 comments

Stop Trying to Kill the SPA

https://frontendatscale.com/issues/51/
1•charca•10m ago•0 comments

Physicists solve 90-year-old puzzle of quantum damped harmonic oscillators

https://phys.org/news/2025-08-physicists-year-puzzle-quantum-damped.html
1•PaulHoule•10m ago•0 comments

UniFi Network Object Oriented Networking Explained

https://lazyadmin.nl/home-network/unifi-network-objects/
1•speckx•10m ago•0 comments

Can LLMs Dream of Electric Sheep?

https://sankalp.bearblog.dev/can-llms-dream-of-electric-sheep/
1•dejavucoder•12m ago•0 comments

Instacart Built a Modern Search Infrastructure on Postgres

https://tech.instacart.com/how-instacart-built-a-modern-search-infrastructure-on-postgres-c528fa6...
1•tanelpoder•13m ago•0 comments

The strange and broken world of DMV login pages

https://tesseral.com/blog/the-nevada-indiana-and-florida-dmvs-have-unusually-bad-login-pages
1•noleary•14m ago•0 comments

Imgur Users Rebel Against MediaLab over Moderation, Glitches, and Lost Community

https://www.digitalinformationworld.com/2025/08/imgur-users-rebel-against-medialab-over.html
1•healsdata•16m ago•0 comments

Nous Research – Hermes 4 405B/70B released

https://hermes4.nousresearch.com
1•beklein•16m ago•1 comments

I Procrastinate (2019)

https://invisibleup.com/articles/27/
1•sogen•17m ago•0 comments

Wormhole.app: Share files E2E encrypted and a link that automatically expires

https://wormhole.app
3•sogen•18m ago•0 comments

Trump media group in $6B deal to buy Crypto.com tokens

https://www.ft.com/content/769694dd-a947-4a09-95ae-fe4bb1b2edf7
2•iamben•20m ago•0 comments

Lessons from Building a Game Engine from Scratch in Gleam [video]

https://www.youtube.com/watch?v=uExwRo_qM-k
1•surprisetalk•21m ago•0 comments

Omarchy 2.0

https://world.hey.com/dhh/omarchy-2-0-16fefc15
3•xachen•22m ago•0 comments

Show HN: Emulating aarch64 in software using JIT compilation and Rust

https://pitsidianak.is/blog/posts/2025-08-25_emulating_aarch64_in_software_using_JIT_compilation....
1•epilys•22m ago•0 comments

iPhone Is Lying to You About Files [video]

https://www.youtube.com/watch?v=tnPAhVxsPHE
1•surprisetalk•22m ago•0 comments

Vortek: Our Answer to Zero Purge Waste Multi-Material Printing [video]

https://www.youtube.com/watch?v=rluJj3NEdQA
1•rutierut•24m ago•0 comments

Cupertino must stop calling Apple Watches 'carbon neutral,' German court rules

https://www.theregister.com/2025/08/26/carbon_neutral_apple_watch/
4•rntn•27m ago•0 comments