frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Scientific Insolvency in GPQA and HLE: A forensic audit reveals 58% error rate

https://zenodo.org/records/18293568
2•jopsammy•1h ago

Comments

jopsammy•1h ago
Author here.

I am an independent researcher (originally med background, moved to CS/Physics). I spent the last few weeks manually grading GPQA-Diamond and Humanity's Last Exam (HLE) because my experimental models (DeepSeek-Overclock) were deriving "wrong" answers that looked logically sound.

I conducted a forensic audit of the datasets. I suspect these benchmarks are currently "gaslighting" foundation models.

*Findings:*

* GPQA-Diamond: Inherent error lower bound *26.8%*. * HLE (Sampled): Inherent error lower bound *~58%*.

Visual Summary of Error Rates: https://i.postimg.cc/nV5hskX2/image1.png

The most shocking finding is in *HLE*, which appears to be riddled with OCR errors from hand-written content, rather than actual "hard" problems. I reverse-engineered these errors by treating the standard answers as "cryptographic hashes" to find the original intended questions.

*Exhibit A: The "Phantom Parameter" (Physics)* In a lattice adsorption problem (`66fecb...`), the text is broken. I successfully reverse-engineered the "Gold Answer" (4.61) and found it corresponds to a specific physical setup where the text digit `4` was misread as `k`, and a strikethrough was interpreted as a deletion. *See the forensic reconstruction:* https://i.postimg.cc/nhfV2hY9/image2.png

*Exhibit B: The Visual Counterfeit (Math)* In a complex projective space problem, the benchmark penalizes the correct formula because the transcriber likely misread `(n+1)(n+1)` (Rank × Dimension) as `(n+1)^(n+1)` due to slanted handwriting. *See the visual comparison:* https://i.postimg.cc/6TJKMMZR/image3.png

*Conclusion:* Because of these errors, valid reasoning from models is being assigned a zero score. We are seemingly optimizing for typo-compatibility, not intelligence.

Full PDF is on Zenodo (linked above). Verification code (~139 scripts) will be open-sourced once I sanitize the repo (having some git access issues atm). Happy to answer questions.

cmrx64•1h ago
this feels a bit like a bombshell given the other recent works on emergent misalignment. how long have we been lying to models?
jopsammy•1h ago
This is a deeply unsettling thought. I hope everyone can see this work. We truly have no idea how much resources have been wasted here.

Polar weather on Jupiter and Saturn hints at the planets' interior details

https://news.mit.edu/2026/polar-weather-jupiter-saturn-hints-planets-interior-details-0119
1•el_duderino•2m ago•0 comments

Show HN: Unfault – A CLI and LSP for code orientation

https://unfault.dev
1•sylvain-h•3m ago•1 comments

How to Build the Life You Want: 3 Takeaways

https://www.mindbodydad.com/mind/build-the-life-you-want
1•Olshansky•3m ago•0 comments

Apple vs. the AI Hype Cycle

https://ericlamb.substack.com/p/apple-vs-the-ai-hype-cycle
1•ericlamb89•5m ago•0 comments

Amazon Ion

https://amazon-ion.github.io/ion-docs/
2•tosh•7m ago•0 comments

You shouldn't trust data collected on MTurk

https://osf.io/preprints/psyarxiv/zs6pk_v1
1•speckx•7m ago•0 comments

Banana Pro – Nano Banana Pro 4K AI Image Generator

https://www.banana-pro.com
1•amierhan•8m ago•0 comments

Show HN: I created Wiz, personal AI agent with Claude Code

https://thoughts.jock.pl/p/wiz-personal-ai-agent-claude-code-2026
1•joozio•10m ago•0 comments

The Zen of Reticulum

https://github.com/markqvist/Reticulum/blob/master/Zen%20of%20Reticulum.md
3•mikece•12m ago•0 comments

Trump Shares Map of US Including Greenland, Canada, Venezuela

https://www.newsweek.com/trump-shares-map-of-us-including-greenland-canada-venezuela-11384438
4•djkivi•13m ago•0 comments

Huge amounts of extra land needed for RFK Jr's meat-heavy diet guidelines

https://www.theguardian.com/environment/2026/jan/20/rfk-jr-trump-meat-diet-guidelines-land
1•ndsipa_pomu•16m ago•0 comments

Show HN: Tycostream – turn Materialize views into real-time GraphQL APIs

https://github.com/tycoworks/tycostream
1•chrisanderson85•16m ago•0 comments

Going to write 1.000.000 lines of code for community projects

https://onemillionlines.com/
1•websku•16m ago•1 comments

Why Your European Business Is Probably Breaking GDPR Law

https://blog.please-open.it/posts/cloud-act-gdpr/
2•mathieupassenau•17m ago•1 comments

How Greenland keeps its eye on independence [pdf]

https://isonomiaquarterly.com/wp-content/uploads/2025/11/iq-3.4-zellen-greenland.pdf
1•brandonlc•17m ago•0 comments

Special Address by President von Der Leyen at the World Economic Forum

https://ec.europa.eu/commission/presscorner/detail/en/speech_26_150
2•armcat•23m ago•0 comments

Concurrent Validity of 16 Commercial Photoplethysmographic Heart Rate Monitors

https://www.mdpi.com/2076-3417/16/1/126
2•PaulHoule•25m ago•0 comments

Creatures in Higher Dimensions [video]

https://www.youtube.com/watch?v=349r0xJFGNw
1•surprisetalk•25m ago•0 comments

Snow Simulation Toy

https://potch.me/2026/snow-simulation-toy.html
1•surprisetalk•25m ago•0 comments

Show HN: Coni – Trust-first Claude Cowork-style agent with permission prompts

https://github.com/coni-ai/coni
1•lime66•25m ago•2 comments

Uca High School Nationals

https://x.com/ucanhscc
1•notgoodme•26m ago•0 comments

A Frustrating Adventure Trying to Design a Logo with AI

https://www.georgesaines.com/blog/2026/1/19/a-frustrating-adventure-trying-to-design-a-logo-with-ai
2•gsaines•26m ago•0 comments

Australian Decacorns

https://www.sohum.com/australian-decacorns/
1•Sohum•27m ago•0 comments

Blogs.hn

https://blogs.hn
1•surprisetalk•27m ago•1 comments

A Light from the Periphery

https://aeon.co/essays/why-satyendra-nath-bose-was-more-than-einsteins-sidekick
1•rifish•28m ago•0 comments

Technological dependence on American software and cloud services

https://www.cigref.fr/technological-dependence-on-american-software-and-cloud-services-an-assessm...
2•DyslexicAtheist•29m ago•0 comments

The 12,000-Year Solar Cycle and other Space Weather – Stefan Burns [video]

https://www.youtube.com/watch?v=HxsIZ4vVImo
1•keepamovin•29m ago•0 comments

Show HN: See how any HN user's AI opinions have evolved over time

https://hnai.vercel.app/
1•skydiver7373•30m ago•0 comments

Nearly all Epstein files still unreleased a month after Congress deadline

https://www.theguardian.com/us-news/2026/jan/19/jeffrey-epstein-files-unreleased-trump-doj
5•treadump•30m ago•0 comments

Reader Scores and Commenting

https://pitchfork.com/news/a-new-era-for-pitchfork-introducing-reader-scores-and-commenting/
1•pentagrama•33m ago•0 comments