I am an independent researcher (originally med background, moved to CS/Physics). I spent the last few weeks manually grading GPQA-Diamond and Humanity's Last Exam (HLE) because my experimental models (DeepSeek-Overclock) were deriving "wrong" answers that looked logically sound.
I conducted a forensic audit of the datasets. I suspect these benchmarks are currently "gaslighting" foundation models.
The most shocking finding is in *HLE*, which appears to be riddled with OCR errors from hand-written content, rather than actual "hard" problems. I reverse-engineered these errors by treating the standard answers as "cryptographic hashes" to find the original intended questions.
*Exhibit A: The "Phantom Parameter" (Physics)*
In a lattice adsorption problem (`66fecb...`), the text is broken. I successfully reverse-engineered the "Gold Answer" (4.61) and found it corresponds to a specific physical setup where the text digit `4` was misread as `k`, and a strikethrough was interpreted as a deletion.
*See the forensic reconstruction:*
https://i.postimg.cc/nhfV2hY9/image2.png
*Exhibit B: The Visual Counterfeit (Math)*
In a complex projective space problem, the benchmark penalizes the correct formula because the transcriber likely misread `(n+1)(n+1)` (Rank × Dimension) as `(n+1)^(n+1)` due to slanted handwriting.
*See the visual comparison:*
https://i.postimg.cc/6TJKMMZR/image3.png
*Conclusion:*
Because of these errors, valid reasoning from models is being assigned a zero score. We are seemingly optimizing for typo-compatibility, not intelligence.
Full PDF is on Zenodo (linked above). Verification code (~139 scripts) will be open-sourced once I sanitize the repo (having some git access issues atm). Happy to answer questions.
cmrx64•1h ago
this feels a bit like a bombshell given the other recent works on emergent misalignment. how long have we been lying to models?
jopsammy•1h ago
This is a deeply unsettling thought. I hope everyone can see this work. We truly have no idea how much resources have been wasted here.
jopsammy•1h ago
I am an independent researcher (originally med background, moved to CS/Physics). I spent the last few weeks manually grading GPQA-Diamond and Humanity's Last Exam (HLE) because my experimental models (DeepSeek-Overclock) were deriving "wrong" answers that looked logically sound.
I conducted a forensic audit of the datasets. I suspect these benchmarks are currently "gaslighting" foundation models.
*Findings:*
* GPQA-Diamond: Inherent error lower bound *26.8%*. * HLE (Sampled): Inherent error lower bound *~58%*.
Visual Summary of Error Rates: https://i.postimg.cc/nV5hskX2/image1.png
The most shocking finding is in *HLE*, which appears to be riddled with OCR errors from hand-written content, rather than actual "hard" problems. I reverse-engineered these errors by treating the standard answers as "cryptographic hashes" to find the original intended questions.
*Exhibit A: The "Phantom Parameter" (Physics)* In a lattice adsorption problem (`66fecb...`), the text is broken. I successfully reverse-engineered the "Gold Answer" (4.61) and found it corresponds to a specific physical setup where the text digit `4` was misread as `k`, and a strikethrough was interpreted as a deletion. *See the forensic reconstruction:* https://i.postimg.cc/nhfV2hY9/image2.png
*Exhibit B: The Visual Counterfeit (Math)* In a complex projective space problem, the benchmark penalizes the correct formula because the transcriber likely misread `(n+1)(n+1)` (Rank × Dimension) as `(n+1)^(n+1)` due to slanted handwriting. *See the visual comparison:* https://i.postimg.cc/6TJKMMZR/image3.png
*Conclusion:* Because of these errors, valid reasoning from models is being assigned a zero score. We are seemingly optimizing for typo-compatibility, not intelligence.
Full PDF is on Zenodo (linked above). Verification code (~139 scripts) will be open-sourced once I sanitize the repo (having some git access issues atm). Happy to answer questions.
cmrx64•1h ago
jopsammy•1h ago