frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

The Fall of the Nerds

https://www.noahpinion.blog/p/the-fall-of-the-nerds
1•otoolep•48s ago•0 comments

I'm 15 and built a free tool for reading Greek/Latin texts. Would love feedback

https://the-lexicon-project.netlify.app/
1•breadwithjam•3m ago•1 comments

How close is AI to taking my job?

https://epoch.ai/gradient-updates/how-close-is-ai-to-taking-my-job
1•cjbarber•3m ago•0 comments

You are the reason I am not reviewing this PR

https://github.com/NixOS/nixpkgs/pull/479442
2•midzer•5m ago•1 comments

Show HN: FamilyMemories.video – Turn static old photos into 5s AI videos

https://familymemories.video
1•tareq_•7m ago•0 comments

How Meta Made Linux a Planet-Scale Load Balancer

https://softwarefrontier.substack.com/p/how-meta-turned-the-linux-kernel
1•CortexFlow•7m ago•0 comments

A Turing Test for AI Coding

https://t-cadet.github.io/programming-wisdom/#2026-02-06-a-turing-test-for-ai-coding
2•phi-system•7m ago•0 comments

How to Identify and Eliminate Unused AWS Resources

https://medium.com/@vkelk/how-to-identify-and-eliminate-unused-aws-resources-b0e2040b4de8
2•vkelk•8m ago•0 comments

A2CDVI – HDMI output from from the Apple IIc's digital video output connector

https://github.com/MrTechGadget/A2C_DVI_SMD
2•mmoogle•8m ago•0 comments

CLI for Common Playwright Actions

https://github.com/microsoft/playwright-cli
3•saikatsg•10m ago•0 comments

Would you use an e-commerce platform that shares transaction fees with users?

https://moondala.one/
1•HamoodBahzar•11m ago•1 comments

Show HN: SafeClaw – a way to manage multiple Claude Code instances in containers

https://github.com/ykdojo/safeclaw
2•ykdojo•14m ago•0 comments

The Future of the Global Open-Source AI Ecosystem: From DeepSeek to AI+

https://huggingface.co/blog/huggingface/one-year-since-the-deepseek-moment-blog-3
3•gmays•15m ago•0 comments

The Evolution of the Interface

https://www.asktog.com/columns/038MacUITrends.html
2•dhruv3006•16m ago•1 comments

Azure: Virtual network routing appliance overview

https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-routing-appliance-overview
2•mariuz•17m ago•0 comments

Seedance2 – multi-shot AI video generation

https://www.genstory.app/story-template/seedance2-ai-story-generator
2•RyanMu•20m ago•1 comments

Πfs – The Data-Free Filesystem

https://github.com/philipl/pifs
2•ravenical•23m ago•0 comments

Go-busybox: A sandboxable port of busybox for AI agents

https://github.com/rcarmo/go-busybox
3•rcarmo•24m ago•0 comments

Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery [pdf]

https://research.nvidia.com/labs/nemotron/files/NVFP4-QAD-Report.pdf
2•gmays•25m ago•0 comments

xAI Merger Poses Bigger Threat to OpenAI, Anthropic

https://www.bloomberg.com/news/newsletters/2026-02-03/musk-s-xai-merger-poses-bigger-threat-to-op...
2•andsoitis•25m ago•0 comments

Atlas Airborne (Boston Dynamics and RAI Institute) [video]

https://www.youtube.com/watch?v=UNorxwlZlFk
2•lysace•26m ago•0 comments

Zen Tools

http://postmake.io/zen-list
2•Malfunction92•28m ago•0 comments

Is the Detachment in the Room? – Agents, Cruelty, and Empathy

https://hailey.at/posts/3mear2n7v3k2r
2•carnevalem•29m ago•1 comments

The purpose of Continuous Integration is to fail

https://blog.nix-ci.com/post/2026-02-05_the-purpose-of-ci-is-to-fail
1•zdw•31m ago•0 comments

Apfelstrudel: Live coding music environment with AI agent chat

https://github.com/rcarmo/apfelstrudel
2•rcarmo•32m ago•0 comments

What Is Stoicism?

https://stoacentral.com/guides/what-is-stoicism
3•0xmattf•33m ago•0 comments

What happens when a neighborhood is built around a farm

https://grist.org/cities/what-happens-when-a-neighborhood-is-built-around-a-farm/
1•Brajeshwar•33m ago•0 comments

Every major galaxy is speeding away from the Milky Way, except one

https://www.livescience.com/space/cosmology/every-major-galaxy-is-speeding-away-from-the-milky-wa...
3•Brajeshwar•33m ago•0 comments

Extreme Inequality Presages the Revolt Against It

https://www.noemamag.com/extreme-inequality-presages-the-revolt-against-it/
2•Brajeshwar•33m ago•0 comments

There's no such thing as "tech" (Ten years later)

1•dtjb•34m ago•0 comments
Open in hackernews

Scientific Insolvency in GPQA and HLE: A forensic audit reveals 58% error rate

https://zenodo.org/records/18293568
3•jopsammy•2w ago

Comments

jopsammy•2w ago
Author here.

I am an independent researcher (originally med background, moved to CS/Physics). I spent the last few weeks manually grading GPQA-Diamond and Humanity's Last Exam (HLE) because my experimental models (DeepSeek-Overclock) were deriving "wrong" answers that looked logically sound.

I conducted a forensic audit of the datasets. I suspect these benchmarks are currently "gaslighting" foundation models.

*Findings:*

* GPQA-Diamond: Inherent error lower bound *26.8%*. * HLE (Sampled): Inherent error lower bound *~58%*.

Visual Summary of Error Rates: https://i.postimg.cc/nV5hskX2/image1.png

The most shocking finding is in *HLE*, which appears to be riddled with OCR errors from hand-written content, rather than actual "hard" problems. I reverse-engineered these errors by treating the standard answers as "cryptographic hashes" to find the original intended questions.

*Exhibit A: The "Phantom Parameter" (Physics)* In a lattice adsorption problem (`66fecb...`), the text is broken. I successfully reverse-engineered the "Gold Answer" (4.61) and found it corresponds to a specific physical setup where the text digit `4` was misread as `k`, and a strikethrough was interpreted as a deletion. *See the forensic reconstruction:* https://i.postimg.cc/nhfV2hY9/image2.png

*Exhibit B: The Visual Counterfeit (Math)* In a complex projective space problem, the benchmark penalizes the correct formula because the transcriber likely misread `(n+1)(n+1)` (Rank × Dimension) as `(n+1)^(n+1)` due to slanted handwriting. *See the visual comparison:* https://i.postimg.cc/6TJKMMZR/image3.png

*Conclusion:* Because of these errors, valid reasoning from models is being assigned a zero score. We are seemingly optimizing for typo-compatibility, not intelligence.

Full PDF is on Zenodo (linked above). Verification code (~139 scripts) will be open-sourced once I sanitize the repo (having some git access issues atm). Happy to answer questions.

cmrx64•2w ago
this feels a bit like a bombshell given the other recent works on emergent misalignment. how long have we been lying to models?
jopsammy•2w ago
This is a deeply unsettling thought. I hope everyone can see this work. We truly have no idea how much resources have been wasted here.