frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Soft Contamination Means Benchmarks Test Shallow Generalization

https://arxiv.org/abs/2602.12413
1•cjbarber•1h ago

Comments

cjbarber•1h ago
From the author on X (https://x.com/g_leech_/status/2023384135201349633), below is all me quoting the tweet thread:

New paper on a long-shot I've been obsessed with for a year:

How much are AI reasoning gains confounded by expanding the training corpus 10000x? How much LLM performance is down to "local" generalisation (pattern-matching to hard-to-detect semantically equivalent training data)?

tl;dr

- OLMo 3 training corpus contains exact duplicates of 50% of the ZebraLogic test set.

- We embed the corpus to find semantic duplicates of test data in the wild. 78% of the CodeForces test set had >=1 semantic duplicate

- The semantic duplicate rate is maybe >4 in 10000

* at least 50% and at least 78% that is

arxiv.org/pdf/2602.12413

Imagine you're head of training at at OpenAI, and you want your benchmark scores to be meaningful (: to estimate OOD performance)

You have a hard task ahead of you! Your models have seen so much, memorisation is so easy - as is local generalisation (noisy pattern-matching).

What can you do? Well, obviously you take every benchmark you're going to test on and try to "decontaminate" your training corpus (remove test data from the training data).

By default this is just one level above string matching ("n-gram matching" - if sentences overlap in (say) a 13-token window, remove them from the training corpus).

But you're actually trying, so you also translate the test sets and delete translations of test from train.

But! every piece of test data has an arbitrary number of logical equivalents and neighbours (like how `x + y = 10` is the same problem as `2x + 2y = 20`). And LLMs are amazing at semantic search, so maybe this inflates benchmark scores.

The cutting-edge tech for detecting these "semantic" duplicates is... an LLM. But you simply can't do 100T x 1M calls. There's not enough compute in the world (yet).

So you do what you can - maybe you

- categorise the entire corpus & do intense search inside relevant partitions (e.g. maths > number theory > ...)

- embed the whole corpus & look for things really close to test data

- train a wee 300M filter model & do what you can with that

How much does this process catch? How many semantic duplicates of test data slip through? And what's the impact on final benchmark scores?

We don't know, This (finally) is where our paper comes in:

We experiment on OLMo 3, one of the only really good models with open training data. Since we have its entire training corpus, we can exhaustively check for real "natural" duplicates and finetune it to estimate their impact. We embed the entire Dolma Instruct corpus.

Firstly: we were surprised by how ineffective n-gram decontamination was at catching exact duplicates - 70% of harder tasks had a match. But the spurious performance gain wasn't so large, at most +4pp.

Secondly, every single MBPP test example and 78% of CodeForces have semantic duplicates

Thirdly we generated 10k synthetic duplicates for MuSR, Zebralogic, and MBPP problems and finetuned on them.

- MuSR +22pp. Semantic duplicates as strong as exact

- ZebraLogic +12pp. Exact much stronger

- MBPP +17pp. Exact stronger

Fourthly we guess that 4 in 10,000 training datapoints are a strong semantic duplicate for a given benchmark datapoint (where strong means just "obvious to Gemini")

So: n-gram decontamination is not enough even for the easy (exact) stuff, semantic duplicates are at least a moderately big deal, and this probably transfers to frontier models to some degree. The above are probably underestimates too (since our detection pipeline was cheapo).

Data contamination is a huge field. Here's how we're new

This is preliminary work on a shoestring - we didn't get at the big questions yet ("what share of benchmark gains come from interpolation over a hidden training corpus?", "does this even matter?")

And local generalisation across very different strings is anyway pretty miraculous

The grand aim of this research programme is to decompose benchmark gains / apparent AI progress into 4 estimates:

1. benchmaxxing (memorising exact duplicates)

2. usemaxxing (RLing narrow capabilities)

3. hidden interpolation / local generalisation

4. OOD generalisation

We have a lot of ideas! If you're interested in funding this, grab me at gavin@arbresearch.com

Nearly all of the real work done by Ari Spiesberger, Juan_VaGu, Nicky Pochinkov, Tomas Gavenciak, peligrietzer and NandiSchoots

And ofc this work wouldn't be possible without allen_ai and natolambert working in public and enabling actually scientific evals.

Meta patents AI that takes over a dead person account to keep posting-chatting

https://www.dexerto.com/entertainment/meta-patents-ai-that-takes-over-a-dead-persons-account-to-k...
1•madihaa•52s ago•0 comments

We Secure Builds With fs-verity

https://substack.bomfather.dev/p/how-we-secure-builds-with-fs-verity
2•snaveen•1m ago•0 comments

Show HN: Peak Finder – Role-playing an optimizer

https://releaser.itch.io/peak-finder
1•npc0•2m ago•0 comments

Ask HN: What code repository inspires you?

1•mixto•2m ago•0 comments

Eigengrau

https://en.wikipedia.org/wiki/Eigengrau
1•thunderbong•4m ago•0 comments

Wikipedia at 25: A Wake-Up Call

https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2026-01-15/Special_report
1•tonymet•4m ago•0 comments

Show HN: WowAI.pet – Generate cinematic videos from blurry pet photos

https://wowai.pet/
1•zy5a59•7m ago•0 comments

The S-Tier Rust Web Framework and the Priest Who Created It

https://www.youtube.com/watch?v=X_VWAhVhmhc
1•J_Shelby_J•10m ago•0 comments

Show HN: I gave OpenClaw 79 tools. It runs businesses now

https://www.accordio.ai/
1•deduxer•10m ago•0 comments

Solve SF

https://solvesf.com/
1•rmason•12m ago•1 comments

The Online Community Trilemma

https://pluralistic.net/2026/02/16/fast-good-cheap/
1•hn_acker•12m ago•0 comments

X.org Server's "Master" Branch Now Closed with Cleaned Up State on "Main"

https://www.phoronix.com/news/X.Org-Server-On-Main
2•rbanffy•18m ago•0 comments

Factional Drift: How online discussion clusters into factions

https://idiallo.com/blog/factional-drift-online
1•foxfired•20m ago•0 comments

Forge: Scalable Agent RL Framework and Algorithm

https://www.minimax.io/news/forge-scalable-agent-rl-framework-and-algorithm
2•jxmorris12•22m ago•0 comments

An experiment and results on CO2 build up in N95 masks [pdf]

https://kylebenzle.com/CO2.pdf
2•hilliardfarmer•25m ago•1 comments

Ask HN: Why can't I reply to "Who wants to be hired?"?

1•ddevnyc•27m ago•4 comments

Natilus Secures Funding to Progress BWB Airliner Plans – Aviation Week Network

https://aviationweek.com/air-transport/aircraft-propulsion/natilus-secures-funding-progress-bwb-a...
1•rbanffy•28m ago•0 comments

Get ready for new Macs and iPads: Apple announces Special Experience on March 4

https://arstechnica.com/apple/2026/02/get-ready-for-new-macs-and-ipads-apple-announces-special-ex...
2•rbanffy•29m ago•0 comments

Humans will probably be able to think in thousands of abstractions

https://gist.github.com/stickynotememo/85beaa362faf67840233df42283316fe
1•stickynotememo•29m ago•0 comments

Dynamic Dialects – ultrasound imaging of the tongue across different UK dialects

https://dynamicdialects.ac.uk/accent-map/
1•camtarn•31m ago•0 comments

Dialectical Bootstrapping

https://www.votito.com/methods/dialectical-bootstrapping/
1•adzicg•32m ago•1 comments

Can consciousness ever be understood – this side of death?

https://www.nature.com/articles/d41586-026-00448-5
1•bookofjoe•32m ago•1 comments

sabotage-linux.neocities.org/blog/about/

https://sabotage-linux.neocities.org/blog/about/
2•1vuio0pswjnm7•33m ago•1 comments

Denver considers kicking out Flock – but still using cameras

https://www.denverpost.com/2026/02/16/denver-flock-cameras-license-plates-other-bidders/
3•therobots927•33m ago•0 comments

PointsCard – The Stripe for local business loyalty

https://pointscard.app
1•getsignl•36m ago•1 comments

Running NanoClaw in a Docker Shell Sandbox

https://www.docker.com/blog/run-nanoclaw-in-docker-shell-sandboxes/
22•four_fifths•39m ago•0 comments

Free Models Router – OpenRouter

https://openrouter.ai/openrouter/free
3•twapi•40m ago•0 comments

Fair Use Blocks Privacy-Motivated Copyright Lawsuit–MCM vs. Perry

https://blog.ericgoldman.org/archives/2026/02/fair-use-blocks-privacy-motivated-copyright-lawsuit...
2•hn_acker•40m ago•0 comments

Route every OpenClaw request to the cheapest Claude model that can handle it

https://github.com/iblai/iblai-openclaw-router
1•tentativeuser•41m ago•1 comments

First ever inhalable gene therapy for cancer gets fast-tracked by FDA

https://www.newscientist.com/article/2515185-first-ever-inhalable-gene-therapy-for-cancer-gets-fa...
3•birriel•41m ago•0 comments