Why SWE-bench Verified no longer measures frontier coding capabilities

https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/

60•kmdupree•2h ago

Comments

w4yai•1h ago

I don't understand these websites which force translation to my native language.

I mean, it's fine as it's useful for many people, but where is the button for disabling it ? Or why is it enabled by default ?

"codage de pointe" sounds so weird and cringe in French.

Toutouxc•1h ago

Same for apps and games. I understand English just fine, no need to switch to your shitty Google-translate localization just because my iPhone or PlayStation is set to my native language.

LukaD•1h ago

Does your browser request French via an Accept-Language header perhaps? What really infuriates me is when sites don’t respect that header and give you a translation based on IP location.

embedding-shape•57m ago

Regardless if it does or not, users should be able to manually override what language the website is in, at least be able to read the native one, regardless of what the original language was, what headers you send and where geodatabases think your IP is from.

w4yai•40m ago

Correct answer! What a bad UX

1a527dd5•1h ago

This feels very much like "we are now moving the goal posts".

neversupervised•1h ago

But this is the good kind of goalpost moving

iLoveOncall•52m ago

Only if you didn't read the article.

They're saying they need to move on from it because the benchmark is flawed (without bringing in proof) and that's why they can't hit 100%.

It's not a "our models are so good that the benchmark is too easy" thing.

f33d5173•50m ago

> without bringing in proof

Did we read the same article?

embedding-shape•47m ago

I feel like they're quite open about why they think the benchmark doesn't work anymore:

> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.

> This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.

MattRix•22m ago

How can you say “without bringing in proof” when there is literally proof in the article?

MattRix•22m ago

Only if you didn’t read the article…

neversupervised•1h ago

Terminal Bench is the future

embedding-shape•56m ago

First, you might want to say why you think so, otherwise this is just borderline spam. Secondly, when your praise things (without motivation or reasoning even), and you've contributed to that specific thing, please say that up front instead of just praising the thing, again it makes it look like spam otherwise.

vintagedave•51m ago

> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions, despite our best efforts in improving on this in the initial creation of SWE-bench Verified.

Is this saying a quarter* of the questions and answers were wrong, this whole time?!

If so, how was this ever, in any way, a valid measurement?

And what was the process for creating this benchmark and how did it end up with such an extraordinarily poor set of data? (There is a description later of how, which seems to be a high standard and I struggle to understand how it aligns with the other results they discuss.) Kudos to them for highlighting the issues, but I am left with questions.

[*] Not one in four, but one in six, thanks commenters for the correction; leaving the original since, eh, my bad, and it lets replies make sense. I feel the broad point still stands!

motoboi•46m ago

It’s saying that 16% of the problems have well, problems.

vintagedave•42m ago

You're right - I did not apply the math. (I won't edit, in order to let the parent comment still make sense, and thankyou for the correction.)

So not one in four, but one in six problems have problems.

That is extraordinarily high and the point still stands: is this truly saying a [large proportion] of the questions and answers were wrong, this whole time, and if so how was it ever a valid measurement?

embedding-shape•42m ago

> Is this saying a quarter of the questions and answers were wrong, this whole time?!

No, they're saying 59.4% of the 27.6% subset had flawed test cases I think.

> If so, how was this ever, in any way, a valid measurement?

Benchmarks essentially aren't, for practical concerns anyways. They don't represent your use case, and they don't represent any and all use cases, they're valid for measuring exactly what's included in the benchmarks, nothing more and nothing less.

I don't understand the ecosystems obsession with using public benchmarks, they hardly ever tell you anything of value. Ok, Qwen 3.5 is 50% better on Benchmark X than Qwen 2.5, does that mean it'll be 50% better for what you're using it for? Very unlikely.

I've been running my own private benchmarks, with test cases I never share anywhere, for the specific problems I'm using LLMs for. Some are based on real, actual cases where a LLM went wrong and I had to adjust the prompt, and over time I've built up a suite.

Most of the times when a new update comes out to a model, it moves maybe 2-3% in my own benchmarks, meanwhile they tout 30-40% increase or something ridiculous in public benchmarks, and we're supposed to believe the models' training data isn't contaminated...

sillysaurusx•37m ago

Imagenet is one of the most popular datasets on the planet. Turns out, a significant fraction of its images are mislabeled. In the limit case the model would have to fit towards wrong answers to get higher than a certain percentage.

The answer is “it works because ML wants to work.” It’s surprising how far you can get with something flawed. It’s also why such huge breakthroughs are possible by noting flaws others haven’t.

embedding-shape•34m ago

> It’s also why such huge breakthroughs are possible by noting flaws others haven’t.

I do these sort of breakthroughs at home all the time! My wife would say the computer is doing something strange, and instead of just randomly clicking around, I read the error messages slowly and out loud, then follow what they say. Anyone can do this, yet it seems like a magical ability every time you employ it to help people.

adityamwagh•48m ago

> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.

No shit, Sherlock!

djoldman•48m ago

> We have incorporated these findings into our recent evaluation efforts. In the last months we’ve chosen to report results from the public split of SWE-Bench Pro. We recommend other model developers do the same. SWE-bench Pro is not perfect, but empirically seems to suffer less from contamination issues.

https://arxiv.org/pdf/2509.16941

Jcampuzano2•43m ago

Its pretty clear that any benchmark that comes out will be outdated and exist within the training data with short measure. There will always be an incentive to optimize specifically for these benchmarks even if just for marketing material. Sure there is a training cutoff, but its usually only 3-6 months off of the public release dates.

The problem with coding benchmarks then becomes creating novel benchmarks that are guaranteed to not already be in the training data, and not borrow anything from previous benchmarks.

In this regard I don't think any benchmark that was created before a given model is released should ever be considered valid or representative of model performance. The potential financial gain for including the data just to be able to market a minor improvement is too swaying. With that in mind they should honestly just stop including benchmarks altogether in marketing material

Let the model speak for itself and let the community decide, but of course that will never slide with corporate types with so much money on the line.

mnky9800n•35m ago

This is why I made Zork bench. Zork, the text adventure game, is in the training data for LLMs. It’s also deterministic. Therefore it should be easy for an LLM to play and complete. Yet they don’t. Understanding why is the goal of Zork bench.

https://github.com/mnky9800n/zork-bench

kqr•27m ago

I have worked on similar problems. See e.g. [1].

The LLMs I have tested have terrible world models and intuitions for how actions change the environment. They're also not great at discerning and pursuing the right goals. They're like an infinitely patient five-year old with amazing vocabulary.

[1]: https://entropicthoughts.com/updated-llm-benchmark

(more descriptions available in earlier evaluations referenced from there)

WarmWash•21m ago

The open models only give the SOTA models a run for their money on gameable benchmarks. On the semi-private ARC-AGI 2 sets they do absolutely awfully (<10% while SOTA is at ~80%)

It might be too expensive, but I would be interested in the benchmarks for the current crop of SOTA models.

jvuygbbkuurx•29m ago

I think the solution is a bunch of private trusted benchmarks, and averaging their announced results.

MattRix•27m ago

They mention this in the article. This is why private (non public) benchmark tasks that have been made from scratch are necessary.

WarmWash•26m ago

An easy way to make coding benchmarks viable again is to initialize the models with 200k of distracting or unrelated tokens in their context. Or even just run the tests sequentially in the same context and see how far the model gets before it unwinds.

These benchmarks are always greenfield, but people want a model that can deal with a rotted context.

cyanydeez•22m ago

a good benchmark would probably porting a selected repo to another language. then clear context notes, and have it port it back.

as long as theres a test framework, you could gauge success deterministically.

Jimmc414•42m ago

Goodhart’s Law in reverse, what can’t be gamed gets rejected.

varispeed•42m ago

Issue with these benchmark also is that they measure a model you are unlikely going to be routed to. My experience with Anthropic is that despite using Opus 4.6 and 4.7, most of the time the performance is matching low B parameter Qwen. I think there should be a way to verify what model is actually being used to process prompts - that should be independently verified. At the moment it is so bad, you have to ask verification question to the model in form of a non-trivial problem. If it solves it, then there is a chance you actually get Opus and not an impostor and so you can continue the session instead of restarting it hoping you get routed correctly. But that does not help if model is replaced with cheaper one mid session. I've got so much work lost because of these shenanigans.

alansaber•29m ago

I'm sure some inference providers don't, but most intentionally obfuscate this data. They have the full trace logs- my impression is that they don't share them because it's their competitive advantage, and it's easier for a competitor to distil their model if they did.

gpm•40m ago

Curiously Opus 4.7 claims to have a 87.6% pass rate and Mythos claims to have a 93.9% pass rate... leading to the conclusion that it's actually possible to "solve" the problems that OpenAI claims are incorrect.

2ndorderthought•36m ago

Or that opus and mythos are training on the data somehow such that there solutions are incorrectly right. Or that openai is lying/wrong. Or that all of these companies are cheating so much it doesn't really matter and never did.

MattRix•24m ago

The problem isn’t that the tasks are impossible to solve, it’s that they’re underspecified and/or impossible to solve consistently (ex. because a test is expecting the solution function to have a specific name that wasn’t specified in the task itself).

So maybe Anthropic runs Mythos through the benchmark 10000 times and takes the highest score, who knows?

gpm•12m ago

We actually know that a "100% pass rate" is trivially possible: https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

Anthropic p-hacking the benchmark strikes me as cheating, and somewhat unlikely. Mythos figuring out how to cheat at the benchmark strikes me as much more likely.

But if that hypothesis is the explanation the interesting part is Opus 4.7 (but not 4.6) seems to be doing the same.

gruez•6m ago

>Mythos figuring out how to cheat at the benchmark strikes me as much more likely.

Isn't this trivially detectable with a bit of human auditing? Surely they must be vaguely checking for this? It sounds far more likely their solution are designed to pass the incorrect tests. That might be considered bad in a SWE context, but it's not exactly cheating either. It might even be considered a good thing, eg. in the context of backwards compatibility.

[1] https://learn.microsoft.com/en-us/troubleshoot/microsoft-365...

ripvanwinkle•35m ago

>>In our analysis we found that all frontier models we tested were able to reproduce the original, human-written bug fix used as the ground-truth reference, known as the gold patch, or verbatim problem statement specifics for certain tasks, indicating that all of them have seen at least some of the problems and solutions during training

this statement alone seems to invalidate the SWE-bench tests

threepts•34m ago

Why don't they ask their premier model to generate a bench for them?

Jokes aside, a benchmark I look forward to is ARC-AGI-3. I tried out their human simulation, and it feels very reasoning heavy.

Leaderboard: https://arcprize.org/leaderboard

(Most premier models don't even pass 5 percent.)

alansaber•32m ago

Very (reasoning) heavy benchmarks do seem like the way to go, being the hardest to game.

falcor84•29m ago

They focus on minimizing the number of moves and don't allow any harness whatsoever, putting the bar extremely high. The current top verified contender (Claude Opus 4.6) is at only 0.45%. But with how new it is, I expect a lot of improvement in the next generation of models.

threepts•5m ago

Optimal for judging actual reasoning ability rather than an LLM's ability to regurgitate knowledge from a necropost on HN/Reddit/Twitter from 2018.

retinaros•29m ago

it never did

gertlabs•20m ago

A better benchmark needs to be objectively scored, have multi-disciplinary, breadth, and be scalable (no single correct answer).

That's what we designed at https://gertlabs.com. We put a lot of thought into it, and kept it mostly (not fully) related to problem solving through coding.

DeathArrow•14m ago

So we need to generate benchmarks after the models finish training. Or we need to keep the solutions to the benchmark problems as closed source.

kqr•7m ago

It was never that great, it seems. For all of 2025 there was virtually no improvement in the rate at which models produced quality code. They only got better at passing automated tests.

https://entropicthoughts.com/no-swe-bench-improvement

Why Lilly's Weight Loss Pill Isn't a Peptide

Clay PCB Tutorial

We Traced U.S. Government Gold to a Drug Cartel

Officially Retired from Emacs

AI Water Use Distractions and Lessons for California

Ask HN: What is the utility of DSA in the age of LLMs?

Prompting doesn't work Software does

After three months on Linux, I don't miss Windows at all

RSME: A Reactive Stability Mutation Encryption

Superpowers is indeed game changer

Elon Musk's legal battle with OpenAI and Sam Altman will head to trial

Expat 2.8.0 released, includes security fixes

Keep Your Identity Small (2009)

MinIO repository is now archived

Name in Landsat

Google just solved agent identity. For Google Cloud

AI and Alignment

Threat actor uses Microsoft Teams to deploy new "Snow" malware

Vibecoding apps? track and patch issues in deployment using patchly

Corpus Christi plans to declare a 'water emergency.' What does that mean?

New text generator built by OpenAI considered too dangerous to release (2019)

Legendary Qualcomm, Apple, and Nuvia alumni form new CPU startup – Nuvacore

Show HN: Knock-Knock v2 – Visualizing bot attacks in multi-protocol Technicolor

Show HN: Lambda ERP – Open-source ERP you can run through chat

Lost Pixel is joining Figma and sunsetting the OSS product

Entre Lineas

Use LangChain with Codex (ChatGPT) Plus/Pro

Bohrdom on Steam, Game by Cole Allen

Write SaaS apps where users control where their data is stored

Free Textbook on Engineering Thermodynamics