They're saying they need to move on from it because the benchmark is flawed (without bringing in proof) and that's why they can't hit 100%.
It's not a "our models are so good that the benchmark is too easy" thing.
Did we read the same article?
> We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests.
> This means that improvements on SWE-bench Verified no longer reflect meaningful improvements in models’ real-world software development abilities. Instead, they increasingly reflect how much the model was exposed to the benchmark at training time.
Is this saying a quarter* of the questions and answers were wrong, this whole time?!
If so, how was this ever, in any way, a valid measurement?
And what was the process for creating this benchmark and how did it end up with such an extraordinarily poor set of data? (There is a description later of how, which seems to be a high standard and I struggle to understand how it aligns with the other results they discuss.) Kudos to them for highlighting the issues, but I am left with questions.
[*] Not one in four, but one in six, thanks commenters for the correction; leaving the original since, eh, my bad, and it lets replies make sense. I feel the broad point still stands!
So not one in four, but one in six problems have problems.
That is extraordinarily high and the point still stands: is this truly saying a [large proportion] of the questions and answers were wrong, this whole time, and if so how was it ever a valid measurement?
No, they're saying 59.4% of the 27.6% subset had flawed test cases I think.
> If so, how was this ever, in any way, a valid measurement?
Benchmarks essentially aren't, for practical concerns anyways. They don't represent your use case, and they don't represent any and all use cases, they're valid for measuring exactly what's included in the benchmarks, nothing more and nothing less.
I don't understand the ecosystems obsession with using public benchmarks, they hardly ever tell you anything of value. Ok, Qwen 3.5 is 50% better on Benchmark X than Qwen 2.5, does that mean it'll be 50% better for what you're using it for? Very unlikely.
I've been running my own private benchmarks, with test cases I never share anywhere, for the specific problems I'm using LLMs for. Some are based on real, actual cases where a LLM went wrong and I had to adjust the prompt, and over time I've built up a suite.
Most of the times when a new update comes out to a model, it moves maybe 2-3% in my own benchmarks, meanwhile they tout 30-40% increase or something ridiculous in public benchmarks, and we're supposed to believe the models' training data isn't contaminated...
The answer is “it works because ML wants to work.” It’s surprising how far you can get with something flawed. It’s also why such huge breakthroughs are possible by noting flaws others haven’t.
I do these sort of breakthroughs at home all the time! My wife would say the computer is doing something strange, and instead of just randomly clicking around, I read the error messages slowly and out loud, then follow what they say. Anyone can do this, yet it seems like a magical ability every time you employ it to help people.
No shit, Sherlock!
The problem with coding benchmarks then becomes creating novel benchmarks that are guaranteed to not already be in the training data, and not borrow anything from previous benchmarks.
In this regard I don't think any benchmark that was created before a given model is released should ever be considered valid or representative of model performance. The potential financial gain for including the data just to be able to market a minor improvement is too swaying. With that in mind they should honestly just stop including benchmarks altogether in marketing material
Let the model speak for itself and let the community decide, but of course that will never slide with corporate types with so much money on the line.
The LLMs I have tested have terrible world models and intuitions for how actions change the environment. They're also not great at discerning and pursuing the right goals. They're like an infinitely patient five-year old with amazing vocabulary.
[1]: https://entropicthoughts.com/updated-llm-benchmark
(more descriptions available in earlier evaluations referenced from there)
It might be too expensive, but I would be interested in the benchmarks for the current crop of SOTA models.
These benchmarks are always greenfield, but people want a model that can deal with a rotted context.
as long as theres a test framework, you could gauge success deterministically.
So maybe Anthropic runs Mythos through the benchmark 10000 times and takes the highest score, who knows?
Anthropic p-hacking the benchmark strikes me as cheating, and somewhat unlikely. Mythos figuring out how to cheat at the benchmark strikes me as much more likely.
But if that hypothesis is the explanation the interesting part is Opus 4.7 (but not 4.6) seems to be doing the same.
Isn't this trivially detectable with a bit of human auditing? Surely they must be vaguely checking for this? It sounds far more likely their solution are designed to pass the incorrect tests. That might be considered bad in a SWE context, but it's not exactly cheating either. It might even be considered a good thing, eg. in the context of backwards compatibility.
[1] https://learn.microsoft.com/en-us/troubleshoot/microsoft-365...
this statement alone seems to invalidate the SWE-bench tests
Jokes aside, a benchmark I look forward to is ARC-AGI-3. I tried out their human simulation, and it feels very reasoning heavy.
Leaderboard: https://arcprize.org/leaderboard
(Most premier models don't even pass 5 percent.)
That's what we designed at https://gertlabs.com. We put a lot of thought into it, and kept it mostly (not fully) related to problem solving through coding.
w4yai•1h ago
I mean, it's fine as it's useful for many people, but where is the button for disabling it ? Or why is it enabled by default ?
"codage de pointe" sounds so weird and cringe in French.
Toutouxc•1h ago
LukaD•1h ago
embedding-shape•57m ago
w4yai•40m ago