That seems more accurate than the huge scores the other ones get
Like, seriously, how come all these agents are beating Claude Code? In practice, they are shitty and not even close. Yes. I tried them.
https://www.oracle.com/news/announcement/blog/oracle-cloud-c...
Wall Street is currently heavily punishing any company who misses their quarter, including NVIDIA!, after beating on their quarter.
Oracle had a earnings miss in the current quarter!
Their current REALITY is ~$15B quarterly revenue (with cloud infra ~$3B) and only ~$12B in near-term deferred backlog and deferred backlog is NOT revenue. To justify the valuation, this would imply OCI going from ~$18B in FY26 to ~$140B by FY30 that is an insane promise of +$120B in 4 years but back-loaded into the year 3 or year 4. :-))
Capex needs ~$35B next year just to chase GPUs/power and if they miss one actual quarter and the story implodes. The suposed rational, efficient market is paying near $1T today for back-loaded hopes.
Is completely bubble math. Like anybody, including Oracle AND their Customers, have ANY idea of their Capex in 4 years.
Complete and total bubble.
It's easy to publish "$NEWMODEL received an X% bump in SWE-Bench Verified!!!!".
Proper research means interrogating the traces, like these researchers did (the Gist shows Claude 4 Sonnet): https://gist.github.com/jacobkahn/bd77c69d34040a9e9b10d56baa...
Commentary: https://x.com/bwasti/status/1963288443452051582, https://x.com/tmkadamcz/status/1963996138044096969
Claude benchmarks poorly but vibes well. Gemini benchmarks well and vibes well. Grok benchmarks well but vibes poorly.
(yes I know you are gushing with anecdotes, the vibes are simply the approximate color of gray born from the countless black and white remarks.)
This issue had affected a tiny fraction of existing agents in a tiny fraction of their runs. And we've now issued a fix.
This is a natural part of running a benchmark, I'm sure tiny things like this will keep on getting discovered and we'll keep on fixing them. This doesn't change the overall picture or trends at all.
Edit: That said, I’m willing to believe based on the information in the thread that this most likely only affects a tiny fraction of runs.
You're all extremely clever and I can't seem to understand how you missed thinking about such a simple edge case. It's like building a chroot and then allowing `cd ..` to break out of it. What other maybe extremely basic edge cases were missed?
> This doesn't change the overall picture or trends at all.
Outsider without financial benefits from the current AI hype might have a different picture. And I'm a bit fed up about AI with fake productivity promises enshittifying nearly all user-facing software that my clients and I are using, bundled with hefty price hikes of Microsoft and the likes in order to pay for their "investments".
The test environment contains the answers to the questions.
How can we ever perform this sort of faux-neutral agentic evaluation in an environment where we want agents to have access to the sum total of knowledge (which will necessarily include being able to learn about the evaluation being conducted and its expectations)?
We relatively quickly identified that the testing set are taken directly from the training set, but the claim has been advertised already so they were more difficult to retract... if it were at all, I left shortly after.
The incentives are not aligned with accurate reporting.
piskov•2h ago
https://arxiv.org/html/2506.12286v3
stefan_•2h ago
I don't get it, who is so opposed to doing the bare minimum of manual work and check what these models are doing? At least back in the day grad students doing an easy meta-paper understood it meant doing some repetitive manual work. Now we got benchmarks by hype vendors who think they can use the thing they are benchmarking to .. mark the bench.
jsheard•2h ago
Seems on-brand for an LLM-related thing to claim that it has verified something without actually checking.
geekymartian•1h ago
yorwba•1h ago
Data contamination stemming from the fact that it's based on already-solved problems in public repositories is a different issue that cannot be addressed by verifying the benchmark questions harder, but only by putting stricter limits on the model under test.
sebzim4500•1h ago
It says nothing about data contamination, which would depend on the model and would not be the fault of the benchmark.
blibble•10m ago
I doubt any of the AI company employees are encouraged to go looking for cheating
fine_tune•2h ago
So kinda neat to see this paper!
[0]https://github.blog/news-insights/octoverse/octoverse-2024/#...
yieldcrv•1h ago
I don't see that contradicting your assumption
BoorishBears•51m ago
CuriouslyC•40m ago
teaearlgraycold•2h ago