(given that IQuestLab published their SWE-Bench Verified trajectory data, I want to be charitable and assume genuine oversight rather than "benchmaxxing", probably an easy to miss thing if you are new to benchmarking)
https://www.reddit.com/r/LocalLLaMA/comments/1q1ura1/iquestl...
If you run SWE-bench evals, just make sure to use the most up-to-date code from our repo and the updated docker images
I don't doubt that it's an oversight, it does however say something about the researchers when they didn't look at a single output where they would have immediately caught this.
Claude spits that very regularly at the end of the answer, when it's clearly out of it's depth, and wants to steer discussion away from that blind-spot.
Not living in China I'm not too concerned about the Chinese government
But yes, sadly it looks like the agent cheated during the eval
adastra22•1mo ago
cadamsdotcom•1mo ago
That said Sonnet 4.5 isn’t new and there have been loads of innovations recently.
Exciting to see open models nipping at the heels of the big end of town. Let’s see what shakes out over the coming days.
pertymcpert•1mo ago
stingraycharles•1mo ago
They’re focused almost entirely on benchmarks. I think Grok is doing the same thing. I wonder if people could figure out a type of benchmark that cannot be optimized for, like having multiple models compete against each other in something.
NitpickLawyer•1mo ago
c7b•1mo ago
astrange•1mo ago
viraptor•1mo ago
rubslopes•1mo ago
FYI I use CC for Anthropic models and OpenCode for everything else.
unsupp0rted•1mo ago
satvikpendem•1mo ago
behnamoh•1mo ago
arthurcolle•1mo ago
yborg•1mo ago
sunrunner•1mo ago
dk8996•1mo ago