https://news.ycombinator.com/item?id=44833929, my comment https://news.ycombinator.com/item?id=44835939
Another approach might be the LiveBench approach where new tests are released on a regular basis.
I could understand focusing on a niche business use case, but coding is a main focus of the foundation models themselves.
I think that the next step is getting an official "checked" mark by the SWE bench team
The best submission is swe-bench-multilingual is Claude 3.7 Sonnet which solves ~43% of the issues in the dataset.
gronky_•12h ago
71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.
But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.
Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.
If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.
szundi•12h ago
energy123•11h ago
dimitri-vs•11h ago
whymauri•9h ago
https://huggingface.co/datasets/princeton-nlp/SWE-bench_Veri...
Its up to your retrieval system/model to selectively hunt for relevant context. Here's a few critiques of the benchy:
https://x.com/brhydon/status/1953648884309536958
thinkingtoilet•11h ago
https://en.wikipedia.org/wiki/Goodhart%27s_law
ambicapter•10h ago
jasonjmcghee•8h ago
But let's say a group uses it as a metric as part of CI and each new idea / feature they create runs against SWE bench. Maybe they have parameterized bits and pieces they adjust, maybe they have multiple candidates datasets for fine tuning, maybe they're choosing between checkpoints.
This will also end up overfitting - especially if done habitually. It might be a great metric and result in a more powerful overall model. Or it might not.
VikingCoder•8h ago
clutchdude•10h ago
kelipso•3h ago
eddd-ddde•11h ago
gronky_•11h ago
Building multiple attempts into your agent is stretching the rules, even if technically it’s acceptable
terminalshort•10h ago
mcintyre1994•10h ago
gronky_•10h ago
def make_pass@1_agent(agent, n):
gronky_•9h ago
If they are running their production product as is, then of course whatever is built into the product is fine.
DougBTX•9h ago
terminalshort•8h ago
Think of the agent like an employee. If he delivers the code within the expected time and to the expected quality standards, his process of getting there means almost nothing. Do I care if he tried 4 different approaches along the way and threw out the first 3? Not a bit.
whymauri•9h ago
radarsat1•9h ago
It's interesting to think about what the trade-offs are. Assuming the system can properly classify a task as easy or hard (big "if" but I guess there are ways), there is nonetheless more to think about, depending on your pricing plan.
For subscription pricing, I guess you don't really care which model runs and in fact it's hard to find a reason to ever run the smaller model, so choosing between the models is more in the provider's interests for cost efficiency.
But for pay-per-use pricing, But if you have a bigger model that can get the answer right 80% of the time, and a smaller model that can handle smaller changes and get things right 60% of the time but correct its mistakes, then the system should try to run it on as many tasks as possible to save you money.. but in the end if ends up having to make a lot of corrections, then maybe you end up needing more total requests than the larger model. In that case maybe it's actually cheaper to run the larger model, if it takes fewer requests.
So I wonder how that kind of trade-off could be effectively calculated. I guess if you can figure out when "retries" happen you can count them and do some statistics on which model is more likely to work out in fewer shots. It's pretty complicated though, when you start to think about it in detail.
I do wonder if even having BOTH the smaller and bigger model make hypotheses, and try the smaller model's idea first, then if it fails, try the bigger model's idea, might be the way to go.
oblio•11h ago
Roritharr•10h ago
bluelightning2k•9h ago
Roritharr•1h ago
terminalshort•10h ago
ai-christianson•10h ago
I.e. the agent cannot even know which tests are failing.
It has to both fix the issue based just on the issue text and fix it in the specific way the unit test, which it cannot see, expects.
For this reason I find the benchmark a little disconnected from the reality of software engineering.