New #1 open-source AI Agent on SWE-bench Verified

https://refact.ai/blog/2025/open-source-sota-on-swe-bench-verified-refact-ai/

28•laxyz•8mo ago

Comments

laxyz•8mo ago

The full pipeline used for SWE-bench Verified is open-source: https://github.com/smallcloudai/refact-bench

amarcheschi•8mo ago

I think the title doesn't make it clear that the results are obtained with closed models

nateburke•8mo ago

Am I correct in understanding that SWE-bench is limited to python?

babushkaboi•8mo ago

yeah, they're all python at the moment.

simonw•8mo ago

The core benchmark is only Python, but there is also SWE-bench Multimodal which uses JavaScript: https://arxiv.org/abs/2410.03859

And the new SWE-bench Multilingual (released a couple of weeks ago) which covers 9 programming languages - C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby and Rust: https://www.swebench.com/multilingual.html

brrrrrm•8mo ago

Open-source use of closed source models?

NicuCalcea•8mo ago

Looks like they support self-hosted models: https://docs.refact.ai/supported-models/#self-hosted-version

MukundMohanK•8mo ago

Between last April and now, swe-bench scores have gone up from 25%-70%.

Sure, they're being overfitted to the dataset. But with most performing similarly across even the hardest of 3rd party benchmarks, think frontier math back in Nov and now, we're closer than ever to a specialisation shift.

Hard to say at what % but once code reviews get better its likely 2025 is the last year SWE is a sought after job * demand and supply both

candiddevmike•8mo ago

SWE bench scores, like a lot of other metrics for LLMs, are pretty divorced from reality IMO. It's a lot like only learning to pass tests vs actual understanding.

Once GenAI companies stop hiring SWEs, I'll believe the doomers.

MukundMohanK•8mo ago

Reality is here whether we like it or not - https://fred.stlouisfed.org/graph/?g=1DEP0

hackeman300•8mo ago

Surely there are no other macroeconomic factors that could have played a role in this decline too

harshitaneja•8mo ago

I help hire for a few clients as well as for my own small organization. We are already seeing impact of these tools on our hiring. For the same responsibilities and tasks we are already requiring lesser resources. For clients with less complex problems we are able to manage similar work with 60% of the resources planned. And that's when most of our work is mathematical modelling, heuristics, constraint programming and such. However, I don't foresee at least for the next few years we would ever get to a scenario where we don't hire developers. Given that most hiring has shifted to only senior developers.

dingnuts•8mo ago

being able to do more things with fewer resources (which lowers costs) always increases demand enough to make up for the reduction of labor caused by the automation

Analogy: when the chainsaw was invented, we didn't stop having lumberjacks, they just learned to use chainsaws

grammarxcore•8mo ago

> Many samples have an issue description that is underspecified, leading to ambiguity on what the problem is and how it should be solved.

OpenAI apparently tuned _basic discovery and refinement_ out of the tests so I don’t think this is a benchmark of anything useful. It can’t replace a human but can possibly make a human more productive.

https://openai.com/index/introducing-swe-bench-verified/

predkambrij•8mo ago

I would like to know why this post got flagged. Is it misleading, or dangerous software? If it's truly #1 open-source on SWE-bench that's quite impressive.

Lightweight and extensible compatibility layer between dataframe libraries

Haskell for all: Beyond agentic coding

Dorsey's Block cutting up to 10% of staff

Show HN: Freenet Lives – Real-Time Decentralized Apps at Scale [video]

In the AI age, 'slow and steady' doesn't win

Administration won't let student deported to Honduras return

How were the NIST ECDSA curve parameters generated? (2023)

AI, networks and Mechanical Turks (2025)

Goto Considered Awesome [video]

Show HN: I Built a Free AI LinkedIn Carousel Generator

Implementing Auto Tiling with Just 5 Tiles

Open Challange (Get all Universities involved

Apple Tried to Tamper Proof AirTag 2 Speakers – I Broke It [video]

Show HN: Isolating AI-generated code from human code | Vibe as a Code

Show HN: More beautiful and usable Hacker News

Toledo Derailment Rescue [video]

War Department Cuts Ties with Harvard University

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

A Bid-Based NFT Advertising Grid

AI readability score for your documentation

NASA Study: Non-Biologic Processes Don't Explain Mars Organics

I inhaled traffic fumes to find out where air pollution goes in my body

X said it would give $1M to a user who had previously shared racist posts

155M US land parcel boundaries

Private Inference

Font Rendering from First Principles

Show HN: Seedance 2.0 AI video generator for creators and ecommerce

Wally: A fun, reliable voice assistant in the shape of a penguin

Rewriting Pycparser with the Help of an LLM

Lobsters Vibecoding Challenge