Many SWE-bench-Passing PRs would not be merged

https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs-would-not-be-merged-into-main/

84•mustaphah•2h ago

Comments

love2read•1h ago

Edit: Nevermind

refulgentis•1h ago

Well, no: one of the first things it says is reviewers were blind to human vs. ai.

yorwba•57m ago

The comment you're replying to is talking about a hypothetical scenario.

In any case, the blinding didn't stop Reviewer #2 from calling out obvious AI slop. (Figure 5)

collabs•41m ago

I feel like I don't have the context for this conversation. If slop is obvious as slop, I feel like we should block it.

If you look at the comment it says what the code following the comment does. It doesn't matter whether it is a human or a machine that wrote it. It is useless. It is actually worse than useless because if someone needs to change the code, now they need to change two things. So in that sense, you just made twice the work for anyone who touches the code after you and for what benefit?

zozbot234•29m ago

The point is that AI models do these kinds of things all the time. They're not really all that smart or intelligent, they just replicate patterns or boilerplate and then iterate until it sort of appears to work properly.

spartanatreyu•23m ago

> appears to work

That "appears" is doing a lot of heavy lifting.

The code working isn't what's being selected for.

The code looking convincing IS what is being selected for.

That distinction is massive.

nubg•41m ago

> mid-2024 agents

Is this a post about AI archeology?

varispeed•24m ago

Do these benchmarks make any sense? I tried a few local models that seem to be scoring well in SWE but the results were pure rubbish. (For instance MiniMax-M2.5 at 128GB from unslothed - completely unusable).

languid-photic•6m ago

makes sense! we wrote something yesterday about the weaknesses of test-based evals like swe-bench [1]

they are definitely useful but they miss the things that are hard to encode in tests, like spec/intent alignment, scope creep, adherence to codebase patterns, team preferences (risk tolerance, etc)

and those factors are really important. which means that test-evals should be relied upon more as weak/directional priors than as definitive measures of real-world usefulness

[1] https://voratiq.com/blog/test-evals-are-not-enough/

Temporal: A nine-year journey to fix time in JavaScript

Many SWE-bench-Passing PRs would not be merged

Don't post generated/AI-edited comments. HN is for conversation between humans.

Making WebAssembly a first-class language on the Web

Personal Computer by Perplexity

Show HN: I built a tool that watches webpages and exposes changes as RSS

Show HN: Autoresearch_at_home – SETI_at_home but for LLM training

Google closes deal to acquire Wiz

Britain is ejecting hereditary nobles from Parliament after 700 years

The MacBook Neo

I was interviewed by an AI bot for a job

Meticulous (YC S21) is hiring to redefine software dev

BitNet: 100B Param 1-Bit model for local CPUs

Preliminary data from a longitudinal AI impact study

Show HN: Klaus – OpenClaw on a VM, batteries included

Entities enabling scientific fraud at scale (2025)

5,200 holes carved into a Peruvian mountain left by an ancient economy

Building Better Country Selects

Against vibes: When is a generative model useful

How we hacked McKinsey's AI platform

Physicist Astrid Eichhorn is a leader in the field of asymptotic safety

Swiss e-voting pilot can't count 2,048 ballots after decryption failure

Show HN: Open-source browser for AI agents

Launch HN: Prism (YC X25) – Workspace and API to generate and edit videos

Can the Dictionary Keep Up?

Launch HN: Sentrial (YC W26) – Catch AI agent failures before your users do

Show HN: Satellite imagery object detection using text prompts

What Is a Tort?

Fungal Electronics (2021)

Building a TB-303 from Scratch

Temporal: A nine-year journey to fix time in JavaScript

Many SWE-bench-Passing PRs would not be merged

Don't post generated/AI-edited comments. HN is for conversation between humans.

Making WebAssembly a first-class language on the Web

Personal Computer by Perplexity

Show HN: I built a tool that watches webpages and exposes changes as RSS

Show HN: Autoresearch_at_home – SETI_at_home but for LLM training

Google closes deal to acquire Wiz

Britain is ejecting hereditary nobles from Parliament after 700 years

The MacBook Neo

I was interviewed by an AI bot for a job

Meticulous (YC S21) is hiring to redefine software dev

BitNet: 100B Param 1-Bit model for local CPUs

Preliminary data from a longitudinal AI impact study

Show HN: Klaus – OpenClaw on a VM, batteries included

Entities enabling scientific fraud at scale (2025)

5,200 holes carved into a Peruvian mountain left by an ancient economy

Building Better Country Selects

Against vibes: When is a generative model useful

How we hacked McKinsey's AI platform

Physicist Astrid Eichhorn is a leader in the field of asymptotic safety

Swiss e-voting pilot can't count 2,048 ballots after decryption failure

Show HN: Open-source browser for AI agents

Launch HN: Prism (YC X25) – Workspace and API to generate and edit videos

Can the Dictionary Keep Up?

Launch HN: Sentrial (YC W26) – Catch AI agent failures before your users do

Show HN: Satellite imagery object detection using text prompts

What Is a Tort?

Fungal Electronics (2021)

Building a TB-303 from Scratch

Many SWE-bench-Passing PRs would not be merged

Comments