frontpage.

One of the authors. Some things that surprised us while running these experiments:

The tasks are pulled from real merged PRs in vLLM and SGLang, so there's a known-good human solution for each one. Agents get the full codebase, the issue description, and a test harness. Pretty generous setup.

What we didn't expect: the agents are genuinely good at diagnosing the problem. They read the code, find the bottleneck, describe the right fix. But then the generated code has subtle bugs. Off-by-one in kernel indexing, wrong tensor shapes, missing synchronization barriers. The kind of stuff that passes a code review at first glance but segfaults under load.

The other weird result: agent rankings completely invert between codebases. Claude Code is the best performer on vLLM (46%) but the worst on SGLang (27%). TRAE with GPT-5 is the opposite pattern. Same underlying models, different agent scaffolding. It suggests the scaffolding around the model matters at least as much as the model itself.

We also tried three open-source models. None produced a single working optimization. One of them (MiniMax-M2.1) got stuck in a loop printing "I need to actually use the tools now" 2,412 times without ever making a tool call.

The benchmark, all agent transcripts, and evaluation code are open: https://ayushnangia.github.io/iso-bench-website/

Curious what others think about the scaffolding result in particular feels underexplored.

Show HN: Terminal Phone – E2EE Walkie Talkie from the Command Line

Show HN: Agent Swarm – Multi-agent self-learning teams (OSS)

Show HN: NotBuiltYet– Open-source library of civilisation problems worth solving

Show HN: Gonzales – Self-hosted internet speed monitor with Home Assistant

Show HN: I'm building TaskWeave, a task orchestrator

Show HN: Modern Reimplementation of the Speck Molecule Renderer

Show HN: Respectify – A comment moderator that teaches people to argue better

Show HN: I built a 50ms SPF record and Shadow IT scanner

Show HN: Coding agents find the right GPU bottleneck 70% of the time, fix it 30%

Show HN: Riverse – persistent AI memory that grows with you, no RAG

Show HN: A real-time strategy game that AI agents can play

Show HN: I ported Tree-sitter to Go

Show HN: Clocksimulator.com – A minimalist, distraction-free analog clock

Show HN: OpenSwarm – Multi‑Agent Claude CLI Orchestrator for Linear/GitHub

Show HN: Django Control Room – All Your Tools Inside the Django Admin

Show HN: One grammar, 18 YAML parsers – a Futamura projector in Common Lisp

Show HN: I built this toolbox with AI – never wrote a line myself

Show HN: Parallel rsync launcher with fancy progress bars

Show HN: PyMOL-RS – Rust reimplementation of PyMOL with modern rendering

Show HN: I built an AI that turns emailed PDFs into ledger entries in 60s

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Show HN: Unix for the Commodore 64? Open Source

Show HN: Codex builds a working NES Emulator in one hour

Show HN: Moonshine Open-Weights STT models – higher accuracy than WhisperLargev3

Show HN: Sgai – Goal-driven multi-agent software dev (GOAL.md → working code)

Show HN: Scheme-langserver – Digest incomplete code with static analysis

Show HN: PgDog – Scale Postgres without changing the app

Show HN: enveil – hide your .env secrets from prAIng eyes

Show HN: Emdash – Open-source agentic development environment

Show HN: Skillscape – Engineering skills matrix without the spreadsheet

Show HN: Coding agents find the right GPU bottleneck 70% of the time, fix it 30%

Comments

Show HN: Terminal Phone – E2EE Walkie Talkie from the Command Line

Show HN: Agent Swarm – Multi-agent self-learning teams (OSS)

Show HN: NotBuiltYet– Open-source library of civilisation problems worth solving

Show HN: Gonzales – Self-hosted internet speed monitor with Home Assistant

Show HN: I'm building TaskWeave, a task orchestrator

Show HN: Modern Reimplementation of the Speck Molecule Renderer

Show HN: Respectify – A comment moderator that teaches people to argue better

Show HN: I built a 50ms SPF record and Shadow IT scanner

Show HN: Coding agents find the right GPU bottleneck 70% of the time, fix it 30%

Show HN: Riverse – persistent AI memory that grows with you, no RAG

Show HN: A real-time strategy game that AI agents can play

Show HN: I ported Tree-sitter to Go

Show HN: Clocksimulator.com – A minimalist, distraction-free analog clock

Show HN: OpenSwarm – Multi‑Agent Claude CLI Orchestrator for Linear/GitHub

Show HN: Django Control Room – All Your Tools Inside the Django Admin

Show HN: One grammar, 18 YAML parsers – a Futamura projector in Common Lisp

Show HN: I built this toolbox with AI – never wrote a line myself

Show HN: Parallel rsync launcher with fancy progress bars

Show HN: PyMOL-RS – Rust reimplementation of PyMOL with modern rendering

Show HN: I built an AI that turns emailed PDFs into ledger entries in 60s

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Show HN: Unix for the Commodore 64? Open Source

Show HN: Codex builds a working NES Emulator in one hour

Show HN: Moonshine Open-Weights STT models – higher accuracy than WhisperLargev3

Show HN: Sgai – Goal-driven multi-agent software dev (GOAL.md → working code)

Show HN: Scheme-langserver – Digest incomplete code with static analysis

Show HN: PgDog – Scale Postgres without changing the app

Show HN: enveil – hide your .env secrets from prAIng eyes

Show HN: Emdash – Open-source agentic development environment

Show HN: Skillscape – Engineering skills matrix without the spreadsheet