frontpage.

One of the authors. Some things that surprised us while running these experiments:

The tasks are pulled from real merged PRs in vLLM and SGLang, so there's a known-good human solution for each one. Agents get the full codebase, the issue description, and a test harness. Pretty generous setup.

What we didn't expect: the agents are genuinely good at diagnosing the problem. They read the code, find the bottleneck, describe the right fix. But then the generated code has subtle bugs. Off-by-one in kernel indexing, wrong tensor shapes, missing synchronization barriers. The kind of stuff that passes a code review at first glance but segfaults under load.

The other weird result: agent rankings completely invert between codebases. Claude Code is the best performer on vLLM (46%) but the worst on SGLang (27%). TRAE with GPT-5 is the opposite pattern. Same underlying models, different agent scaffolding. It suggests the scaffolding around the model matters at least as much as the model itself.

We also tried three open-source models. None produced a single working optimization. One of them (MiniMax-M2.1) got stuck in a loop printing "I need to actually use the tools now" 2,412 times without ever making a tool call.

The benchmark, all agent transcripts, and evaluation code are open: https://ayushnangia.github.io/iso-bench-website/

Curious what others think about the scaffolding result in particular feels underexplored.

Weird System Prompt Artefacts

Making WebAssembly a first-class language on the Web

ExposinDisrupting the Gridtide Global Cyber Espionage Campaign

A Decade of Docker Containers

Will vibe coding end like the maker movement?

Feature Platforms: The Underrated Infrastructure Layer Behind Fast ML Teams

From Tahoe bugs to app review delays, the Apple developer experience is fraying

The Government Just Made It Harder to See What Spy Tech It Buys

iRobot Went Bankrupt. Its Product Scores Explain Why

The Agentic Simul: What 500 PRs in two months taught me

Schrödinger Color Theory Completed After 100 Years

Linux Heterogeneous Memory Management (HMM)

CVE-2026-2006 – PostgreSQL Out-of-cycle release

I don't need AI to build me a new app. I need it to make Jira bearable

Show HN: Cifer, zero-key custody using threshold cryptography

British Citizenship Applications by US Nationals Hit Record High

A New Era of Databases: Lakebase

Show HN: NotBuiltYet– Open-source library of civilisation problems worth solving

Show HN: Ryvos – Autonomous AI assistant in Rust(15MB RAM,50 tools,16 providers)

Nano Banana 2: Google's latest AI image generation model

EHR API Explorer

Matrix Inverse Roots with Fixed-Budget GEMM Kernels

Partial Truth vs. Explicit Failure: Designing Honest System Responses

Memory or mood? Probiotic capsules and powders may affect the brain differently

Linux Foundation's report reveals contributing to open source offers a 2x-5x ROI

Speculations Concerning the First Ultraintelligent Machine (1964) [pdf]

Rule of Three (Computer Programming)

Introduction to Data-Centric Query Compilation

I started a software research company

Show HN: I built a minimal distributed tracer from scratch to understand better