frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Coding agents find the right GPU bottleneck 70% of the time, fix it 30%

https://ayushnangia.github.io/iso-bench-website/
2•ayushnangia16•1h ago
One of the authors. Some things that surprised us while running these experiments:

The tasks are pulled from real merged PRs in vLLM and SGLang, so there's a known-good human solution for each one. Agents get the full codebase, the issue description, and a test harness. Pretty generous setup.

What we didn't expect: the agents are genuinely good at diagnosing the problem. They read the code, find the bottleneck, describe the right fix. But then the generated code has subtle bugs. Off-by-one in kernel indexing, wrong tensor shapes, missing synchronization barriers. The kind of stuff that passes a code review at first glance but segfaults under load.

The other weird result: agent rankings completely invert between codebases. Claude Code is the best performer on vLLM (46%) but the worst on SGLang (27%). TRAE with GPT-5 is the opposite pattern. Same underlying models, different agent scaffolding. It suggests the scaffolding around the model matters at least as much as the model itself.

We also tried three open-source models. None produced a single working optimization. One of them (MiniMax-M2.1) got stuck in a loop printing "I need to actually use the tools now" 2,412 times without ever making a tool call.

The benchmark, all agent transcripts, and evaluation code are open: https://ayushnangia.github.io/iso-bench-website/

Curious what others think about the scaffolding result in particular feels underexplored.

Comments

PaulHoule•1h ago
Those "Lucky Wins" are a big part of the LLM success or "looks like success" story.

One reason the teams I was on did not invent models that good in the 2010s was that we didn't want to give them credit for Lucky Wins.

Show HN: Terminal Phone – E2EE Walkie Talkie from the Command Line

https://gitlab.com/here_forawhile/terminalphone
194•smalltorch•5h ago•50 comments

Show HN: Agent Swarm – Multi-agent self-learning teams (OSS)

https://github.com/desplega-ai/agent-swarm
61•tarasyarema•4h ago•39 comments

Show HN: NotBuiltYet– Open-source library of civilisation problems worth solving

https://shivankar-madaan.github.io/notbuiltyet/
2•mrxlimitless•20m ago•0 comments

Show HN: Gonzales – Self-hosted internet speed monitor with Home Assistant

https://github.com/akustikrausch/gonzales
2•janiskl93•26m ago•0 comments

Show HN: I'm building TaskWeave, a task orchestrator

https://github.com/spicyPoke/TaskWeave
2•spicypoke•27m ago•0 comments

Show HN: Modern Reimplementation of the Speck Molecule Renderer

https://github.com/vangelov/modern-speck
19•vlad_angelov•4d ago•2 comments

Show HN: Respectify – A comment moderator that teaches people to argue better

https://respectify.org/
201•vintagedave•1d ago•197 comments

Show HN: I built a 50ms SPF record and Shadow IT scanner

https://spf1.com
2•bwoud•1h ago•3 comments

Show HN: Coding agents find the right GPU bottleneck 70% of the time, fix it 30%

https://ayushnangia.github.io/iso-bench-website/
2•ayushnangia16•1h ago•1 comments

Show HN: Riverse – persistent AI memory that grows with you, no RAG

https://github.com/wangjiake/JKRiver
2•collenjk•3h ago•0 comments

Show HN: A real-time strategy game that AI agents can play

https://llmskirmish.com/
209•__cayenne__•1d ago•75 comments

Show HN: I ported Tree-sitter to Go

https://github.com/odvcencio/gotreesitter
212•odvcencio•21h ago•100 comments

Show HN: Clocksimulator.com – A minimalist, distraction-free analog clock

https://www.clocksimulator.com/
120•user_timo•1d ago•93 comments

Show HN: OpenSwarm – Multi‑Agent Claude CLI Orchestrator for Linear/GitHub

https://github.com/Intrect-io/OpenSwarm
33•unohee•14h ago•19 comments

Show HN: Django Control Room – All Your Tools Inside the Django Admin

https://github.com/yassi/dj-control-room
128•yassi_dev•1d ago•53 comments

Show HN: One grammar, 18 YAML parsers – a Futamura projector in Common Lisp

https://github.com/johnagrillo62/yaml-project
4•johnagrillo62•4h ago•1 comments

Show HN: I built this toolbox with AI – never wrote a line myself

https://tool.hikun.me/ko
2•harrykoreanlee•4h ago•0 comments

Show HN: Parallel rsync launcher with fancy progress bars

https://github.com/overflowy/parallel-rsync
2•overflowy•5h ago•0 comments

Show HN: PyMOL-RS – Rust reimplementation of PyMOL with modern rendering

https://github.com/zmactep/pymol-rs/releases/tag/v0.1.0
4•zmactep•5h ago•1 comments

Show HN: I built an AI that turns emailed PDFs into ledger entries in 60s

https://baguno.app
2•lakma•6h ago•1 comments

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

https://github.com/Zyora-Dev/zse
57•zyoralabs•15h ago•7 comments

Show HN: Unix for the Commodore 64? Open Source

https://github.com/ascarola/c64ux/releases/tag/v0.7
14•ascarola•15h ago•5 comments

Show HN: Codex builds a working NES Emulator in one hour

https://github.com/kaonashi-tyc/codex-nes-emulator
6•zi2zi-jit•6h ago•4 comments

Show HN: Moonshine Open-Weights STT models – higher accuracy than WhisperLargev3

https://github.com/moonshine-ai/moonshine
311•petewarden•1d ago•74 comments

Show HN: Sgai – Goal-driven multi-agent software dev (GOAL.md → working code)

https://github.com/sandgardenhq/sgai
34•sandgardenhq•23h ago•19 comments

Show HN: Scheme-langserver – Digest incomplete code with static analysis

https://github.com/ufo5260987423/scheme-langserver
50•ufo5260987423•2d ago•2 comments

Show HN: PgDog – Scale Postgres without changing the app

https://github.com/pgdogdev/pgdog
321•levkk•3d ago•61 comments

Show HN: enveil – hide your .env secrets from prAIng eyes

https://github.com/GreatScott/enveil
199•parkaboy•2d ago•129 comments

Show HN: Emdash – Open-source agentic development environment

https://github.com/generalaction/emdash
201•onecommit•1d ago•71 comments

Show HN: Skillscape – Engineering skills matrix without the spreadsheet

https://www.skillscape.dev/
2•danielyefet•9h ago•0 comments