frontpage.

Hey HN! We built Agent Runner, a model-agnostic, open-source agent harness that executes the same prompt against two anonymized coding agents in parallel sandboxes. Each agent can make tool calls, edit multiple files, and self-correct through iterative reasoning. You pick the better result - this becomes the ground truth for the leaderboard.

Why we built it Traditional benchmarks often fall short for modern agentic systems: they rely on static tasks and only measure final outputs. But real coding agents modify multiple files across a repo, answer to user re-prompts, use tool calls, and recover from partial failures

What Agent Runner does You ask it to build anything Agent Runner kicks off two generations from different sandboxed LLM providers (OpenAI, Anthropic, Google, xAI, Mistral, Kimi, and more) Anonymized models make tool calls, multi-file edits, and cater to reprompts You pick your favorite - this preference powers the benchmark

Because different providers handle tool calls, prompts, and execution semantics differently, we worked with each provider to ensure configurations reflect intended behavior. These provider-specific setups remain private, but Agent Runner itself is open-source.

How to try it Kick off Agent Runner at https://www.designarena.ai/agentarena Repo at https://github.com/Design-Arena/agent-runner Use it as a CLI tool: https://pypi.org/project/agent-runner/ pip install agent-runner agentrunner run “create a nextjs replica of Discord”

We hope this provides a provider-agnostic, framework-agnostic, realistic benchmark for state-of-the-art coding agents.

Video demo: https://youtu.be/rdtiuCHatjs

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

Show HN: I built a free UCP checker – see if AI agents can find your store

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

Study of 150 developers shows AI generated code no harder to maintain long term

Spotify now requires premium accounts for developer mode API access

When Albert Einstein Moved to Princeton

Agents.md as a Dark Signal

System time, clocks, and their syncing in macOS

McCLIM and 7GUIs – Part 1: The Counter

So whats the next word, then? Almost-no-math intro to transformer models

Ed Zitron: The Hater's Guide to Microsoft

UK infants ill after drinking contaminated baby formula of Nestle and Danone

Show HN: Android-based audio player for seniors – Homer Audio Player

Starter Template for Ory Kratos

LLMs are powerful, but enterprises are deterministic by nature

Make your iPad 3 a touchscreen for your computer

Internationalization and Localization in the Age of Agents

Building a Custom Clawdbot Workflow to Automate Website Creation

Why the "Taiwan Dome" won't survive a Chinese attack

Xkcd: Game AIs

Windows 11 is finally killing off legacy printer drivers in 2026

From Offloading to Engagement (Study on Generative AI)

AI for People

Rome is studded with cannon balls (2022)

8-piece tablebase development on Lichess (op1 partial)

US to bankroll far-right think tanks in Europe against digital laws

Ask HN: Have AI companies replaced their own SaaS usage with agents?

pi-nes

Show HN: Crew – Multi-agent orchestration tool for AI-assisted development

New hire fixed a problem so fast, their boss left to become a yoga instructor

Show HN: Agent Runner – open-source agent harness to benchmark real coding