frontpage.

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

https://arxiv.org/abs/2602.01465

2•NBenkovich•1h ago

Hi HN,

We recently ran an experiment to answer a simple question:

Does coordinating multiple AI agents as a team actually help with real software engineering tasks, compared to a single strong agent?

To test this, we evaluated our system on SWE-bench Verified. The benchmark consists of real GitHub issues that require understanding codebases, modifying multiple files, running tests, and iterating.

Instead of treating software engineering as a single-agent patch generation problem, we model it as an organizational process.

Our system uses a team of agents with explicit roles:

* manager: plans work, assigns tasks, integrates results * researcher: explores the codebase, issue history, constraints * engineer: implements fixes in isolated environments * reviewer: inspects changes, requests revisions, validates results

There is no fixed pipeline and no predefined number of steps. Agents communicate via structured artifacts (plans, diffs, reviews) and produce real GitHub pull requests with full history.

For evaluation, we compared three setups on SWE-bench Verified:

* single-agent baseline: GPT-5 medium reasoning + shell * agent team (ours): GPT-5 (manager, researcher) + GPT-5 Codex (engineer, reviewer), both medium reasoning * stronger single-model reference: GPT-5.2 (high reasoning)

Results:

* the agent team resolves ~7% more issues than the single-agent GPT-5 medium reasoning baseline * despite using medium reasoning models, the agent team shows ~0.5% better quality than a single GPT-5.2 (high reasoning) agent

Beyond resolution rate, the main benefits are cleaner responsibility boundaries, context isolation, easier debugging and the ability to use different models for different roles.

Code + trajectories are open source: https://github.com/agynio/platform

Paper with methodology and results: https://arxiv.org/abs/2602.01465

Would love to hear your thoughts.

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux

The source code was the moat. But not anymore

Does anyone else feel like their inbox has become their job?

An AI model that can read and diagnose a brain MRI in seconds

Dev with 5 of experience switched to Rails, what should I be careful about?

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

Tell HN: Yet Another Round of Zendesk Spam

Postgres Message Queue (PGMQ)

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

NY lawmakers proposed statewide data center moratorium

OpenClaw AI chatbots are running amok – these scientists are listening in

Show HN: AI agent forgets user preferences every session. This fixes it

Introduce the Vouch/Denouncement Contribution Model

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

Microsoft appointed a quality czar. He has no direct reports and no budget

Multi-agent coordination on Claude Code: 8 production pain points and patterns

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

DevXT – Building the Future with AI That Acts

A Minimal OpenClaw Built with the OpenCode SDK

The silent death of Good Code

The Internal Negotiation You Have When Your Heart Rate Gets Uncomfortable