Hi HN,

We recently ran an experiment to answer a simple question:

Does coordinating multiple AI agents as a team actually help with real software engineering tasks, compared to a single strong agent?

To test this, we evaluated our system on SWE-bench Verified. The benchmark consists of real GitHub issues that require understanding codebases, modifying multiple files, running tests, and iterating.

Instead of treating software engineering as a single-agent patch generation problem, we model it as an organizational process.

Our system uses a team of agents with explicit roles:

* manager: plans work, assigns tasks, integrates results * researcher: explores the codebase, issue history, constraints * engineer: implements fixes in isolated environments * reviewer: inspects changes, requests revisions, validates results

There is no fixed pipeline and no predefined number of steps. Agents communicate via structured artifacts (plans, diffs, reviews) and produce real GitHub pull requests with full history.

For evaluation, we compared three setups on SWE-bench Verified:

* single-agent baseline: GPT-5 medium reasoning + shell * agent team (ours): GPT-5 (manager, researcher) + GPT-5 Codex (engineer, reviewer), both medium reasoning * stronger single-model reference: GPT-5.2 (high reasoning)

Results:

* the agent team resolves ~7% more issues than the single-agent GPT-5 medium reasoning baseline * despite using medium reasoning models, the agent team shows ~0.5% better quality than a single GPT-5.2 (high reasoning) agent

Beyond resolution rate, the main benefits are cleaner responsibility boundaries, context isolation, easier debugging and the ability to use different models for different roles.

Code + trajectories are open source: https://github.com/agynio/platform

Paper with methodology and results: https://arxiv.org/abs/2602.01465

Would love to hear your thoughts.

Learning to code, or building side projects with AI help, this one's for you

Effulgence RPG Engine [video]

Five disciplines discovered the same math independently – none of them knew

We Scanned an AI Assistant for Security Issues: 12,465 Vulnerabilities

Amazon no longer defend cloud customers against video patent infringement claims

Show HN: Medinilla – an OCPP compliant .NET back end (partially done)

How Does AI Distribute the Pie? Large Language Models and the Ultimatum Game

Resistance Infrastructure

Fire-juggling unicyclist caught performing on crossing

Restoring a lost 1981 Unix roguelike (protoHack) and preserving Hack 1.0.3

GPS and Time Dilation – Special and General Relativity

Show HN: Witnessd – Prove human authorship via hardware-bound jitter seals

Show HN: I built a clawdbot that texts like your crush

Scientists reverse Alzheimer's in mice and restore memory (2025)

Compiling Prolog to Forth [pdf]

Show HN: Cymatica – an experimental, meditative audiovisual app

GitBlack: Tracing America's Foundation

Horizon-LM: A RAM-Centric Architecture for LLM Training

We just ordered shawarma and fries from Cursor [video]

Correctio

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA intends to take action against non-FDA-approved GLP-1 drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009