We recently ran an experiment to answer a simple question:
Does coordinating multiple AI agents as a team actually help with real software engineering tasks, compared to a single strong agent?
To test this, we evaluated our system on SWE-bench Verified. The benchmark consists of real GitHub issues that require understanding codebases, modifying multiple files, running tests, and iterating.
Instead of treating software engineering as a single-agent patch generation problem, we model it as an organizational process.
Our system uses a team of agents with explicit roles:
* manager: plans work, assigns tasks, integrates results * researcher: explores the codebase, issue history, constraints * engineer: implements fixes in isolated environments * reviewer: inspects changes, requests revisions, validates results
There is no fixed pipeline and no predefined number of steps. Agents communicate via structured artifacts (plans, diffs, reviews) and produce real GitHub pull requests with full history.
For evaluation, we compared three setups on SWE-bench Verified:
* single-agent baseline: GPT-5 medium reasoning + shell * agent team (ours): GPT-5 (manager, researcher) + GPT-5 Codex (engineer, reviewer), both medium reasoning * stronger single-model reference: GPT-5.2 (high reasoning)
Results:
* the agent team resolves ~7% more issues than the single-agent GPT-5 medium reasoning baseline * despite using medium reasoning models, the agent team shows ~0.5% better quality than a single GPT-5.2 (high reasoning) agent
Beyond resolution rate, the main benefits are cleaner responsibility boundaries, context isolation, easier debugging and the ability to use different models for different roles.
Code + trajectories are open source: https://github.com/agynio/platform
Paper with methodology and results: https://arxiv.org/abs/2602.01465
Would love to hear your thoughts.