My RL trained multi-agent-coding model Orca-Agent-v0.1-14B reached a 167% higher relative score than its base model on Stanford's TerminalBench. I've open sourced everything.
*What I did:*
- I trained a 14B orchestrator model to better coordinate explorer & coder subagents (subagents are tool calls for orchestrator)
- Scaled to 32x H100s that were pushed to their limits across 4 bare-metal nodes
- Scaled to 256 Docker environments rolling out simultaneously, automatically distributed across the cluster
*Key results:*
- Qwen3-14B jumped from *7% → 18.25%* on TerminalBench after training
- Model now within striking distance of Qwen3-Coder-480B (19.7%)
- Training was stable with smooth entropy decrease and healthy gradient norms
*Training approach:*
Reward design and biggest learning: Kept it simple - *just unit tests*. Every "smart" reward signal I tried to craft led to policy collapse
Curriculum learning:
- Stage-1: Tasks where base model succeeded 1-2/3 times (41 tasks)
- Stage-2: Tasks where Stage-1 model succeeded 1-4/5 times
Dataset: Used synthetically generated RL environments and unit tests
*More details:*
I have added lots more details in the repo linked to this submission, including training code, model weights, datasets.
Huge thanks to:
- Tara for providing the compute
- Prime Intellect team for building prime-rl and dealing with my endless questions
- Alex Dimakis for the conversation that sparked training the orchestrator model
Thanks for reading!
Dan
(Evaluated on the excellent TerminalBench benchmark by Stanford & Laude Institute)
Danau5tin•6h ago
*What I did:*
- I trained a 14B orchestrator model to better coordinate explorer & coder subagents (subagents are tool calls for orchestrator) - Scaled to 32x H100s that were pushed to their limits across 4 bare-metal nodes - Scaled to 256 Docker environments rolling out simultaneously, automatically distributed across the cluster
*Key results:*
- Qwen3-14B jumped from *7% → 18.25%* on TerminalBench after training - Model now within striking distance of Qwen3-Coder-480B (19.7%) - Training was stable with smooth entropy decrease and healthy gradient norms
*Training approach:*
Reward design and biggest learning: Kept it simple - *just unit tests*. Every "smart" reward signal I tried to craft led to policy collapse
Curriculum learning: - Stage-1: Tasks where base model succeeded 1-2/3 times (41 tasks) - Stage-2: Tasks where Stage-1 model succeeded 1-4/5 times
Dataset: Used synthetically generated RL environments and unit tests
*More details:*
I have added lots more details in the repo linked to this submission, including training code, model weights, datasets.
Huge thanks to: - Tara for providing the compute - Prime Intellect team for building prime-rl and dealing with my endless questions - Alex Dimakis for the conversation that sparked training the orchestrator model
Thanks for reading!
Dan
(Evaluated on the excellent TerminalBench benchmark by Stanford & Laude Institute)