Scaling Coding-Agent RL to 32x H100s. 160% Improvement on Stanford's TBench

https://github.com/Danau5tin/Orca-Agent-RL

2•Danau5tin•6h ago

Comments

Danau5tin•6h ago

My RL trained multi-agent-coding model Orca-Agent-v0.1-14B reached a 167% higher relative score than its base model on Stanford's TerminalBench. I've open sourced everything.

*What I did:*

- I trained a 14B orchestrator model to better coordinate explorer & coder subagents (subagents are tool calls for orchestrator) - Scaled to 32x H100s that were pushed to their limits across 4 bare-metal nodes - Scaled to 256 Docker environments rolling out simultaneously, automatically distributed across the cluster

*Key results:*

- Qwen3-14B jumped from *7% → 18.25%* on TerminalBench after training - Model now within striking distance of Qwen3-Coder-480B (19.7%) - Training was stable with smooth entropy decrease and healthy gradient norms

*Training approach:*

Reward design and biggest learning: Kept it simple - *just unit tests*. Every "smart" reward signal I tried to craft led to policy collapse

Curriculum learning: - Stage-1: Tasks where base model succeeded 1-2/3 times (41 tasks) - Stage-2: Tasks where Stage-1 model succeeded 1-4/5 times

Dataset: Used synthetically generated RL environments and unit tests

*More details:*

I have added lots more details in the repo linked to this submission, including training code, model weights, datasets.

Huge thanks to: - Tara for providing the compute - Prime Intellect team for building prime-rl and dealing with my endless questions - Alex Dimakis for the conversation that sparked training the orchestrator model

Thanks for reading!

Dan

(Evaluated on the excellent TerminalBench benchmark by Stanford & Laude Institute)

Workflows: Durable Execution with Just Postgres

Every Country Has One

Show HN: MyPCOptimizer – AI-powered PC hardware upgrade advisor

Show HN: Vayno – AI Email Sequence Generator from Any Landing Page

Show HN: Vayno – AI Email Sequence Generator from Any Landing Page

Better authentication with workload identity federation

Debugging Playwright Timeouts: A Practical Checklist

Understanding H-1B Visa Changes

Online OR1K Emulator Running Linux

Google is showing ads to people who are about to buy [video]

Homotopy Type Theory for Dummies

Writing 30 posts in 30 days

Equifax Plans to Profit from Medicaid Cuts

Bessent: Broader recession possible without more rate cuts

$50 PlanetScale Metal

The Stallman Paradox: How Web3 Became the Ultimate Open Source Theater

rm -rf / remains

2025 Quantum Open Source Software Survey Results

Is the Internet Making Culture Worse?

OpenAI signs $38B cloud computing deal with Amazon

Show HN: Word Wolfer, Number Wolfer – educational games inspired by Munchers

Install script does rm -RF /usr for Ubuntu

Gallery of wonderful drawings our little thermal printer received

Mechanochemical Approach to Upcycling of Fluoride from PTFE into Fine Chemicals

Blood, Brick and Legend: The Chemistry of Dracula's Castle

Antarctic glacier shows fastest retreat in modern history

We spent 47k running AI agents in production

Ask HN: Freelancer? Seeking freelancer? (November 2025)

The Americas, led by Canada, is on brink of losing measles-elimination status

Ask HN: Are social bonds bad for independent thought?

Scaling Coding-Agent RL to 32x H100s. 160% Improvement on Stanford's TBench

Comments

Workflows: Durable Execution with Just Postgres

Every Country Has One

Show HN: MyPCOptimizer – AI-powered PC hardware upgrade advisor

Show HN: Vayno – AI Email Sequence Generator from Any Landing Page

Show HN: Vayno – AI Email Sequence Generator from Any Landing Page

Better authentication with workload identity federation

Debugging Playwright Timeouts: A Practical Checklist

Understanding H-1B Visa Changes

Online OR1K Emulator Running Linux

Google is showing ads to people who are about to buy [video]

Homotopy Type Theory for Dummies

Writing 30 posts in 30 days

Equifax Plans to Profit from Medicaid Cuts

Bessent: Broader recession possible without more rate cuts

$50 PlanetScale Metal

The Stallman Paradox: How Web3 Became the Ultimate Open Source Theater

rm -rf / remains

2025 Quantum Open Source Software Survey Results

Is the Internet Making Culture Worse?

OpenAI signs $38B cloud computing deal with Amazon

Show HN: Word Wolfer, Number Wolfer – educational games inspired by Munchers

Install script does rm -RF /usr for Ubuntu

Gallery of wonderful drawings our little thermal printer received

Mechanochemical Approach to Upcycling of Fluoride from PTFE into Fine Chemicals

Blood, Brick and Legend: The Chemistry of Dracula's Castle

Antarctic glacier shows fastest retreat in modern history

We spent 47k running AI agents in production

Ask HN: Freelancer? Seeking freelancer? (November 2025)

The Americas, led by Canada, is on brink of losing measles-elimination status

Ask HN: Are social bonds bad for independent thought?