frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Show HN: Terminal-Bench-RL: Training long-horizon terminal agents with RL

https://github.com/Danau5tin/terminal-bench-rl
106•Danau5tin•16h ago
After training calculator agent via RL, I really wanted to go bigger! So I built RL infrastructure for training long-horizon terminal/coding agents that scales from 2x A100s to 32x H100s (~$1M worth of compute!) Without any training, my 32B agent hit #19 on Terminal-Bench leaderboard, beating Stanford's Terminus-Qwen3-235B-A22! With training... well, too expensive, but I bet the results would be good!

*What I did*:

- Created a Claude Code-inspired agent (system msg + tools)

- Built Docker-isolated GRPO training where each rollout gets its own container

- Developed a multi-agent synthetic data pipeline to generate & validate training data with Opus-4

- Implemented a hybrid reward signal of unit test verifiers & a behavioural LLM judge.

*Key results*:

- My untrained Qwen3-32B agent achieved 13.75% on Terminal-Bench (#19, beats Stanford's Qwen3-235B MoE)

- I tested training to work stably on 32x H100s distributed across 4 bare metal nodes

- I created a mini-eval framework for LLM-judge performance. Sonnet-4 won.

- ~£30-50k needed for full training run of 1000 epochs (I could only afford testing )

*Technical details*:

- The synthetic dataset ranges from easy to extremely hard tasks. An example hard task's prompt:

"I found this mystery program at `/app/program` and I'm completely stumped. It's a stripped binary, so I have no idea what it does or how to run it properly. The program seems to expect some specific input and then produces an output, but I can't figure out what kind of input it needs. Could you help me figure out what this program requires?"

- Simple config presets allow training to run on multiple hardware setups with minimal effort.

- GRPO used with 16 rollouts per task, up to 32k tokens per rollout.

- Agent uses XML/YAML format to structure tool calls

*More details*:

My Github repos open source it all (agent, data, code) and has way more technical details if you are interested!:

- Terminal Agent RL repo

- Multi-agent synthetic data pipeline repo

I thought I would share this because I believe long-horizon RL is going to change everybody's lives, and so I feel it is important (and super fun!) for us all to share knowledge around this area, and also have enjoy exploring what is possible.

Thanks for reading!

Dan

(Built using rLLM RL framework which was brilliant to work with, and evaluated and inspired by the great Terminal Bench benchmark)

Comments

rboyd•15h ago
Great work! There should be a way for entities to crowdfund model training. Can a model like this be partially evaluated during training time and save through early stopping?

What are the best papers/resources on sota long-horizon RL?

Thanks.

thomasfromcdnjs•15h ago
How much did you spend?
tjungblut•15h ago
If you are curios, like me, how the actual reinforcement learning happens. It uses verl [1] underneath. The paper "HybridFlow: A Flexible and Efficient RLHF Framework" [2] explains it really well.

[1] https://github.com/volcengine/verl

[2] https://arxiv.org/abs/2409.19256v2

OtherShrezzing•15h ago
That you've spent in the low-thousands (by the looks of it), and managed to beat GPT4.1 is an amazing insight into the moat of the big AI labs.
bravesoul2•15h ago
Wow amazing! Amazing a "one person band" can do this much. It crosses many skillets.
erdaltoprak•14h ago
This is incredible work
enigma101•14h ago
Did you consider a kickstarter to overcome the gpu poorness??? 30 to 50 should be doable
anorwell•14h ago
Some of the comments so far seem to be misunderstanding this submission. As I understand it:

1. Custom scaffolding (system prompt and tools) using Qwen3-32B achieved 13.75% on Terminal-Bench. No training was involved. 2. The author has built an RL system, but it has not been used for anything due to cost limitations.

So there's actually no result related to training here. It well known that the scaffolding used can have a large impact on benchmark outcomes (the Terminal bench leaderboard also demonstrates this [1]).

[1] https://www.tbench.ai/leaderboard

esafak•13h ago
It looks like the submission has two aspects that are being conflated.

1. Tooling for training a terminal agent.

2. An agent that was _not_ trained with this tooling but prompt engineered. I could not find the author's discussion on this point.

TarasBob•2h ago
I'm willing to help fund this if the creator is interested. I sent him an email.

M8.7 earthquake in Western Pacific, tsunami warning issued

https://earthquake.usgs.gov/earthquakes/eventpage/us6000qw60/executive
421•jandrewrogers•3h ago•81 comments

Study mode

https://openai.com/index/chatgpt-study-mode/
748•meetpateltech•10h ago•520 comments

RIP Shunsaku Tamiya, the man who made plastic model kits a global obsession

https://JapaneseNostalgicCar.com/rip-shunsaku-tamiya-plastic-model-kits/
166•fidotron•6h ago•30 comments

URL-Driven State in HTMX

https://www.lorenstew.art/blog/bookmarkable-by-design-url-state-htmx/
111•lorenstewart•5h ago•41 comments

Launch HN: Hyprnote (YC S25) – An open-source AI meeting notetaker

158•yujonglee•11h ago•84 comments

Two Birds with One Tone: I/Q Signals and Fourier Transform

https://wirelesspi.com/two-birds-with-one-tone-i-q-signals-and-fourier-transform-part-1/
23•teleforce•4h ago•2 comments

USB-C for Lightning iPhones

https://obsoless.com/products/iph0n3-usb-c-protection-case
84•colinprince•3d ago•67 comments

Learning basic electronics by building fireflies

http://a64.in/posts/learning-basic-electronics-by-building-fireflies/
192•signa11•10h ago•55 comments

iPhone 16 cameras vs. traditional digital cameras

https://candid9.com/phone-camera/
151•sergiotapia•13h ago•195 comments

FoundationDB: From idea to Apple acquisition [video]

https://www.youtube.com/watch?v=C1nZzQqcPZw
126•zdw•4d ago•16 comments

Actual Size Online Ruler (Mm,Cm,Inches)

https://anruler.com/
4•artiomyak•2d ago•6 comments

How the brain increases blood flow on demand

https://hms.harvard.edu/news/how-brain-increases-blood-flow-demand
77•gmays•8h ago•35 comments

Show HN: I built an AI that turns any book into a text adventure game

https://www.kathaaverse.com/
208•rcrKnight•11h ago•82 comments

JavaScript decided my day starts at 9am

https://senhongo.com/blog/when-javaScript-decided-my-day-starts-at-9am
21•SenHeng•3d ago•23 comments

Dropbox Passwords discontinuation

https://help.dropbox.com/en-us/installs/dropbox-passwords-discontinuation
39•h1fra•7h ago•13 comments

ACM Transitions to Full Open Access

https://www.acm.org/publications/openaccess
132•pcvarmint•10h ago•13 comments

Irrelevant facts about cats added to math problems increase LLM errors by 300%

https://www.science.org/content/article/scienceadviser-cats-confuse-ai
331•sxv•12h ago•160 comments

A month using XMPP (using Snikket) for every call and chat (2023)

https://neilzone.co.uk/2023/08/a-month-using-xmpp-using-snikket-for-every-call-and-chat/
86•ColinWright•9h ago•52 comments

CodeCrafters (YC S22) is hiring first Marketing Person

https://www.ycombinator.com/companies/codecrafters/jobs/7ATipKJ-1st-marketing-hire
1•sarupbanskota•6h ago

Show HN: Terminal-Bench-RL: Training long-horizon terminal agents with RL

https://github.com/Danau5tin/terminal-bench-rl
106•Danau5tin•16h ago•10 comments

Microsoft Flight Simulator 2024: WebAssembly SDK

https://docs.flightsimulator.com/msfs2024/html/6_Programming_APIs/WASM/WebAssembly.htm
120•breve•3d ago•62 comments

Playing with more user-friendly methods for multi-factor authentication

https://tesseral.com/blog/i-designed-some-more-user-friendly-methods-for-multi-factor-authentication
51•noleary•1d ago•32 comments

Structuring large Clojure codebases with Biff

https://biffweb.com/p/structuring-large-codebases/
56•PaulHoule•13h ago•3 comments

Measuring Engineering

https://fffej.substack.com/p/measuring-engineering
21•mooreds•3d ago•4 comments

My 2.5 year old laptop can write Space Invaders in JavaScript now (GLM-4.5 Air)

https://simonwillison.net/2025/Jul/29/space-invaders/
467•simonw•14h ago•323 comments

Supervised fine tuning on curated data is reinforcement learning

https://arxiv.org/abs/2507.12856
42•GabrielBianconi•7h ago•14 comments

Maru OS – Use your phone as your PC

https://maruos.com/
206•fsflover•8h ago•144 comments

Elements of System Design

https://github.com/jarulraj/periodic-table
92•qianli_cs•10h ago•33 comments

Observable Notebooks 2.0 Technology Preview

https://observablehq.com/notebook-kit/
189•mbostock•13h ago•45 comments

More honey bees dying, even as antibiotic use halves

https://news.uoguelph.ca/2025/07/more-honey-bees-dying-even-as-antibiotic-use-halves/
160•pseudolus•8h ago•116 comments