frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Show HN: Terminal-Bench-RL: Training Long-Horizon Terminal Agents with RL

https://github.com/Danau5tin/terminal-bench-rl
100•Danau5tin•10h ago
After training calculator agent via RL, I really wanted to go bigger! So I built RL infrastructure for training long-horizon terminal/coding agents that scales from 2x A100s to 32x H100s (~$1M worth of compute!) Without any training, my 32B agent hit #19 on Terminal-Bench leaderboard, beating Stanford's Terminus-Qwen3-235B-A22! With training... well, too expensive, but I bet the results would be good!

*What I did*:

- Created a Claude Code-inspired agent (system msg + tools)

- Built Docker-isolated GRPO training where each rollout gets its own container

- Developed a multi-agent synthetic data pipeline to generate & validate training data with Opus-4

- Implemented a hybrid reward signal of unit test verifiers & a behavioural LLM judge.

*Key results*:

- My untrained Qwen3-32B agent achieved 13.75% on Terminal-Bench (#19, beats Stanford's Qwen3-235B MoE)

- I tested training to work stably on 32x H100s distributed across 4 bare metal nodes

- I created a mini-eval framework for LLM-judge performance. Sonnet-4 won.

- ~£30-50k needed for full training run of 1000 epochs (I could only afford testing )

*Technical details*:

- The synthetic dataset ranges from easy to extremely hard tasks. An example hard task's prompt:

"I found this mystery program at `/app/program` and I'm completely stumped. It's a stripped binary, so I have no idea what it does or how to run it properly. The program seems to expect some specific input and then produces an output, but I can't figure out what kind of input it needs. Could you help me figure out what this program requires?"

- Simple config presets allow training to run on multiple hardware setups with minimal effort.

- GRPO used with 16 rollouts per task, up to 32k tokens per rollout.

- Agent uses XML/YAML format to structure tool calls

*More details*:

My Github repos open source it all (agent, data, code) and has way more technical details if you are interested!:

- Terminal Agent RL repo

- Multi-agent synthetic data pipeline repo

I thought I would share this because I believe long-horizon RL is going to change everybody's lives, and so I feel it is important (and super fun!) for us all to share knowledge around this area, and also have enjoy exploring what is possible.

Thanks for reading!

Dan

(Built using rLLM RL framework which was brilliant to work with, and evaluated and inspired by the great Terminal Bench benchmark)

Comments

rboyd•9h ago
Great work! There should be a way for entities to crowdfund model training. Can a model like this be partially evaluated during training time and save through early stopping?

What are the best papers/resources on sota long-horizon RL?

Thanks.

thomasfromcdnjs•9h ago
How much did you spend?
tjungblut•9h ago
If you are curios, like me, how the actual reinforcement learning happens. It uses verl [1] underneath. The paper "HybridFlow: A Flexible and Efficient RLHF Framework" [2] explains it really well.

[1] https://github.com/volcengine/verl [2] https://arxiv.org/abs/2409.19256v2

OtherShrezzing•9h ago
That you've spent in the low-thousands (by the looks of it), and managed to beat GPT4.1 is an amazing insight into the moat of the big AI labs.
bravesoul2•9h ago
Wow amazing! Amazing a "one person band" can do this much. It crosses many skillets.
erdaltoprak•8h ago
This is incredible work
enigma101•8h ago
Did you consider a kickstarter to overcome the gpu poorness??? 30 to 50 should be doable
anorwell•8h ago
Some of the comments so far seem to be misunderstanding this submission. As I understand it:

1. Custom scaffolding (system prompt and tools) using Qwen3-32B achieved 13.75% on Terminal-Bench. No training was involved. 2. The author has built an RL system, but it has not been used for anything due to cost limitations.

So there's actually no result related to training here. It well known that the scaffolding used can have a large impact on benchmark outcomes (the Terminal bench leaderboard also demonstrates this [1]).

[1] https://www.tbench.ai/leaderboard

esafak•7h ago
It looks like the submission has two aspects that are being conflated.

1. Tooling for training a terminal agent.

2. An agent that was _not_ trained with this tooling but prompt engineered. I could not find the author's discussion on this point.

Minnesota activates National Guard after St. Paul cyberattack

https://www.bleepingcomputer.com/news/security/minnesota-activates-national-guard-after-st-paul-cyberattack/
1•krunck•39s ago•0 comments

SharePoint Server Vulnerabilities Now Exploited to Deliver Ransomware

https://zeldasecurity.com/sharepoint-server-ransomware-vulnerabilities/
1•zeldasecurity•2m ago•1 comments

Claude Code vs. Cursor: My First Impressions

https://www.aiengineering.report/p/claude-code-vs-cursor-my-first-impressions
1•waprin•6m ago•0 comments

Webflow Down for >31 Hours

https://status.webflow.com
14•philip1209•10m ago•0 comments

How JIT builds of CPython work

https://savannah.dev/posts/how-your-code-runs-in-a-jit-build/
2•mariuz•12m ago•0 comments

'bitchat? now on the App Store

https://twitter.com/jack/status/1949780445446512780
1•janandonly•14m ago•0 comments

Large-scale study uncovers 57 genetic hotspots into stuttering origins

https://news.vumc.org/2025/07/28/large-scale-study-defines-genetic-architecture-of-stuttering/
1•nixass•16m ago•0 comments

Innovation starts with consumers, not academia

https://lemire.me/blog/2025/07/16/innovation/
1•zffr•16m ago•1 comments

Toshiba MG11 series hard drive

https://storage.toshiba.com/enterprise-hdd/cloud-scale-capacity/mg11-series
2•AureliusMA•16m ago•1 comments

Debugging Hell: Spark Tomcat and Proxies

https://sumantopal07.medium.com/bug-which-took-months-to-debug-bbd12e0eba9d
1•Sumanto•17m ago•0 comments

Researcher is a relic term from academia – Elon Musk

https://twitter.com/elonmusk/status/1950254103474446728
1•amrrs•19m ago•0 comments

Microsoft Nears OpenAI Agreement for Ongoing Tech Access

https://www.bloomberg.com/news/articles/2025-07-29/microsoft-s-access-to-openai-tech-is-focus-of-contract-talks
1•mfiguiere•23m ago•0 comments

Predictive UX Engineering

https://travisbumgarner.dev/blog/photography-portfolio-performance
1•sillysideprojs•23m ago•0 comments

A Curated List of Awesome Honeypots

https://securehoney.net/awesome-honeypots.html
1•sugarpimpdorsey•23m ago•0 comments

DietPi released a new version v9.15

1•StephanStS•24m ago•0 comments

Google's June 2025 Core Update

https://www.searchenginejournal.com/googles-june-2025-update-analysis-what-just-happened/551501/
2•andrewstetsenko•26m ago•0 comments

Spotify stock falls on revenue miss, lackluster guidance

https://www.cnbc.com/2025/07/29/spotify-spot-stock-q2-2025-earnings.html
1•bundie•28m ago•0 comments

Microsoft bans LibreOffice developer's account without warning, rejects appeal

https://www.neowin.net/news/microsoft-bans-libreoffice-developers-account-without-warning-rejects-appeal/
36•bundie•30m ago•4 comments

Show HN: Gradient-Free ML Algorithm (Available for Contract Work)

1•atowns•30m ago•0 comments

Show HN: I Built a GitHub Action to Wait for Vercel Deployments Before CI

https://github.com/marketplace/actions/vercel-preview-url-with-status-polling
1•bakkerinho•31m ago•0 comments

New Generational Pomodoro

1•TheZBuilder•31m ago•0 comments

Big Tech Is the Only Winner of the [UK's] Online Safety Act

https://www.newstatesman.com/science-tech/big-tech/2025/07/big-tech-is-the-only-winner-of-the-online-safety-act
3•sealeck•32m ago•0 comments

GLM 4.5 one-shots a Full Coding Project

https://www.youtube.com/watch?v=3fbOQBTfemg
2•amrrs•33m ago•0 comments

One Year After Fisker's Bankruptcy, Ocean Owners Are Still Paying the Price

https://www.autoevolution.com/news/one-year-after-fisker-s-bankruptcy-ocean-owners-are-still-paying-the-price-252214.html
2•dangle1•33m ago•0 comments

Unleashing the Editing Superpower of Emacs

http://yummymelon.com/devnull/unleashing-the-editing-superpower-of-emacs.html
2•kickingvegas•37m ago•0 comments

Scamming Substack?

https://willstorr.substack.com/p/scamming-substack
1•exolymph•39m ago•0 comments

Ask HN: Why the fundamental skepticism around LLMs?

3•smokel•42m ago•9 comments

Private Equity in the Hospital Industry (2021)

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3924517
5•coloneltcb•43m ago•0 comments

Apple to Shutter 1st Retail Store in China

https://finance.yahoo.com/news/apple-shutter-retail-store-china-024446353.html
3•mgh2•43m ago•0 comments

SecureFlow Extension to Vibe Code Securely – Codepathfinder.dev

https://codepathfinder.dev/blog/introducing-secureflow-extension-to-vibe-code-securely/
3•shivasurya•43m ago•0 comments