frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Terminal-Bench: a benchmark for AI agents in terminal environments

https://www.tbench.ai/
17•mikemerrill•8mo ago

Comments

mikemerrill•8mo ago
We (an open community of AI researchers at Stanford, Anthropic, UW, and more) just released Terminal-Bench, a new open-source framework for evaluating how well AI agents perform in terminal environments. Given how much we all use the terminal and how many new AI terminal assistants are emerging, we wanted to create a rigorous way to test their capabilities.

What we found: The best commercial agents (using models like GPT-4, Claude, Gemini) score less than 20% on our benchmark tasks. Even with their impressive capabilities, these agents struggle with: - Chaining multiple terminal commands together - Reasoning over long command outputs - Acting independently within sensible limits - Executing tasks safely

What's in Terminal-Bench: - Docker-containerized environments for consistent testing - Hand-crafted tasks covering data science, networking, security, and more - Human-verified solutions and test cases - Support for different integration methods

Want to get involved? We're looking for contributors to help expand the benchmark with new challenging tasks. If you've got scenarios where current AI agents fail in the terminal, we'd love to include them!

Check out our website: https://tbench.ai Join our Discord: https://discord.gg/6xWPKhGDbA

What terminal tasks do you wish AI agents could handle better?

kristopolous•8mo ago
Their terminus approach is petty similar to my "dui mode" here: https://github.com/day50-dev/llmehelp ... it's still not great except for basic investigations. I think a hybrid approach would be better.

There's certainly some litellm hacking that will improve things. I'm absolutely convinced of that. The proxy is pretty hard to use though. I keep making glacial progress on it.

mikemerrill•8mo ago
Agreed that it's not the best way to use agents right now (they still need supervision) but I think in the coming year(s) we'll reach a point where they'll be good enough to run on their own (see Codex).

If you're interested in working on this, we love to see new contributors in the Discord https://discord.gg/6xWPKhGDbA

1979: The Model World of Robert Symes [video]

https://www.youtube.com/watch?v=HmDxmxhrGDc
1•xqcgrek2•1m ago•0 comments

Satellites Have a Lot of Room

https://www.johndcook.com/blog/2026/02/02/satellites-have-a-lot-of-room/
1•y1n0•2m ago•0 comments

1980s Farm Crisis

https://en.wikipedia.org/wiki/1980s_farm_crisis
1•calebhwin•2m ago•1 comments

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

https://github.com/skorotkiewicz/fsid
1•modinfo•7m ago•0 comments

Show HN: Holy Grail: Open-Source Autonomous Development Agent

https://github.com/dakotalock/holygrailopensource
1•Moriarty2026•14m ago•1 comments

Show HN: Minecraft Creeper meets 90s Tamagotchi

https://github.com/danielbrendel/krepagotchi-game
1•foxiel•22m ago•1 comments

Show HN: Termiteam – Control center for multiple AI agent terminals

https://github.com/NetanelBaruch/termiteam
1•Netanelbaruch•22m ago•0 comments

The only U.S. particle collider shuts down

https://www.sciencenews.org/article/particle-collider-shuts-down-brookhaven
1•rolph•25m ago•1 comments

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

1•solarisos•25m ago•2 comments

Show HN: Remotion directory (videos and prompts)

https://www.remotion.directory/
1•rokbenko•27m ago•0 comments

Portable C Compiler

https://en.wikipedia.org/wiki/Portable_C_Compiler
2•guerrilla•29m ago•0 comments

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

1•Ginsabo•30m ago•0 comments

Software Engineering Transformation 2026

https://mfranc.com/blog/ai-2026/
1•michal-franc•31m ago•0 comments

Microsoft purges Win11 printer drivers, devices on borrowed time

https://www.tomshardware.com/peripherals/printers/microsoft-stops-distrubitng-legacy-v3-and-v4-pr...
3•rolph•31m ago•1 comments

Lunch with the FT: Tarek Mansour

https://www.ft.com/content/a4cebf4c-c26c-48bb-82c8-5701d8256282
2•hhs•35m ago•0 comments

Old Mexico and her lost provinces (1883)

https://www.gutenberg.org/cache/epub/77881/pg77881-images.html
1•petethomas•38m ago•0 comments

'AI' is a dick move, redux

https://www.baldurbjarnason.com/notes/2026/note-on-debating-llm-fans/
4•cratermoon•39m ago•0 comments

The source code was the moat. But not anymore

https://philipotoole.com/the-source-code-was-the-moat-no-longer/
1•otoolep•39m ago•0 comments

Does anyone else feel like their inbox has become their job?

1•cfata•39m ago•1 comments

An AI model that can read and diagnose a brain MRI in seconds

https://www.michiganmedicine.org/health-lab/ai-model-can-read-and-diagnose-brain-mri-seconds
2•hhs•43m ago•0 comments

Dev with 5 of experience switched to Rails, what should I be careful about?

2•vampiregrey•45m ago•0 comments

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

https://arxiv.org/abs/2601.16429
1•PaulHoule•46m ago•0 comments

Scientists discover “levitating” time crystals that you can hold in your hand

https://www.nyu.edu/about/news-publications/news/2026/february/scientists-discover--levitating--t...
2•hhs•48m ago•0 comments

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

https://www.youtube.com/watch?v=3VReIuv1GFo
1•erickhill•48m ago•0 comments

Tell HN: Yet Another Round of Zendesk Spam

5•Philpax•49m ago•1 comments

Postgres Message Queue (PGMQ)

https://github.com/pgmq/pgmq
1•Lwrless•52m ago•0 comments

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

https://github.com/kjnez/django-rclone
2•cui•55m ago•1 comments

NY lawmakers proposed statewide data center moratorium

https://www.niagara-gazette.com/news/local_news/ny-lawmakers-proposed-statewide-data-center-morat...
2•geox•57m ago•0 comments

OpenClaw AI chatbots are running amok – these scientists are listening in

https://www.nature.com/articles/d41586-026-00370-w
3•EA-3167•57m ago•0 comments

Show HN: AI agent forgets user preferences every session. This fixes it

https://www.pref0.com/
6•fliellerjulian•59m ago•0 comments