news newest ask show jobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Terminal-Bench: a benchmark for AI agents in terminal environments

https://www.tbench.ai/

7•mikemerrill•4h ago

Comments

mikemerrill•4h ago

We (an open community of AI researchers at Stanford, Anthropic, UW, and more) just released Terminal-Bench, a new open-source framework for evaluating how well AI agents perform in terminal environments. Given how much we all use the terminal and how many new AI terminal assistants are emerging, we wanted to create a rigorous way to test their capabilities.

What we found: The best commercial agents (using models like GPT-4, Claude, Gemini) score less than 20% on our benchmark tasks. Even with their impressive capabilities, these agents struggle with: - Chaining multiple terminal commands together - Reasoning over long command outputs - Acting independently within sensible limits - Executing tasks safely

What's in Terminal-Bench: - Docker-containerized environments for consistent testing - Hand-crafted tasks covering data science, networking, security, and more - Human-verified solutions and test cases - Support for different integration methods

Want to get involved? We're looking for contributors to help expand the benchmark with new challenging tasks. If you've got scenarios where current AI agents fail in the terminal, we'd love to include them!

Check out our website: https://tbench.ai Join our Discord: https://discord.gg/6xWPKhGDbA

What terminal tasks do you wish AI agents could handle better?

kristopolous•2h ago

Their terminus approach is petty similar to my "dui mode" here: https://github.com/day50-dev/llmehelp ... it's still not great except for basic investigations. I think a hybrid approach would be better.

There's certainly some litellm hacking that will improve things. I'm absolutely convinced of that. The proxy is pretty hard to use though. I keep making glacial progress on it.

Texas considers allowing treated fracking water released into rivers

https://www.texastribune.org/2025/05/19/texas-legislature-produced-water-legal-protections-oil-gas/

1•geox•2m ago•0 comments

Trams – The Absolute Best Transportation for Cities [video]

https://www.youtube.com/watch?v=bNTg9EX7MLw

1•CHB0403085482•6m ago•0 comments

Show HN: Paste Keyboard – Insert saved text with one tap

https://apps.apple.com/us/app/paste-keyboard-auto-paste/id6744092980

1•noteable•6m ago•0 comments

Still Booting: People Stuck Using Ancient Windows Computers

https://www.bbc.com/future/article/20250516-the-people-stuck-using-ancient-windows-computers

1•andrewl•8m ago•0 comments

Inter-Agent Communication on MCP

https://aws.amazon.com/blogs/opensource/open-protocols-for-agent-interoperability-part-1-inter-agent-communication-on-mcp/

1•ke4qqq•18m ago•0 comments

Bomb Pulse

https://en.wikipedia.org/wiki/Bomb_pulse

1•izuchukwu•18m ago•0 comments

Modal's Serverless KV Store Now Scales to Infinity

https://modal.com/blog/cache-dict-launch

1•birdculture•19m ago•0 comments

Show HN: Bobber Game (Go Down to Go Up)

https://stan-stani.github.io/minigames/?game=bobber

1•EstanislaoStan•20m ago•0 comments

Delta Air Lines can sue CrowdStrike over outage

https://www.itnews.com.au/news/delta-air-lines-can-sue-crowdstrike-over-computer-outage-617292

5•Khaine•22m ago•1 comments

"Copilot" bot user exempted from GitHub blocks

https://mastodon.social/@mcc/114536667832141959

2•luu•26m ago•0 comments

Ask HN: When will managers be replaced by AI?

2•GianFabien•26m ago•0 comments

Walmart Now Officially Allows Amazon MCF – What This Means for Sellers

https://www.sellegr8.com/blog/walmart-now-officially-allows-amazon-mcf

1•TMWNN•36m ago•1 comments

Klarna's revenue per employee soars to nearly $1M thanks to AI efficiency push

https://techcrunch.com/2025/05/19/klarnas-revenue-per-employee-soars-to-nearly-1m-thanks-to-ai-efficiency-push/

1•Willingham•38m ago•1 comments

Monkeys are kidnapping babies of another species, perplexing scientists

https://www.cnn.com/2025/05/19/science/monkey-kidnappings-jicaron-island-panama

4•kjhughes•46m ago•1 comments

Analysis: Eurovision Muted Sounds of Crowd Booing and Shouting "Free Palestine "

https://theintercept.com/2025/05/17/eurovision-censored-israel-booing-free-palestine/

8•Qem•50m ago•1 comments

Qualcomm's next-gen Snapdragon X chips arrive at the beginning of next year

https://www.xda-developers.com/qualcomms-next-gen-snapdragon-x-chips-next-year/

1•jasoneckert•51m ago•0 comments

Credit card processor suspends payments at Civitai

https://civitai.com/articles/14945/credit-card-payments-pausing-may-23-2025

1•70709•54m ago•0 comments

AI Is a Nothingburger [video]

https://www.youtube.com/watch?v=E31KuUJmqCU

1•txcwg002•1h ago•0 comments

The Superspecial Isogeny Club Quartet (2023) [video]

https://www.youtube.com/watch?v=b_AuzlaIxLs

1•akkartik•1h ago•0 comments

Google to give app devs access to Gemini Nano for on-device AI

https://arstechnica.com/google/2025/05/google-to-give-app-devs-access-to-gemini-nano-for-on-device-ai/

2•e12e•1h ago•0 comments

In Reversal, Trump Officials Will Allow Offshore N.Y. Wind Farm to Proceed

https://www.nytimes.com/2025/05/19/climate/empire-wind-new-york-hochul.html

6•toomuchtodo•1h ago•1 comments

Ask HN: When do you just give up and ship it?

5•90s_dev•1h ago•3 comments

Waymo gets OK to expand robotaxi service into more of Silicon Valley

https://techcrunch.com/2025/05/19/waymo-gets-ok-to-expand-robotaxi-service-into-more-of-silicon-valley/

3•badmonster•1h ago•1 comments

Coinbase becomes first crypto exchange to join the S&P 500

https://www.latimes.com/business/story/2025-05-19/coinbase-becomes-first-crypto-company-to-join-the-s-p-500-after-rocky-week

4•andsoitis•1h ago•3 comments

Improving Assembly Code Performance with LLMss via Reinforcement Learning

https://arxiv.org/abs/2505.11480

2•badmonster•1h ago•1 comments

What are people doing? Live-ish estimates based on global population dynamics

https://humans.maxcomperatore.com/

13•willbc•1h ago•2 comments

Architecture vs. Content

https://paragraph.com/@signalvs/architecture-vs-content

1•joanwestenberg•1h ago•0 comments

In the Future, China Will Be Dominant. The U.S. Will Be Irrelevant

https://www.nytimes.com/2025/05/19/opinion/china-us-trade-tariffs.html

7•yubblegum•1h ago•1 comments

Linux kernel for Rockchip RK3588 devices

https://github.com/inindev/linux-rockchip

1•transpute•1h ago•0 comments

GitHub Helping Users Work Together (2013) [video]

https://www.youtube.com/watch?v=52gouVi91vE

2•mgraczyk•1h ago•1 comments