We (an open community of AI researchers at Stanford, Anthropic, UW, and more) just released Terminal-Bench, a new open-source framework for evaluating how well AI agents perform in terminal environments. Given how much we all use the terminal and how many new AI terminal assistants are emerging, we wanted to create a rigorous way to test their capabilities.
What we found: The best commercial agents (using models like GPT-4, Claude, Gemini) score less than 20% on our benchmark tasks. Even with their impressive capabilities, these agents struggle with:
- Chaining multiple terminal commands together
- Reasoning over long command outputs
- Acting independently within sensible limits
- Executing tasks safely
What's in Terminal-Bench:
- Docker-containerized environments for consistent testing
- Hand-crafted tasks covering data science, networking, security, and more
- Human-verified solutions and test cases
- Support for different integration methods
Want to get involved? We're looking for contributors to help expand the benchmark with new challenging tasks. If you've got scenarios where current AI agents fail in the terminal, we'd love to include them!
What terminal tasks do you wish AI agents could handle better?
kristopolous•2h ago
Their terminus approach is petty similar to my "dui mode" here: https://github.com/day50-dev/llmehelp ... it's still not great except for basic investigations. I think a hybrid approach would be better.
There's certainly some litellm hacking that will improve things. I'm absolutely convinced of that. The proxy is pretty hard to use though. I keep making glacial progress on it.
mikemerrill•4h ago
What we found: The best commercial agents (using models like GPT-4, Claude, Gemini) score less than 20% on our benchmark tasks. Even with their impressive capabilities, these agents struggle with: - Chaining multiple terminal commands together - Reasoning over long command outputs - Acting independently within sensible limits - Executing tasks safely
What's in Terminal-Bench: - Docker-containerized environments for consistent testing - Hand-crafted tasks covering data science, networking, security, and more - Human-verified solutions and test cases - Support for different integration methods
Want to get involved? We're looking for contributors to help expand the benchmark with new challenging tasks. If you've got scenarios where current AI agents fail in the terminal, we'd love to include them!
Check out our website: https://tbench.ai Join our Discord: https://discord.gg/6xWPKhGDbA
What terminal tasks do you wish AI agents could handle better?