There's certainly some litellm hacking that will improve things. I'm absolutely convinced of that. The proxy is pretty hard to use though. I keep making glacial progress on it.
If you're interested in working on this, we love to see new contributors in the Discord https://discord.gg/6xWPKhGDbA
mikemerrill•8mo ago
What we found: The best commercial agents (using models like GPT-4, Claude, Gemini) score less than 20% on our benchmark tasks. Even with their impressive capabilities, these agents struggle with: - Chaining multiple terminal commands together - Reasoning over long command outputs - Acting independently within sensible limits - Executing tasks safely
What's in Terminal-Bench: - Docker-containerized environments for consistent testing - Hand-crafted tasks covering data science, networking, security, and more - Human-verified solutions and test cases - Support for different integration methods
Want to get involved? We're looking for contributors to help expand the benchmark with new challenging tasks. If you've got scenarios where current AI agents fail in the terminal, we'd love to include them!
Check out our website: https://tbench.ai Join our Discord: https://discord.gg/6xWPKhGDbA
What terminal tasks do you wish AI agents could handle better?