τ-Knowledge: agents must navigate ~700 interconnected policy documents to complete multi-step tasks. Best frontier model (GPT-5.2, high reasoning) hits ~25%. The surprising part: even when you hand the model the exact documents it needs, performance only reaches ~40%. We found that the bottleneck isn't retrieval — it's reasoning over complex, interlinked policies and executing the right actions in the right order.
τ-Voice: same grounded tasks, but over live full-duplex voice with realistic audio — accents, background noise, interruptions, compressed phone lines. Voice agents score 31–51% in clean audio conditions and 26–38% in realistic ones. A consistent failure pattern across providers (OpenAI, Gemini, xAI): agent mishears a name or email during authentication, and everything downstream fails.
We also incorporated 75+ task fixes to the original airline, retail, and telecom domains — many based on community audits and PRs (including contributions from Amazon and Anthropic). We believe a benchmark is only as good as its maintenance, and we're grateful for the community's help improving it.
Code and leaderboard are open — we'd welcome community submissions and feedback.
Blog post (papers, code, leaderboard): https://sierra.ai/blog/bench-advancing-agent-benchmarking-to...
sohamray19•1h ago
Also brought up questions about how multimodal models handle knowledge and context rot, and it seems like an open question so far.