frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Τ³-Bench is out – can agents handle complex docs and live calls?

8•victorbarres•2h ago
τ-Bench is an open benchmark for evaluating AI agents on grounded, multi-turn customer service tasks with verifiable outcomes. It's been great to see the community adopt it since launch — this is now the third iteration. With τ³-Bench, we're extending it to two new settings: knowledge-intensive retrieval and full-duplex voice.

τ-Knowledge: agents must navigate ~700 interconnected policy documents to complete multi-step tasks. Best frontier model (GPT-5.2, high reasoning) hits ~25%. The surprising part: even when you hand the model the exact documents it needs, performance only reaches ~40%. We found that the bottleneck isn't retrieval — it's reasoning over complex, interlinked policies and executing the right actions in the right order.

τ-Voice: same grounded tasks, but over live full-duplex voice with realistic audio — accents, background noise, interruptions, compressed phone lines. Voice agents score 31–51% in clean audio conditions and 26–38% in realistic ones. A consistent failure pattern across providers (OpenAI, Gemini, xAI): agent mishears a name or email during authentication, and everything downstream fails.

We also incorporated 75+ task fixes to the original airline, retail, and telecom domains — many based on community audits and PRs (including contributions from Amazon and Anthropic). We believe a benchmark is only as good as its maintenance, and we're grateful for the community's help improving it.

Code and leaderboard are open — we'd welcome community submissions and feedback.

Blog post (papers, code, leaderboard): https://sierra.ai/blog/bench-advancing-agent-benchmarking-to...

Comments

sohamray19•1h ago
was talking to some AI labs yesterday who use their own version of voicified tau bench that is half duplex and clean audio, hopefully we can move to tau voice for more representative environments.

Also brought up questions about how multimodal models handle knowledge and context rot, and it seems like an open question so far.