Show HN: Τ³-Bench is out – can agents handle complex docs and live calls?

8•victorbarres•1h ago

τ-Bench is an open benchmark for evaluating AI agents on grounded, multi-turn customer service tasks with verifiable outcomes. It's been great to see the community adopt it since launch — this is now the third iteration. With τ³-Bench, we're extending it to two new settings: knowledge-intensive retrieval and full-duplex voice.

τ-Knowledge: agents must navigate ~700 interconnected policy documents to complete multi-step tasks. Best frontier model (GPT-5.2, high reasoning) hits ~25%. The surprising part: even when you hand the model the exact documents it needs, performance only reaches ~40%. We found that the bottleneck isn't retrieval — it's reasoning over complex, interlinked policies and executing the right actions in the right order.

τ-Voice: same grounded tasks, but over live full-duplex voice with realistic audio — accents, background noise, interruptions, compressed phone lines. Voice agents score 31–51% in clean audio conditions and 26–38% in realistic ones. A consistent failure pattern across providers (OpenAI, Gemini, xAI): agent mishears a name or email during authentication, and everything downstream fails.

We also incorporated 75+ task fixes to the original airline, retail, and telecom domains — many based on community audits and PRs (including contributions from Amazon and Anthropic). We believe a benchmark is only as good as its maintenance, and we're grateful for the community's help improving it.

Code and leaderboard are open — we'd welcome community submissions and feedback.

Blog post (papers, code, leaderboard): https://sierra.ai/blog/bench-advancing-agent-benchmarking-to...

Comments

sohamray19•1h ago

was talking to some AI labs yesterday who use their own version of voicified tau bench that is half duplex and clean audio, hopefully we can move to tau voice for more representative environments.

Also brought up questions about how multimodal models handle knowledge and context rot, and it seems like an open question so far.

Dialkit

Turn messy Amazon invoice PDFs into usable Excel data

Claude picks the first idea that works. Make it pick the best one

Rendering OCI Images in Rust: Introducing Ocirender

The Diminished Art of Coding

Popular LiteLLM PyPI package backdoored to steal credentials, auth tokens

You Can Now Run MS-DOS Applications on the Apple IIe

Anthropic's Claude can now control your Mac

What does the world feel like? A live-sync emoji map

An Inside Look at the Subway's Archaic Signal System

Multi-Array Queue: Now Lock-Free

Stellantis, the Company Where Driving the Wrong Car to Work Can Get You a Ticket

Anthropic won't acknowledge my prior art notice

Post-AGI Vocabulary

US jury finds Meta and Google liable in social media addiction trial

Everyone's very angry online right now

Real-time TSA delays tracker

Show HN: Pgsemantic – Point at your Postgres DB, get vector search instantly

Phishing Attempt from Notifications Github.com

Estroclic – A pill reminder for women on hormonal contraceptives (Android)

Sports Betting Is Everywhere, Especially on Credit Reports

A18 Pro and MacBook Neo Deep-Dive [video]

Making of Live 2D Moving Girl Made with Paper Crafts Analog [video]

Meta and YouTube found liable in social media addiction trial

Claude Code Auto Dream

ARC-AGI-3

The US government just banned consumer routers made outside the US

As Slow as Possible

LibreChat is a self-hosted AI chat platform that unifies all major AI providers

Pickleball Wiki – a minimalist site for honest pickleball gear reviews