frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Τ³-Bench is out – can agents handle complex docs and live calls?

8•victorbarres•1h ago
τ-Bench is an open benchmark for evaluating AI agents on grounded, multi-turn customer service tasks with verifiable outcomes. It's been great to see the community adopt it since launch — this is now the third iteration. With τ³-Bench, we're extending it to two new settings: knowledge-intensive retrieval and full-duplex voice.

τ-Knowledge: agents must navigate ~700 interconnected policy documents to complete multi-step tasks. Best frontier model (GPT-5.2, high reasoning) hits ~25%. The surprising part: even when you hand the model the exact documents it needs, performance only reaches ~40%. We found that the bottleneck isn't retrieval — it's reasoning over complex, interlinked policies and executing the right actions in the right order.

τ-Voice: same grounded tasks, but over live full-duplex voice with realistic audio — accents, background noise, interruptions, compressed phone lines. Voice agents score 31–51% in clean audio conditions and 26–38% in realistic ones. A consistent failure pattern across providers (OpenAI, Gemini, xAI): agent mishears a name or email during authentication, and everything downstream fails.

We also incorporated 75+ task fixes to the original airline, retail, and telecom domains — many based on community audits and PRs (including contributions from Amazon and Anthropic). We believe a benchmark is only as good as its maintenance, and we're grateful for the community's help improving it.

Code and leaderboard are open — we'd welcome community submissions and feedback.

Blog post (papers, code, leaderboard): https://sierra.ai/blog/bench-advancing-agent-benchmarking-to...

Comments

sohamray19•1h ago
was talking to some AI labs yesterday who use their own version of voicified tau bench that is half duplex and clean audio, hopefully we can move to tau voice for more representative environments.

Also brought up questions about how multimodal models handle knowledge and context rot, and it seems like an open question so far.

Dialkit

https://joshpuckett.me/dialkit
1•Areibman•48s ago•0 comments

Turn messy Amazon invoice PDFs into usable Excel data

https://amazoninvoicetoexcel.com/
1•bigCourage•1m ago•0 comments

Claude picks the first idea that works. Make it pick the best one

https://photostructure.com/coding/claude-code-replan/
2•speckx•2m ago•0 comments

Rendering OCI Images in Rust: Introducing Ocirender

https://edera.dev/stories/rendering-oci-images-the-right-way-introducing-ocirender
3•sys_call•3m ago•0 comments

The Diminished Art of Coding

https://nolanlawson.com/2026/03/22/the-diminished-art-of-coding/
1•birdculture•3m ago•0 comments

Popular LiteLLM PyPI package backdoored to steal credentials, auth tokens

https://www.bleepingcomputer.com/news/security/popular-litellm-pypi-package-compromised-in-teampc...
3•billybuckwheat•4m ago•0 comments

You Can Now Run MS-DOS Applications on the Apple IIe

https://hackaday.com/2026/03/25/you-can-now-run-ms-dos-applications-on-the-apple-iie/
1•sethkush•5m ago•0 comments

Anthropic's Claude can now control your Mac

https://venturebeat.com/technology/anthropics-claude-can-now-control-your-mac-escalating-the-figh...
3•devonnull•6m ago•1 comments

What does the world feel like? A live-sync emoji map

https://theworldmood.com
2•Unical-A•6m ago•0 comments

An Inside Look at the Subway's Archaic Signal System

https://www.nytimes.com/interactive/2025/04/20/nyregion/nyc-subway-signals.html
1•whalesalad•7m ago•0 comments

Multi-Array Queue: Now Lock-Free

https://github.com/MultiArrayQueue/LockFreeMultiArrayQueue
1•vitpro2213•8m ago•0 comments

Stellantis, the Company Where Driving the Wrong Car to Work Can Get You a Ticket

https://www.wsj.com/business/autos/the-company-where-driving-the-wrong-car-to-work-can-get-you-a-...
2•bookofjoe•8m ago•2 comments

Anthropic won't acknowledge my prior art notice

https://gist.github.com/Alienfader/9140a7311164d37a90f16600a1e4b6f1
2•alienfader•9m ago•2 comments

Post-AGI Vocabulary

https://gist.github.com/alpeware/21a0a962ff6947069dc02ccb949f18cd
1•simonpure•9m ago•0 comments

US jury finds Meta and Google liable in social media addiction trial

https://www.reuters.com/legal/litigation/jury-reaches-verdict-meta-google-trial-social-media-addi...
2•Philadelphia•10m ago•0 comments

Everyone's very angry online right now

https://www.garbageday.email/p/everyone-s-very-angry-online-right-now
1•laurex•12m ago•0 comments

Real-time TSA delays tracker

https://lufthaven.app/when-to-arrive
1•atharvavaidya•12m ago•0 comments

Show HN: Pgsemantic – Point at your Postgres DB, get vector search instantly

https://github.com/varmabudharaju/pgsemantic
3•varmabudharaju•13m ago•0 comments

Phishing Attempt from Notifications Github.com

https://github.com/QueenMooring/EmergencyBuild-58810/discussions/1
1•whatamidoingyo•13m ago•0 comments

Estroclic – A pill reminder for women on hormonal contraceptives (Android)

https://play.google.com/store/apps/details?id=com.estroclic.app
1•oumaimasmama•15m ago•0 comments

Sports Betting Is Everywhere, Especially on Credit Reports

https://fedinprint.org/item/fednls/102938
1•toomuchtodo•15m ago•1 comments

A18 Pro and MacBook Neo Deep-Dive [video]

https://www.youtube.com/watch?v=fTBvm4Hj7Mw
1•mariuz•15m ago•0 comments

Making of Live 2D Moving Girl Made with Paper Crafts Analog [video]

https://www.youtube.com/watch?v=KYubdZybX8k
1•thunderbong•16m ago•0 comments

Meta and YouTube found liable in social media addiction trial

https://www.cnn.com/2026/03/25/tech/social-media-addiction-trial-jury-decision
1•taubek•16m ago•0 comments

Claude Code Auto Dream

https://twitter.com/JeremyNguyenPhD/status/2036279335221645345
1•hmokiguess•18m ago•0 comments

ARC-AGI-3

https://twitter.com/fchollet/status/2036861192619384989
2•tosh•21m ago•1 comments

The US government just banned consumer routers made outside the US

https://www.theverge.com/news/899172/fcc-foreign-router-ban
2•mikece•21m ago•1 comments

As Slow as Possible

https://pippinbarr.com/as-slow-as-possible/
1•fatso784•21m ago•0 comments

LibreChat is a self-hosted AI chat platform that unifies all major AI providers

https://github.com/danny-avila/LibreChat
2•indigodaddy•22m ago•0 comments

Pickleball Wiki – a minimalist site for honest pickleball gear reviews

https://pickleball-wiki.com/
1•mannyhannah•22m ago•0 comments