Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%

https://quesma.com/blog/tau2-benchmark-improving-results-smaller-models/

58•blndrt•1h ago

Comments

barrkel•1h ago

Using an LLM to (re)write your prompt or system prompt (for local models) is free alpha.

BrunoDCDO•1h ago

I wonder if it would be possible to improve even further on the benchmark by simply showing Claude the current hardest problems and asking it to improve the prompt without including any specifics related to the problems

csoham•1h ago

Really intresting. What did the original prompt look like? Perhaps the original prompt was not that good? I feel like the changes claude suggested (except a couple maybe) are already pretty well known prompt engineering practices.

blndrt•30m ago

Thank you for the feedback!

In this (telecom) benchmark you can review agent policies and manuals here: 1) https://github.com/sierra-research/tau2-bench/blob/main/data... 2) https://github.com/sierra-research/tau2-bench/blob/main/data...

Of course these are just parts of the prompt, you can inspect benchamark code to see how these are rendered to actual LLM calls.

In case someone is not familiar with framework methodology I've wrote a separate article covering that (with some of my thoughts) -> https://quesma.com/blog/tau2-from-llm-benchmark-to-blueprint...

jari_mustonen•1h ago

Here is the summary of key improvements made:

1. Structure & Flow

    - Decision Trees: Clear branching logic with ├── and └── notation

    - Sequential Steps: Numbered, ordered procedures instead of scattered explanations

    - Prerequisites: Explicit dependency checks before proceeding

2. AI Agent Optimizations

    - Tool Call Clarity: Exact function names and parameters

    - Binary Decisions: Clear yes/no conditions instead of ambiguous language

    - Error Handling: Specific failure conditions and next steps

    - Verification Steps: "Recheck" instructions after each fix

3. Cognitive Load Reduction

    - Reference Tables: Quick lookup for tools and purposes

    - Pattern Recognition: Common issue combinations and their solutions

    - Critical Reminders: Common AI mistakes section to prevent errors

4. Actionable Language

    - Removed verbose explanations mixed with instructions

    - Consolidated multiple documents' logic into single workflows 

    - Used imperative commands: "Check X", "If Y then Z"

    - Added immediate verification steps

dlojudice•56m ago

I wish they had published what prompt was given to Claude to improve GPT-5-mini's performance, as well as a before and after comparison of a prompt that underwent this transformation.

blndrt•25m ago

Thanks for the feedback, appreciate it! It makes lot of sense - I'll update the article with links to the actual prompts. Initially I thought these would be too lengthy for the article and no one would care, but as it seems people are really interested in it. Of course I'd be happy to share the details.

moralestapia•34m ago

No before/after prompt.

Into the trash it goes.

CuriouslyC•22m ago

This sort of stuff is trodden ground, if this seems exciting to you check out DSPy.

amelius•12m ago

My take: we have no clue how this works and the performance can be down tomorrow just as well.

grej•11m ago

DSPy was ahead of its time and still underutilized.

Science's answer to the ultimate question: Where do we come from?

LIGO's 10th anniversary gift confirms Hawking's theorem

A Napster Moment for AI?

Gene editing is changing the meat in our diet

Karpenter for Any Kubernetes Cluster

A refresh of Learn CSS with nine new modules

Typst: A Possible LaTeX Replacement

Software owned by Australian banks being tested for social media ban

How Buyers Build Their Shortlist – and Why It's So Hard to Break In

Methane leaks at California oil facilities are also spewing toxic chemicals

Sugar-Proto: Strongly Typed, Expressive, User Friendly Protobuf Wrapper

Stroke centres in England given AI tool that will help 50% of patients recover

Don't build a spaced repetition startup

Is WhatsApp safe? Not according to its ex-security chief

Show HN: Pingoo – A reverse proxy with built-in WAF and bot protection (in Rust)

A single adblock filter may have caused YouTube's global view drop

California’s driverless taxis transport passengers for more than 4m miles/month

We open source our YC company 1 year later, here's why

Is VC Broken? (TL;DR: Yes)

A CEO's Guide to Emacs

We Cut Support Tickets by Rewriting Error Messages for LLMs

Ubuntu 25.10's Rust Coreutils Transition Has Uncovered Performance Shortcomings

Not Just the U.S. – The Whole World Has Soured on Climate Politics

Strong winds at Mount St. Helens stirs up ash from 1980 eruption

Uber, Cabify will slowly disappear in Barcelona due to new taxi law

Systemd 258

Americans View AI and Its Impact on People and Society

Show HN: Cardinal Maps – FOSS Android Maps App

Show HN: Math2Tex – Convert handwritten math and complex notes to LaTeX text

The more I think about this, the more disturbing it becomes [video]