In this (telecom) benchmark you can review agent policies and manuals here: 1) https://github.com/sierra-research/tau2-bench/blob/main/data... 2) https://github.com/sierra-research/tau2-bench/blob/main/data...
Of course these are just parts of the prompt, you can inspect benchamark code to see how these are rendered to actual LLM calls.
In case someone is not familiar with framework methodology I've wrote a separate article covering that (with some of my thoughts) -> https://quesma.com/blog/tau2-from-llm-benchmark-to-blueprint...
1. Structure & Flow
- Decision Trees: Clear branching logic with ├── and └── notation
- Sequential Steps: Numbered, ordered procedures instead of scattered explanations
- Prerequisites: Explicit dependency checks before proceeding
2. AI Agent Optimizations - Tool Call Clarity: Exact function names and parameters
- Binary Decisions: Clear yes/no conditions instead of ambiguous language
- Error Handling: Specific failure conditions and next steps
- Verification Steps: "Recheck" instructions after each fix
3. Cognitive Load Reduction - Reference Tables: Quick lookup for tools and purposes
- Pattern Recognition: Common issue combinations and their solutions
- Critical Reminders: Common AI mistakes section to prevent errors
4. Actionable Language - Removed verbose explanations mixed with instructions
- Consolidated multiple documents' logic into single workflows
- Used imperative commands: "Check X", "If Y then Z"
- Added immediate verification steps
Into the trash it goes.
Definitely interesting, thank you!
Is Claude rewriting generic instructions once, or is it rewriting the core task statement each time? If so, I'm not sure how you prevent information leakage: Claude might easily be "solving" some of the tasks and inserting subtle hints on the approach. I think this result is very interesting if it holds after rewriting only the generic instructions, even if the performance boost is lower.
It definitely makes sense that improving formatting and clarity for these smaller models would really help with performance, but I'm wondering if gpt5-mini is already smart enough to handle that reformatting, and can rewrite the prompt itself, before handing it off to another instance of itself.
Overall an awesome article!
barrkel•1h ago