frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%

https://quesma.com/blog/tau2-benchmark-improving-results-smaller-models/
73•blndrt•2h ago

Comments

barrkel•1h ago
Using an LLM to (re)write your prompt or system prompt (for local models) is free alpha.
BrunoDCDO•1h ago
I wonder if it would be possible to improve even further on the benchmark by simply showing Claude the current hardest problems and asking it to improve the prompt without including any specifics related to the problems
csoham•1h ago
Really intresting. What did the original prompt look like? Perhaps the original prompt was not that good? I feel like the changes claude suggested (except a couple maybe) are already pretty well known prompt engineering practices.
blndrt•1h ago
Thank you for the feedback!

In this (telecom) benchmark you can review agent policies and manuals here: 1) https://github.com/sierra-research/tau2-bench/blob/main/data... 2) https://github.com/sierra-research/tau2-bench/blob/main/data...

Of course these are just parts of the prompt, you can inspect benchamark code to see how these are rendered to actual LLM calls.

In case someone is not familiar with framework methodology I've wrote a separate article covering that (with some of my thoughts) -> https://quesma.com/blog/tau2-from-llm-benchmark-to-blueprint...

jari_mustonen•1h ago
Here is the summary of key improvements made:

1. Structure & Flow

    - Decision Trees: Clear branching logic with ├── and └── notation

    - Sequential Steps: Numbered, ordered procedures instead of scattered explanations

    - Prerequisites: Explicit dependency checks before proceeding
2. AI Agent Optimizations

    - Tool Call Clarity: Exact function names and parameters

    - Binary Decisions: Clear yes/no conditions instead of ambiguous language

    - Error Handling: Specific failure conditions and next steps

    - Verification Steps: "Recheck" instructions after each fix
3. Cognitive Load Reduction

    - Reference Tables: Quick lookup for tools and purposes

    - Pattern Recognition: Common issue combinations and their solutions

    - Critical Reminders: Common AI mistakes section to prevent errors
4. Actionable Language

    - Removed verbose explanations mixed with instructions

    - Consolidated multiple documents' logic into single workflows 

    - Used imperative commands: "Check X", "If Y then Z"

    - Added immediate verification steps
brendoelfrendo•14m ago
Wait, are we about to reinvent programming from first principles?
dlojudice•1h ago
I wish they had published what prompt was given to Claude to improve GPT-5-mini's performance, as well as a before and after comparison of a prompt that underwent this transformation.
blndrt•1h ago
Thanks for the feedback, appreciate it! It makes lot of sense - I'll update the article with links to the actual prompts. Initially I thought these would be too lengthy for the article and no one would care, but as it seems people are really interested in it. Of course I'd be happy to share the details.
moralestapia•1h ago
No before/after prompt.

Into the trash it goes.

CuriouslyC•1h ago
This sort of stuff is trodden ground, if this seems exciting to you check out DSPy.
bigwheels•24m ago
https://dspy.ai/tutorials/tool_use/

Definitely interesting, thank you!

amelius•56m ago
My take: we have no clue how this works and the performance can be down tomorrow just as well.
grej•55m ago
DSPy was ahead of its time and still underutilized.
tibbar•25m ago
> Removed verbose explanations mixed with instructions

Is Claude rewriting generic instructions once, or is it rewriting the core task statement each time? If so, I'm not sure how you prevent information leakage: Claude might easily be "solving" some of the tasks and inserting subtle hints on the approach. I think this result is very interesting if it holds after rewriting only the generic instructions, even if the performance boost is lower.

caminanteblanco•9m ago
The only problem is I feel like having to have Claude rewrite the prompt negates some of the efficiency and latency benefits of using mini. For system prompts obviously this doesn't matter, but for actual continuous user interaction, it feels unworkable.

It definitely makes sense that improving formatting and clarity for these smaller models would really help with performance, but I'm wondering if gpt5-mini is already smart enough to handle that reformatting, and can rewrite the prompt itself, before handing it off to another instance of itself.

Overall an awesome article!

Apple Photos App Corrupts Images

https://tenderlovemaking.com/2025/09/17/apple-photos-app-corrupts-images/
518•pattyj•4h ago•181 comments

A single adblock filter may have caused YouTube's global view drop

https://github.com/easylist/easylist/issues/22375
46•Medea•50m ago•15 comments

Bringing fully autonomous rides to Nashville, in partnership with Lyft

https://waymo.com/blog/2025/09/waymo-is-coming-to-nashville-in-partnership-with-lyft
74•ra7•2h ago•58 comments

Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%

https://quesma.com/blog/tau2-benchmark-improving-results-smaller-models/
73•blndrt•2h ago•15 comments

Determination of the fifth Busy Beaver value

https://arxiv.org/abs/2509.12337
169•marvinborner•5h ago•53 comments

How to Motivate Yourself to Do a Thing You Don't Want to Do

https://ashleyjanssen.com/how-to-motivate-yourself-to-do-a-thing-you-dont-want-to-do/
7•mooreds•21m ago•2 comments

GNU Midnight Commander

https://midnight-commander.org/
430•pykello•11h ago•235 comments

Procedural Island Generation (III)

https://brashandplucky.com/2025/09/17/procedural-island-generation-iii.html
48•ibobev•3h ago•7 comments

PureVPN IPv6 Leak

https://anagogistis.com/posts/purevpn-ipv6-leak/
108•todsacerdoti•5h ago•32 comments

Shai-Hulud malware attack: Tinycolor and over 40 NPM packages compromised

https://socket.dev/blog/ongoing-supply-chain-attack-targets-crowdstrike-npm-packages
1155•jamesberthoty•1d ago•931 comments

Firefox 143 for Android to introduce DoH

https://blog.mozilla.org/en/firefox/dns-android/
105•HieronymusBosch•2h ago•53 comments

YouTube addresses lower view counts which seem to be caused by ad blockers

https://9to5google.com/2025/09/16/youtube-lower-view-counts-ad-blockers/
42•iamflimflam1•1h ago•60 comments

Stategraph: Terraform state as a distributed systems problem

https://stategraph.dev/blog/why-stategraph/
92•lawnchair•7h ago•51 comments

SQLiteData: A fast, lightweight replacement for SwiftData using SQL and CloudKit

https://github.com/pointfreeco/sqlite-data
12•wahnfrieden•2h ago•9 comments

Notion API importer, with Databases to Bases conversion bounty

https://github.com/obsidianmd/obsidian-importer/issues/421
155•twapi•10h ago•49 comments

UUIDv47: Store UUIDv7 in DB, emit UUIDv4 outside (SipHash-masked timestamp)

https://github.com/stateless-me/uuidv47
15•aabbdev•1h ago•2 comments

Algebraic Types are not Scary

https://blog.aiono.dev/posts/algebraic-types-are-not-scary,-actually.html
56•Bogdanp•2d ago•28 comments

EU Chat Control: Germany's position has been reverted to undecided

https://mastodon.social/@chatcontrol/115215006562371435
272•doener•5h ago•207 comments

You can't test if quantum uses complex numbers

https://algassert.com/post/2501
39•EvgeniyZh•2d ago•18 comments

Things you can do with a Software Defined Radio (2024)

https://blinry.org/50-things-with-sdr/
890•mihau•1d ago•143 comments

Alibaba's New AI Chip: Key Specifications Comparable to H20

https://news.futunn.com/en/post/62202518/alibaba-s-new-ai-chip-unveiled-key-specifications-compar...
151•dworks•6h ago•159 comments

Doom crash after 2.5 years of real-world runtime confirmed on real hardware

https://lenowo.org/viewtopic.php?t=31
382•minki_the_avali•18h ago•152 comments

Murex – An intuitive and content aware shell for a modern command line

https://murex.rocks/
83•modinfo•9h ago•41 comments

Oh no, not again a meditation on NPM supply chain attacks

https://tane.dev/2025/09/oh-no-not-again...-a-meditation-on-npm-supply-chain-attacks/
125•theycameback•5h ago•160 comments

The Asus Gaming Laptop ACPI Firmware Bug: A Deep Technical Investigation

https://github.com/Zephkek/Asus-ROG-Aml-Deep-Dive
320•signa11•11h ago•149 comments

XeroxNostalgia.com

https://xeroxnostalgia.com/
18•surprisetalk•2d ago•1 comments

I got the highest score on ARC-AGI again swapping Python for English

https://jeremyberman.substack.com/p/how-i-got-the-highest-score-on-arc-agi-again
143•freediver•13h ago•83 comments

How to make the Framework Desktop run even quieter

https://noctua.at/en/how-to-make-the-framework-desktop-run-even-quieter
314•lwhsiao•21h ago•117 comments

Denmark close to wiping out cancer-causing HPV strains after vaccine roll-out

https://www.gavi.org/vaccineswork/denmark-close-wiping-out-leading-cancer-causing-hpv-strains-aft...
879•slu•21h ago•320 comments

AMD Open Source Driver for Vulkan project is discontinued

https://github.com/GPUOpen-Drivers/AMDVLK/discussions/416
132•haunter•15h ago•40 comments