frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%

https://quesma.com/blog/tau2-benchmark-improving-results-smaller-models/
58•blndrt•1h ago

Comments

barrkel•1h ago
Using an LLM to (re)write your prompt or system prompt (for local models) is free alpha.
BrunoDCDO•1h ago
I wonder if it would be possible to improve even further on the benchmark by simply showing Claude the current hardest problems and asking it to improve the prompt without including any specifics related to the problems
csoham•1h ago
Really intresting. What did the original prompt look like? Perhaps the original prompt was not that good? I feel like the changes claude suggested (except a couple maybe) are already pretty well known prompt engineering practices.
blndrt•30m ago
Thank you for the feedback!

In this (telecom) benchmark you can review agent policies and manuals here: 1) https://github.com/sierra-research/tau2-bench/blob/main/data... 2) https://github.com/sierra-research/tau2-bench/blob/main/data...

Of course these are just parts of the prompt, you can inspect benchamark code to see how these are rendered to actual LLM calls.

In case someone is not familiar with framework methodology I've wrote a separate article covering that (with some of my thoughts) -> https://quesma.com/blog/tau2-from-llm-benchmark-to-blueprint...

jari_mustonen•1h ago
Here is the summary of key improvements made:

1. Structure & Flow

    - Decision Trees: Clear branching logic with ├── and └── notation

    - Sequential Steps: Numbered, ordered procedures instead of scattered explanations

    - Prerequisites: Explicit dependency checks before proceeding
2. AI Agent Optimizations

    - Tool Call Clarity: Exact function names and parameters

    - Binary Decisions: Clear yes/no conditions instead of ambiguous language

    - Error Handling: Specific failure conditions and next steps

    - Verification Steps: "Recheck" instructions after each fix
3. Cognitive Load Reduction

    - Reference Tables: Quick lookup for tools and purposes

    - Pattern Recognition: Common issue combinations and their solutions

    - Critical Reminders: Common AI mistakes section to prevent errors
4. Actionable Language

    - Removed verbose explanations mixed with instructions

    - Consolidated multiple documents' logic into single workflows 

    - Used imperative commands: "Check X", "If Y then Z"

    - Added immediate verification steps
dlojudice•56m ago
I wish they had published what prompt was given to Claude to improve GPT-5-mini's performance, as well as a before and after comparison of a prompt that underwent this transformation.
blndrt•25m ago
Thanks for the feedback, appreciate it! It makes lot of sense - I'll update the article with links to the actual prompts. Initially I thought these would be too lengthy for the article and no one would care, but as it seems people are really interested in it. Of course I'd be happy to share the details.
moralestapia•34m ago
No before/after prompt.

Into the trash it goes.

CuriouslyC•22m ago
This sort of stuff is trodden ground, if this seems exciting to you check out DSPy.
amelius•12m ago
My take: we have no clue how this works and the performance can be down tomorrow just as well.
grej•11m ago
DSPy was ahead of its time and still underutilized.

Science's answer to the ultimate question: Where do we come from?

https://bigthink.com/starts-with-a-bang/science-answer-ultimate-question/
1•Brajeshwar•35s ago•0 comments

LIGO's 10th anniversary gift confirms Hawking's theorem

https://bigthink.com/starts-with-a-bang/ligo-10-anniversary-hawking-theorem/
1•Brajeshwar•42s ago•0 comments

A Napster Moment for AI?

https://cepa.org/article/a-napster-moment-for-ai/
1•geox•44s ago•0 comments

Gene editing is changing the meat in our diet

https://www.abc.net.au/news/science/2025-09-17/gene-editing-food-fish-beef-pork-regulations-genet...
1•Brajeshwar•52s ago•0 comments

Karpenter for Any Kubernetes Cluster

https://www.vcluster.com/blog/introducing-vcluster-auto-nodes-karpenter-based-dynamic-autoscaling...
1•saiyampathak•1m ago•0 comments

A refresh of Learn CSS with nine new modules

https://web.dev/blog/learn-css-refresh
1•eustoria•1m ago•0 comments

Typst: A Possible LaTeX Replacement

https://lwn.net/Articles/1037577/
1•leephillips•1m ago•0 comments

Software owned by Australian banks being tested for social media ban

https://www.reuters.com/business/finance/software-owned-by-australian-banks-being-tested-social-m...
1•c420•3m ago•0 comments

How Buyers Build Their Shortlist – and Why It's So Hard to Break In

https://www.learning.propelgrowth.com/blog/how-buyers-build-their-shortlist-and-why-it-s-so-hard-...
1•mooreds•3m ago•0 comments

Methane leaks at California oil facilities are also spewing toxic chemicals

https://www.latimes.com/environment/story/2025-08-26/methane-leaks-at-california-oil-facilities-a...
1•PaulHoule•3m ago•0 comments

Sugar-Proto: Strongly Typed, Expressive, User Friendly Protobuf Wrapper

https://github.com/illegal-instruction-co/sugar-proto
1•signa11•3m ago•0 comments

Stroke centres in England given AI tool that will help 50% of patients recover

https://www.theguardian.com/society/2025/sep/01/stroke-centres-in-england-given-ai-tool-that-will...
1•alphabetatango•3m ago•0 comments

Don't build a spaced repetition startup

https://www.giacomoran.com/blog/dont-build-sr-startup/
2•ran3000•5m ago•0 comments

Is WhatsApp safe? Not according to its ex-security chief

https://proton.me/blog/is-whatsapp-safe
1•eustoria•5m ago•0 comments

Show HN: Pingoo – A reverse proxy with built-in WAF and bot protection (in Rust)

https://github.com/pingooio/pingoo
1•sylvain_kerkour•5m ago•0 comments

A single adblock filter may have caused YouTube's global view drop

https://github.com/easylist/easylist/issues/22375
1•Medea•6m ago•0 comments

California’s driverless taxis transport passengers for more than 4m miles/month

https://ourworldindata.org/data-insights/californians-now-travel-millions-of-miles-each-month-in-...
1•alphabetatango•6m ago•0 comments

We open source our YC company 1 year later, here's why

https://github.com/comfy-deploy/comfydeploy
2•bennykokmusic•7m ago•0 comments

Is VC Broken? (TL;DR: Yes)

https://pawelbrodzinski.substack.com/p/is-vc-broken
1•flail•7m ago•0 comments

A CEO's Guide to Emacs

https://web.archive.org/web/20181206074048/https://www.fugue.co/blog/2015-11-11-guide-to-emacs.html
2•signa11•7m ago•0 comments

We Cut Support Tickets by Rewriting Error Messages for LLMs

https://www.thundercompute.com/blog/we-cut-support-tickets-by-rewriting-error-messages-for-llms
1•mooreds•8m ago•0 comments

Ubuntu 25.10's Rust Coreutils Transition Has Uncovered Performance Shortcomings

https://www.phoronix.com/news/Ubuntu-Rust-Coreutils-Perf
1•signa11•8m ago•0 comments

Not Just the U.S. – The Whole World Has Soured on Climate Politics

https://www.nytimes.com/2025/09/16/magazine/climate-politics-us-world-paris-agreement.html
1•WaltPurvis•10m ago•1 comments

Strong winds at Mount St. Helens stirs up ash from 1980 eruption

https://abcnews.go.com/US/strong-winds-mount-st-helens-stirs-ash-1980/story?id=125641981
1•speckx•10m ago•0 comments

Uber, Cabify will slowly disappear in Barcelona due to new taxi law

https://www.catalannews.com/politics/item/ride-hailing-services-disappear-barcelona-taxi-law-16-s...
1•asp1•10m ago•0 comments

Systemd 258

https://lists.freedesktop.org/archives/systemd-devel/2025-September/051670.html
2•asp1•12m ago•0 comments

Americans View AI and Its Impact on People and Society

https://www.pewresearch.org/science/2025/09/17/how-americans-view-ai-and-its-impact-on-people-and...
2•swolpers•14m ago•2 comments

Show HN: Cardinal Maps – FOSS Android Maps App

https://github.com/ellenhp/cardinal
1•ellenhp•16m ago•1 comments

Show HN: Math2Tex – Convert handwritten math and complex notes to LaTeX text

1•leoyixing•16m ago•0 comments

The more I think about this, the more disturbing it becomes [video]

https://www.youtube.com/watch?v=ZO5u3V6LJuM
1•obscurette•16m ago•0 comments