frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

We Mourn Our Craft

https://nolanlawson.com/2026/02/07/we-mourn-our-craft/
186•ColinWright•1h ago•172 comments

I Write Games in C (yes, C)

https://jonathanwhiting.com/writing/blog/games_in_c/
22•valyala•2h ago•6 comments

Hoot: Scheme on WebAssembly

https://www.spritely.institute/hoot/
124•AlexeyBrin•7h ago•24 comments

SectorC: A C Compiler in 512 bytes

https://xorvoid.com/sectorc.html
17•valyala•2h ago•1 comments

Stories from 25 Years of Software Development

https://susam.net/twenty-five-years-of-computing.html
65•vinhnx•5h ago•9 comments

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
833•klaussilveira•22h ago•250 comments

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

https://www.forbes.com/sites/mikestunson/2026/02/05/us-jobs-disappear-at-fastest-january-pace-sin...
155•alephnerd•2h ago•106 comments

The AI boom is causing shortages everywhere else

https://www.washingtonpost.com/technology/2026/02/07/ai-spending-economy-shortages/
119•1vuio0pswjnm7•8h ago•149 comments

Al Lowe on model trains, funny deaths and working with Disney

https://spillhistorie.no/2026/02/06/interview-with-sierra-veteran-al-lowe/
57•thelok•4h ago•8 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
1061•xnx•1d ago•613 comments

Reinforcement Learning from Human Feedback

https://rlhfbook.com/
80•onurkanbkrc•7h ago•5 comments

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

https://www.hpcwire.com/off-the-wire/brookhaven-labs-rhic-concludes-25-year-run-with-final-collis...
4•gnufx•57m ago•1 comments

Start all of your commands with a comma (2009)

https://rhodesmill.org/brandon/2009/commands-with-comma/
489•theblazehen•3d ago•177 comments

Vocal Guide – belt sing without killing yourself

https://jesperordrup.github.io/vocal-guide/
212•jesperordrup•12h ago•73 comments

France's homegrown open source online office suite

https://github.com/suitenumerique
567•nar001•6h ago•259 comments

Coding agents have replaced every framework I used

https://blog.alaindichiappari.dev/p/software-engineering-is-back
226•alainrk•6h ago•354 comments

A Fresh Look at IBM 3270 Information Display System

https://www.rs-online.com/designspark/a-fresh-look-at-ibm-3270-information-display-system
40•rbanffy•4d ago•7 comments

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

https://github.com/Momciloo/fun-with-clip-path
10•momciloo•2h ago•0 comments

History and Timeline of the Proco Rat Pedal (2021)

https://web.archive.org/web/20211030011207/https://thejhsshow.com/articles/history-and-timeline-o...
19•brudgers•5d ago•4 comments

Selection Rather Than Prediction

https://voratiq.com/blog/selection-rather-than-prediction/
8•languid-photic•3d ago•1 comments

72M Points of Interest

https://tech.marksblogg.com/overture-places-pois.html
29•marklit•5d ago•3 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
114•videotopia•4d ago•33 comments

Where did all the starships go?

https://www.datawrapper.de/blog/science-fiction-decline
77•speckx•4d ago•82 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
275•isitcontent•22h ago•38 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
201•limoce•4d ago•112 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
288•dmpetrov•22h ago•155 comments

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

https://github.com/sandys/kappal
22•sandGorgon•2d ago•12 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
557•todsacerdoti•1d ago•269 comments

Making geo joins faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
155•matheusalmeida•2d ago•48 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
427•ostacke•1d ago•111 comments
Open in hackernews

Tau² benchmark: How a prompt rewrite boosted GPT-5-mini by 22%

https://quesma.com/blog/tau2-benchmark-improving-results-smaller-models/
197•blndrt•4mo ago

Comments

barrkel•4mo ago
Using an LLM to (re)write your prompt or system prompt (for local models) is free alpha.
BrunoDCDO•4mo ago
I wonder if it would be possible to improve even further on the benchmark by simply showing Claude the current hardest problems and asking it to improve the prompt without including any specifics related to the problems
blndrt•4mo ago
I think there's a chance we could squeeze a better benchmark score, although there's a risk of overfitting which I wanted to avoid.

The simplest test would be to make previously “unreachable” tasks succeed through obvious prompt tweaks — like reordering instructions or emphasizing key parts.

That said, my methodology intentionally avoided exposing the model to actual tasks. Instead, I focused on the domain as a whole: refining the instructions so a smaller model could understand and act reliably.

csoham•4mo ago
Really intresting. What did the original prompt look like? Perhaps the original prompt was not that good? I feel like the changes claude suggested (except a couple maybe) are already pretty well known prompt engineering practices.
blndrt•4mo ago
Thank you for the feedback!

In this (telecom) benchmark you can review agent policies and manuals here: 1) https://github.com/sierra-research/tau2-bench/blob/main/data... 2) https://github.com/sierra-research/tau2-bench/blob/main/data...

Of course these are just parts of the prompt, you can inspect benchamark code to see how these are rendered to actual LLM calls.

In case someone is not familiar with framework methodology I've wrote a separate article covering that (with some of my thoughts) -> https://quesma.com/blog/tau2-from-llm-benchmark-to-blueprint...

jari_mustonen•4mo ago
Here is the summary of key improvements made:

1. Structure & Flow

    - Decision Trees: Clear branching logic with ├── and └── notation

    - Sequential Steps: Numbered, ordered procedures instead of scattered explanations

    - Prerequisites: Explicit dependency checks before proceeding
2. AI Agent Optimizations

    - Tool Call Clarity: Exact function names and parameters

    - Binary Decisions: Clear yes/no conditions instead of ambiguous language

    - Error Handling: Specific failure conditions and next steps

    - Verification Steps: "Recheck" instructions after each fix
3. Cognitive Load Reduction

    - Reference Tables: Quick lookup for tools and purposes

    - Pattern Recognition: Common issue combinations and their solutions

    - Critical Reminders: Common AI mistakes section to prevent errors
4. Actionable Language

    - Removed verbose explanations mixed with instructions

    - Consolidated multiple documents' logic into single workflows 

    - Used imperative commands: "Check X", "If Y then Z"

    - Added immediate verification steps
brendoelfrendo•4mo ago
Wait, are we about to reinvent programming from first principles?
ranie93•4mo ago
Seemingly its always been on a scale between directly editing 1s and 0s and drafting legislature. Compile times may vary
whateveracct•4mo ago
I'd say it's more "programming with extra steps"
inerte•4mo ago
Maybe one day we will all be using https://shakespearelang.com/
measurablefunc•4mo ago
This is more like reinvention by trying everything which doesn't work first. It's the dual of first principles.
ivape•4mo ago
In other words, just like programming, we’re writing better instructions. In this case, we’re asking it to think out loud more clearly. It’s almost like whiteboard interview prep.

It’s quite amazing because it means programming is fully entering the natural language phase of the timeline.

If you aren’t a solid clear writer, you may not make it in the brave new world.

idiotsecant•4mo ago
The computers of the future will be operated by shamans making incantations more than technicians writing code.
Yoric•4mo ago
Of the future?

We already have people praying to the machine gods, so I guess your future is next week?

mhuffman•4mo ago
>If you aren’t a solid clear writer, you may not make it in the brave new world.

Have you not heard of all the AI startups that can turn a 3-word thought into very clearly written prose to be lovingly poured into the waiting mouth of your AI agent?

johnrob•4mo ago
Isn’t programming the clearest form of writing? Perhaps it’s the non programmers that need to “catch up”.
mejutoco•4mo ago
We are still in the pigsty compared to math
Yoric•4mo ago
I'd have to disagree. We're much less ambiguous than math.

In fact, according to theory, we're writing executable proofs.

lgas•4mo ago
> Isn’t programming the clearest form of writing?

Not the way most people do it.

ashtonshears•4mo ago
If programming was the most clear form of writing, then how come eductators frequently use pseudo code to make programming more clear?
Dilettante_•4mo ago
The same reason kids are first taught the Bohrian atom model. It is less clear and precise, but thereby also less complex.

"100 baskets of apples" is easier to hold in your head than "23 baskets of red, small-ish apples, 12 of large red, 6 of any size green...", but my no means does it permit a more clear view of the Truth.

Different usage of the word "clear".

Dilettante_•4mo ago
Addendum: 'Clear' as in the opposite of ambiguous, not 'clear' as in the opposite of confusing.
pjot•4mo ago
I’ve found myself writing code intending to write prompts for writing better code.

Soon enough Im sure we’ll start to see programming languages that are geared towards interacting with llms

LeoPanthera•4mo ago
Finally a use for Lojban!

https://en.wikipedia.org/wiki/Lojban

beefnugs•4mo ago
Great! A diviner has vibe-exposed the arcane magic word knowledge on the steps to ultimate knowledgeplasty! Come let us get together to share more trial-and-error wordsmithery, Together we will someday have ultimate power!

If the model creators themselves arent sharing this magic-word bullshitteryy then why is anyone spending time on this? It is just going to change with every model release

dlojudice•4mo ago
I wish they had published what prompt was given to Claude to improve GPT-5-mini's performance, as well as a before and after comparison of a prompt that underwent this transformation.
blndrt•4mo ago
Thanks for the feedback, appreciate it! It makes lot of sense - I'll update the article with links to the actual prompts. Initially I thought these would be too lengthy for the article and no one would care, but as it seems people are really interested in it. Of course I'd be happy to share the details.
seunosewa•4mo ago
I checked and also couldn't find the prompt.
blndrt•4mo ago
I published an update - you should be able to find that information at the end of the post.

Should be available now, although it might take a while for CDN to propagate.

alejoar•4mo ago
Thanks for sharing!
quinncom•4mo ago
I see that you've added links to a pull request that show the previous and final optimized prompts. However, the OP was asking for the prompt you gave to Claude to assist you in optimizing your prompt. Would you mind sharing that one? (That way nobody has to reverse engineer the instructions from the diff you provided.)
moralestapia•4mo ago
No before/after prompt.

Into the trash it goes.

CuriouslyC•4mo ago
This sort of stuff is trodden ground, if this seems exciting to you check out DSPy.
bigwheels•4mo ago
https://dspy.ai/tutorials/tool_use/

Definitely interesting, thank you!

mccoyb•4mo ago
Many of the "look at what I did programming LLMs" blog posts on Hacker News have been developed and put out in academic papers and groups. The posts which gain traction here seem to be perennially behind the state of the art.
amelius•4mo ago
My take: we have no clue how this works and the performance can be down tomorrow just as well.
lubesGordi•4mo ago
My hypothesis: the length of the prompt shrunk, yet maintained the same amount of information.
grej•4mo ago
DSPy was ahead of its time and still underutilized.
behnamoh•4mo ago
Can you point me to any resources on DSPy that don't make it look like magic though? It used to be all the hype for a while and then everyone moved on from it.
tibbar•4mo ago
> Removed verbose explanations mixed with instructions

Is Claude rewriting generic instructions once, or is it rewriting the core task statement each time? If so, I'm not sure how you prevent information leakage: Claude might easily be "solving" some of the tasks and inserting subtle hints on the approach. I think this result is very interesting if it holds after rewriting only the generic instructions, even if the performance boost is lower.

blndrt•4mo ago
I only had Claude rewrite the domain policies and generic instructions, not the individual task statements. I updated the blog with a link showing the exact changes.

So no leakage — it wasn’t solving or hinting at any of the specific test cases, since none of the tasks were ever exposed to it.

caminanteblanco•4mo ago
The only problem is I feel like having to have Claude rewrite the prompt negates some of the efficiency and latency benefits of using mini. For system prompts obviously this doesn't matter, but for actual continuous user interaction, it feels unworkable.

It definitely makes sense that improving formatting and clarity for these smaller models would really help with performance, but I'm wondering if gpt5-mini is already smart enough to handle that reformatting, and can rewrite the prompt itself, before handing it off to another instance of itself.

Overall an awesome article!

blndrt•4mo ago
Thank you!

Great point. Indeed my methodology was to treat the prompt refactoring as one-off task, therefore I didn't care much about cost/latency.

As for having GPT-5-mini do the rewriting — that’s a really interesting idea. I think the biggest challenge is avoiding cognitive overload. The Tau² agent policies are pretty complex: it’s easy to grasp the overall task, but the detailed rules for each user case aren’t always obvious.

I'm not sure if how easy it is to actually overload GPT-5-mini, so that's definitely worth exploring.

doctorpangloss•4mo ago
you would also be interested in dSPY...
thanhhaimai•4mo ago
This is the PR with the changes in case people missed it:

https://github.com/mieciu/tau2-bench/pull/1/files

blndrt•4mo ago
Thanks! I also updated the post with the link on the website.
nitwit005•4mo ago
That seems so strongly directed, that it feels like an attempt to reproduce a classic chat bot.
catlifeonmars•4mo ago
Can one customer get the model to return the bill details for another customer?
init_test123•4mo ago
Have you tried to use gpt-5 with high thinking to rewrite the prompt? why claude for this vs some other model?
blndrt•4mo ago
Yea, so that part I actually did not overthink - I knew I need strong reasoning and just grabbed opus which is my personal go-to for such tasks and sticked to it as I wanted to avoid too many moving parts.

Would be interesting to compare both the benchmark result as well as the way other models approached the whole refactoring process!

simianwords•4mo ago
Rewriting prompts don't come with no costs. The cost here is that different prompts work for different contexts and is not generalisable. The rewritten prompt here will not work well for other cases like medical or social advice.

I think this rewriting of prompts technique is the reason "reasoning" models perform well - they know exactly how to rewrite the prompts for a context.

FWIW I don't trust these benchmarks fully because a huge bump like this is not expected - I would expect OpenAI to optimise enough to let such gaps open.

dangoodmanUT•4mo ago
Doesn't saying "check -> action" suggest you're taking _away_ the agentic capabilities, and optimizing for the benchmark, meaning it's no longer a good benchmark for agentic capabilities?

That's like being able to see the test before taking it

blndrt•4mo ago
Great point! However, I’d ask the following: isn't faithfully following nuanced instructions an _agentic capability_ by itself?

If a model only performs well once the rules are clarified, that’s still revealing something important about its agency: it’s brittle when policies are ambiguous, but much stronger when they’re structured.

I agree with you that there’s a fine line between genuinely helping the model 'understand' the task and just 'teaching to the test'.

That said, Tau² is framed as a very specific use case — and we showed it can be solved more reliably. At the end of the day, that means we now have an agent built on a cheaper, faster model that still performs its job with higher reliability.

roger_•4mo ago
Copilot in VSCode seems to do something similar in the form of todo lists.
tedsanders•4mo ago
>GPT-5 showed significant improvement only in one benchmark domain - which is Telecom. The other ones have been somehow overlooked during model presentation - therefore we won’t bother about them either.

I work at OpenAI and you can partly blame me for our emphasis on Telecom. While we no doubt highlight the evals that make us look good, let me defend why the emphasis on Telecom isn't unprincipled cherry picking.

Telecom was made after Retail and Airline, and fixes some of their problems. In Retail and Airline, the model is graded against a ground truth reference solution. Grading against a reference solution makes grading easier, but has the downside that valid alternative solutions can receive scores of 0 by the automatic grading. This, along with some user model issues, is partly why Airline and Retail scores stopped climbing with the latest generations of models and are stuck around 60% / 80%. I'd bet you $100 that a superintelligence would probably plateau around here too, as getting 100% requires perfect guessing of which valid solution is written as the reference solution.

In Telecom, the authors (Barres et al.) made the grading less brittle by grading against outcome states, which may be achieved via multiple solutions, rather than by matching against a single specific solution. They also improved the user modeling and some other things too. So Telecom is the much better eval, with a much cleaner signal, which is partly why models can score as high as 97% instead of getting mired at 60%/80% due to brittle grading and other issues.

Even if I had never seen GPT-5's numbers, I like to think I would have said ahead of time that Telecom is much better than Airline/Retail for measuring tool use.

Incidentally, another thing to keep in mind when critically looking at OpenAI and others reporting their scores on these evals is that the evals give no partial credit - so sometimes you can have very good models that do all but one thing perfectly, which results in very poor scores. If you tried generalizing to tasks that don't trigger that quirk, you might get much better performance than the eval scores suggest (or vice versa, if your tasks trigger a quirk not present in the eval).

Here's the tau2-bench paper if anyone wants to read more: https://arxiv.org/abs/2506.07982

blndrt•4mo ago
Haha, I guess my little sarcasm just earned us a masterclass! Thanks a lot for sharing your insights — really appreciate it!
fallmonkey•4mo ago
Appreciated the response! I noticed the same when I ran tau2 myself on gpt-5 and 4.1, where gpt-5 is really good at looking at tool results and interleaving those with thinking, while 4.1/o3 struggles to decide the proper next tool to use even with thinking. To some extent, gpt-5 is too good at figuring out the right tool to use in one go. Amazing progress.
DoctorOetker•4mo ago
This sounds very vague, what does scoring good at Telecom mean?

Can we get some (hypothetical) examples of ground truths?

For example for the Airline domain, what kind of facts are these ground truth facts? All the airports, the passenger lines between them, etc? Or does it mean detailed knowledge of the airplane manuals for pilots, maintenance, ...?

sublimefire•4mo ago
My experience as well.

Prompt changes affect output substantially (just look up arxiv), the difficult part is find an optimal structure to yield the best results. It is a bit expensive to do a lot of testing on your own, so it all boils down to feels and experience at the moment. Then you mix up tool calls, other agent calls, client functions and this gets terribly hard to evaluate.

I am still puzzled how distance between policies can have an effect on the output. And how a simple retry fixes everything.

thesehands•4mo ago
This is very much what dspy aims to address. Learning the incantations necessary to prompt well can be replaced by an algorithmic loop and example labelled cases.
wigglefruit•4mo ago
I feel like eventually we’ll get LLMs that will act like compilers do now. So they will take a prompt and turn it into an optimized prompt for a bigger LLM.
ErikBjare•4mo ago
Would be curious to run this through DSPy/GEPA and see if it can squeeze even further performance by optimizing the prompt
lonefrog06•4mo ago
I have read somewhere that XML prompting could also help to remove ambiguities and increase success rates for agents, did you here about that and would that be a good idea? Christophe from France
blndrt•4mo ago
Salut Christophe! Yes, I’ve come across the concept :) In fact, I think what we did with the ├── and └── notation is already a step in that direction (at least concept-wise) as it also puts a specific structure over the instructions. But stretching all the way seems worth exploring too!
lonefrog06•4mo ago
thanks for your reply