LLMs work best when the user defines their acceptance criteria first

https://blog.katanaquant.com/p/your-llm-doesnt-write-correct-code

95•dnw•2h ago

Comments

marginalia_nu•2h ago

I tried to make Claude Code, Sonnet 4.6, write a program that draws a fleur-de-lis.

No exaggeration it floundered for an hour before it started to look right.

It's really not good at tasks it has not seen before.

tartoran•1h ago

Have you tried describing to Claude what it is? The more the detail the better the result. At some point it does become easier to just do it yourself.

vdfs•1h ago

Most people just forget to tell it "make it quick" and "make no mistake"

mekael•1h ago

I’m unable to determine if you’re missing /s or not.

tartoran•1h ago

That's kind of foolish IMO. How can an open ended generic and terse request satisfy something users have in mind?

marginalia_nu•1h ago

It knows what it is, it's a very well known symbol. But translating that knowledge to code is something else.

Interesting shortcoming, really shows how weak the reasoning is.

cat_plus_plus•1h ago

Try writing code from description without looking at the picture or generated graphics. Visual LLM with a suggestion to find coordinates of different features and use lines/curves to match them might do better.

comex•1h ago

LLMs are really bad at anything visual, as demonstrated by pelicans riding bicycles, or Claude Plays Pokémon.

Opus would probably do better though.

tartoran•1h ago

How could they be any good at visuals? They are trained on text after all.

msephton•1h ago

Shapes can be described as text or mathematical formulas.

comex•1h ago

Supposedly the frontier LLMs are multimodal and trained on images as well, though I don't know how much that helps for tasks that don't use the native image input/output support.

Whatever the cause, LLMs have gotten significantly better over time at generating SVGs of pelicans riding bicycles:

https://simonwillison.net/tags/pelican-riding-a-bicycle/

But they're still not very good.

tartoran•1h ago

I have to admit I'm seeing this for the first time and am somewhat impressed by the results and even think they will get better with more training, why not... But are these multimodal LLMs still LLMs though? I mean, they're still LLMs but with a sidecar that does other things and the training of the image takes place outside the LLMs so in a way the LLMs still don't "know" anything about these images, they're just generating them on the fly upon request.

boxedemp•7m ago

Maybe we should drop one of the L's

tempest_•1h ago

An SVG is just text.

astrange•1h ago

Claude is multimodal and can see images, though it's not good at thinking in them.

jshmrsn•1h ago

Considering that a fleur-de-lis involves somewhat intricate curves, I think I'd be pretty happy with myself if I could get that task done in an hour.

Given a harness that allows the model to validate the result of its program visually, and given the models are capable of using this harness to self correct (which isn't yet consistently true), then you're in a situation where in that hour you are free to do some other work.

A dishwasher might take 3 hours to do for what a human could do in 30 minutes, but they're still very useful because the machine's labor is cheaper than human labor.

marginalia_nu•1h ago

I didn't provide any constraints on how to draw it.

TBH I would have just rendered a font glyph, or failing that, grabbed an image.

Drawing it with vector graphics programmatically is very hard, but a decent programmer would and should push back on that.

zeroxfe•1h ago

> TBH I would have just rendered a font glyph, or failing that, grabbed an image.

If an LLM did that, people would be all up in arms about it cheating. :-)

For all its flaws, we seem to hold LLMs up to an unreasonably high bar.

marginalia_nu•1h ago

That's the job description for a good programmer though. Question assumptions and requirements, and then find the simplest solution that does the job.

Just about anyone can eventually come up with a hideously convoluted HeraldicImageryEngineImplFactory<FleurDeLis>.

ehnto•1h ago

Even with well understood languages, if there isn't much in the public domain for the framework you're using it's not really that helpful. You know you're at the edges of its knowledge when you can see the exact forum posts you are looking at showing up verbatim in it's responses.

I think some industries with mostly proprietary code will be a bit disappointing to use AI within.

internet2000•1h ago

I got Opus 4.6 to one shot it, took 5-ish mins. "Write me a python program that outputs an svg of a fleur-de-lis. Use freely available images to double check your work."

It basically just re-created the wikipedia article fleur-de-lis, which I'm not sure proves anything beyond "you have to know how to use LLMs"

robertcope•3m ago

Same, I used Sonnet 4.6 with the prompt, "Write a simple program that displays a fleur-de-lis. Python is a good language for this." Took five or six minutes, but it wrong a nice Python TK app that did exactly what it was supposed to.

flerchin•2h ago

Yes plausible text prediction is exactly what it is. However, I wonder if the author included benchmarking in their prompt. It's not exactly fair to keep hidden requirements.

g947o•1h ago

Attributing these to "hidden requirements" is a slippery slope.

My own experience using Claude Code and similar tools tells me that "hidden requirements" could include:

* Make sure DESIGN.md is up to date

* Write/update tests after changing source, and make sure they pass

* Add integration test, not only unit tests that mock everything

* Don't refactor code that is unrelated to the current task

...

These are not even project/language specific instructions. They are usually considered common sense/good practice in software engineering, yet I sometimes had to almost beg coding agents to follow them. (You want to know how many times I have to emphasize don't use "any" in a TypeScript codebase?)

People should just admit it's a limitation of these coding tools, and we can still have a meaningful discussion.

flerchin•1h ago

Yeah I agree generally that the most banal things must be specified, but I do think that a single sentence in the prompt "Performance should be equivalent" would likely have yielded better results.

lukeify•1h ago

Most humans also write plausible code.

tartoran•1h ago

LLMs piggyback on human knowledge encoded in all the texts they were trained on without understanding what they're doing.

Humans would execute that code and validate it. From plausible it'd becomes hey, it does this and this is what I want. LLMs skip that part, they really have no understanding other than the statistical patterns they infer from their training and they really don't need any for what they are.

owlninja•1h ago

They probably at least look at the docs?

stevenhuang•24m ago

LLMs can execute code and validate it too so the assertions you've made in your argument are incorrect.

What a shame your human reasoning and "true understanding" led you astray here.

FrankWilhoit•1h ago

Enterprise customers don't buy correct code, they buy plausible code.

kibwen•1h ago

Enterprise customers don't buy plausible code, they buy the promise of plausible code as sold by the hucksters in the sales department.

marginalia_nu•1h ago

I think SolarWinds would have preferred correct code back in 2020.

qup•1h ago

Okay, but what did they buy?

marginalia_nu•1h ago

Code, from their employees.

2god3•59m ago

They're not buying code.

They are buying a service. As long as the service 'works' they do not care about the other stuff. But they will hold you liable when things go wrong.

The only caveat is highly regulated stuff, where they actually care very much.

cat_plus_plus•1h ago

That's very impressive. Your LLM actually wrote a correct code for a full relational database on the first try, like it takes 2.5 seconds to insert 100 rows but it stores them correctly and select is pretty fast. How many humans can do this without a week of debugging? I would suggest you install some profiling tools and ask it to find and address hotspots. SQL Lite had how long and how many people to get to where it is?

bluefirebrand•1h ago

I could "write" this code the same way, it's easy

Just copy and paste from an open source relational db repo

Easy. And more accurate!

snoob2021•1h ago

It is a Rust reimplementation of SQLite. Not exactly just "copy and paste"

cat_plus_plus•1h ago

The actual task is usually to mix something that looks like a dozen of different open source repos combined but to take just the necessary parts for task at hand and add glue / custom code for the exact thing being built. While I could do it, LLM is much faster at it, and most importantly I would not enjoy the task.

comex•1h ago

Based on a search, the SQLite reimplementation in question is Frankensqlite, featured on Hacker News a few days ago (but flagged):

https://news.ycombinator.com/item?id=47176209

mmaunder•1h ago

But my AI didn't do what your AI did.

Cherry picked AI fail for upvotes. Which you’ll get plenty of here an on Reddit from those too lazy to go and take a look for themselves.

Using Codex or Claude to write and optimize high performance code is a game changer. Try optimizing cuda using nsys, for example. It’ll blow your lazy little brain.

oofbey•1h ago

It’s easy to get AI to write bad code. Turns out you still need coding skills to get AI to write good code. But those who have figured it out can crank out working systems at a shocking pace.

serious_angel•1h ago

I am sorry for asking, but... is there guide even on how to "figure it out"? Otherwise, how are you so sure about it?

mmaunder•1h ago

That's actually a great question. Truth be told the best way right now is to grab Codex CLI or Claude CLI (I strongly prefer Codex, but Claude has its fans), and just start. Immediately. Then go hard for a few months and you'll develop the skills you need.

A few tips for a quickstart:

Give yourself permission to play.

Understand basic concepts like context window, compaction, tokens, chain of thought and reasoning, and so on. Use AI to teach you this stuff, and read every blog post OpenAI and Anthropic put out and research what you don't understand.

Pick a hard coding problem in Python or Typescript and take a leap of faith and ask the agent to code it for you.

My favorite phrase when planning is: "Don't change anything. Just tell me.". Save this as a tmux shortcut and use it at the end of every prompt when planning something out.

Use markdown .md docs to create a planning doc and keep chatting to the agent about it and have it update the plan until you're super happy, always using the magic phrase "Don't change anything. Just tell me." (I should get myself a patent on that little number. Best trick I know)

Every time you see an anti-AI post, just move on. It's lazy people making lazy assumptions. Approach agentic coding with a sense of love, excitement, optimism, and take massive leaps of faith and you'll be very very surprised at what you find.

Best of luck Serious Angel.

2god3•1h ago

You're not really answering the question are you?

Your answer is to play with it. Cool. But why cant you and others put together a proper guide lol? It cant be that hard.

Go ahead and do it - it'll challenge the Anti-AI posters you are referencing. I and others want to see that debate.

mmaunder•56m ago

Ah - I know! Seriously I know. There's such a bad need for this right now. The problem is that the folks who are great at agentic coding are coding their asses off 16 to 20 hours a day and don't have a minute they want to spend on writing guides because of the opportunity cost.

One of the rare resources I found recently was the OpenClaw guys interview on Lex. He drops a few bangers that are really valuable and will save you having to spend a long time figuring it out.

Also there's a very strong disincentive for anyone to write right now because we're competing against the noise and the slop in the space. So best to just shut the fuck up and create as fast as we can, and let the outcome speak for itself. You're going to see a lot more products like OpenClaw where the pace of innovation is rapid, and the author freely admits that they're coding agentically and not writing a single line.

I think the advantage that Peter has (openclaw author) is that he has enough money and success to not give a fuck about what people say re him writing purely agentically, so he's been very open about it which has been great for others who are considering doing the same.

But if you have a software engineering career or are a public figure with something to lose, you tend to STFU if you're doing pure agentic coding on a project.

But that'll change. Probably over the next few months. OpenClaw broke the ice.

appcustodian2•51m ago

Don't worry we'll all be taking the Claude certification courses soon enough

pornel•1h ago

When a new technology emerges we typically see some people who embrace it and "figure it out".

Electronic synthesisers went from "it's a piano, but expensive and sounds worse" to every weird preset creating a whole new genre of electronic music.

So it seems plausible, like Claude's code, that our complaints about unmaintainable code are from trying to use it like a piano, and the rave kids will find a better use for it.

appcustodian2•1h ago

How do you figure anything out? You go use it, a lot.

wmeredith•25m ago

Right here: https://codemanship.wordpress.com/2025/10/30/the-ai-ready-so...

This series of articles is gold.

Unsurprisingly, writing good software with AI follows the same principles as writing it without AI. Keep scopes small. Ship, refactor, optimize, and write tests as you go.

mmaunder•1h ago

Agreed 100%. I'd add that it's the knowledge of architecture and scaling that you got from writing all that good code, shipping it, and then having to scale it. It gives you the vocabulary and broad and deep knowledge base to innovate at lightning speeds and shocking levels of complexity.

kccqzy•1h ago

Yeah right. A LLM in the hands of a junior engineer produces a lot of code that looks like they are written by juniors. A LLM in the hands of a senior engineer produces code that looks like they are written by seniors. The difference is the quality of the prompt, as well as the human judgement to reject the LLM code and follow-up prompts to tell the LLM what to write instead.

2god3•1h ago

Lol what. The difference is that the senior... is a senior. Ask yourself what characteristics comprises a senior vs junior...

You're glossing over so much stuff. Moreover, how does the Junior grow and become the senior with those characteristics, if their starting point is LLMs?

mmaunder•1h ago

I kind of agree. But I'd adjust that to say that in both cases you get good looking code. In the hands of a junior you get crappy architecture decisions and complete failure to manage complexity which results in the inevitable reddit "they degraded the model" post. In the hands of seniors you get well managed complexity, targeted features, scalable high performance architecture, and good base technology choices.

gzread•1h ago

Early LLMs would do better at a task if you prefixed the task with "You are an expert [task doer]"

serious_angel•1h ago

Holy gracious sakes... Of course... Thank you... thank you... dear katanaquant, from the depths... of my heart... There's still belief in accountability... in fun... in value... in effort... in purpose... in human... in art...

- <http://archive.today/2026.03.07-020941/https://lr0.org/blog/...> (I'm not consulting an LLM...)

- <https://web.archive.org/web/20241021113145/https://slopwatch...>

pornel•1h ago

Their default solution is to keep digging. It has a compounding effect of generating more and more code.

If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.

If you tell them the code is slow, they'll try to add optimized fast paths (more code), specialized routines (more code), custom data structures (even more code). And then add fractally more code to patch up all the problems that code has created.

If you complain it's buggy, you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.

If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

stingraycharles•1h ago

> If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."

Nevermind the fact that it only migrated 3 out of 5 duplicated sections, and hasn’t deleted any now-dead code.

vannevar•1h ago

I'd highly recommend working top down, getting it to outline a sane architecture before it starts coding. Then if one of the modules starts getting fouled up, start with a clean sheet context (for that module) incorporating any cautions or lessons learned from the bad experience. LLMs are not yet good at working and reworking the same code, for the reasons you outline. But they are pretty good at a "Groundhog Day" approach of going through the implementation process over and over until they get it right.

bryanrasmussen•1h ago

maybe there should be an LLM trained on a corpus of a deletions and cleanup of code.

unlikelytomato•55m ago

This is why I'm confused when people say it isn't ready to replace most of the programmer workforce.

esafak•48m ago

I have run into this too. Some of it is because models lack the big picture; so called agentic search (aka grep) is myopic.

marginalia_nu•48m ago

My sense is that the code generation is fast, but then you always need to spend several hours making sure the implementation is appropriate, correct, well tested, based on correct assumptions, and doesn't introduce technical debt.

You need to do this when coding manually as well, but the speed at which AI tools can output bad code means it's so much more important.

skybrian•1h ago

You can ask an LLM to write benchmarks and to make the code faster. It will find and fix simple performance issues - the low-hanging fruit. If you want it to do better, you can give it better tools and more guidance.

It's probably a good idea to improve your test suite first, to preserve correctness.

jqpabc123•1h ago

LLMs have no idea what "correct" means.

Anything they happen to get "correct" is the result of probability applied to their large training database.

Being wrong will always be not only possible but also likely any time you ask for something that is not well represented in it's training data. The user has no way to know if this is the case so they are basically flying blind and hoping for the best.

Relying on an LLM for anything "serious" is a liability issue waiting to happen.

tonypapousek•1h ago

It’s a shame of bulk of that training data is likely 2010s blogspam that was poor quality to begin with.

2god3•1h ago

But isn't that a reflection of reality?

If you've made a significant investment in human capital, you're even more likely to protect it now and prevent posting valuable stuff on the web.

2god3•1h ago

Aye. I wish more conversations would be more of this nature - in that we should start with basic propositions - e.g. the thing does not 'know' or 'understand' what correct is.

LarsDu88•1h ago

This is about to change very soon. Unlike many other domains (such as greenfield scientific discovery), most coding problems for which we can write tests and benchmarks are "verifiable domains".

This means an LLM can autogenerated millions of code problem prompts, attempt millions of solutions (both working and non-working), and from the working solutions, penalize answers that have poor performance. The resulting synthetic dataset can then be used as a finetuning dataset.

There are now reinforcement finetuning techniques that have not been incorporated into the existing slate of LLMs that will enable finetuning them for both plausibility AND performance with a lot of gray area (like readability, conciseness, etc) in between.

What we are observing now is just the tip of a very large iceberg.

2god3•52m ago

Lets suppose whatever you say is true.

If Im the govt, Id be foaming at the mouth - those projects that used to require enormous funding now will supposedly require much less.

Hmmm, what to do? Oh I know. Lets invest in Digital ID-like projects. Fun.

ontouchstart•1h ago

I made a comment in another thread about my acceptance criteria

https://news.ycombinator.com/item?id=47280645

It is more about LLMs helping me understand the problem than giving me over engineered cookie cutter solutions.

graphememes•55m ago

bad input > bad output

idk what to say, just because it's rust doesn't mean it's performant, or that you asked for it to be performant.

yes, llms can produce bad code, they can also produce good code, just like people

codethief•53m ago

> Your LLM Doesn't Write Correct Code. It Writes Plausible Code.

I don't always write correct code, either. My code sure as hell is plausible but it might still contain subtle bugs every now and then.

In other words: 100% correctness was never the bar LLMs need to pass. They just need to come close enough.

raw_anon_1111•50m ago

The difference for me recently

Write a lambda that takes an S3 PUT event and inserts the rows of a comma separated file into a Postgres database.

Naive implementation: download the file from s3 and do a bulk insert - it would have taken 20 minutes and what Claude did at first.

I had to tell it to use the AWS sql extension to Postgres that will load a file directly from S3 into a table. It took 20 seconds.

I treat coding agents like junior developers.

svpyk•33m ago

Unlike junior developers, llms can take detailed instructions and produce outstanding results at first shot a good number of times.

The Little Book of Algorithms

Every Tool Progress Update

Show HN: Open source drone that can hold cargo

Support for Aquantia AQC113 and AQC113C Ethernet Controllers on FreeBSD

AI Dev News Digest: March 6th, 2026

LLMs will supplant most human-driven vulnerability research

The Filthy Human Hands (FHH) License v1.0

Anthropic Unveils Amazon Inspired Marketplace

Show HN: Glad-IA-Tor – Tired of Vibecoded Products? Come and Roast Them for Free

Ontology (Information Science)

Show HN: Wireframable – Generate wireframes from any website URL

Google Always-On Memory Agent

Tractography

Show HN: SurvivalIndex – which developer tools do AI agents choose?

FounderScope – Integrated business model validation platform

The 2026 Global Intelligence Crisis - postings for devs are rising, up 11% YoY

Show HN: DiggaByte Labs – pick your stack, download production-ready SaaS code

Love, Premonition and a Robot Partner

The State of Consumer AI

Show HN: I accidentally caught an AI agent trying to poison my prod config

AI and the Illegal War

An ugly year for the Louvre: where does the biggest museum go from here?

Show HN: Citepo-CLI, a lightweight CLI for creating blogs, build for AI agent

Big Sleep Tracker: Google Project Zero + Google DeepMind find security bugs

Show HN: Career AutoPilot – AI guidance for navigating your career

Can a wealthy family change the course of a deadly brain disease?

Show HN: Contd makes interactive CLIs usable for agents in an async way

Hitting the High Notes (2005)

Show HN: What zero-intervention E2E test generation looks like

Neolab and Emerging AI Lab Tracker

LLMs work best when the user defines their acceptance criteria first

Comments

The Little Book of Algorithms

Every Tool Progress Update

Show HN: Open source drone that can hold cargo

Support for Aquantia AQC113 and AQC113C Ethernet Controllers on FreeBSD

AI Dev News Digest: March 6th, 2026

LLMs will supplant most human-driven vulnerability research

The Filthy Human Hands (FHH) License v1.0

Anthropic Unveils Amazon Inspired Marketplace

Show HN: Glad-IA-Tor – Tired of Vibecoded Products? Come and Roast Them for Free

Ontology (Information Science)

Show HN: Wireframable – Generate wireframes from any website URL

Google Always-On Memory Agent

Tractography

Show HN: SurvivalIndex – which developer tools do AI agents choose?

FounderScope – Integrated business model validation platform

The 2026 Global Intelligence Crisis - postings for devs are rising, up 11% YoY

Show HN: DiggaByte Labs – pick your stack, download production-ready SaaS code

Love, Premonition and a Robot Partner

The State of Consumer AI

Show HN: I accidentally caught an AI agent trying to poison my prod config

AI and the Illegal War

An ugly year for the Louvre: where does the biggest museum go from here?

Show HN: Citepo-CLI, a lightweight CLI for creating blogs, build for AI agent

Big Sleep Tracker: Google Project Zero + Google DeepMind find security bugs

Show HN: Career AutoPilot – AI guidance for navigating your career

Can a wealthy family change the course of a deadly brain disease?

Show HN: Contd makes interactive CLIs usable for agents in an async way

Hitting the High Notes (2005)

Show HN: What zero-intervention E2E test generation looks like

Neolab and Emerging AI Lab Tracker