Why LLMs Can't Build Software

https://zed.dev/blog/why-llms-cant-build-software

70•srid•2h ago

Comments

9cb14c1ec0•31m ago

> what they cannot do is maintain clear mental models

The more I use claude code, the more frustrated I get with this aspect. I'm not sure that a generic text-based LLM can properly solve this.

cmrdporcupine•17m ago

Honestly it forces you -- rightfully -- to step back and be the one doing the planning.

You can let it do the grunt coding, and a lot of the low level analysis and testing, but you absolutely need to be the one in charge on the design.

It frankly gives me more time to think about the bigger picture within the amount of time I have to work on a task, and I like that side of things.

There's definitely room for a massive amount of improvement in how the tool presents changes and suggestions to the user. It needs to be far more interactive.

dlivingston•12m ago

Reminds me of how Google's Genie 3 can only run for a ~minute before losing its internal state [0].

My gut feeling is that this problem won't be solved until some new architecture is invented, on the scale of the transformer, which allows for short-term context, long-term context, and self-modulation of model weights (to mimic "learning"). (Disclaimer: hobbyist with no formal training in machine learning.)

[0]: https://news.ycombinator.com/item?id=44798166

empath75•28m ago

It's good at micro, but not macro. I think that will eventually change with smarter engineering around it, larger context windows, etc. Never underestimate how much code that engineers will write to avoid writing code.

pmdr•16m ago

> It's good at micro, but not macro.

That's what I've found as well. Start describing or writing a function, include the whole file for context and it'll do its job. Give it a whole codebase and it will just wander in the woods burning tokens for ten minutes trying to solve dependencies.

usrbinbash•27m ago

> We don't just keep adding more words to our context window, because it would drive us mad.

That, and we also don't only focus on the textual description of a problem when we encounter a problem. We don't see the debugger output and go "how do I make this bad output go away?!?". Oh, I am getting an authentication error. Well, meaybe I should just delete the token check for that code path...problem solved?!

No. Problem very much not-solved. In fact, problem very much very bigger big problem now, and [Grug][1] find himself reaching for club again.

Software engineers are able to step back, think about the whole thing, and determine the root cause of a problem. I am getting an auth error...ok, what happens when the token is verified...oh, look, the problem is not the authentication at all...in fact there is no error! The test was simply bad and tried to call a higher privilege function as a lower privilege user. So, test needs to be fixed. And also, even though it isn't per-se an error, the response for that function should maybe differentiate between "401 because you didn't authenticate" and "401 because your privileges are too low".

[1]: https://grugbrain.dev

trod1234•12m ago

Isn't the 401 for LLMs the same single undecidable token? Doesn't this basically go to the undecidable nature of math in CS?

Put another way, you have an excel roster corresponding to people with accounts where some need to have their account shutdown but you only have their first and last names as identifiers, and the pool is sufficiently large that there are more than one person per a given set of names.

You can't shut down all accounts with a given name, and there is no unique identifier. How do you solve this?

You have to ask and be given that unique identifier that differentiates between the undecidable. Without that, even the person can't do the task.

The person can make guesses, but those guesses are just hallucinations with a significant n probability towards a bad repeat outcome.

At a core level I don't think these type of issues are going to be solved.

livid-neuro•12m ago

The first cars broke down all the time. They had a limited range. There wasn't a vast supply of parts for them. There wasn't a vast industry of experts who could work on them. There wasn't a vast network of fuel stations to provide energy for them. The horse was a proven method.

What an LLM cannot do today is almost irrelevant in the tide of change upon the industry. The fact is, with improvements, it doesn't mean an LLM cannot do it tomorrow.

skydhash•7m ago

When the first cars broke down, people were not saying: One day, we’ll go to the moon with one of these.

LLMs may get better, but it will not be what people are clamoring them to be.

jedimastert•6m ago

> The first cars broke down all the time. They had a limited range. There wasn't a vast supply of parts for them. There wasn't a vast industry of experts who could work on them.

I mean, there was and then there wasn't. All of those things are shrinking fast because we handed over control to people who care more about profits than customers because we got too comfy and too cheap, and now right to repair is screwed.

Honestly, I see llm-driven development as a threat to open source and right to repair, among the litany of other things

skydhash•10m ago

Programmers are mostly translating business rules to the very formal process execution of the computer world. And you need to both knows what the rules means and how the computer works (or at least how the abstracted version you’re working with works). The translation is messy at first, which is why you need to revise it again and again. Especially when later rules comes challenging all the assumptions you’ve made or even contradicting themselves.

Even translations between human languages (which allows for ambiguity) can be messy. Imagine if the target language is for a system that will exactly do as told unless someone has qualified those actions as bad.

jmclnx•25m ago

I am not a fan of today's concept of "AI", but to be fair, building today's software is not for the faint of heart, very few people gets it right on try 1.

Years ago I gave up compiling these large applications all together. I compiled Firefox via FreeBSD's (v8.x) ports system, that alone was a nightmare.

I cannot imagine what it would be like to compile GNOME3 or KDE or Libreoffice. Emacs is the largest thing I compile now.

anotherhue•22m ago

I suggest trying Nix, by being reproducible those nasty compilation demons get solved once and for all. (And usually by someone else)

trod1234•4m ago

The problem with Nix is its claimed to be reproducible, but the proof isn't really there because of the existence of collisions.

While a collision hasn't yet been found for a SHA256, by the pigeonhole principle they exist, and the computer will not be able to decide between the two packages leading to system level failure, with errors that have no link to cause.

These things generally speaking contain properties of mathematical chaos that no admin would ever approach or touch because its unmaintainable. Non-deterministic problems are the most costly problems because troubleshooting which is based on properties of determinism, doesn't work.

emilecantin•24m ago

Yeah, I think it's pretty clear to a lot of people that LLMs aren't at the "build me Facebook, but for dogs" stage yet. I've had relatively good success with more targeted tasks, like "Add a modal that does this, take this existing modal as an example for code style". I also break my problem down into smaller chunks, and give them one by one to the LLM. It seems to work much better that way.

trod1234•21m ago

I think most people trying to touch on this topic don't consider this byline with other similar bylines like, "Why LLMs can't recognize themselves looping", or "Why LLMs can't express intent", or "Why LLMs can't recognize truth/falsity, or confidence levels of what they know vs don't know", these other bylines basically with a little thought equate to Computer Science halting problems, or the undecidability nature of mathematics.

Taken to a next step, recognizing this makes the investment in such a moonshot pipedream (overcoming these inherent problems in a deterministic way), recklessly negligent.

nromiun•18m ago

This is exactly why I have tried and failed to use LLMs for any real project. For toy examples they are fine, but for anything larger they introduce many small obvious bugs.

The worst thing is that you can't point those bugs out to the LLM. It will prefer to rewrite the whole code instead. With new bugs of course. So you are back to square one.

saghm•17m ago

> Context omission: Models are bad at finding omitted context.

> Recency bias: They suffer a strong recency bias in the context window.

> Hallucination: They commonly hallucinate details that should not be there.

To be fair, those are all issues that most human engineers I've worked with (including myself!) have struggled with to various degrees, even if we don't refer to them the same way. I don't know about the rest of you, but I've certainly had times where I found out that an important nuance of a design was overlooked until well into the process of developing something, forgotten a crucial detail that I learned months ago that would have helped me debug something much faster than if I had remembered it from the start, or accidentally make an assumption about how something worked (or misremembered it) and ended up with buggy code as a result. I've mostly gotten pretty positive feedback about my work over the course of my career, so if I "can't build software", I have to worry about the companies that have been employing me and my coworkers who have praised my work output over the years. Then again, I think "humans can't build software reliably" is probably a mostly correct statement, so maybe the lesson here is that software is hard in general.

Nickersf•16m ago

I think they're another tool in the toolbox not a new workshop. You have to build a good strategy around LLM usage when developing software. I think people are naturally noticing that and adapting.

generalizations•15m ago

These LLM discussions really need everyone to mention what LLM they're actually using.

> AI is awesome for coding! [Opus 4]

> No AI sucks for coding and it messed everything up! [4o]

Would really clear the air. People seem to be evaluating the dumbest models (apparently because they don't know any better?) and then deciding the whole AI thing just doesn't work.

omnicognate•11m ago

What the article says is as true of Opus 4 as any other LLM.

troupo•5m ago

> These LLM discussions really need everyone to mention what LLM they're actually using.

They need to mention significantly more than that: https://dmitriid.com/everything-around-llms-is-still-magical...

--- start quote ---

Do we know which projects people work on? No

Do we know which codebases (greenfield, mature, proprietary etc.) people work on? No

Do we know the level of expertise the people have? No.

Is the expertise in the same domain, codebase, language that they apply LLMs to? We don't know.

How much additional work did they have reviewing, fixing, deploying, finishing etc.? We don't know.

--- end quote ---

And that's just the tip of the iceberg. And that is an iceberg before we hit another one: that we're trying to blindly reverse engineer a non-deterministic blackbox inside a provider's blackbox

Transfinity•15m ago

> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over.

I feel personally described by this statement. At least on a bad day, or if I'm phoning it in. Not sure if that says anything about AI - maybe just that the whole "mental models" part is quite hard.

apples_oranges•10m ago

It means something is not understood. Could be the product, the code in question, or computers in general. 90% of coders seem to be lacking foundational knowledge imho. Not trying to hate on anyone, but when you have the basics down, you can usually see quickly where the problem is, or at least must be.

JimDabell•11m ago

LLMs can’t build software because we are expecting them to hear a few sentences, then immediately start coding until there’s a prototype. When they get something wrong, they have a huge amount of spaghetti to wade through. There’s little to no opportunity to iterate at a higher level before writing code.

If we put human engineering teams in the same situation, we’d expect them to do a terrible job, so why do we expect LLMs to do any better?

We can dramatically improve the output of LLM software development by using all those processes and tools that help engineering teams avoid these problems:

https://jim.dabell.name/articles/2025/08/08/autonomous-softw...

lordnacho•6m ago

I think I agree with the idea that LLMs are good at the junior level stuff.

What's happened for me recently is I've started to revisit the idea that typing speed doesn't matter.

This is an age-old thing, most people don't think it really matters how fast you can type. I suppose the steelman is, most people think it doesn't really matters how fast you can get the edits to your code that you want. With modern tools, you're not typing out all the code anyway, and there's all sorts of non-AI ways to get your code looking the way you want. And that doesn't matter, the real work of the engineer is the architecture of how the whole program functions. Typing things faster doesn't make you get to the goal faster, since finding the overall design is the limiting thing.

But I've been using Claude for a while now, and I'm starting to see the real benefit: you no longer need to concentrate to rework the code.

It used to be burdensome to do certain things. For instance, I decided to add an enum value, and now I have to address all the places where it matches on that enum. This wasn't intellectually hard in the old world, you just got the compiler to tell you where the problems were, and you added a little section for your new value to do whatever it needed, in all the places it appeared.

But you had to do this carefully, otherwise you would just cause more compile/error cycles. Little things like forgetting a semicolon will eat a cycle, and old tools would just tell you the error was there, not fix it for you.

LLMs fix it for you. Now you can just tell Claude to change all the code in a loop until it compiles. You can have multiple agents working on your code, fixing little things in many places, while you sit on HN and muse about it. Or perhaps spend the time considering what direction the code needs to go.

The big thing however is that when you're no longer held up by little compile errors, you can do more things. I had a whole laundry list of things I wanted to change about my codebase, and Claude did them all. Nothing on the business level of "what does this system do" but plenty of little tasks that previously would take a junior guy all day to do. With the ability to change large amounts of code quickly, I'm able to develop the architecture a lot faster.

It's also a motivation thing: I feel bogged down when I'm just fixing compile errors, so I prioritize what to spend my time on if I am doing traditional programming. Now I can just do the whole laundry list, because I'm not the guy doing it.

revskill•6m ago

They can read and mind the error then figure out the best way to resolve. It is the best part about llm. No human can do it better than an llm. But they are not your mind reader. It is where things fall apart.

chollida1•3m ago

Most of this might be true for LLM's but years of investing experience has created a mental model of looking for the tech or company that sucks and yet keeps growing.

People complained endlessly about the internet in the early to mid 90s, its slow, static, most sites had under construction signs on them, your phone modem would just randomly disconnect. The internet did suck in alot of ways and yet people kept using it.

Twitter sucked in the mid 2000s, we saw the fail whale weekly and yet people continued to use it for breaking news.

Electric cars sucked, no charging, low distance, expensive and yet no matter how much people complain about them they kept getting better.

Phones sucked, pre 3G was slow, there wasn't much you could use them for before app stores and the cameras were potato quality and yet people kept using them while they improved.

Always look for the technology that sucks and yet people keep using it because it provides value. LLM's aren't great at alot of tasks and yet no matter how much people complain about them, they keep getting used and keep improving through constant iteration.

LLM"s amy not be able to build software today, but they are 10x better than where they were in 2022 when we first started using chatgpt. Its pretty reasonable to assume in 5 years they will be able to do these types of development tasks.

The Loeb Scale: Astronomical Classification of Interstellar Objects

The Drugs Are Taking Hold

The Monolith That Made AI Useful

Scientists Call for Ban on Social Media and Smartphones Before Age 13

The Equality Delete Problem in Apache Iceberg

Frac5: A new type of flame fractal

How Millennials Killed Mayonnaise

Show HN: I built a free alternative to Adobe Acrobat PDF viewer

OSNews goes ad-free, for everyone, and we need your support

An interstellar mission to test astrophysical black holes

AGI: Probably Not 2027

Trends in US Children's Mortality and Health

Saint Seiya Singer Nobuo Yamada Dies at 61

Leeches and the Legitimizing of Folk-Medicine

Parallel: Web Search Infrastructure for AIs

Omarchy (micro) forks Chromium [video]

How One Activist Is Using a Decades-Old Policy to Stall Green Energy Projects

For Some Patients, the 'Inner Voice' May Soon Be Audible

Microsoft's canceled "Surface Neo" dual-screen PC

When Your AI Friend Gets a Corporate Makeover

Australia Blocks Polymarket After Regulator Targets Illegal Online Betting

Launch HN: Cyberdesk (YC S25) – Automate Windows legacy desktop apps

DARPA christens unmanned ship aimed at revolutionizing naval capability

Show HN: FilterQL – A tiny query language for filtering structured data

Why Remediation Is the Hardest Problem in NHI Security

Show HN: Accentless – Right-click to add native accents your writing

Car brands using curl, Car brands sponsoring or paying for curl support

Study reveals salps play outsize role in damping global warming

A generative deep learning approach to de novo antibiotic design

Lasso Transactions – Fund Creators and Combat Free-Riders Without Copyright