A human expert needs to identify the need for software, decide what the software should do, figure out what's feasible to deliver, build the first version (AI can help a bunch here), evaluate what they've built, show it to users, talk to them about whether it's fit for purpose, iterate based on their feedback, deploy and communicate the value of the software, and manage its existence and continued evolution in the future.
Some of that stuff can be handled by non-developer humans working with LLMs, but a human expert needs who understands code will be able to do this stuff a whole lot more effectively.
I guess the big question is if experienced product management types can pick up enough coding technical literacy to work like this without programmers, or if programmers can pick up enough enough PM skills to work without PMs.
My money is on both roles continuing to exist and benefit from each other, in a partnership that produces results a lot faster because the previously slow "writing the code" part is a lot faster than it used to be.
AI video is an incredible tool, but it can't make movies.
It's almost as if all of these models are an exoskeleton for people that already know what they're doing. But you still need an expert in the loop.
To me this appears to be a very time-dependent assertion. 5 years ago, AI couldn't generate a good movie frame. 2 years ago, AI couldn't generate a good shot, but now in 2025, AI can generate a not-too-shabby scene. If capabilities continue improving at this rate (e.g. as they have with AI being able to generate full musical albums), I wouldn't bet against AI being able to generate a decent feature film in the next decade. It might take longer until it's the sort of thing that we'd present in festivals, but I just don't a clear barrier any more.
Looking at it from another perspective, if an AI driven task currently requires "an expert in the loop" to navigate things by offering the appropriate prompts, evaluating and iterating on the AI generated content, then there's nothing clear to stop us from training the next generation of AI to include that expert's competency.
Taking it into full extrapolation mode, the thing that current generation AIs really don't have is the human experience that leads to a creative drive, but once we have robotic agents among us, these would arguably be able start gathering "experiences" that they could then mine to write and produce "their own" stories.
Humans are sharply declining in this ability at the same time. Most of what Hollywood churns out now is superhero slop, forced-diversity spin-offs, awful remakes of classics, and awkward comebacks for yesteryear's leading men.
I know it's not a movie but I could've happily watched "Nothing, Forever" for the rest of my life. That was creative, chaotic, hilarious, and wildly entertaining.
Meanwhile I watched the human-created War Of The Worlds (2025) last weekend... The less said, the better.
I'd argue that they can't, at least on a short timeframe. Not because LLMs can't generate a program or product that works, but that there needs to be enough understanding of how the implementation works to fix any complex issues that come up.
One experience I had is that I had tried to generate a MITM HTTPS proxy that uses Netty using Claude, and while it generated a pile of code that looked good on the surface, it didn't actually work. Not knowing enough about Netty, I wasn't able to debug why it didn't work and trying to fix it with the LLM didn't help either.
Maybe PMs can pick up enough knowledge over time to be able to implement products that can scale, but by that time they'd effectively be a software engineer, minus the writing code part.
If all juniors are using AI, or even worse, no juniors are ever hired, I'm not sure how we can produce those seniors at the scale we currently do. Which isn't even that large a scale.
Just this past weekend, I've designed and written code (in Typescript) that I don't think LLMs can even come close to writing in years. I have a subscription to a frontier LLM, but lately I find myself using like 25% of the time.
At a certain level the software architecture problems I'm solving, drawing upon decades of understanding about maintainable, performant, and verifiable design of data structures and types and algorithms, are things LLMs cannot even begin to grasp.
At that point, I find that attempting to use an LLM to even draft an initial solution is a waste of time. At best I can use it for initial brainstorming.
The people saying LLMs can code are hard for me to understand. They are good for simple bash scripts and complex refactoring and drafting basic code idioms and that's about it.
And even for these tasks the amount of hand-holding I need to do is substantial. At least Gemini Pro/CLI seems good at one-shot performance, before its context gets poisoned
"Take X and Y I've written before, some documentation for Z, an example W from that repo, now smash them together and build the thing I need"
Just today, I spent an hour documenting a function that performs a set of complex scientific simulations. Defined the function input structure, the outputs, and put a bunch of references in the body to function calls it would use.
I then spent 15 minutes explaining to the free version of ChatGPT what the function needs to do both in scientific terms and in computer architecture terms (e.g. what needed to be separated out for unit tests). Then it asked me to answer ~15 questions it had (most were yes/no, it took about 5 min), then it output around 700 lines of code.
It took me about 5 minutes to get it working, since it had a few typos. It ran.
Then I spent another 15 minutes laying out all the categories of unit tests and sanity tests I wanted it to write. It produced ~1500 lines of tests. It took me half an hour to read through them all, adjusting some edge cases that didn't make sense to me and adjusting the code accordingly. And a couple cases where it was testing the right part of the code, but had made valiant but wrong guesses as to what the scientifically correct answer would be. All the tests then passed.
All in all, a little over two hours. And it ran perfectly. In contrast, writing the code and tests myself entirely by hand would have taken at least a couple of entire days.
So when you say they're good for those simple things you list and "that's about it", I couldn't disagree more. In fact, I find myself relying on them more and more for the hardest scientific and algorithmic programming, when I provide the design and the code is relatively self-contained and tests can ensure correctness. I do the thinking, it does the coding.
So that's... math. A very well defined problem, defined very well. Any decent programmer should be able to produce working software from that, and it's great that ChatGPT was able to help you get it done much faster than you could have done it yourself. That's also the kind of project that's very well suited for unit testing, because again: math. Functions with well defined inputs, outputs, and no side-effects.
Only a tiny subset of software development projects are like that though.
Right: the majority of software development is things like "build a REST API for these three database tables" or "build a contact form with these four fields" or "write unit tests for this new function" or "update my YAML CI configuration to run this extra command".
The example you gave sounds like the problem is deterministic, even if composed of many moving parts. That's one way of looking at complexity.
When I talk about complex problems I'm not just talking about intricate problems. I'm talking about problems where the "problem" is design, not just implementing a design, and that is where LLMs struggle a lot.
Example, I want to design a strongly typed fluent API interface to some functionality. Even knowing how to shape the fluent interface so that is powerful, intuitive, well/strongly typed, and maintainable is a deep art.
The intuitive design constraints that I'm designing under would be hard to even explain to an LLM.
It is a lot faster at typing than I am.
In my experience implementing algorithms from a good comprehensive description and keeping track of data models is where they shine the most.
Anything less is setting it up for failure...
One reason I know LLM can't come close to my design is this: I've written something that works (that a typical senior engineer might write), but this not enough. I have evaluated it critically (drawing on my experience with long lived software), rewritten it again to better meet the targets above, and repeated this process several times. I don't know what would make an LLM go: now that kind of works, but is this the most intuitive, well typed, and maintainable design that there could be?
My previous design required looping through all known resources asking "can actor X action Y on this?". The new design gets to generate a very complex by thoroughly tested SQL query instead.
Applying that new design and updating the hundred of related tests would have taken me weeks. I got it done in two days.
Here's a diff that captures most of the work: https://github.com/simonw/datasette/compare/e951f7e81f038e43...
Eg. Just updating bootstrap to angular bootstrap. It didn't transfer how I placed the dropdowns ( basically using dropdown-end). So everything was out of view in desktop and mobile.
It forgot the transloco I used everywhere and just used default English ( happens a lot).
Suggested code that fixed 1 bug ( expression property recursion), but now linq to SQL was broken.
Upgrade to angular 17 in a asp.net core app. I knew it used vite now. But it also required a browser folder to deploy. 20 changes down the road, I noticed something on my ui wasn't updated in dev ( fast commits for my side project, I don't build locally), it didn't deploy anything related to angular no more...
I had 2 files named ApplicationDbContext and it took the one from wrong monolith module.
It adds files in the wrong directory sometimes. Eg. Some modules were made with feature folders.
It sometimes forgets to update my ocelot gateway or updates the compressed version. ...
Note: I documented my architecture in eg. cline. But I use multiple agents to experiment with.
Tldr: it's an expert beginner programmer.
I'm bringing to suspect a lot of my great experiences with coding agents come from the fact that they can run tests to confirm they haven't broken anything.
that's like, 90% of the code people are writing
This works well for humans too, but custom analysers are abstract and not many devs know how to write them, so they are mostly provided by library authors. However, being able to generate them via LLMs makes them so much more accessible, and IMHO is a game changer for enforcing an architecture.
I've been exploring this direction a lot lately, and it feels very promising.
I also want C# semantics even more closely integrated with the LLM. I'm imagining a stronger version of Structured Model Outputs that knows all the valid tokens that could be generated following a "." (including instance methods, extension properties, etc.) and prevents invalid code from even being generated in the first place, rather than needing a roundtrip through a Roslyn analyzer or the compiler to feed more text back to the model. (Perhaps there's some leeway to allow calls to not-yet-written methods to be generated.) Or maybe this idea is just a crutch I'm inventing for current frontier models and future models will be smart enough that they don't need it?
I have written many program analyses (though never any for C#; I’ll have to check it out), and my experience is that they are quite challenging to write. Many are research-level CS, so well outside the skill set of your average vibe coder. I’m wondering if you have some insight about LLM generated code that has not occurred to me…
I have a strong opinion that AI will boost the importance of people with “special knowledge” more than anyone else regardless of role. So engineers with deep knowledge of a system or PMs with deep knowledge of a domain.
In a lot of ways I think that will lead to stronger delivery teams. As a designer—the best performing teams I've been on have individuals with a core competency, but a lot of overlap in other areas. Product managers with strong engineering instincts, engineers with strong design instincts, etc. When there is less ambiguity in communication, teams deliver better software.
Longer-term I'm unsure. Maybe there is some sort of fusion into all-purpose product people able to do everything?
I have a few scattered thoughts here but I think you’re caught up on how things are done now.
A human expert in a field is the customer.
Do you think, say, gpt5 pro can’t talk to them about a problem and what’s reasonable to try and build in software?
It can build a thing, with tests, run stuff and return to a user.
It can take feedback (talking to people is the key major things LLMs have solved).
They can iterate (see: codex) deploy and they can absolutely write copy.
What do you really think in this list they can’t do?
For simplicity reduce it to a relatively basic crud app. We know that they can make these over several steps. We know they can manage the ui pretty well, do incremental work etc. What’s missing?
I think something huge here is that some of the software engineering roles and management become exceptionally fast and cheap. That means you don’t need to have as many users to be worthwhile writing code to solve a problem. Entirely personal software becomes economically viable. I don’t need to communicate value for the problem my app has solved because it’s solved it for me.
Frankly most of the “AI can’t ever do my thing” comments come across as the same as “nobody can estimate my tasks they’re so unique” we see every time something comes up about planning. Most business relevant SE isn’t complex logically, interestingly unique or frankly hard. It’s just a different language to speak.
Disclaimer: a client of mine is working on making software simpler to build and I’m looking at the AI side, but I have these views regardless.
You'll get the occasional high agency non-technical customer who decides to learn how to get these things done with LLMs but they'll be a pretty rare breed.
I know that right now few want to sit in front of claude code, but it's just not that big of a leap to move this up a layer. Workflows do this even without the models getting better.
The one key point is that I am keenly aware of what I can and cannot do. With these new superpowers, I often catch myself doing too much, and I end up doing a lot more rewrites than a real engineer would. But I can see Dunning Kruger playing out everywhere when people say they can vibe code an entire product.
I’m sure it’ll improve over time but it won’t be nearly as easy as making ai good at coding.
A while ago I discovered that Claude, left to its own devices, has been doing the LLM equivalent of Ctrl-C/Ctrl-V for almost every component it's created in an ever growing .NET/React/Typescript side project for months on end.
It was legitimately baffling seeing the degree to which it had avoided reusing literally any shared code in favor of updating the exact same thing in 19 places every time a color needed to be tweaked or something. The craziest example was a pretty central dashboard view with navigation tabs in a sidebar where it had been maintaining two almost identical implementations just to display a slightly different tab structure for logged in vs logged out users.
I've now been directing it to de-spaghetti things when I spot good opportunities and added more best practices to CLAUDE.md (with mixed results) so things are gradually getting more manageable, but it really shook my confidence in its ability to architect, well, anything on its own without micromanagement.
Yes, they're bad now, but they'll get better in a year.
If the generative ability is good enough for small snippets of code, it's good enough for larger software that's better organized. Maybe the models don't have enough of the right kind of training data, or the agents don't have the right reasoning algorithms. But it is there.
If we’re simply measuring model benchmarks, I don’t know if they’re much better than a few years ago… but if we’re looking at how applicable the tools are, I would say we’re leaps and bounds beyond where we were.
Candidly, it's awful. There are countless situations where it would be faster for me to edit the file directly (CSS, I'm looking at you!).
With that said, I've been surprised at how far the coding agents are able to go[0], and a lot less surprised about where I need to step in.
Things that seem to help: 1. Always create a plan/debug markdown file 2. Prompt the agent to ask questions/present multiple solutions 3. Use git more than normal (squash ugly commits on merge)
Planning is key to avoid half-brained solutions, but having "specs" for debug is almost more important. The LLM will happily dive down a path of editing as few files as possible to fix the bug/error/etc. This, unchecked, can often lead to very messy code.
Prompting the agent to ask questions/present multiple solutions allows me to stay "in control" over the how something is built.
I now basically commit every time a plan or debug step is complete. I've tried having the LLM control git, but I feel that it eats into the context a bit too much. Ideally a 3rd party "agent" would handle this.
The last thing I'll mention is that Claude Code (Sonnet 4.5) is still very token-happy, in that it eagerly goes above and beyond when not always necessary. Codex (gpt-5-codex) on the other hand, does exactly what you ask, almost to a fault. For both cases, this is where planning up-front is super useful.
[0]Caveat: the projects are either Typescript web apps or Rust utilities, can't speak to performance on other languages/domains.
Noting your caveat but I’m doing this with Python and your experience is very different from mine.
The "it's awful" admission is due to the "don't look at code" aspect of this exercise.
For real work, my split is more like 80% LLM/20% non-LLM, and I read all the code. It's much faster!
Always create a plan/debug markdown file
Very much necessary. Especially with Claude I find. It auto-compacts so often (Sonnet 4.5) and it instantly goes a-wall stupid after that. I then make it re-read the markdown file, so we can actually continue without it forgetting about 90% of what we just did/talked about. Prompt the agent to ask questions/present multiple solutions
I find that only helps marginally. They all output so much text it's not even funny. And that's with one "solution".I don't get how people can stand reading all that nonsense they spew, especially Claude. Everything is insta-ready to deploy, problem solved, root cause found, go hit the big red button that might destroy the earth in a mushroom cloud. I learned real fast to only skim what it says and ignore all that crap (as in I never tried to "change its personality" for real - I did try to tell it to always use the scientific method and prove its assumptions but just like a junior dev it never does and just tells me stupid things it believes to be true and I have to question it. Again, just like a junior dev, but it's my junior dev that's always on and available when I have time and it does things while I do other stuff. And instead of me having to ask the junior after and hour or two what rabbit hole it went down and get them out of there, Claude and Codex usually visually ping the terminal before I even have time to notice. That's for when I don't have full time focus on what I'm trying to do with the agents, which is why I do like using them.
The times when I am fully attentive, they're just soooo slow. And many many times I could do what they're doing faster or just as fast but without spending extra money and "environment". I've been trying to "only use AI agents for coding" for like a month or two now to see its positives and limitations and form my own opinion(s).
Prompting the agent to ask questions/present multiple solutions allows me to stay "in control" over the how something is built.
I find Claude's "Plan mode" is actually ideal. I just enable it and I don't have to tell it anything. While Codex "breaks out" from time to time and just starts coding even when I just ask it a question. If these machines ever take over, there's probably some record of me swearing at them and I will get a hitman on me. Unlike junior devs, I have no qualms about telling a model that it again ignored everything I told it. Ideally a 3rd party "agent" would handle this.
With sub-agents you can. Simple git interactions are perfect for subagents because not much can get lost in translation in the interface between the main agent and the sub agent. Then again, I'm not sure how you loose that much context. I rather use a sub agent for things like running the tests and linter on the whole project in the final steps, which spew a lot of unnecessary output.Personally, I had a rather bad set of experiences with it controlling git without oversight, so I do that myself, since doing it myself is less taxing than approving everything it wants to do (I automatically allow Claude certain commands that are read only for investigations and reviewing things).
I hear this so much. It's almost like people think code quality is unrelated to how well the product works. As though you can have 1 without the other.
If your code quality is bad, your product will be bad. It may be good enough for a demo right now, but that doesn't mean it really "works".
Why? Modern hardware power allow for extremely inefficient code, so even if some code runs a thousand times slower because it's badly programmed it will still be so fast that it seems instant.
For the rest of the stuff, it has no relevance for the user of the software what the code is doing inside of the chip, as long as the inputs and outputs function as they should. User wants to give input and receive output, nothing else has any significance at all for her.
But that's just a small piece of the puzzle. I agree that the user only cares about what the product does and not how the product works, but the what is always related to how, even if that relationship is imperceptible to the user. A product with terrible code quality will have more frequent and longer outages (because debugging is harder), and it will take longer for new features to be added (because adding things is harder). The user will care about these things.
Could be because programming involves:
1. Long chains of logical reasoning, and
2. Applying abstract principles in practice (in this case, "best practices" of software engineering).
I think LLMs are currently bad at both of these things. They may well be among the things LLMs are worst at atm.
Also, there should be a big asterisk next to "can write code". LLMs do often produce correct code of some size and of certain kinds, but they can also fail at that too frequently.
I've generally found the quality of .NET to be quite good. It trips up sometimes when linters ping it for rules not normally enforced, but it does the job reasonably well.
The front-end javascript though? It's both an absolute genuis and a complete menace at the same time. It'll write reams of code to gets things just right but with no regards to human maintainability.
I lost an entire session to the fact that it cheerfully did:
npm install fabric
npm install -D @types/fabric
Now that might look fine, but a human would have realised that the typings library is a completely different out-dated API, the package last updated 6 years ago.Claude however didn't realise this, and wrote a ton of code that would pass unit tests but fail the type check. It'd check the type checker, re-write it all to pass the type checker, only for it now to fail the unit tests.
Eventually it semi-gave up typing and did loads of (fabric as any) all over the place, so now it just gave runtime exceptions instead.
I intervened when I realised what it was doing, and found the root cause of it's problems.
It was a complete blindspot because it just trusted both the library and the typechecker.
So yeah, if you want to snipe a vibe coder, suggest installing fabricjs with typings!
Improving this is what everyone's looking into now. Even larger models, context windows, adding reasoning, or something else might improve this one day.
For example, you can pull the library code to your working environment and install the coding agent there as well. Then you can ask them to read specific files, or even all files in the library. I believe (according to my personal experience) this would significantly decrease the possibility of hallucinating.
The human brain learns through mistakes, repetition, breaking down complex problems into simpler parts, and reimagining ideas. The hippocampus naturally discards memories that aren’t strongly reinforced.. so if you rely solely on AI, you’re simply not going to remember much.
Any company claiming they've replaced engineers with AI has done so in an attempt to cover up the real reasons they've gotten rid of a few engineers. "AI automating our work" sounds much better to investors than "We overhired and have to downsize".
In some ways, this seems backwards. Once you have a demo that does the right thing, you have a spec, of sorts, for what's supposed to happen. Automated tooling that takes you from demo to production ready ought to be possible. That's a well-understood task. In restricted domains, such as CRUD apps, it might be automated without "AI".
To really get the most out of it though, you still need to have solid knowledge in your own field.
The difference is what we used to call the "ilities": Reliability, inhabitability, understandability, maintainability, securability, scalability, etc.
None of these things are about the primary function of the code, i.e. "it seems to work." In coding, "it seems to work" is good enough. In software engineering, it isn't.
subtlesoftware•3h ago
The next step would be to have a model running continuously on a project with inputs from monitoring services, test coverage, product analytics, etc. Such an agent, powered by a sufficient model, could be considered an effective software engineer.
We’re not there today, but it doesn’t seem that far off.
jahbrewski•3h ago
bloppe•3h ago
pil0u•3h ago
I wonder if that same non-technical person that built the MVP with GenAI and requires a (human) technical assistance today, will need it tomorrow as well. Will the tooling be mature enough and lower the barrier enough for anyone to have a complete understanding about software engineering (monitoring services, test coverage, product analytics)?
thomasfromcdnjs•3h ago
I've played around with agent only code bases (where I don't code at all), and had an agent hooked up to server logs, which would create an issue when it encounters errors, and then an agent would fix the tickets, push to prod and check deployment statuses etc. Worked good enough to see that this could easily become the future. (I also had it claude/codex code that whole setup)
Just for semantic nitpicking, I've zero shot heaps of small "software" projects that I use then throw away. Doesn't count as a SAAS product but I would still call it software.
bloppe•3h ago
An inevitable comment: "But I've seen AI code! So it must be able to build software"
bcrosby95•3h ago
Building an automated system that determines if a system is correct (whatever that means) is harder to build than the coding agents themselves.
bloppe•3h ago
What time frame counts as "not that far off" to you?
If you tried to bet me that the market for talented software engineers would collapse within the next 10 years, I'd take it no question. 25 years, I think my odds are still better than yours. 50 years, I might not take the bet.
subtlesoftware•3h ago
bloppe•2h ago