That's what I've found as well. Start describing or writing a function, include the whole file for context and it'll do its job. Give it a whole codebase and it will just wander in the woods burning tokens for ten minutes trying to solve dependencies.
That, and we also don't only focus on the textual description of a problem when we encounter a problem. We don't see the debugger output and go "how do I make this bad output go away?!?". Oh, I am getting an authentication error. Well, meaybe I should just delete the token check for that code path...problem solved?!
No. Problem very much not-solved. In fact, problem very much very bigger big problem now, and [Grug][1] find himself reaching for club again.
Software engineers are able to step back, think about the whole thing, and determine the root cause of a problem. I am getting an auth error...ok, what happens when the token is verified...oh, look, the problem is not the authentication at all...in fact there is no error! The test was simply bad and tried to call a higher privilege function as a lower privilege user. So, test needs to be fixed. And also, even though it isn't per-se an error, the response for that function should maybe differentiate between "401 because you didn't authenticate" and "401 because your privileges are too low".
Put another way, you have an excel roster corresponding to people with accounts where some need to have their account shutdown but you only have their first and last names as identifiers, and the pool is sufficiently large that there are more than one person per a given set of names.
You can't shut down all accounts with a given name, and there is no unique identifier. How do you solve this?
You have to ask and be given that unique identifier that differentiates between the undecidable. Without that, even the person can't do the task.
The person can make guesses, but those guesses are just hallucinations with a significant n probability towards a bad repeat outcome.
At a core level I don't think these type of issues are going to be solved. Quite a lot of people would be unable to solve this and struggle with this example (when not given the answer, or hinted at the solution in the framing of the task; ie when they just have a list of names and are told to do an impossible task).
What an LLM cannot do today is almost irrelevant in the tide of change upon the industry. The fact is, with improvements, it doesn't mean an LLM cannot do it tomorrow.
LLMs may get better, but it will not be what people are clamoring them to be.
I mean, there was and then there wasn't. All of those things are shrinking fast because we handed over control to people who care more about profits than customers because we got too comfy and too cheap, and now right to repair is screwed.
Honestly, I see llm-driven development as a threat to open source and right to repair, among the litany of other things
LLMs are not like this. The fundamental way they operate, the core of their design is faulty. They don't understand rules or knowledge. They can't, despite marketing, really reason. They can't learn with each interaction. They don't understand what they write.
All they do is spit out the most likely text to follow some other text based on probability. For casual discussion about well-written topics, that's more than good enough. But for unique problems in a non-English language, it struggles. It always will. It doesn't matter how big you make the model.
They're great for writing boilerplate that has been written a million times with different variations - which can save programmers a LOT of time. The moment you hand them anything more complex it's asking for disaster.
Modern coding AI models are not just probability crunching transformers. They haven't been just that for some time. In current coding models the transformer bit is just one part of what is really an expert system. The complete package includes things like highly curated training data, specialized tokenizers, pre and post training regimens, guardrails, optimized system prompts etc, all tuned to coding. Put it all together and you get one shot performance on generating the type of code that was unthinkable even a year ago.
The point is that the entire expert system is getting better at a rapid pace and the probability bit is just one part of it. The complexity frontier for code generation keeps moving and there's still a lot of low hanging fruit to be had in pushing it forward.
> They're great for writing boilerplate that has been written a million times with different variations
That's >90% of all code in the wild. Probably more. We have three quarters of a century of code in our history so there is very little that's original anymore. Maybe original to the human coder fresh out of school, but the models have all this history to draw upon. So if the models produce the boilerplate reliably then human toil in writing if/then statements is at an end. Kind of like - barring the occasional mad genious [0] - the vast majority of coders don't write assembly to create a website anymore.
[0] https://asm32.info/index.cgi?page=content/0_MiniMagAsm/index...
This is lipstick on a pig. All those methods are impressive, but ultimately workarounds for an idea that is fundamentally unsuitable for programming.
>That's >90% of all code in the wild. Probably more.
Maybe, but not 90% of time spent on programming. Boilerplate is easy. It's the 20%/80% rule in action.
I don't deny these tools can be useful and save time - but they can't be left to their own devices. They need to be tightly controlled and given narrow scopes, with heavy oversight by an SME who knows what the code is supposed to be doing. "Design W module with X interface designed to do Y in Z way", keeping it as small as possible and reviewing it to hell and back. And keeping it accountable by making tests yourself. Never let it test itself, it simply cannot be trusted to do so.
LLMs are incredibly good at writing something that looks reasonable, but is complete nonsense. That's horrible from a code maintenance perspective.
And even with all that, they still produce garbage way too often. If we continue the "car" analogy, the car would crash randomly sometimes when you leave the driveway, and sometimes it would just drive into the house. So you add all kinds of fancy bumpers to the car and guard rails to the roads, and the car still runs off the road way too often.
Said like a true software person. I'm to understand that computer people are looking at LLMs from the wrong end of the telescope; and that from a neuroscience perspective, there's a growing consensus among neuroscientists that the brain is fundamentally a token predictor, and that it works on exactly the same principles as LLMs. The only difference between a brain and an LLM maybe the size of its memory, and what kind and quality of data it's trained on.
Hahahahahaha.
Oh god, you're serious.
Sure, let's just completely ignore all the other types of processing that the brain does. Sensory input processing, emotional regulation, social behavior, spatial reasoning, long and short term planning, the complex communication and feedback between every part of the body - even down to the gut microbiome.
The brain (human or otherwise) is incredibly complex and we've barely scraped the surface of how it works. It's not just nuerons (which are themselves complex), it's interactions between thousands of types of cells performing multiple functions each. It will likely be hundreds of years before we get a full grasp on how it truly works - if we ever do at all.
We can reasonably speak about certain fundamental limitations of LLMs without those being claims about what AI may ever do.
I would agree they fundamentally lack models of the current task and that it is not very likely that continually growing the context will solve that problem, since it hasn't already. That doesn't mean there won't someday be an AI that has a model much as we humans do. But I'm fairly confident it won't be an LLM. It may have an LLM as a component but the AI component won't be primarily an LLM. It'll be something else.
The sooner people stop worrying about a label for what you feel fits LLMs best, the sooner they can find the things they (LLMs) absolutely excel at and improve their (the user's) workflows.
Stop fighting the future. Its not replacing right now. Later? Maybe. But right now the developers and users fully embracing it are experiencing productivity boosts unseen previously.
Language is what people use it as.
This is the kind of thing that I disagree with. Over the last 75 years we’ve seen enormous productivity gains.
You think that LLMs are a bigger productivity boost than moving from physically rewiring computers to using punch cards, from running programs as batch processes with printed output to getting immediate output, from programming in assembly to higher level languages, or even just moving from enterprise Java to Rails?
Skepticism isn't the same thing as fighting the future.
I will call something AGI when it can reliably solve novel problems it hasn't been pre-trained on. That's my goal post and I haven't moved it.
I have the complete opposite feeling. The layman understanding of the term "AI" is AGI, a term that only needs to exist because researchers and businessmen hype their latest creations as AI.
The goalposts for AI don't move but the definition isn't precise but we know it when we see it.
AI, to the layman, is Skynet/Terminator, Asimov's robots, Data, etc.
The goalposts moving that you're seeing is when something the tech bubble calls AI escapes the tech bubble and everyone else looks at it and says, no, that's not AI.
The problem is that everything that comes out of the research efforts toward AI, the tech industry calls AI despite it not achieving that goal by the common understanding of the term. LLMs were/are a hopeful AI candidate but, as of today, they aren't but that doesn't stop OpenAI from trying to raise money using the term.
EDIT - I see now. sorry.
For all intents and purposes of the public. AI == LLM. End of story. Doesn't matter what developers say.
The premise that an AI needs to do Y "as we do" to be good at X because humans use Y to be good at X needs closer examination. This presumption seems to be omnipresent in these conversations and I find it so strange. Alpha Zero doesn't model chess "the way we do".
Neural networks are necessary but not sufficient. LLMs are necessary but not sufficient.
I have no doubt that there are multiple (perhaps thousands? more?) of LLM-like subsystems in our brains. They appear to be a necessary part of creating useful intelligence. My pet theory is that LLMs are used for associative memory purposes. They help generate new ideas and make predictions. They extract information buried in other memory. Clearly there is another system on top that tests, refines, and organizes the output. And probably does many more things we haven't even thought to name yet.
Alternatively, the goalposts keep being moved.
"Every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.
Lemma: any statement about AI which uses the word "never" to preclude some feature from future realization is false.
Lemma: contemporary implementations have almost always already been improved upon, but are unevenly distributed."
And with fusion, we already have a working prototype (the Sun). And if we could just scale our tech up enough, maybe we’d have usable fusion.
That is too reductive and simply not true. Contemporary critiques of AI include that they waste precious resources (such as water and energy) and accelerate bad environmental and societal outcomes (such as climate change, the spread of misinformation, loss of expertise), among others. Critiques go far beyond “hur dur, LLM can’t code good”, and those problems are both serious and urgent. Keep sweeping critiques under the rug because “they’ll be solved in the next five years” (eternally away) and it may be too late. Critiques have to take into account the now and the very real repercussions already happening.
Dismissing a concern with “LLMs/AI can’t do it today but they will probably be able to do it tomorrow” isn’t all that useful or helpful when “tomorrow” in this context could just as easily be “two months from now” or “50 years from now”.
A crucial ingredient might be missing.
Even translations between human languages (which allows for ambiguity) can be messy. Imagine if the target language is for a system that will exactly do as told unless someone has qualified those actions as bad.
When the employer business isn't shipping software, engineers have no other option than actually learn the business as well.
Agree strongly, and I think this is basically what the article is saying as well about keeping a mental model of requirements/code behavior. We kind of already knew this was the hard part. How many times have you heard that once you get past junior level, the hard part is not writing the code? And that It's knowing what code to write? This realization is practically a right of passage.
Which kind of begs the question for what the software engineering job looks like in the future. It definitely depends on how good the AI is. In the most simplistic case, AI can do all the coding right now and all you need is a task issue. And frankly probably a user written (or at least reviewed, but probably written) test. You could make the issue and test upfront and farm out the PR to an agent and manually approve when you see it passed the test case you wrote.
In that case you are basically PM and QA. You are not even forming the prompt, just detailing the requirements.
But as the tech improves can all tasks fit into that model? Not design/architecture tasks - or at least without a new task completion model than described above. The window will probably grow, but its hard to imagine that it will handle all pure coding tasks. Even for large tasks that theorhetically can fit into that model, you are going to have to do a lot of thinking and testing and prototyping to figure out the requirements and test cases. In theory you could apply the same task/test process but that seems like it would be too much structure and indirection to actually be helpful compared to knowing how to code.
Those rules are also very fuzzy and only get defined more formally by the coding process.
An earlier effort at AI was based on rules and the C. Forgy RETE algorithm. Soooo, rules have been tried??
Rules engines were traditionally written in Prolog or Lisp during the AI wave they were cool.
Forgy was Charles Forgy.
For a "rules engine", there was also IBM's YES/L1.
Concidentally, I encountered the author's work for the first time only a couple of days ago as a podcast guest, he vouches for the "Dirty Code" approach while straw-manning Uncle Bob's general principles of balancing terseness/efficiency with ergonomics and readability (in most, but not all, cases).
I guess this stuff sells t-shirts and mugs /rant
Have you read Uncle Bob? There's no need to strawman: Bob's examples in Clean Code are absolutely nuts.
Here's a nice writeup that includes one of Bob's examples verbatim in case you've forgotten: https://qntm.org/clean
Here's another: https://gerlacdt.github.io/blog/posts/clean_code/
> THINK they are big brained developers many, many more, and more even definitely probably maybe not like this, many sour face (such is internet)
> (note: grug once think big brained but learn hard way)
Grug is both the high and low end of the Bell curve.
Instead, my brain parses code into something like an AST which then is represented as a spatial graph. I model the program as a logical structure instead of a textual one. When you look past the language, you can work on the program. The two are utterly disjoint.
I think LLMs fail at software because they're focused on text and can't build a mental model of the program logic. It take a huge amount of effort and brainpower to truly architect something and understand large swathes of the system. LLMs just don't have that type of abstract reasoning.
It's funny that everyone says that "LLMs" have plateaued, yet the base models have caught up with early attempts to build harnesses with the things I've mentioned above. They now match or exceed the previous generation software glue, with just "tools", even with limited ones like just "terminal".
- When ever you address a failing test, always bring your component mental model into the context.
Paste that into your Claude prompt and see if you get better results. You'll even be able to read and correct the LLM's mental model.
so current LLMs might not quite be human level, but I'd have to see a bigger model fail before I'd conclude that it can't do $X.
Years ago I gave up compiling these large applications all together. I compiled Firefox via FreeBSD's (v8.x) ports system, that alone was a nightmare.
I cannot imagine what it would be like to compile GNOME3 or KDE or Libreoffice. Emacs is the largest thing I compile now.
While a collision hasn't yet been found for a SHA256 package on Nix, by the pigeonhole principle they exist, and the computer will not be able to decide between the two packages in such a collision leading to system level failure, with errors that have no link to cause (due to the properties involved, and longstanding CS problems in computation).
These things generally speaking contain properties of mathematical chaos which is a state that is inherently unknowable/unpredictable that no admin would ever approach or touch because its unmaintainable. The normally tightly coupled error handling code is no longer tightly coupled because it requires matching a determinable state (CS computation problems, halting/decidability).
Non-deterministic failure domains are the most costly problems to solve because troubleshooting which leverages properties of determinism, won't work.
This leaves you only a strategy of guess and check; which requires intimate knowledge of the entire system stack without abstractions present.
A cursory look at a nix system would also show you that the package name, version and derivation sha are all concatenated together.
> A cursory look at a nix system would show ... <three things concattenated together>
This doesn't negate or refute the pigeonhole principle. In following pigeonhole there is some likelihood that a collision will exist, and that probability trends to 1 given sufficient iterations (time).
The only argument you have is a measure of likelihood and probability, which is a streetlight effect cognitive bias or intelligence trap. There's a video which discusses these type of traps on youtube, TED from an ex-CIA officer.
Likelihood and probability are heavily influenced by the priors they measure, and without perfect knowledge (which no one has today) those priors may deviate significantly, or be indeterminable.
Imagine for a second that a general method for rapidly predicting collisions, regardless of algorithm, is discovered and released; which may not be far off given current advances with quantum computing.
All the time and money cumulatively spent towards Nix, as cost becomes wasted, and you are left in a position of complete compromise suddenly and without a sound pivot for comparable cost (previously).
With respect, if you can't differentiate basic a priori reasoned logic from AI, I would question your perceptual skills and whether they are degrading. There is a growing body of evidence that exposure to AI may cause such degradation as seems to be starting to be seen with regards to doctors and their use and diagnostics after use in various studies (1).
1: https://time.com/7309274/ai-lancet-study-artificial-intellig...
Taken to a next step, recognizing this makes the investment in such a moonshot pipedream (overcoming these inherent problems in a deterministic way), recklessly negligent.
> Recency bias: They suffer a strong recency bias in the context window.
> Hallucination: They commonly hallucinate details that should not be there.
To be fair, those are all issues that most human engineers I've worked with (including myself!) have struggled with to various degrees, even if we don't refer to them the same way. I don't know about the rest of you, but I've certainly had times where I found out that an important nuance of a design was overlooked until well into the process of developing something, forgotten a crucial detail that I learned months ago that would have helped me debug something much faster than if I had remembered it from the start, or accidentally make an assumption about how something worked (or misremembered it) and ended up with buggy code as a result. I've mostly gotten pretty positive feedback about my work over the course of my career, so if I "can't build software", I have to worry about the companies that have been employing me and my coworkers who have praised my work output over the years. Then again, I think "humans can't build software reliably" is probably a mostly correct statement, so maybe the lesson here is that software is hard in general.
> AI is awesome for coding! [Opus 4]
> No AI sucks for coding and it messed everything up! [4o]
Would really clear the air. People seem to be evaluating the dumbest models (apparently because they don't know any better?) and then deciding the whole AI thing just doesn't work.
They need to mention significantly more than that: https://dmitriid.com/everything-around-llms-is-still-magical...
--- start quote ---
Do we know which projects people work on? No
Do we know which codebases (greenfield, mature, proprietary etc.) people work on? No
Do we know the level of expertise the people have? No.
Is the expertise in the same domain, codebase, language that they apply LLMs to? We don't know.
How much additional work did they have reviewing, fixing, deploying, finishing etc.? We don't know.
--- end quote ---
And that's just the tip of the iceberg. And that is an iceberg before we hit another one: that we're trying to blindly reverse engineer a non-deterministic blackbox inside a provider's blackbox
It happens on many topics related to software engineering.
The web developer is replying to the embedded developer who is replying to the architect-that-doesnt-code who is replying to someone with 2 years of experience who is replying to someone working at google who is replying to someone working at a midsize b2b German company with 4 customers. And on and on.
Context is always omitted and we're all talking about different things ignoring the day to day reality of our interlocutors.
I feel personally described by this statement. At least on a bad day, or if I'm phoning it in. Not sure if that says anything about AI - maybe just that the whole "mental models" part is quite hard.
If we put human engineering teams in the same situation, we’d expect them to do a terrible job, so why do we expect LLMs to do any better?
We can dramatically improve the output of LLM software development by using all those processes and tools that help engineering teams avoid these problems:
https://jim.dabell.name/articles/2025/08/08/autonomous-softw...
granted- it needs careful planning for CLAUDE.md and all issues and feature requests need a lot of in-depth specifics but it all works. so I am not 100% convinced by this piece. I'd say it's def not easy to get coding agents to be able to manage and write software effectively and specially hard to do so in existing projects but my experience has been across that entire spectrum. I have been sorely disappointed in coding agents and even abandoned a bunch or projects and dozens of pull requests but I have also seen them work.
you can check out that project here: https://github.com/julep-ai/steadytext/
What's happened for me recently is I've started to revisit the idea that typing speed doesn't matter.
This is an age-old thing, most people don't think it really matters how fast you can type. I suppose the steelman is, most people think it doesn't really matters how fast you can get the edits to your code that you want. With modern tools, you're not typing out all the code anyway, and there's all sorts of non-AI ways to get your code looking the way you want. And that doesn't matter, the real work of the engineer is the architecture of how the whole program functions. Typing things faster doesn't make you get to the goal faster, since finding the overall design is the limiting thing.
But I've been using Claude for a while now, and I'm starting to see the real benefit: you no longer need to concentrate to rework the code.
It used to be burdensome to do certain things. For instance, I decided to add an enum value, and now I have to address all the places where it matches on that enum. This wasn't intellectually hard in the old world, you just got the compiler to tell you where the problems were, and you added a little section for your new value to do whatever it needed, in all the places it appeared.
But you had to do this carefully, otherwise you would just cause more compile/error cycles. Little things like forgetting a semicolon will eat a cycle, and old tools would just tell you the error was there, not fix it for you.
LLMs fix it for you. Now you can just tell Claude to change all the code in a loop until it compiles. You can have multiple agents working on your code, fixing little things in many places, while you sit on HN and muse about it. Or perhaps spend the time considering what direction the code needs to go.
The big thing however is that when you're no longer held up by little compile errors, you can do more things. I had a whole laundry list of things I wanted to change about my codebase, and Claude did them all. Nothing on the business level of "what does this system do" but plenty of little tasks that previously would take a junior guy all day to do. With the ability to change large amounts of code quickly, I'm able to develop the architecture a lot faster.
It's also a motivation thing: I feel bogged down when I'm just fixing compile errors, so I prioritize what to spend my time on if I am doing traditional programming. Now I can just do the whole laundry list, because I'm not the guy doing it.
I always have a whole bunch of things I want to change in the codebase I'm working on, and the bottleneck is review, not me changing that code.
LLM also helps you test.
Almost every quality software has is designed in from a higher abstraction level. Almost nothing is put there by fixing error after error.
People complained endlessly about the internet in the early to mid 90s, its slow, static, most sites had under construction signs on them, your phone modem would just randomly disconnect. The internet did suck in alot of ways and yet people kept using it.
Twitter sucked in the mid 2000s, we saw the fail whale weekly and yet people continued to use it for breaking news.
Electric cars sucked, no charging, low distance, expensive and yet no matter how much people complain about them they kept getting better.
Phones sucked, pre 3G was slow, there wasn't much you could use them for before app stores and the cameras were potato quality and yet people kept using them while they improved.
Always look for the technology that sucks and yet people keep using it because it provides value. LLM's aren't great at alot of tasks and yet no matter how much people complain about them, they keep getting used and keep improving through constant iteration.
LLM"s amy not be able to build software today, but they are 10x better than where they were in 2022 when we first started using chatgpt. Its pretty reasonable to assume in 5 years they will be able to do these types of development tasks.
most of the vibe shift I think I’ve seen in the past few months to using LLMs in the context of coding has been improvements in dataset curation and ux, not fundamentally better tech
That doesn't seem unexpected. Any technological leap seem to happen in sigmoid-like steps. When a fruitful approach is discovered we run to it until diminishing returns sets in. Often enough a new approach opens doors to other approaches that builds on it. It takes time to discover the next step in the chain but when we do we get a new sigmoid-like leap. Etc...
I.e. combining new approaches around old school "AI" with GenAI. That's probably not exactly what he's trying to do but maybe somewhere in the ball park.
Thing is breakthroughs are always X years away (50 for fusion power for example).
The only example he gave that actually was kind of a big deal was mobile phones where capacitive touchscreens really did catapult the technology forward. But it is not like celphones weren't already super useful, profitable and getting better over time before capacitive touchscreens were introduced.
Maybe broadband to the internet also qualifies.
A (bad) analogy would be that I can pretty easily tell the difference between a cat and an ape, and the differences in capability are blatantly obvious - but the improvement when going from IQ 70 to Einstein are much harder to assess and arguably not that useful for most tasks.
I tend to find that when I switch to a new model, it doesn't seem any better, but then at some point after using it for a few weeks I'll try to use the older model again and be quite surprised at how much worse it is.
Uhhh, no?
In the past month we've had:
- LLMs (3 different models) getting gold at IMO
- gold at IoI
- beat 9/10 human developers at atcode heuristics (optimisations problems) with the single human that actually beat the machine saying he was exhausted and next year it'll probably be over.
- agentic that actually works. And works for 30-90 minute sessions while staying coherent and actually finishing tasks.
- 4-6x reduction in price for top tier (SotA?) models. oAI's "best" model now costs 10$/MTok, while retaining 90+% of their previous SotA models that were 40-60$/MTok.
- several "harnesses" being released by every model provider. Claude code seems to remain the best, but alternatives are popping off everywhere - geminicli, opencoder, qwencli (forked, but still), etc.
- opensource models that are getting close to SotA, again. Being 6-12months behind (depending on who you ask), opensource and cheap to run (~2$/MTok on some providers).
I don't see the plateauing in capabilities. LLMs are plateauing only in benchmarks, where number goes up can only go up so far until it becomes useless. IMO regular benchmarks have become useless. MMLU & co are cute, but agentic whatever is what matters. And those capabilities have only improved. And will continue to improve, with better data, better signals, better training recipes.
Why do you think eveyr model provider is heavily subsidising coding right now? They all want that sweet sweet data & signals, so they can improve their models.
Go open the OpenAI API playground and give GPT3 and GPT5 the same prompt to make a reasonably basic game in JavaScript to your specification and watch GPT 3 struggle and GPT 5 one-shot it.
Me, I agree with the author of the article. It's possible that the technology will eventually get there, but it doesn't seem to be there now. And I prefer to make decisions based on present-day reality instead of just assuming that the future I want is the future I'll get.
Ha;) Yes, when you provide examples to prove your point they are, by definition, selective:)
You are free to develop your own mental models of what technology and companies to invest in. I was only trying to share my 20 years of experience with investing to show why you shouldn't discard current technology because of its current limits.
Engineering decisions, which is closer to what TFA is talking about, tend to have to be a lot more focused on the here & now. You can make bets on future R&D developments (e.g, the Apollo program), but that's a game best played when you also have some control over R&D budgeting and direction (e.g, the Apollo program), and when you don't have much other choice (e.g, the Apollo program).
Specifically, to me the limitation of LLMs is discovering new knowledge and being able to reason about information they haven't seen before. LLMs still fail at things like counting the number of b's in the word blueberry or not getting distracted by inserting random cat facts in word problems (both issues I've seen appear in the last month)
I don't mean that to say they're a useless tool, I'm just not into the breathless hype.
A lot of what you described as "sucked" were not seen as "sucking" at the time. Nobody complained about the phones being slow because nobody expected to use phones the way we do today. The internet was slow and less stable but nobody complained because they expected to stream 4k movies and they could not. This is anachronistic.
The fact that we can see how some things improved in X Y manner does not mean that LLMs will improve the way you think they will. Maybe we invent a different technology that does a better job. After it was not that dial up itself became faster and I don't think there were fanatics saying that dialup technology would give us 1Gbp speeds. The problem with AI is that because scaling up compute has provided breakthroughs, some think that somehow with scaling up compute and some technical tricks we can solve all the current problems. I don't think that anybody can say that we cannot invent a technology that can overcome these, but if LLMs is this technology that can just keep scaling has been under doubt. Last year or so there has been a lot of refinement and broadening of applications, but nothing like a breakthrough.
Has VR really improved 10x? I lost touch after the HTC Vive and heard about Valve Index but I was under the impression that even the best that Apple has on offer is 2x at most.
This is a big rewrite of history. Phones took off because before mobile phones the only way to reach a person was to call when they were at home or their office. People were unreachable for timespans that now seem quaint. Texting brought this into async. The "potato" cameras were the advent of people always having a camera with them.
People using the Nokia 3210 were very much not anticipating when their phones would get good, they were already a killer app. That they improved was icing on the cake.
It always bugs me whenever I hear someone defend some new tech (blockchain, LLMs, NFTs) by comparing it with phones or the internet or whatever. People did not need to be convinced to use cell phones or the internet. While there were absolutely some naysayers, the utility and usefulness of these technologies was very obvious by the time they became available to consumers.
But also, there's survivorship bias at play here. There are countless promising technologies that never saw widespread adoption. And any given new technology is far more likely to end up as a failure then it is to become "the next iPhone" or "the new internet."
In short, you should sell your technology based on what it can do right now, instead of what it might do in the future. If your tech doesn't provide utility right now, then it should be developed for longer before you start charging money for it. And while there's certainly some use for LLMs, a lot of the current use cases being pushed (google "AI overviews", shitty AI art, AIs writing out emails) aren't particularly useful.
For example, it would be wrong for me to say that "hyperloop got a ton of hype and investments, and it failed. Therefore LLMs, which are also getting a ton of hype and investments, will also fail." Hyperloop and LLMs are fundamentally different technologies, and the failure of hyperloop is a poor indicator of whether LLMs will ultimately succeed.
Which isn't to say we can't make comparisons to previous successes or failures. But those comparisons shouldn't be your main argument for the viability of a new technology.
My main argument for the viability of the technology is that it's useful today. Even if it doesn't improve from here, my job as a coder has already been changed.
FWIW - 3d printing has come a far way, and I personally have a 3D printer. But the idea that it was going to completely disrupt manufacturing is simply not true. There are known limitations (how the heck are you going to get a wood polymer squeezed through a metal tip?) and those limitations are physics, not technical ones.
They haven't continued to see massive adoption and improvement despite the flaws people point out.
They had initial success at printing basic plastic pieces but have failed to print in other materials like metal as you correctly point out, so these wouldn't pass my screening as they currently sit.
We can expect them to be better in 5 years, but your last assertion doesn't follow. We can't assert with any certainty that they will be able to specifically solve the problems laid out in the article. It might just not be a thing LLMs are good at, and we'll need new breakthroughs that may or may not appear.
And NFTs had a lot of loud detractors.
And everyone complained about a million other solutions that did not go anywhere.
Still, a bunch of investors made a lot of money on VR and very much so on NFT. Investments being good is not an indicator of anything being useful.
And NFTs was always perceived as a scam, same as the breathless blockchain no sense.
LLMs have many many issues, but I think they stick out as different to the other examples.
All these things are not black boxes and they are mostly deterministic. Based on the inputs, you EXACTLY know what to expect as output.
That's not the case with LLMs, how they are trained and how they work internally.
We certainly get a better understanding on how to adjust the inputs so we get a desired output. But that's far from guaranteed at the same level as the examples you mentioned.
That's a fundamental problem with LLMs. And you can see that in how industry actors are building solutions around that problem. Reasoning (chain-of-thought) is basically a band-aid to narrow a decision tree, because the LLM does not really "reason" about anything. And the results only get better with more training data. We literally have to brute-force useful results by throwing more compute and memory at the problem (and destroying the environment and climate by doing so).
The stagnation of recent model releases does not look good for this technology.
So consider your analogy, that the internet was always useful, but it was javascript that caused the actual titanic shift in the software industry. Even though the core internet backbone didn't radically improve as fast as you imagine it would have. Javascript was hacked together as a toy scripting language meant to make pages more interactive, but turns out, it was the key piece in unlocking that 10x value of the already existing internet.
Agents and the explosion of all these little context services are where I see the same thing happening here. Right now they are buggy, and mostly experimental toys. However, they are unlocking that 10x value.
Improvements in model performance seem to be approaching the peak rather than demonstrating exponential gains. Is the quote above where we land in the end?
I find Sonnet frequently loses the plot, but Opus can usually handle it (with sufficient clarity in prompting).
I don't want a chat window.
I want AI workflows as part of my IDE, like Visual Studio, InteliJ, Android Studio are finally going after.
I want voice controlled actions on my native language.
Knowledge across everything on the project for doing code refactorings, static analysis with AI feedback loop, generating UI based out of handwritten sketches, programming on the go using handwriting, source control commit messages out of code changes,...
I've one thing that helps is using the "Red-Green-Refactor" language. We're in RED phase - test should fail. We're in GREEN phase - make this test pass with minimal code. We're in REFACTOR phase - improve the code without breaking tests.
This helps the LLM understand the TDD mental model rather than just seeing "broken code" that needs fixing.
Perhaps good for someone just getting their feet wet with these computational objects, but not resolving or explaining things in a clear way, or highlighting trends in research and engineering that might point towards ways forward.
You also have a technical writing no no where you cite a rather precise and specific study with a paraphrase to support your claims … analogous to saying “Godel’s incompleteness theorem means _something something_ about the nature of consciousness”.
A phrase like: “Unfortunately, for now, they cannot (beyond a certain complexity) actually understand what is going on” referencing a precise study … is ambiguous and shoddy technical writing — what exactly does the author mean here? It’s vague.
I think it is even worse here because _the original study_ provides task-specific notions of complexity (a critique of the original study! Won’t different representations lead to different complexity scaling behavior? Of course! That’s what software engineering is all about: I need to think at different levels to control my exposure to complexity)
I wonder is this not just a proxy for intelligence?
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over. This is exactly the opposite of what I am looking for. Software engineers test their work as they go. When tests fail, they can check in with their mental model to decide whether to fix the code or the tests, or just to gather more data before making a decision. When they get frustrated, they can reach for help by talking things through. And although sometimes they do delete it all and start over, they do so with a clearer understanding of the problem.
My experiences are based on using Cline with Anthropic Sonnet 3.7 doing TDD on Rails, and have a very different experience. I instruct the model to write tests before any code and it does. It works in small enough chunks that I can review each one. When tests fail, it tends to reason very well about why and fixes the appropriate place. It is very common for the LLM to consult more code as it goes to learn more.
It's certainly not perfect but it works about as well, if not better, than a human junior engineer. Sometimes it can't solve a bug, but human junior engineers get in the same situation too.
OTOH i tried building a native Windows Application using Direct2D in Rust and it was a disaster.
I wish people could be a bit more open about what they build.
I would say for the last 6 months, 95% of the code for my chat app (https://github.com/gitsense/chat) was AI generated (98% human architected). I believe what I created in the last 6 months was far from trivial. One of the features that AI helped a lot with, was the AI Search Assistant feature. You can learn more about it here https://github.com/gitsense/chat/blob/main/packages/chat/wid...
As a debugging partner, LLMs are invaluable. I could easily load all the backend search code into context and have it trace a query and create a context bundle with just the affected files. Once I had that, I would use my tool to filter the context to just those files and then chat with the LLM to figure out what went wrong or why the search was slow.
I very much agree with the author of the blog post about why LLMs can't really build software. AI is an industry game changer as it can truly 3x to 4x senior developers in my opinion. I should also note that I spend about $2 a day on LLM API calls (99% to Gemini 2.5 Flash) and I probably have to read 200+ LLM generated messages a day and reply back in great detail about 5 times a day (think of an email instead of chat message).
Note: The demo on that I have in the README hasn't been setup, as I am still in the process of finalizing things for release but the NPM install instructions should work.
It does not works so well for any problems it has not seen before. At that point you need to explain the problem, and instruct the solution. So a that point, you're just acting as a mentor instead of using your capacity to just implement the solution yourself.
My whole team has really bought into the "claude-code" way of doing side tasks that have been on the backlog for years, think like simple refactors, or secondary analytic systems. Basically any well-trodden path that is mostly constrained by time that none of us are given, are perfect for these agents right now.
Personally I'm enjoying the ability to highlight a section of code and ask the LLM to explain this to me like I'm 5, or look for any potential race conditions. For those archiac, fragile monolithic blocks of code that stick around long after the original engineers have left, it's magical to use the LLM to wrap my head around that.
I haven't found it can write these things any better though, and that is the key here. It's not very good at creating new things that aren't commonly seen. It also has a code style that is quite different than what already exists. So when it does inject code, often times it has to be rewritten to fit the style around it. Already, I'm hearing whispers of people say things like "code written for the AI to read." That's where my eyes roll because the payoff for the extra mental bandwidth doesn't seem worth it right now.
I haven't tried it with Rails myself (haven't touched Ruby in years, to be honest), but it doesn't surprise me that it would work well there. Ruby on Rails programming culture is remarkably consistent about how to do things. I would guess that means that the LLM is able to derive a somewhat (for lack of a better word) saner model from its training data.
By contrast, what it does with Python can get pretty messy pretty quickly. One of the biggest problems I've had with it is that it tends to use a random hodgepodge of different Python coding idioms. That makes TDD particularly challenging because you'll get tests that are well designed for code that's engineered to follow one pattern of changes, written against a SUT that follows conventions that lead to a completely different pattern of changes. The result is horribly brittle tests that repeatedly break for spurious reasons.
And then iterating on it gets pretty wild, too. My favorite behavior is when the real defect is "oops I forgot to sort the results of the query" and the suggested solution is "rip out SqlAlchemy and replace it with Django."
R code is even worse; even getting it to produce code that follows a spec in the first place can be a challenge.
It’s still early days, but we are learning that as with software written exclusively by humans, the more specific the specifications are, the more likely the result will be as you intended.
Vibing I often let it explain the implemented business logic (instead of reading the code directly) and judge that.
However, I agree with the main thesis (that they can’t do it on their own). Also related to this this whole idea of “we will easily fix memory next” will turn out to be the same as “we can fix vision in one summer” turned out it’s 30 years later, much improved but still not fixed. Memory is hard.
The first project is a C++ embedded device. The second is a sophisticated Django-based UI front end for a hardware device (so, python interacting with hardware and various JS libraries handling most of the front end).
So far I am deeper into the Django project than the C++ embedded project.
It's interesting.
I had already hand-coded a conceptual version of the UI just to play with UI and interaction ideas. I handed this to Cursor as well as a very detailed specification for the entire project, including directory structure, libraries, where to use what and why, etc. In other words, exactly what I would provide a contractor or company if I were to outsource this project. I also told it to take a first stab at the front end based on the hand-coded version I plopped into a temporary project directory.
And then I channeled Jean-Luc Picard and said "Engage!".
The first iteration took a few minutes. It was surprisingly functional and complete. Yet, of course, it had problems. For example, it failed to separate various screens into separate independent Django apps. It failed to separate the one big beautiful CSS and JS files into independent app-specific CSS and JS files. In general, it ignored separation of concerns and just made it all work. This is the kind of thing you might expect from a junior programmer/fresh grad.
Achieving separation of concerns and other undesirable cross-pollination of code took some effort. LLM's don't really understand. They simulate understanding very well, but, at the end of the day, I don't think we are there. They tend to get stuck and make dumb mistakes.
The process to get to something that is now close to a release candidate entailed an interesting combination of manual editing and "molding" of the code base with short, precise and scope-limited instructions for Cursor. For my workflow I am finding that limiting what I ask AI to do delivers better results. Go too wide and it can be in a range between unpredictable and frustrating.
Speaking of frustrations, one of the most mind-numbing things it does every so often is also in a range, between completely destroying prior work or selectively eliminating or modifying functionality that used to work. This is why limiting the scope, for me, has been a much better path. If I tell it to do something in app A, there's a reasonable probability that it isn't going to mess with and damage the work done in app B.
This issue means that testing become far more important in this workflow, because, on every iteration, you have no idea what functionality may have been altered or damaged. It will also go nuts and do things you never asked it to do. For example, I was in the process of redoing the UI for one of the apps. For some reason it decided it was a good idea to change the UI for one of the other apps, remove all controls and replace them with controls it thought were appropriate or relevant (which wasn't even remotely the case). And, no, I did not ask it to touch anything other than the app we were working on.
Note: For those not familiar with Django, think of an app as a page with mostly self-contained functionality. Apps (pages) can share data with each other through various means, but, for the most part, the idea is that they are designed as independent units that can be plucked out of a project and plugged into another (in theory).
The other thing I've been doing is using ChatGPT and Cursor simultaneously. While Cursor is working I work with ChatGPT on the browser to plan the next steps, evaluate options (libraries, implementation, etc.) and even create quick stand-alone single file HTML tests I can run without having to plug into the Django project to test ideas. I like this very much. It works well for me. It allows me to explore ideas and options in the context of an OpenAI project and test things without the potential to confuse Cursor. I have been trying to limit Cursor to being a programmer, rather than having long exploratory conversations.
Based on this experience, one thing is very clear to me: If you don't know what you are doing, you are screwed. While the OpenAI demo where they have v5 develop a French language teaching app is cool and great, I cannot see people who don't know how to code producing anything that would be safe to bet the farm on. The code can be great and it can also be horrific. It can be well designed and it can be something that would cause you to fail your final exams in a software engineering course. There's great variability and you have to get your hands in there, understand and edit code by hand as part of the process.
Overall, I do like what I am seeing. Anyone who has done non-trivial projects in Django knows that there's a lot of busy boilerplate typing that is just a pain in the ass. With Cursor, that evaporates and you can focus on where the real value lies: The problem you are trying to solve.
I jump into the embedded C++ project next week. I've already done some of it, but I'm in that mental space 100% next week. Looking forward to new discoveries.
The other reality is simple: This is the worse this will ever be. And it is already pretty good.
9cb14c1ec0•2h ago
The more I use claude code, the more frustrated I get with this aspect. I'm not sure that a generic text-based LLM can properly solve this.
cmrdporcupine•2h ago
You can let it do the grunt coding, and a lot of the low level analysis and testing, but you absolutely need to be the one in charge on the design.
It frankly gives me more time to think about the bigger picture within the amount of time I have to work on a task, and I like that side of things.
There's definitely room for a massive amount of improvement in how the tool presents changes and suggestions to the user. It needs to be far more interactive.
mock-possum•2h ago
My experience with prompting LLMs for codegen is really not much different from my experience with querying search engines - you have to understand how to ‘speak the language’ of the corpus being searched, in order to find the results you’re looking for.
micromacrofoot•2h ago
I keep saying it and no one really listens: AI really is advanced autocomplete. It's not reasoning or thinking. You will use the tool better if you understand what it can't do. It can write individual functions pretty well, stringing a bunch of them together? not so much.
It's a good tool when you use it within its limitations.
dlivingston•2h ago
My gut feeling is that this problem won't be solved until some new architecture is invented, on the scale of the transformer, which allows for short-term context, long-term context, and self-modulation of model weights (to mimic "learning"). (Disclaimer: hobbyist with no formal training in machine learning.)
[0]: https://news.ycombinator.com/item?id=44798166
skydhash•2h ago
LLMs techniques allows us to extract rules from text and other data. But those data are not representative of a coherent system. The result itself is incoherent and lacks anything that wasn’t part of the data. And that’s normal.
It’s the same as having a mathematical function. Every point that it maps to is meaningful, everything else may as well not exists.
elephanlemon•2h ago
skydhash•2h ago
edaemon•2h ago
That and other tricks have only made me slightly less frustrated, though.
SoftTalker•1h ago
> LLMs get endlessly confused: they assume the code they wrote actually works; when test fail, they are left guessing as to whether to fix the code or the tests; and when it gets frustrating, they just delete the whole lot and start over.
I see this constantly with mediocre developers. Flailing, trying different things, copy-pasting from StackOverflow without understanding, ultimately deciding the compiler must have a bug, or cosmic rays are flipping bits.
layer8•1h ago
SoftTalker•1h ago
layer8•1h ago
Xss3•1h ago
I feel this way because at my company our interns on a gap year from their comp sci degree don't blame the compiler, cosmic bits, or blindly copy from stack overflow.
They're incentivized and encouraged to learn and absolutely choose to do so. The same goes for seniors.
If you say 'I've been learning about X for ticket Y' in the standup people basically applaud it, managers like us training ourselves to be better.
Sure managers may want to see a brief summary or a write-up applicable to our department if you aren't putting code down for a few days, but that's the only friction.
hahn-kev•1h ago