I think it suffers from performance anxiety...
----
The only solution I have found is to - rewrite the prompt from scratch, change the context myself, and then clear any "history or memories" and then try again.
I have even gone so far as to open nested folders in separate windows to "lock in" scope better.
As soon as I see the agent say "Wait, that doesnt make sense, let me review the code again" its cooked
It’s REAL FUCKING TEMPTING to say ”hey Claude, go do this thing that would take me hours and you seconds” because he will happily, and it’ll kinda work. But one way or another you are going to put those hours in.
It’s like programming… is proof of work.
Vibe coding though, super deceptive!
All LLMs degrade in quality as soon as you go beyond one user message and one assistant response. If you're looking for accuracy and highest possible quality, you need to constantly redo the conversations from scratch, never go beyond one user message.
If the LLM gets it wrong in their first response, instead of saying "No, what I meant was...", you need to edit your first response, and re-generate, otherwise the conversation becomes "poisoned" almost immediately, and every token generated after that will suffer.
I'm not sure there's much to learn here, besides it's kinda fun, since no real human was forced to suffer through this exercise on the implementor side.
Which describes a lot of outsourced development. And we all know how well that works
It's not hard, just different.
Yes.
How useful is the comparison with the worst human results? Which are often due to process rather than the people involved.
You can improve processes and teach the humans. The junior will become a senior, in time. If the processes and the company are bad, what's the point of using such a context to compare human and AI outputs? The context is too random and unpredictable. Even if you find out AI or some humans are better in such a bad context, what of it? The priority would be to improve the process first for best gains.
I don't mean the code producers, I mean the enterprise itself is not intelligent yet it (the enterprise) is described as developing the software. And it behaves exactly like this, right down to deeply enjoying inflicting bad development/software metrics (aka BD/SM) on itself, inevitably resulting in:
https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...
telling it to do better without any feedback obviously is going to go nowhere fast.
For example, in a functional-style codebase, they will try to rewrite everything to a class. I have to adjust the prompt to list things that I'm not interested in. And some inexperienced people are trying to write better code by learning from such changes of LLMs...
Another problem is that they try to add handling of different cases that are never present in my data. I have to mention that there is no need to update handling to be more generalized. For example, my code handles PNG files, and they add JPG handling that never happens.
Your point stands uncontested by me, but I just wanted to mention that humans have that bias too.
Random link (has the Nature study link): https://blog.benchsci.com/this-newly-proven-human-bias-cause...
"Hey claude, I get this error message: <X>", and it'll often find the root cause quicker than I could.
"Hey claude, anything I could do to improve Y?", and it'll struggle beyond the basics that a linter might suggest.
It suggested enthusiastically a library for <work domain> and it was all "Recommended" about it, but when I pointed out that the library had been considered and rejected because <issue>, it understood and wrote up why that library suffered from that issue and why it was therefore unsuitable.
There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving. It can do structured problems very well, and it can transform unstructured data very well, but it can't deal with unstructured problems very well.
That may well change, so I don't want to embed that thought too deeply into my own priors, because the LLM space seems to evolve rapidly. I wouldn't want to find myself blind to the progress because I write it off from a class of problems.
But right now, the best way to help an LLM is have a deep understanding of the problem domain yourself, and just leverage it to do the grunt-work that you'd find boring.
Neither of those are things I follow, and either way design is better informed by the specific problems that need to be solved rather than by such general, prescriptive principles.
I follow WET principles (write everything twice at least) because the abstraction penalty is huge, both in terms of performance and design, a bad abstraction causes all subsequent content to be made much slower. Which I can't afford as a small developer.
Same with most other "clean code" principles. My codebase is ~70K LoC right now, and I can keep most of it in my head. I used to try to make more functional, more isolated and encapsulated code, but it was hard to work with and most importantly, hard to modify. I replaced most of it with global variables, shit works so much better.
I do use partial classes pretty heavily though - helps LLMs not go batshit insane from context overload whenever they try to read "the entire file".
Models sometimes try to institute these clean code practices but it almost always just makes things worse.
I think, if you're writing code where you know the entire code base, a lot of the clean principles seem less important, but once you get someone who doesn't, and that can be you coming back to the project in three months, suddenly they have value.
When I was younger, writing Python rather than Rust, I used to go out of my way to make everything DRY, DRY, DRY everywhere from the outset. Class-based views in Django come to mind.
Today, I just write code, and after it's working I go back and clean things up where applicable. Not because I'm "following a principle", but because it's what makes sense in that specific instance.
what the person you replied to had claude do is relatively simple and structured, but to that person what claude did is "automagic".
People already vastly overestimate AI's capabilities. This contributes to that.
Very easy to write it off when it spins out on the open-ended problems, without seeing just how effective it can be once you zoom in.
Of course, zooming in that far gives back some of the promised gains.
Edit: typo
The love/hate flame war continues because the LLM companies aren't selling you on this. The hype is all about "this tech will enable non-experts to do things they couldn't do before" not "this tech will help already existing experts with their specific niche," hence the disconnect between the sales hype and reality.
If OpenAI, Anthropic, Google, etc. were all honest and tempered their own hype and misleading marketing, I doubt there would even be a flame war. The marketing hype is "this will replace employees" without the required fine print of "this tool still needs to be operated by an expert in the field and not your average non technical manager."
As we speak, my macOS menubar has an iStat Menus replacement, a Wispr Flow replacement (global hotkey for speech-to-text), and a logs visualizer for the `blocky` dns filtering program -- all of which I built without reading code aside from where I was curious.
It was so vibe-coded that there was no reason to use SwiftUI nor set them up in Xcode -- just AppKit Swift files compiled into macOS apps when I nix rebuild.
The only effort it required was the energy to QA the LLM's progress and tell it where to improve, maybe click and drag a screenshot into claude code chat if I'm feeling excessive.
Where do my 20 years of software dev experience fit into this except beyond imparting my aesthetic preferences?
In fact, insisting that you write code yourself is becoming a liability in an interesting way: you're going to make trade-offs for DX that the LLM doesn't have to make, like when you use Python or Electron when the LLM can bypass those abstractions that only exist for human brains.
Isn't that the point they are making?
Vibe-coding is a claude code <-> QA loop on the end result that anyone can do (the non-experts in his claim).
An example of a cycle looks like "now add an Options tab that let's me customize the global hotkey" where I'm only an end-user.
Once again, where do my 20 years of software experience come up in a process where I don't even read code?
I would hazard a guess that your knowledge lead to better prompts, better approach... heck even understanding how to build a status bar menu on Mac OS is slightly expert knowledge.
You are illustrating the GP's point, not negating it.
You're imagining that I'm giving Claude technical advice, but that is the point I'm trying to make: I am not.
This is what "vibe-coding" tries to specify.
I am only giving Claude UX feedback from using the app it makes. "Add a dropdown that lets me change the girth".
Now, I do have a natural taste for UX as a software user, and through that I can drive Claude to make a pretty good app. But my software engineering skills are not utilized... except for that one time I told Claude to use an AGDT because I fancy them.
Your 20 years is assisting you in ways you don't know; you're so experienced you don't know what it means to be inexperienced anymore. Now, it's true you probably don't need 20 years to do what you did, but you need some experience. Its not that the task you posed to the LLM is trivial for everyone due to the LLM, its that its trivial for you because you have 20 years experience. For people with experience, the LLM makes moderate tasks trivial, hard tasks moderate, and impossible tasks technically doable.
For example, my MS students can vibe code a UI, but they can't vibe code a complete bytecode compiler. They can use AI to assist them, but it's not a trivial task at all, they will have to spend a lot of time on it, and if they don't have the background knowledge they will end up mired.
Your mom wouldn't vibe-code software that she wants not because she's not a software engineer, but because she doesn't engage with software as a user at the level where she cares to do that.
Consider these two vibe-coded examples of waybar apps in r/omarchy where the OP admits he has zero software experience:
- Weather app: https://www.reddit.com/r/waybar/comments/1p6rv12/an_update_t...
- Activity monitor app: https://www.reddit.com/r/omarchy/comments/1p3hpfq/another_on...
That is a direct refutation of OP's claim. LLM enabled a non-expert to build something they couldn't before.
Unless you too think there exists a necessary expertise in coming up with these prompts:
- "I want a menubar app that shows me the current weather"
- "Now make it show weather in my current location"
- "Color the temperatures based on hot vs cold"
- "It's broken please find out why"
Is "menubar" too much expertise for you? I just asked claude "what is that bar at the top of my screen with all the icons" and it told me that it's macOS' menubar.
"Where do my 20 years of software dev experience fit into this except beyond imparting my aesthetic preferences?"
Anyway, I think you kind of unintentionally proved my point. These two examples are pretty trivial as far as software goes, and it enabled someone with a little technical experience to implement them where before they couldn't have.
They work well because:
a) the full implementation for these apps don't even fill up the AI context window. It's easy to keep the LLM on task.
b) it's a tutorial style-app that people often write as "babby's first UI widget", so there are thousands of examples of exactly this kind of thing online; therefore the LLM has little trouble summoning the correct code in its entirety.
But still, someone with zero technical experience is going to be immediately thwarted by the prompts you provided.
Take the first one "I want a menubar app that shows me the current weather".
https://chatgpt.com/share/693b20ac-dcec-8001-8ca8-50c612b074...
ChatGPT response: "Nice — here's a ready-to-run macOS menubar app you can drop into Xcode..."
She's already out of her depth by word 11. You expect your mom to use Xcode? Mine certainly can't. Even I have trouble with Xcode and I use it for work. Almost every single word in that response would need to be explained to her, it might as well be a foreign language.
Now, the LLM could help explain it to her, and that's what's great about them. But by the time she knows enough to actually find the original response actionable, she would have gained... knowledge and experience enough to operate it just to the level of writing that particular weather app. Though having done that, it's still unreasonable to now believe she could then use the LLM to write a bytecode compiler, because other people who have a Ph.D. in CS can. The LLM doesn't level the playing field, it's still lopsided toward the Ph.D.s / senior devs with 20 years exp.
Which is a prompt that someone with experience would write. Your average, non-technical person isn't going to prompt something like that, they are going to say "make it so I can change the settings" or something else super vague and struggle. We all know how difficult it is to define software requirements.
Just because an LLM wrote the actual code doesn't mean your prompts weren't more effective because of your experience and expertise in building software.
Sit someone down in front of an LLM with zero development or UI experience at all and they will get very different results. Chances are they won't even specify "macOS menu bar app" in the prompt and the LLM will end up trying to make them a webapp.
Your vibe coding experience just proves my initial point, that these tools are useful for those who already have experience and can lean on that to craft effective prompts. Someone non-technical isn't going to make effective use of an LLM to make software.
The LLM prompt space is an ND space where you can start at any point, and then the LLM carves a path through the space for so many tokens using the instructions you provided, until it stops and asks for another direction. This frames LLM prompt coding as a sort of navigation task.
The problem is difficult because at every decision point, there's an infinite number of things you could say that could lead to better or worse results in the future.
Think of a robot going down the sidewalk. It controls itself autonomously, but it stops at every intersection and asks "where to next boss?" You can tell it either to cross the street, or drive directly into traffic, or do any number of other things that could cause it to get closer to its destination, further away, or even to obliterate itself.
In the concrete world, it's easy to direct this robot, and to direct it such that it avoids bad outcomes, and to see that it's achieving good outcomes -- it's physically getting closer to the destination.
But when prompting in an abstract sense, its hard to see where the robot is going unless you're an expert in that abstract field. As an expert, you know the right way to go is across the street. As a novice, you might tell the LLM to just drive into traffic, and it will happily oblige.
The other problem is feedback. When you direct the physical robot to drive into traffic, you witness its demise, its fate is catastrophic, and if you didn't realize it before, you'd see the danger then. The robot also becomes incapacitated, and it can't report falsely about its continued progress.
But in the abstract case, the LLM isn't obliterated, it continues to report on progress that isn't real, and as a non expert, you can't tell its been flattened into a pancake. The whole output chain is now completely and thoroughly off the rails, but you can't see the smoldering ruins of your navigation instructions because it's told you "Exactly, you're absolutely right!"
Your original claim:
> The hype is all about "this tech will enable non-experts to do things they couldn't do before"
Are you saying that a prompt like "make a macOS weather app for me" and "make an options menu that lets me set my location" are only something an expert can do?
I need to know what you think their expertise is in.
Else Visual Basic and Dreamweaver would have killed software engineering in the 90s.
Also, I didn't make them. A clanker did. I can see this topic brings out the claws. Honestly I used to have the same reaction, and in a large way I still hate it.
I'm not sure you're interacting with single claim I've made so far.
claude2() {
claude "$(claude "Generate a prompt and TODO list that works towards this goal: <goal>$*</goal>" -p)"
}
$ claude2 pls give ranked ideas for make code betterOne under-discussed lever that senior / principal engineers can pull is the ability to write linters & analyzers that will stop junior engineers ( or LLMs ) from doing something stupid that's specific to your domain.
Let's say you don't want people to make async calls while owning a particular global resource, it only takes a few minutes to write an analyzer that will prevent anyone from doing so.
Avoid hours of back-and-forth over code review by encoding your preferences and taste into your build pipeline and stop it at source.
I am phenomenally productive this way, I am happier at my job, and its quality of work is extremely high as long as I occasionally have it stop and self-review it's progress against the style principles articulated in its AGENTS.md file. (As it tends to forget a lot of rules like DRY)
Some tasks I do enjoy coding. Once in the flow it can be quite relaxing.
But mostly I enjoy the problem solving part: coming up with the right algorithm, a nice architecture , the proper set of metrics to analyze etc
There is enough work for all of us to be handsomely paid while having fun doing it :) Just find what you like, and work with others who like other stuff, and you'll get through even the worst of problems.
For me the fun comes not from the action of typing stuff with my sausage fingers and seeing characters end up on the screen, but basically everything before that and after that. So if I can make "translate what's in my head into source on disk something can run" faster, that's a win in my book, but not if the quality degrades too much, so tight control over it still not having to use my fingers to actually type.
Having said that I used to be deep into coding and back then I am quite sure that I would hate AI coding for me. I think for me it comes down to – when I was learning about coding and stretching my personal knowledge in the area, the coding part was the fun part because I was learning. Now that I am past that part I really just want to solve problems, and coding is the means to that end. AI is now freeing because where I would have been reluctant to start a project, I am more likely to give it a go.
I think it is similar to when I used to play games a lot. When I would play a game where you would discover new items regularly, I would go at it hard and heavy up until the point where I determined there was either no new items to be found or it was just "more of the same". When I got to that point it was like a switch would flip and I would lose interest in the game almost immediately.
Most are not paid for results, they're paid for time at desk and regular responsibilities such as making commits, delivering status updates, code reviews, etc. - the daily activities of work are monitored more closely than the output. Most ESOP grant such little equity that working harder could never observably drive an increase in its value. Getting a project done faster just means another project to begin sooner.
Naturally workers will begin to prefer the motions of the work they find satisfying more than the result it has for the business's bottom line, from which they're alienated.
Wow. I've read a lot of hacker news this past decade, but I've never seen this articulated so well before. You really lifted the veil for me here. I see this everywhere, people thinking the work is the point, but I haven't been able to crystallize my thoughts about it like you did just now.
Im on the side of only enjoy coding to solve problems and i skipped software engineering and coding for work explicitly because i did not want to participate in that dynamic of being removed from the problems. instead i went into business analytics, and now that AI is gaining traction I am able to do more of what I love - improving processes and automation - without ever really needing to "pay dues" doing grunt work I never cared to be skilled at in the first place unless it was necessary.
Sometimes you can, sometimes you have to break the problem apart and get the LLM to do each bit separately, sometimes the LLM goes funny and you need to solve it yourself.
Customers don't want you wasting money doing by hand what can be automated, nor do they want you ripping them off by blindly handing over unchecked LLM output when it can't be automated.
ultimately i wonder how long people will need devs at all if you can all prompt your wishes
some will be kept to fix the occasional hallucination and that's it
getting things solved entirely feels very very numbing to me
even when gemini or chatgpt solves it well, and even beyond what i'd imagine.. i feel a sense of loss
It's typically been productive to care about the how, because it leads to better maintainability and a better ability to adapt or pivot to new problems. I suppose that's getting less true by the minute, though.
Sometimes, you strike gold, so there's that.
> You've really hit the crux of the problem and why so many people have differing opinions about AI coding.
Part of it perhaps, but there's also a huge variation in model output. I've been getting some surprisingly bad generations from ChatGPT recently, though I'm not sure if that's ChatGPT getting worse or me getting used to a much higher quality of code from Claude Code which seems to test itself before saying "done". I have no idea if my opinion will flip again now 5.2 is out.
And some people are bad communicators, an important skill for LLMs, though few will recognise it because everyone knows what they themselves meant by whatever words they use.
And some people are bad planners, likewise an important skill for breaking apart big tasks that LLMs can't do into small ones they can do.
Many engineers walk a path where they start out very focussed on programming details, language choice, and elegant or clever solutions. But if you're in the game long enough, and especially if you're working in medium-to-large engineering orgs on big customer-facing projects, you usually kind of move on from it. Early in my career I learned half a dozen programming languages and prided myself on various arcane arts like metaprogramming tricks. But after a while you learn that one person's clever solution is another person's maintainability nightmare, and maybe being as boring and predictable and direct as possible in the code (if slightly more verbose) would have been better. I've maintained some systems written by very brilliant programmers who were just being too clever by half.
You also come to realize that coding skills and language choice don't matter as much as you thought, and the big issues in engineering are 1) are you solving the right problem to begin with 2) people/communication/team dynamics 3) systems architecture, in that order of importance.
And also, programming just gets a little repetitive after a while. Like you say, after a decade or so, it feels a bit like "more of the same." That goes especially for most of the programming most of us are doing most of the time in our day jobs. We don't write a lot of fancy algorithms, maybe once in a blue moon and even then you're usually better off with a library. We do CRUD apps and cookie-cutter React pages and so on and so on.
If AI coding agents fall into your lap once you've reached that particular variation of a mature stage in your engineering career, you probably welcome them as a huge time saver and a means to solve problems you care about faster. After a decade, I still love engineering, but there aren't may coding tasks I particularly relish diving into. I can usually vaguely picture the shape of the solution in my head out the gate, and actually sitting down and doing it feels rather a bore and just a lot of typing and details. Which is why it's so nice when I can send Claude to do it instead, and review the results to see if they match what I had in mind.
Don't get me wrong. I still love programming if there's just the right kind of compelling puzzle to solve (rarer and rarer these days), and I still pride myself on being able to do it well. Come the holidays I will be working through Advent of Code with no AI assistance whatsoever, just me and vim. But when January roles around and the day job returns I'll be having Claude do all the heavy lifting once again.
Claude writing code gets the same output if not better in about 1/10 of the time.
That's where you realize that the writing code bits are just one small part of the overall picture. One that I realize I could do without.
Same for sql, do you really context switch between sql and other code that frequently?
Everyone should stop using bash, especially if you have a scripting language you can use already.
For example, I often find Python has very mature and comprehensive packages for a specific need I have, but it is a poor language for the larger project (I also just hate writing Python). So I'll often put the component behind a http server and communicate that way. Or in other cases I've used Rust for working with WASAPI and win32 which has some good crates for it, but the ecosystem is a lot less mature elsewhere.
I used to prefer reinventing the wheel in the primary project language, but I wasted so much time doing that. The tradeoff is the project structure gets a lot more complicated, but it's also a lot faster to iterate.
Plus your usual html/css/js on the frontend and something else on the backend, plus SQL.
I absolutely can attest to what parent is saying, I have been developing software in Python for nearly a decade now and I still routinely look up the /basics/.
LLM's have been a complete gamechanger to me, being able to reduce the friction of "ok let me google what I need in a very roundabout way my memory spit it out" to a fast and often inline llm lookup.
I said notetaking, but it's more about building your own index. In $WORK projects, I mostly use the browser bookmarks, the ticket system, the PR description and commits to contextually note things. In personal projects, I have an org-mode file (or a basic text file) and a lot of TODO comments.
Or I can farm that stuff to an LLM, stay in my flow, and iterate at a speed that feels good.
I have over a decade of experience, I do this stuff daily, I don't think I can write a 10 line bash/python/js script without looking up the docs at least a couple times.
I understand exactly what I need to write, but exact form eludes my brain, so this Levenshtein-distance-on-drugs machine that can parse my rambling + surrounding context into valid syntax for what I need right at that time is invaluable and I would even go as far as saying life changing.
I understand and hold high level concepts alright, I know where stuff is in my codebase, I understand how it all works down to very low levels, but the minutea of development is very hard due to how my memory works (and has always worked).
But figuring out what is the correct way in this particular language is the issue.
Now I can get the assistant to do it, look at it and go "yep, that's how you iterate over an array of strings".
You may get possibilities, but not for what you asked for.
I love to prototype various approaches. Sometimes I just want to see which one feels like the most natural fit. The LLM can do this in a tenth of the time I can, and I just need to get a general idea of how each approach would feel in practice.
This sentence alone is a huge red flag in my books. Either you know the problem domain and can argue about which solution is better and why. Or you don't and what you're doing are experiment to learn the domain.
There's a reason the field is called Software Engineering and not Software Art. Words like "feels" does not belongs. It would be like saying which bridge design feels like the most natural fit for the load. Or which material feels like the most natural fit for a break system.
Software development is nowhere near advanced enough for this to be true. Even basic questions like "should this project be built in Go, Python, or Rust?" or "should this project be modeled using OOP and domain-driven design, event-sourcing, or purely functional programming?" are decided largely by the personal preferences of whoever the first developer is.
I really don't think this is true. What was the demonstrated impact of writing Terraform in Go rather than Rust? Would writing Terraform in Rust have resulted in a better product? Would rewriting it now result in a better product? Even among engineers with 15 years experience you're going to get differing answers on this.
That's tautologically true, yes, but your claim was
> Either you know the problem domain and can argue about which solution is better and why. Or you don't and what you're doing are experiment to learn the domain.
So, assuming the domain of infrastructure-at-code is mostly known now which is a fair statement -- which is a better choice, Go or Rust, and why? Remember, this is objective fact, not art, so no personal preferences are allowed.
A solution may be Terraform, another is Ansible,… To implement that solution, you need a programming language, but by then you’re solving accidental complexity, not the essential one attached to the domain. You may be solving, implementation speed, hiring costs, code safety,… but you’re not solving IaC.
Maybe I'm lucky, but I've never encountered this situation. It has been mostly about what tradeoffs I'm willing to make. Libraries are more line of codes added to the project, thus they are liabilities. Including one is always a bad decision, so I only do so because the alternative is worse. Having to choose between two is more like between Scylla and Charybdis (known tradeoffs) than deciding to go left or right in a maze (mystery outcome).
Generally, you are correct that having multiple libraries to choose among is concerning, but it really depends. Mostly it's stylistic choices and it can be hard to tell how it integrates before trying.
In my experience most LLMs are going to answer this with some form of "Absolutely!" and then propose a square-peg-into-a-round-hole way to do it that is likely suboptimal vs using a different library that is far more suited to your problem if you didn't guess the right fit library to begin with.
The sycophancy problem is still very real even when the topic is entirely technical.
Gemini is (in my experience) the least likely to lead you astray in these situations but its still a significant problem even there.
if you ask a human this the answer can also often be "yes [if we torture the library]", because software development is magic and magic is the realm of imagination.
much better prompt: "is this library designed to solve this problem" or "how can we solve this problem? i am considering using this library to do so, is that realistic?"
Please don't say you commit AI-generated stuff without checking it first?
Disagree. Claude makes the same garbage worthless comments as a Freshman CS student. Things like:
// Frobbing the bazz
res = util.frob(bazz);
Or
// If bif is True here then blorg
if (bif){ blorg; }
Like wow, so insightful
And it will ceaselessly try to auto complete your comments with utter nonsense that is mostly grammatically correct.
The most success I have had is using claude to help with Spring Boot annotations and config processing (Because documentation is just not direct enough IMO) and to rubber duck debug with, where claude just barely edges out the rubber duck.
I don't want LLMs, AI, and eventually Robots to take over the fun stuff. I want them to do the mundane, physical tasks like laundry and dishes, leave me to the fun creative stuff.
But as we progress right now, the hype machine is pushing AI to take over art, photography, video, coding, etc. All the stuff I would rather be doing. Where's my house cleaning robot?
Of course this is a bit too black&white. There can still be a creative human being introducing nuance and differences, trying to get the automated tools to do things different in the details or some aspects. Question is, losing all those creative jobs (in absolute numbers of people doing them), what will we as society, or we as humanity become? What's the ETA on UBI, so that we can reap the benefits of what we automated away, instead of filling the pockets of a few?
On the other hand, if e.g. I need a web interface to do something, the only way I can enjoy myself is by designing my own web framework, which is pretty time-consuming, and then I still need to figure out how to make collapsible sections in CSS and blerghhh. Claude can do that in a few seconds. It's a delightful moment of "oh, thank god, I don't have to do this crap anymore."
There are many coding tasks that are just tedium, including 99% of frontend development and over half of backend development. I think it's fine to throw that stuff to AI. It still leaves a lot of fun on the table.
Some famous sculptors had an atelier full of students that helped them with mundane tasks, like carving out a basic shape from a block of stone.
When the basic shape was done, the master came and did the rest. You may want to have the physical exercise of doing the work yourself, but maybe someone sometimes likes to do the fine work and leave the crude one to the AI.
I don't :) Before I had IDE templates and Intellisense. Now I can just get any agentic AI to do it for me in 60 seconds and I can get to the actual work.
I am so tired of this analogy. Have the people who say this never worked with a junior dev before? If you treat your junior devs as brainless code monkeys who only exist to type out your brilliant senior developer designs and architectures instead of, you know, human beings capable of solving problems, 1) you're wasting your time, because a less experienced dev is still capable of solving problems independently, 2) the juniors working under you will hate it because they get no autonomy, and 3) the juniors working under you will stay junior because they have no opportunity to learn--which means you've failed at one of your most important tasks as a senior developer, which is mentorship.
When I was a junior, that's how it was for me. The senior gave me something that was structured and architected and asked me to handle smaller tasks that were beneath them.
Giving juniors full autonomy is a great way to end up with an unmaintainable mess that is a nightmare to work with without substancial refactoring. I know this because I have made a career out of fixing exactly this mistake.
> Giving juniors full autonomy is a great way to end up with an unmaintainable mess that is a nightmare to work with without substancial refactoring.
Nobody is suggesting they get full autonomy to cowboy code and push unreviewed changes to prod. Everything they build should be getting reviewed by their peers and seniors. But they need opportunities to explore and make mistakes and get feedback.
It's an entirely different world in small businesses that aren't primarily tech.
I'm asking because I legitimately have not figured out an answer to this problem.
Seriously, long term thinking went out the window long time ago, didn't it?
It is definitely a problem for the company. How is it a problem for the senior dev at any point?
What incentive do they have to aid the company at the expense of their own *long term* career prospects?
Also, I'm pretty sure junior devs can use directing a LLM to learn from mistakes faster. Let them play. Soon enough they're going to be better than all of us anyway. The same way widespread access to strong chess computers raised the bar at chess clubs.
I'm probably a pretty shitty developer by HN standards but I generally have to build a prototype to fully understand and explore problem and iterate designs and LLMs have been pretty good for me as trainers for learning things I'm not familiar with. I do have a certain skill set, but the non-domain stuff can be really slow and tedious work. I can recognize "good enough" and "clean" and I think the next generation can use that model very well to be become native with how to succeed with these tools.
Let me put it this way: people don't have to be hired by the best companies to gain experience using best practices anymore.
Unfortunately the bar is being raised on us. If you can't hang with the new order you are out of a job. I promise I was one of the holdouts who resisted this the most. It's probably why I got laid off last spring.
Thankfully, as of this last summer, agentic dev started to really get good, and my opinion made a complete 180. I used the off time to knock out a personal project in a month or two's worth of time, that would have taken me a year+ the old way. I leveraged that experience to get me where I am now.
I don't think we'll be out of jobs. Maybe temporarily. But those jobs come back. The energy and money drain that LLMs are, are just not sustainable.
I mean, it's cool that you got the project knocked out in a month or two, but if you'd sit down now without an LLM and try to measure the quality of that codebase, would you be 100% content? Speed is not always a good metric. Sure, 1 -2 months for a project is nice, but isn't especially a personal project more about the fun of doing the project and learning something from it and sharpening your skills?
However if I just say “I have this goal, implement a solution”, chances are that unless it is a very common task, it will come up with a subpar/incomplete implementation.
What’s funny to me is that complexity has inverted for some tasks: it can ace a 1000 lines ML model for a general task I give it, yet will completely fail to come up with a proper solution for a 2D geometric problem that mostly has high school level maths that can be solved in 100 lines
I asked Claude to fix a pet peeve of mine, spawning a second process inside an existing Wine session (pretty hard if you use umu, since it runs in a user namespace). I asked Claude to write me a python server to spawn another process to pass through a file handler "in Proton", and it proceeded a long loop of trying to find a way to launch into an existing wine session from Linux with tons of environment variables that didn't exist.
Then I specified "server to run in Wine using Windows Python" and it got more things right. Except it tried to use named pipes for IPC. Which, surprise surprise, doesn't work to talk to the Linux piece. Only after I specified "local TCP socket" it started to go right. Had I written all those technical constraints and made the design decisions in the first message it'd have been a one-hit success.
This is true, as for "Open Ended" I use Beads with Claude code, I ask it to identify things based on criteria (even if its open ended) then I ask it to make tasks, then when its done I ask it to research and ask clarifying questions for those tasks. This works really well.
While this is true in my experience, the opposite is not true. LLMs are very good at helping me go through a structure processing of thinking about architectural and structural design and then help build a corresponding specification.
More specifically the "idea honing" part of this proposed process works REALLY well: https://harper.blog/2025/02/16/my-llm-codegen-workflow-atm/
This: Each question should build on my previous answers, and our end goal is to have a detailed specification I can hand off to a developer. Let’s do this iteratively and dig into every relevant detail. Remember, only one question at a time.
thats called job security!
Back in the day, we would just do this with a search engine.
until works { try again }
The stuff is getting so cheap and so fast... a sufficient increment in quantity can produce a phase change in quality.
My sentiment was "that's obviously a weird non-intended hack" but I wanted to test quickly, and well ... it worked. Later, reading the man-pages I aknowledged the fact that I needed to declare specific flags for gcc in place of the gpt advised solution.
I think these kind of value based judgements are hard to emulate for LLMs, it's hard for them to identifiate a single source as the most authoritative source in a sea of lesser authoritative (but numerous) sources.
If I had a pdf printout of a table, the workflow i used to have to use to get that back into a table data structure to use for automation was hard (annoying). dedicated OCR tools with limitations on inputs, multiple models in that tool for the different ways the paper the table was on might be formatted. it took hours for a new input format
now i can take a photo of something with my phone and get a data table in like 30 seconds.
people seem so desperate to outsource their thinking to these models and operating at the limits of their capability, but i have been having a blast using it to cut through so much tedium that werent unsolved problems but required enough specialized tooling and custom config to be left alone unless you really had to
this fits into what youre saying with using it to do the grunt work i find boring i suppose, but feels a little bit more than that - like it has opened a lot of doors to spaces that had grunt work that wasnt worth doing for the end result previously but now it is
In short: they are great for simple manual data entry, and I use them extensively for that. But they need to be supervised manually. They aren't a solution, but a tool to make humans more productive. Tasks that would have taken hours, and I just never did, now take minutes.
Static analysis has the opposite problem - very structured, deterministic, but limited to predefined patterns and overwhelms you in false positives.
The sweet spot seems to be to give structure to what the LLM should look for, rather than letting it roam free on an open-ended "review this" prompt.
We built Autofix Bot[1] around this idea.
[1] https://autofix.bot (disclosure: founder)
Claude is for getting shit done, it's not at its best at long research tasks.
The basic gist of it is to give the llm some code to review and have it assign a grade multiple times. How much variance is there in the grade?
Then, prompt the same llm to be a "critical" reviewer with the same code multiple times. How much does that average critical grade change?
A low variance of grades across many generations and a low delta between "review this code" and "review this code with a critical eye" is a major positive signal for quality.
I've found that gpt-5.1 produces remarkably stable evaluations whereas Claude is all over the place. Furthermore, Claude will completely [and comically] change the tenor of its evaluation when asked to be critical whereas gpt-5.1 is directionally the same while tightening the screws.
You could also interpret these results to be a proxy for obsequiousness.
Edit: One major part of the eval i left out is "can an llm converge on an 'A'?" Let's say the llm gives the code a 6/10 (or B-). When you implement its suggestions and then provide the improved code in a new context, does the grade go up? Furthermore, can it eventually give itself an A, and consistently?
It's honestly impressive how good, stable, and convergent gpt-5.1 is. Claude is not great. I have yet to test it on Gemini 3.
I created the original plan with a very specific ask - create an abstraction to remove some tight coupling. Small problem that had a big surface area. The planning/brainstorming was great and I like the plan we came up with.
I then tried to use a prompt like OP's to improve it (as I said, large surface area so I wanted to review it) - "Please review PLAN_DOC.md - is it a comprehensive plan for this project?". I'd run it -> get feedback -> give it back to Claude to improve the plan.
I (naively perhaps) expected this process to converge to a "perfect plan". At this point I think of it more like a probability tree where there's a chance of improving the plan, but a non-zero chance of getting off the rails. And once you go off the rails, you only veer further and further from the truth.
There are certainly problems where "throwing compute" at it and continuing to iterate with an LLM will work great. I would expect those to have firm success criteria. Providing definitions of quality would significantly improve the output here as well (or decrease the probability of going off the rails I suppose). Otherwise Claude will confuse quality like we see here.
Shout out OP for sharing their work and moving us forward.
> ..oh and the app still works, there's no new features, and just a few new bugs.
The logger library which Claude created is actually pretty simple, highly approachable code, with utilities for logging the timings of async code and the ability to emit automatic performance warnings.
I have been using LogTape (https://logtape.org) for JavaScript logging, and the inherited, category-focused logging with different sinks has been pretty great.
Ultrathink. You're a principal engineer. Do not ask me any
questions. We need to improve the quality of this codebase.
Implement improvements to codebase quality.
I'm a little disappointed that Claude didn't eventually decide to start removing all of the cruft it had added to improve the quality that way instead."I spent 200 days in the woods"
"I Google translated this 200 times"
"I hit myself with this golf club 200 times"
Is this really what hacker news is for now?
It still requires an exhausting amount of thought and energy to make the LLM go in the direction I want, which is to say in a direction which considers the code which is outside the current context window.
I suspect that we will not solve the context window problem for a long time. But we will see a tremendous growth in “on demand tooling” for things which do fit into a context window and for which we can let the AI “do whatever it wants.”
For me, my work product needs to conform to existing design standards and I can’t figure out how to get Claude to not just wire up its own button styles.
But it’s remarkable how—despite all of the nonsense—these tools remain an irreplaceable part of my work life.
I think LLMs are still at the 'advanced autocomplete' stage, where the most productive way to use them is to have a human in the loop.
In this, accuracy of following instructions, and short feedback time is much more important than semi-decent behavior over long-horizon tasks.
Then, I ask it to execute each phase from the doc one at a time. I review all the code it writes or sometimes just write it myself. When it is done it updates the plan with what was accomplished and what needs to be done next.
This has worked for me because:
- it forces the planning part to happen before coding. A lot of Claude’s “wtf” moments can be caught in this phase before it write a ton of gobbledygook code that I then have to clean up
- the code is written in small chunks, usually one or two functions at a time. It’s small enough that I can review all the code and understand before I click accept. There’s no blindly accepting junk code.
- the only context is the planning doc. Claude captures everything it needs there, and it’s able to pick right up from a new chat and keep working.
- it helps my distraction-prone brain make plans and keep track of what I was doing. Even without Claude writing any code, this alone is a huge productivity boost for me. It’s like have a magic notebook that keeps track of where I was in my projects so I can pick them up again easily.
Tangible examples like this seem like a useful way to show some of the limitations.
Removing code, renaming files, condensing, and other edits is mostly a post-training stuff, supervised learning behavior. You have armies of developers across the world making 17 to 35 dollars an hour solving tasks step by step which are then basically used to generate prompt/responses pairs of desired behavior for a lot of common development situations, adding desired output for things like tool calling, which is needed for things like deleting code.
A typical human working on post-training dataset generation task would involve a scenario like: given this Dockerfile for a python application, when we try to run pytest it fails with exception foo not found. The human will notice that package foo is not installed, change the requirements.txt file and write this down, then he will try pip install, and notice that the foo package requires a certain native library to be installed. The final output of this will be a response with the appropriate tool calls in a structured format.
Given that the amount of unsupervised learning is way bigger than the amount spent on fine-tuning for most models, it is not surprise that given any ambiguous situation, the model will default to what it knows best.
More post-training will usually improve this, but the quality of the human generated dataset probably will be the upper bound of the output quality, not to mention the risk of overfitting if the foundation model labs embrace SFT too enthusiastically.
what does this even mean? could you expand on it
In my experience, Claude can actually clean up a repo rather nicely if you ask it to (1) shrink source code size (LOC or total bytes), (2) reduce dependencies, and (3) maintain integration tests.
LLMs are incapable of reducing entropy in a code base
I've always had this nagging feeling, but I think this really captures the essence of it succintly.
There is no consensus on what constitutes a high quality codebase.
Said differently - even if you asked 200 humans to do this same exercise, you would get 200 different outputs.
To extend what may seem like a [prima facie] insane, stupid, or foolhardy idea: Why not send the output of /dev/urandom into /bin/bash? Or even /proc/mem? It probably won't do anything particularly interesting. It will probably just break things and burn power.
And so? It's just a computer; its scope is limited.
1) Run multiple code analysis tools over it and have the LLM aggregate it with suggestions
2) ask the LLM to list potential improvements open ended question and pick by hand which I want
And usually repeat the process with a completely different model (ie diff company trained it)
Any more and yeah they end up going in circles
It is not unlike people, the difference being that if you ask someone the same thing 200 times, he will probably going to tell you to go fuck yourself, or, if unable to, turn to malicious compliance. These AIs will always be diligent. Or, a human may use the opportunity to educate himself, but again, LLMs don't learn by doing, they have a distinct training phase that involves ingesting pretty much everything humanity has produced, your little conversation will not have a significant effect, if at all.
const x = new NewClass();
assert.ok(x instanceof NewClass);
So I am not at all surprised about Claude adding 5x tests, most of which are useless.It's going to be fun to look back at this and see how much slop these coding agents created.
I worked in a C# codebase with Result responses all over the place, and it just really complicated every use case all around. Combined with Promises (TS) it's worse still.
I suspect SLOC growth wouldn't be quite as dramatic but things like converting everything to Rust's error handling approach could easily happen.
"...oh and the app still works, there's no new features, and just a few new bugs."
Nobody thinks that doing 200 improvement passes on functioning code base is a good idea. The prompt tells the model that it is a principal engineer, then contradicts that role the imperative "We need to improve the quality of this codebase". Determining when code needs to be improved is a responsibility for the principal engineer but the prompt doesn't tell the model that it can decide the code is good enough. I think we would see a different behavior if the prompt was changed to "Inspect the codebase, determine if we can do anything to improve code quality, then immediately implement it." If the model is smart enough, this will increasingly result in passes where the agent decides there is nothing left to do.
In my experience with CC I get great results where I make an open ended question about a large module and instruct it to come back to me with suggestions. Claude generates 5-10 suggestions and ranks them by impact. It's very low-effort from the developer's perspective and it can generate some good ideas.
As for this experiment: What does quality even mean? Most human devs will have different opinions on it. If you would ask 200 different devs (Claude starts from 0 after each iteration) to do the same, I have doubts the code would look much better.
I am also wondering what would happen if Claude would have an option to just walk away from the code if its "good enough". For each problem most human devs run cost->benefit equation in their head, only worthy ideas are realized. Claude does not do it, the code writing cost is very low on his site and the prompt does not allow any graceful exit :)
Given with what I've seen from Claude 4.5 Opus, I suspect the following test would be interesting: attempt to have Claude Code + Haiku/Sonnet/Opus implement and benchmark an algorithm with:
- no CLAUDE.md file
- a basic CLAUDE.md file
- an overly nuanced CLAUDE.md file
And then both test the algorithm speed and number of turns it takes to hit that algorithm speed.
Fucking yikes dude. When's the last time it took you 4500 lines per screen, 9000 including the JSON data in the repo????? This is already absolute insanity.
I bet I could do this entire app in easily less than half, probably less than a tenth, of that.
Or I should say, they kept hiring the humans who needed something to do, and basically did what this AI did.
In my experience, adding this kind of instruction to the context window causes SOTA coding models to actually undertake that kind of optimization while development carries on. You can also periodically chuck your entire codebase into Gemini-3 (with its massive context window) and ask it to write a refactoring plan; then, pass that refactoring plan back into your day-to-day coding environment such as Cursor or Codex and get it to take a few turns working away at the plan.
As with human coders, if you let them run wild "improving" things without specifically instructing them to also pay attention to bloat, bloat is precisely what you will get.
> Some of them are really unnecessary and could be replaced with off the shelf solution
Lots of people would regard this as a good thing. Surely the LLM can't guess which kind you are.
For instance - it created a hasMinimalEntropy function meant to "detect obviously fake keys with low character variety". I don't know why.
I've worked on writing some as a data scientist, and I have gotten the basic claude output to be much better; it makes some saner decisions, it validates and circles back to fix fits, etc.
Claude has a bias to add lines of code to a project, rather than make it more concise. Consequently, each refactoring pass becomes more difficult to untangle, and harder to improve.
Ideally, in this experiment, only the first few passes would result in changes - mostly shrinking the project size, and from then on, Claude would change nothing - just a like a very good programmer.
This is the biggest problem with developing with Claude, by far. Anthropic should laser focus on fixing it.
> Tons of tests got added, but some tests that mattered the most (maestro e2e tests that validated the app still works) were forgotten.
I've seen many LLM proponents often cite the number of tests as a positive signal.
This smells, to me, like people who tout lines of code.
When you are counting tests in the thousands I think its a negative signal.
You should be writing property based tests rather than 'assert x=1', 'assert x=2', 'assert x=-1' and on and on.
If LLMs are incapable of acknowledging that then add it to the long list of 'failure modes'.
written-beyond•2d ago
I disagree, it's very useful even in languages that have exception throwing conventions. It's good enough for the return type for Promise.allSettled api.
The problem is when I don't have the result type I end up approximating it anyway through other ways. For a quick project I'd stick with exceptions but depending on my codebase I usually use the Go style ok, err tuple (it's usually clunkier in ts though) or a rust style result type ok err enum.