Your job is to solve customer problems. Their problems may only be solvable with code that is proven to work, but it is equally likely (I dare say even more likely) that their problem isn't best solved with code at all, or even solved with code that doesn't work properly but works well enough.
From the post and the example he links, the point is that if you don't at least look at the running code, you don't know that it works.
In my opinion the point is actually well illustrated by Chris's talk here:
https://v5.chriskrycho.com/elsewhere/seeing-like-a-programme...
(summary of the relevant section if you're not going to click)
>>>
In the talk "Seeing Like a Programmer," Chris Krycho quotes the conductor and composer Eímear Noone, who said:
> "The score is potential energy. It's the potential for music to happen, but it's not the music."
He uses this quote to illustrate the distinction between "software as artifact" (the code/score) and "software as system" (the running application/music). His point is that the code itself is just a static artifact—"potential energy"—and the actual "software" only really exists when that code is executed and running in the real world.
Your tests run the code. You know it works. I know the article is trying to say that testing is not comprehensive enough, but my experience disagrees. But I also recognize that testing is not well understood (quite likely the least understood aspect of computer science!) — and if you don't have a good understanding you can get caught not testing the right things or not testing what you think you are. I would argue that you would be better off using your time to learn how to write great tests instead of using it to manually test your code, but to each their own.
What is more likely to happen is not understanding the customer needs well enough, leaving it impossible to write tests that align with what the software needs to do. Software development can break down very quickly here. However, manual testing does not help. You can't know what to manually test without understanding the problem either. However, as before, your job is not to deliver proven code. Your job is to solve customer problems. When you realize that, it becomes much less likely that you write tests that are not in line with the solution you need.
Outside in testing is great but I typically do automated outside in testing and only manual at the end. The loop process of testing needs to be repeatable and fast, manual is too slow
I've lost count of the number of times I've skipped it because the automated test passed and then found there was some dumb but obvious bug that I missed, instantly exposed when I actually exercised the feature myself.
There's a lot of pedantry here trying to argue that there exists some feature which doesn't need to be "manually" tested, and I think the definition of "manual" can be pushed around a lot. Is running a program that prints "OK" a manual test or not? Is running the program and seeing that it now outputs "grue" rather than "bleen" manual? Does verifying the arithmetic against an Excel spreadsheet count?
There are programs that almost can't be manual, and programs that almost have to be manual. I remember when working on PIN pad integration we looked into getting a robot to push the buttons on the pad - for security reasons there's no way of injecting input automatically.
What really matters is getting as close to a realistic end user scenario as possible.
[1] As far as I can tell. If there are good solutions for this too, I'd love to learn.
Unit testing, whether manual or automated, typically catches about 30% of bugs.
End to end testing and visual inspection of code are both closer to 70% of bugs.
I vibe code a lot of stuff for myself, mostly for viewing data, when I don’t really need to care how it works. I’m coming around to the idea that outside of some specific circumstances where everyone has agreed they don’t need to care about or understand the code, team vibe coding is a bad practice.
If I’m paying an engineer, it’s for their work, unless explicitly agreed otherwise.
I think vibe coding is soon going to be seen the same way as “research” where you engage an offshore team (common e.g. in consulting) to give you a rundown on some topic and get back the first five google search results. Everyone knows how to do that, if it’s what they wanted they wouldn’t be hiring someone to do it.
Is anyone else seeing this in their orgs? I'm not...
You could intuitively think it's just a difference of degree, but it's more akin to a difference of kind. Same for a nuke vs a spear, both are weapons, no one argues they're similar enough that we can treat them the same way
LLMs can't do this.
Your code is unambiguously better than any LLM code if you can comment a link to the stackoverflow post you copied it from.
If you are sufficiently motivated to appear more "productive" than your coworkers, you can force them to review thousands of lines of incorrect AI slop code while you sit back and mess around with your chatbots.
Your coworkers no longer have enough time to work on their in-progress PRs, so you can dominate the development team in terms of LOC shipped.
Understand that sociopaths are skilled at navigating social and bureaucratic environments. A sociopath who ships the most LOC will get the promotion every single time.
I think this is largely an issue that can be solved culturally within a team, we just unfortunately only have so much input on how other teams work. It doesn't help either when their manager doesn't seem to care about the feedback... Corporate politics are fun.
the idea that LLMs make developers more productive is delusional.
Probs fine when you are still in the exploration phase of a startup, scary once you get to some kind of stability
Hell, for my hobby projects, I try to keep individual commits under 50-100 lines of code.
Developers aren't hired to write code that's never run (at least in my opinion). We're also responsible for running the code/keeping it running.
And if it was repeated... Well I would probably get fired...
https://github.com/WireGuard/wireguard-android/pull/82 https://github.com/WireGuard/wireguard-android/pull/80
In that first one, the double pasted AI retort in the last comment is pretty wild. In both of these, look at the actual "files changed" tab for the wtf.
I recently reviewed a PR that I suspect is AI generated. It added a function that doesn't appear to be called from anywhere.
It's shit because AI is absolutely not on the level of a good developer yet. So it changes the expectation. If a PR is not AI generated then there is a reasonable expectation that a vaguely competent human has actually thought about it. If it's AI generated then the expectation is that they didn't really think about it at all and are just hoping the AI got it right (which it very often doesn't). It's rude because you're essentially pawning off work that the author should have done to the reviewer.
Obviously not everyone dumps raw AI generated code straight into a PR, so I don't have any problem with using AI in general. But if I can tell that your code is AI generated (as you easily can in the cases you linked), then you've definitely done it wrong.
My eyes were wide open when 2 jobs ago, they said they would be blocking all personal web browsing from work computers. Multiple Software Devs were unhappy because they were using their work laptop for booking flights, dealing with their kids schools stuff and other personal things. They did not have personal computer at all.
People do what they think they will be rewarded for. When you think your job is to write a lot of code then LLMs are great. When you need quality code you start to ask if LLMs are better or not?
Unfortunately, this person is vibe coding completely, and even the PR process is painful: * The coding agent reverts previously applied feedback * Coding agent not following standards throughout the code base * Coding agent re-inventing solutions that already exist * PR feedback is being responded to with agent output * 50k line PRs that required a 10-20 line change * Lack of testing (though there are some automated tests, but their validations are slim/lacking) * Bad error handling/flow handling
(By my organization, I meant my company - this person doesn't report to me or in my tree).
But LLMs don't really perform well enough on our codebase to allow you to generate things that even appear to work. And I'm the most junior member of my team at 37 years of age, hired in 2019.
I really tried to follow the mandate from on high to use Copilot, but the Agent mode can't even write code that compiles with the tools available to it.
Luckily I hooked it up to gptel so I can at least ask it quick questions about big functions I don't want to read in emacs.
Edit: I'm an idiot ignore me.
It has emdashes because my blog turns " - " into an emdash here: https://github.com/simonw/simonwillisonblog/blob/06e931b397f...
If we are accepting LLM generated code, we should accept LLM generated content as long as it is "proof read" :)
Just a wild thought, nothing serious.
New to me, but I'm on board.
We already delegate accountability to non-humans all the time: - CI systems block merges - monitoring systems page people - test suites gate different things
In practice accountability is enforced by systems, not humans.. humans are defintiely "blamed" after the fact, but the day-to-day control loop is automated.
As agents get better at running code, inspecting ui state, correlating logs, screenshots, etc they're starting to operationally be "accountable" and preventing bad changes from shipping and producing evidence when something goes wrong .
At some point humans role shifts from "i personally verify this works" to "i trust this verification system and am accountable for configuring it correctly".
Thats still responsibility, but kind of different from whats described here. Taken to a logical extreme, the arguement here would suggest that CI shouldn't replace manual release checklists
Human collaboration works on trust.
Part of trust is accountability and consequences. If I get caught embezzling money from my employer I can lose my job, harm my professional reputation and even go to jail. There are stakes!
I computer system has no stakes, and cannot take accountability for its actions. This drastically limits what it makes sense to outsource to that system.
A lot of this comes down to my work on prompt injection. LLMs are fundamentally gullible: an email assistant might respond to an email asking for the latest sales figures by replying with the latest (confidential) sales figures.
If my human assistant does that I can reprimand or fire them. What am I meant to do with an LLM agent?
if you put them (without humans) in a forrest they would not survive and evolve (they are not viable systems alone); they are not taking action without the setup & maintenance (& accountability) of people
Perhaps an unstated and important takeaway here is that junior developers should not be permitted to use an LLMs for the same reason they should not hire people: they have not demonstrated enough skill mastery and judgement to be trusted with the decision to outsource their labor. Delegating to a vendor is a decision made by high-level stakeholders, with the ability to monitor the vendor performance, and replace the vendor with alternatives if that performance is unsatisfactory. Allowing junior developers to use LLM is allowing them to delegate responsibility without any visibility or ability to set boundaries on what can be delegated. Also important: you cannot delegate personal growth, and by permitting junior engineers to use an LLM that is what you are trying to do.
Accountability is about what happens if and when something goes wrong. The moon landings were controlled with computer assistance, but Nixon preparing a speech for what happened in the event of lethal failure is accountability. Note that accountability does not of itself imply any particular form or detail of control, just that a social structure of accountability links outcome to responsible person.
From there, I include explicit steps for how to test, including manual testing, and unit test/E2E test commands. If it's something visual, I try to include at least a screenshot, or sometimes even a brief screen capture demonstrating the feature.
Really go out of your way to make the reviewer's life easier. One benefit of doing all of this is that in most cases, the reviewer won't need to reach out to ask simple questions. This also helps to enable more asynchronous workflows, or distributed teams in different time zones.
The Devs went in kicking and screaming. As an SRE it seemed like for SDEs, writing a description of the change, explaining the problem the code is solving, testing methodology, etc is harder than actually coding. Ironically AI is proving that this theory was right all along.
One can also point QA or consultants to a ticket for documentation purposes or timeline details.
To be fair, copilot review is actually alright at catching these sorts of things. It remains a nice courtesy to extend to your reviewer.
Not to say you shouldn't write descriptions, I will keep doing it because it's my job. But a lot of people just don't care enough or are too distracted to read them.
Well, I'm sure you can guess what happened after that - within the same file even
This is ~mandatory for me. Even if what I am working on is non-visual. I will take a screenshot of a new type in my IDE and put a red box around it. This conveys the focus of my attention and other important aspects of the work effort.
I would go a step further: we need to deliver code that belongs. This means following existing patterns and conventions in the codebase. Without explicit instruction, LLMs are really bad at this, and it's one of the things that make it incredibly obvious to reviews that a given piece of code has been generated by AI.
I also see AI coding tools violate "Chesterton's Fence" (and the pre-Chesterton's Fence, not sure what that is called, the idea being that code is necessary otherwise it shouldn't be in the source).
They used to be. They have become quite good at it, even without instruction. Impressively so.
But it does require that the humans who laid the foundation also followed consistent patterns and conventions. If there is deviation to be found, the LLM will see it and be forced to choose which direction to go, and that's when things quickly fall off the rails. LLMs are not (yet) good at that.
Garbage in, garbage out, as they say.
Agents love to cheat. That's an issue I don't see a horizon for change.
Here's Opus 4.5 trying to cheat its way out of properly implementing compatibility and cross-platform, despite the clear requirements:
https://gist.github.com/alganet/8531b935f53d842db98157e1b8c0...
> Should popen handles work with fgets/fread/fwrite? PHP supports this. Option A: Create a minimal pipe_io_stream device / Option B: Store FILE* in io_private with a flag / Option C: Only support pclose, require explicit stream wrapper for reads.
If I asked for compatibility, why give me options that won't fully achieve it?
It actually tried to "break check" my knowledge about the interpreter (test me if I knew enough to catch it), and proposed shortcuts all the way through the chat.
I don't want to have to pepper my chats with variations on "don't cheat". I mean, I can do it, but it seems like boilerplate.
I wish I had some similar testing-related chats to share. Agents do that all the time.
This is the major blocker right now for AI-assisted automated verification, and one of the reasons why this isn't well developed beyond general directions (give it screenshots, make it run the command, etc).
My approach to coding agents is to prepare a spec at the start, as complete as possible, and develop a beefy battery of tests as we make progress. Yesterday there was a story "I ported JustHTML from Python to JavaScript with Codex CLI and GPT-5.2 in hours". They had 9000+ tests. That was the secret juice.
So the future of AI coding as I see it ... it will be better than pre-2020, we will learn to spec and plan good tests, and the tests are actually our contract the code does what is supposed to do. You can throw away the code and keep the specs and tests and regenerate any time.
For UIs I do a different trick - live diagnostic tests - I ask the agent to write tests that run in the app itself, check consistencies, constraints and expected behaviors. Having the app running in its natural state makes it easier to test, you can have complex constraints encoded in your diagnostics.
> They had 9000+ tests.
They were most probably also written by AI, there's no other (human) way. The way I see it we're putting turtles upon turtles hoping that everything will stick together, somehow.
Yes. They came from the existing project being ported, which was also AI-written.
Behind that is a smaller number of larger integration tests, and the even longer running regression tests that are run every release but not on every commit.
Yes, from the same author, in fact.
That's really not a great development for us. If our main point is now reduced to accountability over the result with barely any involvement in the implementation - that's very little moat and doesn't command a high salary. Either we provide real value or we don't ...and from that essay I think it's not totally clear what the value is - it seems like every QA, junior SWE or even product manager can now do the job of prompting and checking the output.
Experienced software engineers have such a huge edge over everyone else with this stuff.
If your product manager doesn't understand what a CORS header is good luck having them produce a change that requires cross-domain fetch() call... and first they'll have to know what a "cross-doman fetch() call" means.
And sure they could ask an LLM about that, but they still need the vocabulary and domain knowledge to get to that question.
That’s the thing. People exposing such rude behavior usually are not, or haven’t been in a looong time…
As for the local testing part not being performed, this is a slippery slope I’m fighting everyday: more and more cloud based services and platforms are used to deploy software to run with specific shenanigans and running it locally requires some kind of deep craft and understanding. Vendor lock-in is coming back in style (e.g. Databricks)
The best solution I have for that is staging environments, ideally including isolated-from-production environments you can run automated tests against.
A colleague was working on an important subsystem and would ask Djikstra for a review when he thought it was ready. Djikstra would have to stop what he was doing, analyze the code, and would find a grievous error or edge case. He would point it out to the colleague who would then get back to work. The colleague would submit his code for review again and this could carry on enough times that Djikstra got annoyed.
Djikstra proposed a solution. His colleague would have to submit with his code some form of proof or argument as to why it was correct and ready to merge. That way Djikstra could save time by only having to review the argument and not all of the code.
There’s a way of looking at LLM output as Djikstra’s colleague. It puts a lot of burden on the human using this tool to review all of the code. I like Doctorow’s mental model of a reverse centaur. The LLM cannot reason and so won’t provide you with a sound argument. It can probably tell you what it did and summarize the code changes it made… but it can’t decide to merge those changes. It needs a human, the bottom half of the centaur, to do the last bit of work here. Because that’s all we’re doing when we let these tools do most of the work for us: we’re here to take the blame.
And all it takes is an implementation of what we’re trying to build already, every open source library ever, all of SO, a GW of power from a methane power plant, an Olympic pool of water and all of your time reviewing the code it generates.
At the end of the day it’s on you to prove why your changes and contributions should be merged. That’s a lot of work! But there’s no shortcuts. Luckily you can reason while the LLMs struggle with that so use it while you can when choosing to use such tools.
Manual and automatic testing are still both required, but you must explicitly ensure that security considerations are included in those tests.
The LLM doesn't care. Caring is YOUR job.
That is part of it, yes, but there are many others, such as ensuring that the new code is easy to understand and maintain by humans, makes the right tradeoffs, is reasonably efficient and secure, doesn't introduce a lot of technical debt, and so on.
These are things that LLMs often don't get right, and junior engineers need guidance with and mentoring from more experienced engineers to properly learn. Otherwise software that "works" today, will be much more difficult to make "work" tomorrow.
That's why I refuse to take part in it. But I'm an old-world craftsman by now, and I understand nobody wants to pay for working, well-thought-out code any more. They don't want a Chesterfield; they want plywood and glue.
I’m experimenting with how to get these into a PR, and the “gh” CLI tool is helpful.
Does anyone have a recipe to get a coding agent to record video of webflows?
It's even worse than that: non-junior devs are doing it as well.
Vote with your wallet.
Developers have created too many layers of abstraction and indirection to do their jobs. We're burning a ton of energy managing state management frameworks, that are many layers of indirection away from the computations that are salient to users.
All those DSLs, config syntaxes, layers of boilerplate waste a huge amount of electricity, when end users want to draw geometric shapes.
So a non-dev generates a mess, but in a way so do devs with Django and Elixir, RoR, Terraform. When really end of the day it's matrix math against memory and sync of that state to the display.
From a hardware engineers perspective, the mess of devs and non-devs is the same abstract mess of electrical states that have nothing to do with the goal. All those frameworks can be generalized into a handful of electrical patterns, saving a ton of electricity.
Started career in late 90s designing boards for telecom companies network backbones.
Trying to create a secure, reliable and scalable system that enables many people to work on one code base, share their code around with others and at the end of the day coordinate this dance of electrons across multiple computers, that's where all of these 'useless' layers of abstraction become absolutely necessary.
I know exactly what those layers of abstraction are used for. Why so many? Jobs making layers of abstraction.
But all of them are dev friendly means of modeling memory states for the CPU to watch and transform just so. They can all be compressed into a generic and generalized set of mathematical functions ridding ourselves of the various parser rules to manage each bespoke syntax inherent to each DSL, layers of framework.
Boilerplate comes when your language doesn't have affordances, you get around this with /abstraction/ which leads to DSLs (Domain Specific Languages).
Matrix math is generally done on more than raw bits provided by digital circuits. Simple things like numbers require some amount of abstraction and indirection (pointers to memory addresses that begin arrays).
My point is yes, we've gotten ourselves in a complicated tar pit, but it's not because there wasn't a simpler solution lower in the stack.
I don’t know about you, but I get paychecks twice a month for doing things included in my job description.
Now we have nightly builds that nobody checks the result of and we’re finding out about bugs weeks later. Big company btw
Pretending it’s just the kids and young people doing the bad thing makes the outrage easier to sell to adults.
This might be unpopular, but that is seeming more like an opportunity if we want to continue allowing AI to generate code.
One of the annoying things engineers have to deal with is stopping whatever they're doing and doing a review. Obviously this gets worse if more total code is being produced.
We could eliminate that interruption by having someone doing more thorough code reviews, full-time. Someone who is not being bound by sprint deadlines and tempted to gloss over reviews to get back to their own work. Someone who has time to pull down the branch and actually run the code and lightly test things from an engineer's perspective so QA doesn't hit super obvious issues. They can also be the gatekeeper for code quality and PR quality.
I would have thought that reviewing PRs and doing it well is in the job description. You latter mention "someone" a few times - who that someone might be?
This is not the first time somebody had that idea.
If i was CTO I would not be happy to hear my engineers are spending lots of time re-writing and testing code written by product managers. Big nope.
Overall, this hits the nail on the head about not delivering broken code and providing automated tests. Thanks for putting your thoughts on paper.
The title doesn't go far enough - slop (AI or otherwise) can work and pass all the tests, and still be slop.
If it doesn't even work you're absolutely wasting my time with it.
Testing only “proves” correctness for the specific state, environment, configuration, and inputs the code was tested with. In practice that only tests a tiny portion of possible circumstances, and omits all kinds of edge and non-edge cases.
Claude, etc, works best with good tests that verify the system works. And so the code becomes in some ways the tests rather than the code that does the thing. If you're responsible for the thing, then 90% of your responsibility moves to verifying behavior and giving agents feedback.
Therefore you must verify it works as intended in the real world. This means not shipping code and hoping for the best, but checking that it actually does the right thing in production. And on top of that, you have to verify that it hasn't caused a regression in something else in production.
You could try to do that with tests, but tests aren't always feasible. Therefore it's important to design fail-safes into your code that ALERT YOU to unexpected or erroneous conditions. It needs to do more than just log an error to some logging system you never check - you must actually be notified of it, and you should consider it a flaw in your work, like a defective pair of Nikes on an assembly line. Some kind of plumbing must exist to take these error logs (or metrics, traces, whatever) and send it to you. Otherwise you end up producing a defective product, but never know it, because there's nothing in place to tell you its flaws.
Every single day I run into somebody's broken webapp or mobile app. Not only do the authors have no idea (either because they aren't notified of the errors, or don't care about them), there is no way for me to even e-mail the devs to tell them. I try to go through customer support, a chat agent, anything, and even they don't have a way to send in bug reports. They've insulated themselves from the knowledge of their own failures.
Who popped this balloon? I know I need to change my employer, but it's not so easy. And I'm not sure another employer is going to be any better.
This only happens because the software industry has fallen into the Religion of Speed. I see it constantly: justified corner-cutting, rushing shit out the door, and always loading up another feature/project/whatever with absolutely zero self-awareness. AI is just an amplifier for bad behavior that was already causing chaos.
What's not being said here but should be: discipline matters. It's part of being a professional and always precedes someone who can ship code that "just works."
[1] https://ia.net/*
That's not something I've seen or been able to achieve in most of my professional work.
Strong disagree here, your job is to deliver solutions that help the business solve a problem. In _most_ cases that means delivering code that you should be able to confidently prove satisfies the requirements like the OP mentioned, but I think this is an important nitpick distinction I didn't understand until later on in my career.
I guess to me, it's either the case that LLMs are just another tool, in which case the already existing teachings of best practice should cover them (and therefore the tone and some content of this article is unnecessary) or they're something totally new, in which case maybe some of the already existing teachings apply, but maybe not because it's so different that the old incentives can't reasonably take hold. Maybe we should focus a little bit more attention on that.
The article mentions rudeness, shifting burdens, wasting people's time, dereliction. Really loaded stuff and not a framing that I find necessary. The average person is just trying to get by, not topple a social contract. For that, look upwards.
A lot of people using LLMs seem not to have understood that you can't expect them to write code that works without testing it first!
If that wasn't clearly a problem I wouldn't have felt the need to write this.
daedrdev•2h ago