Verification debt: the hidden cost of AI-generated code

https://fazy.medium.com/agentic-coding-ais-adolescence-b0d13452f981

31•xfz•1h ago

Comments

Kerrick•1h ago

> It gets 50% more pull requests, 50% more documentation, 50% more design proposals

Perhaps this will finally force the pendulum to swing back towards continuous integration (the practice now aliased trunk-based development to disambiguate it from the build server). If we're really lucky, it may even swing the pendulum back to favoring working software over comprehensive documentation, but maybe that's hoping too much. :-)

maxdo•1h ago

Code is fully disposable way to generate custom logic.

Hand crafted , scalable code will be a very rare phenomenon

There will be a clear distinction between too.

hnthrow0287345•1h ago

This still seems like technical debt to me. It's just debt with a much higher compounding interest rate and/or shorter due date. Credit cards vs. traditional loans or mortgages.

>And six months later you discover you’ve built exactly what the spec said — and nothing the customer actually wanted.

That's not a developer problem, it's a PM/business problem. Your PM or equivalent should be neck deep in finding out what to build. Some developers like doing that (likely for free) but they can't spend as much time on it as a PM because they have other responsibilities, so they are not as likely not as good at it.

If you are building POCs (and everyone understands it's a POC), then AI is actually better getting those built as long as you clean it up afterwards. Having something to interact with is still way better than passively staring at designs or mockup slides.

Developers being able to spend less time on code that is helpful but likely to be thrown away is a good thing IMO.

lowsong•39m ago

> AI is actually better getting those built as long as you clean it up afterwards

I've never seen a quick PoC get cleaned up. Not once.

I'm sure it happens sometimes, but it's very rare in the industry. The reality is that a PoC usually becomes "good enough" and gets moved into production with only the most perfunctory of cleanup.

somewhereoutth•27m ago

There is nothing as permanent as a temporary solution!

johngossman•58m ago

This verification problem is general.

As an experiment, I had Claude Cowork write a history book. I chose as subject a biography of Paolo Sarpi, a Venetian thinker most active in the early 17th century. I chose the subject because I know something about him, but am far from expert, because many of the sources in Italian, in which I am a beginner, and because many of the sources are behind paywalls, which does not mean the AIs haven't been trained on them.

I prompted it to cite and footnote all sources, avoid plagiarism and AI-style writing. After 5 hours, it was finished (amusingly, it generated JavaScript and emitted a DOCX). And then I read the book. There was still a lingering jauntiness and breathlessness ("Paolo Sarpi was a pivotal figure in European history!") but various online checkers did not detect AI writing or plagiarism. I spot checked the footnotes and dates. But clearly this was a huge job, especially since I couldn't see behind the paywalls (if I worked for a Uni I probably could).

Finally, I used Gemini Deep Research to confirm the historical facts and that all the cited sources exist. Gemini thought it was all good.

But how do I know Gemini didn't hallucinate the same things Claude did?

Definitely an incredible research tool. If I were actually writing such a book, this would be a big start. But verification would still be a huge effort.

apical_dendrite•32m ago

I used gemini to look up a relative with a connection to a famous event. The relative himself is obscure, but I have some of his writings and I've heard his story from other relatives. Gemini fabricated a completely false narrative about my relative that was much more exciting than what actually happened. I spent a bunch of time looking at the sources that Gemini supplied trying to verify things and although the sources were real, the story Gemini came up with was completely made up.

johngossman•28m ago

Yup. I've had Gemini create fake citations to papers. I've also had it hallucinate the contents of paywalled papers, so I know I can't trust anything it writes, though I am getting better at using it recursively to verify things.

gowld•6m ago

Before AI, the smartest human still had to pass the paywall to access paywalled content.

AI has exacerbated the Internet's "content must be free or else does not exist" trend.

It's just not interesting to challenge an AI to write professional research content without giving it access to research conetent. Without access, it's just going to paraphrase what's already available.

VanTodi•52m ago

I've come to the point where I think generated code is nothing better than a random package I install. Did I read it all and just accepted what was promised? Yes Can it bite me in the butt somewhere down the road? Probably, but I currently at least have more doubt about the generated code than a random package I picked up somewhere on git which readme I just partly skipped over.

somewhereoutth•23m ago

However a random [but well established] package will have been used many many times, thus will have been verified in the wild, and likely will have a bug tracker, updates, and perhaps even a community of people who care about that particular code. No comparison really.

apical_dendrite•51m ago

My company recently hired a contractor. He submits multi-thousand line PRs every day, far faster than I can review them. This would maybe be OK if I could trust his output, but I can't. When I ask him really basic questions about the system, he either doesn't know or he gets it wrong. This week, I asked for some simple scripts that would let someone load data in a a local or staging environment, so that the system could be tested in various configurations. He submitted a PR with 3800 lines of shell scripts. We do not have any significant shell scripts anywhere else in our codebase. I spent several hours reviewing it with him - maybe more time than he spent writing it. His PR had tons and tons of end-to-end tests of the system that didn't actually test anything - some said they were validating state, but passed if a get request returned a 200. There were a few tests that called a create API. The tests would pass if the API returned an ID of the created object. But they would ALSO pass if the test didn't return an ID. I was trying to be a good teacher, so I kept asking questions like "why did you make this decision", etc, to try to have a conversation about the design choices and it was very clear that he was just making up bullshit rationalizations - he hadn't made any decisions at all. There was one particularly nonsensical test suite - it said it was testing X but included API calls that had nothing to do with X. I was trying to figure out how he had come up with that, and then I realized - I had given him a Postman export with some example API requests, and in one of the API requests I had gotten lazy and modified the request to test something but hadn't modified the name in Postman. So the LLM had assumed that the request was related to the old name and used it when generating a test suite, even though these things had nothing to do with each other. He had probably never actually read the output so he had no idea that it made no sense.

When he was first hired, I asked him to refactor a core part of the system to improve code quality (get rid of previous LLM slop). He submitted a 2000+ line PR within a day or so. He's getting frustrated because I haven't reviewed it and he has other 2000+ line PRs waiting on review. I asked him some questions about how this part of the system was invoked and how it returned data to the rest of the system, and he couldn't answer. At that point I tried to explain why I am reluctant to let him commit his refactor of a core part of the system when he can't even explain the basic functionality of that component.

lpnam0201•25m ago

Do you think he used AI to generate that much code without ever understanding or having a look at the code ? Why was he hired ?

apical_dendrite•22m ago

Yes, because he can't answer basic questions about the code.

He was hired because we needed a contractor quickly and he and his company represented to us that he was a lot more experienced than he actually is.

afro88•6m ago

Will you get rid of him? It sounds like he's wasting a lot of your time

scuff3d•14m ago

This sums up the inherent friction between hype and reality really well.

CEOs and hype men want you to believe that LLMs can replace everyone. In 6 months you can give them the keys to the kingdom and they'll do a better job running your company then you did. No more devs. No more QA. No more pesky employees who needs crazy stuff like sleep, and food, and time off to be a human.

Then of course we run face first into reality. You give the tool to an idiot (or a generally well meaning person not paying enough attention) and you end up with 2k PRs that are batshit insane, production data based deleted, malicious code downloaded and executed on just machines, email archives deleted, and entire production infrastructure systems blown away. Then the hype men come back around and go "well yeah, it's not the tools fault, you still need an expert at the wheel, even though you were told you don't".

LLMs can do amazing things, and I think there's a lot of opportunities to improve software products if used correctly, but reality does not line up with the hype, and it never will

gowld•4m ago

Why are you paying someone who isn't doing the job you hired someone to do?

Why are you acting like you work for the contractor, instead of the contractor workign for you?

Why are you teaching a contractor anything? That's a violation of labor law. You are treating a contractor like an employee.

bryanlarsen•35m ago

Verification is the bottleneck now, so we have to adjust our tooling and processes to make verification as easy as possible.

When you submit a PR, verifiability should be top of mind. Use those magic AI tools to make the PR as easy to possible to verify as possible. Chunk your PR into palatable chunks. Document and comment to aid verification. Add tests that are easy for the reviewer to read, test and tweak. Etc.

gowld•9m ago

Just prompt the AI to verify the software.

ironman1478•34m ago

Verification has always been hard and always ignored, in software more than other industries. This is not specific to AI generated code.

I currently work in a software field that has a large numerical component and verifying that the system is implemented correctly and stable takes much longer than actually implementing it. It should have been like that when I used to work in a more software-y role, but people were much more cavalier then and it bit that company in the butt often. This isn't new, but it is being amplified.

fishtoaster•21m ago

Figuring out how to trust AI-written code faster is the project of software engineering for the next few years, IMO.

We'll need to figure out the techniques and strategies that let us merge AI code sight unseen. Some ideas that have already started floating around:

- Include the spec for the change in your PR and only bother reviewing that, on the assumption that the AI faithfully executed it

- Lean harder on your deterministic verification: unit tests, full stack tests, linters, formatters, static analysis

- Get better ai-based review: greptile and bugbot and half a dozen others

- Lean into your observability tooling so that AIs can fix your production bugs so fast they don't even matter.

None of these seem fully sufficient right now, but it's such a new problem that I suspect we'll be figuring this out for the next few years at least. Maybe one of these becomes the silver bullet or maybe it's just a bunch of lead bullets.

But anyone who's able to ship AI code without human review (and without their codebase collapsing) will run circles around the rest.

orsorna•16m ago

>Lean harder on your deterministic verification: unit tests, full stack tests, linters, formatters, static analysis

It's wild that the gamut of PRs being zipped around don't even do these. You would run such validations as a human...

gjsman-1000•15m ago

Do you know what happens to every industry when they get too fast and slapdash?

Regulation.

It happened with plumbing. Electricians. Civil engineers. Bridge construction. Haircutting. Emergency response. Legal work. Tech is perhaps the least regulated industry in the world. Cutting someone’s hair requires a license, operating a commercial kitchen requires a license, holding the SSN of 100K people does not.

If AI is fast and cheap, some big client will use it in a stupid manner. Tons of people can and will be hurt afterward. Regulation will follow. AI means we can either go faster, or focus on ironing out every last bug with the time saved, and politicians will focus on the latter instead of allowing a mortgage meltdown in the prime credit market. Everyone stays employed while the bar goes higher.

chromaton•16m ago

Historically, the cycle has been requirements -> code -> test, but with coding becoming much faster, the bottlenecks have changed. That's one of the reasons I've been working on Spark Runner to help automate testing for web apps: https://https://github.com/simonarthur/spark-runner

bensyverson•6m ago

It comes down to trust. I was not able to trust GPT 4.1 or Sonnet 3.5 with anything other than short, well-specified tasks. If I let them go too long (e.g. in long Cursor sessions), it would lose the plot and start thrashing.

With better models and harnesses (e.g. Claude Code), I can now trust the AI more than I would trust a junior developer in the past.

I still review Claude's plans before it begins, and I try out its code after it finishes. I do catch errors on both ends, which is why I haven't taken myself out of the loop yet. But we're getting there.

Most of the time, the way I "verify" the code is behavioral: does it do what it's supposed to do? Have I tried sufficient edge cases during QA to pressure-test it? Do we have good test coverage to prevent regressions and check critical calculations? That's about as far as I ever took human code verification. If anything, I have more confidence in my codebases now.

Show HN: I made a cute open-source App for learning Japanese

Apple Ads

Linux hacked onto a PS5 to turn Sony's console into a Steam Machine

Does Costco Sell Half of the Cashews?

Show HN: NovusNet, a C++ networking library for beginners

Show HN: NervOS – Sandbox for AI Agents Using Firecracker MicroVMs

Building Cursor for LibreOffice: A Week-Long Journey

Show HN: N8n-trace – Grafana-like observability for n8n workflows

Knuth Claude's Cycles note update: problem now fully solved by LLMs

Tesla back on top as Norway's EV market surges to 98% share in February

Sam and Dario's not-so-excellent AI adventure

Show HN: A Bullet Hell of Your Own Making

A Homemade Robot Nag

Brain Computer Interfaces Are Now Giving Sight Back to the Blind – Garry's List

Show HN: VibeRepo – Make any codebase AI-agent-ready in one command

Show HN: A website to learn Python tips daily

The surprising whimsy of the Time Zone Database

Prime Radiant: What We're Working On

Young billionaires are behind the prediction market boom. They hate each other

Life Happens at 1x Speed

The Full Rewrite: AI Edition

Why Do Ivy League Colleges Reject Some Students with Perfect Scores

The Origin Story of gRPC

Students Are Finding New Ways to Cheat on the SAT

I Asked 6 AIs to Nuke My Computer [video]

Why Gen Z Is Unprepared for the Workplace

From Studio to Street: The Story of DAT (1990)

The Apollo Guidance Computer Talk (2017) [video]

Show HN: SRA – A new architectural pattern for modern product engineering

The Dangerous Illusion of AI Coding? – Jeremy Howard [video]