Why agents do not write most of our code – A reality check

https://octomind.dev/blog/why-agents-do-not-write-most-of-our-code-a-reality-check

75•birdculture•2mo ago

Comments

another_twist•2mo ago

Eerily maps to my experience almost word for word. I had codex write a chunk of code step by step with guidance and whatnot. Had to spend days cleaning up the mess.

vidarh•2mo ago

My experience is that if AI creates the mess, AI should clean it up, and it usually can, if you put it in a suitable agent loop that does a review, hands off small, well defined cleanup steps to an agent, and runs test suites.

If you review the first-stage output from the AI manually, you're wasting time.

You still need to review the final outputs, but reviewing the initial output is like demanding a developer hands over code they just barely got working and pointing out all of the issues to them without giving them a chance to clean it up first. It's not helpful to anyone unless your time costs the business less than the AI's time.

Zardoz84•2mo ago

IA reviewing code generated by AI, it's a recipe for disaster.

vidarh•2mo ago

That's categorically not true, as long as there's a human reviewer at the end of the chain. It can usually continue to deliver actual improvements over several iterations (just like a human would).

That does not mean you can get away with not reviewing it. But you can most certainly with substantial benefit defer reviewing it until an AI review thinks the code doesn't need further refinement. It probably still does need refinement despite the AI's say so - and sometimes it needs throwing away -, but it's also highly likely in my experience to need less, and take less time to review.

mprast•2mo ago

great stuff; I've had almost exactly the same experience. I think blow-by-blow writeups like this are a sorely needed antidote to the hype

reaslonik•2mo ago

One thing I find that constantly makes pain for users is assuming that any of these models are thinking, when in reality they're completing a sentence. This might seem like a nitpick at first, but it's a huge deal in reality: if you ask a language model to evaluate whether a solution is right, it's not evaluating the solution, it's giving you a statistically likely next sentence where yes and no are fairly common. If you tell it's wrong, the likely next sentence is something affirming it, but it doesen't really make a difference.

The only way to use a tool like this is to give a problem that fits context, evaluate the solution it chugs at you and re-roll it if it wasn't correct. Don't tell a language model to think because it can't and won't. It's a way less efficient way of re-rolling the solution

giuscri•2mo ago

but it’s also true that the next sentence is generated by evaluating the whole conversation including the proposed solution.

my mental model is that the llm learned to predict what another person would say just by looking at that solution.

so it’s really telling whether the solution is likely (likely!) to be right or wrong

ben_w•2mo ago

Slight quibble, but the reinforcement learning from human feedback means they're trained (somewhat) on what the specific human asking the question is likely to consider right or wrong.

This is both why they're sycophantic, and also why they're better than just median internet comments.

But this is only a slight quibble, because what you say is also somewhat true, and why they have such a hard time saying "I don't know".

giuscri•2mo ago

idk… maybe we’ll found out the reason is that on the internet no one ends a conversation saying “i don’t know” :D

ben_w•2mo ago

That's my point :)

pietz•2mo ago

Can you go into a bit more detail why the two approaches are so different in your opinion?

I don't think I agree and I want to understand this argument better.

ismailmaj•2mo ago

I’m guessing the argument is that LLMs get worse for problems they haven’t seen before, so you may assume they think for problems that are commonly discussed in the internet or seen on github, but once you step out of that zone, you get plausible but logically false results.

That or a reductive fallacy, in either case I’m not convinced, IMO they are just not smart enough (either due to lack of complexity in the architecture or bad training that didn’t help it generalize reasoning patterns).

nijave•2mo ago

They regurgitate what they're trained on so they're largely consensus based. However, the consensus can be frequently wrong--especially when the information is outdated

Someone with the ability to "think" should be able to separate oft repeated fiction from fact

nijave•2mo ago

>The only way to use a tool like this is to give a problem that fits context

Or give context to the model which fits the problem. That's more of an art than a science at this point it seems

I think people with better success are those better at generating prompts but that's non trivial

sunir•2mo ago

You’re right and wrong at the same time. A quantum superposition of validity.

The word thinking is going too much work in your argument, but arguably “assume it’s thinking” is not doing enough work.

The models do compute and can reduce entropy; however, they don’t match the way we presume things do this because we assume every intelligence is human or more accurately the same as our own mind.

To see the algorithm for what it is, you can make it work through a logical set of steps from input to output but it requires multiple passes. The models use a heuristic pattern matching approach to reasoning instead of a computational one like symbolic logic.

While the algorithms are computed, the virtual space the input is transformed to the output is not computational.

The models remain incredible and remarkable but they are incomplete.

Further there is a huge garbage in garbage out problem as often the input to the model lacks enough information to decide on the next transformation to the code base. That’s part of the illusion of conversationality that tricks us into thinking the algorithm is like a human.

AI has always had human reactions like this. Eliza was surprisingly effective, right?

It may be that average humans are not capable of interacting with an AI reliably because the illusion is overwhelming for instinctive reasons.

As engineers we should try to accurately assess and measure what is actually happening so we can predict and reason about how the models fit into systems.

stray•2mo ago

I get that a submarine can't swim.

I'm just not so sure of importance of the difference between swimming and whatever the word for how a submarine moves is.

If it looks like thinking and quacks like thinking...

netdevphoenix•2mo ago

This has always been true. The difference is that now more people are admitting it. While you could argue that LLMs have junior level capabilities, they definitely do not have junior level self reflection or self awareness or self anything. It fundamentally doesn't learn where learning means being significantly less likely to fail at a task class x after being taught about it. And even just the ability of asking for help. These agents just choose to generate unusable code over stopping and asking for help or guidance and this implies that they are unable to tell their limits skill wise, knowledge wise, etc.

Frankly, I have been highly concerned seeing all the transformer hype in here when the gains people claims cannot be reliably replicated everywhere.

The financial incentives to make transformer tech work as it is being sold (even when it might not be cost effective) need to be paid close attention because to me, it looks a bit too much like blockchain or big data.

graphememes•2mo ago

Every time I read a post about this, none of the prompts are shared, and when I review the actual commands and how the AI is working it makes me realize that the person who is driving is not experienced in doing so. AI's will do a best attempt, you can see this by looking at the reasoning / thinking output, additionally, the temperature, is usually pretty moderate (4.5-8) and so you'll have heavy "creative liberties" taken. So you need to account for that, you have to show it the right and wrong way to do things. I don't usually use agents or AI for things that are one-offs but not copy & paste, or for deep thinking / critical tasks that require human thought where AI wouldn't be able to do it.

For all the other trivial things, I can delegate those out to it, and expect junior results when I give it sub-optimal guidance, however through nominal and or extreme guidance I can get adequate / near-perfect results.

Another dimension that really matters here is the actual model used, not every model is the same.

Also, if the AI does something wrong, have it assess why things went wrong, revert back to the previous checkpoint and integrate that into the plan.

You're driving, you are ultimately, in control, learn to drive. It's a tool, it can be adjusted, you can modify the output, you can revert, you can also just not use it. But, if you do actually learn how to use it you'll find it can speed up your process. It is not a cure-all though, it's good in certain situations, just like a hammer.

davidclark•2mo ago

On the other hand, when people who claim success with AI share their prompts, I see all the same misses and flaws that keep me from fully buying in. For the person though, it seems like they gloss over these errors and claim wild success. Their prompts never actually seem that different from the ones that fail me as well.

It seems like “you’re not doing it correctly” is just a rationalization to protect the pro-AI person’s established opinion.

_boffin_•2mo ago

Nobody does it correctly—ai or not.

It’s about breaking the problem down into epics, tasks, and acceptance criteria that is reviewed. Review the written code and adjust as needed.

Tests… a lot of tests.

raflueder•2mo ago

I had a similar experience a couple of months ago where I decided to give it a go and "vibe code" a small TUI to get a feel for the workflow.

I used Claude Code and while the end result works (kinda) I noticed I was less satisfied with the process and, more importantly, I now had to review "someone else's" code instead of writing it myself, I had no idea of the internal workings of the application and it felt like starting at day one on a new codebase. It shifted my way of working from thinking/writing into reviewing/giving feedback which for me personally is way less mentally stimulating and rewarding.

There were def. some "a-ha" moments where CC came up with certain suggestions I wouldn't have thought of myself but those were only a small fraction of the total output and there's def. a dopamine hit from seeing all that code being spit out so fast.

Used as a prototyping tool to quickly test an idea seems to be a good use case but there should be better tooling around taking that prototype, splitting it into manageable parts, sharing the reasoning behind it so I can then rework it so I have the necessary understanding to move it forward.

For now I've decided to stick to code completion, writing of unit tests, commit messages, refactoring short snippets, CHANGELOG updates, it does fairly well on all of those small very focused tasks and the saved time on those end up being net positive.

mnky9800n•2mo ago

> Used as a prototyping tool to quickly test an idea seems to be a good use case but there should be better tooling around taking that prototype, splitting it into manageable parts, sharing the reasoning behind it so I can then rework it so I have the necessary understanding to move it forward.

This would be amazing. I think claude code is a great prototyping tool, but I agree, you don't really learn your code base. But I think, that is okay for a prototype if you just want to see if the idea works at all. Then you can restart as you say with some scaffolding to implement it better.

pramodbiligiri•2mo ago

Great article.

One thing I was wondering after looking at the list of items in the “Cursor agent produced a coding plan” image: do folks actually make such lists when developing a feature without AI assistants?

That list has items like “Create API endpoints for …”, “Write tests …”. If you’re working on a feature that’s within a single codebase and not involving dependencies on other systems or teams, isn’t that a lot of ceremony for what you’ll eventually end up doing anyway (and only likely to miss due to oversight)?

I see a downside to such lists, because when I see a dozen items lined up like that… who knows whether they’re all the right ones for the feature at hand? Or whether the feature needs some other change entirely, or whether you’ve figured out the right order to do them in?

Where I’ve seen such fine-grained lists have value is for task timeline estimation, but rarely for the actual implementation.

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

Tell HN: Yet Another Round of Zendesk Spam

Postgres Message Queue (PGMQ)

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

NY lawmakers proposed statewide data center moratorium

OpenClaw AI chatbots are running amok – these scientists are listening in

Show HN: AI agent forgets user preferences every session. This fixes it

Introduce the Vouch/Denouncement Contribution Model

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

Microsoft appointed a quality czar. He has no direct reports and no budget

Multi-agent coordination on Claude Code: 8 production pain points and patterns

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

DevXT – Building the Future with AI That Acts

A Minimal OpenClaw Built with the OpenCode SDK

The silent death of Good Code

The Internal Negotiation You Have When Your Heart Rate Gets Uncomfortable

Show HN: Glance – Fast CSV inspection for the terminal (SIMD-accelerated)

Busy for the Next Fifty to Sixty Bud

Imperative

Show HN: I decomposed 87 tasks to find where AI agents structurally collapse

I went back to Linux and it was a mistake

Octrafic – open-source AI-assisted API testing from the CLI

US Accuses China of Secret Nuclear Testing

Peacock. A New Programming Language

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

What to know about the software selloff

Show HN: Syntux – generative UI for websites, not agents

Microsoft appointed a quality czar. He has no direct reports and no budget

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

Tell HN: Yet Another Round of Zendesk Spam

Postgres Message Queue (PGMQ)

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

NY lawmakers proposed statewide data center moratorium

OpenClaw AI chatbots are running amok – these scientists are listening in

Show HN: AI agent forgets user preferences every session. This fixes it

Introduce the Vouch/Denouncement Contribution Model

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

Microsoft appointed a quality czar. He has no direct reports and no budget

Multi-agent coordination on Claude Code: 8 production pain points and patterns

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

DevXT – Building the Future with AI That Acts

A Minimal OpenClaw Built with the OpenCode SDK

The silent death of Good Code

The Internal Negotiation You Have When Your Heart Rate Gets Uncomfortable

Show HN: Glance – Fast CSV inspection for the terminal (SIMD-accelerated)

Busy for the Next Fifty to Sixty Bud

Imperative

Show HN: I decomposed 87 tasks to find where AI agents structurally collapse

I went back to Linux and it was a mistake

Octrafic – open-source AI-assisted API testing from the CLI

US Accuses China of Secret Nuclear Testing

Peacock. A New Programming Language

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

What to know about the software selloff

Show HN: Syntux – generative UI for websites, not agents

Microsoft appointed a quality czar. He has no direct reports and no budget

Why agents do not write most of our code – A reality check

Comments