Over-editing refers to a model modifying code beyond what is necessary

https://nrehiew.github.io/blog/minimal_editing/

135•pella•1h ago

Comments

whinvik•1h ago

Yeah I have always felt GPT 5.4 does too much. It is amazing at following instructions precisely but it convinces itself to do a bit too much.

I am surprised Gemini 3.1 Pro is so high up there. I have never managed to make it work reliably so maybe there's some metric not being covered here.

eterm•1h ago

It's funny, because the wisdom that was often taught ( but essentially never practiced ) was "Refactor as you go".

The idea being that if you're working in an area, you should refactor and tidy it up and clean up "tech debt" while there.

In practice, it was seldom done, and here we have LLMs actually doing it, and we're realising the drawbacks.

hyperpape•1h ago

That's a real question, maybe the changes are useful, though I think I'd like to see some examples. I do not trust cognitive complexity metrics, but it is a little interesting that the changes seem to reliably increase cognitive complexity.

ramesh31•1h ago

>The idea being that if you're working in an area, you should refactor and tidy it up and clean up "tech debt" while there.

This is horrible practice, and very typical junior behavior that needs to be corrected against. Unless you wrote it, Chesterton's Fence applies; you need to think deeply for a long time about why that code exists as it does, and that's not part of your current task. Nothing worse than dealing with a 1000 line PR opened for a small UI fix because the code needed to be "cleaned up".

cassianoleal•51m ago

That is the flip side of what you're arguing against, and is also very typical junior behaviour that needs to be corrected against.

Tech debt needs to be dealt with when it makes sense. Many times it will be right there and then as you're approaching the code to do something else. Other times it should be tackled later with more thought. The latter case is frequently a symptom of the absence of the former.

In Extreme Programming, that's called the Boy Scouting Rule.

https://furqanramzan.github.io/clean-code-guidelines/princip...

traderj0e•26m ago

The Boy Scout "leave it better than you found it" is a good rule to follow. All code has its breaking points, so when you're adding a new feature and find that the existing code doesn't support it without hacks, it probably needs a refactor.

localhoster•55m ago

So I think theres some more nuance than that. A lot of the times, the abstraction is solid enough for you to work with that code area, ie tracking down some bug or extending a functionality. But sometimes you find yourself at a crossroad - which is either hacking around the existing implementation, or rethink it. With LLMs, how do you even rethink it? Does it even matter to rethink it? And on any who, those decisions are hidden away from you.

traderj0e•12m ago

It's only hidden if you don't read the code. Even if you don't, at some point you'll notice the LLM starting to struggle.

bluefirebrand•51m ago

There is a pretty substantial difference between "making changes" and "refactoring"

If LLMs are doing sensible and necessary refactors as they go then great

I have basically zero confidence that is actually the case though

hirako2000•48m ago

When the model write new code doing the same thing as existing logic that's not a refactor.

At times even when a function is right there doing exactly what's needed.

Worse, when it modifies a function that exists, supposedly maintaining its behavior, but breaks for other use cases. Good try I guess.

Worst. Changing state across classes not realising the side effect. Deadlock, or plain bugs.

aerhardt•48m ago

When they decide to touch something as they go, they often don't improve it. Not what I would call "refactoring" but rather a yank of the slot machine's arm.

raincole•43m ago

Really? I've never heard it's considered wise to put refactoring and new features (or bugfixes) in the same commit. Everyone I know from every place I've seen consider it bad. From harmful to a straight rejection in code review.

"Refactor-as-you-go" means to refactor right after you add features / fix bugs, not like what the agent does in this article.

jstanley•1h ago

Conversely, I often find coding agents privileging the existing code when they could do a much better job if they changed it to suit the new requirement.

I guess it comes down to how ossified you want your existing code to be.

If it's a big production application that's been running for decades then you probably want the minimum possible change.

If you're just experimenting with stuff and the project didn't exist at all 3 days ago then you want the agent to make it better rather than leave it alone.

Probably they just need to learn to calibrate themselves better to the project context.

_pastel•19m ago

The tradeoff is highly contextual; it's not a tradeoff an agent can always make by inspecting the project themselves.

Even within the same project, for a given PR, there are some parts of the codebase I want to modify freely and some that I want fixed to reduce the diff and testing scope.

I try to explain up-front to the agent how aggressively they can modify the existing code and which parts, but I've had mixed success; usually they bias towards a minimal diff even if that means duplication or abusing some abstractions. If anyone has had better success, I'd love to hear your approach.

anonu•1h ago

Here, the author means the agent over-edits code. But agents also do "too much": as in they touch multiple files, run tests, do deployments, run smoke tests, etc... And all of this gets abstracted away. On one hand, its incredible. But on the other hand I have deep anxiety over this:

1. I have no real understanding of what is actually happening under the hood. The ease of just accepting a prompt to run some script the agent has assembled is too enticing. But, I've already wiped a DB or two just because the agent thought it was the right thing to do. I've also caught it sending my AWS credentials to deployment targets when it should never do that.

2. I've learned nothing. So the cognitive load of doing it myself, even assembling a simple docker command, is just too high. Thus, I repeatedly fallback to the "crutch" of using AI.

Barbing•1h ago

Must consider ourselves lucky for having the intuition to notice skill stagnation and atrophy.

Only helps if we listen to it :) which is fun b/c it means staying sharp which is inherently rewarding

ok_dad•1h ago

Why are you letting the LLM drive? Don't turn on auto-approve, approve every command the agent runs. Don't let it make design or architecture decisions, you choose how it is built and you TELL that clanker what's what! No joke, if you treat the AI like a tool then you'll get more mileage out of it. You won't get 10x gains, but you will still understand the code.

giraffe_lady•51m ago

I agree with this too. I decided on constraints for myself around these tools and I give my complete focus & attention to every prompt, often stopping for minutes to figure things through and make decisions myself. Reviewing every line they produce. I'm a senior dev with a lot of experience with pair programming and code review, and I treat its output just as I would those tasks.

It has about doubled my development pace. An absolutely incredible gain in a vacuum, though tiny compared to what people seem to manage without these self-constraints. But in exchange, my understanding of the code is as comprehensive as if I had paired on it, or merged a direct report's branch into a project I was responsible for. A reasonable enough tradeoff, for me.

soiltype•26m ago

Personally I've found "carefully review every move it makes" to be an extremely unpleasant and difficult workflow. The effort needed to parse every action is immense, but there's a complete absence of creative engagement - no chance of flow state. Just the worst kind of work which I've been unable to sustain, unfortunately. At this point I mostly still do work by hand.

JoshTriplett•15m ago

I agree, but I also think that giving the LLM free rein is also extremely unpleasant and difficult. And you still need to review the resulting code.

andoando•23m ago

Because its SO much faster not to have to do all that. I think 10x is no joke, and if you're doing MVP, its just not worth the mental effort.

wahnfrieden•16m ago

It’s terribly slow

applfanboysbgon•10m ago

This is significantly slower than just writing the code yourself.

onlyrealcuzzo•1h ago

> 2. I've learned nothing. So the cognitive load of doing it myself, even assembling a simple docker command, is just too high. Thus, I repeatedly fallback to the "crutch" of using AI.

I'm not trying to be offense, so with all due respect... this sounds like a "you" problem. (And I've been there, too)

You can ask the LLMs: how do I run this, how do I know this is working, etc etc.

Sure... if you really know nothing or you put close to zero effort into critically thinking about what they give you, you can be fooled by their answers and mistake complete irrelevance or bullshit for evidence that something works is suitably tested to prove that it works, etc.

You can ask 2 or 3 other LLMs: check their work, is this conclusive, can you find any bugs, etc etc.

But you don't sound like you know nothing. You sound like you're rushing to get things done, cutting corners, and you're getting rushed results.

What do you expect?

Their work is cheap. They can pump out $50k+ worth of features in a $200/mo subscription with minimal baby-sitting. Be EAGER to reject their work. Send it back to them over and over again to do it right, for architectural reviews, to check for correctness, performance, etc.

They are not expensive people with feelings you need to consider in review, that might quit and be hard to replace. Don't let them cut corners. For whatever reason, they are EAGER to cut corners no matter how much you tell them not to.

raincole•59m ago

It never ceases to scare me how they just run python code I didn't write via:

> python <<'EOF'

> ${code the agent wrote on the spot}

> EOF

I mean, yeah, in theory it's just as dangerous as running arbitrary shell commands, which the agent is already doing anyway, but still...

harikb•54m ago

On the credentials point. Here is what I find.

Day 1: Carefully handles the creds, gives me a lecture (without asking) about why .env should be in .gitignore and why I should edit .env and not hand over the creds to it.

Day 2: I ask for a repeat, has lost track of that skill or setting, frantically searches my entire disk, reads .env including many other files, understands that it is holding a token, manually creates curl commands to test the token and then comes back with some result.

It is like it is a security expert on Day 1 and absolute mediocre intern on Day 2

eterm•34m ago

I found the same, it was super careful handling the environment variable until it hit an API error, and I caught in it's thinking "Let me check the token is actually set correctly" and it just echoed the token out.

( This was low-stakes test creds anyway which I was testing with thankfully. )

I never pass creds via env or anything else it can access now.

My approach now is to get it to write me linqpad scripts, which has a utility function to get creds out of a user-encrypted share, or prompts if it's not in the store.

This works well, but requires me to run the scripts and guide it.

Ultimately, fully autotonous isn't compatible with secrets. Otherwise, if it really wanted to inspect it, then it could just redirect the request to an echo service.

The only real way is to deal with it the same way we deal with insider threat.

A proxy layer / secondary auth, which injects the real credentials. Then give claude it's own user within that auth system, so it owns those creds. Now responsibilty can be delegated to it without exposing the original credentials.

That's a lot of work when you're just exploring an API or DB or similar.

cortesoft•44m ago

While I share some of the feelings about 'not understanding what is actually happening under the hood', I can't help but think about how this feeling is the exact same response that programmers had when compilers were invented:

https://vivekhaldar.com/articles/when-compilers-were-the--ai...

We are completely comfortable now letting the compilers do their thing, and never seem to worry that we "don't know what is actually happening under the hood".

I am not saying these situations are exactly analogous, but I am saying that I don't think we can know yet if this will be one of those things that we stop worrying about or it will be a serious concern for a while.

ManuelKiessling•38m ago

(I‘m saying this as someone who uses AI for coding a lot and mostly love it) Yeah, but is that really the same? Compilers work deterministically — if it works once, it will work always. LLMs are a different story for now.

betenoire•25m ago

Said another way, compilers are a translation of existing formal code. Compilers don't add features, they don't create algorithms (unrolling, etc., notwithstanding), they are another expression of the same encoded solution.

LLMs are nothing like that

cortesoft•7m ago

LLMs are just translating text into output, too, and are running on deterministic computers like every other bit of code we run. They aren't magic.

It is just the scope that makes it appear non-deterministic to a human looking at it, and it is large enough to be impossible for a human to follow the entire deterministic chain, but that doesn't mean it isn't in the end a function that translates input data into output data in a deterministic way.

cortesoft•10m ago

LLMs are deterministic, too. I know there is randomness in the choosing tokens, but that randomness is derived from a random seed that can be repeated.

mnkypete•35m ago

Except that compilers are (at least to a large degree) deterministic. It's complexity that you don't need to worry about. You don't need to review the generated assembly. You absolutely need to review AI generated code.

cortesoft•11m ago

At the end of the day, LLMs are also deterministic. They are running on computers just like all software, and if you have all the same data and random seeds, and you give the same prompt to the same LLM, you will get back the exact same response.

mathieudombrock•31m ago

A major difference is that _someone_ knew what was going on (compiler devs).

cortesoft•13m ago

That is an interesting difference, I agree.

Although, while the compiler devs might know what was going on in the compiler, they wouldn't know what the compiler was doing with that particular bit of code that the FORTRAN developer was writing. They couldn't possibly foresee every possible code path that a developer might traverse with the code they wrote. In some ways, you could say LLMs are like that, too; the LLM developers know how the LLM code works, but they don't know the end result with all the training data and what it will do based on that.

In addition, to the end developer writing FORTRAN it was a black box either way. Sure, someone else knows how the compiler works, but not the developer.

msteffen•31m ago

I think about this a lot, though one paragraph from that article:

> Many assembly programmers were accustomed to having intimate control over memory and CPU instructions. Surrendering this control to a compiler felt risky. There was a sentiment of, if I don’t code it down to the metal, how can I trust what’s happening? In some cases, this was about efficiency. In other cases, it was about debuggability and understanding programming behavior. However, as compilers matured, they began providing diagnostic output and listings that actually improved understanding.

I would 100% use LLMs more and more aggressively if they were more transparent. All my reservations come from times when I prompt “change this one thing” and it rewrites my db schema for some reason, or adds a comment that is actively wrong in several ways. I also think I have a decent working understanding of the assembly my code compiles to, and do occasionally use https://godbolt.org/. Of course, I didn’t start out that way, but I also don’t really have any objections to teenagers vibe-coding games, I just think at some point you have to look under the hood if you’re serious.

cortesoft•18m ago

> I would 100% use LLMs more and more aggressively if they were more transparent. All my reservations come from times when I prompt “change this one thing” and it rewrites my db schema for some reason, or adds a comment that is actively wrong in several ways.

Isn't that what git is for, though? Just have your LLM work in a branch, and then you will have a clear record of all the changes it made when you review the pull request.

nextaccountic•4m ago

The difference is that compilers are supposed to be deterministic and low level inclined people often investigate compiler bugs (specially performance bugs) and can pinpoint to some deterministic code that triggered it. Fix the underlying code and it stops misbehaving with high assurance

A non deterministic compiler is probably defective and in any case much less useful

lucasgerads•9m ago

I usually try to review all the code written by claude. And also let claude review all the code that i write. So, usually I have some understanding of what is going on. And Claude definitely sometimes makes "unconventional" decisions. But if you are working on a large code base with other team members (some of which may already have left the company). Their are also large parts of the code that one doesn't understand and are abstracted away.

itopaloglu83•1h ago

I always described it as over-complicating the code, but doing too much is a better diagnosis.

slopinthebag•1h ago

I think the industry has leaned waaay too far into completely autonomous agents. Of course there are reasons why corporations would want to completely replace their engineers with fully autonomous coding agents, but for those of us who actually work developing software, why would we want less and less autonomy? Especially since it alienates us from our codebases, requiring more effort in the future to gain an understanding of what is happening.

I think we should move to semi-autonomous steerable agents, with manual and powerful context management. Our tools should graduate from simple chat threads to something more akin to the way we approach our work naturally. And a big benefit of this is that we won't need expensive locked down SOTA models to do this, the open models are more than powerful enough for pennies on the dollar.

NitpickLawyer•56m ago

I'm hearing this more and more, we need new UX that is better suited for the LLM meta. But none that I've seen so far have really got it, yet.

grttww•52m ago

When you steer a car, there isn’t this degree of probability about the output.

How do you emulate that with llm’s? I suppose the objective is to get variance down to the point it’s barely noticeable. But not sure it’ll get to that place based on accumulating more data and re-training models.

lo1tuma•57m ago

I’m not sure if I share the authors opinion. When I was hand-writing code I also followed the boy-scout rule and did smaller refactorings along the line.

exitb•57m ago

As mentioned in the article, prompting for minimal changes does help. I find GPT models to be very steerable, but it doesn’t mean much when you take your hands of the wheel. These type of issues should be solved at planning stage.

Almured•56m ago

I feel ambivalent about it. In most cases, I fully agree with the overdoing assessment and then having to spend 30min correcting and fixing. But I also agree with the fact sometimes the system is missing out on more comprehensive changes (context limitations I suppose)! I am starting to be very strict when coding with these tool but still not quite getting the level of control I would like to see

lopsotronic•55m ago

When asked to show their development-test path in the form of a design document or test document, I've also noticed variance between the document generated and what the chain-of-thought thingy shows during the process.

The version it puts down into documents is not the thing it was actually doing. It's a little anxiety-inducing. I go back to review the code with big microscopes.

"Reproducibility" is still pretty important for those trapped in the basements of aerospace and defense companies. No one wants the Lying Machine to jump into the cockpit quite yet. Soon, though.

We have managed to convince the Overlords that some teensy non-agentic local models - sourced in good old America and running local - aren't going to All Your Base their Internets. So, baby steps.

aerhardt•51m ago

I'm building a website in Astro and today I've been scaffolding localization. I asked Codex 5.4 x-high to follow the official guidelines for localization and from that perspective the implementation was good. But then it decides to re-write the copy and layout of all pages. They were placeholders, but still?

Codex also has a tendency to apply unwanted styles everywhere.

I see similar tendencies in backend and data work, but I somehow find it easier to control there.

I'm pretty much all in on AI coding, but I still don't know how to give these things large units of work, and I still feel like I have to read everything but throwaway code.

jasonjmcghee•45m ago

I never use xhigh due to overthinking. I find high nearly always works better.

Purely anecdotal.

magicalhippo•33m ago

You can steer it though. When I see it going off the reservation I steer it back. I also commit often, just about after every prompt cycle, so I can easily revert and pick up the ball in a fresh context.

But yeah, I saw a suggestion about adding a long-lived agent that would keep track of salient points (so kinda memory) but also monitor current progress by main agent in relation to the "memory" and give the main agent commands when it detects that the current code clashes with previous instructions or commands. Would be interesting to see if it would help.

traderj0e•24m ago

They also don't understand how exceptions work. They'll try-catch everything, print the error, and continue. If I see a big diff, I know it just added 10 try-catches in random parts of my codebase.

pilgrim0•46m ago

Like others mentioned, letting the agent touch the code makes learning difficult and induces anxiety. By introducing doubt it actually increases the burden of revision, negating the fast apparent progress. The way I found around this is to use LLMs for designing and auditing, not programming per se. Even more so because it’s terrible at keeping the coding style. Call it skill issue, but I’m happier treating it as a lousy assistant rather than as a dependable peer.

pyrolistical•43m ago

I attempt to solve most agent problems by treating them as a dumb human.

In this case I would ask for smaller changes and justify every change. Have it look back upon these changes and have it ask itself are they truly justified or can it be simplified.

graybeardhacker•35m ago

I use Claude Code every day and have for as long as it has been available. I use git add -p to ensure I'm only adding what is needed. I review all code changes and make sure I understand every change. I prompt Claude to never change only whitespace. I ask it to be sure to make the minimal changes to fix a bug.

Too many people are treating the tools as a complete replacement for a developer. When you are typing a text to someone and Google changes a word you misspelled to a completely different word and changes the whole meaning of the text message do you shrug and send it anyway? If so, maybe LLMs aren't for you.

dbvn•26m ago

Don't forget the non-stop unnecessary comments

Flavius•24m ago

Token bonanza! Inference sellers love this simple trick.

tim-projects•25m ago

> The model fixes the bug but half the function has been rewritten.

The solution to this is to use quality gates that loop back and check the work.

I'm currently building a tool with gates and a diff regression check. I haven't seen these problems for a while now.

https://github.com/tim-projects/hammer

Isolated_Routes•15m ago

I think building something really well with AI takes a lot of work. You can certainly ask it to do things and it will comply, and produce something pretty good. But you don't know what you don't know, especially when it speaks to you authoritatively. So checking its work from many different angles and making sure it's precise can be a challenge. Will be interesting to see how all of this iterates over time.

deepfriedbits•10m ago

I agree 100%. At the same time, I feel like this piece, and our comments on it are snapshots in time because of the rate of advancement in the industry. These coding models are already significantly better than they were even nine months ago.

I can't help but read complaints about the capabilities of AI – and I'm certainly not accusing you of complaining about AI, just a general thought – and think "Yet" to myself every time.

Isolated_Routes•2m ago

Exactly! I completely agree. I think figuring out how to use this new tool well develop into a bit of an art form, which we will race to keep up with.

simonw•15m ago

I've not seen over-editing in Claude Code or Codex in quite a while, so I was interested to see the prompts being used for this study.

I think they're in here, last edited 8 months ago: https://github.com/nreHieW/fyp/blob/5a4023e4d1f287ac73a616b5...

jollyllama•7m ago

It's called code churn. Generally, LLMs make code churn.

ricardorivaldo•1m ago

duplicated ? https://news.ycombinator.com/item?id=47866913

Let's enable MFA for all Ruby gems

Open-Source contributions do not help

New study compares growing corn for energy to solar production. It's no contest

Show HN: macOS VMs to let you agents run wild

You Need MLOps: When CI/CD for Machine Learning Becomes Mandatory

Show HN: Ghost Pepper Meet local meeting transcription and diarization

Freelancers Not Delivering

500 AI prompts scored across 8 quality dimensions. None passed

Twenty: The open alternative to Salesforce, designed for AI

Brick Farm Simplifies X API

The invisible engineering behind Lambda's network

It's None of Your Business

Who Gets to Stay Human – The AI hype, stripped of the hype

Hindsight Reaches 10k GitHub Stars: The Community's Choice for Agent Memory

Crypto Billionaire Justin Sun Accuses World Liberty of 'Criminal Extortion'

Web Stalker – an artist-made browser that ignored images and formatting (1998)

Show HN: Legal Action Boundary Eval for agentic legal workflows

The Macroeconomic Effects of Tariffs: Insights from 180 Years of Trade Policy

Javokhir Sindarov's secret weapon – his coach IM Roman Vidonyak

Former Employee Sues MrBeast's Company over Harassment Claims

The Praxian Genocidal Kill Chain – Part 1

The power keeping wages low: Planet Money

Apple's New CEO Has a Background in VR, but Is Reportedly Bearish on Vision Pro

Ask HN: How do people use coding agents?

Claude Code /ultrareview

Chrome tab bar organizer for localhost

Allbirds goes soleless and pivots to AI

Amtrak Trains

Building AI-First at Intercom (With Claude Code and Rails) [video]

Introduction to Parsing [pdf]