Scaling LLMs to Larger Codebases

https://blog.kierangill.xyz/oversight-and-guidance

307•kierangill•1mo ago

Comments

rootnod3•1mo ago

Or why you shouldn't....

CuriouslyC•1mo ago

STAN'd to the top.

Decent article but it feels like a linkedin rehashing of stuff the people at the edge have already known for a while.

Aurornis•1mo ago

> but it feels like a linkedin rehashing of stuff the people at the edge have already known for a while.

You're not wrong, but it bears repeating to newcomers.

The average LLM user I encounter is still just hammering questions into the prompt and getting frustrated when the LLM makes the same mistakes over and over again.

Aurornis•1mo ago

> Making a prompt library useful requires iteration. Every time the LLM is slightly off target, ask yourself, "What could've been clarified?" Then, add that answer back into the prompt library.

I'm far from an LLM power user, but this is the single highest ROI practice I've been using.

You have to actually observe what the LLM is trying to do each time. Simply smashing enter over and over again or setting it to auto-accept everything will just burn tokens. Instead, see where it gets stuck and add a short note to CLAUDE.md or equivalent. Break it out into sub-files to open for different types of work if the context file gets large.

Letting the LLM churn and experiment for every single task will make your token quota evaporate before your eyes. Updating the context file constantly is some extra work for you, but it pays off.

My primary use case for LLMs is exploring code bases and giving me summaries of which files to open, tracing execution paths through functions, and handing me the info I need. It also helps a lot to add some instructions for how to deliver useful results for specific types of questions.

CPLX•1mo ago

I'm with you on that, but I have to say I have been doing that aggressively, and it's pretty easy for Claude Code at least to ignore the prompts, commands, Markdown files, README, architecture docs, etc.

I feel like I spend quite a bit of time telling the thing to look at information it already knows. And I'm talking about when I HAVE actually created various documents to use and prompts.

As a specific example, it regularly just doesn't reference CLAUDE.md and it seems pretty random as to when it decides to drop that out of context. That's including right at session start when it should have it fresh.

Aurornis•1mo ago

> and it's pretty easy for Claude Code at least to ignore the prompts, commands, Markdown files, README, architecture docs, etc.

I would agree with that!

I've been experimenting with having Claude re-write those documents itself. It can take simple directives and turn them into hierarchical Markdown lists that have multiple bullet points. It's annoying and overly verbose for humans to read, but the repetition and structure seems to help the LLM.

I also interrupt it and tell it to refer back to CLAUDE.md if it gets too off track.

Like I said, though, I'm not really an LLM power user. I'd be interested to hear tips from others with more time on these tools.

zarp•1mo ago

> it seems pretty random as to when it decides to drop that out of context

Overcoming this kind of nondeterministic behavior around creating/following/modifying instructions is the biggest thing I wish I could solve with my LLM workflows. It seems like you might be able to do this through a system of Claude Code hooks, but I've struggled with finding a good UX for maintaining a growing and ever-changing collection of hooks.

Are there any tools or harnesses that attempt to address this and allow you to "force" inject dynamic rules as context?

lkjdsklf•1mo ago

Wouldn't it be great if we had some kind of deterministic language to precisely and concisely tell a computer what to do

oblio•1mo ago

Yeah, but that's hard and boring.

chairmansteve•1mo ago

Like Java or Python?

kierangill•1mo ago

Agreed here. A key theme, which isn’t terribly explicit in this post, is that your codebase is your context.

I’ve found that when my agent flies off the rails, it’s due to an underlying weakness in the construction of my program. The organization of the codebase doesn’t implicitly encode the “map”. Writing a prompt library helps to overcome this weakness, but I’ve found that the most enduring guidance comes from updating the codebase itself to be more discoverable.

fragmede•1mo ago

> my agent flies off the rails

Which, I've had it delete the entire project including .git out of "shame", so my claude doesn't get permission to run rm anymore.

Codex has fewer levers but it's deleted my entire project twice now.

(Play with fire, you're gonna get burnt.)

CPLX•1mo ago

Wait, what? Can you please describe this shame incident?

Also, I have extremely frequent commits and version control syncs to GitHub and so on as part of the process (including when it's working on documents or things that aren't code) as a way to counteract this.

Although I suppose a sufficiently devious AI can get around those, it seems to not have been a problem.

ewoodrich•1mo ago

Not OP, and haven't had it flat out rm the entire .git, but I have had Claude get flustered and pull a "Wait, no! what was I thinking? that idea doesn't work at all here, I need to revert that attempt and try something else..."

.. and then ran a fatally flawed "git checkout" command that wiped out all unstaged changes, which it immediately realized and after flailing around for five minutes trying to undo eventually came back saying "yeah uh so sorry, but... here's the thing..."

fragmede•1mo ago

Basically that, but the entire project directory got wiped out, not just .git/. Backups are your friend (Arq gets my vote), as well as commiting often and pushing branches to the remote server that aren't my supposed to get reviewed, just so you have a recent off-machine copy. Claude has a way to deny rm and unlink and you can find other various protections, up to actually sandboxing your yolo session in a VM.

For Claude Chrome, I highly recommend using a separate profile. I also blocked my bank.com (not just via /etc/hosts but as this message is going to get harvested for training days, I unfortunately won't say what it is here. Email me if you really have to know - and promise you'll not just turn around and tell the whole Internet to AI) out of extra paranoia. Better paranoid and not got, than getting got, imo.

My rm interdiction script (which is far from 100%). https://gist.github.com/fragmede/96f35225c29cf8790f10b1668b8...

candiddevmike•1mo ago

Because, in my experience/conspiracy theory, the model providers are trying to make the models function better without having to have these kinds of workarounds. And so there's a disconnect where folks are adding more explicit instructions and the models are being trained to effectively ignore them under the guise of using their innate intuition/better learning/mixture of experts.

JonathanFly•1mo ago

> Every time the LLM is slightly off target, ask yourself, "What could've been clarified?

Better than that, ask the LLM. Better than that, have the LLM ask itself. You do still have make sure it doesn't go off the rails, but the LLM itself wrote this to help answer the question:

### Pattern 10: Student Pattern (Fresh Eyes)

*Concept:* Have a sub-agent read documentation/code/prompts "as a newcomer" to find gaps, contradictions, and confusion points that experts miss.

*Why it works:* Developers write with implicit knowledge they don't realize is missing. A "student" perspective catches assumptions, undefined terms, and inconsistencies.

*Example prompt:* ``` Task: "Student Pattern Review

Pretend you are a NEW AI agent who has never seen this codebase. Read these docs as if encountering them for the first time: 1. CLAUDE.md 2. SUB_AGENT_QUICK_START.md

Then answer from a fresh perspective:

## Confusion Points - What was confusing or unclear on first read? - What terms are used without explanation?

## Contradictions - Where do docs disagree with each other? - What's inconsistent?

## Missing Information - What would a new agent need to know that isn't covered?

## Recommendations - Concrete edits to improve clarity

Be honest and critical. Include file:line references." ```

*Uses cases:* Before finalizing new documentation, evaluating prompts for future Agents.

mym1990•1mo ago

Its kind of crazy that the knee jerk reaction to failing to one shot your prompt is to abandon the whole thing because you think the tool sucks. It very well might, but it could also be user error or a number of other things. There wouldn't be a good nights sleep in sight if I knew an LLM was running rampant all over production code in an effort to "scale it".

t_tsonev•1mo ago

I'm okay with writing developer docs in the form of agent instructions, those are useful for humans too. If they start to get oddly specific or sound mental, then it's obviously the tool at fault.

zeroonetwothree•1mo ago

There’s always a trade off in terms of alternative approaches. So I don’t think it’s “crazy” that if one fails you switch to a different one. Sure, sometimes persistence can pay off, but not always.

Like if I go to a restaurant for the first time and the item I order is bad, could I go back and try something else? Perhaps, but I could also go somewhere else.

smallerize•1mo ago

This highlights a missing feature of LLM tooling, which is asking questions of the user. I've been experimenting with Gemini in VS Code, and it just fills in missing information by guessing and then runs off writing paragraphs of design and a bunch of code changes that could have been avoided by asking for clarification at the beginning.

skolos•1mo ago

Claude code regularly asks me questions - I like how anthropic implemented this

rockbruno•1mo ago

Yeah I experienced this yesterday and it was really cool. It really only happened once though.

hobofan•1mo ago

So does Cursor in the Plan mode.

zvorygin•1mo ago

Append “First ask clarifying questions” to your prompt.

pteetor•1mo ago

For complicated prompts, I always add this:

"Before you start, please ask me any questions you have about this so I can give you more context. Be extremely comprehensive."

(I got the idea from a Medium article[1].) The LLM will, indeed, stop and ask good questions. It often notices what I've overlooked. Works very well for me!

[1] https://medium.com/@jordan_gibbs/the-most-important-chatgpt-...

tharkun__•1mo ago

So like most junior to mid level devs ;)

Claude does have this specific interface for asking questions now. I've only had it choose to ask me questions on its own a very few times though. But I did have it ask clarifying questions before that interface was even a thing, when I specifically asked it to ask me clarifying questions.

Again, like a junior dev. And like a junior dev, it can also help to ask it to ask / check what its doing "mid-way", i.e. watch what it's doing and stop it, when it's running down some rabbit hole you know is not gonna yield results.

CPLX•1mo ago

You'd have to make it do that. Here's a cut and paste I keep open on my desktop, I just paste it back in every time things seem to drift:

> Before you proceed, read the local and global Claude.md files and make sure you understand how we work together. Make sure you never proceed beyond your own understanding.

> Always consult the user anytime you reach a judgment call rather than just proceeding. Anytime you encounter unexpected behavior or errors, always pause and consider the situation. Rather than going in circles, ask the user for help; they are always there and available.

> And always work from understanding; never make assumptions or guess. Never come up with field names, method names, or framework ideas without just going and doing the research. Always look at the code first, search online for documentation, and find the answer to things. Never skip that step and guess when you do not know the answer for certain.

And then the Claude.md file has a much more clearly written out explanation of how we work together and how it's a consultative process where every major judgment call should be prompted to the user, and every single completed task should be tested and also asked for user confirmation that it's doing what it's supposed to do. It tends to work pretty well so far.

andrewmutz•1mo ago

The issues raised in this article are why I think highly-opinionated frameworks will lead to higher developer productivity when using AI assisted coding

You may not like all the opinions of the framework, but the LLM knows them and you don’t need to write up any guidelines for it.

christophilus•1mo ago

Yep. I ran an experiment this morning building the same app in Go, Rust, Bun, Ruby (Rails), Elixir (Phoenix), and C# (ASP whatever). Rails was a done deal almost right away. Bun took a lot of guidance, but I liked the result. The rest was a lot more work with so-so results — even Phoenix, surprisingly.

I liked the Rust solution a lot, but it had 200+ dependencies vs Bun’s 5 and Rails’ 20ish (iirc). Rust feels like it inherited the NPM “pull in a thousand dependencies per problem” philosophy, which is a real shame.

some-guy•1mo ago

I can vouch for this as someone who works in a 1.6 million line codebase, where there are constant deviations and inconsistent patterns. LLMs have been almost completely useless on it other than for small functions or files.

vivin•1mo ago

You can't get away from the engineering part of software engineering even if you are using LLMs. I have been using Claude Opus 4.5, and it's the best out of the models I have tried. I find that I can get Claude to work well if I already know the steps I need to do beforehand, and I can get it to do all of the boring stuff. So it's a series of very focused and directed one-shot prompts that it largely gets correct, because I'm not giving it a huge task, or something open-ended.

Knowing how you would implement the solution beforehand is a huge help, because then you can just tell the LLM to do the boring/tedious bits.

teaearlgraycold•1mo ago

They’re good for getting you from A to B. But you need to know A (current state of the code) and how to get to B (desired end state). They’re fast typers not automated engineers.

ericmcer•1mo ago

seriously, I stopped agent mode altogether. I hit it with very specific like: write a function that takes an array of X and returns y.

It almost never fails and usually does it in a neat way, plus its ~50 lines of code so I can copy and paste confidently. Letting the agent just go wild on my code has always been a PITA for me.

vivin•1mo ago

I've used agent mode, but I tell it not to go hog wild and to not do anything other than what I have instructed it to do. Also, sometimes I will tell it not to change the code, and to go over its changes with me first, before I tell it that it can make the changes.

I feel the same way as you in general -- I don't trust it to go and just make changes all over the codebase. I've seen it do some really dumb stuff before because it doesn't really understand the context properly.

tschellenbach•1mo ago

I wrote this forever ago in AI terms :) https://getstream.io/blog/cursor-ai-large-projects/

But the summary here is that with the right guidance, AI currently crushes it on large codebases.

mstank•1mo ago

As the models have progressively improved (able to handle more complex code bases, longer files, etc) I’ve started using this simple framework on repeat which seems to work pretty well at one shorting complex fixes or new features.

[Research] ask the agent to explain current functionality as a way to load the right files into context.

[Plan] ask the agent to brainstorm the best practices way to implement a new feature or refactor. Brainstorm seems to be a keyword that triggers a better questioning loop for the agent. Ask it to write a detailed implementation plan to an md file.

[clear] completely clear the context of the agent —- better results than just compacting the conversation.

[execute plan] ask the agent to review the specific plan again, sometimes it will ask additional questions which repeats the planning phase again. This loads only the plan into context and then have it implement the plan.

[review & test] clear the context again and ask it to review the plan to make sure everything was implemented. This is where I add any unit or integration tests if needed. Also run test suites, type checks, lint, etc.

With this loop I’ve often had it run for 20-30 minutes straight and end up with usable results. It’s become a game of context management and creating a solid testing feedback loop instead of trying to purely one-shot issues.

AlexB138•1mo ago

This is essentially my exact workflow. I also keep the plan markdown files around in the repo to refer agents back to when adding new features. I have found it to be a really effective loop, and a great way to reprime context when returning to features.

mstank•1mo ago

Exactly this. I clear the old plans every few weeks.

For really big features or plans I’ll ask the agent to create linear issue tickets to track progress for each phase over multiple sessions. Only MCP I have loaded is usually linear but looking for a good way to transition it to a skill.

AlexB138•1mo ago

Ah, that's a great idea. I've just been having the agent add a Progress section to the plan files and checking things off as we work.

doublerebel•1mo ago

I like Linearis as a CLI/skill interface to Linear, its help and json output are built well for use with Agents.

JamesSwift•1mo ago

In general anything with an API is simply saying "find the auth token at ~/.config/foo.json". It mostly knows the rest endpoints and can figure out the rest

redrove•1mo ago

I use an Obsidian MCP to essentially keep a database of plans, or versions sometimes that I can just fire off.

mstank•1mo ago

Why eat up the context with an MCP when a ./docs/plans folder does the same?

redrove•1mo ago

Flexibility and deeper Obsidian integration.

prmph•1mo ago

Nothing will really work when the models fail at the most basic of reasoning challenges.

I've had models do the complete opposite of what I've put in the plan and guidelines. I've had them go re-read the exact sentences, and still see them come to the opposite conclusion, and my instructions are nothing complex at all.

I used to think one could build a workflow and process around LLMs that extract good value from them consistently, but I'm now not so sure.

I notice that sometimes the model will be in a good state, and do a long chain of edits of good quality. The problem is, it's still a crap-shoot how to get them into a good state.

alienbaby•1mo ago

I'm curious in what kinda if situations you are seeing the model the do opposite of your intention consistently where the instructions were not complex. Do you have any examples?

avereveard•1mo ago

Mostly gemini 3 pro when I ask to investigate a bug and provide fixing options (i do this mostly so i can see when the model loaded the right context for large tasks) gemini immediately starts fixing things and I just cant trust it

Codex and claude give a nice report and if I see they're not considering this or that I can tell em.

saxenaabhi•1mo ago

fyi that happened to me with codex.

but, why is it a big issue? if it does something bad, just reset the worktree and try again with a different model/agent? They are dirt cheap at 20/m and I have 4 subscription(claude, codex, cursor, zed).

prmph•1mo ago

The issue is that if it's struggling sometimes with basic instruction following, it's likely to be making insidious mistakes in large complex tasks that you might no have the wherewithal or time to review.

The thing about good abstractions is that you should be able to trust in a composable way. The simpler or more low-level the building blocks, the more reliable you should expect them to be. In LLMs you can't really make this assumption.

saxenaabhi•1mo ago

I'm not sure you can make that assumption even when a human wrote that code. LLMs are competing with humans not with some abstraction.

> The issue is that if it's struggling sometimes with basic instruction following, it's likely to be making insidious mistakes in large complex tasks that you might no have the wherewithal or time to review.

Yes, that's why we review all code even when written by humans.

avereveard•1mo ago

Same I have multiple subscription and layer them. I use haiku to plan and send queue of task to codex and gemini whose command line can be scripted

The issue to me is that I have no idea of what the code looks like and have to have a reliable first layer model that can summarize current codebase state so I can decide whether the next mutation moves the project forward or reduces technical debt. I can delegate much more that way, while gemini "do first" approach tend to result in many dead ends that I have to unravel.

hu3•1mo ago

Check context size.

LLMs become increasingly error-prone as their memory is fills up. Just like humans.

In VSCode Copilot you can keep track of how many tokens the LLM is dealing with in realtime with "Chat Debug".

When it reaches 90k tokens I should expect degraded intelligence and brace for a possible forced sumarization.

Sometimes I just stop LLMs and continue the work in a new session.

mstank•1mo ago

In my experience this was an issue 6-8 months ago. Ever since Sonnet 4 I haven’t had any issues with instruction following.

Biggest step-change has been being able to one-shot file refactors (using the planning framework I mentioned above). 6 months ago refactoring was a very delicate dance and now it feels like it’s pretty much streamlined.

ewoodrich•1mo ago

I recently ran into two baffling, what felt like GPT 3.5 era completely backwards misinterpretations of an unambiguous sentence once each in Codex and CC/Sonnet a few days apart in completely different scenarios (both very early in the context window). And to be fair, they were notable partially as an "exception that proves the rule" where it was surprising to see but OP's example can definitely still happen in my experience.

I was prepared to go back to my original message and spot an obvious-in-hindsight grey area/phrasing issue on my part as the root cause but there was nothing in the request itself that was unclear or problematic, nor was it buried deep within a laundry list of individual requests in a single message. Of course, the CLI agents did all sorts of scanning through the codebase/self debate/etc in between the request and the first code output. I'm used to how modern models/agents get tripped up by now so this was an unusually clear cut failure to encounter from the latest large commercial reasoning models.

In both instances, literally just restating the exact same request with "No, the request was: [original wording]" was all it took to steer them back and didn't become a concerning pattern. But with the unpredictability of how the CLI agents decide to traverse a repo and ingest large amounts of distracting code/docs it seems much too over confident to believe that random, bizarre LLM "reasoning" failures won't still occur from time to time in regular usage even as models improve given their inherent limitations.

(If I were bending over backwards to be charitable/anthropomorphize, it would be the human failure mode of "I understood exactly what I was asked for and what I needed to do, but then somehow did the exact opposite, haha oops brain fart!" but personally I'm not willing to extend that much forgiveness/tolerance to a failure from a commercial tool I pay for...)

PeterFBell•1mo ago

It's complicated. Firstly, don't love that this happens. But the fact you're not willing to provide tolerance to a commercial tool that costs maybe a few hundred bucks a month but are willing to do so for a human who probably costs thousands of bucks a month is revealing of a double standard we're all navigating.

Its like the fallout when a waymo kills a "beloved neighborhood cat". I'm not against cats, and I'm deeply saddened at the loss of any life, but if it's true that (comparable) mile for mile, waymos reduce deaths and injuries, that is a good thing - even if they don't reduce them to zero.

And to be clear, I often feel the same way - but I am wondering why and whether it's appropriate!

prmph•1mo ago

For me I was just pointing out some interesting and noteworthy failure modes.

And it matters. If the models struggle sometimes with basic instruction following, they're can quite possibly make insidious mistakes in large complex tasks that you might no have the wherewithal or time to review.

ewoodrich•1mo ago

I mean, we typically architect systems depending on humans around an assumption of human fallibility. But when it comes to automation, randomly still doing the exact opposite even if somewhat rare is problematic and limits where and at what scale it can be safely deployed without needing ongoing human supervision.

For a coding tool it’s not as problematic as hopefully you vet the output to some degree but it still means I have don’t feel comfortable using them using them as expansively (like the mythical personal assistant doing my banking and replying to emails, etc) as they might otherwise be used with more predictable failure modes.

I’m perfectly comfortable with Waymo on the other hand, but that would probably change if I knew they were driven by even the newest and fanciest LLMs as [toddler identified | action: avoid toddler] -> turns towards toddler is a fundamentally different sort of problem.

asim•1mo ago

I don't do any of that. I find with GitHub copilot and Claude sonnet 4.5 if I'm clear enough about the what and where it'll sort things out pretty well, and then there's only reiteration of code styling or reuse of functionality. At that point it has enough context to keep going. The only time I might clear that whole thing is if I'm working on an entirely new feature where the context is too large and it gets stuck in summarising the history. Otherwise it's good. But this in codespaces. I find the Tasks feature much harder. Almost a write-off when trying to do something big. Twice I've had it go off on some strange tangent and build the most absurd thing. You really need to keep your eyes on it.

hyperadvanced•1mo ago

Same. I find that if I can piecemeal explain the desired functionality and work as I would pairing with another engineer that it’s totally possible to go from “make me a simple wheel with spokes” to “okay now let’s add a better frame and brakes” with relatively little planning, other than what I’d already do when researching the codebase to implement a new feature

asim•1mo ago

It's quite interesting because it makes me wonder how we make it efficient and predictable. The human language is just too verbose. There must be some DSL, some more refined way to get to the output we need. I don't know whether it means you actually just need to provide examples or something else. But you know code is very binary, do this do that. LLMs are really just too verbose even in this format right now. That higher layer really needs a language. I mean I get it. It's understanding human language and converting it to code. Very clever. But I think we can do better.

hu3•1mo ago

Yeah I found that for daily work, current models like Sonnet/Opus 4.5, Gemini 3.0 Pro (and even Flash) work really well without planning as long as I divide and conquer larger tasks into smaller ones. Just like I would do if I was programming myself.

For planning large tasks like "setup playwright tests in this project with some demo tests" I spend some time chatting with Gemini 3 or Opus 4.5 to figure out the most idiomatic easy-wins and possible pitfalls. Like: separate database for playwright tests. Separate users in playwright tests. Skipping login flow for most tests. And so on.

I suspect that devs who use a formal-plan-first approach tend to tackle larger tasks and even vibe code large features at a time.

mbreese•1mo ago

I’ve had some luck with giving the LLM an overview of what I want the final version to do, but then asking it to perform smaller chunks. This is how I’d approach it myself — I know where I’m trying to go, and will implement smaller chunks at a time. I’ll also sometimes ask it to skip certain functionality - leaving a placeholder and saying we’ll get back to it later.

godzillafarts•1mo ago

This is effectively what I'm doing, inspired by HumanLayer's Advanced Context Engineering guidelines: https://github.com/humanlayer/advanced-context-engineering-f...

We've taken those prompts, tweaked them to be more relevant to us and our stack, and have pulled them in as custom commands that can be executed in Claude Code, i.e. `/research_codebase`, `/create_plan`, and `/implement_plan`.

It's working exceptionally well for me, it helps that I'm very meticulous about reviewing the output and correcting it during the research and planning phase. Aside from a few use cases with mixed results, it hasn't really taken off throughout our team unfortunately.

zeroCalories•1mo ago

I agree this can work okay, but once I find myself doing this much handholding I would prefer to drive the process myself. Coordinating 4 agents and guiding them along really makes you appreciate the mythical-man-month on the scale of hours.

dfsegoat•1mo ago

Highly recommend using agent based hooks for things like `[review & test]`.

At a basic level, they work akin to git-hooks, but they fire up a whole new context whenever certain events trigger (E.g. another agent finishes implementing changes) - and that hook instance is independent of the implementation context (which is great, as for the review case it is a semi-independent reviewer).

jarjoura•1mo ago

As of Dec 2025, Sonnet/Opus and GPTCodex are both trained and most good agent tools (ie. opencode, claude-code, codex) have prompts to fire off subagents during an exploration (use the word explore) and you should be able to Research without needing the extra steps of writing plans and resetting context. I'd save that expense unless you need some huge multi-step verifiable plan implemented.

The biggest gotcha I found is that these LLMs love to assume that code is C/Python but just in your favorite language of choice. Instead of considering that something should be written encapsulated into an object to maintain state, it will instead write 5 functions, passing the state as parameters between each function. It will also consistently ignore most of the code around it, even if it could benefit from reading it to know what specifically could be reused. So you end up with copy-pasta code, and unstructured copy-pasta at best.

The other gotcha is that claude usually ignores CLAUDE.md. So for me, I first prompt it to read it and then I prompt it to next explore. Then, with those two rules, it usually does a good job following my request to fix, or add a new feature, or whatever, all within a single context. These recent agents do a much better job of throwing away useless context.

I do think the older models and agents get better results when writing things to a plan document, but I've noticed recent opus and sonnet usually end up just writing the same code to the plan document anyway. That usually ends up confusing itself because it can't connect it to the code around the changes as easily.

indigodaddy•1mo ago

Interesting, for me they almost always assume/write TS.

coldtea•1mo ago

>Instead of considering that something should be written encapsulated into an object to maintain state, it will instead write 5 functions, passing the state as parameters between each function.

Sounds very functional, testable, and clean. Sign me up.

the_sleaze_•1mo ago

I know this is tongue in cheek, but writing functional code in an object oriented language, or even worse just taking a giant procedural trail of tears and spreading it across a few files like a roomba through a pile of dog doo is ... well.. a code smell at best.

I have a user prompt saved called clean code to make a pass through the changes and remove unused, DRY and refactor - literally the high points of uncle bob's Clean Code. It works shockingly well at taking AI code and making it somewhat maintainable.

boredtofears•1mo ago

Care to share the prompt? Sounds useful!

the_sleaze_•1mo ago

Sure. Please improve it and come back around to let me know.

https://gist.github.com/prostko/5cf33aba05680b722017fdc0937f...

zx8080•1mo ago

Does its output follow the "no comments needed" principle of the uncle Bob?

KptMarchewa•1mo ago

>I know this is tongue in cheek, but writing functional code in an object oriented language, or even worse just taking a giant procedural trail of tears and spreading it across a few files like a roomba through a pile of dog doo is ... well.. a code smell at best.

After forcing myself over years to apply various OOP principles using multiple languages, I believe OOP has truly been the worst thing to happen to me personally as engineer. Now, I believe what you actually see is just an "aesthetics" issue, moreover it's purely learned aesthetics.

coldtea•1mo ago

Not so much tongue in cheek, but a little on the light side, sure.

I'd argue writing functional code in C++ (which is multi-paradigm anyway), or Java, or Typescript is fine!

nextaccountic•1mo ago

> As of Dec 2025, Sonnet/Opus and GPTCodex are both trained and most good agent tools (ie. opencode, claude-code, codex) have prompts to fire off subagents during an exploration (use the word explore) and you should be able to Research without needing the extra steps of writing plans and resetting context. I'd save that expense unless you need some huge multi-step verifiable plan implemented.

Does the UI shows clearly what portion was done by a subagent?

master_crab•1mo ago

The UI (terminal) in Claude code will tell you if it has launched a subagent to research a particular file or problem. But it will not be highlighted for you, simply displayed in its record of prompts and actions.

mstank•1mo ago

If you use the vscode extension you can click to view the sub-agent prompts and see all tool calls.

xnorswap•1mo ago

Yes it will, this is almost verbatim (redacted product) claude-code output from my current session:

   ● I'll explore the codebase to understand the current <redacted> architecture, testing patterns, and integration points. This will help me formulate effective strategies for reducing QA burden.

   ● 3 Explore agents finished (ctrl+o to expand)
      ├─ Explore <redacted> architecture · 57 tool uses · 60.0k tokens
      │  ⎿  Done
      ├─ Explore current testing approach · 29 tool uses · 51.7k tokens
      │  ⎿  Done
      └─ Explore API integration patterns · 44 tool uses · 71.7k tokens
         ⎿  Done

During agent execution, it also shows what each sub-agent is up to. In ctrl+o mode it'll show the prompts it passed to each sub-agent.

dboreham•1mo ago

AI can be an FP absolutist too.

je42•1mo ago

If claude ignores your claude.md you can force it to read via settings to cat it every session start for example.

zingar•1mo ago

I’m uneasy having an agent implement several pages of plan and then writing tests and results only at the and of all that. It feels like getting a CS student to write and follow a plan to do something they haven’t worked on before.

It’ll report, “Numbers changed in step 6a therefore it worked” [forgetting the pivotal role of step 2 which failed and as a result the agent should have taken step 6b, not 6a].

Or “there is conclusive evidence that X is present and therefore we were successful” [X is discussed in the plan as the reason why action is NEEDED, not as success criteria].

I _think _ that what is going wrong is context overload and my remedy is to have the agent update every step of the plan with results immediately after action and before moving on to action on the next step.

When things seem off I can then clear context and have the agent review results step by step to debug its own work: “review step 2 of the results. Are the stated results confident with final conclusions? Quote lines from the results verbatim as evidence.”

layer8•1mo ago

This is a bit like agile versus waterfall.

zingar•1mo ago

100%, the reason I thought of this is constantly telling developers to break their work down into smaller pieces so that they can focus and the customer sees value sooner.

dboreham•1mo ago

One of the things I like about LLM coding is that I don't need to become a psychologist in order to persuade other humans to approach their work in an manner I'd prefer.

uoaei•1mo ago

What is the current state of LCMs (large code models)? I.e. models that operate on the AST and not on text tokens.

pron•1mo ago

> Here's a LLM literacy dipstick: ask a peer engineer to read some code they're unfamiliar with. Do they understand it? ... No? Then the LLM won't either.

Of course, but the problem is the converse: There are too many situations where a peer engineer will know what to do but the agent won't. This means that it requires more work to make a codebase understandable to a human than it does to make it understandable to an agent.

> Moving more implementation feedback from human to computer helps us improve the chance of one-shotting... Think of these as bumper rails. You can increase the likelihood of an LLM reaching the bowling pins by making it impossible to land in the gutter.

Sort of, but this is also a little similar to claiming that P = NP. Having a an efficient way to reliably check if a solution is correct is not the same at all as a reliable way to find a solution. It's the theory of computation that tells us that it probably isn't. The likelihood may well be higher yet still not high enough. Even though theoretically NP problems are strictly easier than EXPTIME ones, in practice, in many situations (though not all) they are equally intractable.

In fact, we can put the claim to the test: there are languages, like ATS and Idris, that make almost any property provable and checkable. These languages let the programmer (human or machine) position the "bumper rails" so precisely as to ensure we hit the target. We can ask the agent to write the code, write the proof of correctness, and check it. We'd still need to check that the correctness property is the right one, but if the claim is correct, coding agents should be best at writing code, accompanied by correctness proofs, in ATS or Idris. Are they?

Obviously, mileage mauy vary dependning on the task and the domain, but if it's true that coding models will get significantly better, then the best course of action may well be, in many cases, to just wait until they do rather than spend a lot of effort working around their current limitations, effort that will be wasted if and when capabilities improve. And that's the big question: are we in for a long haul where agent capabilities remain roughly where they are today or not?

victorbjorklund•1mo ago

Biggest change to my workflow has been to break down projects to smaller parts using libraries. So where I in the past would put everything in the same code base I now break down stuff that can be separate to its own libraries (like wrapping an external API). That way the AI only needs to read the docs for the library instead of having to read all the code when working on features that use the API.

EastLondonCoder•1mo ago

I’ve ended up with a workflow that lines up pretty closely with the guidance/oversight framing in the article, but with one extra separation that’s been critical for me.

I’m working on a fairly messy ingestion pipeline (Instagram exports → thumbnails → grouped “posts” → frontend rendering). The data is inconsistent, partially undocumented, and correctness is only visible once you actually look at the rendered output. That makes it a bad fit for naïve one-shotting.

What’s worked is splitting responsibility very explicitly:

• Human (me): judge correctness against reality. I look at the data, the UI, and say things like “these six media files must collapse into one post”, “stories should not appear in this mode”, “timestamps are wrong”. This part is non-negotiably human.

• LLM as planner/architect: translate those judgments into invariants and constraints (“group by export container, never flatten before grouping”, “IG mode must only consider media/posts/*”, “fallback must never yield empty output”). This model is reasoning about structure, not typing code.

• LLM as implementor (Codex-style): receives a very boring, very explicit prompt derived from the plan. Exact files, exact functions, no interpretation, no design freedom. Its job is mechanical execution.

Crucially, I don’t ask the same model to both decide what should change and how to change it. When I do, rework explodes, especially in pipelines where the ground truth lives outside the code (real data + rendered output).

This also mirrors something the article hints at but doesn’t fully spell out: the codebase isn’t just context, it’s a contract. Once the planner layer encodes the rules, the implementor can one-shot surprisingly large changes because it’s no longer guessing intent.

The challenges are mostly around discipline:

• You have to resist letting the implementor improvise.

• You have to keep plans small and concrete.

• You still need guardrails (build-time checks, sanity logs) because mistakes are silent otherwise.

But when it works, it scales much better than long conversational prompts. It feels less like “pair programming with an AI” and more like supervising a very fast, very literal junior engineer who never gets tired, which, in practice, is exactly what these tools are good at.

tracker1•1mo ago

Just over the weekend, I decided to shell out for the top tier Claude Code to give it a try... definitely an improvement over the year I spent with Github CoPilot enabled on my personal projects (mostly an annoyance more than a help that I eventually disabled altogether).

I've seen some impressive output so far, and have a couple friends that have been using AI generation a lot... I'm trying to create a couple legacy (BBS tech related, in Rust) applications to see how they land. So far mostly planning and structure beyond the time I've spent in contemplation. I'm not sure I can justify the expense long term, but wanting to experience the fuss a bit more to have at least a better awareness.

dmofp•1mo ago

I have a somewhat different take on this (somewhat captured in the post linked below).

IMO, the best way to raise the floor of LLM performance in codebases is by building meaning into the code base itself ala DDD. If your codebase is hard to understand and grok for a human, it will be the same for an LLM. If your codebase is unstructured and has no definable patterns, it will be harder for an LLM to use.

You can try to overcome this with even more tooling and more workflows but IMO, it is throwing good money after bad. it is ironic and maybe unpopular, but it turns out LLMs prove that all the folks yapping about language and meaning (re: DDD) were right.

DDD & the Simplicity Gospel:

https://oluatte.com/posts/domain-driven-design-simplicity-go...

dj_gitmo•1mo ago

Great post. I work on two large codebases. One is structured much like the example from the post, and the other is a mess. LLMs care much better at understanding the organized code.

__MatrixMan__•1mo ago

I'm interested to see where we'll land re: organizing larger codebases to accommodate agents.

I've been having a lot of fun taking my larger projects and decomposing them into directed graphs where the nodes are nix flakes. If I launch claude code in a flake devshell it has access to only those tools, and it sees the flake.nix and assumes that the project is bounded by the CWD even though it's actually much larger, so its context is small and it doesn't get overwhelmed.

Inputs/outputs are a nice language agnostic mechanism for coordinating between flakes (just gotta remember to `nix flake update --update-input` when you want updated outputs from an adjacent flake). Then I can have them write feature requests for each other and help each other test fixtures and features. I also like watching them debate over a design, they get lazy and assume the other "team" will do the work, but eventually settle on something reasonable.

I've been running with the idea for a few weeks, maybe it's dumb, but I'd be surprised if this kind of rethinking didn't eventually yield a radical shift in how we organize code, even if the details look nothing like what I've come up with. Somehow we gotta get good at partitioning context so we can avoid the worst parts of the exponential increase in token volume that comes from submitting the entire chat session history just to get the next response.

quinnjh•1mo ago

yeah this is an interesting approach, both for the context-partitioning but also for reproducibility and dependency pinning. i was toying with this before needing to run with just docker on a project. would be nice to find a tool that streamlines some of this

__MatrixMan__•1mo ago

Re: dependency pinning, I put together a little write-up about that: https://gist.github.com/MatrixManAtYrService/6eaf50373448c0b...

You can use it as an alternative to `git bisect` where only you're only bisecting the history of a single subflake. I imagine writing a new test that indicates the presence of an old bug, and then going back in time to see when the bug was reintroduced. With git bisect, going back in time means your new test goes away too.

salty_frog•1mo ago

Id be keen to read/hear more about the experiment you've been undertaking as I too have been thinking the impact on the design/architecture/organising of software.

The focus mainly seems to be on enhancing existing workflows to produce code we currently expect - often you hear its like a junior dev.

The type of rethinking you outlined could have code organised in such a way a junior dev would never be able to extend but our 'junior dev' LLM can iterate through changes easily.

I care more about the properties of software e.g. testable, extendable, secure than how it organised.

Gets me to think of questions like

- what is the correlation between how code is organised vs its properties? - what is the optimal organisation of code to facilitate llms to modify and extend software?

__MatrixMan__•1mo ago

Its not even a POC at this point, just a readme and a sandbox for testing it while I work on it. But you might find the readme interesting:

https://github.com/MatrixManAtYrService/poag

I'm especially pleased with how explicit it makes the inner dependency graph. Today I'm tinkering with pact (https://docs.pact.io/). I like that I'm forced to add the pact contracts generated during consumer testing as flake outputs (so they can then be inputs to whichever flake does provider testing). It's potentially a bit more work than it would be under other schemes, but it also makes the directionality of the dependency into a first class citizen and not an implementation detail. Otherwise it would be easy to forget which batch of tests depends on artifacts generated by the other.

I suppose there's things like Bazel for that sort of thing also but I don't think you can drop an agent into a bazel... thingy... and expect it to feel at home.

lnx01•1mo ago

LLMs are so good at telling me about things I know little to nothing about, but when when I ask about things I have expert knowledge on they consistently fail, hallucinate, and confidently lie...

dmoy•1mo ago

Feels like https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

marcosdumay•1mo ago

It was a very clear sarcastic reference to it.

But it's still not completely right. LLMs are actually great to tell you about things you know little about. You just have to take names, ideas, and references from it, not facts.

(And that makes agentic coding almost useless, by the way.)

llmslave2•1mo ago

I think you end up asking it basic questions about stuff you know little about, but much more complex/difficult questions for stuff you're already an expert in.

christophilus•1mo ago

I’ve found that they vary a huge amount based on the subject matter. In my case, I have noticed the opposite of what you observed. They know a lot about the web space (which I’ve been in for around 25 years), but are pretty bad (though not useless) at esoteric languages such as Hare.

drcxd•1mo ago

Obviously, since the training material for such esoteric languages is scarce. (That's why they are esoteric!) So by definition, LLM will never be good at esoteric languages.

throw-12-16•1mo ago

You just don't know enough to identify the bullshit when you aren't an expert in that domain.

justsid•1mo ago

That’s the joke

blauditore•1mo ago

It's like people are rediscovering the most basic principles: E.g. that documentation ("prompt library") is usecho, or that well-organized code leads to higher velocity in development.

hu3•1mo ago

if that's what it takes for more people to write tests, then so be it

laser9•1mo ago

> This is the garbage in, garbage out principle in action. The utility of a model is bottlenecked by its inputs. The more garbage you have, the more likely hallucinations will occur.

Good read but I wouldn't fully extend the garbage in, garbage out principle to the LLMs. These massive LLMs are trained on internet-scale data, which includes a significant amount of garbage, and still do pretty good. Hallucinations are due to missing or misleading context than from the noise alone. Tech debt heavy code bases though unstructured still provides information-rich context.

patcon•1mo ago

Is it not the case that "production level code" coming out of these processes makes the whole system of coder-plus-machine weaker?

I find it to be a good thing that the code must be read in order to be production-grade, because that implies the coder must keep learning.

I worry about the collapse in knowledge pipeline when there is very little benefit to overseeing the process...

I say that as a bad coder who can and has done SO MUCH MORE with llm agents. So I'm not writing this as someone who has an ideal of coding that is being eroded. I'm just entering the realm of "what elite coding can do" with LLMs, but I worry for what the realm will lose, even as I'm just arriving

spullara•1mo ago

Using AugmentCode's Context Engine you can get this either through their VSCode/JetBrains plugins, their Auggie command line coding agent or by registering their MCP server with your local coding agent like Claude Code. It works far better than painstakingly stuffing your own context manually or having your agent use grep/lsp/etc to try and find what it needs.

avree•1mo ago

Why do none of these ever touch on token optimization? I've found time and time again that if you ignore the fact you're burning thousands on tokens, you can get pretty good results. Things like prompt libraries and context.md files tend to just burn more tokens per call.

hobofan•1mo ago

> Aside: Why are LLMs good at greenfield?

I have the complete opposite experience, where once some patterns already exist 2-3 times in the codebase, the LLMs start to accurately replicating them instead of trying to solve everything as one-off solutions.

> You can’t be inconsistent if there are no existing patterns.

"Consistency" shouldn't be equated to "good". If that's your only metric for quality and you don't apply any taste you'll quickly end of with a unmaintainable hodgepodge of second-grade libraries if you let an LLM do its thing in a greenfield project.

throw-12-16•1mo ago

There is no way this is economical.

Burn through your token limit in agent mode just to thrash around a few more times trying to identify where the agent "misunderstood" the prompt.

The only time LLM's work as coding agents for me is tightly scoped prompts with a small isolated context.

Just throwing an entire codebase into an LLM in an agentic loop seems like a fools errand.

eurekin•1mo ago

If you're interested in the large codebase... The best I found so far are extended context models. Using newest Nemotron3 nano, you can put a 1m tokens (about 3 ish megabytes of text) of pure code dump (I use repomix --style markdown) and ask around. That's been one of the biggest wow moments I had with LLMs so far. Much better experience than any RAG I used

Simplita•1mo ago

One thing that helped us as codebases grew was separating decision-making from execution. Let the model reason about intent and scope, but keep execution deterministic and constrained. It reduced drift and made failures much easier to debug once context got large.

ColinEberhardt•1mo ago

“When an LLM can generate a working high-quality implementation in a single try, that is called one-shotting. This is the most efficient form of LLM programming.”

This is a good article, but misses one of the most important advances this year - the agentic loop.

There are always going to be limits to how much code a model can one-shot. Give it the ability to verify its changes and iterate, massively increase its ability to write sizeable chunks of verified and working code.

jukkat•1mo ago

Put every detail into CLAUDE.md and after a while CC starts to forget/ignore what it’s been told.

I’d like to see dynamic task-specific context building. Write a prompt and the model starts to collect relevant instructions.

Also a review loop to check that instructions were followed.

Ayanonymous•1mo ago

I'm still learning about how LLMs can be used in coding, but this article helped me understand the importance of giving clear instructions and not relying too much on automation. The point about developers still needing to guide the model really makes sense. Thanks for sharing this!

eddywebs•1mo ago

How about adding MCP support to large code bases to provide RAG based context to LLMs, ive been playing with this idea with some good results.

Tiny C Compiler

The silent death of Good Code

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

The F Word

Hoot: Scheme on WebAssembly

FDA intends to take action against non-FDA-approved GLP-1 drugs

Eigen: Building a Workspace

First Proof

Vocal Guide – belt sing without killing yourself

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

I write games in C (yes, C) (2016)

Start all of your commands with a comma (2009)

Show HN: Browser based state machine simulator and visualizer

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Coding agents have replaced every framework I used

Where did all the starships go?

Learning from context is harder than we thought

Tiny C Compiler

The silent death of Good Code

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

The F Word

Hoot: Scheme on WebAssembly

FDA intends to take action against non-FDA-approved GLP-1 drugs

Eigen: Building a Workspace

First Proof

Vocal Guide – belt sing without killing yourself

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

I write games in C (yes, C) (2016)

Start all of your commands with a comma (2009)

Show HN: Browser based state machine simulator and visualizer

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Coding agents have replaced every framework I used

Where did all the starships go?

Learning from context is harder than we thought

Scaling LLMs to Larger Codebases

Comments