- Better structured codebases - we need hierarchical codebases with minimal depth, maximal orthogonality and reasonable width. Think microservices.
- Better documentation - most code documentations are not built to handle updates. We need a proper graph structure with few sources of truth that get propagated downstream. Again, some optimal sort of hierarchy is crucial here.
At this point, I really don't think that we necessarily need better agents.
Setup your codebase optimally, spin up 5-10 instances of gpt-5-codex-high for each issue/feature/refactor (pick the best according to some criteria) and your life will go smoothly
This was my journey: I vibe-coded an Electron app and ended up with a terrible monolithic architecture, and mostly badly written code. Then, I took the app's architecture docs and spent a lot of my time shouting "MAKE THIS ARCHITECTURE MORE ORTHOGONAL, SOLID, KISS, DRY" to gpt-5-pro, and ended up with a 1500+ liner monster doc.
I'm now turning this into a Tauri app and following the new architecture to a T. I would say that it is has a pretty clean structure with multiple microservices.
Now, new features are gated based on the architecture doc, so I'm always maintaining a single source of truth that serves as the main context for any new discussions/features. Also, each microservice has its own README file(s) which are updated with each code change.
The vibe coded main invoice generator script then does the calendar calculations to figure out the pay cycle and examines existing invoices in the invoice directory to determine the next invoice number (the invoice number is in the file name, so it doesn't need to open the files). When it is done with the calculations, it uses the template command to generate the final invoice.
This is a very small example, but I do think that clearly defined modules/microservices/libraries are a good way to only put the relevant work context into the limited context window.
It also happens to be more human-friendly, I think?
Microservices should already be a last resort when you’ve either: a) hit technical scale that necessitates it b) hit organizational complexity that necessitates it
Opting to introduce them sooner will almost certainly increase the complexity of your codebase prematurely (already a hallmark of LLM development).
> Better documentation
If this means reasoning as to why decisions are made then yes. If this means explaining the code then no - code is the best documentation. English is nowhere near as good at describing how to interface with computers.
Given how long gpt codex 5 has been out, there’s no way you’ve followed these practices for a reasonable enough time to consider them definitive (2 years at the least, likely much longer).
Agreed, but how else are you going to scale mostly AI written code? Relying mostly on AI agents gives you that organizational complexity.
> Given how long gpt codex 5 has been out, there’s no way you’ve followed these practices for a reasonable enough time to consider them definitive
Yeah, fair. Codex has been out for less than 2 weeks at this point. I was relying on gpt-5 in August and opus before that.
But in my experience a microservide architecture is orders of magnitud more complex to build and understand that a monolith.
If you, with the help of an LLM, strugle to keep a monolith organized, I am positive you will find even harder to build microservices.
Good luck in your journey, I hope you learn a ton!
We summarize context and remember summarizations of it.
Maybe we need to do this with the LLM. Chain of thought sort of does this but it’s not deliberate. The system prompt needs to mark this as a deliberate task of building summaries and notes notes of the entire code base and this summarized context of the code base with gotchas and aspects of it can be part of permanent context the same way ChatGPT remembers aspects of you.
The summaries can even be sectioned off and and have different levels of access. So if the LLM wants to drill down to a subfolder it looks at the general summary and then it looks at another summary for the sub folder. It doesn’t need to access the full summary for context.
Imagine a hierarchy of system notes and summaries. The LLM decides where to go and what code to read while having specific access to notes it left previously when going through the code. Like the code itself it never reads it all it just access sections of summaries that go along with the code. It’s sort of like code comments.
We also need to program it to change the notes every time it changes the program. And when you change the program without consulting AI, every commit you do the AI also needs to update the notes based off of your changes.
The LLM needs a system prompt that tells it to act like us and remember things like us. We do not memorize and examine full context of anything when we dive into code.
yes, and if you're an engineering manager you retain _out of date_ summarizations, often materially out of date.
This is something humans dont actually do. We aren’t aware of every change and we don’t have updated documentation of every change so the LLM will be doing better in this regard.
Did you read the entirety of what I wrote? Please read.
Say the AI left a 5 line summary of a 300 line piece of code. You as a human update that code. What I am saying specifically is this: when you do the change, The AI then sees this and updates the summary. So AI needs to be interacting with every code change whether or not you used it to vibe code.
The next time the AI needs to know what this function does, it doesn’t need to read the entire 300 line function. It reads the 5 line summary, puts it in the context window and moves on with chain of thought. Understand?
This is what shrinks the context. Humans don’t have unlimited context either. We have vague fuzzy memories of aspects of the code and these “notes” effectively make coding agents do the same thing.
- - kae3g
A good senior engineer has a ton in their head after 6+ months in a codebase. You can spend a lot of time trying to equip Claude Code with the equivalent in the form of CLAUDE.MD, references to docs, etc., but it's a lot of work, and it's not clear that the agents even use it well (yet).
We do take notes, we summarize our writings, that's a process. But the brain does not follow that primitive process to "scale".
why would you bother with all these summaries if you can just read and remember the code perfectly.
I think this might be a good leap for agents, the ability to not just review a doc in it's current state, but to keep in context/understanding the full evolution of a document.
I'll stop ya right there. Spending the past few weeks fixing bugs in a big multi-tier app (which is what any production software is this days). My output per bug is always one commit, often one line.
Claude is an occasional help, nothing more. Certainly not generating the commit for me!
Claude is able to create entire PRs for me that are clean, well written, and maintainable.
Can it fail spectacularly? Yes, and it does sometimes. Can it be given good instructions and produce results that feel like magic? Also yes.
In a way that is still helpful, especially if the act of putting the prompt together brought you to the solution organically.
Beyond that, 'clean', 'well written' and 'maintainable' are all relative terms here. In a low quality, mega legacy codebase, the results are gonna be dogshit without an intense amount of steering.
You have to be willing to accept "close-ish and good enough" to what you'd write yourself. I would say that most of the time I spend with Claude is to get from its initial try to "close-ish and good enough". If I was working on tiny changes of just a few lines, it would definitely be faster just to write them myself. It's the hundreds of lines of boilerplate, logging, error handling, etc. that makes the trade-off close to worth it.
I gave up building agents as soon as I figured they would never scale beyond context constraint. Increase in memory and compute costs to grow the context size of these things isn't linear.
Having more context, but leaving open an inability to effectively focus on the latest task is the real problem.
People have tried to expand context windows by reducing the O(n^2) attention mechanism to something more sparse and it tends to perform very poorly. It will take a fundamental architectural change.
That is, humans usually don't store exactly what was written in as sentence five paragraphs ago, but rather the concept or idea conveyed. If we need details we go back and reread or similar.
And when we write or talk, we form first an overall thought about what to say, then we break it into pieces and order the pieces somewhat logically, before finally forming words that make up sentences for each piece.
From what I can see there's work on this, like this[1] and this[2] more recent paper. Again not an expert so can't comment on the quality of the references, just some I found.
That is what transformers attention does in the first place, so you would just be stacking two transformers.
I know there classes of problems that LLMs can't natively handle (like doing math, even simple addition... or spatial reasoning, I would assume time's in there too). There are ways they can hack around this, like writing code that performs the math.
But how would you do that for chronological reasoning? Because that would help with compacting context to know what to remember and what not.
Coding agents choke on our big C++ code-base pretty spectacularly if asked to reference large files.
I have multiple things I'd love LLMs to attempt to do, but the context window is stopping me.
In fact I've found LLMs are reasonable at the simple task of refactoring a large file into smaller components with documentation on what each portion does even if they can't get the full context immediately. Doing this then helps the LLM later. I'm also of the opinion we should be making codebases LLM compatible. So if it happens i direct the LLM that way for 10mins and then get back to the actual task once the codebase is in a more reasonable state.
I could see in C++ it getting smarter about first checking the .h files or just grepping for function documentation, before actually trying to pull out parts of the file.
You load the thing up with relevant context and pray that it guides the generation path to the part of the model that represents the information you want and pray that the path of tokens through the model outputs what you want
That's why they have a tendency to go ahead and do things you tell them not to do..
also IDK about you but I hate how much praying has become part of the state of the art here. I didn't get into this career to be a fucking tech priest for the machine god. I will never like these models until they are predictable, which means I will never like them.
Are we still calling it intelligence?
"Its not the x value that's the problem, its the y value".
You're right, it's not "raw intelligence" that's the bottleneck, because there's none of that in there. The truth is no tweak to any parameter is ever going to make the LLM capable of programming. Just like an exponential curve is always going to outgrow a linear one. You can't tweak the parameters out of that fundamental truth.
The context pipeline is a major problem in other fields as well, not just programming. In healthcare, the next billion-dollar startup will likely be the one that cracks the personal health pipeline, enabling people to chat with GPT-6 PRO while seamlessly bringing their entire lifetime of health context into every conversation.
Because you will always need a specialist to drive these tools. You need someone who understands the landscape of software - what's possible, what's not possible, how to select and evaluate the right approach to solve a problem, how to turn messy human needs into unambiguous requirements, how to verify that the produced software actually works.
Provided software developers can grow their field of experience to cover QA and aspects of product management - and learn to effectively use this new breed of coding agents - they'll be just fine.
Eg. "Refactor this large file into meaningful smaller components where appropriate and add code documentation on what each small component is intended to achieve." The LLM can usually handle this well (with some oversight of course). I also have instructions to document each change and why in code in the LLMs instructions.md
If the LLM does create a regression i also ask the LLM to add code documentation in the code to avoid future regressions, "Important: do not do X here as it will break Y" which again seems to help since the LLM will see that next time right there in the portion of code where it's important.
None of this verbosity in the code itself is harmful to human readers either which is nice. The end result is the codebase becomes much easier for LLMs to work with.
I suspect LLM compatibility may be a metric we measure codebases in the future as we learn more and more how to work with them. Right now LLMs themselves often create very poor LLM compatible code but by adding some more documentation in the code itself they can do much better.
I started writing a solution, but to be honest I probably need the help of someone who's more experienced.
Although to be honest, I'm sure someone with VC money is already working on this.
If you want your agent to be really good at working with dates in a functional way or know how to deal with the metric system (as examples), then you need to train on those problems, probably using RFT. The other challenge is that even if you have this problem set in testable fashion running at scale is hard. Some benchmarks have 20k+ test cases and can take well over an hour to run. If you ran each test case sequentially it would take over 2 years to complete.
Right now the only company I'm aware of that lets you do that at scale is runloop (disclaimer, I work there).
However, the limitation can be masqueraded using layering techniques where output of one agent is fed as an input to another using consensus for verification or other techniques to the nth degree to minimize errors. But this is a bit like the story of a boy with a finger in the dike. Yes, you can spawn as many boys but there is a cost associated that would keep growing and wont narrow down.
It has nothing to do with contexts or window of focus or any other human centric metric. This is what the architecture is supposed to do and it does so perfectly.
Yeah this is the really big one - kind of buried the lede a little there :)
Understanding product and business requirements traditionally means communicating (either via docs and specs or directly with humans) with a bunch of people. One of the differences between a junior and senior is being able to read between the lines of a github or jira issue and know that more information needs to be teased out from… somewhere (most likely someone).
I’ve noticed that when working with AI lately I often explicitly tell them “if you need more information or context ask me before writing code”, or variations thereof. Because LLMs, like less experienced engineers, tend to think the only task is to start writing code immediately.
It will get solved though, there’s no magic in it, and LLMs are well equipped by design to communicate!
https://github.com/foolsgoldtoshi-star/foolsgoldtoshi-star-p...
_ _ kae3g
The ICPC is a short (5 hours) timed contest with multiple problems, in which contestants are not allowed to use the internet.
The reason most don't get a perfect score isn't because the tasks themselves are unreasonably difficult, but because they're difficult enough that 5 hours isn't a lot of time to solve so many problems. Additionally they often require a decent amount of math / comp-sci knowledge so if you don't know have the knowledge necessary you probably won't be able complete it.
So to get a good score you need lots of math & comp-sci knowledge + you need to be a really quick coder.
Basically the consent is perfect for LLMs because they have a ton of math and comp-sci knowledge, they can spit out code at super human speeds, and the problems themselves are fairly small (they take a human maybe 15 mins to an hour to complete).
Who knows, maybe OP is right and LLMs are smart enough to be super human coders if they just had the right context, but I don't think this example proves their point well at all. These are exactly the types of problems you would expect a supercharged auto-complete would excel at.
As these tools make it possible for a single person to do more, it will become increasingly likely that society will be exposed to greater risks than that single person's (or small company's) assets can cover.
These tools already accelerate development enough that those people who direct the tools can no longer state with credibility that they've personally reviewed the code/behavior with reasonable coverage.
It'll take over-extensions of the capability of these tools, of course, before society really notices, but it remains my belief that until the tools themselves can be held liable for the quality of their output, responsibility will become the ultimate bottleneck for their development.
And who writes the tests?
Humans gatekeep, especially in the tech industry, and that is exactly what will limit us improving AI over time. It will only be when we turn over it's choices to it that we move beyond all this bullshit.
You need to have the right things in the context, irrelevant stuff is not just wasteful, it is increasingly likely to cause errors. It has been shown a few times that as the context window grows, performance drops.
Heretical I know, but I find that thinking like a human goes a long way to working with AI.
Let's take the example of large migrations. You're not going to load the whole codebase in your brain and figure out what changes to make and then vomit them out into a huge PR. You're going to do it bit by bit, looking up relevant files, making changes to logically-related bits of code, and putting out a PR for each changelist.
This exactly what tools should do as well. At $PAST_JOB my team built a tool based on OpenRewrite (LLMs were just coming up) for large-scale multi-repo migrations and the centerpiece was our internal codesearch tool. Migrations were expressed as a codesearch query + codemod "recipe"; you can imagine how that worked.
That would be the best way to use AI for large-scale changes as well. Find the right snippets of code (and documentation!), load each one into the context of an agent in multiple independent tasks.
Caveat: as I understand it, this was the premise of SourceGraph's earliest forays into AI-assisted coding, but I recall one of their engineers mentioning that this turned out to be much trickier than expected. (This was a year+ back, so eons ago in LLM progress time.)
Just hypothesizing here, but it may have been that the LSIF format does not provide sufficient context. Another company in this space is Moderne (the creators of OpenRewrite) that have a much more comprehensive view of the codebase, and I hear they're having better success with large LLM-based migrations.
koakuma-chan•1h ago
CardenB•1h ago
neutronicus•1h ago
Like, how feasible is it for a mid-size corporation to use a technique like LoRA, mentioned by GP, to "teach" (say, for example) Kimi K2 about a large C++ codebase so that individual engineers don't need to learn the black art of "context engineering" and can just ask it questions.
pu_pe•1h ago