Please god, no, never do this. For one thing, why would you not commit the generated source code when storage is essentially free? That seems insane for multiple reasons.
> When models inevitably improve, you could connect the latest version and regenerate the entire codebase with enhanced capability.
How would you know if the code was better or worse if it was never committed? How do you audit for security vulnerabilities or debug with no source code?
You would not do this because: unlike programming languages, natural languages are ambiguous and thus inadequate to fully specify software.
> this assumes models can achieve strict prompt adherence
What does strict adherence to an ambiguous prompt even mean? It’s like those people asking Babbage if his machine would give the right answer when given the wrong figures. I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a proposition.
Llms can be configured to have deterministic output too
You generate code with LLMs. You write tests for this code, either using LLMs or on your own. You of course commit your actual code: it is required to actually run the program, after all. However you also save the entire prompt chain somewhere. Then (as stated in the article), when a much better model comes along, you re-run that chain, presumably with prompting like "create this project, focusing on efficiency" or "create this project in Rust" or "create this project, focusing on readability of the code". Then you run the tests against the new codebase and if the suite passes you carry on, with a much improved codebase. The theoretical benefit of this over just giving your previously generated code to the LLM and saying "improve the readability" is that the newer (better) LLM is not burdened by the context of the "worse" decisions made by the previous LLM.
Obviously it's not actually that simple, as tests don't catch everything (tho with fuzz testing and complete coverage and such they can catch most issues), but we programmers often treat them as if they do, so it might still be a worthwhile endeavor.
Not to mention it also performs web searches, web fetching etc which would also make it not deterministic
One of the things we learned very quickly was that having generated source code in the same repository as actual source code was not sustainable. The nature of reviewing changes is just too different between them.
Another thing we learned very quickly was that attempting to generate code, then modify the result is not sustainable; nor is aiming for a 100% generated code base. The end result of that was that we had to significantly rearchitect the project for us to essentially inject manually crafted code into arbitrary places in the generated code.
Another thing we learned is that any change in the code generator needs to have a feature flag, because someone was relying on the old behavior.
Keeping a repository with the prompts, or other commands separate is fine, but not committing the generated code at all I find questionable at best.
Tbh this all sounds very familiar and like classic data management/admin systems for regular businesses. The only difference is that the data is code and the admins are the engineers themselves so the temptation to "just" change things in place is too great. But I suspect it doesn't scale and is hard to manage etc.
But given this is about LLMs, which people tend to run with temperature>0, this is unlikely to be true, so then I'd really urge anyone to actually store the results (somewhere, maybe not in scm specifically) as otherwise you won't have any idea about what the code was in the future.
Compilers are deterministic. Given the same input you always get the same output so there's no reason to store the output. If you don't get the same output we call it a compiler bug!
LLMs do not work this way.
(Aside: Am I the only one who feels that the entire AI industry is predicated on replacing only development positions? we're looking at, what, 100bn invested, with almost no reduce in customer's operating costs other than if the customer has developers).
Nobody commits the compiled code; this is the direction we are moving in, high level source code is the new assembly.
Regenerated code might behave differently, have different bugs(worst case), or not work at all(best case).
Prompts are in a sense what higher level programming languages were to assembly. Sure there is a crucial difference which is reproducibility. I could try and write down my thoughts why I think in the long run it won't be so problematic. I could be wrong of course.
I run https://pollinations.ai which servers over 4 million monthly active users quite reliably. It is mostly coded with AI. Since about a year there was no significant human commit. You can check the codebase. It's messy but not more messy than my codebases were pre-LLMs.
I think prompts + tests in code will be the medium-term solution. Humans will be spending more time testing different architecture ideas and be involved in reviewing and larger changes that involve significant changes to the tests.
How does anyone using AI like this have confidence that they aren't unintentionally plagiarizing code and violating the terms of whatever license it was released under?
For random personal projects I don't see it mattering that much. But if a large corp is releasing code like this, one would hope they've done some due diligence that they have to just stolen the code from some similar repo on GitHub, laundered through a LLM.
The only section in the readme doesn't mention checking similar projects or libraries for common code:
> Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.
Companies are satisfied with the idemnity provided by Microsoft.
They don’t and no one cares
The best way I can explain the experience of working with an LLM agent right now is that it is like if every API in the world had a magic "examples" generator that always included whatever it was you were trying to do (so long as what you were trying to do was within the obvious remit of the library).
Another way to phrase this is LLM-as-compiler and Python (or whatever) as an intermediate compiler artefact.
Finally, a true 6th generation programming language!
I've considered building a toy of this with really aggressive modularisation of the output code (eg. python) and a query-based caching system so that each module of code output only changes when the relevant part of the prompt or upsteam modules change (the generated code would be committed to source control like a lockfile).
I think that (+ some sort of WASM encapsulated execution environment) would one of the best ways to write one off things like scripts which don't need to incrementally get better and more robust over time in the way that ordinary code does.
Karpathy already said English is the new programming language.
The reason he keeps adjusting the prompts is because he knows how to program. He knows what it should look like.
It just blurs the line between engineer and tool.
later
updated to clarify kentonv didn't write this article
> this tool is improving itself, learning from every interaction
which seem to indicate a fundamental misunderstanding of how modern LLMs work: the 'improving' happens by humans training/refining existing models offline to create new models, and the 'learning' is just filling the context window with more stuff, not enhancement of the actual model or the model 'learning' - it will forget everything if you drop the context and as the context grows it can 'forget' things it previously 'learned'.
Essentially, the field will get frozen where existing senior engineers will be able to utilize AI to outship traditional senior-junior teams, even as junior engineers fail to secure employment
I don’t think anything in this article counters this argument
It will simply reduce the amount of opportunities to learn (and not just for juniors), by virtue of companies' beancounters concluding "two for one" (several juniors) doesn't return the same as "buy one get one free" (existing staff + AI license).
I dread the day we all "learn from AI". The social interaction part of learning is just as important as the content of it, really, especially when you're young; none of that comes across yet in the pure "1:1 interaction" with AI.
Programming has become more of a "social game" in the last 15 years or so. AI is a new superpower for people like me, bringing balance to the Force.
But as of now, it's senior engineers who really know what they 're doing who can spot the errors in AI code.
I realise you meant it as “the engineer and their tool blend together”, but I read it like a funny insult: “that guy likes to think of himself as an engineer, but he’s a complete tool”.
Maybe journalists and bloggers angling for attention do it, prompt engineers are too aware of the limitations of prompting to do that.
Now consider your reasonable instinct to not accuse other people coupled with the possibility setting AI lose with “write a positive article about AI where you have some paragraphs about the current limitations based on this link. write like you are just following the evidence.” Meanwhile we are supposed to sit here and weigh every word.
This reminds to write a prompt for a blogpost. How AI could be used for making personal-looking tech-guy who meditates and runs websites. (Do we have the technology? Yes we do)
Em-dash baby.
There are better clues, like the kind of vague pretentious babble bad marketers use to make their products and ideas seem more profound than they are. It’s a type of bad writing which looks grandiose but is ultimately meaningless and that LLMs heavily pick up on.
Furthermore, on macOS there are simple key combinations (e.g. with ⌥) to make all sort of smart punctuation even if you don’t have the feature enabled by default, and on iOS you can long press on a key (such as the hyphen) to see alternates.
The majority of people may not use correct punctuation marks, but enough do that assuming a single character immediately means they used an LLM is just plain wrong. I have never used an LLM to write a blog post, internet comment, or anything of the sort, and I have used smart punctuation in all my writing for over a decade. Same with plenty of other HN commenters, journalists, writers, editors, and on and on. You don’t need to be a literal machine to care about correct character use.
I used a combination of OpenAI's online Codex, and Claude Sonnet 4 in VSCode agent mode. It was nice that Codex was more automated and had an environment it could work in, but its thought-logs are terrible. Iteration was also slow because it takes awhile for it to spin the environment up. And while you _can_ have multiple requests running at once, it usually doesn't make sense for a single, somewhat small project.
Sonnet 4's thoughts were much more coherent, and it was fun to watch it work and figure out problems. But there's something broken in VSCode right now that makes its ability to read console output inconsistent, which made things difficult.
The biggest issue I ran into is that both are set up to seek out and read only small parts of the code. While they're generally good at getting enough context, it does cause some degradation in quality. A frequent issue was replication of CSS styling between the Rust side of things (which creates all of the HTML elements) and the style.css side of things. Like it would be working on the Rust code and forget to check style.css, so it would just manually insert styles on the Rust side even though those elements were already styled on the style.css side.
Codex is also _terrible_ at formatting and will frequently muck things up, so it's mandatory to use it with an autoformatter and instructions to use it. Even with that, Codex will often say that it ran it, but didn't actually run it (or ran it somewhere in the middle instead of at the end) so its pull requests fail CI. Sonnet never seemed to have this issue and just used the prevailing style it saw in the files.
Now, when I say "almost 100% AI", it's maybe 99% because I did have to step in and do some edits myself for things that both failed at. In particular neither can see the actual game running, so they'd make weird mistakes with the design. (Yes, Sonnet in VS Code can see attached images, and potentially can see the DOM of vscode's built in browser, but the vision of all SOTA models is ass so it's effectively useless). I also stepped in once to do one major refactor. The AIs had decided on a very strange, messy, and buggy interpreter implementation at first.
So every single run will result in different non-reproducible implementation with unique bugs requiring manual expert interventions. How is this better?
https://www.trk7.com/blog/ai-agents-for-coding-promise-vs-re...
This has been my experience as well. to always run the cli tool in the bottom pane of an IDE and not in a standalone terminal.
Take note - there is no limit. Every feature you or the AI can prompt can be generated.
Imagine if you were immortal and given unlimited storage. Imagine what you could create.
That’s a prompt away.
Even now you’re still restricting your thinking to the old ways.
No, it is not possible to prompt every feature, and I suspect people who believe LLMs can accurately program anything in any language are frankly not solving any truly novel or interesting problems, because if they were they’d see the obvious cracks.
Currently, it's 6 prompts away in which 5 of those are me guiding the LLM to output the answer that I already have in mind.
This only works if the model and its context are immutable. None of us really control the models we use, so I'd be sceptical about reproducing the artifacts later.
- Human reviewed: Code guidelines and prompt templates are essentially dev tool infra-as-code and need review
- Discarded: Individual prompt commands I write, and implementation plan progress files the AI write, both get trashed, and are even part of my .gitignore . They were kept by Cloudflare, but we don't keep these.
- Unreviewed: Claude Code does not do RAG in the usual sense, so it is on us to create guides for how we do things like use big frameworks. They are basically indexes for speeding up AI with less grepping + hallucinating across memory compactions. The AI reads and writes these, and we largely stay out of it.
There are weird cases I am still trying to figure out. Ex:
- feature impl might start with an AI coming up with the product spec, so having that maintained as the AI progresses and committed in is a potentially useful artifact
- how prompt templates get used is helpful for their automated maintenance.
It is not my intention to hurt your feelings, but it sounds like you and/or the LLM are not really good at their job. Looking at programmer salaries and LLM energy costs, this appears to be a very very VERY expensive OAuth library.
Again: Not my intention to hurt any feelings, but the numbers really are shockingly bad.
SupremumLimit•10h ago
Dylan16807•10h ago
deadbabe•10h ago
dingnuts•10h ago
tptacek•8h ago
keybored•3h ago
To have their minds changed drastically, sure..
dwaltrip•7h ago
sitkack•9h ago
ChatGPT came out in Nov 2022. Attention Was All There Was in 2017, we were already 5 years in the past. Or 5 years of research to catch up to, and then from 2022 to now ... papers and research have been increasing exponentially. Even in if SOTA models were frozen, we still have years of research to apply and optimize in various ways.
BoorishBears•7h ago
Lately I spend all day post-training models for my product, and I want to say 99% of the research specific to LLMs doesn't reproduce and/or matter once you actually dig in.
We're getting exponentially more papers on the topics and they're getting worse on average.
Every day there's a new paper claiming an X% gain by post-training some ancient 8B parameter model and comparing it to a bunch of other ancient models after they've overfitted on the public dataset of a given benchmark and given the model a best of 5.
And benchmarks won't ever show it, but even ChatGPT 3.5-Turbo has better general world knowledge than a lot models people consider "frontier" models today because post-training makes it easy to cover up those gaps with very impressive one-prompt outputs and strong benchmark scores.
-
It feels like things are getting stuck in a local maxima: we are making forward progress, the models are useful and getting more useful, but the future people are envisioning takes reaching a completely different goal post that I'm not at all convinced we're making exponential progress towards.
There maybe exponential number of techniques claiming to be ground breaking, but what has actually unlocked new capabilities that can't just as easily be attributed to how much more focused post-training has become on coding and math?
Test time compute feels like the only one and we're already seeing the cracks form in terms of its effect on hallucinations, and there's a clear ceiling for the performance the current iteration unlocks as all these models are converging on pretty similar performance after just a few model releases.
rxtexit•1h ago
We are just back on track.
I just read Oracular Programming: A Modular Foundation for Building LLM-Enabled Software the other day.
We don't even have a new paradigm yet. I would be shocked that in 10 years I don't look back at this time of writing a prompt into a chatbot and then pasting the code into an IDE as completely comical.
The most shocking thing to me is we are right back on track to what I would have expected in 2000 for 2025. In 2019 those expectations seemed like science fiction delusions after nothing happening for so long.
a2128•4h ago
Sevii•10h ago
greyadept•9h ago
dymk•9h ago
thuuuomas•9h ago
BoorishBears•7h ago
It should be obvious to anyone who with a cursory interest in model training, you can't trust benchmarks unless they're fully private black-boxes.
If you can get even a hint of the shape of the questions on a benchmark, it's trivial to synthesize massive amounts of data that help you beat the benchmark. And given the nature of funding right now, you're almost silly not to do it: it's not cheating, it's "demonstrably improving your performance at the downstream task"
tptacek•8h ago
kiitos•6h ago
If the LLM hallucinates, then the code it produces is wrong. That wrong code isn't obviously or programmatically determinable as wrong, the agent has no way to figure out that it's wrong, it's not as if the LLM produces at the same time tests that identify that hallucinated code as being wrong. The only way that this wrong code can be identified as wrong is by the human user "looking closely" and figuring out that it is wrong.
You seem to have this fundamental belief that the code that's produced by your LLM is valid and doesn't need to be evaluated, line-by-line, by a human, before it can be committed?? I have no idea how you came to this belief but it certainly doesn't match my experience.
tptacek•6h ago
An agent lints and compiles code. The LLM is stochastic and unreliable. The agent is ~200 lines of Python code that checks the exit code of the compiler and relays it back to the LLM. You can easily fool an LLM. You can't fool the compiler.
I didn't say anything about whether code needs to be reviewed line-by-line by humans. I review LLM code line-by-line. Lots of code that compiles clean is nonetheless horrible. But none of it includes hallucinated API calls.
Also, from where did this "you seem to have a fundamental belief" stuff come from? You had like 35 words to go on.
someothherguyy•4h ago
fragmede•2h ago
kiitos•1h ago
kiitos•3h ago
The LLM can easily hallucinate code that will satisfy the agent and the compiler but will still fail the actual intent of the user.
> I review LLM code line-by-line. Lots of code that compiles clean is nonetheless horrible.
Indeed most code that LLMs generate compiles clean and is nevertheless horrible! I'm happy that you recognize this truth, but the fact that you review that LLM-generated code line-by-line makes you an extraordinary exception vs. the normal user, who generates LLM code and absolutely does not review it line-by-line.
> But none of [the LLM generated code] includes hallucinated API calls.
Hallucinated API calls are just one of many many possible kinds of hallucinated code that an LLM can generate, by no means does "hallucinated code" describe only "hallucinated API calls" -- !
saagarjha•2h ago
BoorishBears•6h ago
If you want a model that's getting better at helping you as a tool (which for the record, I do), then you'd say in the last 3 months things got better between Gemini's long context performance, the return of Claude Opus, etc.
But if your goal post is replacing SWEs entirely... then it's not hard to argue we definitely didn't overcome any new foundational issues in the last 3 months, and not too many were solved in the last 3 years even.
In the last year the only real foundational breakthrough would be RL-based reasoning w/ test time compute delivering real results, but what that does to hallucinations + even Deepseek catching up with just a few months of post-training shows in its current form, the technique doesn't completely blow up any barriers that were standing the way people were originally touting it.
Overall models are getting better at things we can trivially post-train and synthesize examples for, but it doesn't feel like we're breaking unsolved problems at a substantially accelerated rate (yet.)
atomlib•4h ago
groby_b•10h ago
LLMs have been continually improving for years now. The surprising thing would be them not improving further. And if you follow the research even remotely, you know they'll improve for a while, because not all of the breakthroughs have landed in commercial models yet.
It's not "techno-utopian determinism". It's a clearly visible trajectory.
Meanwhile, if they didn't improve, it wouldn't make a significant change to the overall observations. It's picking a minor nit.
The observation that strict prompt adherence plus prompt archival could shift how we program is both true, and it's a phenomenon we observed several times in the past. Nobody keeps the assembly output from the compiler around anymore, either.
There's definitely valid criticism to the passage, and it's overly optimistic - in that most non-trivial prompts are still underspecified and have multiple possible implementations, not all correct. That's both a more useful criticism, and not tied to LLM improvements at all.
double0jimb0•9h ago
sumedh•8h ago