The moment you point it at a real, existing codebase - even a small one - everything falls apart.
Not my experience. It excels in existing codebases too.I often ask it "I have this bug. Why?" And it almost always figures it out and fixes it. Huge code base.
Codex user, not Claude Code.
Anyways, I think one area where Codex and Claude Code falls short is that they do not test the changes they made by using the app.
In this case, the LLM should ideally render the page in a real browser, and actually click on the buttons to verify. Best if the LLM test it before the changes, and then after so that it is the same. Maybe it should take a screenshot of before the change, then take a screenshot after. And match.
I asked why Codex and Claude don't do this here: https://news.ycombinator.com/item?id=46792066
EDIT: Sorry, just noticed you said "real browser". Haven't tried this but Playwright gets you a long way down the road.
Why don't you prove it?
1. Find an old large codebase in codeberg (avoiding the octopus for obvious reasons)
2. Video stream the session and make the LLM convo public
3. Ask your LLM to remove jQuery from the db and submit regular commits to a public remote branch
Then we will be able to judge if the evidence stands
And I don't remove jQuery every day. Maybe the OP is right that Opus 4.6 sucks at removing jQuery. I don't know. I've never asked an AI to do it.
The moment you point it at a real, existing codebase - even a small one - everything falls apart.
This statement is absolutely not true based on my experience. Codex has been amazing for me at existing code bases.That's funny. I feel like it's the opposite. Claiming that Opus 4.6 or GPT 5.3 fails as soon as you point them to an existing code base, big or small, is a much more extraordinary claim.
I honestly think that size and age alone are sufficient to lead these tools into failure cases.
Is your AI PR publicly available in github?
Some things AI does well, many things it may be not worth the effort entailed, and some where it downright sucks and may even be harmful. The question is will it ever change the curve to where it is useful most of the time?
If coding agents can't test the code as they're editing it they're no different from pasting your entire codebase into ChatGPT and crossing your fingers.
At one point you mention it hadn't run "npm test" - did it run that once you directly told it to?
I start every one of my coding agent sessions with "run uv run pytest" purely to confirm that it can run the tests and seed the idea with it that tests exist and matter to me.
Your post ends with a screenshot showing you debating a C# syntax thing with the bot. I recommend telling it "write code that demonstrates if this works or not" in cases like that.
If coding agents can't test the code as they're editing it they're no different from pasting your entire codebase into ChatGPT and crossing your fingers.
Out of curiosity, how do you get Claude Code or Codex to actually do this? I asked this question here before:Most importantly all of my Python projects use a pyproject.toml file with this pattern:
[dependency-groups]
dev = ["pytest"]
Which means I can tell the agent: Run "uv run pytest"
And it will run the tests - without first needing to setup a virtual environment or install dependencies or anything like that. I wrote more about that pattern here: https://til.simonwillison.net/uv/dependency-groupsFor more complex test suites I'll give it more detailed instructions.
For testing web apps I used to tell it "use playwright" or "use playwright Python".
I'm currently experimenting with my own simple CLI browser automation tool. This means I can tell it:
Run "uvx rodney --help" and then use
rodney to test this change
The --help output tells it everything it needs to use the tool - here's that document in the repo: https://github.com/simonw/rodney/blob/10b2a6c81f9f3fb36ce4d1...I've recently started having the bots "manually" test changes with a new tool I built called Showboat. It's less than a week old but it's so far been working really well: https://simonwillison.net/2026/Feb/10/showboat-and-rodney/
jQuery: It's Going Absolutely Nowhere™
My experience with 4.6 has been that it gobbles up tokens like crazy but it's pretty smart otherwise. Even the latest LLMs need a lot of context to know what they're working on, which versions to target, access to some MCP like Context7 to get up to date documentations(especially for js/ts).
My non-tech friends have a tendency to talk to AI like a person and then complain about the quality of the answers and I always tell them: ask your question, with one or two follow-ups max then start a new conversation. Also, provide as much relevant context as possible to get the best answer, even if it seems obvious. I'd expect a SWE to already be aware of this stuff.
I've been able to find obscure edge cases thanks to Claude and I've also had it produce code that does the opposite of what I asked even with a clear prompt, but that's the nature of LLMs.
I'm a little baffled by this post. The author claims to have "Wrote a comprehensive CLAUDE.md with detailed instructions." and yet didn't have "run the tests" anywhere? I realize this post is going to be a playground for bashing on AI but I just wish the prompt was published or even better, if it's open source let other people try. Seems like the perfect case to throw claude code in a wiggum loop at overnight.
Because we're still paying for Brendan Eich's mistakes 30 years later (though Brendan isn't, apparently), and even an LLM trained on an unfathomably-large corpus of code by experts at hundreds of millions of dollars of expense can't unscrew it. What, like, even is a language's standard library, man?
> The moment you point it at a real, existing codebase - even a small one - everything falls apart
That's not been my experience with running Claude to create production code. Plan mode is absolutely your friend, as is tuning your memory files and prompts. You'll need to do code reviews as before, and when it makes changes that you don't like (like patching in unit tests), you need to correct it.
Also, we use hexagonal architecture, so there are clean patterns for it to gather context from. FWIW, I work in Python, not JS, so when Claude was trained on it, there weren't twenty wildly different flavor-of-the-week-fifteen-years-ago frameworks and libraries to confuse it.
If JS sucks to write as a human, it will suck even more to write as a LLM.
Which tools did it use?
q3k•57m ago
defraudbah•47m ago
bdangubic•46m ago
re-thc•37m ago
Insanity•27m ago
netdevphoenix•47m ago
snarf21•31m ago
(Please oh please can we have a Show HN AI. I'm not interested in people's weekend vibe coded app to replace X popular tool. I want to check out cool projects wher people invested their passion and time.)
zdw•46m ago
nananana9•43m ago
q3k•38m ago
xcubic•43m ago
gherkinnn•40m ago
Either way, OP is holding it wrong and vague hypebro comments like yours don't help either. Be specific.
Here's an example: I told Claude 4.5 Opus to go through our DB migration files and the ORM model definitions and point out any DB indexes we might be missing based on how the data is being accessed. It did so, ingested all the controllers as well and a short while later presented me with a list of missing indexes, ordered by importance and listing why each index would speed up reads and how to test the gains.
Now, I have no way of knowing how exhaustive the analysis was, but the suggestions it gave were helpful, Claude did not recommend over-indexing, and considered read vs write performance.
The equivalent work would have taken me a day, Claude gave me something helpful in a matter of minutes.
Now, I for one could not handle the information stream of 20 such analyses coming in. I can't even handle 2 large feature PRs in parallel. This is where I ask for more specifics.
dmbche•37m ago
morkalork•2m ago
https://steve-yegge.medium.com/gas-town-emergency-user-manua...
weakfish•35m ago
beepbooptheory•32m ago
bogzz•29m ago
SJMG•23m ago
chasd00•38m ago
bogzz•32m ago
It's not too late to jump on the Cocaine-Driven Development Orchestrated by LLMs train.
neya•30m ago
ladyprestor•20m ago