When I investigated I found the docs and implementation are completely out of sync, but the implementation doesn’t work anyway. Then I went poking on GitHub and found a vibed fix diff that changed the behavior in a totally new direction (it did not update the documentation).
Seems like everyone over there is vibing and no one is rationalizing the whole.
Claude Code creator literally brags about running 10 agents in parallel 24/7. It doesn't just seems like it, they confirmed like it is the most positive thing ever.
Full disclosure - I am a heavy codex user and I review and understand every line of code. I manually fight spurious tests it tries to add by pointing a similar one already exists and we can get coverage with +1 LOC vs +50. It's exhausting, but personal productivity is still way up.
I think the future is bright because training / fine-tuning taste, dialing down agentic frameworks, introducing adversarial agents, and increasing model context windows all seem attainable and stackable.
I can’t understand how people would run agents 24/7. The agent is producing mediocre code and is bottlenecked on my review & fixes. I think I’m only marginally faster than I was without LLMs.
And specifically: Lots of checks for impossible error conditions - often then supplying an incorrect "default value" in the case of those error conditions which would result in completely wrong behavior that would be really hard to debug if a future change ever makes those branches actually reachable.
I don’t know where the LLMs are picking up this paranoid tendency to handle every single error case. It’s worth knowing about the error cases, but it requires a lot more knowledge and reasoning about the current state of the program to think about how they should be handled. Not something you can figure out just by looking at a snippet.
In particular writing tests that do nothing, writing tests and then skipping them to resolve test failures, and everybody's favorite: writing a test that greps the source code for a string (which is just insane, how did it get this idea?)
That is not an uncommon occurrence in human-written code as well :-\
Automation is great for scaling work but it also we can break more things faster when things go wrong.
(Paraphrasing because I don’t have the original comment link.)
Then again, the google home page was broken on FF on Android for how long?
You can assert that something you want to happen is actually happening
How do you assert all the things it shouldn't be doing? They're endless. And AI WILL mess up
It's enough if you're actively reviewing the code in depth.. but if you're vibe coding? Good luck
Doesn't mean it's not a useful tool - if you read and think about the output you can keep it in check. But the "100% of my contributions to Claude Code were written by Claude Code" claim by the creator makes me doubt this is being done.
Edit: And 3 minutes later it is back...
Similarly, Human-in-the-loop utilization of AI/ML tooling in software development is expected and in fact encouraged.
Any IP that is monetizable and requires significant transformation will continue to see humans-in-the-loop.
Weak hiring in the tech industry is for other reasons (macro changes, crappy/overpriced "talent", foreign subsidies, demanding remote work).
AI+Competent Developer paid $300k TC > Competent Developer paid $400k TC >>> AI+Average Developer paid $30k TC >> Average Developer paid $40k TC >>>>> Average Developer paid $200k TC
Other AI agents, I guess. Call Claude in to clean up code written by Gemini, then ChatGPT to clean up the bugs introduced by Claude, then start the cycle over again.
If the code is cheap (and it certainly is), then tossing it out and replacing it can also be cheap.
Shaping of a codebase is the name of the game - this has always been, and still, is difficult. Build something, add to it, refactor, abstraction doesn’t sit right, refactor, semantics change, refactor, etc, etc.
I’m surprised at how so few seem to get this. Working enterprise code, many codebases 10-20 years old could just as well have been produced by LLMs.
We’ve never been good at paying debt and you kind of need a bit of OCD to keep a code base in check. LLM exacerbates the lack of continuous moulding as iterations can be massive and quick.
The amount of times I have to "yell" at the llm for adding #[allow] statements to silence the linter instead of fixing the code is crazy and when I point it out they go "Oops, you caught me, let me fix it the proper way".
So the tests don't necessarily make them produce proper code.
So I have a different experience with Claude Code, but I'm not trying to say you're holding it wrong, just adding a data point, and then, maybe I got lucky.
And this is not tied to the LLMs. It's that to EVERYTHING we do. There are limits everywhere.
And for humans the context window might be smaller, but at least we have developed methods of abstracting different context windows, by making libraries.
Now, as a trade-off of trying to go super-fast, changes need to be made in response to your current prompts, and there is no time validate behavior in cases you haven't considered.
And regardless of whether you have abstractions in libraries, or whether you have inlined code everywhere, you're gonna have issues.
With libraries changes in behavior are going to impact code in places you don't want, but also, you don't necessarily know, as you haven't tested all paths.
With inlined code everywhere you're probably going to miss instances, or code goes on to live its own life and you lose track of it.
They built a skyscraper while shifting out foundational pieces. And now a part of the skyscraper is on the foundation of your backyard shed.
The degradation is palpable.
I have been using vscode github copilot chat with mostly the claude opus 4.5 model. The underlying code for vscode github copilot chat has turned to shit. It will continuously make mistakes no matter what for 20 minutes. This morning I was researching Claude Code and pricing thinking about switching however this post sounds like it has turned to shit also. I don't mind spending $300-$500 a month for a tool that was a month ago accomplishing in a day what would take me 3-4 days to code. However, the days since the last update have been shit.
Clearly the AI companies can't afford to run these models at profit. Do I buy puts?
It's screwing up even in very simple rebases. I got a bug where a value wasn't being retrieved correctly, and Claude's solution was to create an endpoint and use an HTTP GET from within the same back-end! Now it feels worse than Sonnet.
All the engineers I asked today have said the same thing. Something is not right.
A model or new model version X is released, everyone is really impressed.
3 months later, "Did they nerf X?"
It's been this way since the original chatGPT release.
The answer is typically no, it's just your expectations have risen. What was previously mind-blowing improvement is now expected, and any mis-steps feel amplified.
What we need is an open and independent way of testing LLMs and stricter regulation on the disclosure of a product change when it is paid under a subscription or prepaid plan.
I mean, that's part of the problem: as far as I know, no claim of "this model has gotten worse since release!" has ever been validated by benchmarks. Obviously benchmarking models is an extremely hard problem, and you can try and make the case that the regressions aren't being captured by the benchmarks somehow, but until we have a repeatable benchmark which shows the regression, none of these companies are going to give you a refund based on your vibes.
Unfortunately, it's paywalled most of the historical data since I last looked at it, but interesting that opus has dipped below sonnet on overall performance.
This is not the same thing as a "omg vibes are off", it's reproducible, I am using the same prompts and files, and getting way worse results than any other model.
It has a habit of trusting documentation over the actual code itself, causing no end of trouble.
Check your claude.md files (both local and ~user ) too, there could be something lurking there.
Or maybe it has horribly regressed, but that hasn't been my experience, certainly not back to Sonnet levels of needing constant babysitting.
An upcoming IPO increases pressure to make financials look prettier.
In fact as my prompts and documents get better it seems it does increasingly better.
Still, it can't replace a human, I really need to correct it at all, and if I try to one shot a feature I always end up spending more time refactoring it few days later.
Still, it's a huge boost to productivity, but the time it can take over without detailed info and oversight is far away.
However when I try to log in via CLI it takes me to a webpage with an “Authorize” button. Clicking the button does nothing. An error is logged to the console but nothing displays in the UI.
We reached out to support who have not helped.
Not a great first impression
For the claude.ai UI, I've never had a single deep research properly transition (and I've done probably 50 or so) to its finished state. I just know to refresh the page after ~10mins to make the report show up.
Just a pro sub - not max.
Most of the time it gives me a heads up that I'm at 90% but a lot of the times it just failed, no warning, and I assumed it was I hit max.
Sometimes, poor old Claude wants to go on holiday and that is a problem?!?
---
> Just my own observation that the same pattern has occurred at least 3 times now:
> release a model; overhype it; provide max compute; sell it as the new baseline
> this attracts a new wave of users to show exponential growth & bring in the next round of VC funding (they only care about MAU going up, couldn’t care less about existing paying users)
> slowly degrade the model and reduce inference
> when users start complaining, initially ignore them entirely then start gaslighting and make official statements denying any degradation
> then frame it as a tiny minority of users experiencing issues then, when pressure grows, blame it on an “accidentally” misconfigured servers that “unintentionally” reduced quality (which coincidentally happened to save the company tonnes of $).
Businesses like google were already a step in the wrong direction in terms of customer service, but the new wave of AI companies seem to have decided their only relation to clients is collecting their money.
Unclear costs, no support, gaslighting customers when a model is degraded, incoming rug pulls..
I cancelled my subscription.
I just signed up as a paying customer, only to find that Claude is totally unusable for my purposes at the moment. There's also no support (shocker), despite their claims that you'll be E-mailed by the support team if you file a report.
I recently put a little money on the API for my personal account. I seem to burn more tokens on my personal account than my day job, in spite of using AI for 4x as long at work, and I’m trying to figure out why.
This is the new status quo for software ... changing and breaking beneath your feet like sand.
measurablefunc•1h ago