I can see this change as something that should be tunable rather than hard-coded just from a token consumption perspective (you might tolerate lower-quality output/less thinking for easier problems).
Today another thing started happening which are phrases like "I've been burning too many tokens" or "this has taken too many turns". Which ironically takes more tokens of custom instructions to override.
Also claude itself is partially down right now (Arp 6, 6pm CEST): https://status.claude.com/
For example I wanted to get VNC working with PopOS Cosmic and itll be like ah its ok well just install sway and thatll work!
Their status page shows everything is okay.
Second! In CLAUDE.md, I have a full section NOT to ever do this, and how to ACTUALLY fix something.
This has helped enormously.
However I'm not sure how to best prompt against that behavior without influencing it towards swinging the other way and looking for the most intentionally overengineered solutions instead...
If it's really far off the mark, revert back to where you originally sent the prompt and try to steer it more, if it's starting to hesitate you can usually correct it without starting over.
Repeatedly, too. Had to make the server reference sources read-only as I got tired of having to copy them over repeatedly
A trivial example: whenever CC suggests doing more than one thing in a planning mode, just have it focus on each task and subtask separately, bounding each one by a commit. Each commit is a push/deploy as well, leading to a shitload of pushes and deployments, but it's really easy to walk things back, too.
Of course they do say that you should review/test everything the tool creates, but in most contexts, it's sort of added as an afterthought.
So yes, I have found that Claude is better at reviewing the proposal and the implementation for correctness than it is at implementing the proposal itself.
Maybe we're being A/B tested.
Along with claude max, I have a chatgpt pro plan and I find it a life-saver to catch all the silliness opus spits out.
Unable to start session. The authentication server returned an error (500). You can try again.
And less so if you read [1] or similar assessments. I, too, believe that every token is subsidized heavily. From whatever angle you look at it.
Thusly quality/token/whatever rug pulls are inevitable, eventually. This is just another one.
Just now I had a bug where a 90 degree image rotation in a crate I wrote was implemented wrong.
I told Claude to find & fix and it found the broken function but then went on to fix all of its call sites (inserting two atomic operations there, i.e. the opposite of DRY). Instead of fixing the root cause, the wrong function.
And yes, that would not have happened a few months ago.
This was on Opus 4.6 with effort high on a pretty fresh context. Go figure.
Also, everyone has a different workflow. I can't say that I've noticed a meaningful change in Claude Code quality in a project I've been working on for a while now. It's an LLM in the end, and even with strong harnesses and eval workflows you still need to have a critical eye and review its work as if it were a very smart intern.
Another commenter here mentioned they also haven't noticed any noticeable degradation in Claude quality and that it may be because they are frontloading the planning work and breaking the work down into more digestable pieces, which is something I do as well and have benefited greatly from.
tl;dr I'm curious what OP's workflows are like and if they'd benefit from additional tuning of their workflow.
the agent has a set of scripts that are well tested, but instead it chooses to write a new bespoke script everytime it needs to do something, and as a result writes both the same bugs over and over again, and also unique new bugs every time as well.
I've lost track of the number of times it's started a task by building it's own tools, I remind it that it has a tool for doing that exact task, then it proceeds to build it's own tools anyways.
This wasn't happening 2 months ago.
This is the whole point of AI. Its a black box that they can completely control.
And I hope we will eventually reach a point where models become "good enough" for certain tasks, and we won't have to replace them every 6 months.
(That would be similar to the evolution of other technologies like personal computers and smartphones.)
The constraints of (b) limit them from raising the price, so that means meeting (a) by making it worse, and maybe eventually doing a price discrimination play with premium tiers that are faster and smarter for 10x the cost. But anything done now that erodes the market's trust in their delivery makes that eventual premium tier a harder sell.
And idk about the pricing thing. Right now I waste multiple dollars on a 40 minute response that is useless. Why would I ever use this product?
I do wonder how much all the engineering put into these coding tools may actually in some cases degrade coding performance relative to simpler instructions and terminal access. Not to mention that the monthly subscription pricing structure incentivizes building the harness to reduce token use. How much of that token efficiency is to the benefit of the user? Someone needs to be doing research comparing e.g. Claude Code vs generic code assist via API access with some minimal tooling and instructions.
I tend to agree about the legacy workarounds being actively harmful though. I tried out Zed agent for a while and I was SHOCKED at how bad its edit tool is compared to the search-and-replace tool in pi. I didn't find a single frontier model capable of using it reliably. By forking, it completely decouples models' thinking from their edits and then erases the evidence from their context. Agents ended up believing that a less capable subagent was making editing mistakes.
Well, according to this story, instructions refined by trial and error over months might be good for one LLM on Tuesday, and then be bad for the same LLM on Wednesday.
That doesn’t mean you personally are required to, but some people do and your interaction with the system of social trust determines how much of that remains opaque to you.
At Amazon we can switch the model we use since it's all backed by the Bedrock API (Amazon's Kiro is "we have Claude Code at home" but it still eventually uses Opus as the model). I suppose this means the issue isn't confined to just Claude Code. I switched back to Opus 4.5 but I guess that won't be served forever.
Instead, orchestrate all agents visibly together, even when there is hierarchy. Messages should be auditable and topography can be carefully refined and tuned for the task at hand. Other tools are significantly better at being this layer (e.g. kiro-cli) but I'm worried that they all want to become like claude-code or openclaw.
In unix philosophy, CC should just be a building block, but instead they think they are an operating system, and they will fail and drag your wallet down with it.
I'm purely arguing on technical basis, "person" may fall into either of those camps of philosophy.
Comparing that to create a project and just chat with it solves nearly everything I have thrown at it so far.
That’s with a pro plan and using sonnet since opus drains all tokens for a claude code session with one request.
Something worse than a bad model is an inconsistent model. One can't gauge to what extent to trust the output, even for the simplest instructions, hence everything must be reviewed with intensity which is exhausting. I jumped on Max because it was worth it but I guess I'll have to cancel this garbage.
I've basically stopped using it because I have to be so hands on now.
A month later, I literally cannot get them to iterate or improve on it. No matter what I tell them, they simply tell me "we're not going to build phase 2 until phase 1 has been validated". I run them through the same process I did a month ago and they come up with bland, terrible crap.
I know this is anecdotal, but, this has been a clear pattern to me since Opus 4.6 came out. I feel like I'm working with Sonnet again.
I'm not trying to discredit your experience and maybe it really is something wrong with the model.
But in my experience those first few prompts / features always feel insanely magical, like you're working with a 10x genius engineer.
Then you start trying to build on the project, refactor things, deploy, productize, etc. and the effectiveness drops off a cliff.
But I'm optimistic that this will gradually improve in time.
The codebase itself is architected and documented to be LLM friendly and claude.md gives very strong harnesses how to do things.
As architect Claude is abysmal, but when you give it an existing software pattern it merely needs to extend, it’s so good it still gives me probably something like 5x feature velocity boost.
Plus when doing large refactorings, it forgets much fever things than me.
Inventing new architecture is as hard as ever and it’s not great help there - unless you can point it to some well documented pattern and tell it ”do it like that please”.
Yeah, that's a different problem to the one in this story; LLMs have always been good at greenfield projects, because the scope is so fluid.
Brownfield? Not so much.
I feel that we look for patterns to the point of being superstitious. (ML would call it overfitting.)
Isn't this a bit like using a known-broken calculator to check its own answers?
Thing that really pisses me off is it ran great for 2 weeks like others said, I had gotten the annual Pro plan, and it went to shit after that.
Bait and switch at its finest.
Don't forget the 10x token cost cache eviction penalty you pay for resuming the session later.
(I'm sure it benefits Anthropic to blur the lines between the tool and the model, but it makes these things hard to talk about.)
On 18.000+ prompts.
Not sure the data says what they think it says.
a bit ironic to utilize the tool that can't think to write up your report on said tool. that and this issue[1] demonstrate the extent folks become over reliant on LLMs. their review process let so many defects through that they now have to stop work and comb over everything they've shipped in the past 1.5 months! this is the future
[1] https://github.com/anthropics/claude-code/issues/42796#issue...
Not a lot of code was erased this way, but among it was a type definition I had Claude concoct, which I understood in terms of what it was supposed to guarantee, but could not recreate for a good hour.
Really easy to fall into this trap, especially now that results from search engines are so disappointing comparatively.
For certain work, we'll have to let go of this desire.
If you limit yourself to whatever you can recreate, then you are effectively limiting the work you can produce to what you know.
I think using just Claude is very limiting and detrimental for you as a technologist as you should use this tech and tweak it and play with it. They want to be like Apple, shut up and give us your money.
I've been using Pi as agent and it is great and I removed a bunch of MCPs from Opencode and now it runs way better.
Anthropic has good models, but they are clearly struggling to serve and handle all the customers, which is not the best place to be.
I think as a technologist, I would love a client with huge codebase. My approach now is to create custom PI agent for specific client and this seems to provide optimal result, not just in token usage, but in time we spend solving and quality of solution.
Get another engine as a backup, you will be more happy.
- expletives per message: 2.1x
- messages with expletives: 2.2x
- expletives per word: 4.4x(!)
- messages >50% ALL CAPS: 2.5x
Either the model has degraded, or my patience has.
Huh?
** ** ** ** implement ** ** ** ** no ** ** ** ** ** mistakes
If you're so convinced the models keep getting worse, build or crowdfund your own tracker.
Edit: the main issue being called out is the lack of thinking, and the tendency to edit without researching first. Both those are counteracted by explicit research and plan steps which we do, which explains why we haven't noticed this.
I'm regularly switching back to 4.5 and preferring it. I'm not excited for when it gets sunset later this year if 4.6 isn't fixed or superseded by then.
It is a matter of paradigm.
Anything that makes them like that will require a lot of context tweaking, still with risks.
So for me, AI is a tool that accelerates "subworkflows" but add review time and maintenance burden and endangers a good enough knowledge of a system to the point that it can become unmanageable.
Also, code is a liability. That is what they do the most: generate lots and lots of code.
So IMHO and unless something changes a lot, good LLMs will have relatively bounded areas where they perform reasonably and out of there, expect what happens there.
AI is 'creative enough' - whether we call it 'synthetic creativity' or whatever, it definitely can explore enough combinations and permutations that it's suitably novel. Maybe it won't produce 'deeply original works' - but it'll be good enough 99.99% of the time.
The reliability issue is real.
It may not be solvable at the level of LLM.
Right now everything is LLM-driven, maybe in a few years, it will be more Agentically driven, where the LLM is used as 'compute' and we can pave over the 'unreiablity'.
For example, the AI is really good when it has a lot of context and can identify a narrow issue.
It gets bad during action and context-rot.
We can overcome a lot of this with a lot more token usage.
Imagine a situation where we use 1000x more tokens, and we have 2 layers of abstraction running the LLMs.
We're running 64K computers today, things change with 1G of RAM.
But yes - limitations will remian.
Constantly worrying, "is this a superset? Is this a superset?" Is exhausting. Just use the damn tool, stop arguing about if this LLM can get all possible out of distribution things that you would care about or whatever. If it sucks, don't make excuses for it, it sucks. We don't give Einstein a pass for saying dumb shit either, and the LLM ain't no Einstein
If there's one thing to learn from philosophy, it's that asking the question often smuggles in the answer. Ask "is it possible to make an unconstrained deity?" And you get arguments about God.
it's a tool like everything else we've gotten before, but admittedly a much more major one
but "creativity" must come from either it's training data (already widely known) or from the prompts (i.e. mostly human sources)
Been having this feeling that things have got worse recently but didn't think it could be model related.
The most frustrating aspect recently (I have learned and accepted that Claude produces bad code and probably always did, mea culpa) is the non-compliance. Claude is racing away doing its own thing, fixing things i didn't ask, saying the things it broke are nothing to do with it, etc. Quite unpleasant to work with.
The stuff about token consumption is also interesting. Minimax/Composer have this habit of extensive thinking and it is said to be their strength but it seems like that comes at a price of huge output token consumption. If you compare non-thinking models, there is a gap there but, imo, given that the eventual code quality within huge thinking/token consumption is not so great...it doesn't feel a huge gap.
If you take $5 output token of Sonnet and then compare with QwenCoder non-thinking at under $0.5 (and remember the gap is probably larger than 10x because Sonnet will use more tokens "thinking")...is the gap in code quality that large? Imo, not really.
Have been a subscriber since December 2024 but looking elsewhere now. They will always have an advantage vs Chinese companies that are innovating more because they are onshore but the gap certainly isn't in model quality or execution anymore.
maybe they tried to give it the characteristics of motivated junior developers
The frustrating part is that it's not a workflow _or_ model issue, but a silently-introduced limitation of the subscription plan. They switched thinking to be variable by load, redacted the thinking so no one could notice, and then have been running it at ~1/10th the thinking depth nearly 24/7 for a month. That's with max effort on, adaptive thinking disabled, high max thinking tokens, etc etc. Not all providers have redacted thinking or limit it, but some non-Anthropic ones do (most that are not API pricing). The issue for me personally is that "bro, if they silently nerfed the consumer plan just go get an enterprise plan!" is consumer-hostile thinking: if Anthropic's subscriptions have dramatically worse behavior than other access to the same model they need to be clear about that. Today there is zero indication from Anthropic that the limitation exists, the redaction was a deliberate feature intended to hide it from the impacted customers, and the community is gaslighting itself with "write a better prompt" or "break everything into tiny tasks and watch it like a hawk same you would a local 27B model" or "works for me <in some unmentioned configuration>" - sucks :/
I knew I should have been alerted when Anthropic gave out €200 free API usage. Evidently they know.
Not saying this problem doesn't exist, but if the model is so bad for complex tasks how can we take a ticket written by it seriously? Or this author used ChatGPT to write this? (that'd be quite some ironic value, admittedly)
Isn't the more economical explanation that these models were never as impressive as you first thought they were, hallucinate often, break down in unexpected ways depending on context, and simply cannot handle large and complex engineering tasks without those being broken down into small, targeted tasks?
An "economical explanation" is actually that Anthropic subscriptions are heavily subsidized and after a while they realized that they need to make Claude be more stingy with thinking tokens. So they modified the instructions and this is the result.
Should I switch back to API pricing? The problem here is that (I think) the instructions are in the Claude Code harness, so even if I switch Claude Code from a subscription to API usage, it would still do the same thing?
StanAngeloff•4h ago
I was wondering if anyone else is also experiencing this? I have personally found that I have to add more and more CLAUDE.md guide rails, and my CLAUDE.md files have been exploding since around mid-March, to the point where I actually started looking for information online and for other people collaborating my personal observations.
This GH issue report sounds very plausible, but as with anything AI-generated (the issue itself appears to be largely AI assisted) it’s kind of hard to know for sure if it is accurate or completely made up. _Correlation does not imply causation_ and all that. Speaking personally, findings match my own circumstances where I’ve seen noticeable degradation in Opus outputs and thinking.
EDIT: The Claude Code Opus 4.6 Performance Tracker[1] is reporting Nominal.
[1]: https://marginlab.ai/trackers/claude-code/
jgrahamc•3h ago
StanAngeloff•3h ago
Another thing that worked like magic prior to Feb/Mar was how likely Claude was to load a skill whenever it deduced that a skill might be useful. I personally use [superpowers][1] a lot, and I've noticed that I have to be very explicit when I want a specific skill to be used - to the point that I have to reference the skill by name.
[1]: https://github.com/obra/superpowers
Larrikin•1h ago
StanAngeloff•52m ago
This was a first for me with Sonnet. It completely veered off the prompt it was given (review a design document) and instead come out with a verbose suggestion to do a mechanical search and replace to use this newly fabricated function name - that it event spelled incorrectly. I had to Google numey to make sure Sonnet wasn't outsmarting me.
sixothree•38m ago
loloquwowndueo•1h ago
I told it to implement the server side one, it said ok, I tabbed away for a while, came to find the js implementation, checking the log Claude said “on second thought I think I’ll do the client side version instead”.
Rarely do I throw an expletive bomb at Claude - this was one such time.
sixothree•37m ago
loloquwowndueo•27m ago
It’s always “you’re using the tool wrong, need to tweak this knob or that yadda yadda”.
denimnerd42•1h ago
matheusmoreira•1h ago
Avicebron•1h ago
matheusmoreira•1h ago
> When thinking is deep, the model resolves contradictions internally before producing output.
> When thinking is shallow, contradictions surface in the output as visible self-corrections: "oh wait", "actually,", "let me reconsider", "hmm, actually", "no wait."
Yeah, THIS is something that I've seen happen a lot. Sometimes even on Opus with max effort.
StanAngeloff•1h ago
I wonder if this is even more exaggerated now through Easter, as everyone’s got a bit extra time to sit down and <play> with Claude. That might be pushing capacity over the limit - I just don’t know enough about how Antropic provision and manage capacity to know if that could be a factor. However quality has gotten really bad over the holiday.
mikkupikku•1h ago
StanAngeloff•1h ago
fxtentacle•4m ago
Also, it's probably very easy to spot such benchmarks and lock-in full thinking just for them. Some ISPs do the same where your internet speed magically resets to normal as soon as you open speedtest.net ...