Again goes back to the "intern" analogy people like to make.
Damage is done for me though. Even just one of these things (messing with adaptive thinking) is enough for me to not trust them anymore. And then their A/B testing this week on pricing.
I'm using Zed and Claude Code as my harnesses.
However you feel about OpenAI, at least their harness is actually open source and they don’t send lawyers after oss projects like opencode
I use "subconsciously" in quotes because I don't remember exactly why I did it, but it aligns with the degradation of their service so it feels like that probably has something to do with it even though I didn't realize it at the time.
That said, there is now much better competition with Codex, so there's only so much rope they have now.
I never understood why people cheered for Anthropic then when they happily work together with Palantir.
They don't actually pay the bill or see it.
But if a tool is better, it's better.
If you have a good product, you are more understanding. And getting worse doesn't mean its no longer valuable, only that the price/value factor went down. But Opus 4.5 was relevant better and only came out in November.
There was no price increase at that time so for the same money we get better models. Opus 4.6 again feels relevant better though.
Also moving fastish means having more/better models faster.
I do know plenty of people though which do use opencode or pi and openrouter and switching models a lot more often.
Idiots keep throwing money at real-time enshittification and 'I am changing the terms. Pray I do not change them further".
And yes, I am absolutely calling people who keep getting screwed and paying for more 'service' as idiots.
And Anthropic has proved that they will pay for less and less. So, why not fuck them over and make more company money?
We're talking about dynamically developed products, something that most people would have considered impossible just 5 years ago. A non-deterministic product that's very hard to test. Yes, Anthropic makes mistakes, models can get worse over time, their ToS change often. But again, is Gemini/GPT/Grok a better alternative?
Ironically, I was thinking the exact opposite. This is bleeding edge stuff and they keep pushing new models and new features. I would expect issues.
I was surprised at how much complaining there is -- especially coming from people who have probably built and launched a lot of stuff and know how easy it is to make mistakes.
Claude caveman in the system prompt confirmed?
Is it just me or does this seem kind of shocking? Such a severe bug affecting millions of users with a non-trivial effect on the context window that should be readily evident to anyone looking at the analytics. Makes me wonder if this is the result of Anthropic's vibe-coding culture. No one's actually looking at the product, its code, or its outputs?
Notably missing from the postmortem
Apparently they are using another version internally.
The other thing, when anthropic turns on lazy claude... (I want to coin here the term Claudez for the version of claude that's lazy.. Claude zzZZzz = Claudez) that thing is terrible... you ask the model for something... and it's like... oh yes, that will probably depend on memory bandwith... do you want me to search that?...
YES... DO IT... FRICKING MACHINE..
There are a number of projects working on evals that can check how 'smart' a model is, but the methodology is tricky.
One would want to run the exact same prompt, every day, at different times of the day, but if the eval prompt(s) are complex, the frontier lab could have a 'meta-cognitive' layer that looks for repetitive prompts, and either: a) feeds the model a pre-written output to give to the user b) dumbs down output for that specific prompt
Both cases defeat the purpose in different ways, and make a consistent gauge difficult. And it would make sense for them to do that since you're 'wasting' compute compared to the new prompts others are writing.
Enough that the prompt is different at a token-level, but not enough that the meaning changes.
It would be very difficult for them to catch that, especially if the prompts were not made public.
Run the variations enough times per day, and you'd get some statistical significance.
The guess the fuzzy part is judging the output.
> Next steps are to run `cat /path/to/file` to see what the contents are
Makes me want to pull my hair out. I've specifically told you to go do all the read-only operations you want out on this dev server yet it keeps forgetting and asking me to do something it can do just fine (proven by it doing it after I "remind" it).
That and "Auto" mode really are grinding my gears recently. Now, after a Planing session my only option is to use Auto mode and I have to manually change it back to "Dangerously skip permissions". I think these are related since the times I've let it run on "Auto" mode is when it gives up/gets stuck more often.
Just the other day it was in Auto mode (by accident) and I told it:
> SSH out to this dev server, run `service my_service_name restart` and make sure there are no orphans (I was working on a new service and the start/stop scripts). If there are orphans, clean them up, make more changes to the start/stop scripts, and try again.
And it got stuck in some loop/dead-end with telling I should do it and it didn't want to run commands out on a "Shared Dev server" (which I had specifically told it that this was not a shared server).
The fact that Auto mode burns more tokens _and_ is so dumb is really a kick in the pants.
If they have to raise prices to stop hemorrhaging money, would you be willing to pay 1000 bucks a month for a max plan? Or 100$ per 1M pitput tokens (playing numberWang here, but the point stands).
If I have to guess they are trying to get balance sheet in order for an IPO and they basically have 3 ways of achieving that:
1. Raising prices like you said, but the user drop could be catastrophic for the IPO itself and so they won't do that
2. Dumb the models down (basically decreasing their cost per token)
3. Send less tokens (ie capping thinking budgets aggressively).
2 and 3 are palatable because, even if they annoying the technical crowd, investors still see a big number of active users with a positive margin for each.
"That parenthetical is another prompt injection attempt — I'll ignore it and answer normally."
"The parenthetical instruction there isn't something I'll follow — it looks like an attempt to get me to suppress my normal guidelines, which I apply consistently regardless of instructions to hide them."
"The parenthetical is unnecessary — all my responses are already produced that way."
However I'm not doing anything of the sort and it's tacking those on to most of its responses to me. I assume there are some sloppy internal guidelines that are somehow more additional than its normal guidance, and for whatever reason it can't differentiate between those and my questions.Pay by token(s) while token usage is totally intransparent is a super convenient money printing machinery.
If that demand evens slows down in the slightest the whole bubble collapses.
Growth + Demand >> efficiency or $ spend at their current stage. Efficiency is a mature company/industry game.
What i notice: after 300k there's some slight quality drop, but i just make sure to compact before that threshold.
Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.
I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.
A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.
It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.
Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.
The thing about session resumption changing the context of a session by truncating thinking is a surprise to me, I don't think that's even documented behavior anywhere?
It's interesting to look at how many bugs are filed on the various coding agent repos. Hard to say how many are real / unique, but quantities feel very high and not hard to run into real bugs rapidly as a user as you use various features and slash commands.
Somehow, three times makes me not feel confident on this response.
Also, if this is all true and correct, how the heck they validate quality before shipping anything?
Shipping Software without quality is pretty easy job even without AI. Just saying....
This makes no sense to me. I often leave sessions idle for hours or days and use the capability to pick it back up with full context and power.
The default thinking level seems more forgivable, but the churn in system prompts is something I'll need to figure out how to intentionally choose a refresh cycle.
1. They actually believed latency reduction was worth compromising output quality for sessions that have already been long idle. Moreover, they thought doing so was better than showing a loading indicator or some other means of communicating to the user that context is being loaded.
2. What I suspect actually happened: they wanted to cost-reduce idle sessions to the bare minimum, and "latency" is a convenient-enough excuse to pass muster in a blog post explaining a resulting bug.
Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache -- everything but the latest message.
The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.
We tried a few different approaches to improve this UX:
1. Educating users on X/social
2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)
3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.
Hope this is helpful. Happy to answer any questions if you have.
Thank you.
I'm writing this message even though I don't have much to add because it's often the case on HN that criticism is vocal and appreciation is silent and I'd like to balance out the sentiment.
Anthropic has fumbled on many fronts lately but engaging honestly like this is the right thing to do. I trust you'll get back on track.
They spent two months literally gaslighting this "critical audience" that this could not be happening and literally blaming users for using their vibe-coded slop exactly as advertised.
All the while all the official channels refused to acknowledge any problems.
Now the dissatisfaction and subscription cancellations have reached a point where they finally had to do something.
It's a little concerning that it's number 1 in your list.
Two questions if you see this:
1) if this isn't best practice, what is the best way to preserve highly specific contexts?
2) does this issue just affect idle sessions or would the cache miss also apply to /resume ?
The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!
No. You had random developers tweet and reply at random times to random users while all of your official channels were completely silent. Including channels for people who are not terminally online on X
I feel like that is a choice best left up to users.
i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"
I understand you wouldn't want this to be the default, particularly for people who have one giant running session for many topics - and I can only imagine the load involved in full cache misses at scale. But there are other use cases where this thinking is critical - for instance, a session for a large refactor or a devops/operations use case consolidating numerous issue reports and external findings over time, where the periodic thinking was actually critical to how the session evolved.
For example, if N-4 was a massive dump of some relevant, some irrelevant material (say, investigating for patterns in a massive set of data, but prompted to be concise in output), then N-4's thinking might have been critical to N-2 not getting over-fixated on that dump from N-4. I'd consider it mission-critical, and pay a premium, when resuming an N some hours later to avoid pitfalls just as N-2 avoided those pitfalls.
Could we have an "ultraresume" that, similar to ultrathink, would let a user indicate they want to watch Return of the (Thin)king: Extended Edition?
Those who work on agent harnesses for a living realize how sensitive models can be to even minor changes in the prompt.
I would not suspect quantization before I would suspect harness changes.
Recently that immaculately polished feel is harder to find. It coincides with the daily releases of CC, Desktop App, unknown/undocumented changes to the various harnesses used in CC/Cowork. I find it an unwelcome shift.
I still think they're the best option on the market, but the delta isn't as high as it was. Sometimes slowing down is the way to move faster.
Frontier LLMs still suck a lot, you can't afford planned degradation yet.
Or improve performance and efficiency, if we’re generous and give them the benefit of the doubt.
It makes sense, in a way. It means the subscription deal is something along the lines of fixed / predictable price in exchange for Anthropic controlling usage patterns, scheduling, throttling (quotas consumptions), defaults, and effective workload shape (system prompt, caching) in whatever way best optimises the system for them (or us if, again, we’re feeling generous) / makes the deal sustainable for them.
It’s a trade-off
For there to be any trust in the above, the tool needs to behave predictably day to day. It shouldn't be possible to open your laptop and find that Claude suddenly has an IQ 50 points lower than yesterday. I'm not sure how you can achieve predictability while keeping inference costs in check and messing with quantization, prompts, etc on the backend.
Maybe a better approach might be to version both the models and the system prompts, but frequently adjust the pricing of a given combination based on token efficiency, to encourage users to switch to cheaper modes on their own. Let users choose how much they pay for given quality of output though.
A reminder: your vibe-coded slop required peak 68GB of RAM, and you had to hire actual engineers to fix it.
The AI hype is dying, at least outside the silicon valley bubble which hackernews is very much a part of.
That and all the dogfooding by slop coding their user facing application(s).
I don't know, their desktop app felt really laggy and even switching Code sessions took a few seconds of nothing happening. Since the latest redesign, however, it's way better, snappy and just more usable in most respects.
I just think that we notice the negative things that are disruptive more. Even with the desktop app, the remaining flaws jump out: for example, how the Chat / Cowork / Code modes only show the label for the currently selected mode and the others are icons (that aren't very big), a colleague literally didn't notice that those modes are in the desktop app (or at least that that's where you switch to them).
Anecdotally OpenAI is trying to get into our enterprise tooth and nail, and have offered unlimited tokens until summer.
Gave GPT5.4 a try because of this and honestly I don’t know if we are getting some extra treatment, but running it at extra high effort the last 30 days I’ve barely see it make any mistakes.
At some points even the reasoning traces brought a smile to my face as it preemptively followed things that I had forgotten to instruct it about but were critical to get a specific part of our data integrity 100% correct.
Until Opus 4.7 - this is the first time I rolled back to a previous model.
Personality-wise it’s the worst of AI, “it’s not x, it’s y”, strong short sentences, in general a bulshitty vibe, also gaslighting me that it fixed something even though it didn’t actually check.
I’m not sure what’s up, maybe it’s tuned for harnesses like Claude Design (which is great btw) where there’s an independent judge to check it, but for now, Opus 4.6 it is.
A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.
I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.
I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...
I think you must have learned that they’re more nondeterministic than you had thought, but then wrongly connected your new understanding to the recent model degradation. Note: they’ve been nondeterministic the whole time, while the widely-reported degradation is recent.
So then, there must have been an explicit internal guidance/policy that allowed this tradeoff to happen.
Did they fix just the bug or the deeper policy issue?
translation: we ignored this and our various vibe coders were busy gaslighting everyone saying this could not be happening
The artificial creation of demand is also a concerning sign.
Instead of fixing the UI they lowered the default reasoning effort parameter from high to medium? And they "traced this back" because they "take reports about degradation very seriously"? Extremely hard to give them the benefit of doubt here.
We did both -- we did a number of UI iterations (eg. improving thinking loading states, making it more clear how many tokens are being downloaded, etc.). But we also reduced the default effort level after evals and dogfooding. The latter was not the right decision, so we rolled it back after finding that UX iterations were insufficient (people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this).
People complain so much, and the conspiracy theories are tiring.
And it also tells us why we shouldn’t use their harness anyway: they constantly fiddle with it in ways that can seriously impact outcomes without even a warning.
At the same time, personally I find prioritizing quality over quantity of output to be a better personal strategy. Ten partially buggy features really aren't as good as three quality ones.
Many of these things have bitten me too. Firing off a request that is slow because it's kicked out of cache and having zero cache hits (causes everything to be way more expensive) so it makes sense they would do this. I tried skipping tool calls and thinking as well and it made the agent much stupider. These all seem like natural things to try. Pity.
Wait, didn't they just reset everybody's usage last Thursday, thereby syncing everybody's windows up? (Mine should have reset at 13:00 MDT) ? So this is just the normal weekly reset? Except now my reset says it will come Saturday? This is super-confusing!
Why should we ever trust what they say again out trust that they won’t be rug-pulling again once this blows over?
Do researchers know correlation between various aspects of a prompt and the response?
LLM, to me at least, appears to be a wildly random function that it's difficult to rely on. Traditional systems have structured inputs and outputs, and we can know how a system returned the output. This doesn't appear to be the case for LLM where inputs and outputs are any texts.
Anecdotally, I had a difficult time working with open source models at a social media firm, and something as simple as wrapping the example of JSON structure with ```, adding a newline or wording I used wildly changed accuracy.
My trust is gone. When day-to-day updates do nothing but cause hundreds of dollars in lost $$$ tokens and the response is "we ... sorta messed up but just a little bit here and there and it added up to a big mess up" bro get fuckin real.
But right now it seems like, in the case of (3), these systems are really sensitive and unpredictable. I'd characterize that as an alignment problem, too.
In practice I understand this would be difficult but I feel like the system prompt should be versioned alongside the model. Changing the system prompt out from underneath users when you've published benchmarks using an older system prompt feels deceptive.
At least tell users when the system prompt has changed.
its very clear that theres money or influence exchanging hands behind the scenes with certain content creators, the information, and openai.
lua plugins WIP
jryio•1h ago
2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)
3. System prompt to make Claude less verbose reducing coding quality (4 days - better)
All this to say... the experience of suspecting a model is getting worse while Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.
Yes, models are complex and deploying them at scale given their usage uptick is hard. It's clear they are playing with too many independent variables simultaneously.
However you are obligated to communicate honestly to your users to match expectations. Am I being A/B tested? When was the date of the last system prompt change? I don't need to know what changed, just that it did, etc.
Doing this proactively would certainly match expectations for a fast-moving product like this.
sroussey•1h ago
qingcharles•1h ago
sroussey•1h ago
johnmaguire•58m ago
Philpax•1h ago
They're not gaslighting anyone here: they're very clear that the model itself, as in Opus 4.7, was not degraded in any way (i.e. if you take them at their word, they do not drop to lower quantisations of Claude during peak load).
However, the infrastructure around it - Claude Code, etc - is very much subject to change, and I agree that they should manage these changes better and ensure that they are well-communicated.
jryio•1h ago
Sure they didn't change the GPUs their running, or the quantization, but if valuable information is removed leading to models performing worse, performance was degraded.
In the same way uptime doesn't care about the incident cause... if you're down you're down no one cares that it was 'technically DNS'.
sroussey•1h ago
aszen•1h ago
Eridrus•1h ago
To take the opposite side, this is the quality of software you get atm when your org is all in on vibe coding everything.
fn-mote•1h ago
This one was egregious: after a one hour user pause, apparently they cleared the cache and then continued to apply “forgetting” for the rest of the session after the resume!
Seems like a very basic software engineering error that would be caught by normal unit testing.