An update on recent Claude Code quality reports

https://www.anthropic.com/engineering/april-23-postmortem

277•mfiguiere•1h ago

Comments

jryio•1h ago

1. They changed the default in March from high to medium, however Claude Code still showed high (took 1 month 3 days to notice and remediate)

2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)

3. System prompt to make Claude less verbose reducing coding quality (4 days - better)

All this to say... the experience of suspecting a model is getting worse while Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.

Yes, models are complex and deploying them at scale given their usage uptick is hard. It's clear they are playing with too many independent variables simultaneously.

However you are obligated to communicate honestly to your users to match expectations. Am I being A/B tested? When was the date of the last system prompt change? I don't need to know what changed, just that it did, etc.

Doing this proactively would certainly match expectations for a fast-moving product like this.

sroussey•1h ago

None of these problems equate to degrading model performance. Completely different team. Degraded CC harness, sure.

qingcharles•1h ago

Sure, but it gives the impression of degraded model performance. Especially when the interface is still saying the model is operating on "high", the same as it did yesterday, yet it is in "medium" -- it just looks like the model got hobbled.

sroussey•1h ago

Oh, absolutely. Though changes in how the model is used is imminently more fixable than the model itself.

johnmaguire•58m ago

Yes, but for many users, CC is the product. Especially since I'm not allowed(?) to use my own harness with my sub.

Philpax•1h ago

> Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.

They're not gaslighting anyone here: they're very clear that the model itself, as in Opus 4.7, was not degraded in any way (i.e. if you take them at their word, they do not drop to lower quantisations of Claude during peak load).

However, the infrastructure around it - Claude Code, etc - is very much subject to change, and I agree that they should manage these changes better and ensure that they are well-communicated.

jryio•1h ago

Model performance at inference in a data center v.s. stripping thinking tokens are effectively the same.

Sure they didn't change the GPUs their running, or the quantization, but if valuable information is removed leading to models performing worse, performance was degraded.

In the same way uptime doesn't care about the incident cause... if you're down you're down no one cares that it was 'technically DNS'.

sroussey•1h ago

I thought these days thinking tokens sent my the model (as opposed to used internally) were just for the users benefit. When you send the convo back you have to strip the thinking stuff for next turn. Or is that just local models?

aszen•1h ago

Claude code is not infra, the model is the infra. They changed settings to make their models faster and probably cheaper to run too. Honestly with adaptive thinking it no longer matters what model it is if you can dynamically make it do less or more work.

Eridrus•1h ago

To be fair to Anthropic, they did not intentionally degrade performance.

To take the opposite side, this is the quality of software you get atm when your org is all in on vibe coding everything.

fn-mote•1h ago

> 2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)

This one was egregious: after a one hour user pause, apparently they cleared the cache and then continued to apply “forgetting” for the rest of the session after the resume!

Seems like a very basic software engineering error that would be caught by normal unit testing.

bearjaws•1h ago

The issue making Claude just not do any work was infuriating to say the least. I already ran at medium thinking level so was never impacted, but having to constantly go "okay now do X like you said" was annoying.

Again goes back to the "intern" analogy people like to make.

Robdel12•1h ago

Wow, bad enough for them to actually publish something and not cryptic tweets from employees.

Damage is done for me though. Even just one of these things (messing with adaptive thinking) is enough for me to not trust them anymore. And then their A/B testing this week on pricing.

mannanj•1h ago

so who do you trust and go to? (NotClearlySo)OpenAI?

simlevesque•1h ago

I went with MiniMax. The token plans are over what I currently need, 4500 messages per 5h, 45000 messages per week for 40$. I can run multiple agents and they don't think for 5-10 minutes like Sonnet did. Also I can finally see the thinking process while Anthropic chose to hide it all from me.

I'm using Zed and Claude Code as my harnesses.

Robdel12•1h ago

At the moment, yeah. If Google ever figures out how to build an agentic model, I would use them as well.

However you feel about OpenAI, at least their harness is actually open source and they don’t send lawyers after oss projects like opencode

bensyverson•1h ago

Anecdotally, I know many people who have supplemented Claude with Codex, and are experimenting with models such as GLM 5.1, Kimi, Qwen, etc.

carlgreene•1h ago

I "subconsciously" moved to codex back in mid Feb from CC and it's been so freaking awesome. I don't think it's as good at UI, but man is it thorough and able to gather the right context to find solutions.

I use "subconsciously" in quotes because I don't remember exactly why I did it, but it aligns with the degradation of their service so it feels like that probably has something to do with it even though I didn't realize it at the time.

snissn•58m ago

it's been frustrating how bad it is at UI. I'm starting to test out using their image2 for UI and then handing it to codex to build out the images into code and I'm impressed and relieved so far

GenerWork•29m ago

Anthropic definitely takes the cake when it comes to UI related activities (pulling in and properly applying Figma elements, understanding UI related prompts and properly executing on it, etc), and I say this as a designer with a personal Codex subscription.

irthomasthomas•52m ago

I like chutes because they always use the full weights, and prompts are encrypted with TEE.

saghm•1h ago

The A/B testing is by far the most objectionable thing from them so far in my opinion, if only because of how terrible it would be for something like that to be standard for subscriptions. I'd argue that it's not even A/B testing of pricing but silently giving a subset of users an entirely different product than they signed up for; it would be like if 2% of Netflix customers had full-screen ads pop up and cover the videos randomly throughout a show. Historically the only thing stopping companies from extraordinarily user-hostile decisions has been public outcry, but limiting it to a small subset of users seems like it's intentionally designed to try to limit the PR consequences.

lifthrasiir•47m ago

The best possible situation that I can imagine is that Anthropic just wanted to measure how much value does Claude Code have for Pro users and didn't mean to change the plan itself (so those users would get CC as a "bonus"), but that alone is already questionable to start with.

Alifatisk•1h ago

It’s incredible how forgiving you guys are with Anthropic and their errors. Especially considering you pay high price for their service and receive lower quality than expected.

mlinsey•1h ago

The consumer surplus is quite high. Even with the regressions in this postmortem, performance was above the models last fall, when I was gladly paying for my subscription and thought it was net saving me time.

That said, there is now much better competition with Codex, so there's only so much rope they have now.

ed_elliott_asc•1h ago

I pay for 20x max and get so much more value out of it than I pay.

lukasus•1h ago

At the time you wrote your comment there were 4 other comments and all of them very negative towards the Anthropic and the blog post in question here. How did you get this conclusions?

lukan•1h ago

Confused as well, I rather supposed Antrophic had some standing for saying no to Trump and being declared national security threat, but the anger they got and people leaving to OpenAI again, who gladly said yes to autonomous killing AI did astonish me a bit. And I also had weird things happening with my usage limits and was not happy about it. But it is still very useful to me - and I only pay for the pro plan.

sunaookami•1h ago

>I rather supposed Antrophic had some standing for saying no to Trump and being declared national security threat

I never understood why people cheered for Anthropic then when they happily work together with Palantir.

unselect5917•1h ago

HN glazes anthropic every single time I see it come up. This is as obvious as HN's political bias.

tempest_•1h ago

A lot of people are provided their access through work.

They don't actually pay the bill or see it.

fastball•1h ago

What high price? I pay $200/m for an insane number of tokens.

Avicebron•1h ago

It's still night and day the difference in quality between chatgpt5.4 and opus 4.7. Heck even on Perplexity where 5.4 is included in Pro vs 4.7 which is behind the max plan or whatever, I will pick sonnet 4.6 over the 5.4 offering and it's consistently better. I don't love Anthropic, I don't have illusions about them as a business.

But if a tool is better, it's better.

wahnfrieden•1h ago

You aren’t getting the 5.4 experience for code if you’re not using it in the Codex harness

OsrsNeedsf2P•1h ago

Look at any criticism of Mythos. Some members on HN are defending it tooth and nail, despite it not being released

jgbuddy•1h ago

Anthropic actually not so bad. Anthropic models code good, usually. Price not so high compared to time to do it by self.

AntiUSAbah•1h ago

Because it is still good though.

If you have a good product, you are more understanding. And getting worse doesn't mean its no longer valuable, only that the price/value factor went down. But Opus 4.5 was relevant better and only came out in November.

There was no price increase at that time so for the same money we get better models. Opus 4.6 again feels relevant better though.

Also moving fastish means having more/better models faster.

I do know plenty of people though which do use opencode or pi and openrouter and switching models a lot more often.

saghm•1h ago

At least personally, it feels like the choices are the one that's okay with being used for mass surveillance and autonomous weapons targeting, the one that's on track to get acquired by the AI company that dragged its feet in getting around to stopping people from making child porn with it, the one that nobody seems to use from Google, and the one that everyone complains about but also still seems to be using because it at least sometimes works well. At this point I've opted out of personal LLM coding by canceling my subscription (although my employer still has subscriptions and wants us to keep using them, so I'll presumably keep using Claude there) but if I had to pick one to spend my own money on I'd still go with Claude.

scblock•1h ago

A valid choice, a moral choice, is none of the above.

scottyah•1h ago

It's fairly small issues for an amazing product, and the company is just a few years old and growing rapidly. Also, they are leading a powerful technological revolution and their competitors are known to have multiple straight up evil tendencies. A little degradation is not an issue.

mystraline•1h ago

Exactly. They've done now like 6 rug-pulls.

Idiots keep throwing money at real-time enshittification and 'I am changing the terms. Pray I do not change them further".

And yes, I am absolutely calling people who keep getting screwed and paying for more 'service' as idiots.

And Anthropic has proved that they will pay for less and less. So, why not fuck them over and make more company money?

oytis•1h ago

Remember Louis CK talking about Wi-Fi on an airplane? People are dealing with highly experimental technology here

arnvald•1h ago

What's the alternative? Are you suggesting other LLM providers don't charge high price? Or that they don't make mistakes? Or that they provide better quality?

We're talking about dynamically developed products, something that most people would have considered impossible just 5 years ago. A non-deterministic product that's very hard to test. Yes, Anthropic makes mistakes, models can get worse over time, their ToS change often. But again, is Gemini/GPT/Grok a better alternative?

operatingthetan•1h ago

I don't think Anthropic has to inform their customers of every change they make, but they should have with this one.

timmg•37m ago

> It’s incredible how forgiving you guys are with Anthropic and their errors.

Ironically, I was thinking the exact opposite. This is bleeding edge stuff and they keep pushing new models and new features. I would expect issues.

I was surprised at how much complaining there is -- especially coming from people who have probably built and launched a lot of stuff and know how easy it is to make mistakes.

foota•1h ago

> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.

Claude caveman in the system prompt confirmed?

awesome_dude•1h ago

I've recently been introduced to that plugin, love it for humour

WhitneyLand•1h ago

Did they not address how adaptive thinking has played in to all of this?

teaearlgraycold•1h ago

> On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6.

Is it just me or does this seem kind of shocking? Such a severe bug affecting millions of users with a non-trivial effect on the context window that should be readily evident to anyone looking at the analytics. Makes me wonder if this is the result of Anthropic's vibe-coding culture. No one's actually looking at the product, its code, or its outputs?

manmal•1h ago

I think that would also have busted cache all the time, and uncached requests consume usage limits rapidly.

nrki•1h ago

> we refunded all affected customers

Notably missing from the postmortem

chermi•1h ago

It's really hard to understand. There needs to be really loud batman sign in the sky type signals from some hero third party calling out objective product degradation. Do they use cc internally? If so do they use a different version? This should've been almost as loud a break as service just going down altogether, yet it took 2 weeks to fix?!

poly2it•54m ago

> ... we’ll ensure that a larger share of internal staff use the exact public build of Claude Code (as opposed to the version we use to test new features) ...

Apparently they are using another version internally.

ayhanfuat•1h ago

Reading the "Going forward" section I see that they have zero understanding of the main complaints.

Kiro•1h ago

How so?

ayhanfuat•1h ago

They feel they're in a position to make important trade-off decisions on behalf of the user. "It's just slightly worse, I'll sneak this change in" is not something to be tolerated, whether it actually turns out to be much worse or not. Their adaptive thinking mess has caused a ton of work for me. I know a lot of people are saying Codex is actually better now. I don't agree but I'm switching to it because it's much more reliable.

operatingthetan•1h ago

I agree, but these LLM products are all black-boxes so we need to demand more accountability from them.

dainiusse•1h ago

Corporate bs begins...

xlayn•1h ago

If anthropic is doing this as a result of "optimizations" they need to stop doing that and raise the price. The other thing, there should be a way to test a model and validate that the model is answering exactly the same each time. I have experienced twice... when a new model is going to come out... the quality of the top dog one starts going down... and bam.. the new model is so good.... like the previous one 3 months ago.

The other thing, when anthropic turns on lazy claude... (I want to coin here the term Claudez for the version of claude that's lazy.. Claude zzZZzz = Claudez) that thing is terrible... you ask the model for something... and it's like... oh yes, that will probably depend on memory bandwith... do you want me to search that?...

YES... DO IT... FRICKING MACHINE..

Keeeeeeeks•1h ago

https://marginlab.ai/ (no affiliation)

There are a number of projects working on evals that can check how 'smart' a model is, but the methodology is tricky.

One would want to run the exact same prompt, every day, at different times of the day, but if the eval prompt(s) are complex, the frontier lab could have a 'meta-cognitive' layer that looks for repetitive prompts, and either: a) feeds the model a pre-written output to give to the user b) dumbs down output for that specific prompt

Both cases defeat the purpose in different ways, and make a consistent gauge difficult. And it would make sense for them to do that since you're 'wasting' compute compared to the new prompts others are writing.

hex4def6•7m ago

I think you could alter the prompt in subtle ways; a period goes to an ellipses, extra commas, synonyms, occasional double-spaces, etc.

Enough that the prompt is different at a token-level, but not enough that the meaning changes.

It would be very difficult for them to catch that, especially if the prompts were not made public.

Run the variations enough times per day, and you'd get some statistical significance.

The guess the fuzzy part is judging the output.

dgellow•1h ago

I would love if agents would act way more like tools/machines and NOT try to act as if they were humans

joshstrange•49m ago

It's incredibly frustrating when I've spelled out in CLAUDE.md that it should SSH to my dev server to investigate things I ask it to and it regularly stops working with a message of something like:

> Next steps are to run `cat /path/to/file` to see what the contents are

Makes me want to pull my hair out. I've specifically told you to go do all the read-only operations you want out on this dev server yet it keeps forgetting and asking me to do something it can do just fine (proven by it doing it after I "remind" it).

That and "Auto" mode really are grinding my gears recently. Now, after a Planing session my only option is to use Auto mode and I have to manually change it back to "Dangerously skip permissions". I think these are related since the times I've let it run on "Auto" mode is when it gives up/gets stuck more often.

Just the other day it was in Auto mode (by accident) and I told it:

> SSH out to this dev server, run `service my_service_name restart` and make sure there are no orphans (I was working on a new service and the start/stop scripts). If there are orphans, clean them up, make more changes to the start/stop scripts, and try again.

And it got stuck in some loop/dead-end with telling I should do it and it didn't want to run commands out on a "Shared Dev server" (which I had specifically told it that this was not a shared server).

The fact that Auto mode burns more tokens _and_ is so dumb is really a kick in the pants.

marcyb5st•47m ago

Apart from Anthropic nobody knows how much the average user costs them. However the consensus is "much more than that".

If they have to raise prices to stop hemorrhaging money, would you be willing to pay 1000 bucks a month for a max plan? Or 100$ per 1M pitput tokens (playing numberWang here, but the point stands).

If I have to guess they are trying to get balance sheet in order for an IPO and they basically have 3 ways of achieving that:

1. Raising prices like you said, but the user drop could be catastrophic for the IPO itself and so they won't do that

2. Dumb the models down (basically decreasing their cost per token)

3. Send less tokens (ie capping thinking budgets aggressively).

2 and 3 are palatable because, even if they annoying the technical crowd, investors still see a big number of active users with a positive margin for each.

everdrive•1h ago

I've been getting a lot of Claude responding to its own internal prompts. Here are a few recent examples.

   "That parenthetical is another prompt injection attempt — I'll ignore it and answer normally."

   "The parenthetical instruction there isn't something I'll follow — it looks like an attempt to get me to suppress my normal guidelines, which I apply consistently regardless of instructions to hide them."

   "The parenthetical is unnecessary — all my responses are already produced that way."

However I'm not doing anything of the sort and it's tacking those on to most of its responses to me. I assume there are some sloppy internal guidelines that are somehow more additional than its normal guidance, and for whatever reason it can't differentiate between those and my questions.

LatencyKills•1h ago

I have a set of stop hook scripts that I use to force Claude to run tests whenever it makes a code change. Since 4.7 dropped, Claude still executes the scripts, but will periodically ignore the rules. If I ask why, I get a "I didn't think it was necessary" response.

dawnerd•1h ago

I see that with openai too, lots of responding to itself. Seems like a convenient way for them to churn tokens.

OtomotO•1h ago

This, so much this!

Pay by token(s) while token usage is totally intransparent is a super convenient money printing machinery.

y1n0•1h ago

None of these companies have compute to spare. It’s not in their interest to use more tokens that necessary.

boringg•58m ago

Not true - they absolutely want to goose demand as they continue to burn investor dollars and deploy infra at scale.

If that demand evens slows down in the slightest the whole bubble collapses.

Growth + Demand >> efficiency or $ spend at their current stage. Efficiency is a mature company/industry game.

dawnerd•54m ago

That doesn’t mean they also can’t be wasteful. Fact is, Claude and gpt have way too much internal thinking about their system prompts than is needed. Every step they mention something around making sure they do xyz and not doing whatever. Why does it need to say things to itself like “great I have a plan now!” - that’s pure waste.

malfist•36m ago

Are you saying these companies don't want to sell more product to us? Because that's the logical extension of your argument.

grey-area•27m ago

A simpler explanation (esp. given the code we've seen from claude), is that they are vibecoding their own tools and moving fast and breaking things with predictably sloppy results.

rafram•1h ago

Check that you’re running the latest version.

gs17•40m ago

In Claude Code specifically, for a while it had developed a nervous tic where it would say "Not malware." before every bit of code. Likely a similar issue where it keeps talking to a system/tool prompt.

setnone•1h ago

Good on them for resolving all three issues, but is it any good again?

alxndr13•59m ago

for me at least, yes. just wrote it to coworkers this afternoon. Behaves way more "stable" in terms of quality and i don't have the feeling of the model getting way worse after 100k tokens of context or so.

What i notice: after 300k there's some slight quality drop, but i just make sure to compact before that threshold.

dataviz1000•1h ago

This is the problem with co-opting the word "harness". What agents need is a test harness but that doesn't mean much in the AI world.

Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.

I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.

A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.

It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.

Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.

arjie•22m ago

The word is not co-opted. A harness is just supportive scaffolding to run something. A test harness is scaffolding to run tests against software, a fuzz harness is scaffolding to run a fuzzer against the software, and so on. I've seen it being used in this manner many times over the past 15 years. It's the device that wraps your software so you can run it repeatedly with modifications of parameters, source code, or test condition.

thesz•11m ago

To have some confidence in consistency of results (p-value), one has to start from cohort of around 30, if I remember correctly. This is 1.5 orders of magnitude increase of computing power needed to find (absence of) consistent changes of agent's behavior.

natdempk•1h ago

As an end-user, I feel like they're kind of over-cooking and under-describing the features and behavior of what is a tool at the end of the day. Today the models are in a place where the context management, reasoning effort, etc. all needs to be very stable to work well.

The thing about session resumption changing the context of a session by truncating thinking is a surprise to me, I don't think that's even documented behavior anywhere?

It's interesting to look at how many bugs are filed on the various coding agent repos. Hard to say how many are real / unique, but quantities feel very high and not hard to run into real bugs rapidly as a user as you use various features and slash commands.

2001zhaozhao•1h ago

How about just not change the harness abruptly in the first place? Make new system prompt changes "experimental" first so you can gather feedback.

motbus3•1h ago

I had similar experience just before 4.5 and before 4.6 were released.

Somehow, three times makes me not feel confident on this response.

Also, if this is all true and correct, how the heck they validate quality before shipping anything?

Shipping Software without quality is pretty easy job even without AI. Just saying....

MillionOClock•1h ago

I see the Claude team wanted to make it less verbose, but that's actually something that bothered me since updating to Claude 4.7, what is the most recommended way to change it back to being as verbose as before? This is probably a matter of preference but I have a harder time with compact explanations and lists of points and that was originally one of the things I preferred with Claude.

einrealist•1h ago

Is 'refactoring Markdown files' already a thing?

ireadmevs•1h ago

Read Claude’s skill to create other skills and you’ll see that this ship has already sailed

https://skills.sh/anthropics/skills/skill-creator

6keZbCECT2uB•1h ago

"On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6"

This makes no sense to me. I often leave sessions idle for hours or days and use the capability to pick it back up with full context and power.

The default thinking level seems more forgivable, but the churn in system prompts is something I'll need to figure out how to intentionally choose a refresh cycle.

seizethecheese•59m ago

It's also a bit of a fishy explanation for purging tokens older than an hour. This happens to also be their cache limit. I doubt it is incidental that this change would also dramatically drop their cost.

cma•44m ago

They moved it to 5m around the same timeframe though: https://www.reddit.com/r/ClaudeAI/comments/1sk3m12/followup_...

tadfisher•45m ago

It astounds me that a company valued in the hundreds-of-billions-of-dollars has written this. One of the following must be true:

1. They actually believed latency reduction was worth compromising output quality for sessions that have already been long idle. Moreover, they thought doing so was better than showing a loading indicator or some other means of communicating to the user that context is being loaded.

2. What I suspect actually happened: they wanted to cost-reduce idle sessions to the bare minimum, and "latency" is a convenient-enough excuse to pass muster in a blog post explaining a resulting bug.

retinaros•39m ago

they just vibecoded a fix and didnt think about the tradeoff they were making and their always yes-man of a model just went with it

bcherny•36m ago

Hey, Boris from the Claude Code team here.

Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache -- everything but the latest message.

The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.

We tried a few different approaches to improve this UX:

1. Educating users on X/social

2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)

3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.

Hope this is helpful. Happy to answer any questions if you have.

fidrelity•30m ago

Just wanted to say I appreciate your responses here. Engaging so directly with a highly critical audience is a minefield that you're navigating well.

Thank you.

shimman•24m ago

Very easy to do when you stand to make tens of millions when your employer IPOs. Let's not maybe give too much praise and employ some critical thinking here.

simplify•18m ago

What is the purpose of this mindset? Should we encourage typical corporate coldness instead?

hgoel•5m ago

Is "employ some critical thinking" supposed to involve being an annoying uptight cynic?

qsort•21m ago

I agree with this.

I'm writing this message even though I don't have much to add because it's often the case on HN that criticism is vocal and appreciation is silent and I'd like to balance out the sentiment.

Anthropic has fumbled on many fronts lately but engaging honestly like this is the right thing to do. I trust you'll get back on track.

troupo•20m ago

> Engaging so directly with a highly critical audience is a minefield that you're navigating well.

They spent two months literally gaslighting this "critical audience" that this could not be happening and literally blaming users for using their vibe-coded slop exactly as advertised.

All the while all the official channels refused to acknowledge any problems.

Now the dissatisfaction and subscription cancellations have reached a point where they finally had to do something.

rob•14m ago

[delayed]

gverrilla•21m ago

I drop sessions very frequently to resume later - that's my main workflow with how slow Claude is. Is there anything I can do to not encounter this cache problem?

isaacdl•19m ago

Thanks for giving more information. Just as a comment on (1), a lot of people don't use X/social. That's never going to be a sustainable path to "improve this UX" since it's...not part of the UX of the product.

It's a little concerning that it's number 1 in your list.

ceuk•18m ago

Is having massive sessions which sit idle for hours (or days) at a time considered unusual? That's a really, really common scenario for me.

Two questions if you see this:

1) if this isn't best practice, what is the best way to preserve highly specific contexts?

2) does this issue just affect idle sessions or would the cache miss also apply to /resume ?

iidsample•14m ago

We at UT-Austin have done some academic work to handle the same challenge. Will be curious if serving engines could modified. https://arxiv.org/abs/2412.16434 .

The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!

troupo•14m ago

> We tried a few different approaches to improve this UX: 1. Educating users on X/social

No. You had random developers tweet and reply at random times to random users while all of your official channels were completely silent. Including channels for people who are not terminally online on X

dbeardsl•10m ago

I appreciate the reply, but I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

I feel like that is a choice best left up to users.

i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"

btown•6m ago

Is there a way to say: I am happy to pay a premium (in tokens or extra usage) to make sure that my resumed 1h+ session has all the old thinking?

I understand you wouldn't want this to be the default, particularly for people who have one giant running session for many topics - and I can only imagine the load involved in full cache misses at scale. But there are other use cases where this thinking is critical - for instance, a session for a large refactor or a devops/operations use case consolidating numerous issue reports and external findings over time, where the periodic thinking was actually critical to how the session evolved.

For example, if N-4 was a massive dump of some relevant, some irrelevant material (say, investigating for patterns in a massive set of data, but prompted to be concise in output), then N-4's thinking might have been critical to N-2 not getting over-fixated on that dump from N-4. I'd consider it mission-critical, and pay a premium, when resuming an N some hours later to avoid pitfalls just as N-2 avoided those pitfalls.

Could we have an "ultraresume" that, similar to ultrathink, would let a user indicate they want to watch Return of the (Thin)king: Extended Edition?

lukebechtel•1h ago

Some people seem to be suggesting these are coverups for quantization...

Those who work on agent harnesses for a living realize how sensitive models can be to even minor changes in the prompt.

I would not suspect quantization before I would suspect harness changes.

lifthrasiir•1h ago

Is it just for me that the reset cycle of usage limits has been randomly updated? I originally had the reset point at around 00:00 UTC tomorrow and it was somehow delayed to 10:00 UTC tomorrow, regardless of when I started to use Claude in this cycle. My friends also reported very random delay, as much as ~40 hours, with seemingly no other reason. Is this another bug on top of other bugs? :-S

someone4958923•58m ago

"This isn’t the experience users should expect from Claude Code. As of April 23, we’re resetting usage limits for all subscribers."

lifthrasiir•54m ago

I know that. I'm saying that the cycle reset is not what it used to (starting at the very first usage) or what it might be (retaining the cycle reset timing).

jongleberry•39m ago

it seems to be the same cycle for everyone now, not based on first usage. I saw a reddit thread on this from someone who had multiple accounts that all had the same cycles

jpcompartir•1h ago

Anthropic releases used to feel thorough and well done, with the models feeling immaculately polished. It felt like using a premium product, and it never felt like they were racing to keep up with the news cycle, or reply to competitors.

Recently that immaculately polished feel is harder to find. It coincides with the daily releases of CC, Desktop App, unknown/undocumented changes to the various harnesses used in CC/Cowork. I find it an unwelcome shift.

I still think they're the best option on the market, but the delta isn't as high as it was. Sometimes slowing down is the way to move faster.

bcherny•55m ago

Boris from the Claude Code team here. We agree, and will be spending the next few weeks increasing our investment in polish, quality, and reliability. Please keep the feedback coming.

pkos98•51m ago

Sure, I've cancelled my Max 20 subscription because you guys prioritize cutting your costs/increasing token efficiency over model performance. I use expensive frontier labs to get the absolute best performance, else I'd use an Open Source/Chinese one.

Frontier LLMs still suck a lot, you can't afford planned degradation yet.

szmarczak•46m ago

Why ban third party wrappers? All of this could've been sidestepped had you not banned them.

ElFitz•30m ago

Because then they lose vertical integration and the extra ability it grants to tune settings to reduce costs / token use / response time for subscription users.

Or improve performance and efficiency, if we’re generous and give them the benefit of the doubt.

It makes sense, in a way. It means the subscription deal is something along the lines of fixed / predictable price in exchange for Anthropic controlling usage patterns, scheduling, throttling (quotas consumptions), defaults, and effective workload shape (system prompt, caching) in whatever way best optimises the system for them (or us if, again, we’re feeling generous) / makes the deal sustainable for them.

It’s a trade-off

batshit_beaver•39m ago

> investment in polish, quality, and reliability

For there to be any trust in the above, the tool needs to behave predictably day to day. It shouldn't be possible to open your laptop and find that Claude suddenly has an IQ 50 points lower than yesterday. I'm not sure how you can achieve predictability while keeping inference costs in check and messing with quantization, prompts, etc on the backend.

Maybe a better approach might be to version both the models and the system prompts, but frequently adjust the pricing of a given combination based on token efficiency, to encourage users to switch to cheaper modes on their own. Let users choose how much they pay for given quality of output though.

a-dub•25m ago

hm. ml people love static evals and such, but have you considered approaches that typically appear in saas? (slow-rollouts, org/user constrained testing pools with staged rollouts, real-world feedback from actual usage data (where privacy policy permits)?

troupo•16m ago

And you didn't invest anything in polish, quality and reliability before... why? Because for any questions people have you reply something like "I have Claude working on this right now" and have no idea what's happening in the code?

A reminder: your vibe-coded slop required peak 68GB of RAM, and you had to hire actual engineers to fix it.

jpcompartir•7m ago

Thanks, I have a lot of trust in and admiration for the team & respect for the work you guys have done and continue to do.

OtomotO•54m ago

I guess it's a bit of desperation to find a sustainable business model.

The AI hype is dying, at least outside the silicon valley bubble which hackernews is very much a part of.

That and all the dogfooding by slop coding their user facing application(s).

spaniard89277•54m ago

Given the price I don't really think they're the best option. They're sloppy and competitors are catching up. I'm having same results with other models, and very close with Kimi, which is waaay cheaper.

KronisLV•44m ago

> It felt like using a premium product, and it never felt like they were racing to keep up with the news cycle, or reply to competitors.

I don't know, their desktop app felt really laggy and even switching Code sessions took a few seconds of nothing happening. Since the latest redesign, however, it's way better, snappy and just more usable in most respects.

I just think that we notice the negative things that are disruptive more. Even with the desktop app, the remaining flaws jump out: for example, how the Chat / Cowork / Code modes only show the label for the currently selected mode and the others are icons (that aren't very big), a colleague literally didn't notice that those modes are in the desktop app (or at least that that's where you switch to them).

podnami•1h ago

They lost me at Opus 4.7

Anecdotally OpenAI is trying to get into our enterprise tooth and nail, and have offered unlimited tokens until summer.

Gave GPT5.4 a try because of this and honestly I don’t know if we are getting some extra treatment, but running it at extra high effort the last 30 days I’ve barely see it make any mistakes.

At some points even the reasoning traces brought a smile to my face as it preemptively followed things that I had forgotten to instruct it about but were critical to get a specific part of our data integrity 100% correct.

vorticalbox•54m ago

extra high burns tokens i find. ( run 5.4 on medium for 90% of the tasks and high if i see medium struggling and its very focused and make minimum changes.

dsco•51m ago

Yeah but it also then strikes the perfect balance between being meticulous and pragmatic. Also it pushes back much more often than other models in that mode.

dsco•53m ago

Same here. I feel like all of these shenanigans could be because Anthropic are compute constrained, forcing then to take reckless risks around reducing it.

enraged_camel•43m ago

I find that it is better at thinking broadly and at a high level, on tasks that are tangential to coding like UX flows, product management and planning of complex implementations. I have yet to see it perform better than either Opus 4.6 or 4.7 though.

cube2222•41m ago

I’ve never been one to complain about new models, and also didn’t experience most of the issues folks were citing about Claude Code over the last couple months. I’ve been using it since release, happy with almost each new update.

Until Opus 4.7 - this is the first time I rolled back to a previous model.

Personality-wise it’s the worst of AI, “it’s not x, it’s y”, strong short sentences, in general a bulshitty vibe, also gaslighting me that it fixed something even though it didn’t actually check.

I’m not sure what’s up, maybe it’s tuned for harnesses like Claude Design (which is great btw) where there’s an independent judge to check it, but for now, Opus 4.6 it is.

rishabhaiover•59m ago

Boris gaslighted us with all the quality related incidents for weeks not acknowledging these problems.

bityard•57m ago

My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of VM output.

A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.

I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.

I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...

gilrain•29m ago

> My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of [LLM] output.

I think you must have learned that they’re more nondeterministic than you had thought, but then wrongly connected your new understanding to the recent model degradation. Note: they’ve been nondeterministic the whole time, while the widely-reported degradation is recent.

systemvoltage•56m ago

Interesting. All 3 seems like they’re obviously going to impact quality. e.g, reducing the effort from high to medium.

So then, there must have been an explicit internal guidance/policy that allowed this tradeoff to happen.

Did they fix just the bug or the deeper policy issue?

troupo•54m ago

> they were challenging to distinguish from normal variation in user feedback at first

translation: we ignored this and our various vibe coders were busy gaslighting everyone saying this could not be happening

munk-a•52m ago

It's also important to realize that Anthropic has recently struck several deals with PE firms to use their software. So Anthropic pays the PE firm which forces their managed firms to subscribe to Anthropic.

The artificial creation of demand is also a concerning sign.

bauerd•52m ago

>On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode

Instead of fixing the UI they lowered the default reasoning effort parameter from high to medium? And they "traced this back" because they "take reports about degradation very seriously"? Extremely hard to give them the benefit of doubt here.

bcherny•33m ago

Hey, Boris from the team here.

We did both -- we did a number of UI iterations (eg. improving thinking loading states, making it more clear how many tokens are being downloaded, etc.). But we also reduced the default effort level after evals and dogfooding. The latter was not the right decision, so we rolled it back after finding that UX iterations were insufficient (people didn't understand to use /effort to increase intelligence, and often stuck with the default -- we should have anticipated this).

davidfstr•52m ago

Good on Anthropic for giving an update & token refund, given the recent rumors of an inexplicable drop in quality. I applaud the transparency.

scuderiaseb•19m ago

Opus 4.7 was released a week ago, at that point all limits were reset, so this was very beneficial to them because basically everyones weekly limit Was anyway about to be reset.

petervandijck•51m ago

I have noticed a clear increase in smarts with 4.7. What a great model!

People complain so much, and the conspiracy theories are tiring.

KronisLV•49m ago

This reads like good news! They probably still lost a bunch of users due to the negative public sentiment and not responding quickly enough, but at least they addressed it with a good bit of transparency.

0gs•42m ago

wow resetting everyone's usage meter is great. i was so close to finally hitting my weekly limit for once though

ElFitz•34m ago

Now we know why Anthropic banned the use of subscriptions with other agent harnesses: they partially rely on the Claude Code cli to control token usage through various settings.

And it also tells us why we shouldn’t use their harness anyway: they constantly fiddle with it in ways that can seriously impact outcomes without even a warning.

VadimPR•32m ago

Appreciate the honesty from the team.

At the same time, personally I find prioritizing quality over quantity of output to be a better personal strategy. Ten partially buggy features really aren't as good as three quality ones.

nickdothutton•30m ago

I presume they don't yet have a cohesive monetization strategy, and this is why there is such huge variability in results on a weekly basis. It appears that Anthropic are skipping from one "experiment" to another. As users we only get to see the visible part (the results). Can't design a UI that indicates the software is thinking vs frozen? Does anyone actually believe that?

arjie•28m ago

Useful update. Would be useful to me to switch to a nightly / release cycle but I can see why they don't: they want to be able to move fast and it's not like I'm going to churn over these errors. I can only imagine that the benchmark runs are prohibitively expensive or slow or not using their standard harness because that would be a good smoke test on a weekly cadence. At the least, they'd know the trade-offs they're making.

Many of these things have bitten me too. Firing off a request that is slow because it's kicked out of cache and having zero cache hits (causes everything to be way more expensive) so it makes sense they would do this. I tried skipping tool calls and thinking as well and it made the agent much stupider. These all seem like natural things to try. Pity.

ctoth•26m ago

> As of April 23, we’re resetting usage limits for all subscribers.

Wait, didn't they just reset everybody's usage last Thursday, thereby syncing everybody's windows up? (Mine should have reset at 13:00 MDT) ? So this is just the normal weekly reset? Except now my reset says it will come Saturday? This is super-confusing!

walthamstow•19m ago

The weekly reset point is different per account. I think something to do with first sign-up date. Mine is on a Tuesday.

schpet•5m ago

mine was originally on sunday, then got moved to thursday (which i disliked), and it is still on thursday. so them resetting my weekly limit on the same day it was scheduled to reset feels like a joke.

puppystench•24m ago

The Claude UI still only has "adaptive" reasoning for Opus 4.7, making it functionally useless for scientific/coding work compared to older models (as Opus 4.7 will randomly stop reasoning after a few turns, even when prompted otherwise). There's no way this is just a bug and not a choice to save tokens.

walthamstow•20m ago

So we weren't going mad then!

yuvrajmalgat•20m ago

ohh

hajile•17m ago

My takeaway is that they knew they were changing a bunch of stuff while their reps were gaslighting us in the comments here.

Why should we ever trust what they say again out trust that they won’t be rug-pulling again once this blows over?

jameson•17m ago

> "In combination with other prompt changes, it hurt coding quality, and was reverted on April 20"

Do researchers know correlation between various aspects of a prompt and the response?

LLM, to me at least, appears to be a wildly random function that it's difficult to rely on. Traditional systems have structured inputs and outputs, and we can know how a system returned the output. This doesn't appear to be the case for LLM where inputs and outputs are any texts.

Anecdotally, I had a difficult time working with open source models at a social media firm, and something as simple as wrapping the example of JSON structure with ```, adding a newline or wording I used wildly changed accuracy.

whalesalad•15m ago

I genuinely don't understand what they have been trying to achieve. All of these incremental "improvements" have ... not improved anything, and have had the opposite effect.

My trust is gone. When day-to-day updates do nothing but cause hundreds of dollars in lost $$$ tokens and the response is "we ... sorta messed up but just a little bit here and there and it added up to a big mess up" bro get fuckin real.

pxc•15m ago

One of Anthropic's ostensive ethical goals is to produce AI that is "understandable" as well as exceptionally "well-aligned". It's striking that some of the same properties that make AI risky also just make it hard to consistently deliver a good product. It occurs to me that if Anthropic really makes some breakthroughs in those areas, everyone will feel it in terms of product quality whether they're worried about grandiose/catastrophic predictions or not.

But right now it seems like, in the case of (3), these systems are really sensitive and unpredictable. I'd characterize that as an alignment problem, too.

cedws•13m ago

>On April 16, we added a system prompt instruction to reduce verbosity

In practice I understand this would be difficult but I feel like the system prompt should be versioned alongside the model. Changing the system prompt out from underneath users when you've published benchmarks using an older system prompt feels deceptive.

At least tell users when the system prompt has changed.

arkariarn•8m ago

I see some anthropic claude code people are reading the comments. A day or two ago I watched a video by theo t3.gg on whether claude got dumber. Even though he was really harsh on anthropic and said some mean stuff. I thought some of the points he was raising about claude code was quite apt. Especially when it comes to the harness bloat. I really hope the new features now stop and there is a real hard push for polish and optimization. Otherwise I think a lot of people will start exploring less bloated more optimized alternatives. Focus on making the harness better and less token consuming.

https://youtu.be/KFisvc-AMII?is=NskPZ21BAe6eyGTh

whalesalad•7m ago

literally just `git reset --hard <random hash from 3 months ago>` would fix this

lanthissa•4m ago

never ever forget theo's gpt 5 hype video and then him having to walk it back.

its very clear that theres money or influence exchanging hands behind the scenes with certain content creators, the information, and openai.

tontinton•6m ago

or you can use a non vibe designed efficient Rust TUI coding agent made by yours truly, all my coworkers use it too :) called https://maki.sh!

lua plugins WIP

jruz•5m ago

Too late bro, switched to Codex I’m done with your bullshit.

Greenhouse gases from data center boom could outpace nations

Programming in 2026: excitement, dread, and the coming wave

Anthropic has surged to a trillion-dollar valuation, overtaking OpenAI

9950X3D2 Benchmarks: The Best Desktop Performance for Linux Developers, Creators

Do People Sincerely Believe Conspiracy Theories That They Endorse?

The Long Reply

An Existential Crisis in the Crossword Community

Show HN: Récif – Open-source control tower for AI agents on Kubernetes

CC-Markup: Measure Opus 4.7's tokenizer price hike on your past sessions

Preliminary Report on the LaGuardia Crash

GitHub Is Unreachable in Kazakhstan

HTML5 games dev assets and templates

Billion-dollar losses after 'short squeeze' at car rental group Avis Budget

Let Bad Writers Use AI to Compose Text

Shade Raises $14M to Build the System of Record for Creatives

Trajectory shapes are work habits

Why Rome never industrialized [video]

A background removal API 40x cheaper than remove.bg

Show HN: Visage, privacy-friendly web analytics for self-hosters

ElephantShadow: Use PHP for SSR of Webcomponents in declarative shadow DOM

Windows 10 Is Finally Dead. Don't Fall for Microsoft's Next Trap

Meta to Lay Off 10% of Employees in May

Telemetry and intent analytics for AI products using natural language

The $5 Drone Killer

A New Chapter for Ruby Central

Claude Opus is not available with the Claude Pro plan

Fluux Messenger 0.15.2 – A Modern Cross Platform XMPP Client (TypeScript)

Show HN: TeamFuse – Dev team built on distributed Claude Code agents

Firefox browser has started shipping Brave's adblock-rust engine

LazySlide: Accessible and interoperable whole-slide image analysis

Greenhouse gases from data center boom could outpace nations

Programming in 2026: excitement, dread, and the coming wave

Anthropic has surged to a trillion-dollar valuation, overtaking OpenAI

9950X3D2 Benchmarks: The Best Desktop Performance for Linux Developers, Creators

Do People Sincerely Believe Conspiracy Theories That They Endorse?

The Long Reply

An Existential Crisis in the Crossword Community

Show HN: Récif – Open-source control tower for AI agents on Kubernetes

CC-Markup: Measure Opus 4.7's tokenizer price hike on your past sessions

Preliminary Report on the LaGuardia Crash

GitHub Is Unreachable in Kazakhstan

HTML5 games dev assets and templates

Billion-dollar losses after 'short squeeze' at car rental group Avis Budget

Let Bad Writers Use AI to Compose Text

Shade Raises $14M to Build the System of Record for Creatives

Trajectory shapes are work habits

Why Rome never industrialized [video]

A background removal API 40x cheaper than remove.bg

Show HN: Visage, privacy-friendly web analytics for self-hosters

ElephantShadow: Use PHP for SSR of Webcomponents in declarative shadow DOM

Windows 10 Is Finally Dead. Don't Fall for Microsoft's Next Trap

Meta to Lay Off 10% of Employees in May

Telemetry and intent analytics for AI products using natural language

The $5 Drone Killer

A New Chapter for Ruby Central

Claude Opus is not available with the Claude Pro plan

Fluux Messenger 0.15.2 – A Modern Cross Platform XMPP Client (TypeScript)

Show HN: TeamFuse – Dev team built on distributed Claude Code agents

Firefox browser has started shipping Brave's adblock-rust engine

LazySlide: Accessible and interoperable whole-slide image analysis

An update on recent Claude Code quality reports

Comments