Claude code made almost half a billion so far[1] (>500m in ARR and its like 9 months old) , and 30% of all users have been impacted at least once, just from the first routing bug. Scary stuff.
Their post mortem is basically "evaluations are hard, we relied on vibe checking, now we are going to have even more frequent vibe checking". I believe it was indeed unintentional, but in the future where investor's money wont come down from the skies, serving distilled models will be very tempting. And you can not be liable to any SLA currently, it's just vibes. I wonder how enterprise vendors are going to deal with this going forward, you cannot just degrade quality without client or vendor even being able to really prove it.
[1][https://www.anthropic.com/news/anthropic-raises-series-f-at-...]
the blog explains what issues they had and how they fixed them. this is good enough.
It's a material difference in the product, not just "a bug."
Matches my experience. I use CC through our enterprise Vertex AI account and never noticed any degradation.
In general it seems like these bugs, while serious, were substantially less prevalent than anecdotal online reports would have you believe. We are really talking about a ~1-2 week window here where most issues were concentrated, a relatively small percentage of total requests and total users impacted.
> Approximately 30% of Claude Code users had at least one message routed to the wrong server type, resulting in degraded responses.
> However, some users were affected more severely, as our routing is "sticky". This meant that once a request was served by the incorrect server, subsequent follow-ups were likely to be served by the same incorrect server.
30% of Claude Code users getting a degraded response is a huge bug.
I would have appreciated if they had released the full distribution of impact though.
How many users forget they have a sub? How many get a sub through work and don't use it often?
I'd bet a large number tbh based on other subscription services.
imho there's a big market gap for companies that are truly honest with customers instead of corporate gaslighting
Layered in aggrandizing. You host a service, people give you money.
My criticism is it's 'puffy'. The 'scope and complexity' for a public postmortem is 'customer-facing'. Otherwise it's a tree/forest scenario.
One might say 'the lady doth protest too much'; this should be routine. It is, elsewhere: see Cloud, Web Hosting, PBX. Pick your decade.
Can anyone explain to a layperson how this sort of thing is even possible for an LLM?
For normal code, of course stupid bugs happen all the time. You accidentally introduce an off-by-one error in a conditional, for example, or add an extra `goto fail`.
But LLMs aren't written by humans! Models are trained by automated programs over a period of many months across unfathomably massive data centers.
How would a human introduce a bug like the one described in TFA?
[1] Here is an example of two common approaches: https://www.reddit.com/r/AIDungeon/comments/1eppgyq/can_some...
I've honestly received the best results in creative writing by ignoring top_k/top_p and simply tuning temperature. Restricting my output to only common words causes everything to feel generic. But Deepseek constantly breaks into Chinese/gibberish/ZALGO! when I go to 1.14.
This isn't related to the "recent issues" but I feel like it's useful advice for anyone trying out AI story creation.
I know I'll probably get push back on this, but it left a sour taste in my mouth when I paid for a $200 sub that felt like it was less useful than ChatGPT Plus ($20) at times.
Or to summarize: [south park "we're sorry" gif]
Interesting, this implies that the 1M context servers performs worst at low context. Perhaps this is due to some KV cache compression, eviction or sparse attention scheme being applied on these 1M context servers?
> All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set factor as 2.0.
Calling the platforms A, B and C might help provide us the insight we're missing to spot incongruous behaviors faster than trying to aggregate more generalized feedback.
> Our own privacy practices also created challenges in investigating reports. Our internal privacy and security controls limit how and when engineers can access user interactions with Claude, in particular when those interactions are not reported to us as feedback.
Ok makes sense and glad to hear
> It remains particularly helpful for users to continue to send us their feedback directly. You can use the /bug command in Claude Code
Ok makes sense and I’d expect that a human can then see the context in that case although I hope it is still very explicit to the end user (I’m not a Claude Code user so I cannot comment)
> or you can use the "thumbs down" button in the Claude apps to do so
This is pretty concerning. I can’t imagine the average person equates hitting this button with forfeiting their privacy.
When you click "thumbs down" you get the message "Submitting this report will send the entire current conversation to Anthropic for future improvements to our models." before you submit the report, I'd consider that pretty explicit.
[1] https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
moatmoat•2h ago
In Aug–Sep 2025, Claude users saw degraded output quality due to infrastructure bugs, not intentional changes.
The Three Issues 1. *Context window routing error* - Short-context requests sometimes routed to long-context servers.
2. *Output corruption* - TPU misconfigurations led to weird outputs (wrong language, syntax errors). 3. *Approximate top-k miscompilation* - A compiler bug in TPU/XLA stack corrupted token probability selection. Why It Was Hard to Detect - Bugs were subtle, intermittent, and platform-dependent.- Benchmarks missed these degradations.
- Privacy/safety rules limited access to real user data for debugging.
Fixes and Next Steps - More sensitive, continuous evals on production.
- Better tools to debug user feedback safely.
- Stronger validation of routing, output correctness, and token-selection.
sebastiennight•1h ago
Do their ToS really limit access to user data (prompt/response)? I don't remember seeing anything to that effect in their terms.
mcintyre1994•1h ago
favorited•1h ago