A postmortem of three recent issues

https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues

381•moatmoat•4mo ago

Comments

moatmoat•4mo ago

TL;DR — Anthropic Postmortem of Three Recent Issues

In Aug–Sep 2025, Claude users saw degraded output quality due to infrastructure bugs, not intentional changes.

The Three Issues 1. *Context window routing error* - Short-context requests sometimes routed to long-context servers.

   - Started small, worsened after load-balancing changes.

2. *Output corruption* - TPU misconfigurations led to weird outputs (wrong language, syntax errors).

   - Runtime optimizations wrongly boosted improbable tokens.

3. *Approximate top-k miscompilation* - A compiler bug in TPU/XLA stack corrupted token probability selection.

   - Occasionally dropped the true top token.

Why It Was Hard to Detect - Bugs were subtle, intermittent, and platform-dependent.

- Benchmarks missed these degradations.

- Privacy/safety rules limited access to real user data for debugging.

Fixes and Next Steps - More sensitive, continuous evals on production.

- Better tools to debug user feedback safely.

- Stronger validation of routing, output correctness, and token-selection.

sebastiennight•4mo ago

> Privacy/safety rules limited access to real user data for debugging.

Do their ToS really limit access to user data (prompt/response)? I don't remember seeing anything to that effect in their terms.

mcintyre1994•4mo ago

I’d imagine they have a lot of internal controls, even if ultimately someone at the company can read the data within their terms. It makes sense that the teams debugging stuff wouldn’t have this access immediately.

favorited•4mo ago

I know that when you submit a thumbs up/down rating for a response, you need to opt-in to the whole chat conversation being shared with Anthropic.

stellalo•4mo ago

Title should be fixed: it’s about Claude models in general, not Claude Code

dang•4mo ago

Fixed now, thanks! (Submitted title was "Claude Code Degradation: A postmortem of three recent issues".)

deepdarkforest•4mo ago

Wow. Sneaky. They do not even state the rate of impact for the XLA bug afaik, which affected everyone, not just claude code users, very vague. Interesting.

Claude code made almost half a billion so far[1] (>500m in ARR and its like 9 months old) , and 30% of all users have been impacted at least once, just from the first routing bug. Scary stuff.

Their post mortem is basically "evaluations are hard, we relied on vibe checking, now we are going to have even more frequent vibe checking". I believe it was indeed unintentional, but in the future where investor's money wont come down from the skies, serving distilled models will be very tempting. And you can not be liable to any SLA currently, it's just vibes. I wonder how enterprise vendors are going to deal with this going forward, you cannot just degrade quality without client or vendor even being able to really prove it.

[1][https://www.anthropic.com/news/anthropic-raises-series-f-at-...]

extr•4mo ago

Is your contention that paying for a service entitles you to zero bugs, ever?

deepdarkforest•4mo ago

Of course not! But usually, you can quantify metrics for quality, like uptime, lost transactions, response time, throughput etc. Then you can have accountability, and remediate. Even for other bugs, you can often reproduce and show clearly the impact. But in this case, other than internal benchmarks, you cannot really prove it. There is no accountability yet

_zoltan_•4mo ago

why would they publish the data you seek? I would not publish it either.

the blog explains what issues they had and how they fixed them. this is good enough.

flutas•4mo ago

If you paid for a streaming service and the HD option only worked for a random subset of users, and not you, would you complain?

It's a material difference in the product, not just "a bug."

dylan604•4mo ago

I'd honestly blame my ISP for traffic shaping my connection as a first assumption, and not immediately blame the streaming platform.

gabriel666smith•4mo ago

We already kind of have a solution for this with SLAs. Humans, being (probably) non-deterministic, also fuck up. An expectation of a level of service is, I think, reasonable. It's not "zero mistakes ever", just as it can't be "zero bugs ever".

We're firmly in the realms of 'this thing is kind of smarter / faster at a task compared to me my employees, so I am contracting it to do that task'.

That doesn't mean 'if it fails, no payment'.

But I think it's too analogous to non-tech-products to hide behind a 'no refunds' policy. It's that good - there are consequences for it, I think.

VirusNewbie•4mo ago

They likely don't want to say how much of their inference comes from GCP vs. AWS.

extr•4mo ago

> Incorrect routing affected less than 0.0004% of requests on Google Cloud's Vertex AI between August 27 and September 16.

Matches my experience. I use CC through our enterprise Vertex AI account and never noticed any degradation.

In general it seems like these bugs, while serious, were substantially less prevalent than anecdotal online reports would have you believe. We are really talking about a ~1-2 week window here where most issues were concentrated, a relatively small percentage of total requests and total users impacted.

ispeaknumbers•4mo ago

I'm not sure if you can claim these were "less prevalent than anecdotal online reports". From their article:

> Approximately 30% of Claude Code users had at least one message routed to the wrong server type, resulting in degraded responses.

> However, some users were affected more severely, as our routing is "sticky". This meant that once a request was served by the incorrect server, subsequent follow-ups were likely to be served by the same incorrect server.

30% of Claude Code users getting a degraded response is a huge bug.

extr•4mo ago

I don't know about you but my feed is filled with people claiming that they are surely quantizating the model, Anthropic is purposefully degrading things to save money, etc etc. 70% of users were not impacted. 30% had at least one message degraded. One message is basically nothing.

I would have appreciated if they had released the full distribution of impact though.

flutas•4mo ago

That 30% is of ALL users, not users who made a request, important to note the weasel wording there.

How many users forget they have a sub? How many get a sub through work and don't use it often?

I'd bet a large number tbh based on other subscription services.

smca•4mo ago

(I work at Anthropic) It's 30% of all CC users that made a request during that period. We've updated the post to be clearer.

flutas•4mo ago

Thanks for the correction and updating the post.

I typically read corporate posts as cynically as possible, since it's so common to word things in any way to make the company look better.

Glad to see an outlier!

extr•4mo ago

That's a pretty cynical read. My personal impression is that Anthropic has a high level of integrity as an organization. Believe what you want, I'm inclined to give them the benefit of the doubt here and move on.

kashunstva•4mo ago

> My personal impression is that Anthropic has a high level of integrity as an organization.

Unless you consider service responsiveness as a factor of integrity. Still waiting on a service message reply from third week of May. I’m sure it’s right around the corner though.

dytyruio•4mo ago

> Anthropic is purposefully degrading things to save money

Regardless of whether it’s to save money, it’s purposefully inaccurate:

“When Claude generates text, it calculates probabilities for each possible next word, then randomly chooses a sample from this probability distribution.”

I think the reason for this is that if you were to always choose the highest probable next word, you may actually always end up with the wrong answer and/or get stuck in a loop.

They could sandbag their quality or rate limit, and I know they will rate limit because I’ve seen it. But, this is a race. It’s not like Microsoft being able to take in the money for years because people will keep buying Windows. AI companies can try to offer cheap service to government and college students, but brand loyalty is less important than selecting the smarter AI to help you.

andy99•4mo ago

> I think the reason for this is that if you were to always choose the highest probable next word, you may actually always end up with the wrong answer and/or get stuck in a loop.

No, it's just the definition of sampling at non-zero temperature. You can set T=0 to always get the most likely token. Temperature trades of consistency for variety. You can set T to zero in the API, I assume the defaults for Claude code and their chat are nonzero.

efskap•4mo ago

>or get stuck in a loop

You are absolutely right! Greedy decoding does exactly that for longer seqs: https://huggingface.co/docs/transformers/generation_strategi...

Interestingly DeepSeek recommends a temperature of 0 for math/coding, effectively greedy.

lmm•4mo ago

> 30% had at least one message degraded. One message is basically nothing.

They don't give an upper bound though. 30% had at least one message degraded. Some proportion of that 30% (maybe most of them?) had some larger proportion of their messages (maybe most of them?) degraded. That matters, and presumably the reason we're not given those numbers is that they're bad.

mirekrusin•4mo ago

Routing bug was sticky, "one message is basically nothing" is not what was happening - if you were affected, you were more likely to be affected even more.

thousand_nights•4mo ago

i don't trust companies anymore because every time there's a worldwide outage they use softspeak like "we're observing elevated amounts of errors for a small subset of users", hours after some CTO approves to change the status page

imho there's a big market gap for companies that are truly honest with customers instead of corporate gaslighting

edoceo•4mo ago

I'm with you that a market gap for honesty exists - especially on status pages. Making a better product and being honest I'd class as very-very-hard.

I do think an independent service status monitor might be an easier stip-gap and could serve to improve honesty. It's not trivial.

bravetraveler•4mo ago

> We don't typically share this level of technical detail about our infrastructure, but the scope and complexity of these issues justified a more comprehensive explanation.

Layered in aggrandizing. You host a service, people give you money.

levocardia•4mo ago

No, what that statement means is "we know that if we just say 'we weren't downgrading performance to save money', you won't believe us, so here is a deep dive on the actual reason it happened"

bravetraveler•4mo ago

They can still do the deep dive, that is absolutely convincing. They likely did: distracted before I could finish [work, unfortunately - incident of our own]

My criticism is it's 'puffy'. The 'scope and complexity' for a public postmortem is 'customer-facing'. Otherwise it's a tree/forest scenario.

One might say 'the lady doth protest too much'; this should be routine. It is, elsewhere: see Cloud, Web Hosting, PBX. Pick your decade.

pluto_modadic•4mo ago

they're big, and we expect proper behavior out of them when they mess up. that includes public details.

FossQuestion•4mo ago

What deep dive? They explained one of the three issues. and never offered a real solution to the problems. their action items list is just "We will test better next time..."

OGEnthusiast•4mo ago

Seems like Claude is using TPUs a lot more than I thought. For some reason I thought 90%+ of their capacity was from AWS.

Wowfunhappy•4mo ago

> On August 25, we deployed a misconfiguration to the Claude API TPU servers that caused an error during token generation. An issue caused by a runtime performance optimization occasionally assigned a high probability to tokens that should rarely be produced given the context, for example producing Thai or Chinese characters in response to English prompts, or producing obvious syntax errors in code. A small subset of users that asked a question in English might have seen "สวัสดี" in the middle of the response, for example.

Can anyone explain to a layperson how this sort of thing is even possible for an LLM?

For normal code, of course stupid bugs happen all the time. You accidentally introduce an off-by-one error in a conditional, for example, or add an extra `goto fail`.

But LLMs aren't written by humans! Models are trained by automated programs over a period of many months across unfathomably massive data centers.

How would a human introduce a bug like the one described in TFA?

ashdksnndck•4mo ago

There are many layers of human-written code in between you and the weights.

sailingparrot•4mo ago

LLMs are still executed by code written by humans. In this case, the model ultimately give you a probability distribution over each (~200k) tokens in the vocabulary. It's then up to you to decide how you want to sample the next token, you could for example just always sample the most likely, or to make the output more creative, you can sample randomly from the top-k tokens. This top-k sampling, to make it efficient, is written in XLA and compiled to run directly as a kernel, there was a bug in that kernel, which presumably led to tokens outside of the top-k window be select from times to times.

Centigonal•4mo ago

LLMs produce a probability distribution for what the next token might be. How you pick the actual word that is printed next from that probability distribution is by using a sampling approach[1]. If your sampling approach is "select the next word randomly from among the top 4 possibilities" and you flip a > sign, you could end up with the behavior described in the OP.

[1] Here is an example of two common approaches: https://www.reddit.com/r/AIDungeon/comments/1eppgyq/can_some...

jjmarr•4mo ago

The next word can also be selected with weighted randomization and "temperature" is used to control how much weight lower probability tokens get.

I've honestly received the best results in creative writing by ignoring top_k/top_p and simply tuning temperature. Restricting my output to only common words causes everything to feel generic. But Deepseek constantly breaks into Chinese/gibberish/ZALGO! when I go to 1.14.

This isn't related to the "recent issues" but I feel like it's useful advice for anyone trying out AI story creation.

jldugger•4mo ago

The AI kernels are floating point, so it's possible to do some unintuitive math that ends up negative even though it wouldn't be in the Real domain. I wouldn't be surprised if checking for overflow state is disabled for perf reasons and the negative simply becomes really big -- like asking for the -1st item in an array and getting the last.

blackqueeriroh•4mo ago

Simple answer: there are two separate processes here, training and inference.

As you discuss, training happens over a long period of time in a (mostly) hands-off fashion once it starts.

But inference? That’s a separate process which uses the trained model to generate responses, and it’s a runtime process - send a prompt, inference runs, response comes back. That’s a whole separate software stack, and one that is constantly being updated to improve performance.

It’s in the inference process where these issues were produced.

flutas•4mo ago

And yet no offers of credits to make things right for the users, for what was essentially degraded performance of what you paid for.

I know I'll probably get push back on this, but it left a sour taste in my mouth when I paid for a $200 sub that felt like it was less useful than ChatGPT Plus ($20) at times.

Or to summarize: [south park "we're sorry" gif]

blackqueeriroh•4mo ago

I’m pretty certain if you check the ToS that Anthropic doesn’t guarantee a level of response quality, and explicitly even says there is zero guarantee, even for paid plans.

So to be fair, you are getting exactly what you paid for - a non-deterministic set of generated responses of varying quality and accuracy.

topaz0•4mo ago

Hey, they need that $200 to postpone their inevitable bankruptcy

kashunstva•4mo ago

I would not count on any compensation. I was a Pro subscriber until the third week of May when I was no longer able to login due to some auto ban. No idea what triggered it but no one from support has bothered to respond to my appeals form submissions. A lost cause.

cyanf•4mo ago

> On August 29, a routine load balancing change unintentionally increased the number of short-context requests routed to the 1M context servers. At the worst impacted hour on August 31, 16% of Sonnet 4 requests were affected.

Interesting, this implies that the 1M context servers performs worst at low context. Perhaps this is due to some KV cache compression, eviction or sparse attention scheme being applied on these 1M context servers?

kiratp•4mo ago

This is due to RoPE scaling.

> All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set factor as 2.0.

https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking

FossQuestion•4mo ago

The key issue is that their post-mortem never explained what went wrong on two out of three issues.

All I know is that my requests can now travel along three completely different code paths, each on its own stack and tuned differently. Those optimizations can flip overnight, independent of any model-version bump—so whatever worked yesterday may already be broken today.

I really don't get the praise that they are getting for this postmortem, it only made me more annoyed.

behnamoh•4mo ago

Reminder that Anthropic is the only AI company that has never released any open-source/weight models.

arduanika•4mo ago

Sure, but don't you feel safer that way?

behnamoh•4mo ago

of course, who wants an open-source Sonnet 3... /s

_zoltan_•4mo ago

and?

zer00eyz•4mo ago

If you are going to run a non deterministic system on three very different hardware platforms doesn't it behoove you to tell your users where their experience is coming from?

Calling the platforms A, B and C might help provide us the insight we're missing to spot incongruous behaviors faster than trying to aggregate more generalized feedback.

HoyaSaxa•4mo ago

I’m pretty surprised that Anthropic can directly impact the infra for AWS Bedrock as this article suggests. That goes against AWSs commitments. I’m sure the same is true for Google Vertex but I haven’t digged in there from a compliance perspective before.

> Our own privacy practices also created challenges in investigating reports. Our internal privacy and security controls limit how and when engineers can access user interactions with Claude, in particular when those interactions are not reported to us as feedback.

Ok makes sense and glad to hear

> It remains particularly helpful for users to continue to send us their feedback directly. You can use the /bug command in Claude Code

Ok makes sense and I’d expect that a human can then see the context in that case although I hope it is still very explicit to the end user (I’m not a Claude Code user so I cannot comment)

> or you can use the "thumbs down" button in the Claude apps to do so

This is pretty concerning. I can’t imagine the average person equates hitting this button with forfeiting their privacy.

_da_•4mo ago

> This is pretty concerning. I can’t imagine the average person equates hitting this button with forfeiting their privacy.

When you click "thumbs down" you get the message "Submitting this report will send the entire current conversation to Anthropic for future improvements to our models." before you submit the report, I'd consider that pretty explicit.

HoyaSaxa•4mo ago

Great to hear. I'm not a Claude user and the article did not make it seem that way.

l1n•4mo ago

(Anthropic employee, speaking in a personal capacity)

> I’m pretty surprised that Anthropic can directly impact the infra for AWS Bedrock as this article suggests.

We don't directly manage AWS Bedrock deployments today, those are managed by AWS.

> I can’t imagine the average person equates hitting this button with forfeiting their privacy.

We specify

> Submitting this report will send the entire current conversation to Anthropic for future improvements to our models.

in the thumbs down modal. Is there a straightforward way to improve this copy?

pluto_modadic•4mo ago

"have a human take a look at this conversation (from {time} to {time})"

crazygringo•4mo ago

Sounds fine to me. I'm assuming it wasn't obvious to readers that there was a confirmation message that appears when thumbs down is clicked.

HoyaSaxa•4mo ago

Yes, I don't use Claude so I wasn't aware. I'm glad to hear it sounds like it is conspicuous.

HoyaSaxa•4mo ago

> We don't directly manage AWS Bedrock deployments today, those are managed by AWS.

That was my understanding before this article. But the article is pretty clear that these were "infrastructure bugs" and the one related to AWS Bedrock specifically says it was because "requests were misrouted to servers". If Anthropic doesn't manage the AWS Bedrock deployments, how could it be impacting the load balancer?

l1n•4mo ago

The load balancer container is provided to Bedrock, since it's part of our overall LLM serving system, but they run it.

vlovich123•4mo ago

The value of figuring out how to make their LLM serving deterministic might help them track this down. There was a recent paper about how the received wisdom that kept assigning it to floating point associativity actually overlooked the real reasons for non-determinism [1].

[1] https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

mmaunder•4mo ago

It has a big impact on performance to do determinism. Which leaves using another model to essentially IQ test their models with reporting and alerting.

ants_everywhere•4mo ago

network traffic and machine load aren't deterministic. I think for the near term, getting full determinism (e.g. for auditing) is going to only be feasible for batch jobs that are not cost sensitive.

A google search isn't deterministic. Neither is loading upvote count on social media.

It's common advice in distributed systems to have a graceful degradation state instead of becoming unavailable. That wouldn't be possible in a system that's completely deterministic.

vlovich123•4mo ago

Network traffic and machine load don’t usually impact the output of a pure (in the CS sense of purity) math function (which is what an LLM is) unless you’ve written your system to be sensitive to that.

> to have a graceful degradation state instead of becoming unavailable. That wouldn't be possible in a system that's completely deterministic.

What does this even mean? I see no incompatibility between determinism and your ability to perform the same function more slowly. Determinism just means that the output of the system is solely dependent on the inputs - feed the same inputs get the same outputs. If by degraded state you’re intentionally choosing to change your inputs, that doesn’t change the determinism of your system.

When it is said that LLMs aren’t deterministic, it’s because the output token is dependent on the inputs context and all other contexts processed in the same batch because the kernels are written non-deterministically. If the kernels were written deterministically (so that the output only depended on your input context), then there wouldn’t be a problem and it also wouldn’t change the ability for the system to degrade; it would be deterministic because capturing the input context and random seed would be sufficient. As it stands you’d have to capture the interim states of the other inputs being processed in the same batch and that interim state problem is what makes it non deterministic.

As for Google search, it’s not clear to me it’s non-deterministic. When you Google the exact same thing twice you get exactly the same page of results and selected snippets. That suggests there’s more determinism in the system than you’re giving it credit for.

bangaladore•4mo ago

Same input -> Model -> Same output

My understanding is non-determinism is usually due to explicitly introducing randomness while sampling. No reason why they couldn't use a static seed.

And if not, that's something that they should solve, unintentional non-determinism is a bug, not a feature.

data-ottawa•4mo ago

With all due respect to the Anthropic team, I think the Claude status page[1] warrants an internal code red for quality. There were 50 incidents in July, 40 incidents in August, and 21 so far in September. I have worked in places where we started approaching half these numbers and they always resulted in a hard pivot to focusing on uptime and quality.

Despite this I'm still a paying customer because Claude is a fantastic product and I get a lot of value from it. After trying the API it became a no brainer to buy a 20x Max membership. The amount of stuff I've gotten done with Claude has been awesome.

The last several weeks have strongly made me question my subscription. I appreciate the openness of this post, but as a customer I'm not happy.

I don't trust that these issues are all discovered and resolved yet, especially the load balancing ones. At least anecdotally I notice that around 12 ET (9AM pacific) my Claude Code sessions noticeably drop in quality. Again, I hope the team is able to continue finding and fixing these issues. Even running local models on my own machine at home I run into complicated bugs all the time — I won't pretend these are easy problems, they are difficult to find and fix.

[1] https://status.anthropic.com/history

lumost•4mo ago

I've become extremely nervous about these sudden declines in quality. Thankfully I don't have a production product using AI (yet), but in my own development experience - the model becoming dramatically dumber suddenly is very difficult to work around.

At this point, I'd be surprised if the different vendors on openrouter weren't abusing their trust by silently dropping context/changing quantization levels/reducing experts - or other mischievous means of delivering the same model at lower compute.

martinald•4mo ago

Openrouter is aware this is happening and flags it now on the UI. It's a real problem.

ruszki•4mo ago

I don’t know whether they are better or worse than others. One for sure, a lot of companies lie on their status pages. I encounter outages frequently which are not reported on their status pages. Nowadays, I’m more surprised when they self report some problems. Personally, I didn’t have serious problems with Claude so far, but it’s possible that I was just lucky. In my perspective, it just seems that they are reporting outages in a more faithful way. But that can be completely coincidental.

willsmith72•4mo ago

> Despite this I'm still a paying customer because Claude is a fantastic product and I get a lot of value from it.

Doesn't that say it all? At this point the quality of the AI trumps reliability for the customer (you and me), so even though of course they should (and I'm sure will) focus on it, why would they prioritise reliability over model quality right now?

edoceo•4mo ago

The up-theead complaint is that quality drops and draws a line to reliability. They (Anthropx) have two hard problems to solve.

martinald•4mo ago

What makes it even worse is the status page doesn't capture all smaller incidents. This is the same for all providers. If they actually provided real time graphs of token latency, failed requests, token/s etc I think they'd be pretty horrific.

If you trust this OpenRouter data the uptime record of these APIs is... not good to say the least: https://openrouter.ai/openai/gpt-5/uptime

It's clear to me that every provider is having enormous scale challenges. Claude Code often slows to a crawl and I have to interrupt it and tell it to try again.

This is especially pronounced around 4-6pm UK time (when we have Europe, Eastern US and West Coast US all hammering it).

Even today I was getting 503 errors from Gemini AI studio with model overloaded at that time, nothing on status page.

I really wonder if it would be worth Claude et al offering a cheaper off peak plan, to try and level out demand. Perhaps the optics of that don't look good though.

Edit to add: I think another potential dimension to this is GB200s have been a lot slower to come on stream than probably the industry expected. There's been a lot of defects with various hardware and software components and I suspect the liquid cooling has been difficult to get right (with far more catastrophic failure states!).

Maxious•4mo ago

Artificial Analysis also monitor LLM provider APIs independently "based on 8 measurements each day at different times" you can see the degradation as opus 4.1 came online https://artificialanalysis.ai/providers/anthropic#end-to-end...

l1n•4mo ago

> Claude et al offering a cheaper off peak plan We do offer Batch Processing today - https://docs.claude.com/en/docs/build-with-claude/batch-proc...

martinald•4mo ago

I mean for Claude Code.

renewiltord•4mo ago

This is always why you should put as few incidents on status page as possible. People's opinion will drop and then the negative effect will fade over time. But if you have a status page then it's incontrovertible proof. Better to lie. They'll forget.

e.g. S3 has many times encountered increased error rate but doesn't report. No one says anything about S3.

People will say many things, but their behaviour is to reward the lie. Every growth hack startup guy knows this already.

pnutjam•4mo ago

Yup, these guys aren't the customers anyway. The investors are the only ones they care about because the customers don't come close to paying the actual costs.

data-ottawa•4mo ago

It’s good they update the status page, but the issues are noticeable without it.

lanstin•4mo ago

S3 autoscales, so any time the load increases you can see 5xx and 429 errors, but it flexes up in a few hours. That’s not exactly an incident, sort of Works as Designed.

The first time you write a multithreaded utility to do something in account with S3 you will see this, and have to write the temporary back off code.

woah•4mo ago

Vibe coding gone wrong?

nojs•4mo ago

Hey now, CC is only 80% vibe coded!

https://www.reddit.com/r/singularity/comments/1khxwjh/claude...

bdangubic•4mo ago

80% of Atlassian employees use Jira :)

stephen_cagle•4mo ago

I do wonder what a random dip in quality causes in a long running conversation? Does the conversation recover at a later point, or does the introduction of temporary idiocy permanently affect the rest of the conversation?

Statistically, probably likely that the dip occurred at a point that wasn't too important? But what happens if the idiot comes out at a critical point?

Kind of reminds me of the two alternate ways that time travel works in sci-fi. Does the small change to the past explode like a fission reaction, or does history heal itself?

Anywho, if errors do accumulate, I can see being very pissed off even with temporary idiocy from the model, as it means it poisons the context for the entire rest of the conversation.

unsupp0rted•4mo ago

Depends how good your competitors are at capitalizing on it.

Guess what Sam Altman is good at.

dantodor•4mo ago

That is a very good start in sharing some level of information with their users, and kudos to the Anthropic team for doing that. However, I don't see any mention of the longstanding issue in CC of API timeout errors. And, at least for me, it's the most frustrating one.

lukasb•4mo ago

I almost never see these. Maybe issue is your network?

mvdtnz•4mo ago

I don't believe for one second that response quality dropped because of an infrastructural change and remained degraded, unnoticed, for weeks. This simply does not pass the sniff test.

blackqueeriroh•4mo ago

Can you provide any proof of what you’re saying? Any examples that would bear out what you’re asserting? Anything at all?

“I refuse to believe what the people who would know the best said, for no real reason except that it doesn’t feel right” isn’t exactly the level of considered response we’re hoping for here on HN. :)

mccoyb•4mo ago

Have you used these tools at all? It's incredibly obvious. It was obvious for weeks during August, where several people posted about degradation on r/ClaudeAI ...

There's a thousand and one reasons why a company valued in the billions, with the eyes of the world watching, would not be completely honest in their public response.

breakingcups•4mo ago

Google Gemini had an ongoing issue for at least 3 weeks with empty responses and the thinking context filled with absolute nonsense. It was even acknowledged on social media by their head of product but no official communication at all. No status page updates, nada. Apparently it was even a known issue for much longer but they only started fixing it after a config change made the problem much more prevalent.

yomismoaqui•4mo ago

This reminds me of the story [1] about Facebook intentionally breaking parts of its Android app for some users (including crashing or disabling functionality), to see how far it could degrade before users stopped using Facebook.

According to reports, users did not stop coming back even when the app was broken for hours.

A similar thing happened to me when playing some initial version of The Binding of Isaac on Linux, when it was made with Flash. Its performance wasn't the best but I couldn't stop playing.

So if people still returns maybe Anthropic has something great going on with Claude Code.

[1]: https://www.theguardian.com/technology/2016/jan/05/facebook-...

Omnipresent•4mo ago

Which LLM can generate that timeline event graphic from text?

mulmboy•4mo ago

Big missing piece - what was the impact of the degraded quality?

Was it 1% worse / unnoticeable? Did it become useless? The engineering is interesting but I'd like to see it tied to actual impact

cpursley•4mo ago

Significant, check any Claude related thread here over the last month or the Claude Code subreddit. Anecdotally, the degradation has been so bad that I had to downgrade to a month old version - which has helped a lot. I think part of the problem is there as well (Claude Code).

am17an•4mo ago

They must really be having a bad time if Anthropic of all labs is willing to share their infra details. On the actual precision bug, it is quite unfortunate on FMA side, numerical issues are often deeply bewildering and no AI can solve them (yet.) Also goes to show, if you are in a super crunch situation like this one (competitor literally eating your lunch every day), you need humans to understand what went wrong and even then it can take weeks to rectify.

mike_hearn•4mo ago

The most interesting thing about this is the apparent absence of unit tests. The test for the XLA compiler bug just prints the outputs, it's more like a repro case than a unit test in the sense that it'd be run by a test harness and have coverage tracked. And the action items are simply to lean more aggressively into evals.

Although unit testing an entire LLM is not really feasible right now, all these bugs were in small deterministic parts of the system. Load balancing, top-k probability calculations and so on are all engineered parts no different to other software, and should in principle all be unit testable. At most you need an injectable PRNG. Yes, non-deterministic optimization bugs are awful but I've personally found compiler and database bugs in the past using just regular app test suites. With CI you get a lot of runs so rare events can still surface as long as you investigate flakes. One of my current projects runs every unit test in the same process in parallel, which has proven an excellent and cheap strategy for flushing out rare thread safety issues and database deadlocks.

A few days ago I commented on a thread about the Java launch that people often feel Java is "enterprisey" compared to Python because Java code is typically written to be heavily unit testable. A lot of abstraction is driven by the desire for dependency injection, for example. I contrasted that to scripting language culture where I've found testing is often either missing or kinda surface level (e.g. mostly just asserting on types).

When I've been learning PyTorch a few years ago I noticed the same thing. The tutorials took you from simple to complex stuff without talking much about how you test or best structure the code. This makes sense for ML research where you don't have a clear goal and success boils down to maxing a score in some kind of eval, but it doesn't make sense for production deployment at scale.

I wonder if the AI labs could use more people with SRE and HA SWE background to focus on things like this. I'm kinda skeptical that more aggressive rolling evals-in-prod are the best way to avoid bugs like these happening again.

vintagedave•4mo ago

I've had to write some detailed prompts and examples to have AI generate the kind of unit tests I want in Python. I've seen the assertions on types alone too. I want assertions on values and more.

Even more than that, AI tends to mock _everything_. Mocking is useful, but the more real code a unit test invokes, the better, because the risk is not only the code itself but its interactions, the interface. Yet AI in Python will mock so heavily it barely tests even the code itself, with tautological statements.

I've prompted with heavy warnings against mocking and pointing directly at examples of thorough tests as examples. FWIW, Python does have excellent tools for injection and can write really nicely structured code.

mike_hearn•4mo ago

I'm curious how you structure your Python to be well testable. I have to admit, my own use of Python has been limited to scripts and (a long time ago) a game engine, not large codebases. So unit testing for those hardly came up.

It seems there's a couple of dependency injection frameworks but they're clones of what's found in Java, right down to the type names. One of them even calls injectable objects beans! (Rhazes)

Balinares•4mo ago

Same as you do it in any language: you compose instead of inheriting, you avoid shared state, you generally think about how this thing you're implementing can be tested even as you are implementing it. Test-driven development tends to constrain your interfaces too early but you can get a lot of the same benefits with, let's call it, test-mindful development. That works in any language.

vintagedave•4mo ago

Most of my Python is web. Individual components, same as always - approach with a set API and not too many dependencies, and allow injection via some route if so. I also test web endpoints. One thing I really like is isolating tests that require data -- rather than mocking the database, for example, I'll create an in-memory SQLite DB used while running tests. That way I can test the full stack: a web API, see its results, and check what was changed in the database at the same time, all isolated from the 'real' stack.

lordmathis•4mo ago

I learned to write well testable code when I learned go. It pushes you to pass interfaces instead of direct implementations. There's also no inheritance, just composition. While there's no 1 to 1 translation to Python the concepts are still useful. It can be easier in Python thanks to duck typing.

redman25•4mo ago

I wish I had 100 upvotes to give you. Weak, heavily mocked tests are my biggest pet peave. Test “quality” is important and not something a lot of devs pay attention to.

I’ve found myself preferring integration tests or unit tests with a “real” database set up because the tests are much more effective. If you design them right, they don’t even need to be slower.

whatevaa•4mo ago

They will be locally if you have to also run 3 virus scanners :)

andoando•4mo ago

Mocked tests also make refactoring a pain in the ass.

This is why I heavily prefer integration tests

bobbylarrybobby•4mo ago

When asked to write UI tests (playwright), I've seen Claude Code do essentially the following:

const elem = document.querySelector(".foo"); // actual element that exists elem.innerHTML = '<div class="bar"></div>'; const child = elem.locator(".bar"); // child we actually want to test for expect(child).toExist()

Gee thanks Claude, what a great test...

vintagedave•4mo ago

Same. Drives me up the wall. I’m writing my own coding agent now and I’m baking into it prompts against all the anti patterns I’ve see.

itsgrimetime•4mo ago

Wish they would have included what the actual failure mode was. I’ve been having issues where Claude Code will just hang after running some tool call, was that caused by one of these bugs?

nadis•4mo ago

I've noticed the same issue increasingly in the past couple of days; I'm guessing it's unrelated (but still a bug) but hope I'm wrong. The number of times I've had to ask variations of "why did you stop? proceed" has really escalated.

jaiyam•4mo ago

Do we know why Google’s Gemini didn’t get affected by the XLA bug? Don't they use approx top-k or mixed precision? Also, TIL that Claude is served in prod at bf16

endymion-light•4mo ago

I don't know if this is related, but I've noticed a massive issue recently within web app design where claude will just create random streams of text that displays in the dom. Think it's something specifically related to attempting to use svelte, but definitely a massive degredation that I didn't notice prior to this.

wbnns•4mo ago

You're absolutely right!

Razengan•4mo ago

Anthropic seems to have a "We're special" syndrome:

* No Sign in with Apple on the website, so tough luck if you signed up on iOS via that

* Can't buy via in-app purchase so have to give them your card

* Can't remove your payment method from your account

* Hard to get support after buying a subscription

* Asking Claude questions about Claude itself such as privacy just gives you links to their website

What even is the benefit of paying for Claude over ChatGPT or Grok that are better in overall UX?

l1n•4mo ago

> so tough luck if you signed up on iOS via that

FYI that you can sign in with your private relay email address. It's annoying though.

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

Software factories and the agentic moment

FDA intends to take action against non-FDA-approved GLP-1 drugs

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Show HN: Browser based state machine simulator and visualizer

LLMs as the new high level language

First Proof

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Vocal Guide – belt sing without killing yourself

I write games in C (yes, C) (2016)

Start all of your commands with a comma (2009)

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Reinforcement Learning from Human Feedback

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

The F Word

Speed up responses with fast mode

Software factories and the agentic moment

FDA intends to take action against non-FDA-approved GLP-1 drugs

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Show HN: Browser based state machine simulator and visualizer

LLMs as the new high level language

First Proof

Al Lowe on model trains, funny deaths and working with Disney

Show HN: A luma dependent chroma compression algorithm (image compression)

Vocal Guide – belt sing without killing yourself

I write games in C (yes, C) (2016)

Start all of your commands with a comma (2009)

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Reinforcement Learning from Human Feedback

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

France's homegrown open source online office suite

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

A postmortem of three recent issues

Comments