frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

A postmortem of three recent issues

https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues
104•moatmoat•2h ago

Comments

moatmoat•2h ago
TL;DR — Anthropic Postmortem of Three Recent Issues

In Aug–Sep 2025, Claude users saw degraded output quality due to infrastructure bugs, not intentional changes.

The Three Issues 1. *Context window routing error* - Short-context requests sometimes routed to long-context servers.

   - Started small, worsened after load-balancing changes.
2. *Output corruption* - TPU misconfigurations led to weird outputs (wrong language, syntax errors).

   - Runtime optimizations wrongly boosted improbable tokens.
3. *Approximate top-k miscompilation* - A compiler bug in TPU/XLA stack corrupted token probability selection.

   - Occasionally dropped the true top token.
Why It Was Hard to Detect - Bugs were subtle, intermittent, and platform-dependent.

- Benchmarks missed these degradations.

- Privacy/safety rules limited access to real user data for debugging.

Fixes and Next Steps - More sensitive, continuous evals on production.

- Better tools to debug user feedback safely.

- Stronger validation of routing, output correctness, and token-selection.

sebastiennight•1h ago
> Privacy/safety rules limited access to real user data for debugging.

Do their ToS really limit access to user data (prompt/response)? I don't remember seeing anything to that effect in their terms.

mcintyre1994•1h ago
I’d imagine they have a lot of internal controls, even if ultimately someone at the company can read the data within their terms. It makes sense that the teams debugging stuff wouldn’t have this access immediately.
favorited•1h ago
I know that when you submit a thumbs up/down rating for a response, you need to opt-in to the whole chat conversation being shared with Anthropic.
stellalo•1h ago
Title should be fixed: it’s about Claude models in general, not Claude Code
deepdarkforest•1h ago
Wow. Sneaky. They do not even state the rate of impact for the XLA bug afaik, which affected everyone, not just claude code users, very vague. Interesting.

Claude code made almost half a billion so far[1] (>500m in ARR and its like 9 months old) , and 30% of all users have been impacted at least once, just from the first routing bug. Scary stuff.

Their post mortem is basically "evaluations are hard, we relied on vibe checking, now we are going to have even more frequent vibe checking". I believe it was indeed unintentional, but in the future where investor's money wont come down from the skies, serving distilled models will be very tempting. And you can not be liable to any SLA currently, it's just vibes. I wonder how enterprise vendors are going to deal with this going forward, you cannot just degrade quality without client or vendor even being able to really prove it.

[1][https://www.anthropic.com/news/anthropic-raises-series-f-at-...]

extr•58m ago
Is your contention that paying for a service entitles you to zero bugs, ever?
deepdarkforest•45m ago
Of course not! But usually, you can quantify metrics for quality, like uptime, lost transactions, response time, throughput etc. Then you can have accountability, and remediate. Even for other bugs, you can often reproduce and show clearly the impact. But in this case, other than internal benchmarks, you cannot really prove it. There is no accountability yet
_zoltan_•18m ago
why would they publish the data you seek? I would not publish it either.

the blog explains what issues they had and how they fixed them. this is good enough.

flutas•42m ago
If you paid for a streaming service and the HD option only worked for a random subset of users, and not you, would you complain?

It's a material difference in the product, not just "a bug."

dylan604•32m ago
I'd honestly blame my ISP for traffic shaping my connection as a first assumption, and not immediately blame the streaming platform.
extr•1h ago
> Incorrect routing affected less than 0.0004% of requests on Google Cloud's Vertex AI between August 27 and September 16.

Matches my experience. I use CC through our enterprise Vertex AI account and never noticed any degradation.

In general it seems like these bugs, while serious, were substantially less prevalent than anecdotal online reports would have you believe. We are really talking about a ~1-2 week window here where most issues were concentrated, a relatively small percentage of total requests and total users impacted.

ispeaknumbers•59m ago
I'm not sure if you can claim these were "less prevalent than anecdotal online reports". From their article:

> Approximately 30% of Claude Code users had at least one message routed to the wrong server type, resulting in degraded responses.

> However, some users were affected more severely, as our routing is "sticky". This meant that once a request was served by the incorrect server, subsequent follow-ups were likely to be served by the same incorrect server.

30% of Claude Code users getting a degraded response is a huge bug.

extr•50m ago
I don't know about you but my feed is filled with people claiming that they are surely quantizating the model, Anthropic is purposefully degrading things to save money, etc etc. 70% of users were not impacted. 30% had at least one message degraded. One message is basically nothing.

I would have appreciated if they had released the full distribution of impact though.

flutas•40m ago
That 30% is of ALL users, not users who made a request, important to note the weasel wording there.

How many users forget they have a sub? How many get a sub through work and don't use it often?

I'd bet a large number tbh based on other subscription services.

smca•34m ago
(I work at Anthropic) It's 30% of all CC users that made a request during that period. We've updated the post to be clearer.
extr•32m ago
That's a pretty cynical read. My personal impression is that Anthropic has a high level of integrity as an organization. Believe what you want, I'm inclined to give them the benefit of the doubt here and move on.
thousand_nights•37m ago
i don't trust companies anymore because every time there's a worldwide outage they use softspeak like "we're observing elevated amounts of errors for a small subset of users", hours after some CTO approves to change the status page

imho there's a big market gap for companies that are truly honest with customers instead of corporate gaslighting

bravetraveler•1h ago
> We don't typically share this level of technical detail about our infrastructure, but the scope and complexity of these issues justified a more comprehensive explanation.

Layered in aggrandizing. You host a service, people give you money.

levocardia•51m ago
No, what that statement means is "we know that if we just say 'we weren't downgrading performance to save money', you won't believe us, so here is a deep dive on the actual reason it happened"
bravetraveler•44m ago
They can still do the deep dive, that is absolutely convincing. They likely did: distracted before I could finish [work, unfortunately - incident of our own]

My criticism is it's 'puffy'. The 'scope and complexity' for a public postmortem is 'customer-facing'. Otherwise it's a tree/forest scenario.

One might say 'the lady doth protest too much'; this should be routine. It is, elsewhere: see Cloud, Web Hosting, PBX. Pick your decade.

OGEnthusiast•1h ago
Seems like Claude is using TPUs a lot more than I thought. For some reason I thought 90%+ of their capacity was from AWS.
Wowfunhappy•56m ago
> On August 25, we deployed a misconfiguration to the Claude API TPU servers that caused an error during token generation. An issue caused by a runtime performance optimization occasionally assigned a high probability to tokens that should rarely be produced given the context, for example producing Thai or Chinese characters in response to English prompts, or producing obvious syntax errors in code. A small subset of users that asked a question in English might have seen "สวัสดี" in the middle of the response, for example.

Can anyone explain to a layperson how this sort of thing is even possible for an LLM?

For normal code, of course stupid bugs happen all the time. You accidentally introduce an off-by-one error in a conditional, for example, or add an extra `goto fail`.

But LLMs aren't written by humans! Models are trained by automated programs over a period of many months across unfathomably massive data centers.

How would a human introduce a bug like the one described in TFA?

ashdksnndck•55m ago
There are many layers of human-written code in between you and the weights.
Voloskaya•51m ago
LLMs are still executed by code written by humans. In this case, the model ultimately give you a probability distribution over each (~200k) tokens in the vocabulary. It's then up to you to decide how you want to sample the next token, you could for example just always sample the most likely, or to make the output more creative, you can sample randomly from the top-k tokens. This top-k sampling, to make it efficient, is written in XLA and compiled to run directly as a kernel, there was a bug in that kernel, which presumably led to tokens outside of the top-k window be select from times to times.
Centigonal•46m ago
LLMs produce a probability distribution for what the next token might be. How you pick the actual word that is printed next from that probability distribution is by using a sampling approach[1]. If your sampling approach is "select the next word randomly from among the top 4 possibilities" and you flip a > sign, you could end up with the behavior described in the OP.

[1] Here is an example of two common approaches: https://www.reddit.com/r/AIDungeon/comments/1eppgyq/can_some...

jjmarr•35m ago
The next word can also be selected with weighted randomization and "temperature" is used to control how much weight lower probability tokens get.

I've honestly received the best results in creative writing by ignoring top_k/top_p and simply tuning temperature. Restricting my output to only common words causes everything to feel generic. But Deepseek constantly breaks into Chinese/gibberish/ZALGO! when I go to 1.14.

This isn't related to the "recent issues" but I feel like it's useful advice for anyone trying out AI story creation.

flutas•56m ago
And yet no offers of credits to make things right for the users, for what was essentially degraded performance of what you paid for.

I know I'll probably get push back on this, but it left a sour taste in my mouth when I paid for a $200 sub that felt like it was less useful than ChatGPT Plus ($20) at times.

Or to summarize: [south park "we're sorry" gif]

cyanf•37m ago
> On August 29, a routine load balancing change unintentionally increased the number of short-context requests routed to the 1M context servers. At the worst impacted hour on August 31, 16% of Sonnet 4 requests were affected.

Interesting, this implies that the 1M context servers performs worst at low context. Perhaps this is due to some KV cache compression, eviction or sparse attention scheme being applied on these 1M context servers?

kiratp•9m ago
This is due to RoPE scaling.

> All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required. It is also recommended to modify the factor as needed. For example, if the typical context length for your application is 524,288 tokens, it would be better to set factor as 2.0.

https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Thinking

behnamoh•33m ago
Reminder that Anthropic is the only AI company that has never released any open-source/weight models.
arduanika•30m ago
Sure, but don't you feel safer that way?
behnamoh•20m ago
of course, who wants an open-source Sonnet 3... /s
zer00eyz•20m ago
If you are going to run a non deterministic system on three very different hardware platforms doesn't it behoove you to tell your users where their experience is coming from?

Calling the platforms A, B and C might help provide us the insight we're missing to spot incongruous behaviors faster than trying to aggregate more generalized feedback.

HoyaSaxa•8m ago
I’m pretty surprised that Anthropic can directly impact the infra for AWS Bedrock as this article suggests. That goes against AWSs commitments. I’m sure the same is true for Google Vertex but I haven’t digged in there from a compliance perspective before.

> Our own privacy practices also created challenges in investigating reports. Our internal privacy and security controls limit how and when engineers can access user interactions with Claude, in particular when those interactions are not reported to us as feedback.

Ok makes sense and glad to hear

> It remains particularly helpful for users to continue to send us their feedback directly. You can use the /bug command in Claude Code

Ok makes sense and I’d expect that a human can then see the context in that case although I hope it is still very explicit to the end user (I’m not a Claude Code user so I cannot comment)

> or you can use the "thumbs down" button in the Claude apps to do so

This is pretty concerning. I can’t imagine the average person equates hitting this button with forfeiting their privacy.

_da_•3m ago
> This is pretty concerning. I can’t imagine the average person equates hitting this button with forfeiting their privacy.

When you click "thumbs down" you get the message "Submitting this report will send the entire current conversation to Anthropic for future improvements to our models." before you submit the report, I'd consider that pretty explicit.

vlovich123•2m ago
The value of figuring out how to make their LLM serving deterministic might help them track this down. There was a recent paper about how the received wisdom that kept assigning it to floating point associativity actually overlooked the real reasons for non-determinism [1].

[1] https://thinkingmachines.ai/blog/defeating-nondeterminism-in...

WASM 3.0 Completed

https://webassembly.org/news/2025-09-17-wasm-3.0/
566•todsacerdoti•4h ago•222 comments

A postmortem of three recent issues

https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues
104•moatmoat•2h ago•38 comments

macOS Tahoe Incompatible with Mac Studio M3 Ultra

https://eclecticlight.co/2025/09/17/macos-26-0-tahoe-build-25a354-is-incompatible-with-mac-studio...
16•lawgimenez•16m ago•1 comments

Apple Photos app corrupts images

https://tenderlovemaking.com/2025/09/17/apple-photos-app-corrupts-images/
924•pattyj•11h ago•356 comments

Optimizing ClickHouse for Intel's 280 core processors

https://clickhouse.com/blog/optimizing-clickhouse-intel-high-core-count-cpu
112•ashvardanian•4h ago•28 comments

What's New in C# 14: Null-Conditional Assignments

https://blog.ivankahl.com/csharp-14-null-conditional-assignments/
19•ivankahl•2d ago•4 comments

DeepMind and OpenAI win gold at ICPC

https://codeforces.com/blog/entry/146536
128•notemap•4h ago•120 comments

Gluon: a GPU programming language based on the same compiler stack as Triton

https://github.com/triton-lang/triton/blob/main/python/tutorials/gluon/01-intro.py
44•matt_d•3h ago•10 comments

Tinycolor supply chain attack post-mortem

https://sigh.dev/posts/ctrl-tinycolor-post-mortem/
114•STRiDEX•5h ago•48 comments

Ton Roosendaal to step down as Blender chairman and CEO

https://www.cgchannel.com/2025/09/ton-roosendaal-to-step-down-as-blender-chairman-and-ceo/
187•cma•6h ago•30 comments

YouTube addresses lower view counts which seem to be caused by ad blockers

https://9to5google.com/2025/09/16/youtube-lower-view-counts-ad-blockers/
237•iamflimflam1•8h ago•464 comments

Understanding Deflate

https://jjrscott.com/to-deflate-or-not/
23•ingve•3d ago•0 comments

Drought in Iraq reveals tombs created 2,300 years ago

https://www.smithsonianmag.com/smart-news/severe-droughts-in-iraq-reveals-dozens-of-ancient-tombs...
78•pseudolus•5h ago•10 comments

Programming language inventor or serial killer? (2003)

https://vole.wtf/coder-serial-killer-quiz/
12•marvinborner•1h ago•1 comments

U.S. investors, Trump close in on TikTok deal with China

https://www.wsj.com/tech/details-emerge-on-u-s-china-tiktok-deal-594e009f
328•Mgtyalx•1d ago•365 comments

Launch HN: RunRL (YC X25) – Reinforcement learning as a service

https://runrl.com
44•ag8•6h ago•12 comments

Famous cognitive psychology experiments that failed to replicate

https://buttondown.com/aethermug/archive/aether-mug-famous-cognitive-psychology/
118•PaulHoule•3h ago•73 comments

Event Horizon Labs (YC W24) Is Hiring

https://www.ycombinator.com/companies/event-horizon-labs/jobs/U6oyyKZ-founding-engineer-at-event-...
1•ocolegro•5h ago

Ask HN: What's a good 3D Printer for sub $1000?

121•lucideng•2d ago•149 comments

DeepSeek writes less secure code for groups China disfavors?

https://www.washingtonpost.com/technology/2025/09/16/deepseek-ai-security/
207•otterley•5h ago•124 comments

Infinite Mac: Resource Fork Roundtripping

https://blog.persistent.info/2025/09/infinite-mac-resource-forks.html
23•tobr•1d ago•3 comments

Depression reduces capacity to learn to actively avoid aversive events

https://www.eneuro.org/content/12/9/ENEURO.0034-25.2025
157•PaulHoule•5h ago•39 comments

Jqp: TUI Playground to Experiment with Jq

https://github.com/noahgorstein/jqp
13•ingve•1h ago•2 comments

Alibaba's new AI chip: Key specifications comparable to H20

https://news.futunn.com/en/post/62202518/alibaba-s-new-ai-chip-unveiled-key-specifications-compar...
244•dworks•13h ago•253 comments

Anthropic irks White House with limits on models’ use

https://www.semafor.com/article/09/17/2025/anthropic-irks-white-house-with-limits-on-models-uswhi...
200•mindingnever•4h ago•104 comments

Tau² benchmark: How a prompt rewrite boosted GPT-5-mini by 22%

https://quesma.com/blog/tau2-benchmark-improving-results-smaller-models/
157•blndrt•9h ago•47 comments

Pg_links

https://giulianopz.github.io/pg.html
18•giulianopz•53m ago•0 comments

Just for fun: animating a mosaic of 90s GIFs

https://alexplescan.com/posts/2025/09/15/gifs/
33•Bogdanp•1d ago•8 comments

UUIDv47: Store UUIDv7 in DB, emit UUIDv4 outside (SipHash-masked timestamp)

https://github.com/stateless-me/uuidv47
131•aabbdev•8h ago•63 comments

How to motivate yourself to do a thing you don't want to do

https://ashleyjanssen.com/how-to-motivate-yourself-to-do-a-thing-you-dont-want-to-do/
242•mooreds•7h ago•187 comments