Cloudflare outage on November 18, 2025 post mortem

https://blog.cloudflare.com/18-november-2025-outage/

227•eastdakota•1h ago

Related: Cloudflare Global Network experiencing issues - https://news.ycombinator.com/item?id=45963780 - Nov 2025 (1580 comments)

Comments

nawgz•1h ago

> a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system ... to keep [that] system up to date with ever changing threats

> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail

A configuration error can cause internet-scale outages. What an era we live in

Edit: also, after finishing my reading, I have to express some surprise that this type of error wasn't caught in a staging environment. If the entire error is that "during migration of ClickHouse nodes, the migration -> query -> configuration file pipeline caused configuration files to become illegally large", it seems intuitive to me that doing this same migration in staging would have identified this exact error, no?

I'm not big on distributed systems by any means, so maybe I'm overly naive, but frankly posting a faulty Rust code snippet that was unwrapping an error value without checking for the error didn't inspire confidence for me!

jmclnx•1h ago

I have to wonder if AI was involved with the change.

norskeld•47m ago

I don't think this is the case with CloudFlare, but for every recent GitHub outage or performance issue... oh boy, I blame the clankers!

mewpmewp2•57m ago

It would have been caught only in stage if there was similar amount of data in the database. If stage has 2x less data it would have never occurred there. Not super clear how easy it would have been to keep stage database exactly as production database in terms of quantity and similarity of data etc.

I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.

Aeolun•44m ago

> I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.

We’re like a millionth the size of cloudflare and we have automated tests for all (sort of) queries to see what would happen with 20x more data.

Mostly to catch performance regressions, but it would work to catch these issues too.

I guess that doesn’t say anything about how rare it is, because this is also the first company at which I get the time to go to such lengths.

mewpmewp2•40m ago

But now consider how much extra data Cloudflare at its size would have to have just for staging, doubling or more their costs to have stage exactly as production. They would have to simulate similar amount of requests on top of themselves constantly since presumably they have 100s or 1000s of deployments per day.

In this case it seems the database table in question seemed modest in size (the features for ML) so naively thinking they could have kept stage features always in sync with prod at the very least, but could be they didn't consider that 55 rows vs 60 rows or similar could be a breaking point given a certain specific bug.

It is much easier to test with 20x data if you don't have the amount of data cloudflare probably handles.

Aeolun•20m ago

That just means it takes longer to test. It may not be possible to do it in a reasonable timeframe with the volumes involved, but if you already have 100k servers running to serve 25M requests per second, maybe briefly booting up another 100k isn’t going to be the end of the world?

Either way, you don’t need to do it on every commit, just often enough that you catch these kinds of issues before they go to prod.

norskeld•52m ago

This wild `unwrap()` kinda took me aback as well. Someone really believed in themselves writing this. :)

Jach•41m ago

They only recently rewrote their core in Rust (https://blog.cloudflare.com/20-percent-internet-upgrade/) -- given the newness of the system and things like "Over 100 engineers have worked on FL2, and we have over 130 modules" I won't be surprised for further similar incidents.

binarymax•1h ago

28M 500 errors/sec for several hours from a single provider. Must be a new record.

No other time in history has one single company been responsible for so much commerce and traffic. I wonder what some outage analogs to the pre-internet ages would be.

captainkrtek•1h ago

Something like a major telco going out, for example the AT&T 1990 outage of long distance calling:

> The standard procedures the managers tried first failed to bring the network back up to speed and for nine hours, while engineers raced to stabilize the network, almost 50% of the calls placed through AT&T failed to go through.

> Until 11:30pm, when network loads were low enough to allow the system to stabilize, AT&T alone lost more than $60 million in unconnected calls.

> Still unknown is the amount of business lost by airline reservations systems, hotels, rental car agencies and other businesses that relied on the telephone network.

https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...

adventured•1h ago

> No other time in history has one single company been responsible for so much commerce and traffic.

AWS very likely has Cloudflare beat in commerce responsibility. Amazon is equal to ~2.3% of US GDP by itself.

manquer•44m ago

Absolute volume maybe[1], as relative % of global digital communication traffic, the era of early telegraph probably has it beat.

In the pre digital era, East India Company dwarfs every other company in any metric like commerce controlled, global shipping, communication traffic, private army size, %GDP , % of workforce employed by considerable margins.

The default was large consolidated organization throughout history, like say Bell Labs, or Standard Oil before that and so on, only for a brief periods we have enjoyed benefits of true capitalism.

[1] Although I suspect either AWS or MS/Azure recent down-times in the last couple of years are likely higher

nullbyte808•43m ago

Yes, all(most) eggs should not be in one basket. Perfect opportunity to setup a service that checks cloudflare then switches a site's DNS to akami as a backup.

0xbadcafebee•1h ago

So, to recap:

  - Their database permissions changed unexpectedly (??)
  - This caused a 'feature file' to be changed in an unusual way (?!)
     - Their SQL query made assumptions about the database; their permissions change thus resulted in queries getting additional results, permitted by the query
  - Changes were propagated to production servers which then crashed those servers (meaning they weren't tested correctly)
     - They hit an internal application memory limit and that just... crashed the app
  - The crashing did not result in an automatic backout of the change, meaning their deployments aren't blue/green or progressive
  - After fixing it, they were vulnerable to a thundering herd problem
  - Customers who were not using bot rules were not affected; CloudFlare's bot-scorer generated a constant bot score of 0, meaning all traffic is bots

In terms of preventing this from a software engineering perspective, they made assumptions about how their database queries work (and didn't validate the results), and they ignored their own application limits and didn't program in either a test for whether an input would hit a limit, or some kind of alarm to notify the engineers of the source of the problem.

From an operations perspective, it would appear they didn't test this on a non-production system mimicing production; they then didn't have a progressive deployment; and they didn't have a circuit breaker to stop the deployment or roll-back when a newly deployed app started crashing.

tptacek•57m ago

People jump to say things like "where's the rollback" and, like, probably yeah, but keep in mind that speculative rollback features (that is: rollbacks built before you've experienced the real error modes of the system) are themselves sources of sometimes-metastable distributed system failures. None of this is easy.

paulddraper•16m ago

Looks like you have the perfect window to disrupt them with a superior product.

rawgabbit•1h ago

     > The change explained above resulted in all users accessing accurate metadata about tables they have access to. Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:

  SELECT
  name,
  type
  FROM system.columns
  WHERE
  table =        'http_requests_features'
  order by name;

    Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database.

zzzeek•1h ago

> Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system.

And here is the query they used ** (OK, so it's not exactly):

     SELECT * from feature JOIN permissions on feature.feature_type_id = permissions.feature_type_id

someone added a new row to permissions and the JOIN started returning two dupe feature rows for each distinct feature.

** "here is the query" is used for dramatic effect. I have no knowledge of what kind of database they are even using much less queries (but i do have an idea).

more edits: OK apparently it's described later in the post as a query against clickhouse's table metadata table, and because users were granted access to an additional database that was actually the backing store to the one they normally worked with, some row level security type of thing doubled up the rows. Not sure why querying system.columns is part of a production level query though, seems overly dynamic.

captainkrtek•1h ago

I believe they mentioned ClickHouse

SerCe•1h ago

As always, kudos for releasing a post mortem in less than 24 hours after the outage, very few tech organisations are capable of doing this.

bayesnet•1h ago

And a well-written one at that. Compared to the AWS port-mortem this could be literature.

yen223•41m ago

I'm curious about how their internal policies work such that they are allowed to publish a post mortem this quickly, and with this much transparency.

Any other large-ish company, there would be layers of "stakeholders" that will slow this process down. They will almost always never allow code to be published.

thesh4d0w•36m ago

The person who posted both this blog article and the hacker news post, is Matthew Prince, one of highly technical billionaire founders of cloudflare. I'm sure if he wants something to happen, it happens.

tom1337•31m ago

I mean the CEO posted the post-mortem so there aren't that many layers of stakeholders above. For other post-mortems by engineers, Matthew once said that the engineering team is running the blog and that he wouldn't event know how to veto even if he wanted [0]

[0] https://news.ycombinator.com/item?id=45588305

madeofpalk•26m ago

From what I've observed, it depends on whether you're an "engineering company" or not.

eastdakota•18m ago

Well… we have a culture of transparency we take seriously. I spent 3 years in law school that many times over my career have seemed like wastes but days like today prove useful. I was in the triage video bridge call nearly the whole time. Spent some time after we got things under control talking to customers. Then went home. I’m currently in Lisbon at our EUHQ. I texted John Graham-Cumming, our former CTO and current Board member whose clarity of writing I’ve always admired. He came over. Brought his son (“to show that work isn’t always fun”). Our Chief Legal Officer (Doug) happened to be in town. He came over too. The team had put together a technical doc with all the details. A tick-tock of what had happened and when. I locked myself on a balcony and started writing the intro and conclusion in my trusty BBEdit text editor. John started working on the technical middle. Doug provided edits here and there on places we weren’t clear. At some point John ordered sushi but from a place with limited delivery selection options, and I’m allergic to shellfish, so I ordered a burrito. The team continued to flesh out what happened. As we’d write we’d discover questions: how could a database permission change impact query results? Why were we making a permission change in the first place? We asked in the Google Doc. Answers came back. A few hours ago we declared it done. I read it top-to-bottom out loud for Doug, John, and John’s son. None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate. I sent a draft to Michelle, who’s in SF. The technical teams gave it a once over. Our social media team staged it to our blog. I texted John to see if he wanted to post it to HN. He didn’t reply after a few minutes so I did. That was the process.

anurag•8m ago

Appreciate the extra transparency on the process.

gucci-on-fleek•1h ago

> This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.

As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.

> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions

Also appreciate the honesty here.

> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]

> Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.

Why did this take so long to resolve? I read through the entire article, and I understand why the outage happened, but when most of the network goes down, why wasn't the first step to revert any recent configuration changes, even ones that seem unrelated to the outage? (Or did I just misread something and this was explained somewhere?)

Of course, the correct solution is always obvious in retrospect, and it's impressive that it only took 7 minutes between the start of the outage and the incident being investigated, but it taking a further 4 hours to resolve the problem and 8 hours total for everything to be back to normal isn't great.

eastdakota•49m ago

Because we initially thought it was an attack. And then when we figured it out we didn’t have a way to insert a good file into the queue. And then we needed to reboot processes on (a lot) of machines worldwide to get them to flush their bad files.

tptacek•43m ago

Richard Cook #18 (and #10) strikes again!

https://how.complexsystems.fail/#18

It'd be fun to read more about how you all procedurally respond to this (but maybe this is just a fixation of mine lately). Like are you tabletopping this scenario, are teams building out runbooks for how to quickly resolve this, what's the balancing test for "this needs a functional change to how our distributed systems work" vs. "instead of layering additional complexity on, we should just have a process for quickly and maybe even speculatively restoring this part of the system to a known good state in an outage".

tetec1•40m ago

Yeah, I can imagine that this insertion was some high-pressure job.

gucci-on-fleek•36m ago

Thanks for the explanation! This definitely reminds me of CrowdStrike outages last year:

- A product depends on frequent configuration updates to defend against attackers.

- A bad data file is pushed into production.

- The system is unable to easily/automatically recover from bad data files.

(The CrowdStrike outages were quite a bit worse though, since it took down the entire computer and remediation required manual intervention on thousands of desktops, whereas parts of Cloudflare were still usable throughout the outage and the issue was 100% resolved in a few hours)

EvanAnderson•1h ago

It reads a lot like the Crowdstrike SNAFU. Machine-generated configuration file b0rks-up the software that consumes it.

The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.

The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.

tptacek•59m ago

I don't think this system is best thought of as "deployment" in the sense of CI/CD; it's a control channel for a distributed bot detection system that (apparently) happens to be actuated by published config files (it has a consul-template vibe to it, though I don't know if that's what it is).

EvanAnderson•56m ago

That's why I likened it Crowdstrike. It's a signature database that blew up the consumer of said database. (You probably caught my post mid-edit, too. You may be replying to the snarky paragraph I felt better of and removed.)

Edit: Similar to Crowdstrike, the bot detector should have fallen-back to its last-known-good signature database after panicking, instead of just continuing to panic.

eastdakota•48m ago

That’s correct.

tptacek•47m ago

Is it actually consul-template? (I have post-consul-template stress disorder).

Aeolun•51m ago

I’m fairly certain it will be after they read this thread. It doesn’t feel like they don’t want, or are incapable of improving?

navigate8310•45m ago

I'm amazed that they are not using any simulator of some sort and pushing changes directly to production.

tristan-morris•1h ago

Why call .unwrap() in a function which returns Result<_,_>?

For something so critical, why aren't you using lints to identify and ideally deny panic inducing code. This is one of the biggest strengths of using Rust in the first place for this problem domain.

sayrer•58m ago

Yes, can't have .unwrap() in production code (it's ok in tests)

orphea•47m ago

Like goto, unwrap is just a tool that has its use cases. No need to make a boogeyman out of it.

gishh•40m ago

To be fair, if you’re not “this tall” you really shouldn’t consider using goto in a c program. Most people aren’t that tall.

fwjafwasd•6m ago

panicans should be using .expect() in production

keyle•37m ago

unwrap itself isn't the problem...

tptacek•58m ago

Probably because this case was something more akin to an assert than an error check.

stefan_•52m ago

You are saying this would not have happened in a C release build where asserts define to nothing?

Wonder why these old grey beards chose to go with that.

tptacek•50m ago

I am one of those old grey beards (or at least, I got started shipping C code in the 1990s), and I'd leave asserts in prod serverside code given the choice; better that than a totally unpredictable error path.

ashishb•43m ago

> You are saying this would not have happened in a C release build where asserts define to nothing?

Afaik, Go and Java are the only languages that make you pause and explicitly deal with these exceptions.

tristan-morris•40m ago

And rust, but they chose to panic on the error condition. Wild.

tristan-morris•42m ago

Oh absolutely, that's how it would have been treated.

Surely a unwrap_or_default() would have been a much better fit--if fetching features fails, continue processing with an empty set of rules vs stop world.

marcusb•38m ago

Rust has debug asserts for that. Using expect with a comment about why the condition should not/can't ever happen is idiomatic for cases where you never expect an Err.

This reads to me more like the error type returned by append with names is not (ErrorFlags, i32) and wasn't trivially convertible into that type so someone left an unwrap in place on an "I'll fix it later" basis, but who knows.

thundergolfer•30m ago

Fly writes a lot of Rust, do you allow `unwrap()` in your production environment? At Modal we only allow `expect("...")` and the message should follow the recommended message style[1].

I'm pretty surprised that Cloudflare let an unwrap into prod that caused their worst outage in 6 years.

1. https://doc.rust-lang.org/std/option/enum.Option.html#recomm...

tptacek•19m ago

After The Great If-Let Outage Of 2024, we audited all our code for that if-let/rwlock problem, changed a bunch of code, and immediately added a watchdog for deadlocks. The audit had ~no payoff; the watchdog very definitely did.

I don't know enough about Cloudflare's situation to confidently recommend anything (and I certainly don't know enough to dunk on them, unlike the many Rust experts of this thread) but if I was in their shoes, I'd be a lot less interested in eradicating `unwrap` everywhere and more in making sure than an errant `unwrap` wouldn't produce stable failure modes.

But like, the `unwrap` thing is all programmers here have to latch on to, and there's a psychological self-soothing instinct we all have to seize onto some root cause with a clear fix (or, better yet for dopaminergia, an opportunity to dunk).

A thing I really feel in threads like this is that I'd instinctively have avoided including the detail about an `unwrap` call --- I'd have worded that part more ambiguously --- knowing (because I have a pathological affinity for this community) that this is exactly how HN would react. Maybe ironically, Prince's writing is a little better for not having dodged that bullet.

koakuma-chan•48m ago

Why is there a 200 limit on appending names?

nickmonad•12m ago

Limits in systems like these are generally good. They mention the reasoning around it explicitly. It just seems like the handling of that limit is what failed and was missed in review.

otterley•58m ago

> work has already begun on how we will harden them against failures like this in the future. In particular we are:

> Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input

> Enabling more global kill switches for features

> Eliminating the ability for core dumps or other error reports to overwhelm system resources

> Reviewing failure modes for error conditions across all core proxy modules

Absent from this list are canary deployments and incremental or wave-based deployment of configuration files (which are often as dangerous as code changes) across fault isolation boundaries -- assuming CloudFlare has such boundaries at all. How are they going to contain the blast radius in the future?

This is something the industry was supposed to learn from the CrowdStrike incident last year, but it's clear that we still have a long way to go.

Also, enabling global anything (i.e., "enabling global kill switches for features") sounds like an incredibly risky idea. One can imagine a bug in a global switch that transforms disabling a feature into disabling an entire system.

nikcub•35m ago

They require the bot management config to update and propagate quickly in order to respond to attacks - but this seems like a case where updating a since instance first would have seen the panic and stopped the deploy.

I wonder why clickhouse is used to store the feature flags here, as it has it's own duplication footguns[0] which could have also easily lead to a query blowing up 2/3x in size. oltp/sqlite seems more suited, but i'm sure they have their reasons

[0] https://clickhouse.com/docs/guides/developer/deduplication

HumanOstrich•8m ago

I don't think sqlite would come close to their requirements for permissions or resilience, to name a couple. It's not the solution for every database issue.

Also, the link you provided is for eventual deduplication at the storage layer, not deduplication at query time.

mewpmewp2•34m ago

It seems they had this continous rollout for the config service, but the services consuming this were affected even by small percentage of these config providers being faulty, since they were auto updating every few minutes their configs. And it seems there is a reason for these updating so fast, presumably having to react to threat actors quickly.

otterley•29m ago

It's in everyone's interest to mitigate threats as quickly as possible. But it's of even greater interest that a core global network infrastructure service provider not DOS a significant proportion of the Internet by propagating a bad configuration too quickly. The key here is to balance responsiveness against safety, and I'm not sure they struck the right balance here. I'm just glad that the impact wasn't as long and as severe as it could have been.

tptacek•16m ago

This isn't really "configuration" so much as it is "durable state" within the context of this system.

otterley•10m ago

In my 30 years of reliability engineering, I've come to learn that this is a distinction without a difference.

People think of configuration updates (or state updates, call them what you will) as inherently safer than code updates, but history (and today!) demonstrates that they are not. Yet even experienced engineers will allow changes like these into production unattended -- even ones who wouldn't dare let a single line of code go live without being subject to the full CI/CD process.

HumanOstrich•3m ago

They narrowed down the actual problem to some Rust code in the Bot Management system that enforced a hard limit on the number of configuration items by returning an error, but the caller was just blindly unwrapping it. Memory safety didn't save them either.

Scaevolus•16m ago

Global configuration is useful for low response times to attacks, but you need to have very good ways to know when a global config push is bad and to be able to rollback quickly.

In this case, the older proxy's "fail-closed" categorization of bot activity was obviously better than the "fail-crash", but every global change needs to be carefully validated to have good characteristics here.

Having a mapping of which services are downstream of which other service configs and versions would make detecting global incidents much easier too, by making the causative threads of changes more apparent to the investigators.

lukan•58m ago

"Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page."

Unfortunately they do not share, what caused the status page to went down as well. (Does this happen often? Otherwise a big coincidence it seems)

Aeolun•53m ago

I mean, that would require a postmortem from statuspage.io right? Is that a service operated by cloudflare?

eastdakota•51m ago

We don’t know. Suspect it may just have been a big uptick in load and a failure of its underlying infrastructure to scale up.

dnw•43m ago

Yes, probably a bunch of automated bots decided to check the status page when they saw failures in production.

reassess_blind•18m ago

The status page is hosted on AWS Cloudfront, right? It sure looks like Cloudfront was overwhelmed by the traffic spike, which is a bit concerning. Hope we'll see a post from their side.

notatoad•42m ago

it seems like a good chance that despite thinking their status page was completely independent of cloudfront, enough of the internet is dependent on cloudfront now that they're simply wrong about the status page's independence.

paulddraper•19m ago

Quite possibly it was due to high traffic.

IDK Atlassian Statuspage clientele, but it's possible Cloudflare is much larger than usual.

vsgherzi•57m ago

Why does cloudflare allow unwraps in their code? I would've assumed they'd have clippy lints stopping that sort of thing. Why not just match with { ok(value) => {}, Err(error) => {} } the function already has a Result type.

At the bare minimum they could've used an expect("this should never happen, if it does database schema is incorrect").

The whole point of errors as values is preventing this kind of thing.... It wouldn't have stopped the outage but it would've made it easy to diagnose.

If anyone at cloudflare is here please let me in that codebase :)

waterTanuki•39m ago

Not a cloudflare employee but I do write a lot of Rust. The amount of things that can go wrong with any code that needs to make a network call is staggeringly high. unwrap() is normal during development phase but there are a number of times I leave an expect() for production because sometimes there's no way to move forward.

vsgherzi•25m ago

I'm in a similar boat, at the very leas an expect can give hits to what happened. However this can also be problematic if your a library developer. Sometimes rust is expected to never panic especially in situations like WASM. This is a major problem for companies like Amazon Prime Video since they run in a WASM context for their TV APP. Any panic crashes everything. Personally I usually just either create a custom error type (preferred) or erase it away with Dyn Box Error (no other option). Random unwraps and expects haunt my dreams.

ed_mercer•57m ago

Wow. 26M/s 5xx error HTTP status codes over a span of roughly two hours. That's roughly 187 billion HTTP errors that interrupted people (and systems)!

moralestapia•57m ago

No publicity is bad publicity.

Best post mortem I've read in a while, this thing will be studied for years.

A bit ironic that their internal FL2 tool is supposed to make Cloudflare "faster and more secure" but brought a lot of things down. And yeah, as other have already pointed out, that's a very unsafe use of Rust, should've never made it to production.

sigmar•56m ago

Wow. What a post mortem. Rather than Monday morning quarterbacking how many ways this could have been prevented, I'd love to hear people sound-off on things that unexpectedly broke. I, for one, did not realize logging in to porkbun to edit DNS settings would become impossible with a cloudflare meltdown

Svenskunganka•49m ago

Thanks for the post mortem! It sounds like the core issue stems from the Result::unwrap-call in FL2, crashing it. The follow-up steps mentions "Reviewing failure modes for error conditions across all core proxy modules", but that sounds like another Result::unwrap or Result::expect-call sneaking past code review in a new or existing FL2 module has the possibility to trigger this again in the future. Or even if a third-party crate that one of the modules depends on panics.

Wouldn't it be good to also isolate panics in individual FL2 modules so that a panic in one module (in this case, bot management) only affects availability of that module and not the entirety of FL2?

ojosilva•47m ago

This is the multi-million dollar .unwrap() story. In a critical path of infrastructure serving a significant chunk of the internet, calling .unwrap() on a Result means you're saying "this can never fail, and if it does, crash the thread immediately."The Rust compiler forced them to acknowledge this could fail (that's what Result is for), but they explicitly chose to panic instead of handle it gracefully. This is textbook "parse, don't validate" anti-pattern.

I know, this is "Monday morning quarterbacking", but that's what you get for an outage this big that had me tied up for half a day.

wrs•25m ago

It seems people have a blind spot for unwrap, perhaps because it's so often used in example code. In production code an unwrap or expect should be reviewed exactly like a panic.

It's not necessarily invalid to use unwrap in production code if you would just call panic anyway. But just like every unsafe block needs a SAFETY comment, every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.

arccy•21m ago

if you make it easy to be lazy and panic vs properly handling the error, you've designed a poor language

otterley•16m ago

https://en.wikipedia.org/wiki/Crash-only_software

nine_k•14m ago

Works when you have the Erlang system that does graceful handing for you: reporting, restarting.

yoyohello13•15m ago

So… basically every language ever?

Except maybe Haskell.

dkersten•12m ago

And Gleam

SchwKatze•11m ago

Unwrap isn't a synonym for laziness, it's just like an assertion, when you do unwrap() you're saying the Result should NEVER fail, and if does, it should abort the whole process. What was wrong was the developer assumption, not the use of unwrap.

HumanOstrich•12m ago

But it's memory safe!!!! How could bug happen?

trengrj•45m ago

Classic combination of errors:

Having the feature table pivoted (with 200 feature1, feature2, etc columns) meant they had to do meta queries to system.columns to get all the feature columns which made the query sensitive to permissioning changes (especially duplicate databases).

A Crowdstrike style config update that affects all nodes but obviously isn't tested in any QA or staged rollout strategy beforehand (the application panicking straight away with this new file basically proves this).

Finally an error with bot management config files should probably disable bot management vs crash the core proxy.

I'm interested here why they even decided to name Clickhouse as this error could have been caused by any other database. I can see though the replicas updating causing flip / flopping of results would have been really frustrating for incident responders.

tptacek•38m ago

Right but also this is a pretty common pattern in distributed systems that publish from databases (really any large central source of truth); it might be like the problem in systems like this. When you're lucky the corner cases are obvious; in the big one we experienced last year, a new row in our database tripped an if-let/mutex deadlock, which our system dutifully (and very quickly) propagated across our entire network.

The solution to that problem wasn't better testing of database permutations or a better staging environment (though in time we did do those things). It was (1) a watchdog system in our proxies to catch arbitrary deadlocks (which caught other stuff later), (2) segmenting our global broadcast domain for changes into regional broadcast domains so prod rollouts are implicitly staged, and (3) a process for operators to quickly restore that system to a known good state in the early stages of an outage.

(Cloudflare's responses will be different than ours, really I'm just sticking up for the idea that the changes you need don't follow obviously from the immediate facts of an outage.)

nullbyte808•45m ago

I thought it was an internal mess-up. I thought an employee screwed a file up. Old methods are sometimes better than new. AI fails us again!

ksajadi•44m ago

May I just say that Matthew Prince is the CEO of Cloudflare and a lawyer by training (and a very nice guy overall). The quality of this postmortem is great but the fact that it is from him makes one respect the company even more.

dzonga•42m ago

> thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

I don't use Rust, but a lot of Rust people say if it compiles it runs.

Well Rust won't save you from the usual programming mistake. Not blaming anyone at cloudflare here. I love Cloudflare and the awesome tools they put out.

end of day - let's pick languages | tech because of what we love to do. if you love Rust - pick it all day. I actually wanna try it for industrial robot stuff or small controllers etc.

there's no bad language - just occassional hiccups from us users who use those tools.

dzonga•39m ago

other people might say - why use unsafe rust - but we don't know the conditions of what the original code shipped under. why the pr was approved.

could have been tight deadline, managerial pressure or just the occasional slip up.

tptacek•36m ago

What people are saying is that idiomatic prod rust doesn't use unwrap/expect (both of which panic on the "exceptional" arm of the value) --- instead you "match" on the value and kick the can up a layer on the call chain.

olivia-banks•16m ago

What happens to it up the callstack? Say they propagated it up the stack with `?`. It has to get handled somewhere. If you don't introduce any logic to handle the duplicate databases, what else are you going to do when the types don't match up besides `unwrap`ing, or maybe emitting a slightly better error message? You could maybe ignore that module's error for that request, but if it was a service more critical than bot mitigation you'd still have the same symptom of getting 500'd.

tptacek•14m ago

Yeah, see, that's what I mean.

jryio•31m ago

You misunderstand what Rust’s guarantees are. Rust has never promised to solve or protect programmers from logical or poor programming. In fact, no such language can do that, not even Haskell.

Unwrapping is a very powerful and important assertion to make in Rust whereby the programmer explicitly states that the value within will not be an error, otherwise panic. This is a contract between the author and the runtime. As you mentioned, this is a human failure, not a language failure.

Pause for a moment and think about what a C++ implementation of a globally distributed network ingress proxy service would look like - and how many memory vulnerabilities there would be… I shudder at the thought… (n.b. nginx)

This is the classic example of when something fails, the failure cause over indexes on - while under indexing on the quadrillions of memory accesses that went off without a single hitch thanks to the borrow checker.

I postulate that whatever the cost in millions or hundreds of millions of dollars by this Cloudflare outage, it has paid for more than by the savings of safe memory access.

See: https://en.wikipedia.org/wiki/Survivorship_bias

metaltyphoon•26m ago

> Well Rust won't save you from the usual programming mistake

This is not a Rust problem. Someone consciously chose to NOT handle an error, possibly thinking "this will never happen". Then someone else conconciouly reviewed (I hope so) a PR with an unwrap() and let it slide.

rvz•25m ago

Great write up.

This is the first significant outage that has involved Rust code, and as you can see the .unwrap is known to carry the risk of a panic and should never be used on production code.

chatmasta•22m ago

Wow, crazy disproportional drop in the stock price… good buying opportunity for $NET.

nanankcornering•19m ago

Matt, Looking forward in regaining Elon's and his team trust to use CF again.

RagingCactus•15m ago

Lots of people here are (perhaps rightfully) pointing to the unwrap() call being an issue. That might be true, but to me the fact that a reasonably "clean" panic at a defined line of code was not quickly picked up in any error monitoring system sounds just as important to investigate.

Assuming something similar to Sentry would be in use, it should clearly pick up the many process crashes that start occurring right as the downtime starts. And the well defined clean crashes should in theory also stand out against all the random errors that start occuring all over the system as it begins to go down, precisely because it's always failing at the exact same point.

slyall•13m ago

Ironically just now I got a Cloudflare "Error code 524" page because blog.cloudflare.com was down

jijji•12m ago

this is where change management really shines because in a change management environment this would have been prevented by a backout procedure and it would never have been rolled out to production before going into QA, with peer review happening before that... I don't know if they lack change management but it's definitely something to think about

yoyohello13•5m ago

People really like to hate on Rust for some reason. This wasn’t a Rust problem, no language would have saved them from this kind of issue. In fact, the compiler would have warned that this was a possible issue.

testemailfordg2•4m ago

"Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero."

This simply means, the exception handling quality of your new FL2 is non-existent and is not at par / code logic wise similar to FL.

I hope it was not because of AI driven efficiency gains.