Cloudflare outage should not have happened

https://ebellani.github.io/blog/2025/cloudflare-outage-should-not-have-happened-and-they-seem-to-be-missing-the-point-on-how-to-avoid-it-in-the-future/

35•b-man•47m ago

Comments

mikece•16m ago

Yes, pretty basic looking mistakes that, from the outside, make many wonder how this got through. Though analyzing the post-mortem makes me think of the MV Dali crashing into the Francis Scott Key bridge in Baltimore: the whole thing started with a single loose wire which set off a cascading failure. CF's situation was similar in a few ways though finding a bad query (and .unwrap() in production code rather than test code) should have been a lot easier to spot.

Have any of the post-mortems addressed if any of the code that led to CloudFlare's outage was generated by AI?

cmckn•14m ago

I agree it should not have happened, but I don’t agree that the database schema is the core problem. The “logical single point of failure” here was created by the rapid, global deployment process. If you don’t want to take down all of prod, you can’t update all of prod at the same time. Gradual deployments are a more reliable defense against bugs than careful programming.

locknitpicker•12m ago

This sort of Monday morning quarterbacking is pointless and only serves as a way for random bloggers to try to grab credit without actually doing or creating any value.

nmoura•11m ago

I disagree. I learnt good stuff from this article and it’s enough.

vessenes•11m ago

"If they had a perfectly normalized database, no NULLing and formally verified code, this bug would not have happened."

That may be. What's not specified there is the immense, immense cost of driving a dev org on those terms. It limits, radically, the percent of engineers you can hire (to those who understand this and are willing to work this way), and it slows deployment radically.

Cloudflare may well need to transition to this sort of engineering culture, but there is no doubt that they would not be in the position they are in if they started with this culture -- they would have been too slow to capture the market.

I think critiques that have actionable plans for real dev teams are likely to be more useful than what, to me, reads as a sort of complaint from an ivory tower. Culture matters, shipping speed matters, quality matters, team DNA matters. That's what makes this stuff hard (and interesting!)

etchalon•9m ago

"This massive, accomplished engineering team whose software operates at a scale nearly no one else operates at missed this basic thing" is a hell of a take.

hvb2•8m ago

> A central database query didn’t have the right constraints to express business rules. Not only it missed the database name, but it clearly needs a distinct and a limit, since these seem to be crucial business rules.

In a database, you wouldn't solve this with a distinct or a limit? You would make the schema guarantee uniqueness?

And yes, that wouldn't deal with cross database queries. But the solution here is just the filter by db name, the rest is table design.

nine_k•5m ago

* The unwrap() in production code should have never passed code review. Damn, it should have been flagged by a linter.

* The deployment should have followed the blue/green pattern, limiting the blast radius of a bad change to a subset of nodes.

* In general, a company so much at the foundational level of internet connectivity should not follow the "move fast, break things" pattern. They did not have an overwhelming reason to hurry and take risks. This has burned a lot of trust, no matter the nature of the actual bug.

tptacek•4m ago

Cloudflare doesn't seem to have called it a "Root Cause Analysis" and, in fact, the term "root cause" doesn't appear to occur in Prince's report. I bring this up because there's a school of thought that says "root cause analysis" is counterproductive: complex systems are always balanced on the precipice of multicausal failure.

Voyager 1 Is About to Reach One Light-Day from Earth

Cloudflare outage should not have happened

OpenAI needs to raise at least $207B by 2030 so it can continue to lose money

I don't care how well your "AI" works

A cell so minimal that it challenges definitions of life

Statistical Process Control in Python

Optery (YC W22) Hiring CISO, Release Manager, Tech Lead (Node), Full Stack Eng

Show HN: I turned algae into a bio-altimeter and put it on a weather balloon

Slashdot Effect

JOPA: Java compiler in C++, Jikes modernized to Java 6 with Claude

Show HN: KiDoom – Running DOOM on PCB Traces

Is DWPD Still a Useful SSD Spec?

Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

Surprisingly, Emacs on Android is pretty good

Qiskit open-source SDK for working with quantum computers

Justice dept. requires Realpage end sharing competitively sensitive information

Copyparty, the FOSS file server [video]

I DM'd a Korean presidential candidate and ended up building his core campaign

Trillions spent and big software projects are still failing

Efficient solar cooking that stores heat in sand

Jakarta is now the biggest city in the world

CS234: Reinforcement Learning Winter 2025

Show HN: We built an open source, zero webhooks payment processor

1,700-year-old Roman sarcophagus is unearthed in Budapest

How to repurpose your old phone into a web server

A new bridge links the math of infinity to computer science

Launch HN: Onyx (YC W24) – Open-source chat UI

FLUX.2: Frontier Visual Intelligence

Java Decompiler

Python is not a great language for data science

Cloudflare outage should not have happened

Comments

Voyager 1 Is About to Reach One Light-Day from Earth

Cloudflare outage should not have happened

OpenAI needs to raise at least $207B by 2030 so it can continue to lose money

I don't care how well your "AI" works

A cell so minimal that it challenges definitions of life

Statistical Process Control in Python

Optery (YC W22) Hiring CISO, Release Manager, Tech Lead (Node), Full Stack Eng

Show HN: I turned algae into a bio-altimeter and put it on a weather balloon

Slashdot Effect

JOPA: Java compiler in C++, Jikes modernized to Java 6 with Claude

Show HN: KiDoom – Running DOOM on PCB Traces

Is DWPD Still a Useful SSD Spec?

Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

Surprisingly, Emacs on Android is pretty good

Qiskit open-source SDK for working with quantum computers

Justice dept. requires Realpage end sharing competitively sensitive information

Copyparty, the FOSS file server [video]

I DM'd a Korean presidential candidate and ended up building his core campaign

Trillions spent and big software projects are still failing

Efficient solar cooking that stores heat in sand

Jakarta is now the biggest city in the world

CS234: Reinforcement Learning Winter 2025

Show HN: We built an open source, zero webhooks payment processor

1,700-year-old Roman sarcophagus is unearthed in Budapest

How to repurpose your old phone into a web server

A new bridge links the math of infinity to computer science

Launch HN: Onyx (YC W24) – Open-source chat UI

FLUX.2: Frontier Visual Intelligence

Java Decompiler

Python is not a great language for data science