Questions for Cloudflare

https://entropicthoughts.com/questions-for-cloudflare

76•todsacerdoti•2mo ago

Comments

mnholt•2mo ago

This website could benefit from a CDN…

majke•2mo ago

Questions for "questions for cloudflare" owner

jf•2mo ago

https://web.archive.org/web/20251119165814/https://entropict...

internetter•2mo ago

8.5s... yikes... although notably they aren't adopting an anti-CDN or even really anti-cloudlare perspective, just grievances with software architecture. So the slowness of their site isn't really detrimental to their argument

Sesse__•2mo ago

I loaded it and got an LCP of ~350 ms, which is better than the ~550 ms I got from this very comment page.

tptacek•2mo ago

It's a detailed postmortem published within a couple hours of the incident and this blog post is disappointed that it didn't provide a comprehensive assessment of all the procedural changes inside the engineering organization that came as a consequence. At the point in time when this blog post was written, it would not have been possible for them to answer these questions.

otterley•2mo ago

"But I need attention now!"

kqr•2mo ago

Part of my argument in the article is that it does't take long to come to that realisation if using the right methods. It would absolutely have been possible to identify the problem of missing feedback by that time.

tptacek•2mo ago

It absolutely does take long with the right methods; in fact, the righter the methods, the longer it takes. You're talking about a postmortem that was up within single digit hours of the initial incident resolution. A lot of orgs would wait on the postmortem just to be sure the system is settling back into a steady state!

You were way off here.

kqr•2mo ago

To be clear, I'm not expecting a full analysis within hours. I'm hoping for a method of analysis by which the major deficiencies come up at a high level, and then as more effort is spent on it, more details around those deficiencies are revealed.

What otherwise tends to happen, in my experience, is the initial effort brings up some deficiencies which are only partially the major ones, and subsequent effort is spent looking mainly in that same area, never uncovering those major deficiencies which were not initially discovered.

RationPhantoms•2mo ago

> I wish technical organisations would be more thorough in investigating accidents.

Cloudflare is probably one of the best "voices" in the industry when it comes to post-mortems and root cause analysis.

tptacek•2mo ago

I wish blog posts like these would be more thorough in simply looking at the timestamps on the posts they're critiquing.

ItsHarper•2mo ago

If you read their previous article about AWS (linked in this one), they specifically call out root cause analysis as a flawed approach.

timenotwasted•2mo ago

"I don’t know. I wish technical organisations would be more thorough in investigating accidents." - This is just armchair quarterbacking at this point given that they were forthcoming during the incident and had a detailed post-mortem shortly after. The issue is that by not being a fly on the wall in the war room the OP is making massive assumptions about the level of discussions that take place about these types of incidents long after it has left the collective conscience of the mainstream.

cogman10•2mo ago

People outside of tech (and some inside) can be really bad at understanding how something like this could slip through the cracks.

Reading cloudflare's description of the problem, this is something that I could easily see my own company missing. It's the case that a file got too big which tanked performance enough to bring everything down. That's a VERY hard thing to test for. Especially since this appears to have been a configuration file and a regular update.

The reason it's so hard to test for is because all tests would show that there's no problem. This isn't a code update, it was a config update. Without really extensive performance tests (which, when done well, take a long time!) there really wasn't a way to know that a change that appeared safe wasn't.

I personally give Cloudflare a huge pass for this. I don't think this happened due to any sloppiness on their part.

Now, if you want to see a sloppy outage you look at the Crowdstrike outage from a few years back that bricked basically everything. That is what sheer incompetence looks like.

jsnell•2mo ago

I don't believe that is an accurate description of the issue. It wasn't that the system got too slow due to a big file, it's that the file getting too big was treated as a fatal error rather than causing requests to fail open.

s1mplicissimus•2mo ago

> This isn't a code update, it was a config update

Oh okay, well I guess the outage wasn't a real issue then

kqr•2mo ago

The article makes no claim about the effort that has gone into the analysis. You can apply a lot of effort and still only produce a shallow analysis.

If the analysis has not uncovered the feedback problems (even with large effort, or without it), my argument is that a better method is needed.

cairnechou•2mo ago

"Armchair quarterbacking" is spot on.

The author asks for a deep, system-theoretic analysis... immediately after the incident. That's just not how reality works.

When the house is on fire, you put it out and write a quick "the wiring was bad" report so everyone calms down. You don't write a PhD thesis on electrical engineering standards within 24 hours. The deep feedback-loop analysis happens weeks later, usually internally.

kqr•2mo ago

> The deep feedback-loop analysis

Thinking the consideration of feedback loops requires "deep" analysis is, I suspect, part of the problem! The insufficient feedback shows up at a very shallow level.

colesantiago•2mo ago

Maybe instead of asking "questions" to a corporation which their only interest is profit, is now beholden Wall Street and wouldn't care what we think, we should look for answers and alternatives like BunnyCDN [0], Anubis [1], etc.

[0] https://bunny.net/

[1] https://github.com/TecharoHQ/anubis

arbll•2mo ago

Ah yes because both of those alternatives are non-profits right ?

colesantiago•2mo ago

You can sponsor Anubis right now and start supporting alternatives.

arbll•2mo ago

I think you are misunderstanding what cloudflare provides if you think Anubis is an alternative. Even if we only consider bot protection they are completely different solutions.

colesantiago•2mo ago

> I think you are misunderstanding what cloudflare provides if you think Anubis is an alternative...

Do you have an open source alternative especially if you are donating to the project?

Because I would to see Anubis and other alternatives thrive much more than closed ones like Cloudflare.

arbll•2mo ago

CDN are by essence proprietary because they are infrastructure vendors. They can be built with open source software but what they are selling isn't software, it's physical servers. The alternative is going on-prem which is impossible for CDN if you aren't google or meta.

vlovich123•2mo ago

Bunny has raised money from VC which indicates it’s going the “Wall Street” path.

Anubis is a bot firewall not a CDN.

koakuma-chan•2mo ago

I wouldn't trust a provider that has "Excellent (underlined) star star star star star STAR TrustPilot 4.8 on G2" on their landing page. I bet they are also award winning, and 150 best place to work at. Really shows they have no taste.

colesantiago•2mo ago

I don't remember telling anyone to trust the reviews?

I think it is healthy to try alternatives to Cloudflare and then come to your own decision.

koakuma-chan•2mo ago

I'm not saying you did, but for me things like what I mentioned are red flags. They also use C#—another red flag. There's OVH, Hetzner, DigitalOcean, etc—all are private companies that aren't on Wall Street.

colesantiago•2mo ago

No.

DigitalOcean is owned by Wall Street.

Only Hetzner is a good alternative CDN.

koakuma-chan•2mo ago

You're right, DO is public.

colesantiago•2mo ago

> Bunny has raised money from VC which indicates it’s going the “Wall Street” path.

Yet it is an available alternative to Cloudflare that is not on Wall Street (a public company).

If you want to do this 100% yourself there is Apache Traffic Control.

https://github.com/apache/trafficcontrol

> Anubis is a bot firewall not a CDN.

For now. If we support alternatives they can grow into an open source CDN.

vlovich123•2mo ago

Anubis is a piece of software not a CDN service.

You realize to run a CDN you have to buy massive amounts of bandwidth and computers? DIY here belies a misunderstanding of what it takes to be DOS resistant and also what it takes to actually have CDN deliver a performance benefit.

colesantiago•2mo ago

This is a great idea for Anubis, funding future development and it being an alternative CDN.

Customers on the enterprise plan can either use Anubis's Managed CDN or host Anubis themselves via a enterprise license!

They can directly receive tech support from the creator of Anubis. (as long as they pay on the enterprise plan)

I don't see a problem with this and it can turn Anubis from "a piece of software" into a CDN.

HumanOstrich•2mo ago

Has anyone from the Anubis project said anything about aspiring to transform into a CDN?

akerl_•2mo ago

Maybe they can also start a marketplace to buy and sell digital goods, like NFTs.

blixt•2mo ago

It's a bit odd to come from the outside to judge the internal process of an organization with many very complex moving parts, only a fraction of which we have been given context for, especially so soon after the incident and the post-mortem explaining it.

I think the ultimate judgement must come from whether we will stay with Cloudflare now that we have seen how bad it can get. One could also say that this level of outage hasn't happened in many years, and they are now freshly frightened by it happening again so expect things to get tightened up (probably using different questions than this blog post proposes).

As for what this blog post could have been: maybe a page out of how these ideas were actively used by the author at e.g. Tradera or Loop54.

kqr•2mo ago

> how these ideas were actively used by the author at e.g. Tradera or Loop54.

This would be preferable, of course. Unfortunately both organisations were rather secretive about their technical and social deficiencies and I don't want to be the one to air them out like that.

otterley•2mo ago

The post is describing a full port-mortem process including a Five Whys (https://en.wikipedia.org/wiki/Five_whys) inquiry. In a mature organization that follows best SRE practices, this will be performed by the relevant service teams, recorded in the port-mortem document, and used for creating follow-up actions. It's almost always an internal process and isn't shared with the public--and often not even with customers under NDA.

We mustn't assume that CloudFlare isn't undertaking this process because we're not an audience to it.

tptacek•2mo ago

It also couldn't have happened by the time the postmortem was produced. The author of this blog post appears not to have noticed that the postmortem was up within a couple hours of resolving the incident.

otterley•2mo ago

Exactly. These deeper investigations can sometimes take weeks to complete.

dkyc•2mo ago

These engineering insights were not worth the 16 seconds load time this website took.

It's extremely easy, and correspondingly valueless, to ask all kinds of "hard questions" about a system 24h after it had a huge incident. The hard part is doing this appropriately for every part of the system before something happens, while maintaining the other equally rightful goals of the organizations (such as cost-efficiency, product experience, performance, etc.). There's little evidence that suggests Cloudflare isn't doing that, and their track record is definitely good for their scale.

raincole•2mo ago

Every engineer has this phase when you're capable enough to do something at small scale, so you look at the incumbents, who are doing the similar thing but at 1000x scale, and wonder how they are so bad at it.

Some never get out of this phase though.

Nextgrid•2mo ago

It is unfair to blame Cloudflare (or AWS, or Azure, or GitHub) for what’s happening, and I say that as one of the biggest “yellers at the cloud” on here.

Ultimately end-users don’t have a relationship with any of those companies. They have relationships with businesses that chose to rely on them. Cloudflare, etc publish SLAs and compensation schedules in case those SLAs are missed. Businesses chose to accept those SLAs and take on that risk.

If Cloudflare/etc signed a contract promising a certain SLA (with penalties) and then chose to not pay out those penalties then there would be reasons to ask questions, but nothing suggests they’re not holding up their side of the deal - you will absolutely get compensated (in the form of a refund on your bill) in case of an outage.

The issue is that businesses accept this deal and then scream when it goes wrong, yet are unwilling to pay for a solution that does not fail in this way. Those solutions exist - you absolutely can build systems that are reliable and/or fail in a predictable and testable manner; it’s simply more expensive and requires more skill than just slapping a few SaaSes and CNCF projects together. But it is possible - look at the uptime of card networks, stock exchanges, or airplane avionics. It’s just more expensive and the truth is that businesses don’t want to pay for it (and neither are their end-customers - they will bitch about outages, but will immediately run the other way if you ask them to pony up for a more reliable system - and the ones that don’t, already run such a system and were unaffected by the recent outages).

psim1•2mo ago

> It is unfair to blame Cloudflare (or AWS, or Azure, or GitHub) for what’s happening

> Ultimately end-users don’t have a relationship with any of those companies. They have relationships with businesses that chose to rely on them

Could you not say this about any supplier relationship? No, in this case, we all know the root of the outage is CloudFlare, so it absolutely makes sense to blame CloudFlare, and not their customers.

Nextgrid•2mo ago

Devil’s advocate: I operate the equivalent of an online lemonade stand, some shitty service at a cheap price offered with little guarantees (“if I fuck up I’ll refund you the price of your ‘lemonade’”) for hobbyists to use to host their blog and Visa decides to use it in their critical path. Then this “lemonade stand” goes down. Do you think it’s fair to blame me? I never chose to be part of Visa’s authorization loop, and after all is done I did indeed refund them the price of their “lemonade”. It’s Visa’s fault they introduced a single point of failure with inadequate compensation schedules in their critical path.

stronglikedan•2mo ago

> Do you think it’s fair to blame me?

Absolutely, yes. Where's your backup plan for when Visa doesn't behave as you expect? It's okay to not have one, but it's also your fault for not having one, and that is the sole reason that the lemonade stand went down.

Nextgrid•2mo ago

> Where's your backup plan for when Visa doesn't behave as you expect?

I don’t have (nor have to have) such a plan, I offer X service with Y guarantees paying out Z dollars if I don’t hold up my part of the bargain. In this hypothetical situation if Visa signs up I assumed they wanted to host their marketing website or some low-hanging fruit, it’s not my job to check what they’re using it for (in fact it would be preferable for me not to check, as I’d be seeing unencrypted card numbers and PII otherwise).

mschulkind•2mo ago

The person above who replied to you thinks you're talking about a proverbial lemonade stand taking payments via Visa. That's the misunderstanding.

That aside, I think the example is good. It's a bit like priority inversion in scheduling. With no agreement from the lemonade seller they've suddenly changed greatly in terms of their criticality to some value creation chain.

stronglikedan•2mo ago

If I'm paying a company that chose Cloudflare, and my SLA with that company entitles me to some sort of compensation for outages, then I expect that company to compensate me regardless of whose fault it is, and regardless of whether they were compensated by Cloudflare. I can know that the cause of the outage is Cloudflare, but also know that the company that I'm paying should have had a backup plan and not be solely reliable on one vendor. In other words, I care about who I pay, not who they decide to use.

wongarsu•2mo ago

Don't we say that about all supplier relationships? If my Samsung washing machine stops working I blame Samsung. Even when it turns out that it was a broken drive belt I don't blame the manufacturer of the drive belt, or whoever produced the rubber that went into the drive belt, or whoever made the machine involved in the production of this batch of rubber. Samsung choose to put the drive belt in my washing machine, that's where the buck stops. They are free to litigate the matter internally, but I only care about Samsung selling me a washing machine that's now broken

Same with cloudflare. If you run your site on cloudflare you are responsible for any downtime caused to your site by cloudflare

What we can blame cloudflare for is having so many customers that a cloudflare outage has outsized impact compared to the more uncorrelated outages we would have if sites were distributed among many smaller providers. But that's not quite the same as blaming any individual site being down on cloudflare

raincole•2mo ago

> Don't we say that about all supplier relationships?

No always. If the farm sells packs of poisoned bacon to the supermarket, we blame the farm.

It's more about if the website/supermarket can reasonably do the QA.

mschuster91•2mo ago

> look at the uptime of card networks, stock exchanges, or airplane avionics.

In fact, I'd say... airplane avionics are not what you should be looking at. Boeing's 787? Reboot every 51 days or risk the pilots getting wrong airspeed indicators! No, I'm not joking [1], and it's not the first time either [2], and it's not just Boeing [3].

[1] https://www.theregister.com/2020/04/02/boeing_787_power_cycl...

[2] https://www.theregister.com/2015/05/01/787_software_bug_can_...

[3] https://www.theregister.com/2019/07/25/a350_power_cycle_soft...

Nextgrid•2mo ago

> Reboot every 51 days or risk the pilots getting wrong airspeed indicators

If this is documented then fair enough - airlines don’t have to buy airplanes that need rebooting every 51 days, they can vote with their wallets and Boeing is welcome to fix it. If not documented, I hope regulators enforced penalties high enough to force Boeing to get their stuff together.

Either way, the uptime of avionics (and redundancies - including the unreliable airspeed checklists) are much higher than anything conventional software “engineering” has been putting out the past decade.

mschuster91•2mo ago

Personally, I'm of the opinion that either of these three defects should not ever have been possible in the first place, that should have been prevented by formal verification.

If some dingus widget provider has a downtime, no one dies. If an airplane loses power mid flight due to a software error... that's not guaranteed any more.

cadamsdotcom•2mo ago

The requirement to reboot is documented, and it’s an order of magnitude away from being a regular problem - ie. 51 days instead of every 51 minutes.

This puts it in the same maintenance bucket as parts that need regular replacement.

To use the “part needing regular replacement” analogy - the bug could be fixed for an extra $Y. But everything in engineering is a trade off. Choosing not to fix the bug might bring costs down to something competitive with the market.

Plus, like any part that could be made more reliable for extra cost, there might be no benefit due to service schedules. Say you could design a part that is 60% worn at service time vs a cheaper part that’s 90% worn - it they will both need replacing at the same time, you’d go for the lower cost design.

Eg. if some service task requires resetting the avionics every 40 days, that’s less than 51 days so there may as well be no bug.

waiwai933•2mo ago

> Maybe some of these questions are obviously answered in a Cloudflare control panel or help document. I’m not in the market right now so I won’t do that research.

I don't love piling on, but it still shocks me that people write without first reading.

jcmfernandes•2mo ago

The tone is off. Cloudflare shared a post-mortem on the same day as the incident. It's unreasonable to throw a "I wish technical organisations would be more thorough in investigating accidents".

With that said, I would also like to know how it took them ~2 hours to see the error. That's a long, long time.

vlovich123•2mo ago

A lot of these questions bely a misunderstanding of how it works - bot management is evaluated inline within the proxy as a feature on the site (similar to other features like image optimization).

So during ingress there’s not an async call to the bot management service which intercepts the request before it’s outbound to origin - it’s literally a Lua script (or rust module in fl2) that runs on ingress inline as part of handling the request. Thus there’s no timeout or other concerns with the management service failing to assign a bot score.

There are better questions but to me the ones posed don’t seem particularly interesting.

kqr•2mo ago

Maybe I'm misunderstanding something but it being a blocking call does not make timeouts less important -- if anything they become more important!

tptacek•2mo ago

I don't understand how it is you're doing distributed systems design on a system you don't even have access to. Maybe the issue is timeouts, maybe the issue is some other technical change, maybe the issue is human/procedural. How could you possibly know? The owners of the system probably don't have a full answer within just a few hours of handling the incident!

kqr•2mo ago

I would be worried if they had all the answers within a few hours! I was just caught off guard by the focus on technical control measures when there seems to have been fairly obvious problems with information channels.

For example, "more global kill switches for features" is good, but would "only" have shaved 30 % off the time of recovery (if reading the timeline charitably). Being able to identify the broken component faster would have shaved 30–70 % off the time of recovery depending on how fast identification could happen – even with no improvements to the kill switch situation.

spenrose•2mo ago

I am disappointed to see this article flagged. I thought it was excellent.

kqr•2mo ago

In defense of your taste, it was updated based on the loud feedback here, so you probably read a slightly better version than that which was flagged.

reilly3000•2mo ago

OP’s STPM post linked in the article was a fantastic read. Within the framework that sets up, these kinds of questions make perfect sense. The idea of a premortem deserves its day. https://entropicthoughts.com/aws-dynamodb-outage-stpa

Start all of your commands with a comma

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

How we made geo joins 400× faster with H3 indexes

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Jeffrey Snover: "Welcome to the Room"

Vocal Guide – belt sing without killing yourself

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Microsoft open-sources LiteBox, a security-focused library OS

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Was Benoit Mandelbrot a hedgehog or a fox?

PC Floppy Copy Protection: Vault Prolok

Dark Alley Mathematics

How to effectively write quality code with AI

Delimited Continuations vs. Lwt for Threads

What Is Ruliology?

Where did all the starships go?

Introducing the Developer Knowledge API and MCP Server

Female Asian Elephant Calf Born at the Smithsonian National Zoo

I now assume that all ads on Apple news are scams

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Understanding Neural Network, Visually

Why I Joined OpenAI

Learning from context is harder than we thought

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Start all of your commands with a comma

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

The Waymo World Model

How we made geo joins 400× faster with H3 indexes

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Jeffrey Snover: "Welcome to the Room"

Vocal Guide – belt sing without killing yourself

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Show HN: I spent 4 years building a UI design tool with only the features I use

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Microsoft open-sources LiteBox, a security-focused library OS

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Was Benoit Mandelbrot a hedgehog or a fox?

PC Floppy Copy Protection: Vault Prolok

Dark Alley Mathematics

How to effectively write quality code with AI

Delimited Continuations vs. Lwt for Threads

What Is Ruliology?

Where did all the starships go?

Introducing the Developer Knowledge API and MCP Server

Female Asian Elephant Calf Born at the Smithsonian National Zoo

I now assume that all ads on Apple news are scams

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

Understanding Neural Network, Visually

Why I Joined OpenAI

Learning from context is harder than we thought

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

Questions for Cloudflare

Comments