You were way off here.
What otherwise tends to happen, in my experience, is the initial effort brings up some deficiencies which are only partially the major ones, and subsequent effort is spent looking mainly in that same area, never uncovering those major deficiencies which were not initially discovered.
Cloudflare is probably one of the best "voices" in the industry when it comes to post-mortems and root cause analysis.
Reading cloudflare's description of the problem, this is something that I could easily see my own company missing. It's the case that a file got too big which tanked performance enough to bring everything down. That's a VERY hard thing to test for. Especially since this appears to have been a configuration file and a regular update.
The reason it's so hard to test for is because all tests would show that there's no problem. This isn't a code update, it was a config update. Without really extensive performance tests (which, when done well, take a long time!) there really wasn't a way to know that a change that appeared safe wasn't.
I personally give Cloudflare a huge pass for this. I don't think this happened due to any sloppiness on their part.
Now, if you want to see a sloppy outage you look at the Crowdstrike outage from a few years back that bricked basically everything. That is what sheer incompetence looks like.
Oh okay, well I guess the outage wasn't a real issue then
If the analysis has not uncovered the feedback problems (even with large effort, or without it), my argument is that a better method is needed.
The author asks for a deep, system-theoretic analysis... immediately after the incident. That's just not how reality works.
When the house is on fire, you put it out and write a quick "the wiring was bad" report so everyone calms down. You don't write a PhD thesis on electrical engineering standards within 24 hours. The deep feedback-loop analysis happens weeks later, usually internally.
Thinking the consideration of feedback loops requires "deep" analysis is, I suspect, part of the problem! The insufficient feedback shows up at a very shallow level.
Do you have an open source alternative especially if you are donating to the project?
Because I would to see Anubis and other alternatives thrive much more than closed ones like Cloudflare.
Anubis is a bot firewall not a CDN.
I don't remember telling anyone to trust the reviews?
I think it is healthy to try alternatives to Cloudflare and then come to your own decision.
DigitalOcean is owned by Wall Street.
Only Hetzner is a good alternative CDN.
Yet it is an available alternative to Cloudflare that is not on Wall Street (a public company).
If you want to do this 100% yourself there is Apache Traffic Control.
https://github.com/apache/trafficcontrol
> Anubis is a bot firewall not a CDN.
For now. If we support alternatives they can grow into an open source CDN.
You realize to run a CDN you have to buy massive amounts of bandwidth and computers? DIY here belies a misunderstanding of what it takes to be DOS resistant and also what it takes to actually have CDN deliver a performance benefit.
Customers on the enterprise plan can either use Anubis's Managed CDN or host Anubis themselves via a enterprise license!
They can directly receive tech support from the creator of Anubis. (as long as they pay on the enterprise plan)
I don't see a problem with this and it can turn Anubis from "a piece of software" into a CDN.
I think the ultimate judgement must come from whether we will stay with Cloudflare now that we have seen how bad it can get. One could also say that this level of outage hasn't happened in many years, and they are now freshly frightened by it happening again so expect things to get tightened up (probably using different questions than this blog post proposes).
As for what this blog post could have been: maybe a page out of how these ideas were actively used by the author at e.g. Tradera or Loop54.
This would be preferable, of course. Unfortunately both organisations were rather secretive about their technical and social deficiencies and I don't want to be the one to air them out like that.
We mustn't assume that CloudFlare isn't undertaking this process because we're not an audience to it.
It's extremely easy, and correspondingly valueless, to ask all kinds of "hard questions" about a system 24h after it had a huge incident. The hard part is doing this appropriately for every part of the system before something happens, while maintaining the other equally rightful goals of the organizations (such as cost-efficiency, product experience, performance, etc.). There's little evidence that suggests Cloudflare isn't doing that, and their track record is definitely good for their scale.
Some never get out of this phase though.
Ultimately end-users don’t have a relationship with any of those companies. They have relationships with businesses that chose to rely on them. Cloudflare, etc publish SLAs and compensation schedules in case those SLAs are missed. Businesses chose to accept those SLAs and take on that risk.
If Cloudflare/etc signed a contract promising a certain SLA (with penalties) and then chose to not pay out those penalties then there would be reasons to ask questions, but nothing suggests they’re not holding up their side of the deal - you will absolutely get compensated (in the form of a refund on your bill) in case of an outage.
The issue is that businesses accept this deal and then scream when it goes wrong, yet are unwilling to pay for a solution that does not fail in this way. Those solutions exist - you absolutely can build systems that are reliable and/or fail in a predictable and testable manner; it’s simply more expensive and requires more skill than just slapping a few SaaSes and CNCF projects together. But it is possible - look at the uptime of card networks, stock exchanges, or airplane avionics. It’s just more expensive and the truth is that businesses don’t want to pay for it (and neither are their end-customers - they will bitch about outages, but will immediately run the other way if you ask them to pony up for a more reliable system - and the ones that don’t, already run such a system and were unaffected by the recent outages).
> Ultimately end-users don’t have a relationship with any of those companies. They have relationships with businesses that chose to rely on them
Could you not say this about any supplier relationship? No, in this case, we all know the root of the outage is CloudFlare, so it absolutely makes sense to blame CloudFlare, and not their customers.
Absolutely, yes. Where's your backup plan for when Visa doesn't behave as you expect? It's okay to not have one, but it's also your fault for not having one, and that is the sole reason that the lemonade stand went down.
I don’t have (nor have to have) such a plan, I offer X service with Y guarantees paying out Z dollars if I don’t hold up my part of the bargain. In this hypothetical situation if Visa signs up I assumed they wanted to host their marketing website or some low-hanging fruit, it’s not my job to check what they’re using it for (in fact it would be preferable for me not to check, as I’d be seeing unencrypted card numbers and PII otherwise).
That aside, I think the example is good. It's a bit like priority inversion in scheduling. With no agreement from the lemonade seller they've suddenly changed greatly in terms of their criticality to some value creation chain.
Same with cloudflare. If you run your site on cloudflare you are responsible for any downtime caused to your site by cloudflare
What we can blame cloudflare for is having so many customers that a cloudflare outage has outsized impact compared to the more uncorrelated outages we would have if sites were distributed among many smaller providers. But that's not quite the same as blaming any individual site being down on cloudflare
No always. If the farm sells packs of poisoned bacon to the supermarket, we blame the farm.
It's more about if the website/supermarket can reasonably do the QA.
In fact, I'd say... airplane avionics are not what you should be looking at. Boeing's 787? Reboot every 51 days or risk the pilots getting wrong airspeed indicators! No, I'm not joking [1], and it's not the first time either [2], and it's not just Boeing [3].
[1] https://www.theregister.com/2020/04/02/boeing_787_power_cycl...
[2] https://www.theregister.com/2015/05/01/787_software_bug_can_...
[3] https://www.theregister.com/2019/07/25/a350_power_cycle_soft...
If this is documented then fair enough - airlines don’t have to buy airplanes that need rebooting every 51 days, they can vote with their wallets and Boeing is welcome to fix it. If not documented, I hope regulators enforced penalties high enough to force Boeing to get their stuff together.
Either way, the uptime of avionics (and redundancies - including the unreliable airspeed checklists) are much higher than anything conventional software “engineering” has been putting out the past decade.
If some dingus widget provider has a downtime, no one dies. If an airplane loses power mid flight due to a software error... that's not guaranteed any more.
This puts it in the same maintenance bucket as parts that need regular replacement.
To use the “part needing regular replacement” analogy - the bug could be fixed for an extra $Y. But everything in engineering is a trade off. Choosing not to fix the bug might bring costs down to something competitive with the market.
Plus, like any part that could be made more reliable for extra cost, there might be no benefit due to service schedules. Say you could design a part that is 60% worn at service time vs a cheaper part that’s 90% worn - it they will both need replacing at the same time, you’d go for the lower cost design.
Eg. if some service task requires resetting the avionics every 40 days, that’s less than 51 days so there may as well be no bug.
I don't love piling on, but it still shocks me that people write without first reading.
With that said, I would also like to know how it took them ~2 hours to see the error. That's a long, long time.
So during ingress there’s not an async call to the bot management service which intercepts the request before it’s outbound to origin - it’s literally a Lua script (or rust module in fl2) that runs on ingress inline as part of handling the request. Thus there’s no timeout or other concerns with the management service failing to assign a bot score.
There are better questions but to me the ones posed don’t seem particularly interesting.
For example, "more global kill switches for features" is good, but would "only" have shaved 30 % off the time of recovery (if reading the timeline charitably). Being able to identify the broken component faster would have shaved 30–70 % off the time of recovery depending on how fast identification could happen – even with no improvements to the kill switch situation.
mnholt•2mo ago
majke•2mo ago
jf•2mo ago
internetter•2mo ago
Sesse__•2mo ago