Cloudflare is probably one of the best "voices" in the industry when it comes to post-mortems and root cause analysis.
Reading cloudflare's description of the problem, this is something that I could easily see my own company missing. It's the case that a file got too big which tanked performance enough to bring everything down. That's a VERY hard thing to test for. Especially since this appears to have been a configuration file and a regular update.
The reason it's so hard to test for is because all tests would show that there's no problem. This isn't a code update, it was a config update. Without really extensive performance tests (which, when done well, take a long time!) there really wasn't a way to know that a change that appeared safe wasn't.
I personally give Cloudflare a huge pass for this. I don't think this happened due to any sloppiness on their part.
Now, if you want to see a sloppy outage you look at the Crowdstrike outage from a few years back that bricked basically everything. That is what sheer incompetence looks like.
If the analysis has not uncovered the feedback problems (even with large effort, or without it), my argument is that a better method is needed.
Anubis is a bot firewall not a CDN.
I don't remember telling anyone to trust the reviews?
I think it is healthy to try alternatives to Cloudflare and then come to your own decision.
Yet it is an available alternative to Cloudflare that is not on Wall Street (a public company).
If you want to do this 100% yourself there is Apache Traffic Control.
https://github.com/apache/trafficcontrol
> Anubis is a bot firewall not a CDN.
For now. If we support alternatives they can grow into an open source CDN.
You realize to run a CDN you have to buy massive amounts of bandwidth and computers? DIY here belies a misunderstanding of what it takes to be DOS resistant and also what it takes to actually have CDN deliver a performance benefit.
Customers on the enterprise plan can either use Anubis's Managed CDN or host Anubis themselves via a enterprise license!
They can directly receive tech support from the creator of Anubis. (as long as they pay on the enterprise plan)
I don't see a problem with this and it can turn Anubis from "a piece of software" into a CDN.
I think the ultimate judgement must come from whether we will stay with Cloudflare now that we have seen how bad it can get. One could also say that this level of outage hasn't happened in many years, and they are now freshly frightened by it happening again so expect things to get tightened up (probably using different questions than this blog post proposes).
As for what this blog post could have been: maybe a page out of how these ideas were actively used by the author at e.g. Tradera or Loop54.
This would be preferable, of course. Unfortunately both organisations were rather secretive about their technical and social deficiencies and I don't want to be the one to air them out like that.
We mustn't assume that CloudFlare isn't undertaking this process because we're not an audience to it.
It's extremely easy, and correspondingly valueless, to ask all kinds of "hard questions" about a system 24h after it had a huge incident. The hard part is doing this appropriately for every part of the system before something happens, while maintaining the other equally rightful goals of the organizations (such as cost-efficiency, product experience, performance, etc.). There's little evidence that suggests Cloudflare isn't doing that, and their track record is definitely good for their scale.
Some never get out of this phase though.
Ultimately end-users don’t have a relationship with any of those companies. They have relationships with businesses that chose to rely on them. Cloudflare, etc publish SLAs and compensation schedules in case those SLAs are missed. Businesses chose to accept those SLAs and take on that risk.
If Cloudflare/etc signed a contract promising a certain SLA (with penalties) and then chose to not pay out those penalties then there would be reasons to ask questions, but nothing suggests they’re not holding up their side of the deal - you will absolutely get compensated (in the form of a refund on your bill) in case of an outage.
The issue is that businesses accept this deal and then scream when it goes wrong, yet are unwilling to pay for a solution that does not fail in this way. Those solutions exist - you absolutely can build systems that are reliable and/or fail in a predictable and testable manner; it’s simply more expensive and requires more skill than just slapping a few SaaSes and CNCF projects together. But it is possible - look at the uptime of card networks, stock exchanges, or airplane avionics. It’s just more expensive and the truth is that businesses don’t want to pay for it (and neither are their end-customers - they will bitch about outages, but will immediately run the other way if you ask them to pony up for a more reliable system - and the ones that don’t, already run such a system and were unaffected by the recent outages).
> Ultimately end-users don’t have a relationship with any of those companies. They have relationships with businesses that chose to rely on them
Could you not say this about any supplier relationship? No, in this case, we all know the root of the outage is CloudFlare, so it absolutely makes sense to blame CloudFlare, and not their customers.
Absolutely, yes. Where's your backup plan for when Visa doesn't behave as you expect? It's okay to not have one, but it's also your fault for not having one, and that is the sole reason that the lemonade stand went down.
I don’t have (nor have to have) such a plan, I offer X service with Y guarantees paying out Z dollars if I don’t hold up my part of the bargain. In this hypothetical situation if Visa signs up I assumed they wanted to host their marketing website or some low-hanging fruit, it’s not my job to check what they’re using it for (in fact it would be preferable for me not to check, as I’d be seeing unencrypted card numbers and PII otherwise).
Same with cloudflare. If you run your site on cloudflare you are responsible for any downtime caused to your site by cloudflare
What we can blame cloudflare for is having so many customers that a cloudflare outage has outsized impact compared to the more uncorrelated outages we would have if sites were distributed among many smaller providers. But that's not quite the same as blaming any individual site being down on cloudflare
No always. If the farm sells packs of poisoned bacon to the supermarket, we blame the farm.
It's more about if the website/supermarket can reasonably do the QA.
In fact, I'd say... airplane avionics are not what you should be looking at. Boeing's 787? Reboot every 51 days or risk the pilots getting wrong airspeed indicators! No, I'm not joking [1], and it's not the first time either [2], and it's not just Boeing [3].
[1] https://www.theregister.com/2020/04/02/boeing_787_power_cycl...
[2] https://www.theregister.com/2015/05/01/787_software_bug_can_...
[3] https://www.theregister.com/2019/07/25/a350_power_cycle_soft...
If this is documented then fair enough - airlines don’t have to buy airplanes that need rebooting every 51 days, they can vote with their wallets and Boeing is welcome to fix it. If not documented, I hope regulators enforced penalties high enough to force Boeing to get their stuff together.
Either way, the uptime of avionics (and redundancies - including the unreliable airspeed checklists) are much higher than anything conventional software “engineering” has been putting out the past decade.
I don't love piling on, but it still shocks me that people write without first reading.
With that said, I would also like to know how it took them ~2 hours to see the error. That's a long, long time.
So during ingress there’s not an async call to the bot management service which intercepts the request before it’s outbound to origin - it’s literally a Lua script (or rust module in fl2) that runs on ingress inline as part of handling the request. Thus there’s no timeout or other concerns with the management service failing to assign a bot score.
There are better questions but to me the ones posed don’t seem particularly interesting.
mnholt•1h ago
majke•1h ago
jf•1h ago
internetter•1h ago
Sesse__•1h ago