If you can build a system with redundancy to continue working even if Cloudflare is unavailable then you should, but most years that's going to be a waste of time.
I think you'd be better off spending the time building good relationships with your customers and users so that in the event of an outage that's beyond your control they trust your business and continue to be happy customers when you're back up and running.
In general I think people are overreaction to the CloudFlare outage and most of these types of articles aren't really thought all the way through.
Also the conclusion on Jurassic Park is wrong. Hammond "spared no expense" yet Nedry was a single point of failure? Seems like they spared at least some expense in the IT department
Even if they did "spare no expense" they could have wound up in the same situation. I see this a lot, "it would be better if only we spent more money" but the only thing casually related to increasing expense is increased withdrawals from the bank account. Spending more money doesn't guarantee a better outcome see US public schools for example.
edit: coming back to this. Was the Cloudflare outage really caused by reading a file that was over 200 lines when the process can only handle a max of 200? That's a good example, I'm sure Cloudflare spared no expense in that part of their infrastructure yet here they are (or were).
Almost everyone developing software spares some expense. It's maybe the main argument you can make for why it's engineering vs not. It's a cost-benefit tradeoff.
Cloudflare isn't doing e.g. super expensive formally verified software up and down its whole stack, practically nobody does that.
Nedry is a contractor and presumably reasonably payed, but the job would be awful regardless of how much money he'd make. He's alone, the rest of the on-site employees regard his technical wizardry with deep suspicion and his work habits with disgust. His boss, Hammond, can't stop himself from interfering. Dodgson gives him a way out, in the form of a large cash payment at an airport.
In the book Hammond is a truly ruthless businessman and it makes quite decent satire of the character. When he says things like 'no expense spared' or 'I like kids' it's more like it's coming from an Elon Musk on a stim bender than the warm and aloof Hammond of the movie. In the book, when Hammond comes under pressure he reacts with rage, like when they've realised that Nedry is gone and that Arnold will have to go through the source code himself. At that point Hammond is screaming expletives at his employees, who calmly respond that he instead should go to the cafeteria and get a coffee.
It should also be added that according to the book the reason the park fails is not because it has a single point of failure, but that it is a complex system and inherently uncontrollable. To some extent this shows in the Malcolm character in the movie as well but they do very little with this except having him deliver a few one-liners and the chaos talk with the water drops early on.
But I don't choose cloudflare either, because its too complicated and I don't need that. So I choose the simplest possible thing with as little complexity as possible (for me, that was BunnyCDN). If it goes down, its usually obviously why. And I didn't rely on anything special about it, so I can move away painlessly.
That'd be very inefficient usage of compute. Memory access now has network latency, cache locality doesn't exist, processes don't work. You're basically subverting how computers fundamentally work today. There's no benefit.
I know Kubernetes and containers has everyone thinking servers don't matter but we should have less virtualization, not more. Redundancy and virtualization are not the same thing.
And then when your startup goes down we lose 3/3rds of our operating capacity!
---
There are certain kinds of errors and failures that it's not worth protecting against, because the costs (and consequences) are more than just accepting that things fail from time to time.
It's easy to forget that services used to go down all the time in the 1990s and early 2000s. In this case, we still have super-impressive resiliency with modern cloud hosting.
IMO: The best way to improve the situation is for the cloud hosts to take their lessons learned and improve themselves, and for us (their customers) to vote with our feet if/when a cloud provider has problems.
If your shit breaks and everyone else's shit is still working that's a problem.
yeah sure, if your business is one of the 500 startups on HN creating inane shit like a notes app or a calendar, but outages can affect genuine companies that people rely on
It may even be a rational decision to take the downtime if the cost of avoiding it exceeds the expected cost of an eventual downtime, but that's a business decision that requires some serious thought.
that's at the root of all infrastructure decisions, not just web app tech stacks but even something like utility service. I think it gets lost on a lot of technology people because we love to work on big technical things. No one wants a boring answer like a couple webservers and postgres with a backup in a different datacenter when there's a wall of knobs and switches to play with at the hyperscalers.
If I'm building something that allows my customers to do X, then yes I will own the software that allows my customers to do X. Makes sense.
> They’ll craft artisanal monitoring solutions while their actual business logic—the thing customers pay for—runs on someone else’s computer.
So instead I should build an artisanal hosting solution on my own hardware that I purchase and maintain? I could drop proxmox on them and go from there, or K8s, or even just bare metal and systemd scripts.
But my business isn't about any of those things, its about X. How does owning and running my own hardware get me closer to delivering on X?
In a modern tech business that's everything from the frontend to the database though, including all the bits to keep that running at scale. That's too much for most companies to handle when they're starting and scaling. You'll need to compromise on that value early on, and you'll probably persuade yourself that it's tech debt you'll pay off later. But you won't, because you can't, and that will lead you to dislike the system you built.
It's much simpler and more motivating to accept that in any modern tech business has to rely on third parties, and the fact you pay them money means they probably won't screw it up. It has to be an accepted risk or you'll be paralysed by having too much to do.
There would very typically be a large overlap here.
Probably very few companies should build and run their own CDN and internet scale firewall, for example. Doesn't have to be cloudflare, but there aren't any providers that will have zero outages (a homegrown one is likely to be orders of magnitude worse and more expensive).
I fear this is easy to misconstrue.
For example, I was at a company that, as I learned how everything worked, realized that we were spending $20k / month for cloud services to basically process about as much real-time data as a CD player processes.
I joked that we should be able to run our entire product on a single server running in the office. (Then I pointed out that this was a joke and that running in the cloud gave us amazing redundancy that we didn't have to implement ourselves.) My point was to show that our architecture was massively bloated and overengineered for what we were doing. (IE, the cost of serialization to send messages was more than the actual processing that was happening. The cost was both money, and the fact that we were spending more time working on messaging than the actual product.)
BUT: There's many times where we could easily say, "this would be so much easier if we had our own server in the office." And, if we misconstrue the above quote, we could convince ourselves to run our own server in the office.
Very few times should you manage the actual hardware yourself.
But often a cloud is overly complex for what you need. 10 years ago we left MS Azure and started leasing dedicated hardware in OVH. Our costs were cut by 90%, our performance tripled, and our reliability improved. We did have to take on some effort to make our systems portable with ansible and containers, but we greatly simplified our vendor stack.
I am never confused why something goes down, and I have confidence that I can stand up with another vendor without re-writing anything.
If I can't own it, it should be as simple and commoditized as possible. Most clouds are not that.
If I use CloudFlare, it will also go down, but probably for less time, and someone else has to be up at 2am fixing it.
> Build what delivers your value.
Like Hershey builds grocery stores?
Like Budweiser builds bars?
This can’t be serious.
We live in a society.
1970-01-01•2mo ago
toddgardner•2mo ago
We do this by owning everything we can, and using simple vendors for what we can't.
1970-01-01•2mo ago
iso1631•2mo ago
A 5 hour outage when headlines say "the internet is broken" may well be preferable to a 5 minute outage related to your far simpler and far more resilient setup