Build vs. Buy: What This Week's Outages Should Teach You

https://www.toddhgardner.com/blog/build-vs-buy-outages

25•toddgardner•1h ago

Comments

1970-01-01•1h ago

Meh. This opinion highlights the fact that availability is the least understood pillar in security. The Right Way to Think About It is having good security analysis and doing proper Risk Management. This means it is their job to do business impact analysis, 3rd party assessments, and run tabletop exercises on all your critical systems to tell you what is rock solid and what is a house of cards.

toddgardner•1h ago

How you approach this is very different depending on the size of organization. We're a small shop (3), but we deliver big services to lots of people.

We do this by owning everything we can, and using simple vendors for what we can't.

codingdave•1h ago

Redundancy is a proven way to build resilience into your infrastructure. Ownership does not mean you have to build it. OP is correct that you need to understand it all, but that understanding also allows for solid DR plans that use multiple providers for a resilient infrastructure.

toddgardner•1h ago

An alternative to multiple providers is to use commoditized providers. By using simple infrastructure rather than cloud platforms, I can redploy my infrastructure using ansible with another provider in hours rather than re-building my platform if I decide the cloud is the wrong fit.

tommy_axle•1h ago

An aside: it looks like there is a certificate error for https://certkit.com/ as it's for *.mscertkit.com (this was on Chromium + Linux)

toddgardner•1h ago

wow, yea. that's foolish. Fixing.

almosthere•1h ago

Recoverable master and short dns ttl

righthand•1h ago

Yeah but my DevOps only know Aws or Cloudflare UIs and refuse to consider any other platforms. The leadership sees multiple bills as bad. Back to square one? No one will learn anything because people enjoy the pseudo holiday for problems they set themselves up to do nothing about.

onion2k•1h ago

You can't build your own Cloudflare in any meaningful sense. You can choose not to have the functionality Cloudflare provides because you prioritize the risk of a Cloudflare outage as more important than the benefits Cloudflare gives you, but that probability tree is probably going to land in Cloudflare's favor for 99.99% of businesses.

If you can build a system with redundancy to continue working even if Cloudflare is unavailable then you should, but most years that's going to be a waste of time.

I think you'd be better off spending the time building good relationships with your customers and users so that in the event of an outage that's beyond your control they trust your business and continue to be happy customers when you're back up and running.

mrweasel•42m ago

Exactly, CloudFlare falls squarely in the "Buy" category. This is not a product you just build, you'd overpay massively for global capacity.

In general I think people are overreaction to the CloudFlare outage and most of these types of articles aren't really thought all the way through.

Also the conclusion on Jurassic Park is wrong. Hammond "spared no expense" yet Nedry was a single point of failure? Seems like they spared at least some expense in the IT department

chasd00•1m ago

> Also the conclusion on Jurassic Park is wrong. Hammond "spared no expense" yet Nedry was a single point of failure? Seems like they spared at least some expense in the IT department

Even if they did "spare no expense" they could have wound up in the same situation. I see this a lot, "it would be better if only we spent more money" but the only thing casually related to increasing expense is increased withdrawals from the bank account. Spending more money doesn't guarantee a better outcome see US public schools for example.

vivzkestrel•1h ago

Instead we need a startup that builds over every cloud provider. Think of a web server for example. AWS has EC2, GCP has its own equivalent and Azure has its own and so on. What if we had a startup that virtualizes a layer on top of these such that we AWS has an outage, you lose 1/3rd of your operating capacity, when Azure has an outage you lose 1/3rd of your operating capacity. In order for you startup s virtual webserver to go down, all of AWS, GCP and Azure wil have to go down simultaneously. Basically build on top of everyone s cloud service into one single unified virtual layer that offers end products to consumers. A 6GB RAM server that the end consumer purchases has 2GB of RAM running on AWS, 2GB on Azure and 2GB on GCP. I am sure we can also strategize something along the same lines for a database server with the added question of the database sharding strategy at play

servercobra•57m ago

It's great in theory, it's just relatively expensive. You'll need to pay to be running on all the clouds plus keeping extra traffic to keep databases synced. Distributed systems are hard.

bradly•57m ago

This is what Fog and other cloud agnostic libraries promise. The problem is they you get tied to the lowest common feature set or writing different code paths to take advantage of latest features.

dlisboa•40m ago

> A 6GB RAM server that the end consumer purchases has 2GB of RAM running on AWS, 2GB on Azure and 2GB on GCP.

That'd be very inefficient usage of compute. Memory access now has network latency, cache locality doesn't exist, processes don't work. You're basically subverting how computers fundamentally work today. There's no benefit.

I know Kubernetes and containers has everyone thinking servers don't matter but we should have less virtualization, not more. Redundancy and virtualization are not the same thing.

renewiltord•30m ago

In practice, you're better off just having one cloud but if you ever reach the point you care about this, you're better off running some cloud-agnostic platform like Kubernetes in a multi-cloud setup (i.e. one cluster per cloud) and then load-balancing or failing over via DNS.

mannyv•57m ago

What this outage teaches you is that when a third party vendor fails and the internet breaks you can point the finger at them with no issues.

If your shit breaks and everyone else's shit is still working that's a problem.

skeezyjefferson•48m ago

> you can point the finger at them with no issues.

yeah sure, if your business is one of the 500 startups on HN creating inane shit like a notes app or a calendar, but outages can affect genuine companies that people rely on

juancn•47m ago

There's no easy answer, but you should definitely model what happens when X goes down if you depend on X.

It may even be a rational decision to take the downtime if the cost of avoiding it exceeds the expected cost of an eventual downtime, but that's a business decision that requires some serious thought.

dan353hehe•29m ago

> Here’s the thing, if your core business function depends on some capability, you should own it if at all possible.

If I'm building something that allows my customers to do X, then yes I will own the software that allows my customers to do X. Makes sense.

> They’ll craft artisanal monitoring solutions while their actual business logic—the thing customers pay for—runs on someone else’s computer.

So instead I should build an artisanal hosting solution on my own hardware that I purchase and maintain? I could drop proxmox on them and go from there, or K8s, or even just bare metal and systemd scripts.

But my business isn't about any of those things, its about X. How does owning and running my own hardware get me closer to delivering on X?

I am just sooo sick of AI prediction content, let's kill it already

Building more with GPT-5.1-Codex-Max

Launch HN: Mosaic (YC W25) – Agentic Video Editing

Show HN: DNS Benchmark Tool – Compare and monitor resolvers

Europe is scaling back GDPR and relaxing AI laws

Adventures in upgrading Proxmox

To Launch Something New, You Need "Social Dandelions"

Thunderbird Adds Native Microsoft Exchange Email Support

The $1k AWS Mistake

The Future of Programming (2013) [video]

I made a down detector for down detector

The peaceful transfer of power in open source projects

Netherlands returns control of Nexperia to Chinese owner

Multimodal Diffusion Language Models for Thinking-Aware Editing and Generation

I just want working RCS messaging

What Killed Perl?

How two photographers transformed RAW photo support on Mac

Outdated Samsung handset linked to fatal emergency call failure in Australia

Control LLM Spend and Access with any-LLM-gateway

Emoji evidence errors don’t undo a murder conviction

Reproducible C++ builds by logging Git hashes

Programming the Commodore 64 with .NET

Your smartphone, their rules: App stores enable corporate-government censorship

Build vs. Buy: What This Week's Outages Should Teach You

Gymkhana's 1978 Subaru Brat with 9,500-RPM Redline, Active Aero

Pebble, Rebble, and a path forward

Ultima VII Revisited

Learning to Boot from PXE

Show HN: Browser-based interactive 3D Three-Body problem simulator

Itiner-e: A high-resolution dataset of roads of the Roman Empire

I am just sooo sick of AI prediction content, let's kill it already

Building more with GPT-5.1-Codex-Max

Launch HN: Mosaic (YC W25) – Agentic Video Editing

Show HN: DNS Benchmark Tool – Compare and monitor resolvers

Europe is scaling back GDPR and relaxing AI laws

Adventures in upgrading Proxmox

To Launch Something New, You Need "Social Dandelions"

Thunderbird Adds Native Microsoft Exchange Email Support

The $1k AWS Mistake

The Future of Programming (2013) [video]

I made a down detector for down detector

The peaceful transfer of power in open source projects

Netherlands returns control of Nexperia to Chinese owner

Multimodal Diffusion Language Models for Thinking-Aware Editing and Generation

I just want working RCS messaging

What Killed Perl?

How two photographers transformed RAW photo support on Mac

Outdated Samsung handset linked to fatal emergency call failure in Australia

Control LLM Spend and Access with any-LLM-gateway

Emoji evidence errors don’t undo a murder conviction

Reproducible C++ builds by logging Git hashes

Programming the Commodore 64 with .NET

Your smartphone, their rules: App stores enable corporate-government censorship

Build vs. Buy: What This Week's Outages Should Teach You

Gymkhana's 1978 Subaru Brat with 9,500-RPM Redline, Active Aero

Pebble, Rebble, and a path forward

Ultima VII Revisited

Learning to Boot from PXE

Show HN: Browser-based interactive 3D Three-Body problem simulator

Itiner-e: A high-resolution dataset of roads of the Roman Empire

Build vs. Buy: What This Week's Outages Should Teach You

Comments