Beyond Downtime: Architectural Resilience on Hyperscalers

https://cacm.acm.org/blogcacm/beyond-downtime-architectural-resilience-on-hyperscalers/

3•rbanffy•6h ago

Comments

jiggawatts•37m ago

This is a low-value article that reads like it is AI generated even if it’s not.

Almost every instance of downtime I’ve experienced in the cloud was due to a global outage of some sort that no amount of regional redundancy could fix.

Regional redundancy is typically twice as expensive at small scales and decidedly non-trivial to implement because… where do you put your data? At most one region can have low-latency access, all others have to deal with either eventual consistency OR very high latencies! What happens during a network partition? That’s… “fun” to deal with!

Most groups would benefit far more from simply having seamless DevOps deploys and fast rollback.

Neither is available by default in most cloud platforms, you have to build it from fiddly little pieces like off-brand LEGO.

Proprietary pieces with no local dev experience such as syntax validation and emulators.

toast0•13m ago

Certainly big cloud outages involve global outages, and some regional outages cascade into global outages.

But it's pretty common for a major event to happen in a single region. Datacenter fires and/or flooding happen from time to time. Extreme weather can happen. Automatic transfer switches fail from time to time. Fiber cuts happen.

Not everyone needs regional redundancy, and it does add costs, but I don't think it should be dismissed easily. If you're all in on cloudiness, you could have as little as an alternate region replica of your data and your vm images, and be ready to go manually in another region if you need to. Run some tests once or twice a year to confirm your plan works, and to make an estimate for how long it takes to restore service in the event of a regional outage. A few minutes to put up an outage page and an hour or three to restore service is probably fine... Automatic regional failover gets tricky with data consistency and split brain as you mentioned; and hopefully you don't need to do it often.

jiggawatts•9m ago

> But it's pretty common for a major event to happen in a single region.

It's actually pretty rare these days because all major clouds use zone-redundancy and hence their core services are robust to the loss of any single building. Even during the recent Iberian power outages the local cloud sites mostly (entirely?) stayed up.

The outages I've experienced over the last decade(!) were: Global certificate expiry (Azure), Crowdstrike (Windows everywhere), IAM services down globally (AWS), core inter-region router misconfiguration (customer-wide).

None would have been avoided by having more replicas in more places. All of our production systems are already zone-redundant, which is either the default or "just a checkbox" in most clouds.

This article adds no value to the discussion because it states the problem that's not that big a deal, and then doesn't provide any useful solutions for the few people where it is a big deal.

The problem is either easy to solve -- tick the checkbox for zone-redundancy -- or very difficult to solve -- make your app's data globally replicated -- and the article just says "you should do it" without further elaboration.

That's of no value to anyone.

Intel's Lion Cove P-Core and Gaming Workloads

A non-anthropomorphized view of LLMs

Battle of Vukovar: how 1,800 fighters held off a force of 36,000

Derivative Eigenfunctions

Ask HN: How do I buy a typewriter?

Room to Think

mTLS vs. HTTP Message Signatures: Tradeoffs in Securing HTTP Requests

Nobody has a personality anymore: we are products with labels

Fines coming for Californians caught by drone with illegal fireworks

Code and Trust: Vibrators to Pacemakers

New Horizons images enable first test of interstellar navigation

Strategies to Better Resist Distractions

Trump's BBB has $85M to move space shuttle Discovery from Smithsonian to Texas

The New Corporate Memo: Let AI Ease the Pain

Record-Breaking Results Bring Fusion Power Closer to Reality

iOS app using color filter manipulation

Early Triassic super-greenhouse climate driven by vegetation collapse

The Origin of the Research University

CSS conditionals with the new if() function

Frustrated with my Mac constantly lowering the microphone

Building the Rust Compiler with GCC

'Great Dying' wiped out 90% of life, then came 5M years of lethal heat

Useful Utilities and Toys over DNS

Context Engineering

LLMs should not replace therapists

Why English doesn't use accents

Show HN: FitmMetr – A privacy-first health tracker built by a CSO

Agentic Coding – Copilot to Coworker

Quantum microtubule substrate of consciousness is experimentally supported

Show HN: Create video tours of your data for YT Shorts, IG Reels and TikTok