This is a low-value article that reads like it is AI generated even if it’s not.
Almost every instance of downtime I’ve experienced in the cloud was due to a
global outage of some sort that no amount of regional redundancy could fix.
Regional redundancy is typically twice as expensive at small scales and decidedly non-trivial to implement because… where do you put your data? At most one region can have low-latency access, all others have to deal with either eventual consistency OR very high latencies! What happens during a network partition? That’s… “fun” to deal with!
Most groups would benefit far more from simply having seamless DevOps deploys and fast rollback.
Neither is available by default in most cloud platforms, you have to build it from fiddly little pieces like off-brand LEGO.
Proprietary pieces with no local dev experience such as syntax validation and emulators.
toast0•13m ago
Certainly big cloud outages involve global outages, and some regional outages cascade into global outages.
But it's pretty common for a major event to happen in a single region. Datacenter fires and/or flooding happen from time to time. Extreme weather can happen. Automatic transfer switches fail from time to time. Fiber cuts happen.
Not everyone needs regional redundancy, and it does add costs, but I don't think it should be dismissed easily. If you're all in on cloudiness, you could have as little as an alternate region replica of your data and your vm images, and be ready to go manually in another region if you need to. Run some tests once or twice a year to confirm your plan works, and to make an estimate for how long it takes to restore service in the event of a regional outage. A few minutes to put up an outage page and an hour or three to restore service is probably fine... Automatic regional failover gets tricky with data consistency and split brain as you mentioned; and hopefully you don't need to do it often.
jiggawatts•9m ago
> But it's pretty common for a major event to happen in a single region.
It's actually pretty rare these days because all major clouds use zone-redundancy and hence their core services are robust to the loss of any single building. Even during the recent Iberian power outages the local cloud sites mostly (entirely?) stayed up.
The outages I've experienced over the last decade(!) were: Global certificate expiry (Azure), Crowdstrike (Windows everywhere), IAM services down globally (AWS), core inter-region router misconfiguration (customer-wide).
None would have been avoided by having more replicas in more places. All of our production systems are already zone-redundant, which is either the default or "just a checkbox" in most clouds.
This article adds no value to the discussion because it states the problem that's not that big a deal, and then doesn't provide any useful solutions for the few people where it is a big deal.
The problem is either easy to solve -- tick the checkbox for zone-redundancy -- or very difficult to solve -- make your app's data globally replicated -- and the article just says "you should do it" without further elaboration.
jiggawatts•37m ago
Almost every instance of downtime I’ve experienced in the cloud was due to a global outage of some sort that no amount of regional redundancy could fix.
Regional redundancy is typically twice as expensive at small scales and decidedly non-trivial to implement because… where do you put your data? At most one region can have low-latency access, all others have to deal with either eventual consistency OR very high latencies! What happens during a network partition? That’s… “fun” to deal with!
Most groups would benefit far more from simply having seamless DevOps deploys and fast rollback.
Neither is available by default in most cloud platforms, you have to build it from fiddly little pieces like off-brand LEGO.
Proprietary pieces with no local dev experience such as syntax validation and emulators.
toast0•13m ago
But it's pretty common for a major event to happen in a single region. Datacenter fires and/or flooding happen from time to time. Extreme weather can happen. Automatic transfer switches fail from time to time. Fiber cuts happen.
Not everyone needs regional redundancy, and it does add costs, but I don't think it should be dismissed easily. If you're all in on cloudiness, you could have as little as an alternate region replica of your data and your vm images, and be ready to go manually in another region if you need to. Run some tests once or twice a year to confirm your plan works, and to make an estimate for how long it takes to restore service in the event of a regional outage. A few minutes to put up an outage page and an hour or three to restore service is probably fine... Automatic regional failover gets tricky with data consistency and split brain as you mentioned; and hopefully you don't need to do it often.
jiggawatts•9m ago
It's actually pretty rare these days because all major clouds use zone-redundancy and hence their core services are robust to the loss of any single building. Even during the recent Iberian power outages the local cloud sites mostly (entirely?) stayed up.
The outages I've experienced over the last decade(!) were: Global certificate expiry (Azure), Crowdstrike (Windows everywhere), IAM services down globally (AWS), core inter-region router misconfiguration (customer-wide).
None would have been avoided by having more replicas in more places. All of our production systems are already zone-redundant, which is either the default or "just a checkbox" in most clouds.
This article adds no value to the discussion because it states the problem that's not that big a deal, and then doesn't provide any useful solutions for the few people where it is a big deal.
The problem is either easy to solve -- tick the checkbox for zone-redundancy -- or very difficult to solve -- make your app's data globally replicated -- and the article just says "you should do it" without further elaboration.
That's of no value to anyone.