With this kind of architecture, this sort of problems is just bound to happen.
During my time in AWS, region independence was a must. And some services were able to operate at least for a while without degrading also when some core dependencies were not available. Think like loosing S3.
And after that, the service would keep operating, but with a degraded experience.
I am stunned that this level of isolation is not common in GCP.
> Do the re-implement all the code in every region?
Everyone does.
The difference is AWS very strongly ensures that regions are independent failure domains. The GCP architecture is global with all the pros and cons that implies. e.g GCP has a truly global load balancer while AWS can not since everything is at core regional.
So it's actually really hard to accidentally make cross-region calls, if you're working inside the AWS infrastructure. The call has to happen over the public Internet, and you need a special approval for that.
Deployments also happen gradually, typically only a few regions at a time. There's an internal tool that allows things to be gradually rolled out and automatically rolled back if monitoring detects that something is off.
Generally GCP wants regionality, but because it offers so many higher-level inter-region features, some kind of a global layer is basically inevitable.
Amazon calls this "static stability".
In this outage, my service (on GCP) had static stability, which was great. However, some other similar services failed, and we got more load, but we couldn't start additional instances to handle the load because of the outage, and so we had overloaded servers and poor service quality.
Mayhaps we could have adjusted load across regions to manage instance load, but that's not something we normally do.
Even in the mini incident report they were going through extreme linguistic gymnastics trying to claim they are regional. Describing the service that caused the outage, which is responsible for global quota enforcement and is configured using a data store that replicates data globally in near real time, with apparently no option to delay replication, they said:
Service Control is a regional service that has a regional datastore that it reads quota and policy information from. This datastore metadata gets replicated almost instantly globally to manage quota policies for Google Cloud and our customers.
Not only would AWS call this a global service, the whole concept of global quotas would not fly at AWS.Of course bugs in the orchestrator could cause outages but ideally that piece is a pretty simple "loop over regions and call each regional API update method with the same arguments"
One could hope that they'd realize whatever red tape they've been putting up so far hasn't helped, and so more of it probably wont either.
If what you're doing isn't having an effect you need to do something different, not just more.
They are right, of course, but the way things, the obvious needs to be said sometimes
Acknowledging that one still has risks and that luck plays a factor is important.
- Their alerts were not durable. The outage took out the alert system so humans were just eyeballing dashboards during the outage.
- The cloud marketplace service was affected by cloudflare outage and there's nothiing they could do.
- Tiered stroage was down, disk usage went above normal level. But there's no anomaly detection and no alerts. It survived because t0 storage was massively over provisioned.
- They took pride in using industry well-known designs like cell-based architecture, redundancy, multi-az...ChatGPT would be able to give me a better list
And I don't get whey they had to roast Crowdstrike at the end. I mean, the Crowdstrike incident was really amateur stuff, like, the absolute lowest bar I can think of.
RadiozRadioz•3h ago
When I read "major outage for a large part of the internet was just another normal day for Redpanda Cloud customers", I expected a brave tale of RedPanda SREs valiantly fixing things, or some cool automatic failover tech. What I got instead was: Google told RedPanda there was an issue, RedPanda had a look and their service was unaffected, nothing needed failing over, then someone at RedPanda wrote an article bragging about their triple-nine uptime & fault tolerance.
I get it, an SRE is doing well if you don't notice them, but the only real preventative measure I saw here that directly helped with this issue, is that they over provision disk space. Which I'd be alarmed if they didn't do.
literallyroy•3h ago
dangoodmanUT•2h ago
echelon•9m ago
Haha, we used to joke that's how many nines our customer-facing Ruby on Rails services had compared against our resilient five nines payments systems. Our heavy infra handled billions in daily payment volume and couldn't go down.
With the Ruby teams, we often playfully quipped, "which nines are those?" humorously implying the leading digit itself wasn't itself a nine.
sokoloff•4m ago