The fact that there was an outage is not unexpected… it happens… but all the stumbling and length to get things under control was concerning.
Tech departments running around with their hair on fire / always looking busy isn't one that always builds trust.
Some good information in the comments as well.
The horrible us-east-1 S3 outage of 2017[1] was around 5 hours.
The more you have, the faster the backlog grows in case of an outage, so you need longer to process it all once the system comes back online.
A faster, better-tested "restart all the droplet managers from a known reasonable state" process is probably more important than finding all the Dynamo race conditions.
I think when there is an extended outage it exposes the shortcuts. If you have 100 systems, and one or two can't start fast from zero, and they're required to get back to running smoothly, well you're going to have a longer outage. How would you deal with that, you'd uniformly across your teams subject them to start from zero testing. I suspect though that many teams are staring down a scaling bottleneck, or at least were for much of Amazon's life and so scaling issues (how do we handle 10x usage growth in the next year and half, which are the soft spots that will break) trump cold start testing. Then you get a cold start event with that last one being 5 years ago and 1 or 2 out of your 100 teams falls over and it takes multiple hours all hands on deck to get it to start.
us-east-1 feels like a single point of failure for half the internet. People really need to look into multi-region, maybe even use AI like Zast.ai for intelligent failover so we're not all dead in the water when N. Virginia sneezes.
ggm•2d ago
Not dogfooding is the software and hardware equivalent of the electricity network "black start" -you never want to be there, but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator, which you spin up to start the turbine spinning until the steam takes up load and the real generator is able to get volts onto the wire.
Pumped Hydro is inside the machine. It's often held out as the black-start mechanism because it's gravity, there's less to go wrong, but if we are in 'line up the holes in the cheese grater' space, you can always have 'want of a nail' issues with any mechanism. The honda generator can have a hole in the petrol tank, the turbine at the pumped hydro can be out for maintenance.
nijave•4h ago
Having such a gobstoppingly massive singular region seems to be working against AWS
elchananHaas•4h ago
pas•4h ago
easton•52m ago
The services that fail on AWS’ side are usually multi zone failures anyway, so maybe it wouldn’t help.
I’ve often wondered why they don’t make a secret region just for themselves to isolate the control plane but maybe they think dogfooding us-east-1 is important (it probably wouldn’t help much anyway, a meteor could just as easily strike us-secret-1).
otterley•1h ago
mcswell•3h ago
Are you saying it's different on land-based steam power plants? Why?
brendoelfrendo•2h ago
If I had to guess, I'd assume that the generators on your destroyer were gas turbine generators that use some kind of motor as part of their startup process to get the turbine spinning. It's entirely possible that there was an electric motor in there to facilitate the generator's "black-start," but it may have been powered by a battery rather than a smaller generator.
pas•3h ago
But what would make sense is that upstream services return their load (queue depths or things like PSI from newer kernels) to downstream ones as part of the API responses, so if shit's going on the downstream ones will become more patient and slow down. (But if things are getting cleaned up then the downstream services can speed things up.)
otterley•1h ago
I haven’t seen anyone really solve this problem at scale with maybe the exception of YouTube’s mechanism (https://research.google/pubs/load-is-not-what-you-should-bal...), but that’s specific to them and it isn’t universally applicable to arbitrary workloads.