More than DNS: Learnings from the 14 hour AWS outage

https://thundergolfer.com/blog/aws-us-east-1-outage-oct20

79•birdculture•2d ago

Comments

ggm•2d ago

Aside from not dogfooding, what would have reduced the impact? Because "don't have a bug" is .. well it's the difference between desire and reality.

Not dogfooding is the software and hardware equivalent of the electricity network "black start" -you never want to be there, but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator, which you spin up to start the turbine spinning until the steam takes up load and the real generator is able to get volts onto the wire.

Pumped Hydro is inside the machine. It's often held out as the black-start mechanism because it's gravity, there's less to go wrong, but if we are in 'line up the holes in the cheese grater' space, you can always have 'want of a nail' issues with any mechanism. The honda generator can have a hole in the petrol tank, the turbine at the pumped hydro can be out for maintenance.

nijave•4h ago

A cap on region size could have helped. Region isolation didn't fail here so splitting us-east-1 into 3,4,5 would have been a smaller impact

Having such a gobstoppingly massive singular region seems to be working against AWS

elchananHaas•4h ago

DynamoDB is working on going cellular which should help. Some parts are already cellular, and others like DNS are in progress. https://docs.aws.amazon.com/wellarchitected/latest/reducing-...

pas•4h ago

there's already some virtualization going on. (I heard that what people see as us-east-1a might be us-east-1c for others to spread the load. though obviously it's still too big.)

easton•52m ago

They used to do that (so that everyone picking 1a wouldn’t actually route traffic to the same az), but at some point they made it so all new accounts had the same AZ labels to stop confusing people.

The services that fail on AWS’ side are usually multi zone failures anyway, so maybe it wouldn’t help.

I’ve often wondered why they don’t make a secret region just for themselves to isolate the control plane but maybe they think dogfooding us-east-1 is important (it probably wouldn’t help much anyway, a meteor could just as easily strike us-secret-1).

otterley•1h ago

us-east-2 already exists and wasn’t impacted. And the choice of where to deploy is yours!

mcswell•3h ago

(Almost) irrelevant question. You wrote "...a bigger generator, which you spin up to start the turbine spinning until the steam takes up load..." I once served on a steam powered US Navy guided missile destroyer. In addition to the main engines, we had four steam turbine electrical generators. There was no need--indeed, no mechanism--for spinning any of these turbines electrically, they all started up simply by sending them steam. (To be sure, you'd better ensure that the lube oil was flowing.)

Are you saying it's different on land-based steam power plants? Why?

brendoelfrendo•2h ago

Most (maybe all?) large grid-scale generators use electromagnets to generate the electromagnetic field they use to generate electricity. These magnets require electricity to generate that field, so you need a small generator to kickstart your big generator's magnets in order to start producing power. There are other concerns, too; depending on the nature of the plant, there may be some other machinery that requires electricity before the plant can operate. It doesn't take much startup energy to open the gate on a hydroelectric dam, but I don't think anyone is shoveling enough coal to cold-start a coal plant without a conveyer belt.

If I had to guess, I'd assume that the generators on your destroyer were gas turbine generators that use some kind of motor as part of their startup process to get the turbine spinning. It's entirely possible that there was an electric motor in there to facilitate the generator's "black-start," but it may have been powered by a battery rather than a smaller generator.

pas•3h ago

The post argues for "control theory" and slowing down changes. (Which ... will sure, maybe, but it will slow down convergence, or it complicates things if some class of actions will be faster than others.)

But what would make sense is that upstream services return their load (queue depths or things like PSI from newer kernels) to downstream ones as part of the API responses, so if shit's going on the downstream ones will become more patient and slow down. (But if things are getting cleaned up then the downstream services can speed things up.)

otterley•1h ago

I’ve been thinking about this problem for decades. Load feedback is a wonderful idea, but extremely difficult to put into practice. Every service has a different and unique architecture; and even within a single service, different requests can consume significantly different resources. This makes it difficult, if not impossible, to provide a single quantitative number in response to the question “what request rate can I send you?” (It’s a corollary of the Anna Karenina principle.) It also requires tight coupling between the load balancer and the backends, which has problems of its own.

I haven’t seen anyone really solve this problem at scale with maybe the exception of YouTube’s mechanism (https://research.google/pubs/load-is-not-what-you-should-bal...), but that’s specific to them and it isn’t universally applicable to arbitrary workloads.

JCM9•6h ago

Legitimate question on if the talent exodus from AWS is starting to take its toll. I’m talking about all the senior long-turned folks jumping ship for greener pastures, not the layoffs this week which mostly didn’t touch AWS (folks saying that will happen in future rounds).

The fact that there was an outage is not unexpected… it happens… but all the stumbling and length to get things under control was concerning.

j45•5h ago

Hope not.. Smooth tech that runs is like the Maytag man.

Tech departments running around with their hair on fire / always looking busy isn't one that always builds trust.

zorpner•5h ago

Corey Quinn wrote an interesting article addressing that question: https://www.theregister.com/2025/10/20/aws_outage_amazon_bra...

Some good information in the comments as well.

Nextgrid•5h ago

If you average it out over the last decade do we really have more outages now than before? Any complex system with lots of moving parts is bound to fail every so often.

thundergolfer•4h ago

It's the length of the outage that's striking. AWS us-east-1 has had a few serious outages in the last ~decade, but IIRC none took near 14 hours to resolve.

The horrible us-east-1 S3 outage of 2017[1] was around 5 hours.

1. https://aws.amazon.com/message/41926/

Nextgrid•4h ago

Couldn’t this be explained by natural growth of the amount of cloud resources/data under management?

The more you have, the faster the backlog grows in case of an outage, so you need longer to process it all once the system comes back online.

JCM9•4h ago

Not really. The issue was the time it took to correctly diagnose the issue and then the cascading failures that resulted triggering more lengthy troubleshooting. Rightly or wrongly it plays into the “the folks that knew best how all this works have left the building” vibes. Folks inside AWS say that’s not entirely inaccurate.

SketchySeaBeast•5h ago

I can't imagine a more uncomfortable place to try and troubleshoot all this than in a hotel lobby surrounded by a dozen coworkers.

thundergolfer•4h ago

It wasn't too bad! The annoying bit was that the offsite schedule was delayed for hours for the other ~40 people not working on the issue.

tptacek•4h ago

Good to see an analysis emphasizing the metastable failure mode in EC2, rather than getting bogged down by the DNS/Dynamo issue. The Dynamo issue, from their timeline, looks like it got fixed relatively quickly, unlike EC2, which needed a fairly elaborate SCRAM and recovery process that took many hours to execute.

A faster, better-tested "restart all the droplet managers from a known reasonable state" process is probably more important than finding all the Dynamo race conditions.

thundergolfer•3h ago

I was motivated by your back-and-forth in the original AWS summary to go and write this post :)

tptacek•3h ago

It's good, and I love that you brought the Google SRE stuff into it.

fizlebit•2h ago

It looks from the public writeup that the thing programming the DNS servers didn't acquire a lease on the server to prevent concurrent access to the same record set. I'd love to see the internal details on that COE.

I think when there is an extended outage it exposes the shortcuts. If you have 100 systems, and one or two can't start fast from zero, and they're required to get back to running smoothly, well you're going to have a longer outage. How would you deal with that, you'd uniformly across your teams subject them to start from zero testing. I suspect though that many teams are staring down a scaling bottleneck, or at least were for much of Amazon's life and so scaling issues (how do we handle 10x usage growth in the next year and half, which are the soft spots that will break) trump cold start testing. Then you get a cold start event with that last one being 5 years ago and 1 or 2 out of your 100 teams falls over and it takes multiple hours all hands on deck to get it to start.

zastai0day•33m ago

Man, AWS is at it again. That big outage on Oct 20 was rough, and now (Oct 29) it looks like things are shaky again. SMH.

us-east-1 feels like a single point of failure for half the internet. People really need to look into multi-region, maybe even use AI like Zast.ai for intelligent failover so we're not all dead in the water when N. Virginia sneezes.

binary132•10m ago

That sounds like engineering work and expense without a dollar sign attached to it, so maybe it’ll happen after all the product work (i.e. never.)

Uv is the best thing to happen to the Python ecosystem in a decade

China has added forest the size of Texas since 1990

Tell HN: Azure outage

IRCd service written in awk

Minecraft removing obfuscation in Java Edition

Raspberry Pi Pico Bit-Bangs 100 Mbit/S Ethernet

OS/2 Warp, PowerPC Edition

Dithering – Part 1

AWS to bare metal two years later: Answering your questions about leaving AWS

How the U.S. National Science Foundation Enabled Software-Defined Networking

AOL to be sold to Bending Spoons for $1.5B

Kafka is Fast – I'll use Postgres

A century of reforestation helped keep the eastern US cool

Tailscale Peer Relays

Crunchyroll is destroying its subtitles

OpenAI’s promise to stay in California helped clear the path for its IPO

Board: New game console recognizes physical pieces, with an open SDK

The Internet runs on free and open source software and so does the DNS

GLP-1 therapeutics: Their emerging role in alcohol and substance use disorders

How to Obsessively Tune WezTerm

Keep Android Open

Meta and TikTok are obstructing researchers' access to data, EU commission rules

Responses from LLMs are not facts

Using Atomic State to Improve React Performance in Deeply Nested Component Trees

More than DNS: Learnings from the 14 hour AWS outage

Upwave (YC S12) is hiring software engineers

Composer: Building a fast frontier model with RL

How blocks are chained in a blockchain

Extropic is building thermodynamic computing hardware

Tailscale Services

More than DNS: Learnings from the 14 hour AWS outage

Comments

Uv is the best thing to happen to the Python ecosystem in a decade

China has added forest the size of Texas since 1990

Tell HN: Azure outage

IRCd service written in awk

Minecraft removing obfuscation in Java Edition

Raspberry Pi Pico Bit-Bangs 100 Mbit/S Ethernet

OS/2 Warp, PowerPC Edition

Dithering – Part 1

AWS to bare metal two years later: Answering your questions about leaving AWS

How the U.S. National Science Foundation Enabled Software-Defined Networking

AOL to be sold to Bending Spoons for $1.5B

Kafka is Fast – I'll use Postgres

A century of reforestation helped keep the eastern US cool

Tailscale Peer Relays

Crunchyroll is destroying its subtitles

OpenAI’s promise to stay in California helped clear the path for its IPO

Board: New game console recognizes physical pieces, with an open SDK

The Internet runs on free and open source software and so does the DNS

GLP-1 therapeutics: Their emerging role in alcohol and substance use disorders

How to Obsessively Tune WezTerm

Keep Android Open

Meta and TikTok are obstructing researchers' access to data, EU commission rules

Responses from LLMs are not facts

Using Atomic State to Improve React Performance in Deeply Nested Component Trees

More than DNS: Learnings from the 14 hour AWS outage

Upwave (YC S12) is hiring software engineers

Composer: Building a fast frontier model with RL

How blocks are chained in a blockchain

Extropic is building thermodynamic computing hardware

Tailscale Services