frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

More than DNS: Learnings from the 14 hour AWS outage

https://thundergolfer.com/blog/aws-us-east-1-outage-oct20
79•birdculture•2d ago

Comments

ggm•2d ago
Aside from not dogfooding, what would have reduced the impact? Because "don't have a bug" is .. well it's the difference between desire and reality.

Not dogfooding is the software and hardware equivalent of the electricity network "black start" -you never want to be there, but somewhere in the system you need a honda petrol generator, which is enough to excite the electromagnets on a bigger generator, which you spin up to start the turbine spinning until the steam takes up load and the real generator is able to get volts onto the wire.

Pumped Hydro is inside the machine. It's often held out as the black-start mechanism because it's gravity, there's less to go wrong, but if we are in 'line up the holes in the cheese grater' space, you can always have 'want of a nail' issues with any mechanism. The honda generator can have a hole in the petrol tank, the turbine at the pumped hydro can be out for maintenance.

nijave•4h ago
A cap on region size could have helped. Region isolation didn't fail here so splitting us-east-1 into 3,4,5 would have been a smaller impact

Having such a gobstoppingly massive singular region seems to be working against AWS

elchananHaas•4h ago
DynamoDB is working on going cellular which should help. Some parts are already cellular, and others like DNS are in progress. https://docs.aws.amazon.com/wellarchitected/latest/reducing-...
pas•4h ago
there's already some virtualization going on. (I heard that what people see as us-east-1a might be us-east-1c for others to spread the load. though obviously it's still too big.)
easton•52m ago
They used to do that (so that everyone picking 1a wouldn’t actually route traffic to the same az), but at some point they made it so all new accounts had the same AZ labels to stop confusing people.

The services that fail on AWS’ side are usually multi zone failures anyway, so maybe it wouldn’t help.

I’ve often wondered why they don’t make a secret region just for themselves to isolate the control plane but maybe they think dogfooding us-east-1 is important (it probably wouldn’t help much anyway, a meteor could just as easily strike us-secret-1).

otterley•1h ago
us-east-2 already exists and wasn’t impacted. And the choice of where to deploy is yours!
mcswell•3h ago
(Almost) irrelevant question. You wrote "...a bigger generator, which you spin up to start the turbine spinning until the steam takes up load..." I once served on a steam powered US Navy guided missile destroyer. In addition to the main engines, we had four steam turbine electrical generators. There was no need--indeed, no mechanism--for spinning any of these turbines electrically, they all started up simply by sending them steam. (To be sure, you'd better ensure that the lube oil was flowing.)

Are you saying it's different on land-based steam power plants? Why?

brendoelfrendo•2h ago
Most (maybe all?) large grid-scale generators use electromagnets to generate the electromagnetic field they use to generate electricity. These magnets require electricity to generate that field, so you need a small generator to kickstart your big generator's magnets in order to start producing power. There are other concerns, too; depending on the nature of the plant, there may be some other machinery that requires electricity before the plant can operate. It doesn't take much startup energy to open the gate on a hydroelectric dam, but I don't think anyone is shoveling enough coal to cold-start a coal plant without a conveyer belt.

If I had to guess, I'd assume that the generators on your destroyer were gas turbine generators that use some kind of motor as part of their startup process to get the turbine spinning. It's entirely possible that there was an electric motor in there to facilitate the generator's "black-start," but it may have been powered by a battery rather than a smaller generator.

pas•3h ago
The post argues for "control theory" and slowing down changes. (Which ... will sure, maybe, but it will slow down convergence, or it complicates things if some class of actions will be faster than others.)

But what would make sense is that upstream services return their load (queue depths or things like PSI from newer kernels) to downstream ones as part of the API responses, so if shit's going on the downstream ones will become more patient and slow down. (But if things are getting cleaned up then the downstream services can speed things up.)

otterley•1h ago
I’ve been thinking about this problem for decades. Load feedback is a wonderful idea, but extremely difficult to put into practice. Every service has a different and unique architecture; and even within a single service, different requests can consume significantly different resources. This makes it difficult, if not impossible, to provide a single quantitative number in response to the question “what request rate can I send you?” (It’s a corollary of the Anna Karenina principle.) It also requires tight coupling between the load balancer and the backends, which has problems of its own.

I haven’t seen anyone really solve this problem at scale with maybe the exception of YouTube’s mechanism (https://research.google/pubs/load-is-not-what-you-should-bal...), but that’s specific to them and it isn’t universally applicable to arbitrary workloads.

JCM9•6h ago
Legitimate question on if the talent exodus from AWS is starting to take its toll. I’m talking about all the senior long-turned folks jumping ship for greener pastures, not the layoffs this week which mostly didn’t touch AWS (folks saying that will happen in future rounds).

The fact that there was an outage is not unexpected… it happens… but all the stumbling and length to get things under control was concerning.

j45•5h ago
Hope not.. Smooth tech that runs is like the Maytag man.

Tech departments running around with their hair on fire / always looking busy isn't one that always builds trust.

zorpner•5h ago
Corey Quinn wrote an interesting article addressing that question: https://www.theregister.com/2025/10/20/aws_outage_amazon_bra...

Some good information in the comments as well.

Nextgrid•5h ago
If you average it out over the last decade do we really have more outages now than before? Any complex system with lots of moving parts is bound to fail every so often.
thundergolfer•4h ago
It's the length of the outage that's striking. AWS us-east-1 has had a few serious outages in the last ~decade, but IIRC none took near 14 hours to resolve.

The horrible us-east-1 S3 outage of 2017[1] was around 5 hours.

1. https://aws.amazon.com/message/41926/

Nextgrid•4h ago
Couldn’t this be explained by natural growth of the amount of cloud resources/data under management?

The more you have, the faster the backlog grows in case of an outage, so you need longer to process it all once the system comes back online.

JCM9•4h ago
Not really. The issue was the time it took to correctly diagnose the issue and then the cascading failures that resulted triggering more lengthy troubleshooting. Rightly or wrongly it plays into the “the folks that knew best how all this works have left the building” vibes. Folks inside AWS say that’s not entirely inaccurate.
SketchySeaBeast•5h ago
I can't imagine a more uncomfortable place to try and troubleshoot all this than in a hotel lobby surrounded by a dozen coworkers.
thundergolfer•4h ago
It wasn't too bad! The annoying bit was that the offsite schedule was delayed for hours for the other ~40 people not working on the issue.
tptacek•4h ago
Good to see an analysis emphasizing the metastable failure mode in EC2, rather than getting bogged down by the DNS/Dynamo issue. The Dynamo issue, from their timeline, looks like it got fixed relatively quickly, unlike EC2, which needed a fairly elaborate SCRAM and recovery process that took many hours to execute.

A faster, better-tested "restart all the droplet managers from a known reasonable state" process is probably more important than finding all the Dynamo race conditions.

thundergolfer•3h ago
I was motivated by your back-and-forth in the original AWS summary to go and write this post :)
tptacek•3h ago
It's good, and I love that you brought the Google SRE stuff into it.
fizlebit•2h ago
It looks from the public writeup that the thing programming the DNS servers didn't acquire a lease on the server to prevent concurrent access to the same record set. I'd love to see the internal details on that COE.

I think when there is an extended outage it exposes the shortcuts. If you have 100 systems, and one or two can't start fast from zero, and they're required to get back to running smoothly, well you're going to have a longer outage. How would you deal with that, you'd uniformly across your teams subject them to start from zero testing. I suspect though that many teams are staring down a scaling bottleneck, or at least were for much of Amazon's life and so scaling issues (how do we handle 10x usage growth in the next year and half, which are the soft spots that will break) trump cold start testing. Then you get a cold start event with that last one being 5 years ago and 1 or 2 out of your 100 teams falls over and it takes multiple hours all hands on deck to get it to start.

zastai0day•33m ago
Man, AWS is at it again. That big outage on Oct 20 was rough, and now (Oct 29) it looks like things are shaky again. SMH.

us-east-1 feels like a single point of failure for half the internet. People really need to look into multi-region, maybe even use AI like Zast.ai for intelligent failover so we're not all dead in the water when N. Virginia sneezes.

binary132•10m ago
That sounds like engineering work and expense without a dollar sign attached to it, so maybe it’ll happen after all the product work (i.e. never.)

Uv is the best thing to happen to the Python ecosystem in a decade

https://emily.space/posts/251023-uv
1200•todsacerdoti•8h ago•687 comments

China has added forest the size of Texas since 1990

https://e360.yale.edu/digest/china-new-forest-report
384•Brajeshwar•1d ago•236 comments

Tell HN: Azure outage

662•tartieret•10h ago•633 comments

IRCd service written in awk

https://example.fi/blog/ircd.html
14•pabs3•29m ago•2 comments

Minecraft removing obfuscation in Java Edition

https://www.minecraft.net/en-us/article/removing-obfuscation-in-java-edition
575•SteveHawk27•10h ago•197 comments

Raspberry Pi Pico Bit-Bangs 100 Mbit/S Ethernet

https://www.elektormagazine.com/news/rp2350-bit-bangs-100-mbit-ethernet
70•chaosprint•3h ago•14 comments

OS/2 Warp, PowerPC Edition

https://www.os2museum.com/wp/os2-history/os2-warp-powerpc-edition/
29•TMWNN•3h ago•11 comments

Dithering – Part 1

https://visualrambling.space/dithering-part-1/
223•Bogdanp•8h ago•48 comments

AWS to bare metal two years later: Answering your questions about leaving AWS

https://oneuptime.com/blog/post/2025-10-29-aws-to-bare-metal-two-years-later/view
626•ndhandala•15h ago•430 comments

How the U.S. National Science Foundation Enabled Software-Defined Networking

https://cacm.acm.org/federal-funding-of-academic-research/how-the-u-s-national-science-foundation...
57•zdw•5h ago•15 comments

AOL to be sold to Bending Spoons for $1.5B

https://www.axios.com/2025/10/29/aol-bending-spoons-deal
192•jmsflknr•10h ago•170 comments

Kafka is Fast – I'll use Postgres

https://topicpartition.io/blog/postgres-pubsub-queue-benchmarks
311•enether•12h ago•248 comments

A century of reforestation helped keep the eastern US cool

https://news.agu.org/press-release/a-century-of-reforestation-helped-keep-the-eastern-us-cool/
89•softwaredoug•3h ago•10 comments

Tailscale Peer Relays

https://tailscale.com/blog/peer-relays-beta
258•seemaze•10h ago•71 comments

Crunchyroll is destroying its subtitles

https://daiz.moe/crunchyroll-is-destroying-its-subtitles-for-no-good-reason/
174•Daiz•3h ago•58 comments

OpenAI’s promise to stay in California helped clear the path for its IPO

https://www.wsj.com/tech/ai/openais-promise-to-stay-in-california-helped-clear-the-path-for-its-i...
155•badprobe•9h ago•210 comments

Board: New game console recognizes physical pieces, with an open SDK

https://board.fun/
147•nicoles•23h ago•56 comments

The Internet runs on free and open source software and so does the DNS

https://www.icann.org/en/blogs/details/the-internet-runs-on-free-and-open-source-softwareand-so-d...
111•ChrisArchitect•8h ago•7 comments

GLP-1 therapeutics: Their emerging role in alcohol and substance use disorders

https://academic.oup.com/jes/article/9/11/bvaf141/8277723?login=false
156•PaulHoule•2d ago•67 comments

How to Obsessively Tune WezTerm

https://rashil2000.me/blogs/tune-wezterm
79•todsacerdoti•7h ago•47 comments

Keep Android Open

http://keepandroidopen.org/
2342•LorenDB•22h ago•748 comments

Meta and TikTok are obstructing researchers' access to data, EU commission rules

https://www.science.org/content/article/meta-and-tiktok-are-obstructing-researchers-access-data-e...
147•anigbrowl•4h ago•67 comments

Responses from LLMs are not facts

https://stopcitingai.com/
148•xd1936•5h ago•100 comments

Using Atomic State to Improve React Performance in Deeply Nested Component Trees

https://runharbor.com/blog/2025-10-26-improving-deeply-nested-react-render-performance-with-jotai...
4•18nleung•3d ago•0 comments

More than DNS: Learnings from the 14 hour AWS outage

https://thundergolfer.com/blog/aws-us-east-1-outage-oct20
79•birdculture•2d ago•25 comments

Upwave (YC S12) is hiring software engineers

https://www.upwave.com/job/8228849002/
1•ckelly•10h ago

Composer: Building a fast frontier model with RL

https://cursor.com/blog/composer
179•leerob•10h ago•133 comments

How blocks are chained in a blockchain

https://www.johndcook.com/blog/2025/10/27/blockchain/
50•tapanjk•2d ago•21 comments

Extropic is building thermodynamic computing hardware

https://extropic.ai/
97•vyrotek•8h ago•70 comments

Tailscale Services

https://tailscale.com/blog/services-beta
126•xd1936•1d ago•28 comments