Ask HN: What did you learn from AWS outages?

2•Brysonbw•2h ago

Comments

merek•32m ago

I figured if a single AZ has an outage, let alone the entire region, I can rest easy knowing much bigger companies will have bigger problems. It will probably be newsworthy, and when customers email in, my excuse will be defensible, since I can send them links to external status pages, news articles, etc.

Whilst this was mostly true, it was still a very unpleasant experience, and my service was hanging by a thread for much of the time. I recently moved an important part of the stack from EC2 to Fargate, with two services: a single task to post jobs to a queue, and another service running many tasks to process jobs from the queue.

The incident knocked out the job posting service, which would not come back up. Had I left it to AWS to resolve automatically, my service would have been out for maybe 12 hours.

Fortunately the worker tasks were still available and waiting. I tracked down the old "job poster" code that used to run on an ec2. I sshed into an old ec2, and "deployed" the code by copying and pasting onto the server. The service came back up, although I had to edit the code directly on the ec2 to slow things down, since the ec2 had 1vCPU and an upgrade was not possible during the incident. Furthermore, Fargate workers would not scale out if they had too much work.

This was at about 2 or 3 AM my time, and was carried out whilst customers were emailing in, and cloudwatch alarms were going off all over the place. Once the service was back up, even with my unnerving hacky solution, I got a couple hours sleep.

What I've learnt:

- When the incident was first reported, I thought it would last 2 hours max. A 12 - 16 hour disruption to AWS resources is absolutely possible.

- Maybe don't use us-east-1 for future projects, but I'm not convinced there's much logic to this. Despite past issues, it's impossible to predict where an outage might occur and the affected resources, as well as spillover into other regions.

- Think of ways to make my service more portable, to other regions, even other cloud providers, but the motivation to do this will be gone by tomorrow. It's way more valuable for me to focus on customers, new features, etc, rather than bomb-proofing the service. I don't write airline or medical software. An outage of my service isn't going to kill anyone, and most users are understanding. I'll accept the hit.

The Kaleidoscope (2025)

Markupstandards.org

Sony: Tech Giant Seeks U.S. Bank License to Issue Its Own Stablecoin

Elon Musk now owns 2/3 of satellites after 10,000th Starlink launches

Qwen Language Confusion Gate

Motion to Dismiss for Failure to State a Vulnerability

We Can't Name Variables. Now We're Writing Prompts?

Oracle Vectorizes Its Customers Data

Analytics.USA.gov: U.S. Federal Government Website and App Analytics

Incrementing and decrementing an atomic reference count

Soupault – A Static Website Management Tool

Behavioural scanners in Mannheim: testing surveillance that so many cities want

Python notebook of Princeton GraphMERT Paper – a better knowledge graph

Foreign hackers breached a US nuclear weapons plant via SharePoint flaws

Marine artillery shell detonates over freeway during Camp Pendleton event

iOS 26.1 Beta 4 Adds Liquid Glass Transparency Toggle

Normalize.css

Google's Pixel 10 can now run Linux apps better than other Android phones

Thoughts? "Nvidia in 5y btw $1300 and $4K" based on analysis from the link

Argentine peso weakens to fresh low despite US interventions

Supreme Court will consider whether people who smoke pot can legally own guns

Wikipedia says traffic is falling due to AI search summaries and social video

OpenAI is not a serious company

George F. Smoot, Who Showed How the Cosmos Began, Is Dead at 80

Can a University from Tennessee Help Accelerate Growth in West Palm Beach?

An IKEA Catalog from the Near Future

One Star

Space Debris Hits Plane (?)

Lottery-Fication of Everything

Tech PACs Are Closing in on the Almonds