AWS Outage: A Single Cloud Region Shouldn't Take Down the World. But It Did

https://faun.dev/c/news/devopslinks/aws-outage-a-single-cloud-region-shouldnt-take-down-the-world-but-it-did/

256•eon01•3h ago

Comments

skywhopper•3h ago

There are plenty of ways to address this risk. But the companies impacted would have to be willing to invest in the extra operational cost and complexity. They aren’t.

randomtoast•3h ago

Thing is us-east-1 the primary region for many services of AWS. DynamoDB is a very central offering used by many service. And the issue that has happend is very common[^1].

I think no matter how hard you try to avoid it, in the end there's always a massive dependency chain for modern digital infrastructure[^2].

[1]: https://itsfoss.community/uploads/default/optimized/2X/a/ad3...

[2]: https://xkcd.com/2347/

pjmlp•3h ago

It just goes to show the difference between best practices in cloud computing, and what everyone ends up doing in reality, including well known industry names.

mrbungie•2h ago

> It just goes to show the difference between best practices in cloud computing, and what everyone ends up doing in reality,

Well, inter-region DR/HA is a expensive thing to ensure (whether on salaries, infra or both), specially when you are in AWS.

esafak•2h ago

Does AWS follow its own Well-Architected Framework!?

spyspy•2h ago

Eh, the "best practices" that would've prevented this aren't trivial to implement and are definitely far beyond what most engineering teams are capable of, in my experience. It depends on your risk profile. When we had cloud outages at the freemium game company I worked at, we just shrugged and waited for the systems to come back online - nobody dying because they couldn't play a word puzzle. But I've also had management come down and ask what it would take to prevent issues like that from happening again, and then pretend they never asked once it was clear how much engineering effort it would take. I've yet to meet a product manager that would shred their entire roadmap for 6-18 months just to get at an extra 9 of reliability, but I also don't work in industries where that's super important.

pjmlp•13m ago

Indeed, yet one would expect AWS to lead by example, including all of those that are only using a single region.

Thaxll•1h ago

Best practice does not include plan for when AWS going down. Netflix does not plan for it and they have a very strong eng org.

pjmlp•14m ago

It was only one region.

mcphage•3h ago

It shouldn’t, but it does. As a civilization, we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive. So everything is resting on a giant pile of single point of failures.

Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.

Trasmatta•2h ago

The irony is that true resilience is very complex, and complexity can be a major source of outages in and of itself

lanstin•51m ago

I have enjoyed this paper on such dynamics: https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...

It is kind of the child of what used to be called Catastrophe Theory, which in low dimensions is essentially a classification of folding of manifolds. Now the systems are higher dinemsional and the advice more practical/heuristic.

mschuster91•2h ago

> Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.

Doesn't help either. us-east-1 hosts the internal control plane of AWS and a bunch of stuff is only available in us-east-1 at all - most importantly, Cloudfront, AWS ACM for Cloudfront and parts of IAM.

And the last is the one true big problem. When IAM has a sniffle, everything else collapses because literally everything else depends on IAM. If I were to guess IAM probably handles millions if not billions of requests a second because every action on every AWS service causes at least one request to IAM.

TheNewsIsHere•2h ago

The last re:Invent presentation I saw from one of the principals working on IAM quoted 500 million requests per second. I expect that’s because IAM also underpins everything inside AWS, too.

mschuster91•2h ago

IAM, hands down, is one of the most amazing pieces of technology there is.

The sheer volume is one thing, but... IAM's policy engine, that's another thing. Up to 5000 different roles per account, dozens of policies that can have an effect on any given user entity and on top of that you can also create IAM policies that blanket affect all entities (or only a filtered subset) in an account, and each policy definition can be what, 10 kB or so, in size. Filters can include multiple wildcards everywhere so you can't go for a fast-path in an in-memory index, and they can run variables with on-demand evaluation as well.

And all of that is reachable not on an account-specific endpoint that could get sharded from a shared instance should the load of one account become too expensive, no, it's a global (and region-shared) endpoint. And if that weren't enough, all calls are shipped off to CloudTrail's event log, always, with full context cues to have an audit and debug trail.

To achieve all that in a service quality that allows for less than 10 seconds worth of time before a change in an IAM policy becomes effective and milliseconds of call time is nothing short of amazing.

quesera•1h ago

No harshness intended, but I don't see the magic.

IAM is solid, but is it any more special than any other distributed AuthN+AuthZ service?

lanstin•47m ago

Scale is a feature. 500M per second in practice is impressive.

UltraSane•39m ago

The scale, speed, and uptime of AWS IAM is pretty special.

foinker•2h ago

No shot that happens until an outage breaks at least an entire workday in the US timezones. The only complaint I personally heard was from someone who couldn't load reddit on the train to work.

geodel•1h ago

Well by the time it really happens for a whole day Amazon leadership will be brazen enough to say "OK, enough of this my site is down, we will call back once systems are up so don't bother for a while". Also maybe responsible human engineers would fired by then and AI can be infinitely patient while working through insolvable issues.

2d8a875f-39a2-4•2h ago

> we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive.

You are right. But alas, a peek at the AMZN stock ticker suggests that the market doesn't really value resilience that much.

Cthulhu_•55m ago

Stocks stopped being indicative of anything decades ago though.

agos•2h ago

when did we have resilience?

BoredPositron•2h ago

Cold War was pretty good in terms of resilience.

JCM9•3h ago

US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions. Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.

AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.

Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”

helsinkiandrew•3h ago

>US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions

I thought that if us-east-1 goes down you might not be able to administer (or bring up new services) in other zones, but if you have services running that can take over from us-east-1, you can maintain your app/website etc.

I haven’t had to do this for several years but that was my experience a few years ago on an outage - obviously it depends on the services you’re using.

You can’t start cloning things to other zones after us-east-1 is down - you’ve left it too late

cmiles8•2h ago

Well that sounds like exactly the sort of thing that shouldn’t happen when there’s an issue given the usual response is to spin things up elsewhere, especially on lower priority services where instant failover isn’t needed.

sgarland•2h ago

It depends on the outage. There was one a year or two ago (I think? They run together) that impacted EC2 such that as long as you weren’t trying to scale, or issue any commands, your service would continue to operate. The EKS clusters at my job at the time kept chugging along, but had Karptenter tried to schedule more nodes, we’d have had a bad time.

bpicolo•2h ago

Static stability is a very valuable infra attribute. You should definitely consider how statically stable your services are in architecting them

yencabulator•9m ago

Meanwhile, AWS has always marketed itself as "elastic". Not being able to start new VMs in the morning to handle the daytime load will wreck many sites.

Yeul•2h ago

Internet was supposed to be a communication network if the East Coast was nuked.

What it turned into was Daedalus from Deus Ex lol.

t_sawyer•2h ago

Yeah because Amazon engineers are hypocrites. They want you to spend extra money for region failover and multi-az deploys but they don't do it themselves.

ajkjk•2h ago

They absolutely do do it themselves..

falcor84•2h ago

What do you mean? Obviously, as TFA shows and as others here pointed-out, AWS relies globally on services that are fully-dependent on us-east-1, so they aren't fully multi-region.

ajkjk•2h ago

The claim was that that they're total hypocrites aren't multi region at all. That's totally false, the amount of redundancy in aws is staggering. But there are foundational parts which, I guess, have been too difficult to do that for (or perhaps they are redundant but the redundancy failed in this case? I dunno)

t_sawyer•2h ago

There's multiple single points of failure for their entire cloud in us-east-1.

I think it's hypocritical for them to push customers to double or triple their spend in AWS when they themselves have single points of failure on a single region.

ajkjk•2h ago

That's absurd. It's hypocritical to describe best practices as best practices because you haven't perfectly implemented them? Either they're best practice or they aren't. The customers have the option of risking non-redundancy also, you know.

t_sawyer•1h ago

Yes it's hypocritical to push customers to pay you more money with best practices for uptime when you yourself don't follow them and your choices to not follow them actually make the best practices you pushed your customers to pay you more money for not fully work.

Hey! Pay us more money so when us-east-1 goes down you're not down (actually you'll still go down because us-east-1 is a single point of failure even for our other regions).

falcor84•2h ago

That's a good point, but I'd just s/Amazon engineers/AWS leadership/ , as I'm pretty sure that there's a few layers of management removed between the engineers on the ground at AWS, those who deprioritise any longer-term resilience work needed (which is a very strategic decisioN), and those those who are in charge of external comms/education about best practices for AWS customers.

ajsnigrutin•2h ago

Luckily, those people are the ones that will be getting all the phonecalls from angry customers here. If you're selling resilience and selling twice the service (so your company can still run if one location fails), and it still failed, well... phones will be ringing.

ZeroCool2u•2h ago

They can't even bother to enable billing services in GovCloud regions.

bravetraveler•2h ago

Call me crazy, because this is, perhaps it's their "Room 641a". The purpose of a system is what it does, no point arguing 'should' against reality, etc.

They've been charging a premium for, and marketing, "Availability" for decades at this point. I worked for a competitor and made a better product: it could endure any of the zones failing.

voxadam•2h ago

> perhaps it's their "Room 641a".

For the uninitiated: https://en.wikipedia.org/wiki/Room_641A

jf•1h ago

Interesting. Langley isn’t that far away

nevir•1h ago

It's really not that nefarious.

IAD datacenters have forever been the place where Amazon software developers implement services first (well before AWS was a thing).

Multi-AZ support often comes second (more than you think; Amazon is a pragmatic company), and not every service is easy to make TRULY multi-AZ.

And then other services depend on those services, and may also fall into the same trap.

...and so much of the tech/architectural debt gets concentrated into a single region.

bravetraveler•1h ago

Right, like I said: crazy. Anything production with certain other clouds must be multi-AZ. Both reinforced by culture and technical constraints. Sometimes BCDR/contract audits [zones chosen by a third party at random].

nevir•57m ago

It sure is a blast when they decide to cut off (or simulate the loss of) a whole DC just to see what breaks, I bet :)

bravetraveler•54m ago

The disconnect case was simple: breakage was as expected. The island was lost until we drew it on the map again. Things got really interesting when it was a full power-down and back on.

Were the docs/tooling up to date? Tough bet. Much easier to fix BGP or whatever.

Anon1096•1h ago

It's possible that you really could endure any zone failure. But I take these claims people make all the time with a grain of salt, unless you're working on AWS scale (basically just 3 companies) and have actually run for years and seen every kind of failure mode claiming to be higher availability is not something that's able to be accurately evaluated.

(I'm assuming by zone you mean the equivalent of an AWS region, with multiple connected datacenters)

bravetraveler•1h ago

Yes, equivalent. Did endure, repeatedly. Demonstrated to auditors to maintain compliance. They would pick the zone to cut off. We couldn't bias the test. Literal clockwork.

I'll let people guess for the sport of it, here's the hint: there were at least 30 of them comprised of Real Datacenters. Thanks for the doubt, though. Implied or otherwise.

pinkmuffinere•49m ago

Just letting you know how this response looks to other people -- Anon1096 raises legitimate objections, and their post seems very measured in their concerns, not even directly criticizing you. But your response here is very defensive, and a bit snarky. Really I don't think you even respond directly to their concerns, they say they'd want to see scale equivalent to AWS because that's the best way to see the wide variety of failure modes, but you mostly emphasize the auditors, which is good but not a replacement for the massive real load and issues that come along with it. It feels miscalibrated to Anon's comment. As a result, I actually trust you less. If you can respond to Anon's comment without being quite as sassy, I think you'd convince more people.

bravetraveler•46m ago

I appreciate the feedback, truly. Defensive and snarky are both fair, though I'm not trying to convince. The business and practices exist, today.

At risk of more snark: Clouds aren't the Death Star, they don't have to have an exhaust port. It's fair the first one does... for a while.

whatever1•45m ago

There are shared resources in different regions. Electricity. Cables. Common systems for coordination.

Your experiment proves nothing. Anyone can pull it off.

bravetraveler•45m ago

The sites were chosen specifically to be more than 50 miles apart, it proved plenty.

whatever1•43m ago

I am the CEO of your company. I forgot to pay the electricity bill. How is the multi-region resilience going?

bravetraveler•42m ago

Fine, the tab increments. Get back to hyping or something, this is not your job.

whatever1•40m ago

I doubt it should be yours if this is how you think about resilience.

bravetraveler•39m ago

Your vote has been tallied

quickthrowman•31m ago

If your accounts payable can’t pay the electric bill on time, you’ve got bigger problems.

icedchai•15m ago

If you go far up enough the pyramid, there is always a single point of failure. Also, it's unlikely that 1) all regions have the same power company, 2) all of them are on the same payment schedule, 3) all of them would actually shut off a major customer at the same time without warning, so, in your specific example, things are probably fine.

jayd16•37m ago

You were in a position to actually cut off production zones with live traffic at Amazon scale and test the recovery?

bravetraveler•35m ago

Yes, it was something we would do to maintain certain contracts. Sounds crazy, isn't: they used a significant portion of the capacity, anyway. They brought the auditors.

Real People would notice/care, but financially, it didn't matter. Contract said the edge had to be lost for a moment/restored. I've played both Incident Manager and SRE in this routine.

edit: Less often we'd do a more thorough test: power loss/full recovery. We'd disconnect more regularly given the simplicity.

stronglikedan•50m ago

Was that competitor priced competitively with AWS? I think of the project management triangle here - good, fast, or cheap - pick two. AWS would be fast and cheap.

bravetraveler•48m ago

Yes, good point. Pricing is a bit higher. As another reply pointed out: there's ~three that work on the same scale. This was one, another hint I guess: it's mostly B2B. Normal people don't typically go there.

gchamonlive•2h ago

Been a while since I last suffered from AWS arbitrary complexity, but afaik you can only associate certificates to cloudfront if they are generated in us-east-1, so it's undoubtedly a single point of failure for all CDN if this is still the case.

kokanee•2h ago

I worked at AMZN for a bit and the complexity is not exactly arbitrary; it's political. Engineers and managers are highly incentivized to make technical decisions based on how they affect inter-team dependencies and the related corporate dynamics. It's all about review time.

sharpy•2h ago

I have seen one promo docket get rejected for doing work that is not complex enough... I thought the problem was challenging, and the simple solution brilliant, but the tech assessor disagreed. I mean once you see there is a simple solution to a problem, it looks like the problem is simple...

bdbdkdksk•1h ago

I had a job interview like this recently: "what's the most technically complex problem you've ever worked on?"

The stuff I'm proudest of solved a problem and made money but it wasn't complicated for the sake of being complicated. It's like asking a mechanical engineer "what's the thing you've designed with the most parts"

arethuza•1h ago

I was once very unpopular with a team of developers when I pointed out a complete solution to what they had decided was an "interesting" problem - my solution didn't involve any code being written.

SoftTalker•1h ago

I suppose it depends on what you are interviewing for but questions like that I assume are asked more to see how you answer than the specifics of what you say.

Most web jobs are not technically complex. They use standard software stacks in standard ways. If they didn't, average developers (or LLMs) would not be able to write code for them.

mboerwink•1h ago

I think this could still be a very useful question for an interviewer. If I were hiring for a position working on a complex system, I would want to know what level of complexity a prospect was comfortable dealing with.

gchamonlive•1h ago

That's what arbitrary means to me, but sure, I see no problem calling it political too

AtlasBarfed•1h ago

Forced attrition rears its head again

xbar•2h ago

This set of facts comes to light every 3-5 years when US-East-1 has another failure. Clearly they could have architected their way out of this blast radius problem by now, but they do not.

Why would they keep a large set of centralized, core traffic services in Virginia for decades despite it being a bad design?

firesteelrain•2h ago

It’s probably because there is a lot of tech debt plus look at where it is - Virgina. It shouldn’t take much of imagination to figure out why that is strategic

dsr_•1h ago

They could put a failover site in Colorado or Seattle or Atlanta, handling just their infrastructure. It's not like the NSA wouldn't be able to backhaul from those places.

knotimpressed•1h ago

You mean the surveillance angle as reason for it being in Virginia?

AtlasBarfed•1h ago

What is the motivation of an effective Monopoly to do anything?

I mean look at their console. Their console application is pretty subpar.

cyberax•44m ago

AWS _had_ architected away from single-region failure modes. There are only a few services that are us-east-1 only in AWS (IAM and Route53, mostly), and even they are designed with static stability so that their control plane failure doesn't take down systems.

It's the rest of the world that has not. For a long time companies just ran everything in us-east-1 (e.g. Heroku), without even having an option to switch to another region.

MichaelZuo•38m ago

The parent seems to be implying there is something in us-east-1 that could take down all the various regions?

api•2h ago

My contention for a long time has been that cloud is full of single points of failure (and nightmarish security hazards) that are just hidden from the customer.

"We can't run things on just a box! That's a single point of failure. We're moving to cloud!"

The difference is that when the cloud goes down you can shift the blame to them, not you, and fixing it is their problem.

The corporate world is full of stuff like this. A huge role of consultants like McKinsey is to provide complicated reports and presentations backing the ideas that the CEO or other board members want to pursue. That way if things don't work out they can blame McKinsey.

raw_anon_1111•1h ago

You act as if that is a bug not a feature. As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself. Besides none of your customers are going to blame you if every other major site is down.

unethical_ban•32m ago

As someone who hypothetically runs a critical service, I would rather my service be up than down.

raw_anon_1111•14m ago

And you have never had downtime? If your data center went down - then what?

tredre3•13m ago

> As hypothetically someone who is responsible for my site staying up, I would much rather blame AWS than myself.

That's a very human sentiment, and I share it. That's why I don't swap my car wheels myself, I don't want to feel responsible if one comes loose on the highway and I cause an accident.

But at the same time it's also appalling how low the bar has gotten. We're still the ones deciding that one cloud is enough. The down being "their" fault really shouldn't excuse that fact. Most services aren't important enough to need redundancy. But if yours is, and it goes down because you decided that one provider is enough, then your provider isn't solely at fault here and as a profession I wish we'd take more accountability.

aurareturn•1h ago

Or it is a matter of efficiency. If 1 million companies design and maintain their servers, there would be 1 million (or more) incidents like these. Same issues. Same fixes. Not so efficient.

SoftTalker•1h ago

It might be worse in terms of total downtime but it likely would be much less noticable as it woould be scattered individual outages not everyone at the same time.

qaq•1h ago

even if us-east-1 was a normal region there is not enough spare capacity to take up all the workloads from us-east-1 in other regions so t's a moot point

nevir•1h ago

It also doesn't help that most companies using AWS aren't remotely close to multi-region support, and that us-east-1 is likely the most populated region.

einrealist•1h ago

It sounds like they want to avoid split-brain scenarios as much as possible while sacrificing resilience. For things like DNS, this is probably unavoidable. So, not all the responsibility can be placed on AWS. If my application relies on receipts (such as an airline ticket), I should make sure I have an offline version stored on my phone so that I can still check in for my flight. But I can accept not to be able to access Reddit or order at McDonalds with my phone. And always having cash at hand is a given, although I almost always pay with my phone nowadays.

I hope they release a good root cause analysis report.

masfuerte•58m ago

Amazon are planning to launch the EU Sovereign Cloud by the end of the year. They claim it will be completely independent. It may be possible then to have genuine resiliency on AWS. We'll see.

louthy•43m ago

Then it will be eu-east-1 taking down the EU

samcat116•27m ago

This is the difference between “partitions” and “regions”. Partitions have fully separate IAM, DNS names, etc. This is how there are things like US Gov Cloud, the Chinese AWS cloud, and now the EU sovereign cloud

belter•48m ago

> Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.

Well it did for me today...Dont use us-east-1 explicitly just other regions and I had no outage today...( I get the point about the skeletons in the closet of us-east-1 ...maybe the power plug goes via Bezos wood desk? )

thayne•38m ago

There are hints at in their documentation. For example ACM certs for cloudfront and KMS keys for route53 DNSSEC have to be in the us-east1 region.

helsinkiandrew•3h ago

> The incident underscores the risks associated with the heavy reliance on a few major cloud service providers.

Perhaps for the internet as a whole, but for each individual service it underscores the risk of not hosting your service in multiple zones or having a backup

martypitt•3h ago

A bit meta - but, what is faun.dev? I visited their site - it looks like a very very slow (possibly because of the current outage?), ad-funded Reddit / HN clone?

But, in it's sidebar of "Trending technologies", it lists "Ansible" and "Jenkins" .. which while are both great, I doubt are trending currently.

Curious what this is?

Maxion•2h ago

OP is the creator of faun.dev. Seems to just be yet another tech news site.

darkwater•1h ago

OP who has more submissions than comments. And all the submissions are for either this faun.dev or thechief.io

sofixa•2h ago

> "Ansible" and "Jenkins" .. which while are both great

I would strongly argue that there is nothing great about Jenkins. It's an unholy mess of mouldy spaghetti that can sometimes be used to do achieve a goal, but is generally terrible at everything. Shit to use, shit to maintain, shit to secure. It was the best solution because of a lack of competition 20 years ago, but hasn't been relevant or anywhere near the top 50 since any competition appeared.

The fact that to this very day, nearing the end of 2025, they still don't support JWT identities for runs is embarrassing. Same goes for VMware vSphere.

lunias•1h ago

The design immediately weirded me out, felt strange. Where are they sourcing this information? Is this an AI summary of the BBC live news feed linked in "Further Reading"?

add-sub-mul-div•55m ago

I'd guess self-promoted slop, which is becoming the norm here.

GoatInGrey•41m ago

It's a "vibe-engineered" app. I find it both sad and hilarious just how quickly one can find slop with these.

https://i.ibb.co/Lzgf34mb/Screenshot-20251020-080828.png

Also, this is the exact CSS style that Claude uses whenever I have it program web elements (typically bookmarklet UIs).

fsto•3h ago

Ironically, the HTTP request to this article timed out twice before a successful response.

aeon_ai•2h ago

It's not DNS

There's no way it's DNS

It was DNS

ajross•1h ago

It's "DNS" because the problem is that at the very top of the abstraction hierarchy in any system is a bit of manual configuration.

As it happens, that naturally maps to the bootstrapping process on hardware needing to know how to find the external services it needs, which is what "DNS" is for. So "DNS" ends up being the top level of manual configuration.

But it's the inevitability of the manual process that's the issue here, not the technology. We're at a spot now where the rest of the system reliability is so good that the only things that bring it down are the spots where human beings make mistakes on the tiny handful of places where human operation is (inevitably!) required.

allarm•40m ago

> hardware needing to know how to find the external services it needs, which is what "DNS" is for. So "DNS" ends up being the top level of manual configuration.

Unless DNS configuration propagates over DHCP?

ajross•3m ago

DHCP can only tell you who the local DNS server is. That's not what's failed, nor what needs human configuration.

At the top of the stack someone needs to say "This is the cluster that controls boot storage", "This is the IP to ask for auth tokens", etc... You can automatically configure almost everything but there still has to be some way to get started.

jamesbelchamber•2h ago

This website just seems to be an auto-generated list of "things" with a catchy title:

> 5000 Reddit users reported a certain number of problems shortly after a specific time.

> 400000 A certain number of reports were made in the UK alone in two hours.

ktosobcy•2h ago

Uhm... E(U)ropean sovereigny (and in general spreading the hosting as much as possbile) needed ASAP…

BirAdam•2h ago

Well, except for a lot of business leaders saying that they don't care if it's Amazon that goes down, because "the rest of the internet will be down too."

Dumb argument imho, but that's how many of them think ime.

tjwebbnorfolk•2h ago

because... EU clouds don't break?

https://news.ycombinator.com/item?id=43749178

Cthulhu_•55m ago

Nah, because European services should not be affected by a failure in the US. Whatever systems they have running in us-east-1 should have failovers in all major regions. Today it's an outage in Virginia, tomorrow it could be an attack on undersea cables (which I'm confident are mined and ready to be severed at this point by multiple parties).

bilekas•2h ago

These things happen when profits are the measure everything. Change your provider, but if their number doesn't go up, they wont be reliable.

So your complaints matter nothing because "number go up".

I remember the good old days of everyone starting a hosting company. We never should have left.

g-b-r•2h ago

Can we not promote this AI-generated "article" on that banners-ridden site?

Previous discussions:

https://news.ycombinator.com/item?id=45640754

https://news.ycombinator.com/item?id=45640772

https://news.ycombinator.com/item?id=45640827

https://news.ycombinator.com/item?id=45640838

https://news.ycombinator.com/item?id=45641143

mrbluecoat•2h ago

> due to an "operational issue" related to DNS

Always DNS..

rose-knuckle17•1h ago

aws had an outage. Many companies were impacted. Headlines around the world blame AWS. the real news is how easy it is to identify companies that have put cost management ahead of service resiliency.

Lots of orgs operating wholly in AWS and sometimes only within us-east-1 had no operational problems last night. Some that is design (not using the impacted services). Some of that is good resiliency in design. And some of that was dumb luck (accidentally good design).

Overall, those companies that had operational problems likely wouldn't have invested in resiliancy expenses in any other deployment strategy either. It could have happened to them in Azure, GCP or even a home rolled datacenter.

aurareturn•1h ago

Redundancy is insanely expensive especially for SaaS companies where the biggest cost is cloud.

Are customers willing to pay companies for that redundancy? I think not. Once every few years outage for 3 hours is fine for non critical services.

skopje•8m ago

>> Redundancy is insanely expensive especially for SaaS companies

That right there means the business model is fucked to begin with. If you can't have a resilient service, then you should not be offering that service. Period. Solution: we were fine before the cloud, just a little slower. No problem going back to that for some things. Not everything has to be just in time at lowest possible cost.

dehrmann•35m ago

Three nines might be good enough when you're Fornite. Probably not when you're Robinhood.

LightBug1•1h ago

Remember when the "internet will just route around a network problem"?

FFS ...

bgwalter•1h ago

Probably related:

https://www.nytimes.com/2025/05/25/business/amazon-ai-coders...

"Pushed to use artificial intelligence, software developers at the e-commerce giant say they must work faster and have less time to think."

Every bit of thinking time spent on a dysfunctional, lying "AI" agent could be spent on understanding the system. Even if you don't move your mouse all the time in order to please a dumb middle manager.

renegade-otter•1h ago

If we see more of this, it would not be crazy to assume that all this compelling of engineers to "use AI" and the flood of Looks Good To Me code is coming home.

Cthulhu_•57m ago

Big if, major outages like this aren't unheard of, and so far, fairly uncommon. Definitely hit harder than their SLAs promise though. I hope they do an honest postmortem, but I doubt they would blame AI even if it was somehow involved. Not to mention you can't blame AI unless you go completely hands-off - but that's like blaming an outsourcing partner, which also never happens.

nivekney•1h ago

Wait a second, Snapchat impacted AGAIN? It was impacted during the last GCP outage.

menomatter•1h ago

What are the design best practices and industry standards for building on-premise fallback capabilities for critical infrastructure? Say for health care/banking ..etc

skopje•6m ago

A relative of mine lived and worked in the US for Oppenheimer Funds in the 1990's and they had their own datacenters all over the US, multiple redundancy for weather or war. But every millionaire feels entitled to be a billionaire now, so all of that cost was rolled into a single point of cloud failure.

Reminds me of a great Onion tagline:

"Plowshare hastily beaten back into sword."

thinkindie•52m ago

Today’s reminder: multi-region is so hard even AWS can’t get it right.

rkharsan64•50m ago

This website is just AI slop, the real reporting is in the BBC page linked at the end.

Photos and numbers seem to be stolen straight from it.

GoatInGrey•35m ago

For anyone willing to take the time and provide BBC with your personal contact and address information, you can file a "complaint" at: https://www.bbc.co.uk/contact/complaints

draxil•50m ago

I was just about to post that it didn't affect us (heavy AWS users, in eu-west-1). Buut, I stopped myself because that was just massively tempting fate :)

mannyv•49m ago

This is why we use us-east-2.

1970-01-01•35m ago

Someone, somewhere, had to report that doorbells went down because the very big cloud did not stay up.

I think we're doing the 21st century wrong.

Johnny555•10m ago

My Ring doorbell works just fine without an internet connection (or during a cloud outage). The video storage and app notifications are another matter, but the doorbell itself continues to ring when someone pushes the button.

bicepjai•19m ago

Is this the outage that took Medium down ?

BERT Is Just a Single Text Diffusion Step

Commodore 64 Ultimate

DeepSeek OCR

Space Elevator

Servo v0.0.1 Released

Matrix Conference 2025 Highlights

How to stop Linux threads cleanly

Docker Systems Status: Full Service Disruption

Anthropic and Cursor Spend This Much on Amazon Web Services

Modeling Others' Minds as Code

Entire Linux Network stack diagram (2024)

Show HN: Playwright Skill for Claude Code – Less context than playwright-MCP

How to Enter a City Like a King

Pointer Pointer (2012)

AWS Multiple Services Down in us-east-1

The Peach meme: On CRTs, pixels and signal quality (again)

Forth: The programming language that writes itself

State-based vs Signal-based rendering

Qt Group Buys IAR Systems Group

AWS Outage: A Single Cloud Region Shouldn't Take Down the World. But It Did

Optimizing writes to OLAP using buffers (ClickHouse, Redpanda, MooseStack)

Fractal Imaginary Cubes

Novo Nordisk's Canadian Mistake

Major AWS Outage Happening

Introduction to reverse-engineering vintage synth firmware

Duke Nukem: Zero Hour N64 ROM Reverse-Engineering Project Hits 100%

Give Your Metrics an Expiry Date

Gleam OTP – Fault Tolerant Multicore Programs with Actors

Airliner hit by possible space debris

Major AWS outage takes down Fortnite, Alexa, Snapchat, and more

BERT Is Just a Single Text Diffusion Step

Commodore 64 Ultimate

DeepSeek OCR

Space Elevator

Servo v0.0.1 Released

Matrix Conference 2025 Highlights

How to stop Linux threads cleanly

Docker Systems Status: Full Service Disruption

Anthropic and Cursor Spend This Much on Amazon Web Services

Modeling Others' Minds as Code

Entire Linux Network stack diagram (2024)

Show HN: Playwright Skill for Claude Code – Less context than playwright-MCP

How to Enter a City Like a King

Pointer Pointer (2012)

AWS Multiple Services Down in us-east-1

The Peach meme: On CRTs, pixels and signal quality (again)

Forth: The programming language that writes itself

State-based vs Signal-based rendering

Qt Group Buys IAR Systems Group

AWS Outage: A Single Cloud Region Shouldn't Take Down the World. But It Did

Optimizing writes to OLAP using buffers (ClickHouse, Redpanda, MooseStack)

Fractal Imaginary Cubes

Novo Nordisk's Canadian Mistake

Major AWS Outage Happening

Introduction to reverse-engineering vintage synth firmware

Duke Nukem: Zero Hour N64 ROM Reverse-Engineering Project Hits 100%

Give Your Metrics an Expiry Date

Gleam OTP – Fault Tolerant Multicore Programs with Actors

Airliner hit by possible space debris

Major AWS outage takes down Fortnite, Alexa, Snapchat, and more

AWS Outage: A Single Cloud Region Shouldn't Take Down the World. But It Did

Comments