Major AWS outage takes down Fortnite, Alexa, Snapchat, and more

https://www.theverge.com/news/802486/aws-outage-alexa-fortnite-snapchat-offline

200•codebolt•7h ago

Comments

binsquare•7h ago

The internal disruption reviews are going to be fun :)

Msurrow•7h ago

The fun is really gonna start if the root cause of this somehow implicates an AI as a primary cause.

aurareturn•7h ago

It's never an AI's fault since it's up to a human to implement the AI and put in a process that prevents this stuff from happening.

So blame humans even if an AI wrote some bad code.

Msurrow•6h ago

I agree but then again it’s always a humans fault in the end. So probably a root cause will have a bit more neuance. I was more thinking of the possible headlines and how that would potentially affect the public AI debate. Since this event is big enough to actually get the attention of eg risk management at not-insignificant orgs.

xenocratus•6h ago

> but then again it’s always a humans fault in the end

Disagree, a human might be the cause/trigger, but the fault is pretty much always systemic. A whole lot of things has to happen for that last person to cause the problem.

Msurrow•6h ago

Also agree. “What” build the system though? (Humans)

Edit: and more important who governed the system, ie made decisions about maintainance, staffing, training, processes and so on

karel-3d•6h ago

I haven't seen the "90% of our code is AI" nonsense from Amazon.

portaouflop•5h ago

It’s gonna be DNS

WelcomeShorty•4h ago

Your remark made me laugh, but..:

"Oct 20 3:35 AM PDT The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution."

https://health.aws.amazon.com/health/status

erpellan•4h ago

It’s always DNS! Except when it’s the firewall.

blibble•2h ago

if that's the case it'll be buried

moribvndvs•6h ago

So, uh, over the weekend I decided to use the fact that my company needs a status checker/page to try out Elixir + Phoenix LiveView, and just now I found out my region is down while tinkering with it and watching Final Destination. That’s a little too on the nose for my comfort.

lbreakjai•6h ago

Well at least you don't have to figure out how to test your setup locally.

hobo_mark•6h ago

When did Snapchat move out of GCP?

dijit•5h ago

They might have an implicit dependency on AWS, even if they're not primarily hosted there.

freeqaz•5h ago

Since I'm 5+ years out from my NDA around this stuff, I'll give some high level details here.

Snapchat heavily used Google AppEngine to scale. This was basically a magical Java runtime that would 'hot path split' the monolithic service into lambda-like worker pools. Pretty crazy, but it worked well.

Snapchat leaned very heavily on this though and basically let Google build the tech that allowed them to scale up instead of dealing with that problem internally. At one point, Snap was >70% of all GCP usage. And this was almost all concentrated on ONE Java service. Nuts stuff.

Anyway, eventually Google was no longer happy with supporting this and the corporate way of breaking up is "hey we're gonna charge you 10x what did last year for this, kay?" (I don't know if it was actually 10x. It was just a LOT more)

So began the migration towards Kubernetes and AWS EKS. Snap was one of the pilot customers for EKS before it was generally available, iirc. (I helped work on this migration in 2018/2019)

Now, 6+ years later, I don't think Snap heavily uses GCP for traffic unless they migrated back. And this outage basically confirms that :P

garbthetill•4h ago

Thats so interesting to me, I always assume companies like google who have "unlimited" dollars will always be happy to eat the cost to keep customers, especially given gcp usage outside googles internal services is way smaller compared to azure and aws. Also interesting to see snapchat had a hacky solution with AppEngine

makeitdouble•4h ago

The "unlimited dollars" come from somewhere after all.

GCP is behind in market share, but has the incredible cheat advantage of just not being Amazon. Most retailers won't touch Amazon services with a ten foot pole, so the choice is GCP or Azure. Azure is way more painful for FOSS stacks, so GCP has its own area with only limited competition.

ecshafer•3h ago

GCP as I understand it is the E-commerce/retail choice for this reason. Not Amazon being the main reason.

Honestly as a (very small) shareholder in Amazon, they should spin off AWS as a separate company. The Amazon brand is holding AWS back.

philistine•2h ago

Absolutely! AWS is worth more as a separate company than being hobbled by the rest of Amazon. YouTube is the same.

Big monopolists do not unlock more stock market value, they hoard it and stifle it.

dzonga•3h ago

Customers would rather choose Azure. GCP has a bad rep, bad documentation, bad support compared to AWS / Azure. & with google cutting off products, their trust is damaged.

Scubabear68•2h ago

I’m not sure what you mean by Azure being more painful for FOSS stacks. That is not my experience. Old you elaborate?

However I have seen many people flee from GCP because: Google lacks customer focus, Google is free about killing services, Google seems to not care about external users, people plain don’t trust Google with their code, data or reputation.

array_key_first•3h ago

Google does not give even a singular fuck about keeping their customers. They will happily kill products that are actively in use and are low-effort for... convenience? Streamlining? I don't know, but Google loves to do that.

throwway120385•1h ago

The engineering manager that was leading the project got promoted and now no longer cares about it.

freeqaz•2h ago

These are the best additional bits of information that I can find to share with you if you're curious to read more about Snap and what they did. (They were spending $400m per year on GCP which was famously disclosed in their S-1 when they IPO'd)

0: https://chrpopov.medium.com/scaling-cloud-infrastructure-5c6...

1: https://eng.snap.com/monolith-to-multicloud-microservices-sn...

lesuorac•36m ago

High margin companies are always looking to cut the lower-margin parts of their business regardless of if they're profitable.

The general idea being that you'll losing money due to opportunity cost.

Personally, I think you're better off just not laying people off and having them work the less (but still) profitable stuff. But I'm not in charge.

XorNot•6h ago

Well that takes down Docker Hub as well it looks like.

AshLeece•6h ago

Yep, was just thinking the same when my Kubernetes failed a HelmRelease due to a pull error…

shakesbeard•6h ago

Slack (canvas and huddles), Circle CI and Bitbucket are also reporting issues due to this.

comrade1234•6h ago

I like that we can advertise to our customers that over the last X years we have better uptime than Amazon, google, etc.

eska•6h ago

Just yesterday I saw another Hetzner thread where someone claimed AWS beats them in uptime and someone else blasted AWS for huge incidents. I bet his coffee tastes better this morning.

VBprogrammer•5h ago

I honestly wonder if there is safety in the herd here. If you have a dedicated server in a rack somewhere that goes down and takes your site with it. Or even the whole data center has connectivity issues. As far as the customer is concerned, you screwed up.

If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.

BrentOzar•5h ago

> If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.

Exactly - I've had clients say, "We'll pay for hot standbys in the same region, but not in another region. If an entire AWS region goes down, it'll be in the news, and our customers will understand, because we won't be their only service provider that goes down, and our clients might even be down themselves."

bonesss•5h ago

Show up at a meeting looking like you wet yourself, it’s all anyone will ever talk about.

Show up at a meeting where a whole bunch of people appear to have wet themselves, and we’ll all agree not to mention it ever again…

z3t4•2h ago

You can take advantage by having an unplanned service window every time a large cloud provider goes down. Then tell your client that you where the reason why AWS went down.

chuckadams•2h ago

My guess is their infrastructure is set up through clickops, making it extra painful to redeploy in another region. Even if everything is set up through CloudFormation, there's probably umpteen consumers of APIs that have their region hardwired in. By the time you get that all sorted, the region is likely to be back up.

stavros•3h ago

To be fair, my Hetzner server had ten minutes of downtime the other day. I've been a customer for years and this was the second time or so, so I love Hetzner, but everything has downtime.

bravetraveler•2h ago

Their auction systems are interesting to dig through, but to your point, everything fails. Especially these older auction systems. Great price/service, though. Less than an hour for more than one ad-hoc RAID card replacement

stavros•2h ago

Yeah, I really want one of their dedicated servers, but it's a bit too expensive for what I use it for. Plus, my server is too much of a pet, so I'm spoiled on the automatic full-machine backups.

bravetraveler•2h ago

Absolutely understandable :)

tgv•5h ago

The Register calls it Microsoft 364, 363, ...

SteveNuts•2h ago

365 "eights" of uptime per year.

LocalH•1h ago

That's the funniest thing I've heard this morning. Still less than one 9

kingstnap•5h ago

Reported uptimes are little more than fabricated bullshit.

They measure uptime using averages of "if any part of a chain is even marginally working".

People experience downtime however as "if any part of a chain is degraded".

uyzstvqs•2h ago

I'd say that this is true for the average admin who considers PaaS, Kubernetes and microservices one giant joke. Vendor-neutral monolithic deployments keep on winning.

notahacker•1h ago

Feel bad for the Amazon SDR randomly pitching me AWS services today. Although apparently our former head of marketing got that pitch from four different LinkedIn accounts. Maybe there's a cloud service to rein them in that broke ;)

seeg•6h ago

quay.io was down: https://status.redhat.com

gramakri2•6h ago

quay.io is down

codebolt•5h ago

Atlassian cloud is also having issues. Closing in on the 3 hour mark.

ryanmcdonough•5h ago

Now, I may well be naive - but isn't the point of these systems that you fail over gracefully to another data centre and no-one notices?

spicybright•5h ago

It should be! When I was a complete newbie at AWS my first question was why do you have to pick a region, I thought the whole point was you didn't have to worry about that stuff

nicce•5h ago

As far as I know, region selection is about regulation and privacy and guarantees on that.

speedgoose•5h ago

The region labels found within the metadata are very very powerful.

They make lawyers happy and they stop intelligence services to access the associated resources.

For example, no one would even consider accessing data from a European region without the right paperwork.

speed_spread•4h ago

Because if they were caught they'd have to pay _thousands_ of dollars in fines and get sternly talked to be high ranking officials.

joncrane•2h ago

It's also about latency and nearness to users. Also some regions don't have all features so feature set also matters.

Nifty3929•1h ago

One might hope that this, too, would be handled by the service. Send the traffic to the closest region, and then fallback to other regions as necessary. Basically, send the traffic to the closest region that can successfully serve it.

But yeah, that's pretty hard and there are other reasons customers might want to explicitly choose the region.

sofixa•5h ago

> another data centre

Yes, within the same region. Doing stuff cross-region takes a little bit more effort and cost, so many skip it.

pjmlp•3h ago

Assuming that the service actually bothered to have multiple regions as fallbacks configured.

ivad•5h ago

Seems to have taken down my router "smart wifi" login page, and there's no backup router-only login option! Brilliant work, linksys....

ta988•5h ago

Happened to lots of commercial routers too (free wifi with sign-in pages in stores for example) and that's way outside us-east-1

kristopherleads•4h ago

Was just on a Lufthansa and then United flight - both of which did not have WiFi. Was wondering if there was something going on at the infrastructure level.

Alex-C137•2h ago

Unfortunately that is also be par for the course

rvba•1h ago

What if they use the same router inside AWS and now they cannot login too?

hmry•4h ago

WiFi login portal (Icomera) on the train I'm on doesn't work either.

pantulis•4h ago

Now I know why the documents I was sending to my Kindle didn't go through.

CTDOCodebases•4h ago

I'm getting rate limit issues on Reddit so it could be related.

zoklet-enjoyer•1h ago

Is this why Wordle logged me out and my 2 guesses don't seem to have been recorded? I am worried about losing my streak.

mentalgear•1h ago

> Amazon Alexa: routines like pre-set alarms were not functioning.

It's ridiculous how everything is being stored in the cloud, even simple timers. It's past high time to move functionality back on-device, which would come with the advantage of making it easier to de-connect from big tech's capitalist surveillance state as well.

Itanu•1h ago

exactly why it won't happen :)

add-sub-mul-div•1h ago

Keep going

vivzkestrel•1h ago

stupid question: is buying a server rack and running it at home subject to more downtimes in a year than this? has anyone done an actual SLA analysis?

alphabettsy•52m ago

That depends on a lot of factors, but for me personally, yes it is. Much worse.

Assuming we’re talking about hosting things for Internet users. My fiber internet connection has gone down multiple times, though relatively quickly restored. My power has gone out several times in the last year, with one storm having it out for nearly 24 hrs. I was sleep when it went out and I didn’t start the generator until it was out for 3-4 hours already, far longer than my UPSes could hold up. I’ve had to do maintenance and updates both physical and software.

All of those things contribute to a downtime significantly higher than I see with my stuff running on Linode, Fly.io or AWS.

I run Proxmox and K3s at home and it makes things far more reliable, but it’s also extra overhead for me to maintain.

Most or all of those things could be mitigated at home, but at what cost?

amelius•30m ago

Maybe if you use a UPS and Starlink then ...

dboreham•50m ago

Unanswerable question. Better to perform a failure mode analysis. That rack in your basement would need redundant power (two power companies or one power company and a diesel generator which typically won't be legal to have at your home), then redundant internet service (actually redundant - not the cable company vs the phone company that underneath use the same backhaul fiber).

pluto_modadic•9m ago

so, funny story, my fiber got cut (backhoe) and it took then 12 hours to restore it.

If you had /two/ houses, in separate towns, you'd have better luck. Or, if you had cell as a backup.

Or: if you don't care about it being down for 12 hours.

Forricide•5m ago

So for me, extremely anecdotally, I host a few fairly low-importance things on a home server (which is just an old desktop computer left sitting under a desk with Ubuntu slapped on it): A VPN (WireGuard), a few Discord bots, a Twitch bot + some auth stuff, and a few other services that I personally use.

These are the issues I've ran into that have caused downtime in the last few years:

- 1x power outage: if I had set up restart on power, probably would have been down for 30-60 minutes, ended up being a few hours (as I had to manually press the power button lol). Probably the longest non-self-inflicted issue.

- Twitch bot library issues: Just typical library bugs. Unrelated to self-hosting.

- IP changes: My IP actually barely ever changes, but I should set up DDNS. Fixable with self-hosting (but requires some amount of effort).

- Running out of disk space: Would be nice to be able to just increase it.

- Prooooooobably an internet outage or two, now that I think about it? Not enough that it's been a serious concern, though, as I can't think of a time that's actually happened. (Or I have a bad memory!)

I think that's actually about it. I rely fairly heavily on my VPN+personal cloud as all my notes, todos, etc are synced through it (Joplin + Nextcloud), so I do notice and pay a lot of attention to any downtime, but this is pretty much all that's ever happened. It's remarkable how stable software/hardware can be. I'm sure I'll eventually have some hardware failure (actually, I upgraded my CPU 1-2 years ago because it turns out the Ryzen 1700 I was using before has some kind of extremely-infrequent issue with Linux that was causing crashes a couple times a month), but it's really nice.

To be clear, though, for an actual business project, I don't think this would be a good idea, mainly due to concerns around residential vs commercial IPs, arbitrary IPs connecting to your local network, etc that I don't fully pay attention to.

jdlyga•45m ago

Time to start calling BS on the 9's of reliability

amelius•32m ago

Medium also.

mmmlinux•13m ago

Ohno, not Fortnite! oh, the humanity.

BERT Is Just a Single Text Diffusion Step

Commodore 64 Ultimate

DeepSeek OCR

Space Elevator

Servo v0.0.1 Released

Matrix Conference 2025 Highlights

How to stop Linux threads cleanly

Docker Systems Status: Full Service Disruption

Anthropic and Cursor Spend This Much on Amazon Web Services

Modeling Others' Minds as Code

Entire Linux Network stack diagram (2024)

Show HN: Playwright Skill for Claude Code – Less context than playwright-MCP

How to Enter a City Like a King

Pointer Pointer (2012)

AWS Multiple Services Down in us-east-1

The Peach meme: On CRTs, pixels and signal quality (again)

Forth: The programming language that writes itself

State-based vs Signal-based rendering

Qt Group Buys IAR Systems Group

AWS Outage: A Single Cloud Region Shouldn't Take Down the World. But It Did

Optimizing writes to OLAP using buffers (ClickHouse, Redpanda, MooseStack)

Fractal Imaginary Cubes

Novo Nordisk's Canadian Mistake

Major AWS Outage Happening

Introduction to reverse-engineering vintage synth firmware

Duke Nukem: Zero Hour N64 ROM Reverse-Engineering Project Hits 100%

Give Your Metrics an Expiry Date

Gleam OTP – Fault Tolerant Multicore Programs with Actors

Airliner hit by possible space debris

Major AWS outage takes down Fortnite, Alexa, Snapchat, and more

BERT Is Just a Single Text Diffusion Step

Commodore 64 Ultimate

DeepSeek OCR

Space Elevator

Servo v0.0.1 Released

Matrix Conference 2025 Highlights

How to stop Linux threads cleanly

Docker Systems Status: Full Service Disruption

Anthropic and Cursor Spend This Much on Amazon Web Services

Modeling Others' Minds as Code

Entire Linux Network stack diagram (2024)

Show HN: Playwright Skill for Claude Code – Less context than playwright-MCP

How to Enter a City Like a King

Pointer Pointer (2012)

AWS Multiple Services Down in us-east-1

The Peach meme: On CRTs, pixels and signal quality (again)

Forth: The programming language that writes itself

State-based vs Signal-based rendering

Qt Group Buys IAR Systems Group

AWS Outage: A Single Cloud Region Shouldn't Take Down the World. But It Did

Optimizing writes to OLAP using buffers (ClickHouse, Redpanda, MooseStack)

Fractal Imaginary Cubes

Novo Nordisk's Canadian Mistake

Major AWS Outage Happening

Introduction to reverse-engineering vintage synth firmware

Duke Nukem: Zero Hour N64 ROM Reverse-Engineering Project Hits 100%

Give Your Metrics an Expiry Date

Gleam OTP – Fault Tolerant Multicore Programs with Actors

Airliner hit by possible space debris

Major AWS outage takes down Fortnite, Alexa, Snapchat, and more

Major AWS outage takes down Fortnite, Alexa, Snapchat, and more

Comments