Snapchat, Ring, Roblox, Fortnite and more go down in huge internet outage: Latest updates https://www.the-independent.com/tech/snapchat-roblox-duoling...
To see more (from the first link): https://downdetector.com
It's still missing the one that earned me a phone call from a client.
If in your own datacenter your storage service goes down, how much remains running
The scale here is so large they don't know the complete dependency tree until teams check-in on what is out or not, growing this list. Of course most of it is automated, but getting on 'Affected Services' is not.
Edit: I can login into one of the AWS accounts (I have a few different ones for different companies), but my personal which has a ".edu" email is not logging in.
Actually, many companies are de facto forced to do that, for various reasons.
It means that in order to be certified you have to use providers that in turn are certified or you will have to prove that you have all of your ducks in a row and that goes way beyond certain levels of redundancy, to the point that most companies just give up and use a cloud solution because they have enough headaches just getting their internal processes aligned with various certification requirements.
Medical, banking, insurance to name just a couple are heavily regulated and to suggest that it 'just means certain levels of redundancy' is a very uninformed take.
What's more likely for medical at least is that if you make your own app, that your customers will want to install it into their AWS/Azure instance, and so you have to support them.
Everything depends on DNS....
https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
>circa 2025: grayed out on Hacker News
United we fall.
What's the point of all the auto-healing node-graph systems that were designed in the 70s and refined over decades: if we're just going to do mainframe development anyway?
dynamodb.us-east-1.api.aws
> Signal is experiencing technical difficulties. We are working hard to restore service as quickly as possible.
Edit: Up and running again.Seriously, this thing runs already on 3 servers. A primary + backup and a secondary in another datacenter/provider at Netcup. DNS with another AnycastDNS provider called ClouDNS. Everything still way cheaper then AWS. The database is already replicated for reads. And I could switch to sharding if necessary. I can easily scale to 5, 7, whatever dedicated servers. But I do not have to right now. The primary is at 1% (sic!) load.
There really is no magic behind this. And you have to write your application in a distributable way anyway, you need to understand the concepts of stateless, write-locking, etc. also with AWS.
Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.
Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.
The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.
But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.
As long as the outages are rare enough and you automatically fail over to a different region, what's the problem?
And secondly, how often do you create that backup and are you willing to lose the writes since the last backup?
That backup is absolutely something people should have, but I doubt those are ever used to bring a service back up. That would be a monumental failure of your hosting provider (colo/cloud/whatever)
Not, but if some Amazon flunky decides to kill your account to protect the Amazon brand then you will at least survive, even if you'll lose some data.
Not companies, the protocols are decentralized and at some point it was mostly non-companies. Anyone can hook up a computer and start serving requests which was/is a radical concept, we've lost a lot, unfortunately
We have put more and more services on fewer and fewer vendors. But that's the consolidation and cost point.
May I introduce you to our Lord and Slavemaster CGNAT?
Be it a company or a state, concentration of power that exeeds the needs for their purpose to function by a large margin is always a sure way to spread corruption, create feedback loop in single point of failure, and is buying everyone a ticket to some dystopian reality with a level of certainty that beats anything that a SLA will ever give us.
my business contingency plan for "AWS shuts down and never comes back up" is to go bankrupt
Make sure you let your investors know.
Where did I say that? If I didn't say it: could you please argue in good faith. Thank you.
I get your point, but most companies don't need Y nines of uptime, heck, many should probably not even use AWS, k8s, serverless or whatever complicated tech gives them all these problems at all, and could do with something far simpler.
Again, of course there are exceptions, but advising people in general that they should think about what happens if AWS goes offline for good seems like poor engineering to me. It’s like designing every bridge in your country to handle a tomahawk missile strike.
I guess the reason why people are not doing it is because it hasn't been demonstrated it's worth it, yet!
I've got to admit though, whenever I hear about having a backup plan I think having apples to apples copy elsewhere which is probably not wise/viable anyway. Perhaps having just enough to reach out to the service users/customers suffice.
Also I must add I am heavily influenced by a comment by Adrian Cockroft on why going multi cloud isn't worth it. He worked for AWS (at the time at least) so I should have probably reached to the salt dispenser.
You might as well say the entire NY + DC metro losses power and "never comes back up" What is the plan around that? The person replying is correct, most companies do not have a actionable plan for AWS never coming back up.
I worked at a medium-large company and was responsible for reviewing the infrastructure BCP plan. It stated that AWS going down was a risk, and if it happens we wait for it to come back up. (In a lot more words than that).
An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.
Related, I’m sure, is the fact that things like furnaces and water heaters will die on holidays.
Absurd claim.
Multi-region and multi-cloud are strategies that have existed almost as long as AWS. If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.
GP said:
> most companies
Most companies aren't finance-adjacent or critical infrastructure
Of course there are cases when multi cloud makes sense, but it is in the minority. The absurd claim is that most companies should worry cloud outages and plan for AWS to go offline for ever.
That still fits in with "almost guarantee". It's not as though it's true for everyone, e.g. people who might trigger DR after 10 minutes of downtime, and have it up and running within 30 more minutes.
But it is true for almost everyone, as most people will trigger it after 30 minutes or more, which, plus the time to execute DR, is often going to be far less than the AWS resolution time.
Best of all would be just multi-everything services from the start, and us-east-1 is just another node, but that's expensive and tricky with state.
This describes, what, under 1% of companies out there?
For most companies the cost of being multi-region is much more than just accepting with the occasional outage.
It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.
Not very many people realize that there are some services that still run only in us-east-1.
AWS (over-)reliance is insane...
Why people keep using "nation-state" term incorrectly in HN comments is beyond me...
It's like if you ran you cloud on an old dell box in your closet while your parent company is offering to directly host it in AWS for free.
Not really? Most of the infrastructure is quite resilient and the rare outage is usually limited to a street or two, with restoration time mainly determined by the time it takes the electricians to reach the incident site. For any given address that's maybe a few hours per decade - with the most likely cause being planned maintenance. That's not a "spares are too expensive" issue, that's a "giving every home two fully independent power feeds is silly" issue.
Anything on a metro-sized level is pretty much unheard of, and will be treated as serious as a plane crash. They can essentially only be caused by systemic failure on multiple levels, as the grid is configured to survive multiple independent failures at the same time.
Comparing that to the AWS world: individual servers going down is inevitable and shouldn't come as a surprise. Everyone has redundancies, and an engineer accidentally yanking the power cables of an entire rack shouldn't even be noticeable to any customers. But an entire service going down across an entire availability zone? That should be virtually impossible, and having it happen regularly is a bit of a red flag.
The right analogy is to imagine if businesses that used electricity took that stance, and they basically all do. If you're a hospital or some other business where a power outage is life or death, you plan by having backup generators. But if you're the overwhelming majority of businesses, you do absolutely nothing to ensure that you have power during a power outage, and it's fine.
Texas has had statewide power outages. Spain and Portugal suffered near-nationwide power outages last year. Many US states are heavily reliant on the same single source for water. And remember the discussions on here about Europe's reliance on Russian gas?
Then you have the XKCD sketch about how most software products are reliant on at least one piece of open source software that is maintained by a single person as a hobby.
Nobody likes a single point of failure but often the costs associated with mitigating that are much greater than the risks of having that point of failure.
This is why "risk assessments" are a thing.
Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".
You're arguing as if "run it on AWS" was a decision that didn't undergo the same kinds of risk assessment. As someone who's had to complete such processes (and in some companies, even define them), I can assure you that nobody of any competency runs stuff on AWS complacently.
In fact running stuff with resilience in AWS isn't even as simple as "just running it in AWS". There's a whole plethora of things to consider, and each with its own costs attached. As the meme goes "one does not simply just run something on AWS"
I agree with this. My point is simply that we, as an industry, are not a very competent bunch when it comes to risk management ; and that's especially true when compared to TSOs.
That doesn't mean nobody knows what they do in our industry or that shit never hits the fan elsewhere, but I would argue that it's an outlier behaviour, whereas it's the norm in more secure industries.
> As the meme goes "one does not simply just run something on AWS"
The meme has currency for a reason, unfortunately.
---
That being said, my original point was that utilities losing clients after a storm isn't the consequence of bad (or no) risk assessment ; it's the consequence of them setting up acceptable loss thresholds depending on the likelihood of an event happening, and making sure that the network as a whole can respect these SLOs while strictly respecting safety criteria.
Imagine if the cloud supplier was actually as important as the electricity supplier.
But since you mention it, there are instances of this and provisions for getting back up and running:
* https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...
Catastrophic data loss or lack of disaster recovery kills companies. AWS outages do not.
And regardless, electric service all over the world goes down for minutes or hours all the time.
What are those ?
The only ones that you're likely to encounter are IAM, Route53, and the billing console. The billing console outage for a few hours is hardly a problem. IAM and Route53 are statically stable and designed to be mostly stand-alone. They are working fine right now, btw.
During this outage, my infrastructure on AWS is working just fine, simply because it's outside of us-east-1.
Ironically, our observability provider went down.
I presume this means you must not be working for a company running anything at scale on AWS.
But thinking AWS SLA is granted forever, and everyone should put all its eggs in it because "everyone do it" is neither wise not safe. Those who can afford it, and there are many businesses like that out there, should have a plan B. And actually AWS should not necessarily be plan A.
Nothing is forever. Not the Roman empire, not Inca empire, not china dynasties, not USA geological supremacy. That's not a question of if but when. It doesn't need to be through a lot of suffering, but if we don't systematically organise for a humanity which spread well being for everyone in a systematically resilient way, we will make it through more lot of tragic consequences when this or that single point of failure finally falls.
One of my projects is entirely hosted on S3. I don't care enough if it becomes unavailable for a few hours to justify paying to distribute it to GCP et al.
And actually for most companies, the cost of multi-cloud is greater than the benefits. Particularly when those larger entities can just bitch to their AWS account manager to get a few grand refunded as credits.
Not only that, but as you're seeing with this and the last few dozen outages... when us-east-1 goes down, a solid chunk of what many consumers consider the "internet" goes down. It's perceived less as "app C is down" and more is "the internet is broken today".
Oh god, this. At my company, we found a bug recently with rds.describe_events, which we needed to read binlog information after a B/G cutover. The bug, which AWS support “could not see the details of,” was that events would non-deterministically not show up if you were filtering by instance name. Their recommended fix was to pull in all events for the past N minutes, and do client-side filtering.
This was on top of the other bug I had found earlier, which was that despite the docs stating that you can use a B/G as a filter - a logical choice when querying for information directly related to the B/G you just cut over - doing so returns an empty set. Also, you can’t use a cluster (again, despite docs stating otherwise), you have to use the new cluster’s writer instance.
1. they are idiots
2. they do it on purpose and they think you are an idiot
For me, it just means that the moment you integrate with any API, you are basically their bitch (unless you implement one from every competitor in the market, at which point you can just as well do it yourself).Hosting your services on AWS while having a status page on AWS during an AWS outage is an easily avoidable problem.
It’s not hard to imagine events that would keep AWS dark for a long period of time, especially if you’re just in one region. The outage today was in us-east-1. Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.
Sure, it's worth considering, but for most companies it's not going to be worth the engineering effort to architect cross-cloud services. The complexity is NOT linear.
IMO most shops should focus on testing backups (which should be at least cross-cloud, potentially on-prem of some sort) to make sure their data integrity is solid. Your data can't be recreated, everything else can be rebuilt even if it takes a long time.
You know that’s not true, is-east-1 last one was 2 years ago? But other services have bad days and foundational one drag others a long
At this point, being in any other region cuts your disaster exposure dramatically
AWS US-East 1 has many outages. Anything significant should account for that.
If your resilience plan is to trust a third party, that means you don't really care about going down, does it?
Besides that, as the above poster said, the issue with top tier cloud providers (or cloudflare, or google, etc) is not just that you rely on them, it is that enough people rely on them that you may suffer even if you don't.
What about if your account gets deleted? Or compromised and all your instances/services deleted?
I think the idea is to be able to have things continue running on not-AWS.
Step 2 is multi-AZ
Step 3 is multi-region
Step 4 is multi-cloud.
Each company can work on it's next step, but most will not have positive EROI going from 2 to 3+
Lesson here is that your approach will depend on your industry and peers. Every market will have their won philosophy and requirements here.
But a large enough of "not too big to fail" companies become a too big to fail event. Too many medium sized companies have a total dependency on AWS or Google or Microsoft, if not several at the same time.
We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.
And FWIW, "AWS is down"....only one region (out of 36) of AWS is down.
You can do the multi-region failover, though that's still possibly overkill for most.
Second, preparing for the disappearance of AWS is even more silly. The chance that it will happen are orders of magnitude smaller than the chance that the cost of preparing for such an event will kill your business.
Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?
How did we go from 'you could lose your AWS account' to 'complete collapse of western society'? Do websites even matter in that context?
> Second, preparing for the disappearance of AWS is even more silly.
What's silly is not thinking ahead.
That's the main topic that going through my mind lately, if you replace "my website" with "Wikimedia movement".
We need a far better social, juridical and technical architecture regarding resilience as hostil agendas are in the rise at all level agaisnt sourced trackable global volunteer community knowledge bases.
Given the current geopolitical circumstances, that's not a far fetched scenario. Especially for us-east-1; or anything in the D.C. metro area.
For small and medium sized companies it's not easy to perform an accurate due diligency.
Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…
Resilient systems work autonomously and can synchronize - but don't need to synchronize.
* Git is resilient.
* Native E-Mail clients - with local storage enabled - are somewhat resilient.
* A local package repository is - somewhat resilient.
* A local file-sharing app (not Warp/ Magic-Wormhole -> needs relay) is resilient if it uses only local WiFi or Bluetooth.
We're building weak infrastructure. A lot of stuff shall work locally and only optionally use the internet.But the web, that's the fragile, centralized and weak point currently, and seems to be what you're referring to rather.
Maybe nitpicky, but I feel like it's important to distinguish between "the web" and "the internet".
https://www.bbc.com/news/technology-57707530
That because people trust and hope blindly. They believe IT is for saving money? It isn’t. They coupled their cash registers onto an American cloud service. The couldn’t even pay in cash.
It usually gets worse, when not outages happens for some time. Because that increases blind trust.
Yes the Internet has stayed stable.
The Web, as defined by a bunch of servers running complex software, probably much less so.
Just the fact that it must necessarily be more complex means that it has more failure modes...
Now we have computers that shit themselves if DNS isn’t working, let alone LANs that can operate disconnected from the Internet as a whole.
And partially working or indicating this it works (when it doesn’t) is usually even worse.
For you as a Mexican the end results is the same, AWS went away, and considering there already is a list of countries that cannot use AWS, GitHub and a bunch of other "essential" services, it's not hard to imagine that that list might grow in the future.
I've actually had that.
It’s likely that, like many organizations, this scenario isn’t something Reddit are well prepared for in terms of correct error messaging.
I believe the 2021 one was especially horrific because of it affecting their dns service (route53) and the outage made writes to that service impossible. This made fail overs not work etcetera so their prescribed multi region setups didn't work.
But in the end, some things will have to synchronizes their writes somewhere, right? So for dns I could see how that ends up in a single region.
AWS is bound by the same rules as everyone else in the end... The only thing they have going for them that they have a lot of money to make certain services resilient, but I'm not aware of a single system that's resilient to everything.
Every AWS “global” service be it IAM, STS, CloudFormation, CloudFront, Route 53, Organizations, they all have deep ties to control systems originally built only in us-east-1/n. va.
That's poor design, after all these years. They've had time to fix this.
Until AWS fully decouples the control plane from us-east-1, the entire platform has a global dependency. Even if your data plane is fine, you still rely on IAM and STS for authentication and maybe Route 53 for DNS or failover CloudFormation or ECS for orchestration...
If any of those choke because us-east-1’s internal control systems are degraded, you’re fucked. That’s not true regional independence.
Assuming you cannot alter requirements or SLAs, I could see how their technical solutions are limited. It's possible, just not without breaking their promises. At that point it's no longer a technical problem
The control plane must know the truth about your account and that truth must be globally consistent. That’s where the trouble starts I guess.
I think my old-school system admin ethos is just different than theirs. It's not a who's wrong or right, just a difference in opinions on how it should be done I guess.
The ISP I work for requires us to design in a way that no single DC will cause a point of failure, just difference in design methods and I have to remember the DC I work in is completely differently used than AWS.
In the end however, I know solutions for this exist (federated ledgers, CRDT-based control planes, regional autonomy but they’re just expensive and they don’t look good on quarterly slides), it just takes the almighty dollar to implement and that goes against big business, if it "works" it works, I guess.
AWS’s model scales to millions of accounts because it hides complexity, sure but the same philosophy that enables that scale prevents true decentralization. That is shit. I guess people can architect as if us-east-1 can disappear so that things can continue on, but then thats AWS causing complexity in your code. They are just shifting who is shouldering that little known issue.
> Oct 20 3:35 AM PDT
> The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.
one of the main points of cloud computing is scaling up and down frequently
Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?
>We recommend customers continue to retry any failed requests.
Both of which seem to prop up in post mortems for these widespread outages.
> Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum.
IAM is replicated, so each region has its own read-only IAM cache. AWS SigV4 is also designed to be regionalized, if you ever wondered why the signature key derivation has many steps, that's exactly why ( https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_s... ).
I think most sysadmin don't plan for AWS outage. And economically it makes sense.
But it makes me wonder, is sysadmin a lost art?
I dunno, let me ask chatgpt. Hmmm, it said yes.
Yes. 15-20 years ago when I was still working on network-adjacent stuff I witnessed the shift to the devops movement.
To be clear, the fact that devops don't plan for AWS failures isn't an indication that they lack the sysadmin gene. Sysadmins will tell you very similar "X can never go down" or "not worth having a backup for service Y".
But deep down devops are developers who just want to get their thing running, so they'll google/serveroverflow their way into production without any desire to learn the intricacies of the underlying system. So when something breaks, they're SOL.
"Thankfully" nowadays containers and application hosting abstracts a lot of it back away. So today I'd be willing to say that devops are sufficient for small to medium companies (and dare I say more efficient?).
Signal was also down.
Obviously, some services are only available in us-east-1, but many applications can gain some resiliency just by making a primary home in any other region.
> There is one IAM control plane for all commercial AWS Regions, which is located in the US East (N. Virginia) Region. The IAM system then propagates configuration changes to the IAM data planes in every enabled AWS Region. The IAM data plane is essentially a read-only replica of the IAM control plane configuration data.
and I believe some global services (like certificate manager, etc.) also depend on the us-east-1 region
https://docs.aws.amazon.com/IAM/latest/UserGuide/disaster-re...
I recently ran into an issue where some Bedrock functionality was available in us-east-1 but not one of the other US regions.
Other AWS services should be able to survive this kind of interruption by rerouting requests to other AZs. Big company clients might also want to test against these kinds of scenarios.
There are very few businesses that genuinely cannot handle an outage like this. The only examples I've personally experienced are payment processing and semiconductor manufacturing. A severe IT outage in either of these businesses is an actual crisis. Contrast with the South Korean government who seems largely unaffected by the recent loss of an entire building full of machines with no backups.
I've worked in a retail store that had a total electricity outage and saw virtually no reduction in sales numbers for the day. I have seen a bank operate with a broken core system for weeks. I have never heard of someone actually cancelling a subscription over a transient outage in YouTube, Spotify, Netflix, Steam, etc.
The takeaway I always have from these events is that you should engineer your business to be resilient to the real tradeoff that AWS offers. If you don't overreact to the occasional outage and have reasonable measures to work around for a day or 2, it's almost certainly easier and cheaper than building a multi cloud complexity hellscape or dragging it all back on prem.
Thinking in terms of competition and game theory, you'll probably win even if your competitor has a perfect failover strategy. The cost of maintaining a flawless eject button for an entire cloud is like an anvil around your neck. Every IT decision has to be filtered through this axis. When you can just slap another EC2 on the pile, you can run laps around your peers.
An enduring image that stays with me was when I was a child and the local supermarket lost electricity. Within seconds the people working the tills had pulled out hand cranks by which the tills could be operated.
I'm getting old, but this was the 1980's, not the 1800's.
In other words, to agree with your point about resilience:
A lot of the time even some really janky fallbacks will be enough.
But to somewhat disagree with your apparent support for AWS: While it is true this attitude means you can deal with AWS falling over now and again, it also strips away one of the main reasons people tend to give me for why they're in AWS in the first place - namely a belief in buying peace of mind and less devops complexity (a belief I'd argue is pure fiction, but that's a separate issue). If you accept that you in fact can survive just fine without absurd levels of uptime, you also gain a lot more flexibility in which options are viable to you.
The cost of maintaining a flawless eject button is indeed high, but so is the cost of picking a provider based on the notion that you don't need one if you're with them out of a misplaced belief in the availability they can provide, rather than based on how cost effectively they can deliver what you actually need.
(The counter joke is, of course, "but that's `us-east-1` already! But I mean deliberately and frequently.)
Our applications and databases must have ultra high availability. It can be achieved with applications and data platforms hosted on different regions for failover.
Critical businesses should also plan for replication across multiple cloud platforms. You may use some of the existing solutions out there that can help with such implementations for data platforms.
- Qlik replicate - HexaRocket
and some more.
Or rather implement native replication solutions available with data platforms.
See the sales team from Google flew out an executive to NBA Finals, Azure Sales team flew out another executive to NFL superBowl and the AWS team flew out yet another executive to Wimbledon finals. And thats how you end up with multi-cloud strategy.
I could care less about having more vendor dinners when I know I am promising a falsehood that is extremely expensive and likely going to cost me my job or my credibility at some point.
Common Cause Failures and false redundancy are just all over the place.
My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.
The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).
Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.
Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.
atymic•8h ago