frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

BERT Is Just a Single Text Diffusion Step

https://nathan.rs/posts/roberta-diffusion/
113•nathan-barry•1h ago•10 comments

Commodore 64 Ultimate

https://www.commodore.net/product-page/commodore-64-ultimate-basic-beige-batch1
46•guerrilla•1h ago•14 comments

DeepSeek OCR

https://github.com/deepseek-ai/DeepSeek-OCR
644•pierre•9h ago•165 comments

Space Elevator

https://neal.fun/space-elevator/
1015•kaonwarb•11h ago•216 comments

Servo v0.0.1 Released

https://github.com/servo/servo
238•undeveloper•3h ago•56 comments

Matrix Conference 2025 Highlights

https://element.io/blog/the-matrix-conference-a-seminal-moment-for-matrix/
88•Arathorn•3h ago•46 comments

How to stop Linux threads cleanly

https://mazzo.li/posts/stopping-linux-threads.html
26•signa11•5d ago•3 comments

Docker Systems Status: Full Service Disruption

https://www.dockerstatus.com/pages/incident/533c6539221ae15e3f000031/68f5e1c741c825463df7486c
260•l2dy•8h ago•102 comments

Anthropic and Cursor Spend This Much on Amazon Web Services

https://www.wheresyoured.at/costs/
45•isoprophlex•51m ago•16 comments

Modeling Others' Minds as Code

https://arxiv.org/abs/2510.01272
27•PaulHoule•2h ago•8 comments

Entire Linux Network stack diagram (2024)

https://zenodo.org/records/14179366
461•hhutw•12h ago•39 comments

Show HN: Playwright Skill for Claude Code – Less context than playwright-MCP

https://github.com/lackeyjb/playwright-skill
58•syntax-sherlock•3h ago•22 comments

How to Enter a City Like a King

https://worldhistory.substack.com/p/how-to-enter-a-city-like-a-king
34•crescit_eundo•1w ago•12 comments

Pointer Pointer (2012)

https://pointerpointer.com
177•surprisetalk•1w ago•19 comments

AWS Multiple Services Down in us-east-1

https://health.aws.amazon.com/health/status?ts=20251020
670•kondro•8h ago•264 comments

The Peach meme: On CRTs, pixels and signal quality (again)

https://www.datagubbe.se/crt2/
39•zdw•1w ago•11 comments

Forth: The programming language that writes itself

https://ratfactor.com/forth/the_programming_language_that_writes_itself.html
265•suioir•15h ago•116 comments

State-based vs Signal-based rendering

https://jovidecroock.com/blog/state-vs-signals/
40•mfbx9da4•6h ago•34 comments

Qt Group Buys IAR Systems Group

https://www.qt.io/stock/qt-completes-the-recommended-public-cash-offer-to-the-shareholders-of-iar...
18•shrimp-chimp•3h ago•4 comments

AWS Outage: A Single Cloud Region Shouldn't Take Down the World. But It Did

https://faun.dev/c/news/devopslinks/aws-outage-a-single-cloud-region-shouldnt-take-down-the-world...
257•eon01•3h ago•138 comments

Optimizing writes to OLAP using buffers (ClickHouse, Redpanda, MooseStack)

https://www.fiveonefour.com/blog/optimizing-writes-to-olap-using-buffers
19•oatsandsugar•5d ago•7 comments

Fractal Imaginary Cubes

https://www.i.h.kyoto-u.ac.jp/users/tsuiki/icube/fractal/index-e.html
34•strstr•1w ago•3 comments

Novo Nordisk's Canadian Mistake

https://www.science.org/content/blog-post/novo-nordisk-s-canadian-mistake
396•jbm•19h ago•207 comments

Major AWS Outage Happening

https://old.reddit.com/r/aws/comments/1obd3lx/dynamodb_down_useast1/
1018•vvoyer•8h ago•528 comments

Introduction to reverse-engineering vintage synth firmware

https://ajxs.me/blog/Introduction_to_Reverse-Engineering_Vintage_Synth_Firmware.html
146•jmillikin•13h ago•22 comments

Duke Nukem: Zero Hour N64 ROM Reverse-Engineering Project Hits 100%

https://github.com/Gillou68310/DukeNukemZeroHour
209•birdculture•19h ago•89 comments

Give Your Metrics an Expiry Date

https://adrianhoward.com/posts/give-your-metrics-an-expiry-date/
57•adrianhoward•5d ago•18 comments

Gleam OTP – Fault Tolerant Multicore Programs with Actors

https://github.com/gleam-lang/otp
165•TheWiggles•17h ago•70 comments

Airliner hit by possible space debris

https://avbrief.com/united-max-hit-by-falling-object-at-36000-feet/
372•d_silin•22h ago•196 comments

Major AWS outage takes down Fortnite, Alexa, Snapchat, and more

https://www.theverge.com/news/802486/aws-outage-alexa-fortnite-snapchat-offline
200•codebolt•7h ago•79 comments
Open in hackernews

AWS Multiple Services Down in us-east-1

https://health.aws.amazon.com/health/status?ts=20251020
670•kondro•8h ago

Comments

atymic•8h ago
https://news.ycombinator.com/item?id=45640754
empressplay•7h ago
Can't check out on Amazon.com.au, gives error page
kondro•7h ago
This link works fine from Australia for me.
askonomm•7h ago
Docker is also down.
1659447091•7h ago
Also:

Snapchat, Ring, Roblox, Fortnite and more go down in huge internet outage: Latest updates https://www.the-independent.com/tech/snapchat-roblox-duoling...

To see more (from the first link): https://downdetector.com

kalleboo•7h ago
It's fun watching their list of "Affected Services" grow literally in front of your eyes as they figure out how many things have this dependency.

It's still missing the one that earned me a phone call from a client.

zenexer•6h ago
It's seemingly everything. SES was the first one that I noticed, but from what I can tell, all services are impacted.
hvb2•5h ago
In AWS, if you take out one of dynamo db, S3 or lambda you're going to be in a world of pain. Any architecture will likely use those somewhere including all the other services on top.

If in your own datacenter your storage service goes down, how much remains running

goatking•1m ago
Agreed, but you can put EC2 on that list as well
mlrtime•5h ago
When these major issues come up, all they have is symptoms and not causes. Maybe not until the dynamo oncall comes on and says its down, then everyone knows at least the reason for their teams outage.

The scale here is so large they don't know the complete dependency tree until teams check-in on what is out or not, growing this list. Of course most of it is automated, but getting on 'Affected Services' is not.

DataDaemon•7h ago
But but this is a cloud, it should exist in the cloud.
whatsupdog•7h ago
I can not login to my AWS account. And, the "my account" on regular amazon website is blank on Firefox, but opens on Chrome.

Edit: I can login into one of the AWS accounts (I have a few different ones for different companies), but my personal which has a ".edu" email is not logging in.

gramakri2•7h ago
npm registry also down
thomas_witt•7h ago
Seems to be really only in us-east-1, DynamoDB is performing fine in production on eu-central-1.
glemmaPaul•7h ago
LOL making one db service a central point of failure, charge gold for small compute instances. Rage about needing Multi AZ, make the costs come onto the developer/organization. But, now fail on a region level, so are we going to now have multi-country setup for simple small applications?
philipallstar•6h ago
Just don't buy it if you don't want it. No one is forced to buy this stuff.
benterix•6h ago
> No one is forced to buy this stuff.

Actually, many companies are de facto forced to do that, for various reasons.

philipallstar•6h ago
How so?
jacquesm•5h ago
Certification, for one. Governments will mandate 'x, y and/or z' and only the big providers are able to deliver.
mlrtime•5h ago
That is not the same as mandating AWS, it just means certain levels of redundancy. There are no requirements to be in the cloud.
jacquesm•4h ago
No, that's not what it means.

It means that in order to be certified you have to use providers that in turn are certified or you will have to prove that you have all of your ducks in a row and that goes way beyond certain levels of redundancy, to the point that most companies just give up and use a cloud solution because they have enough headaches just getting their internal processes aligned with various certification requirements.

Medical, banking, insurance to name just a couple are heavily regulated and to suggest that it 'just means certain levels of redundancy' is a very uninformed take.

philipallstar•2h ago
It is definitely not true that only big companies can do this. It is true that every regulation added adds to the power of big companies, which explains some regulation, but it is definitely possible to do a lot of things yourself and evidence that you've done it.

What's more likely for medical at least is that if you make your own app, that your customers will want to install it into their AWS/Azure instance, and so you have to support them.

63stack•5h ago
Security/compliance theater for one
philipallstar•5h ago
That's not a company being forced to, though?
aembleton•5h ago
It is if they want to win contracts
philipallstar•4h ago
I don't think that's true. I think a company can choose to outsource that stuff to a cloud provider or not, but they can still choose.
DrScientist•6h ago
According to their status page the fault was in DNS lookup of the Dynamo services.

Everything depends on DNS....

mlrtime•5h ago
Dynamo had a outage last year if I recall correctly.
xtracto•2h ago
Lol ... of course it's DNS fault again.
Hamuko•6h ago
I thought it was a pretty well-known issue that the rest of AWS depends on us-east-1 working. Basically any other AWS region can get hit by a meteor without bringing down everything else – except us-east-1.
yellow_lead•5h ago
But it seems like only us-east-1 is down today, is that right?
dikei•5h ago
Some global services have control plane located only in `us-east-1`, without which they become read-only at best, or even fail outright.

https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...

KettleLaugh•6h ago
We maybe distributed, but we die united...
glemmaPaul•5h ago
AWS Communist Cloud
nyrp•4h ago
>circa 2005: Score:5, Funny on Slashdot

>circa 2025: grayed out on Hacker News

hangsi•5h ago
Divided we stand,

United we fall.

XCSme•6h ago
Yeah, noticed from Zoom: https://www.zoomstatus.com/incidents/yy70hmbp61r9
igleria•6h ago
funny that even if we have our app running fine in AWS europe, we are affected as developers because of npm/docker/etc being down. oh well.
dijit•6h ago
AWS has made the internet into a single-point-of failure.

What's the point of all the auto-healing node-graph systems that were designed in the 70s and refined over decades: if we're just going to do mainframe development anyway?

voidUpdate•5h ago
To be fair, there is another point of failure, Cloudflare. It seems like half the internet goes down when Cloudflare has one of their moments
polaris64•6h ago
It looks like DNS has been restored: dynamodb.us-east-1.amazonaws.com. 5 IN A 3.218.182.189
miyuru•5h ago
I wonder if the new endpoint was affected as well.

dynamodb.us-east-1.api.aws

killingtime74•6h ago
Signal is down for me
miduil•6h ago
Yes. https://status.signal.org/

    >  Signal is experiencing technical difficulties. We are working hard to restore service as quickly as possible.
Edit: Up and running again.
gbalduzzi•6h ago
Twilio is down worldwide: https://status.twilio.com/
croemer•6h ago
Coinbase down as well
croemer•6h ago
Related thread: https://news.ycombinator.com/item?id=45640772
martinheidegger•6h ago
Designed to provide 99.999% durability and 99.999% availability Still designed, not implemented
gritzko•6h ago
idiocracy_window_view.jpg
jpfromlondon•6h ago
This will always be a risk when sharecropping.
Aldipower•6h ago
My minor 2000 users web app hosted on Hetzner works fyi. :-P
mlrtime•5h ago
But how are you going to web scale it!? /s
Aldipower•4h ago
Web scale? It is an _web_ app, so it is already web scaled, hehe.

Seriously, this thing runs already on 3 servers. A primary + backup and a secondary in another datacenter/provider at Netcup. DNS with another AnycastDNS provider called ClouDNS. Everything still way cheaper then AWS. The database is already replicated for reads. And I could switch to sharding if necessary. I can easily scale to 5, 7, whatever dedicated servers. But I do not have to right now. The primary is at 1% (sic!) load.

There really is no magic behind this. And you have to write your application in a distributable way anyway, you need to understand the concepts of stateless, write-locking, etc. also with AWS.

aembleton•5h ago
Right up until the DNS fails
Aldipower•4h ago
I am using ClouDNS. That is an AnycastDNS provider. My hopes are that they are more reliable. But yeah, it is still DNS and it will fail. ;-)
throw-10-13•6h ago
this is why you avoid us-east-1
BaudouinVH•6h ago
canva.com was down until a few minutes ago.
tosh•5h ago
SES and signal seem to work again
jacquesm•5h ago
Every week or so we interview a company and ask them if they have a fall-back plan in case AWS goes down or their cloud account disappears. They always have this deer-in-the-headlights look. 'That can't happen, right?'

Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.

hvb2•5h ago
> The internet got its main strengths from the fact that it was completely decentralized.

Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.

The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.

jacquesm•5h ago
I think one reason is that people are just bad at statistics. Chance of materialization * impact = small. Sure. Over a short enough time that's true for any kind of risk. But companies tend to live for years, decades even and sometimes longer than that. If we're going to put all of those precious eggs in one basket, as long as the basket is substantially stronger than the eggs we're fine, right? Until the day someone drops the basket. And over a long enough time span all risks eventually materialize. So we're playing this game, and usually we come out ahead.

But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.

hvb2•5h ago
Absolutely, but the cost of perfection (100% uptime in this case) is infinite.

As long as the outages are rare enough and you automatically fail over to a different region, what's the problem?

jacquesm•5h ago
Often simply the lack of a backup outside of the main cloud account.
hvb2•5h ago
Sure, but on a typical outage how likely is it that you'll have that all up and running before the outage is resolved?

And secondly, how often do you create that backup and are you willing to lose the writes since the last backup?

That backup is absolutely something people should have, but I doubt those are ever used to bring a service back up. That would be a monumental failure of your hosting provider (colo/cloud/whatever)

jacquesm•4h ago
> Sure, but on a typical outage how likely is it that you'll have that all up and running before the outage is resolved?

Not, but if some Amazon flunky decides to kill your account to protect the Amazon brand then you will at least survive, even if you'll lose some data.

lentil_soup•5h ago
> Decentralized in terms of many companies making up the internet

Not companies, the protocols are decentralized and at some point it was mostly non-companies. Anyone can hook up a computer and start serving requests which was/is a radical concept, we've lost a lot, unfortunately

hvb2•5h ago
No we've not lost that at all. Nobody prevents you from doing that.

We have put more and more services on fewer and fewer vendors. But that's the consolidation and cost point.

IlikeKitties•5h ago
> No we've not lost that at all. Nobody prevents you from doing that.

May I introduce you to our Lord and Slavemaster CGNAT?

yupyupyups•4h ago
That depends on who your ISP is.
otterley•1h ago
There’s more than one way to get a server on the Internet. You can pay a local data center to put your machine in one of their racks.
psychoslave•3h ago
Well, that is exactly what resilient distributed network are about. Not that much the technical details we implement them through, but the social relationship and balanced in political decision power.

Be it a company or a state, concentration of power that exeeds the needs for their purpose to function by a large margin is always a sure way to spread corruption, create feedback loop in single point of failure, and is buying everyone a ticket to some dystopian reality with a level of certainty that beats anything that a SLA will ever give us.

padjo•5h ago
Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years. I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.
jacquesm•5h ago
Thank you for illustrating my point. You didn't even bother to read the second paragraph.
shawabawa3•5h ago
> Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.

my business contingency plan for "AWS shuts down and never comes back up" is to go bankrupt

jacquesm•5h ago
Is that also your contingency plan for 'user uploads objectionable content and alerts Amazon to get your account shut down'?

Make sure you let your investors know.

padjo•4h ago
If your mitigation for that risk is to have an elaborate plan to move to a different cloud provider, where the same problem can just happen again, then you’re doing an awful job of risk management.
jacquesm•4h ago
> If your mitigation for that risk is to have an elaborate plan to move to a different cloud provider, where the same problem can just happen again, then you’re doing an awful job of risk management.

Where did I say that? If I didn't say it: could you please argue in good faith. Thank you.

matsemann•4h ago
"Is that also your contingency plan if unrelated X happens", and "make sure your investors know" are also not exactly good faith or without snark, mind you.

I get your point, but most companies don't need Y nines of uptime, heck, many should probably not even use AWS, k8s, serverless or whatever complicated tech gives them all these problems at all, and could do with something far simpler.

jacquesm•3h ago
The point is, many companies do need those nines and they count on AWS to deliver and there is no backup plan if they don't. And that's the thing I take issue with, AWS is not so reliable that you no longer need backups.
padjo•2h ago
My experience is that very few companies actually need those 9s. A company might say they need them, but if you dig in it turns out the impact on the business of dropping a 9 (or two) is far less than the cost of developing and maintaining an elaborate multi-cloud backup plan that will both actually work when needed and be fast enough to maintain the desired availability.

Again, of course there are exceptions, but advising people in general that they should think about what happens if AWS goes offline for good seems like poor engineering to me. It’s like designing every bridge in your country to handle a tomahawk missile strike.

chanux•5h ago
I get you. I am with you. But isn't money/resources always a constraint to have a solid backup solution?

I guess the reason why people are not doing it is because it hasn't been demonstrated it's worth it, yet!

I've got to admit though, whenever I hear about having a backup plan I think having apples to apples copy elsewhere which is probably not wise/viable anyway. Perhaps having just enough to reach out to the service users/customers suffice.

Also I must add I am heavily influenced by a comment by Adrian Cockroft on why going multi cloud isn't worth it. He worked for AWS (at the time at least) so I should have probably reached to the salt dispenser.

mlrtime•5h ago
We all read it.. AWS not coming back up is your point on nat having a backup plan?

You might as well say the entire NY + DC metro losses power and "never comes back up" What is the plan around that? The person replying is correct, most companies do not have a actionable plan for AWS never coming back up.

I worked at a medium-large company and was responsible for reviewing the infrastructure BCP plan. It stated that AWS going down was a risk, and if it happens we wait for it to come back up. (In a lot more words than that).

antihero•5h ago
My website running on an old laptop in my cupboard is doing just fine.
whatevaa•5h ago
When your laptop dies it's gonna be a pretty long outage too.
antihero•1h ago
I will find another one
api•4h ago
I have this theory of something I call “importance radiation.”

An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.

Related, I’m sure, is the fact that things like furnaces and water heaters will die on holidays.

jacquesm•4h ago
That's a great concept. It explains a lot, actually!
davedx•5h ago
> I can almost guarantee that whatever plans you have won’t get you fully operational faster than just waiting for AWS to fix it.

Absurd claim.

Multi-region and multi-cloud are strategies that have existed almost as long as AWS. If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.

afro88•5h ago
> If your company is in anything finance-adjacent or critical infrastructure

GP said:

> most companies

Most companies aren't finance-adjacent or critical infrastructure

padjo•4h ago
It’s not absurd, I’ve seen it happen. Company executes on their DR plan due to AWS outage, AWS is back before DR is complete, DR has to be aborted, service is down longer than if they’d just waited.

Of course there are cases when multi cloud makes sense, but it is in the minority. The absurd claim is that most companies should worry cloud outages and plan for AWS to go offline for ever.

philipallstar•4h ago
> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing

That still fits in with "almost guarantee". It's not as though it's true for everyone, e.g. people who might trigger DR after 10 minutes of downtime, and have it up and running within 30 more minutes.

But it is true for almost everyone, as most people will trigger it after 30 minutes or more, which, plus the time to execute DR, is often going to be far less than the AWS resolution time.

Best of all would be just multi-everything services from the start, and us-east-1 is just another node, but that's expensive and tricky with state.

ants_everywhere•3h ago
I thought we were talking about an AWS outage, not just the outage of a single region? A single region can go out for many reasons, including but not limited to war.
malfist•3h ago
I worked for a fortune 500, twice a year we practiced our "catastrophe outage" plan. The target SLA for recovering from a major cloud provider outage was 48 hours.
kelnos•23m ago
> If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.

This describes, what, under 1% of companies out there?

For most companies the cost of being multi-region is much more than just accepting with the occasional outage.

mlrtime•5h ago
>Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.

Not very many people realize that there are some services that still run only in us-east-1.

energy123•5h ago
It happens 2-3x a year during peacetime. Tail events are not homogeneously distributed across time.
mlrtime•5h ago
I don't understand, peacetime?
mbreese•4h ago
Peacetime = When not actively under a sustained attack by a nation-state actor. The implication being, if you expect there to be a “wartime”, you should also expect AWS cloud outages to be more frequent during a wartime.
mrits•4h ago
It makes a lot more sense if they had a typo of peak
smaudet•3h ago
Don't forget stuff like natural disasters and power failures...or just a very adventurous squirrel.

AWS (over-)reliance is insane...

lII1lIlI11ll•2h ago
What about being actively attacked by multinational state or an empire? Does it count or not?

Why people keep using "nation-state" term incorrectly in HN comments is beyond me...

phpnode•1h ago
It sounds more technical than “country” and is therefore better
kelipso•39m ago
To me it sounds more like saying regime instead of government, gives off a sense of distance and danger.
waisbrot•16m ago
I think people generally mean "state", but in the US-centric HN community that word is ambiguous and will generally be interpreted the wrong way. Maybe "sovereign state" would work?
mbreese•13m ago
It could be a multinational state actor, but the term nation-state is the most commonly used, regardless of accuracy. You can argue over whether of not the term itself is accurate, but you still understood the meaning.
__MatrixMan__•3h ago
Its a different kind of outage when the government disconnects you from the internet. Happens all the time, just not yet in the US.
vrc•4h ago
Well technically AWS has never failed in wartime.
joelthelion•4h ago
Call it the aws holiday. Most other companies will be down anyway. It's very likely that your company can afford to be down for a few hours, too.
chii•3h ago
imagine if the electricity supplier too that stance.
ahoka•3h ago
Isn't that basically Texas?
SecretDreams•3h ago
Texas is like if you ran your cloud entirely in SharePoint.
jofzar•3h ago
Let's not insult SharePoint like that.

It's like if you ran you cloud on an old dell box in your closet while your parent company is offering to directly host it in AWS for free.

sgarland•2h ago
Also, every time your cloud went down, the parent company begged you to reconsider, explaining that all they need you to do is remove the disturbingly large cobwebs so they can migrate it. You tell them that to do so would violate your strongly-held beliefs, and when they stare at you in bewilderment, you yell “FREEDOM!” while rolling armadillos at them like they’re bowling balls.
malfist•3h ago
But that is the stance for a lot of electrical utilities. Sometimes weather or a car wreck takes out power and since its too expensive to have spares everywhere, sometimes you have to wait a few hours for a spare to be brought in
yuliyp•2h ago
No, that's not the stance for electrical utilities (at least in most developed countries, including the US): the vast majority of weather events cause localized outages (the grid as a whole has redundancies built in; distribution to (residential and some industrial) does not. It expects failures of some power plants, transmission lines, etc. and can adapt with reserve power, or, in very rare cases by partial degradation (i.e. rolling blackouts). It doesn't go down fully.
JoyfulTurkey•1h ago
Spain and Portugal had a massive power outage this spring, no?
formerly_proven•1h ago
Yeah, and it has a 30 page Wikipedia article with 161 sources (https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...). Does that seem like a common occurrence?
crote•58s ago
> Sometimes weather or a car wreck takes out power

Not really? Most of the infrastructure is quite resilient and the rare outage is usually limited to a street or two, with restoration time mainly determined by the time it takes the electricians to reach the incident site. For any given address that's maybe a few hours per decade - with the most likely cause being planned maintenance. That's not a "spares are too expensive" issue, that's a "giving every home two fully independent power feeds is silly" issue.

Anything on a metro-sized level is pretty much unheard of, and will be treated as serious as a plane crash. They can essentially only be caused by systemic failure on multiple levels, as the grid is configured to survive multiple independent failures at the same time.

Comparing that to the AWS world: individual servers going down is inevitable and shouldn't come as a surprise. Everyone has redundancies, and an engineer accidentally yanking the power cables of an entire rack shouldn't even be noticeable to any customers. But an entire service going down across an entire availability zone? That should be virtually impossible, and having it happen regularly is a bit of a red flag.

awillen•3h ago
That's the wrong analogy though. We're not talking about the supplier - I'm sure Amazon is doing its damnedest to make sure that AWS isn't going down.

The right analogy is to imagine if businesses that used electricity took that stance, and they basically all do. If you're a hospital or some other business where a power outage is life or death, you plan by having backup generators. But if you're the overwhelming majority of businesses, you do absolutely nothing to ensure that you have power during a power outage, and it's fine.

hnlmorg•3h ago
Utility companies do not have redundancy for every part of their infrastructure either. Hence why severe weather or other unexpected failures can cause loss of power, internet or even running water.

Texas has had statewide power outages. Spain and Portugal suffered near-nationwide power outages last year. Many US states are heavily reliant on the same single source for water. And remember the discussions on here about Europe's reliance on Russian gas?

Then you have the XKCD sketch about how most software products are reliant on at least one piece of open source software that is maintained by a single person as a hobby.

Nobody likes a single point of failure but often the costs associated with mitigating that are much greater than the risks of having that point of failure.

This is why "risk assessments" are a thing.

pyrale•2h ago
> Hence why severe weather or other unexpected failures can cause loss of power, internet or even running water.

Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".

hnlmorg•2h ago
> Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".

You're arguing as if "run it on AWS" was a decision that didn't undergo the same kinds of risk assessment. As someone who's had to complete such processes (and in some companies, even define them), I can assure you that nobody of any competency runs stuff on AWS complacently.

In fact running stuff with resilience in AWS isn't even as simple as "just running it in AWS". There's a whole plethora of things to consider, and each with its own costs attached. As the meme goes "one does not simply just run something on AWS"

pyrale•58m ago
> nobody of any competency runs stuff on AWS complacently.

I agree with this. My point is simply that we, as an industry, are not a very competent bunch when it comes to risk management ; and that's especially true when compared to TSOs.

That doesn't mean nobody knows what they do in our industry or that shit never hits the fan elsewhere, but I would argue that it's an outlier behaviour, whereas it's the norm in more secure industries.

> As the meme goes "one does not simply just run something on AWS"

The meme has currency for a reason, unfortunately.

---

That being said, my original point was that utilities losing clients after a storm isn't the consequence of bad (or no) risk assessment ; it's the consequence of them setting up acceptable loss thresholds depending on the likelihood of an event happening, and making sure that the network as a whole can respect these SLOs while strictly respecting safety criteria.

throw0101d•2h ago
> imagine if the electricity supplier too that stance.

Imagine if the cloud supplier was actually as important as the electricity supplier.

But since you mention it, there are instances of this and provisions for getting back up and running:

* https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...

* https://en.wikipedia.org/wiki/Northeast_blackout_of_2003

DiggyJohnson•43m ago
The electric grid is much more important than most private sector software projects by an order of magnitude.

Catastrophic data loss or lack of disaster recovery kills companies. AWS outages do not.

umeshunni•27m ago
What if the electricity grid depends on some AWS service?
quaintdev•6m ago
That would be circular dependency.
kelnos•27m ago
Fortunately nearly all services running on AWS aren't as important as the electric utility, so this argument is not particularly relevant.

And regardless, electric service all over the world goes down for minutes or hours all the time.

yla92•2h ago
> there are some services that still run only in us-east-1.

What are those ?

cyberax•23m ago
> Not very many people realize that there are some services that still run only in us-east-1.

The only ones that you're likely to encounter are IAM, Route53, and the billing console. The billing console outage for a few hours is hardly a problem. IAM and Route53 are statically stable and designed to be mostly stand-alone. They are working fine right now, btw.

During this outage, my infrastructure on AWS is working just fine, simply because it's outside of us-east-1.

Ironically, our observability provider went down.

lucideer•4h ago
> to the tune of a few hours every 5-10 years

I presume this means you must not be working for a company running anything at scale on AWS.

skywhopper•4h ago
That is the vast majority of customers on AWS.
lucideer•4h ago
Ha ha, fair, fair.
sreekanth850•4h ago
Depends on how serious you are with SLA's.
kelseydh•4h ago
It seems like this can be mostly avoided by not using us-east-1.
psychoslave•4h ago
This is planning future based on best things in the past. Not completely irrational and if you can't afford plan B, okayish.

But thinking AWS SLA is granted forever, and everyone should put all its eggs in it because "everyone do it" is neither wise not safe. Those who can afford it, and there are many businesses like that out there, should have a plan B. And actually AWS should not necessarily be plan A.

Nothing is forever. Not the Roman empire, not Inca empire, not china dynasties, not USA geological supremacy. That's not a question of if but when. It doesn't need to be through a lot of suffering, but if we don't systematically organise for a humanity which spread well being for everyone in a systematically resilient way, we will make it through more lot of tragic consequences when this or that single point of failure finally falls.

Waterluvian•3h ago
Without having a well-defined risk profile that they’re designing to satisfy, everyone’s just kind of shooting from the hip with their opinions on what’s too much or too little.
AndrewThrowaway•2h ago
It is like discussing zombie apocalypse. People who are invested in bunkers will hardly understand those who are just choosing death over living in those bunkers for a month longer.
hnlmorg•2h ago
Exactly this!

One of my projects is entirely hosted on S3. I don't care enough if it becomes unavailable for a few hours to justify paying to distribute it to GCP et al.

And actually for most companies, the cost of multi-cloud is greater than the benefits. Particularly when those larger entities can just bitch to their AWS account manager to get a few grand refunded as credits.

DiffEq•3h ago
Maybe; but Parlar had no plan and are now nothing....because AWS decided to shut them off. Always have a good plan...
nucleardog•3h ago
> Planning for an AWS outage is a complete waste of time and energy for most companies. Yes it does happen but very rarely to the tune of a few hours every 5-10 years.

Not only that, but as you're seeing with this and the last few dozen outages... when us-east-1 goes down, a solid chunk of what many consumers consider the "internet" goes down. It's perceived less as "app C is down" and more is "the internet is broken today".

YouAreWRONGtoo•3h ago
More like 2-3 times per year and this is not counting smaller outages or simply APIs that don't do what they document.
sgarland•2h ago
> APIs that don’t do what they document

Oh god, this. At my company, we found a bug recently with rds.describe_events, which we needed to read binlog information after a B/G cutover. The bug, which AWS support “could not see the details of,” was that events would non-deterministically not show up if you were filtering by instance name. Their recommended fix was to pull in all events for the past N minutes, and do client-side filtering.

This was on top of the other bug I had found earlier, which was that despite the docs stating that you can use a B/G as a filter - a logical choice when querying for information directly related to the B/G you just cut over - doing so returns an empty set. Also, you can’t use a cluster (again, despite docs stating otherwise), you have to use the new cluster’s writer instance.

YouAreWRONGtoo•3m ago
While I don't know your specific case, I have seen it happen often enough that there are only two possibilities left:

  1. they are idiots 
  2. they do it on purpose and they think you are an idiot
For me, it just means that the moment you integrate with any API, you are basically their bitch (unless you implement one from every competitor in the market, at which point you can just as well do it yourself).
kxrm•3h ago
Completely agree, but I think companies need to be aware of the AWS risks with third parties as well. Many services were unable to communicate with customers.

Hosting your services on AWS while having a status page on AWS during an AWS outage is an easily avoidable problem.

snowwrestler•3h ago
I would take the opposite view, the little AWS outages are an opportunity to test your disaster recovery plan, which is worth doing even if it takes a little time.

It’s not hard to imagine events that would keep AWS dark for a long period of time, especially if you’re just in one region. The outage today was in us-east-1. Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.

SteveNuts•1h ago
> Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.

Sure, it's worth considering, but for most companies it's not going to be worth the engineering effort to architect cross-cloud services. The complexity is NOT linear.

IMO most shops should focus on testing backups (which should be at least cross-cloud, potentially on-prem of some sort) to make sure their data integrity is solid. Your data can't be recreated, everything else can be rebuilt even if it takes a long time.

zaphirplane•3h ago
> tune of a few hours every 5-10 years

You know that’s not true, is-east-1 last one was 2 years ago? But other services have bad days and foundational one drag others a long

Esophagus4•2h ago
It’s even worse than that - us-east-1 is so overloaded, and they have roughly 5+ outages per year on different services. They don’t publish outage numbers so it’s hard to tell.

At this point, being in any other region cuts your disaster exposure dramatically

coffeebeqn•29m ago
We don’t deploy to us-east but still so many of our API partners and 3rd party services were down a large chunk of the service was effectively down. Including stuff like many dev tools
Spooky23•3h ago
Sure, if your blog or whatever goes down who cares. But otherwise you should thinking about disaster planning and resilience.

AWS US-East 1 has many outages. Anything significant should account for that.

lumost•2h ago
This depends on the scale of company. A fully functional DR plan probably costs 10% of the infra spend + people time for operationalization. For most small/medium businesses its a waste to plan for a once per 3-10 year event. If you’re a large or legacy firm the above costs are trivial and in some cases it may become a fiduciary risk not to take it seriously.
jacquesm•2h ago
And if you're in a regulated industry it might even be a hard requirement.
pyrale•2h ago
What if AWS dumps you because your country/company didn't please the commander in chief enough?

If your resilience plan is to trust a third party, that means you don't really care about going down, does it?

Besides that, as the above poster said, the issue with top tier cloud providers (or cloudflare, or google, etc) is not just that you rely on them, it is that enough people rely on them that you may suffer even if you don't.

temperceve•2h ago
Depends on the business. For 99% of them this is for sure the right answer.
throw0101d•2h ago
> Planning for an AWS outage […]

What about if your account gets deleted? Or compromised and all your instances/services deleted?

I think the idea is to be able to have things continue running on not-AWS.

maerF0x0•1h ago
Using AWS instead of a server in the closet is step 1.

Step 2 is multi-AZ

Step 3 is multi-region

Step 4 is multi-cloud.

Each company can work on it's next step, but most will not have positive EROI going from 2 to 3+

joshuat•21m ago
Multi-cloud is a hole in which you can burn money and not much more
dangoldin•1h ago
I worked at an adtech company where we invested a bit in HA across AZ + regions. Lo and behold there was an AWS outage and we stayed up. Too bad our customers didn't and we still took the revenue hit.

Lesson here is that your approach will depend on your industry and peers. Every market will have their won philosophy and requirements here.

delfinom•1h ago
In before meteor strike takes a AWS region and they cant restore data.
coffeebeqn•37m ago
We started that planning process at my previous company after one such outage but it became clear very quickly that the costs of such resilience would be 2-3x hosting costs in perpetuity and who knows how many manhours. Being down for an hour was a lot more palatable to everyone
indoordin0saur•24m ago
Been doing this for about 8 years and I've worked through a serious AWS disruption at least 5 times in that time.
raincole•5h ago
Most companies just aren't important enough to worry about "AWS never come back up." Planning for this case is just like planning for a terrorist blowing up your entire office. If you're the Pentagon sure you'd better have a plan for that. But most companies are not the Pentagon.
Frieren•4h ago
> Most companies just aren't important enough to worry about "AWS never come back up."

But a large enough of "not too big to fail" companies become a too big to fail event. Too many medium sized companies have a total dependency on AWS or Google or Microsoft, if not several at the same time.

We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.

raincole•4h ago
Well I agree. I kinda think the AI apocalypse would not be like a skynet killing us, but a malware be patched onto all the Tesla and causes one million crashes tomorrow morning.
swader999•4h ago
Battery fires.
coffeebeqn•24m ago
Many have a hard dependency on AWS && Google && Microsoft!
paulddraper•2h ago
Exactly.

And FWIW, "AWS is down"....only one region (out of 36) of AWS is down.

You can do the multi-region failover, though that's still possibly overkill for most.

anal_reactor•4h ago
First, planning for AWS outage is pointless. Unless you provide service of national security or something, your customers are going to understand that when there's global internet outage your service doesn't work either. The cost of maintaining a working failover across multiple cloud providers is just too high compared to potential benefits. It's astonishing that so few eningeers understand the fact that maintaining a technically beautiful solution costs time and money, which might not make a justified business case.

Second, preparing for the disappearance of AWS is even more silly. The chance that it will happen are orders of magnitude smaller than the chance that the cost of preparing for such an event will kill your business.

Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?

jacquesm•4h ago
> Let me ask you: how do you prepare your website for the complete collapse of western society?

How did we go from 'you could lose your AWS account' to 'complete collapse of western society'? Do websites even matter in that context?

> Second, preparing for the disappearance of AWS is even more silly.

What's silly is not thinking ahead.

psychoslave•3h ago
>Let me ask you: how do you prepare your website for the complete collapse of western society?

That's the main topic that going through my mind lately, if you replace "my website" with "Wikimedia movement".

We need a far better social, juridical and technical architecture regarding resilience as hostil agendas are in the rise at all level agaisnt sourced trackable global volunteer community knowledge bases.

csomar•4h ago
> Now imagine for a bit that it will never come back up.

Given the current geopolitical circumstances, that's not a far fetched scenario. Especially for us-east-1; or anything in the D.C. metro area.

pmontra•4h ago
In the case of a costumer of mine the AWS outage manifested itself as Twilio failing to deliver SMSes. The fallback plan has been disabling the rotation of our two SMS providers and sending all messages with the remaing one. But what if the other one had something on AWS too? Or maybe both of them have something else vital on Azure, or Google Cloud, which will fail next week and stop our service. Who knows?

For small and medium sized companies it's not easy to perform an accurate due diligency.

freetanga•4h ago
Additionally I find that most Hyperscalers are trying to lock you in, by tailoring services which are industry standard with custom features which end up building roots and making a multi-vendor or lift and shift problematic.

Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…

jacquesm•4h ago
Yes, they're really good at that. This is just 'embrace and extend'. We all know the third.
invalidusernam3•4h ago
What if the fall-back also never comes back up?
ho_schi•4h ago
The internet is a weak infrastructure, relying on a few big cables and data centers. And through AWS and Cloudflare it has become worse? Was it ever true, that the internet is resilient? I doubt.

Resilient systems work autonomously and can synchronize - but don't need to synchronize.

    * Git is resilient.
    * Native E-Mail clients - with local storage enabled - are somewhat resilient.
    * A local package repository is - somewhat resilient.
    * A local file-sharing app (not Warp/ Magic-Wormhole -> needs relay) is resilient if it uses only local WiFi or Bluetooth.
We're building weak infrastructure. A lot of stuff shall work locally and only optionally use the internet.
CaptainOfCoit•3h ago
The internet seems resilient enough for all intents and purposes, we haven't had a global internet-wide catastrophe impacting the entire internet as far as I know, but we have gotten close to it sometimes (thanks BGP).

But the web, that's the fragile, centralized and weak point currently, and seems to be what you're referring to rather.

Maybe nitpicky, but I feel like it's important to distinguish between "the web" and "the internet".

jacquesm•2h ago
You are absolutely correct but this distinction is getting less and less important, everything is using APIs nowadays, including lots of stuff that is utterly invisible until it goes down.
ho_schi•2h ago
Sweden and the “Coop” disaster:

https://www.bbc.com/news/technology-57707530

That because people trust and hope blindly. They believe IT is for saving money? It isn’t. They coupled their cash registers onto an American cloud service. The couldn’t even pay in cash.

It usually gets worse, when not outages happens for some time. Because that increases blind trust.

CaptainOfCoit•2h ago
That a Swedish supermarket gets hit by a ransomware attack doesn't prove/disprove the overall stability of the internet, nor the fragility of the web.
smaudet•3h ago
If you take into account the "the web" vs "the internet" as others have mentioned.

Yes the Internet has stayed stable.

The Web, as defined by a bunch of servers running complex software, probably much less so.

Just the fact that it must necessarily be more complex means that it has more failure modes...

bombcar•2h ago
The Internet was much more resilient when it was just that - an internetwork of connected networks; each of which could and did operate autonomously.

Now we have computers that shit themselves if DNS isn’t working, let alone LANs that can operate disconnected from the Internet as a whole.

And partially working or indicating this it works (when it doesn’t) is usually even worse.

rco8786•4h ago
If AWS goes down unexpectedly and never comes back up it's much more likely that we're in the middle of some enormous global conflict where day to day survival takes priority over making your app work than AWS just deciding to abandon their cloud business on a whim.
apexalpha•4h ago
Or Trump decided your country does not deserve it.
bombcar•2h ago
Or Bezos.
dr-smooth•55m ago
Or Bezos selling his soul to the Orange Devil and kicking you off when the Conman-in-chief puts the squeeze on some other aspect of Bezos' business empire
CaptainOfCoit•3h ago
Can also be much easier than that. Say you live in Mexico, hosting servers with AWS in the US because you have US customers. But suddenly the government decides to place sanctions on Mexico, and US entities are no longer allowed to do business with Mexicans, so all Mexican AWS accounts get shut down.

For you as a Mexican the end results is the same, AWS went away, and considering there already is a list of countries that cannot use AWS, GitHub and a bunch of other "essential" services, it's not hard to imagine that that list might grow in the future.

Keyframe•3h ago
At least we've got github steady with our code and IaaC, right? Right?!
vahid4m•3h ago
I don't think its worth it, but lets say I did it, what if others that I depend on dont do it? I still won't be fully functional and only one of us have spent a bunch of money.
bschne•3h ago
I find this hard to judge in the abstract, but I'm not quite convinced the situation for the modal company today is worse than their answer to "what if your colo rack catches fire" would have been twenty years ago.
jacquesm•3h ago
> "what if your colo rack catches fire"

I've actually had that.

https://www.webmasterworld.com/webmaster/3663978.htm

bschne•2h ago
I used to work at an SME that ran ~everything on its own colo'd hardware, and while it never got this bad, there were a couple instances of the CTO driving over to the dc because the oob access to some hung up server wasn't working anymore. Fun times...
pluto_modadic•1m ago
oh hey, I've bricked a server remotely and had to drive 45 minutes to the DC to get badged in and reboot things :)
rglover•1h ago
It would behoove a lot of devs to learn the basics of Linux sysadmin and how to setup a basic deployment with a VPS. Once you understand that, you'll realize how much of "modern infra" is really just a mix of over-reliance on AWS and throwing compute at underperforming code. Our addiction to complexity (and burning money on the illusion of infinite stability) is already and will continue to strangle us.
kbar13•37m ago
the correct answer for those companies is "we have it on the roadmap but for right now accept the risk"
saltyoldman•1m ago
Contrast this with the top post.
nextaccountic•5h ago
Is this why reddit is down? (https://www.redditstatus.com/ still says it is up but with degraded infrastructure)
krowek•5h ago
Shameless from them to make it look like it's a user problem. It was loading fine for me one hour ago, now I refresh the page and their message states I'm doing too many requests and should chill out (1 request per hour is too many for you?)
anal_reactor•4h ago
I remember that I made a website and then I got a report that it doesn't work on newest Safari. Obviously, Safari would crash with a message blaming the website. Bro, no website should ever make your shitty browser outright crash.
balder1991•2h ago
Actually I’m just thinking that knowledge about how to crash Safari is valuable.
etothet•4h ago
Never ascribe to malice that which is adequately explained by incompetence.

It’s likely that, like many organizations, this scenario isn’t something Reddit are well prepared for in terms of correct error messaging.

kaptainscarlet•5h ago
I got a rate limit error which didn't make sense since it was my first time opening reddit in hours.
ryanchants•56m ago
Could be a bunch of reddit bots on AWS are now catching back up as AWS recovers and spiking hits to reddit
TrackerFF•5h ago
Lots of outage happening in Norway, too. So I'm guessing it is a global thing.
weberer•5h ago
Llama-5-beelzebub has escaped containment. A special task force has been deployed to the Virginia data center to pacify it.
xodice•5h ago
Major us-east-1 outages happened in 2011, 2015, 2017, 2020, 2021, 2023, and now again. I understand that us-east-1, N. VA, was the first DC but for fucks sake they've had HOW LONG to finish AWS and make us-east-1 not be tied to keeping AWS up.
hvb2•5h ago
First, not all outages are created equal, so you cannot compare them like that.

I believe the 2021 one was especially horrific because of it affecting their dns service (route53) and the outage made writes to that service impossible. This made fail overs not work etcetera so their prescribed multi region setups didn't work.

But in the end, some things will have to synchronizes their writes somewhere, right? So for dns I could see how that ends up in a single region.

AWS is bound by the same rules as everyone else in the end... The only thing they have going for them that they have a lot of money to make certain services resilient, but I'm not aware of a single system that's resilient to everything.

xodice•5h ago
If AWS fully decentralized its control planes, they’d essentially be duplicating the cost structure of running multiple independent clouds and I understand that is why they don't however as long as AWS is reliant upon us-east-1 to function, they have not achieved what they claim to me. A single point of failure for IAM? Nah, no thanks.

Every AWS “global” service be it IAM, STS, CloudFormation, CloudFront, Route 53, Organizations, they all have deep ties to control systems originally built only in us-east-1/n. va.

That's poor design, after all these years. They've had time to fix this.

Until AWS fully decouples the control plane from us-east-1, the entire platform has a global dependency. Even if your data plane is fine, you still rely on IAM and STS for authentication and maybe Route 53 for DNS or failover CloudFormation or ECS for orchestration...

If any of those choke because us-east-1’s internal control systems are degraded, you’re fucked. That’s not true regional independence.

hvb2•4h ago
You can only decentralized your control plane if you don't have conflicting requirements?

Assuming you cannot alter requirements or SLAs, I could see how their technical solutions are limited. It's possible, just not without breaking their promises. At that point it's no longer a technical problem

xodice•4h ago
In the narrow distributed-systems sense? Yes, however those requirements are self-imposed. AWS chose strong global consistency for IAM and billing... they could loosen it at enormous expense.

The control plane must know the truth about your account and that truth must be globally consistent. That’s where the trouble starts I guess.

I think my old-school system admin ethos is just different than theirs. It's not a who's wrong or right, just a difference in opinions on how it should be done I guess.

The ISP I work for requires us to design in a way that no single DC will cause a point of failure, just difference in design methods and I have to remember the DC I work in is completely differently used than AWS.

In the end however, I know solutions for this exist (federated ledgers, CRDT-based control planes, regional autonomy but they’re just expensive and they don’t look good on quarterly slides), it just takes the almighty dollar to implement and that goes against big business, if it "works" it works, I guess.

AWS’s model scales to millions of accounts because it hides complexity, sure but the same philosophy that enables that scale prevents true decentralization. That is shit. I guess people can architect as if us-east-1 can disappear so that things can continue on, but then thats AWS causing complexity in your code. They are just shifting who is shouldering that little known issue.

shinycode•5h ago
It’s that period of the year when we discover AWS clients that don’t have fallback plans
hubertzhang•5h ago
I cannot pull images from docker hub.
tonypapousek•4h ago
Looks like they’re nearly done fixing it.

> Oct 20 3:35 AM PDT

> The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.

chibea•4h ago
It's a bit funny that they say "most service operations are succeeding normally now" when, in fact, you cannot yet launch or terminate new EC2 instance, which is basically the defining feature of the cloud...
rswail•4h ago
In that region, other regions are able to launch EC2s and ECS/EKS without a problem.
jamwil•1h ago
Is that material to a conversation about service uptime of existing resources, though? Are there customers out there that are churning through the full lifecycle of ephemeral EC2 instances as part of their day-to-day?
shawabawa3•1h ago
any company of non trivial scale will surely launch ec2 nodes during the day

one of the main points of cloud computing is scaling up and down frequently

jamwil•54m ago
I understand scaling. I’m saying there is a difference in severity of several orders of magnitude between “the computers are down” and “we can’t add additional computers”.
archon810•22m ago
Except it just broke again.
shkkmo•20m ago
Still not fixed and may have gotten worse...
testemailfordg2•4h ago
Seems like we need more anti-trust cases on AWS or need to break it down, it is becoming too big. Services used in rest of the world get impacted by issues in one region.
arielcostas•3h ago
But they aren't abusing their market power, are they? I mean, they are too big and should definitely be regulated but I don't think you can argue they are much of a monopoly when others, at the very least Google, Microsoft, Oracle, Cloudflare (depending on the specific services you want) and smaller providers can offer you the same service and many times with better pricing. Same way we need to regulate companies like Cloudflare essentially being a MITM for ~20% of internet websites, per their 2024 report.
chibea•4h ago
One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?

Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?

cowsandmilk•4h ago
Many AWS customers have bad retry policies that will overload other systems as part of their retries. DynamoDB being down will cause them to overload IAM.
joncrane•2h ago
Which is interesting because per their health dashboard,

>We recommend customers continue to retry any failed requests.

veltas•1h ago
Can't exactly change existing widespread practice so they're ready for that kind of handling.
otterley•13m ago
They should continue to retry but with exponential backoff and jitter. Not in a busy loop!
wwdmaxwell•4h ago
I think Amazon uses an internal platform called Dynamo as a KV store, it’s different than DynamoDB, so im thinking the outage could be either a dns routing issue or some kind of node deployment problem.

Both of which seem to prop up in post mortems for these widespread outages.

oofbey•28m ago
They said the root cause was DNS for dynamoDB. inside AWS relying on dynamoDB is highly encouraged so it’s not surprising that a failure there would cascade broadly. The fact that EC2 instance launching is effected is surprising. Loops in the service dependency graph are known to be a bad idea.
cyberax•7m ago
When I worked at AWS several years ago, IAM was not dependent on Dynamo. It might have changed, but I highly doubt this. Maybe some kind of network issue with high-traffic services?

> Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum.

IAM is replicated, so each region has its own read-only IAM cache. AWS SigV4 is also designed to be regionalized, if you ever wondered why the signature key derivation has many steps, that's exactly why ( https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_s... ).

kuon•3h ago
I realize that my basement servers have better uptime than AWS this year!

I think most sysadmin don't plan for AWS outage. And economically it makes sense.

But it makes me wonder, is sysadmin a lost art?

TheCraiggers•3h ago
> But it makes me wonder, is sysadmin a lost art?

I dunno, let me ask chatgpt. Hmmm, it said yes.

tripplyons•1h ago
ChatGPT often says yes to both a question and its inverse. People like to hear yes more than no.
ninininino•40m ago
You missed their point. They were making a joke about over-reliance on AI.
tredre3•29m ago
> But it makes me wonder, is sysadmin a lost art?

Yes. 15-20 years ago when I was still working on network-adjacent stuff I witnessed the shift to the devops movement.

To be clear, the fact that devops don't plan for AWS failures isn't an indication that they lack the sysadmin gene. Sysadmins will tell you very similar "X can never go down" or "not worth having a backup for service Y".

But deep down devops are developers who just want to get their thing running, so they'll google/serveroverflow their way into production without any desire to learn the intricacies of the underlying system. So when something breaks, they're SOL.

"Thankfully" nowadays containers and application hosting abstracts a lot of it back away. So today I'd be willing to say that devops are sufficient for small to medium companies (and dare I say more efficient?).

archon810•21m ago
That's not very surprising. At this point you could say that your microwave has a better uptime. The complexity comparison to all the Amazon cloud services and infrastructure would be roughly the same.
nla•3h ago
I still don't know why anyone would use AWS hosting.
nemo44x•3h ago
Someone’s got a case of the Monday’s.
karel-3d•3h ago
Slack was down, so I thought I will send message to my coworkers on Signal.

Signal was also down.

runako•3h ago
Even though us-east-1 is the region geographically closest to me, I always choose another region as default due to us-east-1 (seemingly) being more prone to these outages.

Obviously, some services are only available in us-east-1, but many applications can gain some resiliency just by making a primary home in any other region.

joncrane•2h ago
What services are only available in us-east-1?
tom1337•2h ago
IAM control plane for example:

> There is one IAM control plane for all commercial AWS Regions, which is located in the US East (N. Virginia) Region. The IAM system then propagates configuration changes to the IAM data planes in every enabled AWS Region. The IAM data plane is essentially a read-only replica of the IAM control plane configuration data.

and I believe some global services (like certificate manager, etc.) also depend on the us-east-1 region

https://docs.aws.amazon.com/IAM/latest/UserGuide/disaster-re...

tomchuk•2h ago
IAM, Cloudfront, Route53, ACM, Billing...
nijave•36m ago
parts of S3 (although maybe that's better after that major outage years ago)
runako•1h ago
In addition to those listed in sibling comments, new services often roll out in us-east-1 before being made available in other regions.

I recently ran into an issue where some Bedrock functionality was available in us-east-1 but not one of the other US regions.

sinpor1•2h ago
His influence is so great that it caused half of the internet to stop working properly.
d_burfoot•2h ago
I think AWS should use, and provide as an offering to big customers, a Chaos Monkey tool that randomly brings down specific services in specific AZs. Example: DynamoDB is down in us-east-1b. IAM is down in us-west-2a.

Other AWS services should be able to survive this kind of interruption by rerouting requests to other AZs. Big company clients might also want to test against these kinds of scenarios.

davidrupp•2m ago
AWS Fault Injection Service: https://docs.aws.amazon.com/fis/latest/userguide/what-is.htm...
bob1029•2h ago
One thing has become quite clear to me over the years. Much of the thinking around uptime of information systems has become hyperbolic and self-serving.

There are very few businesses that genuinely cannot handle an outage like this. The only examples I've personally experienced are payment processing and semiconductor manufacturing. A severe IT outage in either of these businesses is an actual crisis. Contrast with the South Korean government who seems largely unaffected by the recent loss of an entire building full of machines with no backups.

I've worked in a retail store that had a total electricity outage and saw virtually no reduction in sales numbers for the day. I have seen a bank operate with a broken core system for weeks. I have never heard of someone actually cancelling a subscription over a transient outage in YouTube, Spotify, Netflix, Steam, etc.

The takeaway I always have from these events is that you should engineer your business to be resilient to the real tradeoff that AWS offers. If you don't overreact to the occasional outage and have reasonable measures to work around for a day or 2, it's almost certainly easier and cheaper than building a multi cloud complexity hellscape or dragging it all back on prem.

Thinking in terms of competition and game theory, you'll probably win even if your competitor has a perfect failover strategy. The cost of maintaining a flawless eject button for an entire cloud is like an anvil around your neck. Every IT decision has to be filtered through this axis. When you can just slap another EC2 on the pile, you can run laps around your peers.

vidarh•2h ago
> The takeaway I always have from these events is that you should engineer your business to be resilient

An enduring image that stays with me was when I was a child and the local supermarket lost electricity. Within seconds the people working the tills had pulled out hand cranks by which the tills could be operated.

I'm getting old, but this was the 1980's, not the 1800's.

In other words, to agree with your point about resilience:

A lot of the time even some really janky fallbacks will be enough.

But to somewhat disagree with your apparent support for AWS: While it is true this attitude means you can deal with AWS falling over now and again, it also strips away one of the main reasons people tend to give me for why they're in AWS in the first place - namely a belief in buying peace of mind and less devops complexity (a belief I'd argue is pure fiction, but that's a separate issue). If you accept that you in fact can survive just fine without absurd levels of uptime, you also gain a lot more flexibility in which options are viable to you.

The cost of maintaining a flawless eject button is indeed high, but so is the cost of picking a provider based on the notion that you don't need one if you're with them out of a misplaced belief in the availability they can provide, rather than based on how cost effectively they can deliver what you actually need.

jezzamon•55m ago
Tech companies, and in particular ad-driven companies, keep a very close eye on their metrics and can fairly accurately measure the cost of an outage in real dollars
DanHulton•2h ago
I forget where I read it originally, but I strongly feel that AWS should offer a `us-chaos-1` region, where every 3-4 days, one or two services blow up. Host your staging stack there and you build real resiliency over time.

(The counter joke is, of course, "but that's `us-east-1` already! But I mean deliberately and frequently.)

avi_vallarapu•1h ago
This is the reason why it is important to plan Disaster recovery and also plan Multi-Cloud architectures.

Our applications and databases must have ultra high availability. It can be achieved with applications and data platforms hosted on different regions for failover.

Critical businesses should also plan for replication across multiple cloud platforms. You may use some of the existing solutions out there that can help with such implementations for data platforms.

- Qlik replicate - HexaRocket

and some more.

Or rather implement native replication solutions available with data platforms.

ibejoeb•1h ago
This is just a silly anecdote, but every time a cloud provider blips, I'm reminded. The worst architecture I've ever encountered was a system that was distributed across AWS, Azure, and GCP. Whenever any one of them had a problem, the system went down. It also cost 3x more than it should.
manishsharan•46m ago
You mean multi-cloud strategy ! You wanna know how you got here ?

See the sales team from Google flew out an executive to NBA Finals, Azure Sales team flew out another executive to NFL superBowl and the AWS team flew out yet another executive to Wimbledon finals. And thats how you end up with multi-cloud strategy.

kevstev•7m ago
Eh, businesses want to stay resilient to a single vendor going down. My least favorite question in interviews this past year was around multi-cloud. Because imho it just isn't worth it- the increased complexity, the trying to like-like services across different clouds that aren't always really the same, and then just the ongoing costs of chaos monkeying and testing that this all actually works, especially in the face of a partial outage like this vs something "easy" like a complete loss of network connectivity... but that is almost certainly not what CEOs want to hear (mostly who I am dealing with here going for VPE or CTO level jobs).

I could care less about having more vendor dinners when I know I am promising a falsehood that is extremely expensive and likely going to cost me my job or my credibility at some point.

spyspy•42m ago
I've seen the exact same thing at multiple companies. The teams were always so proud of themselves for being "multi-cloud" and managers rewarded them for their nonsense. They also got constant kudos for their heroic firefighting whenever the system went down, which it did constantly. Watching actually good engineers get overlooked because their systems were rock-solid while those characters got all the praise for designing an unadulterated piece of shit was one of the main reasons I left those companies.
AnimeLife•40m ago
looks like very few get it right. A good system would have few minutes of blip when one cloud provider went down, which is a massive win compared to outages like this.
avs733•21m ago
i'll bet there are a large number of systems that are dependent on multiple cloud platforms being up without even knowing it. They run on AWS, but rely on a tool from someone else that runs on GCP or on Azure and they haven't tested what happens if that tools goes down...

Common Cause Failures and false redundancy are just all over the place.

esskay•43m ago
Er...They appear to have just gone down again.
lexandstuff•13m ago
Yep. I don't think they ever fully recovered, but status page is still reporting a lot of issues.
TriangleEdge•13m ago
Severity - Degraded...

https://health.aws.amazon.com/health/status

https://downdetector.com/

mlhpdx•11m ago
Cool, building in resilience seems to have worked. Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted (not sure if it would have been anyway).

My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.

The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).

Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.

Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.