I think no matter how hard you try to avoid it, in the end there's always a massive dependency chain for modern digital infrastructure[^2].
[1]: https://itsfoss.community/uploads/default/optimized/2X/a/ad3...
Well, inter-region DR/HA is a expensive thing to ensure (whether on salaries, infra or both), specially when you are in AWS.
Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.
It is kind of the child of what used to be called Catastrophe Theory, which in low dimensions is essentially a classification of folding of manifolds. Now the systems are higher dinemsional and the advice more practical/heuristic.
Doesn't help either. us-east-1 hosts the internal control plane of AWS and a bunch of stuff is only available in us-east-1 at all - most importantly, Cloudfront, AWS ACM for Cloudfront and parts of IAM.
And the last is the one true big problem. When IAM has a sniffle, everything else collapses because literally everything else depends on IAM. If I were to guess IAM probably handles millions if not billions of requests a second because every action on every AWS service causes at least one request to IAM.
The sheer volume is one thing, but... IAM's policy engine, that's another thing. Up to 5000 different roles per account, dozens of policies that can have an effect on any given user entity and on top of that you can also create IAM policies that blanket affect all entities (or only a filtered subset) in an account, and each policy definition can be what, 10 kB or so, in size. Filters can include multiple wildcards everywhere so you can't go for a fast-path in an in-memory index, and they can run variables with on-demand evaluation as well.
And all of that is reachable not on an account-specific endpoint that could get sharded from a shared instance should the load of one account become too expensive, no, it's a global (and region-shared) endpoint. And if that weren't enough, all calls are shipped off to CloudTrail's event log, always, with full context cues to have an audit and debug trail.
To achieve all that in a service quality that allows for less than 10 seconds worth of time before a change in an IAM policy becomes effective and milliseconds of call time is nothing short of amazing.
IAM is solid, but is it any more special than any other distributed AuthN+AuthZ service?
You are right. But alas, a peek at the AMZN stock ticker suggests that the market doesn't really value resilience that much.
AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.
Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”
I thought that if us-east-1 goes down you might not be able to administer (or bring up new services) in other zones, but if you have services running that can take over from us-east-1, you can maintain your app/website etc.
I haven’t had to do this for several years but that was my experience a few years ago on an outage - obviously it depends on the services you’re using.
You can’t start cloning things to other zones after us-east-1 is down - you’ve left it too late
What it turned into was Daedalus from Deus Ex lol.
I think it's hypocritical for them to push customers to double or triple their spend in AWS when they themselves have single points of failure on a single region.
Hey! Pay us more money so when us-east-1 goes down you're not down (actually you'll still go down because us-east-1 is a single point of failure even for our other regions).
They've been charging a premium for, and marketing, "Availability" for decades at this point. I worked for a competitor and made a better product: it could endure any of the zones failing.
For the uninitiated: https://en.wikipedia.org/wiki/Room_641A
IAD datacenters have forever been the place where Amazon software developers implement services first (well before AWS was a thing).
Multi-AZ support often comes second (more than you think; Amazon is a pragmatic company), and not every service is easy to make TRULY multi-AZ.
And then other services depend on those services, and may also fall into the same trap.
...and so much of the tech/architectural debt gets concentrated into a single region.
Were the docs/tooling up to date? Tough bet. Much easier to fix BGP or whatever.
(I'm assuming by zone you mean the equivalent of an AWS region, with multiple connected datacenters)
I'll let people guess for the sport of it, here's the hint: there were at least 30 of them comprised of Real Datacenters. Thanks for the doubt, though. Implied or otherwise.
At risk of more snark: Clouds aren't the Death Star, they don't have to have an exhaust port. It's fair the first one does... for a while.
Your experiment proves nothing. Anyone can pull it off.
Real People would notice/care, but financially, it didn't matter. Contract said the edge had to be lost for a moment/restored. I've played both Incident Manager and SRE in this routine.
edit: Less often we'd do a more thorough test: power loss/full recovery. We'd disconnect more regularly given the simplicity.
The stuff I'm proudest of solved a problem and made money but it wasn't complicated for the sake of being complicated. It's like asking a mechanical engineer "what's the thing you've designed with the most parts"
Most web jobs are not technically complex. They use standard software stacks in standard ways. If they didn't, average developers (or LLMs) would not be able to write code for them.
Why would they keep a large set of centralized, core traffic services in Virginia for decades despite it being a bad design?
I mean look at their console. Their console application is pretty subpar.
It's the rest of the world that has not. For a long time companies just ran everything in us-east-1 (e.g. Heroku), without even having an option to switch to another region.
"We can't run things on just a box! That's a single point of failure. We're moving to cloud!"
The difference is that when the cloud goes down you can shift the blame to them, not you, and fixing it is their problem.
The corporate world is full of stuff like this. A huge role of consultants like McKinsey is to provide complicated reports and presentations backing the ideas that the CEO or other board members want to pursue. That way if things don't work out they can blame McKinsey.
That's a very human sentiment, and I share it. That's why I don't swap my car wheels myself, I don't want to feel responsible if one comes loose on the highway and I cause an accident.
But at the same time it's also appalling how low the bar has gotten. We're still the ones deciding that one cloud is enough. The down being "their" fault really shouldn't excuse that fact. Most services aren't important enough to need redundancy. But if yours is, and it goes down because you decided that one provider is enough, then your provider isn't solely at fault here and as a profession I wish we'd take more accountability.
I hope they release a good root cause analysis report.
Well it did for me today...Dont use us-east-1 explicitly just other regions and I had no outage today...( I get the point about the skeletons in the closet of us-east-1 ...maybe the power plug goes via Bezos wood desk? )
Perhaps for the internet as a whole, but for each individual service it underscores the risk of not hosting your service in multiple zones or having a backup
But, in it's sidebar of "Trending technologies", it lists "Ansible" and "Jenkins" .. which while are both great, I doubt are trending currently.
Curious what this is?
I would strongly argue that there is nothing great about Jenkins. It's an unholy mess of mouldy spaghetti that can sometimes be used to do achieve a goal, but is generally terrible at everything. Shit to use, shit to maintain, shit to secure. It was the best solution because of a lack of competition 20 years ago, but hasn't been relevant or anywhere near the top 50 since any competition appeared.
The fact that to this very day, nearing the end of 2025, they still don't support JWT identities for runs is embarrassing. Same goes for VMware vSphere.
https://i.ibb.co/Lzgf34mb/Screenshot-20251020-080828.png
Also, this is the exact CSS style that Claude uses whenever I have it program web elements (typically bookmarklet UIs).
There's no way it's DNS
It was DNS
As it happens, that naturally maps to the bootstrapping process on hardware needing to know how to find the external services it needs, which is what "DNS" is for. So "DNS" ends up being the top level of manual configuration.
But it's the inevitability of the manual process that's the issue here, not the technology. We're at a spot now where the rest of the system reliability is so good that the only things that bring it down are the spots where human beings make mistakes on the tiny handful of places where human operation is (inevitably!) required.
Unless DNS configuration propagates over DHCP?
At the top of the stack someone needs to say "This is the cluster that controls boot storage", "This is the IP to ask for auth tokens", etc... You can automatically configure almost everything but there still has to be some way to get started.
> 5000 Reddit users reported a certain number of problems shortly after a specific time.
> 400000 A certain number of reports were made in the UK alone in two hours.
Dumb argument imho, but that's how many of them think ime.
So your complaints matter nothing because "number go up".
I remember the good old days of everyone starting a hosting company. We never should have left.
Previous discussions:
https://news.ycombinator.com/item?id=45640754
https://news.ycombinator.com/item?id=45640772
https://news.ycombinator.com/item?id=45640827
Always DNS..
Lots of orgs operating wholly in AWS and sometimes only within us-east-1 had no operational problems last night. Some that is design (not using the impacted services). Some of that is good resiliency in design. And some of that was dumb luck (accidentally good design).
Overall, those companies that had operational problems likely wouldn't have invested in resiliancy expenses in any other deployment strategy either. It could have happened to them in Azure, GCP or even a home rolled datacenter.
Are customers willing to pay companies for that redundancy? I think not. Once every few years outage for 3 hours is fine for non critical services.
That right there means the business model is fucked to begin with. If you can't have a resilient service, then you should not be offering that service. Period. Solution: we were fine before the cloud, just a little slower. No problem going back to that for some things. Not everything has to be just in time at lowest possible cost.
FFS ...
https://www.nytimes.com/2025/05/25/business/amazon-ai-coders...
"Pushed to use artificial intelligence, software developers at the e-commerce giant say they must work faster and have less time to think."
Every bit of thinking time spent on a dysfunctional, lying "AI" agent could be spent on understanding the system. Even if you don't move your mouse all the time in order to please a dumb middle manager.
Reminds me of a great Onion tagline:
"Plowshare hastily beaten back into sword."
Photos and numbers seem to be stolen straight from it.
I think we're doing the 21st century wrong.
skywhopper•3h ago