Appears to have happened within the last 10-15 minutes.
We're having fun figuring out how to communicate amongst colleagues now! It's when it's gone when you realise your dependence
As a degraded-state fallback, email is what we're using now (we have our clients configured to encrypt with PGP by default, we use it for any internal email and also when the customer has PGP so everyone knows how to use that)
if you seriously have no external low dep fallback, please at least document this fact now for the Big Postmortem.
Including fabricating new RAM?
My experience with self hosting has been that, at least when you keep the services independent, downtime is not more common than in hosted environments, and you always know what's going on. Customising solutions, or workarounds in case of trouble, is a benefit you don't get when the service provider is significantly bigger than you are. It has pros and cons and also depends on the product (e.g. email delivery is harder than Mattermost message delivery, or if you need a certain service only once a year or so) but if you have the personell capacity and a continuous need, I find hosting things oneself to be the best solution in general
Unwarranted tip: next time, if you use macOS, just open the terminal and run `caffeinate -imdsu`.
I assume Linux/Windows have something similar built-in (and if not built-in, something that's easily available). For Windows, I know that PowerToys suite of nifty tools (officially provided by Microsoft) has Awake util, but that's just one of many similar options.
The cost of re-designing and re-implementing applications to synchronize data shipping to remote regions and only spinning up remote region resources as needed is even larger for these organizations.
And this is how we end up with these massive cloud footprints not much different than running fleets of VM’s. Just about the most expensive way to use the cloud hyperscalers.
Most non-tech industry organizations cannot face the brutal reality that properly, really leveraging hyperscalers involves a period of time often counted in decades for Fortune-scale footprints where they’re spending 3-5 times on selected areas more than peers doing those areas in the old ways to migrate to mostly spot instance-resident, scale-to-zero elastic, containerized services with excellent developer and operational troubleshooting ergonomics.
Except Google Spanner, I’m told, but AWS doesn’t have an answer for that yet AFAIK.
Admitting to that here?
In civilised jurisdictions that should be criminal.
Using cryptography to avoid accountability is wrong. Drug dealing and sex work, OK, but in other businesses? Sounds very crooked to me
When Slack was down we used... google... google mail? chat. When you go to gmail there is actually a chat app on the left.
I.e. some bottle-necks in new code appearing only _after_ you've deployed there, which is of course too late.
It didn't help that some services had their deploy trains (pipelines in amazon lingo) of ~3 weeks, with us-east-1 being the last one.
I bet the situation hasn't changed much since.
oof, so you're saying this outage could be cause by a change merged 3 weeks ago?
However, if you desperately need to access it you can force resolve it to 3.218.182.212. Seems to work for me. DNS through HN
curl -v --resolve "dynamodb.us-east-1.amazonaws.com:443:3.218.182.212" https://dynamodb.us-east-1.amazonaws.com/
> Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1.
Booting builder /usr/bin/docker buildx inspect --bootstrap --builder builder-1c223ad9-e21b-41c7-a28e-69eea59c8dac #1 [internal] booting buildkit #1 pulling image moby/buildkit:buildx-stable-1 #1 pulling image moby/buildkit:buildx-stable-1 9.6s done #1 ERROR: received unexpected HTTP status: 500 Internal Server Error ------ > [internal] booting buildkit: ------ ERROR: received unexpected HTTP status: 500 Internal Server Error
A `;)` is normally understoond to mean the author isn't entirely serious, and is making light of something or other.
Perhaps you American downvoters were on call and woke up with a fright, and perhaps too much time to browse Hacker News. ;)
2. Trusted brand
It still baffles me how we ended up in this situation where you can almost hear peoples disapproval over the internet when you say AWS / Cloud isn't needed and you're throwing money away for no reason.
The key is that you need to understand no provider will actually put their ass on the line and compensate you for anything beyond their own profit margin, and plan accordingly.
For most companies, doing nothing is absolutely fine, they just need to plan for and accept the occasional downtime. Every company CEO wants to feel like their thing is mission-critical but the truth is that despite everything being down the whole thing will be forgotten in a week.
For those that actually do need guaranteed uptime, they need to build it themselves using a mixture of providers and test it regularly. They should be responsible for it themselves, because the providers will not. The stuff that is actually mission-critical already does that, which is why it didn't go down.
Cloud providers just never adopted that and the "ha, sucks to be you" mentality they have became the norm.
> Following my ill-defined "Tarsnap doesn't have an SLA but I'll give people credits for outages when it seems fair" policy, I credited everyone's Tarsnap accounts with 50% of a month's storage costs.
So in this case the downtime was roughly 26 hours, and the refund was for 50% of a month, so that's more than a 1-1 downtime refund.
IIRC it takes WAY too many managers to approve the dashboard being anything other than green.
It's not a reflection of reality nor is it automated.
Which is actually totally fine for the vast majority of things, otherwise there would be actual commercial pressures to make sure systems are resilient to such outages.
But there is only so much a cloud provider can guarantee within a region or whatever unit of isolation they offer.
you might be thinking of durability for s3 which is 11 nines, and i've never heard of anyone losing an object yet
Delivery of those nines is not a priority. Not for the cloud provider - because they can just lie their way out of it by not updating their status page - and even when they don't, they merely have to forego some of their insane profit margin for a couple hours in compensation. No provider will actually put their ass on the line and offer you anything beyond their own profit margin.
This is not an issue for most cloud clients either because they keep putting up with it (lying on the status page wouldn't be a thing if clients cared) - the unspoken truth is that nobody cares that your "growth & engagement" thing is down for an hour or so, so nobody makes anything more than a ceremonial stink about it (chances are, the thing goes down/misbehaves regularly anyway every time the new JS vibecoder or "AI employee" deploys something, regardless of cloud reliability).
Things where nines actually matter will generally invest in self-managed disaster recovery plans that are regularly tested. This also means it will generally be built differently and far away from your typical "cloud native" dumpster fire. Depending on how many nines you actually need (aka what's the cost of not meeting that target - which directly controls how much budget you have to ensure you always meet it), you might be building something closer to aircraft avionics with the same development practices, testing and rigor.
Because I don't see the business pressure to do? If problems happen they can 1) lie on the status page and hope nothing happens and 2) if they can't get away with lying, their downside is limited to a few hours of profit margin.
(which is not really a dig at AWS because no hosting provider will put their business on the line for you... it's more of a dig at people who claim AWS is some uptime unicorn while in reality they're nowhere near better than your usual hosting provider to justify their 1000x markup)
It's great if they're doing their best anyway, but I don't see it as anything more than "best effort", because nothing bad would happen even if they didn't do a good job at it.
Turns out the default URL was hardcoded to use the us east interface and just by going to workspaces and then editing your URL to be the local region got everyone working again.
Unless you mean nothing is working for you at the moment.
* Give the computers a rest, they probably need it. Heck, maybe the Internet should just shut down in the evening so everyone can go to bed (ignoring those pesky timezone differences)
* Free chaos engineering at the cloud provider region scale, except you didn't opt in to this one and know about in advance, making it extra effective
* Quickly figure out a map which of the things you use have a dependency on a single AWS region without no capability to change or re-route
If you're down for 5 minutes a year because one of your employees broke something, that's your fault, and the blame passes down through the CTO.
If you're down for 5 hours a year but this affected other companies too, it's not your fault
From AWS to Crowdstrike - system resilience and uptime isn't the goal. Risk mitigation isn't the goal. Affordability isn't the goal.
When the CEO's buddies all suffer at the same time as he does, it's just an "act of god" and nothing can be done, it's such a complex outcome that even the amazing boffins at aws/google/microsoft/cloudflare/etc can't cope.
If the CEO is down at a different time than the CEO's buddies then it's that Dave/Charlie/Bertie/Alice can't cope and it's the CTO's fault for not outsourcing it.
As someone who likes to see things working, it pisses me off no end, but it's the way of the world, and likely has been whenever the owner and CTO are separate.
After that process comes the BS and PR step, where reality is spun into a cotton candy that makes the leader look good no matter what.
Yes.
What is important is having a Contractual SLA that is defensible. Acts of God are defensible. And now major cloud infrastructure outtages are too.
AWS outages: almost never happens, you should have been more prepared for when it does
If you say it’s Microsoft then it’s just unavoidable.
Still, it would make a bit of sense if you can find a place in your code where crossing a region hurts less, to move some of your services to a different region.
While your business partners will understand that you’re down while they’re down, will your customers? You called yesterday to say their order was ready, and now they can’t pick it up?
But there are some people on Reddit who think we are all wrong but won't say anything more. So... whatever.
Nothing in the outage history really stands out as "this is the first time we tried this and oops" except for us-east-1.
It's always possible for things to succeed at a smaller scale and fail at full scale, but again none of them really stand out as that to me. Or at least, not any in the last ten years. I'm allowing that anything older than that is on the far side of substantial process changes and isn't representative anymore.
That means Cursor is down, can't login.
It is kinda cool that the worst aws outages are still within a single region and not global.
But I think what wasn't well considered was the async effect - If something is gone for 5 minutes, maybe it will be just fine, but when things are properly asynchronous, then the workflows that have piled up during that time becomes a problem in itself. Worst case, they turn into poison pills which then break the system again.
surely you mean:
> I find it interesting that AWS services appear to be so tightly integrated that when there's an issue THAT BECOMES VISIBLE TO ME in a region, it affects most or all services.
AWS has stuff failing alllllllll the time, it's not very surprising that many of the outages that become visible to you involve multi-system failures - lots of other ones don't become visible!
Interesting point that banks actually tolerate a lot more eventual consistency than most software that just use a billing backend ever do.
Stuff like 503-ing a SaaS request because the billing system was down and you couldn’t check for limits, could absolutely be locally cached and eventual consistency would hurt very little. Unless your cost is quite high, I would much rather prefer to keep the API up and deal with the over-usage later.
If you check out with a credit card, even if everything looked good then, the seller might not see the money for days or might never receive it at all.
Back before AWS provided transparency into AZ assignments, it was pretty common to use latency measurements to try and infer relative locality and mappings of AZs available to an account.
No landing page explaining services are down, just scary error pages. I thought account was compromised. Thanks HN for, as always, being the first to clarify what's happening.
Scary to see that in order to order from Amazon Germany, us-east1 must be up. Everything else works flawlessly but payments are a no go.
Btw, most parts of the amazon.de is working fine, but I can't load profiles, and can't login.
Other things seem to be working fine.
Or rather
Ensure your single point of failure risk is appropriate for your business. I don't have full resilience for my companies AS going down, but we do have limited DR capability. Same with the loss of a major city or two.
I'm not 100% confident in a Thames Barrier flood situation, as I suspect some of our providers don't have the resilience levels we do, but we'd still be able to provide some minimal capability.
Impacting all banking series with red status error. Oddly enough, only their direct deposits are functioning without issues.
Damn, this is really bad.
Looking forward to the postmortem.
“Perplexity is down right now,” Perplexity CEO Aravind Srinivas said on X. “The root cause is an AWS issue. We’re working on resolving it.”
What he should have said, IMHO, is "The root cause is that Perplexity fully depends on AWS."
I wonder if they're actually working on resolving that, or that they're just waiting for AWS to come back up.
"But you can't do webscale uptime on your own"
Sure. I suspect even a single pi with auto-updates on has less downtime.
This outage seems really to be DynamoDB related, so the blast radius in services affected is going to be big. Seems they're still triaging.
Also we use Docker Hub, NPM and a bunch of other services that are hosted by their vendors on us-east-1 so even non AWS customers often can't avoid the blast radius of us-east-1 (though the NPM issue mostly affects devs updating/adding dependencies, our CI builds use our internal mirror)
At least when us-east is down, everything is down.
> We have identified the underlying issue with one of our cloud service providers.
Granted, they are not as drunk on LLM as Google and Microsoft. So, at least we can say this outage had not been vibe-coded (yet).
The way things are today I'm thankful the coffee machine still works without AWS.
TV rights is one of their main revenue sources, and it's expected to always go up, so they see "piracy" as a fundamental threat. IMO, it's a fundamental misunderstanding on their side, because people "pirating" usually don't have a choice - either there is no option for them to pay for the content (e.g. UK's 3pm blackout), or it's too expensive and/or spread out. People in the UK have to pay 3-4 different subscriptions to access all local games.
The best solution, by far, is what France's Ligue 1 just did (out of necessity though, nobody was paying them what they wanted for the rights after the previous debacles). Ligue 1+ streaming service, owned and operated by them which you can get access through a variety of different ways (regular old TV paid channel, on Amazon Prime, on DAZN, via Bein Sport), whichever suits you the best. Same acceptable price for all games.
The problem is that leagues miss out on billions of dollars of revenue when they do this AND they also have to maintain the streaming service which is way outside their technical wheelhouse.
MLS also has a pretty straightforward streaming service through AppleTV which I also enjoy.
What i find weird is that people complain (at least in the case of the MLS deal) that it's a BAD thing, that somehow having an easily accessible service that you just pay for and get access to without a contract or cable is diminishing popularity / discoverability of the product?
TBH, I have a hard time believing statements like this because if the revenue difference was really there, they'd make the switch.
If there's one thing I'll give credit to US sports leagues for, it's knowing how to make money.
Most leagues DO sell their rights to other big companies to have them handle it however they see fit for a large annual fee.
MLB does it partially, some games are shown through cable tv (There are so many games a year that only a small portion is actually aired nationally) the rest are done via regional sports networks (RSNs) that aren't shown nationally. In order to make some money out of this situation MLB created MLBtv that lets you watch all games as long as there are not nationally aired or a local team that is serviced by a RSN. Recently there have been changes because one of the biggest conglomerate of RSNs has gone bankrupt forcing MLB to buy them out and MLB is trying to negotiate a new national cable package with the big telecoms. I believe ESPN has negotiated with MLB to buy out MLBtv but details are scarce.
MLS is a smaller league and Apple bought out exclusive streaming rights for 10 years for some ungodly amount of money. NFL and NBA also have some streaming options but I am less knowledgeable about them but I assume it's similar to MLBtv where there are too many games to broadcast so you can just watch them with a subscription to their service.
In the end of the day these massive deals are the biggest source of revenue for the leagues and the more ways they can divide up the pie among different companies they can extract more money in total. Just looking that the amount of contracts for the US alone is overwhelming.
[1]https://en.wikipedia.org/wiki/Sports_broadcasting_contracts_...
My website is down :(
(EDIT: website is back up, hooray)
This is why distributed systems is an extremely important discipline.
Hell, maybe making today's tech workplace more about getting work done instead of the series of ritualistic performances that the average tech workday has degenerated to might help too.
Ergo, your conclusion doesn't follow from your initial statements, because interviews and workplaces are both far more broken than most people, even people in the tech industry, would think.
Many companies on Vercel don't think to have a strategy to be resilient to these outages.
I rarely see Google, Ably and others serious about distributed systems being down.
But that's the job of Vercel and it looks like they did a pretty good job. They rerouted away from the broken region.
Serious engineering teams that care about distributed systems and multi region deployments don't think like this.
Weird that case creation uses the same region as the case you'd like to create for.
Maybe all that got canned after the acquisition?
- https://status.twilio.com/ - https://www.intercomstatus.com/us-hosting
I want the web ca. 2001 back, please.
Amazon’s Ring to partner with Flock: https://news.ycombinator.com/item?id=45614713
so there is no free coffee time???? lmao
So blame humans even if an AI wrote some bad code.
Disagree, a human might be the cause/trigger, but the fault is pretty much always systemic. A whole lot of things has to happen for that last person to cause the problem.
Edit: and more important who governed the system, ie made decisions about maintainance, staffing, training, processes and so on
"Oct 20 3:35 AM PDT The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution."
(Useless service status pages are incredibly annoying)
It might be an interesting exercise to map how many of our services depend on us-east-1 in one way or another. One can only hope that somebody would do something with the intel, even though it's not a feature that brings money in (at least from business perspective).
Came here after the Internet felt oddly "ill" and even got issues using Medium, and sure enough https://status.medium.com
Snapchat, Ring, Roblox, Fortnite and more go down in huge internet outage: Latest updates https://www.the-independent.com/tech/snapchat-roblox-duoling...
To see more (from the first link): https://downdetector.com
Very big day for an engineering team indeed. Can't vibe code your way out of this issue...
Exactly. This time, some LLM providers are also down and can't help vibe coders on this issue.
Most miserable working years I have had. It's wild how normalized working on weekends and evenings becomes in teams with oncall.
But it's not normal. Our users not being able to shitpost is simply not worth my weekend or evening.
And outside of Google you don't even get paid for oncall at most big tech companies! Company losing millions of dollars an hour, but somehow not willing to pay me a dime to jump in at 3AM? Looks like it's not my problem!
What the redacted?
There's a similar cutout for management, which is how companies like GameStop squeeze their retail managers. They just don't give enough payroll hours for regular employees, so the salaried (but poorly paid) manager has to cover all of the gaps.
It is completely normal for staff to have to work 24/7 for critical services.
Plumbing, HVAC, power plant engineers, doctors, nurses, hospital support staff, taxi drivers, system and network engineers - these people keep our modern world alive, all day, every day. Weekends, midnights, holidays, every hour of every day someone is AT WORK to make sure our society functions.
Not only is it normal, it is essential and required.
It’s ok that you don’t like having to work nights or weekends or holidays. But some people absolutely have to. Be thankful there are EMTs and surgeons and power and network engineers working instead of being with their families on holidays or in the wee hours of the night.
I'm glad there are people willing to do oncall. Especially for critical services.
But the software engineering profession as a whole would benefit from negotiating concessions for oncall. We have normalized work interfering with life so the company can squeeze a couple extra millions from ads. And for what?
Nontrivial amount of ad revenue lost? Not my problem if the company can't pay me to mitigate.
Interestingly, when I worked on analytics around bugs we found that often (in the ads space), there actually wasn't an impact when advertisers were unable to create ads, as they just created all of them when the interface started working again.
Now, if it had been the ad serving or pacing mechanisms then it would've been a lot of money, but not all outages are created equal.
Some can tolerate downtime. Many can’t.
But to parent's points: if you call a plumber or HVAC tech at 3am, you'll pay for the privilege.
And doctors and nurses have shifts/rotas. At some tech places, you are expected to do your day job plus on-call. For no overtime pay. "Salaried" in the US or something like that.
Edit: On-call is not always disclosed. When it is, it's often understated. And finally, you can never predict being re-orged into a team with oncall.
I agree employees should still have the balls to say "no" but to imply there's no wrongdoing here on companies' parts and that it's totally okay for them to take advantage of employees like this is a bit strange.
Especially for employees that don't know to ask this question (new grads) or can't say "no" as easily (new grads or H1Bs.)
This is plainly bad regulation, the market at large discovered the marginal price of oncall is zero, but it’s rather obviously skewed in employer’s favor.
If you or anyone else are doing on-call for no additional pay, precisely nobody is forcing you to do that. Renegotiate, or switch jobs. It was either disclosed up front or you missed your chance to say “sorry, no” when asked to do additional work without additional pay. This is not a problem with on call but a problem with spineless people-pleasers.
Every business will ask you for a better deal for them. If you say “sure” to everything you’re naturally going to lose out. It’s a mistake to do so, obviously.
An employee’s lack of boundaries is not an employer’s fault.
> It is completely normal for staff to have to work 24/7 for critical services.
> Not only is it normal, it is essential and required.
Now you come with the weak "you don't have to take the job" and this gem:
> An employee’s lack of boundaries is not an employer’s fault.
As if there isn't a power imbalance, or employers always disclose everything or chance their mind. But of course, let's blame those entitled employees!
I believe the rules varied based on jurisdiction, and I think some had worse deals, and some even better. But I was happy with our setup in Norway.
Tbh I do not think we would have had, what we had if it wasn't for the local laws and regulations. Sometimes worker friendly laws can be nice.
Having two sites cover the pager is common, but even then you only have 16 working hours at best and somebody has to take the pager early/late.
Pour one out for the customer service teams of affected businesses instead
There are certainly organizations for which that cost is lower than the overall damage of services being down due to AWS fault, but tomorrow we will hear CTOs from smaller orgs as well.
This will hold until the next time AWS had a major outage, rinse and repeat.
When everything is some varying degree of broken at all times been responsible for a brief uptick in the background brokenness isn't the drama you think it is.
It would be different if the systems I worked on where true life and death (ATC/Emergency Services etc) but in reality the blast radius from my fucking up somewhere is monetary and even at the biggest company I worked for constrained (while 100+K per hour from an outage sounds horrific - in reality the vast majority of that was made up when the service was back online, people still needed to order the thing in the end).
I have three words for you: cascading systems failure
I feel bad for the people impacted by the outage. But at the same time there's a part of me that says we need a cataclysmic event to shake the C-Suite out of their current mindset of laying off all of their workers to replace them with AI, the cheapest people they can find in India, or in some cases with nothing at all, in order to maximize current quarter EPS.
It's still missing the one that earned me a phone call from a client.
If in your own datacenter your storage service goes down, how much remains running
The scale here is so large they don't know the complete dependency tree until teams check-in on what is out or not, growing this list. Of course most of it is automated, but getting on 'Affected Services' is not.
I found that out about Plex during an outage too.
Other hosting services like Vercel, package managers like npm, even the docker registeries are down because of it.
Edit: I can login into one of the AWS accounts (I have a few different ones for different companies), but my personal which has a ".edu" email is not logging in.
[1] https://bitbucket.status.atlassian.com/incidents/p20f40pt1rg...
(Be interesting to see how many events currently going to DynamoDB are actually outage information.)
See https://en.wikipedia.org/wiki/Thundering_herd_problem
In short, if it’s all at the same schedule you’ll end up with surges of requests followed by lulls. You want that evened out to reduce stress on the server end.
It's also polite to external services but at the scale of something like AWS that's not a concern for most.
Heh
if you ran into a problem that an instant retry cant fix, chances are you will be waiting so long that your own customer doesnt care anymore.
time left = time so far
But as you note prior knowledge will enable a better guess.
> I visited the Berlin Wall. People at the time wondered how long the Wall might last. Was it a temporary aberration, or a permanent fixture of modern Europe? Standing at the Wall in 1969, I made the following argument, using the Copernican principle. I said, Well, there’s nothing special about the timing of my visit. I’m just travelling—you know, Europe on five dollars a day—and I’m observing the Wall because it happens to be here. My visit is random in time. So if I divide the Wall’s total history, from the beginning to the end, into four quarters, and I’m located randomly somewhere in there, there’s a fifty-percent chance that I’m in the middle two quarters—that means, not in the first quarter and not in the fourth quarter.
> Let’s suppose that I’m at the beginning of that middle fifty percent. In that case, one-quarter of the Wall’s ultimate history has passed, and there are three-quarters left in the future. In that case, the future’s three times as long as the past. On the other hand, if I’m at the other end, then three-quarters have happened already, and there’s one-quarter left in the future. In that case, the future is one-third as long as the past.
https://www.newyorker.com/magazine/1999/07/12/how-to-predict...
What this "time-wise Copernican principle" gives you is a guarantee that, if you apply this logic every time you have no other knowledge and have to guess, you will get the least mean error over all of your guesses. For some events, you'll guess that they'll end in 5 minutes, and they actually end 50 years later. For others, you'll guess they'll take another 50 years and they actually end 5 minutes later. Add these two up, and overall you get 0 - you won't have either a bias to overestimating, nor to underestimating.
But this doesn't actually give you any insight into how long the event will actually last. For a single event, with no other knowledge, the probability that it will after 1 minute is equal to the probability that it will end after the same duration that it lasted so far, and it is equal to the probability that it will end after a billion years. There is nothing at all that you can say about the probability of an event ending from pure mathematics like this - you need event-specific knowledge to draw any conclusions.
So while this Copernican principle sounds very deep and insightful, it is actually just a pretty trite mathematical observation.
They probably had a great skit about the revolt of the Earls against William the Conquerer.
So no, you're not very likely to be right at all. Now sure, if you guess "50 years" for every event, your average error rate will be even worse, across all possible events. But it is absolutely not true that it's more likely that SNL will last for another 50 years as it is that it will last for another 10 years. They are all exactly as likely, given the information we have today.
> Well, there’s nothing special about the timing of my visit. I’m just travelling—you know, Europe on five dollars a day—and I’m observing the Wall because it happens to be here.
It's relatively unlikely that you'd visit the Berlin Wall shortly after it's erected or shortly before it falls, and quite likely that you'd visit it somewhere in the middle.
Well 1/3 of the examples you gave were right.
- In 1969 (8 years after the wall was erected): You'd calculate that there's a 50% chance that the wall will fall between 1972 (8x4/3=11 years) and 1993 (8x4=32 years)
- In 1989 (28 years after the wall was erected): You'd calculate that there's a 50% chance that the wall will fall between 1998 (28x4/3=37 years) and 2073 (28x4=112 years)
- In 1961 (when the wall was, say, 6 months old): You'd calculate that there's a 50% chance that the wall will fall between 1961 (0.5x4/3=0.667 years) and 1963 (0.5x4=2 years)
I found doing the math helped to point out how wide of a range the estimate provides. And 50% of the times you use this estimation method; your estimate will correctly be within this estimated range. It's also worth pointing out that, if your visit was at a random moment between 1961 and 1989, there's only a 3.6% chance that you visited in the final year of its 28 year span, and 1.8% chance that you visited in the first 6 months.
It's important to flag that the principle is not trite, and it is useful.
There's been a misunderstanding of the distribution after the measurement of "time taken so far", (illuminated in the other thread), which has lead to this incorrect conclusion.
To bring the core clarification from the other thread here:
The distribution is uniform before you get the measurement of time taken already. But once you get that measurement, it's no longer uniform. There's a decaying curve whose shape is defined by the time taken so far. Such that the estimate `time_left=time_so_far` is useful.
The most likely time to fail is always "right now", i.e. this is the part of the curve with the greatest height.
However, the average expected future lifetime increases as a thing ages, because survival is evidence of robustness.
Both of these statements are true and are derived from:
P(survival) = t_obs / (t_obs + t_more)
There is no contradiction.
This is a completely different argument that relies on various real-world assumptions, and has nothing to do with the Copernican principle, which is an abstract mathematical concept. And I actually think this does make sense, for many common categories of processes.
However, even this estimate is quite flawed, and many real-world processes that intuitively seem to follow it, don't. For example, looking at an individual animal, it sounds kinda right to say "if it survived this long, it means it's robust, so I should expect it will survive more". In reality, the lifetime of most animals is a binomial distribution - they either very young, because of glaring genetic defects or simply because they're small, fragile, and inexperienced ; or they die at some common age that is species dependent. For example, a humab that survived to 20 years of age has about the same chance of reaching 80 as one that survived to 60 years of age. And an alien who has no idea how long humans live and tries to apply this method may think "I met this human when they're 80 years old - so they'll probably live to be around 160".
I don't think this is correct; as in something that has been there for say hundreds of years had more probability to be there in a hundred years than something that has been there for a month.
Edit: I should add that, more specifically, this is a property of the uniform distribution, it applies to any event for which EndsAfter(t) is uniformly distributed over all t > 0.
Also, the worse thing you can get from this logic is to think that it is actually most likely that the future duration equals the past duration. This is very much false, and it can mislead you if you think it's true. In fact, with no other insight, all future durations are equally likely for any particular event.
The better thing to do is to get some even-specific knowledge, rather than trying to reason from a priori logic. That will easily beat this method of estimation.
> The better thing to do is to get some even-specific knowledge, rather than trying to reason from a priori logic
True, and all the posts above have acknowledged this.
This is exactly what I don't think is right. This particular outage has the same a priori chance of being back in 20 minutes, in one hour, in 30 hours, in two weeks, etc.
The distribution is uniform before you get the measurement of time taken already. But once you get that measurement, it's no longer uniform. There's a decaying curve whose shape is defined by the time taken so far. Such that the statement above is correct, and the estimate `time_left=time_so_far` is useful.
If P(1 more minute | 1 minute so far) = x, then why would P(1 more minute | 2 minutes so far) < x?
Of course, P(it will last for 2 minutes total | 2 minutes elapsed) = 0, but that can only increase the probabilities of any subsequent duration, not decrease them.
If: P(1 more minute | 1 minute so far) = x
Then: P(1 more minute | 2 minutes so far) > x
The curve is:
P(survival) = t_obs / (t_obs + t_more)
(t_obs is time observed to have survived, t_more how long to survive)
Case 1 (x): It has lasted 1 minute (t_obs=1). The probability of it lasting 1 more minute is: 1 / (1 + 1) = 1/2 = 50%
Case 2: It has lasted 2 minutes (t_obs=2). The probability of it lasting 1 more minute is: 2 / (2 + 1) = 2/3 ≈ 67%
I.e. the curve is a decaying curve, but the shape / height of it changes based on t_obs.
That gets to the whole point of this, which is that the length of time something has survived is useful / provides some information on how long it is likely to survive.
Where are you getting this formula from? Either way, it doesn't have the property we were originally discussing - the claim that the best estimate of the duration of an event is the double of it's current age. That is, by this formula, the probability of anything collapsing in the next millisecond is P(1 more millisecond | t_obs) = t_obs / t_obs + 1ms ~= 1 for any t_obs >> 1ms. So by this logic, the best estimate for how much longer an event will take is that it will end right away.
The formula I've found that appears to summarize the original "Copernican argument" for duration is more complex - for 50% confidence, it would say:
P(t_more in [1/3 t_obs, 3t_obs]) = 50%
That is, if given that we have a 50% chance to be experiencing the middle part of an event, we should expect its future life to be between one third and three times its past life.Of course, this can be turned on its head: we're also 50% likely to be experiencing the extreme ends of an event, so by the same logic we can also say that P(t_more = 0 [we're at the very end] or t_more = +inf [we're at the very beginning and it could last forever] ) is also 50%. So the chance t_more > t_obs is equal to the chance it's any other value. So we have precisely 0 information.
The bottom line is that you can't get more information out a uniform distribution. If we assume all future durations have the same probability, then they have the same probability, and we can't predict anything useful about them. We can play word games, like this 50% CI thing, but it's just that - word games, not actual insight.
It's not a uniform distribution after the first measurement, t_obs. That enables us to update the distribution, and it becomes a decaying one.
I think you mistakenly believe the distribution is still uniform after that measurement.
The best guess, that it will last for as long as it already survived for, is actually the "median" of that distribution. The median isn't the highest point on the probability curve, but the point where half the area under the curve is before it, and half the area under the curve is after it.
And the above equation is consistent with that.
The cumulative distribution actually ends up pretty exponential which (I think) means that if you estimate the amount of time left in the outage as the mean of all outages that are longer than the current outage, you end up with a flat value that's around 8 hours, if I've done my maths right.
Not a statistician so I'm sure I've committed some statistical crimes there!
Unfortunately I can't find an easy way to upload images of the charts I've made right now, but you can tinker with my data:
cause,outage_start,outage_duration,incident_duration
Cell management system bug,2024-07-30T21:45:00.000000+0000,0.2861111111111111,1.4951388888888888
Latent software defect,2023-06-13T18:49:00.000000+0000,0.08055555555555555,0.15833333333333333
Automated scaling activity,2021-12-07T15:30:00.000000+0000,0.2861111111111111,0.3736111111111111
Network device operating system bug,2021-09-01T22:30:00.000000+0000,0.2583333333333333,0.2583333333333333
Thread count exceeded limit,2020-11-25T13:15:00.000000+0000,0.7138888888888889,0.7194444444444444
Datacenter cooling system failure,2019-08-23T03:36:00.000000+0000,0.24583333333333332,0.24583333333333332
Configuration error removed setting,2018-11-21T23:19:00.000000+0000,0.058333333333333334,0.058333333333333334
Command input error,2017-02-28T17:37:00.000000+0000,0.17847222222222223,0.17847222222222223
Utility power failure,2016-06-05T05:25:00.000000+0000,0.3993055555555555,0.3993055555555555
Network disruption triggering bug,2015-09-20T09:19:00.000000+0000,0.20208333333333334,0.20208333333333334
Transformer failure,2014-08-07T17:41:00.000000+0000,0.13055555555555556,3.4055555555555554
Power loss to servers,2014-06-14T04:16:00.000000+0000,0.08333333333333333,0.17638888888888887
Utility power loss,2013-12-18T06:05:00.000000+0000,0.07013888888888889,0.11388888888888889
Maintenance process error,2012-12-24T20:24:00.000000+0000,0.8270833333333333,0.9868055555555555
Memory leak in agent,2012-10-22T17:00:00.000000+0000,0.26041666666666663,0.4930555555555555
Electrical storm causing failures,2012-06-30T02:24:00.000000+0000,0.20902777777777776,0.25416666666666665
Network configuration change error,2011-04-21T07:47:00.000000+0000,1.4881944444444444,3.592361111111111Seems to be really limited to us-east-1 (https://health.aws.amazon.com/health/status). I think they host a lot of console and backend stuff there.
Obviously there's pros and cons. One of the pros being that you're so much more resilient to what goes on around you.
But if you look at open source projects, many are close to perfectly verifically integrated.
There's also a big big difference between relying on someone's code and relying on someone's machines. You can vender code - you, however, rely on particular machines being up and connected to the internet. Machines you don't own and you aren't allowed to audit.
You said "Many businesses ARE fully vertically integrated." so why name one that is close to fully vertically integrated, just name one of the many others that are fully vertically integrated. I don't really care about discussing things which prove my point instead of your point as if they prove your point.
> open source projects, many are close to perfectly verifically integrated
Comparing code to services seems odd, not sure how GitLab the software compares to GitLab the service for example. Code is just code, a service requires servers to run on, etc. GitLab the software can't have uptime because it's just code. It can only have an uptime once someone starts running it, at which point you can't attribute everything to the software anymore as the people running it have a great deal of responsibility for how well it runs, and even then, even if GitLab the software would have been "close to perfectly vertically integrated" (like if they used no OS, as if anyone would ever want that), then the GitLab serivice still needs many things from other suppliers to operate.
And again, "close to perfectly verifically integrated" is not "perfectly verifically integrated".
If you are wrong, and in fact nothing in our modern world is fully vertically integrated as I said, then it's best to just admit that and move on from that and continue discussing reality.
You're arguing semantics because you know that's the only way you'll feel right. Bye.
What I said: "No business is fully integrated."
A lot of AWS services under the hood depend on others, and especially us-east-1 is often used for things that require strong consistency like AWS console logins/etc (where you absolutely don't want a changed password or revoked session to remain valid in other regions because of eventual consistency).
even internally, Amazon's dependency graph became visually+logically incomprehensible a long time ago
Looks like AWS detonated six sticks of dynamite under a house of cards...
In the coming hours/days we'll find out if AWS still have significant single points of failure in that region, or if _so many companies_ are just not bothering to build in redundancy to mitigate regional outages.
I'm looking forward to the RCA!
Actually, many companies are de facto forced to do that, for various reasons.
It means that in order to be certified you have to use providers that in turn are certified or you will have to prove that you have all of your ducks in a row and that goes way beyond certain levels of redundancy, to the point that most companies just give up and use a cloud solution because they have enough headaches just getting their internal processes aligned with various certification requirements.
Medical, banking, insurance to name just a couple are heavily regulated and to suggest that it 'just means certain levels of redundancy' is a very uninformed take.
What's more likely for medical at least is that if you make your own app, that your customers will want to install it into their AWS/Azure instance, and so you have to support them.
Everything depends on DNS....
https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
>circa 2025: grayed out on Hacker News
United we fall.
There's no way it's DNS
It was DNS
Even the error message itself is wrong whenever that one appears.
"Too many requests. Your request has been rate limited, please take a break for a couple minutes and try again."
it seems they found your comment
LOL
Snapchat heavily used Google AppEngine to scale. This was basically a magical Java runtime that would 'hot path split' the monolithic service into lambda-like worker pools. Pretty crazy, but it worked well.
Snapchat leaned very heavily on this though and basically let Google build the tech that allowed them to scale up instead of dealing with that problem internally. At one point, Snap was >70% of all GCP usage. And this was almost all concentrated on ONE Java service. Nuts stuff.
Anyway, eventually Google was no longer happy with supporting this and the corporate way of breaking up is "hey we're gonna charge you 10x what did last year for this, kay?" (I don't know if it was actually 10x. It was just a LOT more)
So began the migration towards Kubernetes and AWS EKS. Snap was one of the pilot customers for EKS before it was generally available, iirc. (I helped work on this migration in 2018/2019)
Now, 6+ years later, I don't think Snap heavily uses GCP for traffic unless they migrated back. And this outage basically confirms that :P
GCP is behind in market share, but has the incredible cheat advantage of just not being Amazon. Most retailers won't touch Amazon services with a ten foot pole, so the choice is GCP or Azure. Azure is way more painful for FOSS stacks, so GCP has its own area with only limited competition.
Honestly as a (very small) shareholder in Amazon, they should spin off AWS as a separate company. The Amazon brand is holding AWS back.
Big monopolists do not unlock more stock market value, they hoard it and stifle it.
However I have seen many people flee from GCP because: Google lacks customer focus, Google is free about killing services, Google seems to not care about external users, people plain don’t trust Google with their code, data or reputation.
0: https://chrpopov.medium.com/scaling-cloud-infrastructure-5c6...
1: https://eng.snap.com/monolith-to-multicloud-microservices-sn...
The general idea being that you'll losing money due to opportunity cost.
Personally, I think you're better off just not laying people off and having them work the less (but still) profitable stuff. But I'm not in charge.
What's the point of all the auto-healing node-graph systems that were designed in the 70s and refined over decades: if we're just going to do mainframe development anyway?
That company is very concerning and not because of an outage. In fact, I wish one day we have a full cloudflare outage and the entire net goes dark and it finally sink in how much control this one f'ing company has over information in our so called free society.
"I think if you look at what matters to customers, what they care they care a lot about what the operational performance is, you know, what the availability is, what the durability is, what the latency and throughput is of of the various services. And I think we have a pretty significant advantage in that area." also "And, yeah, you could just you just look at what's happened the last couple months. You can just see kind of adventures at some of these players almost every month. And so very big difference, I think, in security."
It reminds me of that viral clip from a beauty pageant where the contestant went on a geographical ramble while the question was about US education.
A lot of businesses have all their workflows depending on their data on airtable.
dynamodb.us-east-1.api.aws
It's always DNS...
In any case, in order for this to happen, someone would have to collect reliable data (not all big cloud providers like to publish precise data, usually they downlplay the outages and use weasel words like "some customers... in some regions... might have experienced" just not to admit they had an outage) and present stats comparing the availability of Heztner Cloud vs the big three.
I only got €100.000 bounded to a year, then a 20% discount for spend in the next year.
(I say "only" because that certainly would be a sweeter pill, €100.000 in "free" credits is enough to make you get hooked, because you can really feel the free-ness in the moment).
Simultaneously too confused to be able to make their own UX choices, but smart enough to understand the backend of your infrastructure enough to know why it doesn't work and excuses you for it.
I liked your point though!
In technically sophisticated organizations, this disconnect simply floats to higher levels (e.g. CEO vs. CTO rather than middle manager vs. engineer).
Unless you lose a significant amount of money per minute of downtime, there is no incentive to go multicloud.
And multicloud has its own issues.
In the end, you live with the fact that your service might be down a day or two per year.
This is hilarious. In the 90s we used to have services which ran on machines in cupboards which would go down because the cleaner would unplug them. Even then a day or two per year would be unacceptable.
Seems like large cloud providers, including AWS, are down quite regularly in comparison, and at such a scale that everything breaks for everyone involved.
If I am affected, I want everyone to be affected, from a messaging perspective
Take the hit of being down once every 10 years compared to being up for the remaining 9 that others are down.
Hosting on second- or even third-tier providers allows you to overprovision and have much better redundancy, provided your solution is architected from the ground up in a vendor agnostic way. Hetzner is dirt cheap, and there are countless cheap and reliable providers spread around the globe (Europe in my case) to host a fleet of stateless containers that never fail simultaneously.
Stateful services are much more difficult, but replication and failover is not rocket science. 30 minutes of downtime or 30 seconds of data loss rarely kill businesses. On the contrary, unrealistic RTOs and RPOs are, in my experience, more dangerous, either as increased complexity or as vendor lock-in.
Customers don't expect 100% availability and noone offers such SLAs. But for most businesses, 99.95% is perfecty acceplable, and it is not difficult to have less than 4h/year of downtime.
This is alongside "live" reporting on the Israel/Gaza conflict as well as news about Epstein and the Louvre heist.
This is mainstream news.
Perhaps some parts of the migration haven't been completed, or there is still a central database in us-east1
“Amazon Web Services (AWS) is Amazon’s internet based cloud service connecting businesses to people using their apps or online platforms.”
Uh.. yeah.
“Amazon Web Services (AWS) is a cloud computing platform that provides on-demand access to computing power, storage, databases, and other IT resources over the internet, allowing businesses to scale and pay only for what they use.”
This part:
> access to computing power, storage, databases, and other IT resources
could be simplified to: access to computer serversMost people who know little about computers can still imagine a giant mainframe they saw in a movie with a bunch of blinking lights. Not so different, visually, from a modern data center.
It's the same as having a computer room but in someone else's datacentre.
> An Amazon Web Services outage is causing major disruptions around the world. The service provides remote computing services to many governments, universities and companies, including The Boston Globe.
> On DownDetector, a website that tracks online outages, users reported issues with Snapchat, Roblox, Fortnite online broker Robinhood, the McDonald’s app and many other services.
There’s a ton of momentum associated with the prior dominance, but between the big misses on AI, a general slow pace of innovation on core services, and a steady stream of top leadership and engineers moving elsewhere they’re looking quite vulnerable.
As much as I might not like AWS, I think they’ll remain #1 for the foreseeable future. Despite the reasons the guy listed.
The improvements to core services at AWS hasn't really happened at the same pace post covid as it did prior, but that could also have something to do with overall maturity of the ecosystem.
Although it's also largely the case that other cloud providers have also realized that it's hard for them to compete against the core competency of other companies, whereas they'd still be selling the infrastructure the above services are run on.
I'm not sure what feature they're really missing, but my favorite is the way they handle AWS Fargate. The other cloud providers have similar offerings but I find Fargate to have almost no limitations when compared to the others.
It means no longer being hungry. Then you start making mistakes. You stop innovating. And then you slowly lose whatever kind of edge you had, but you don't realize that you're losing it until it's gone
https://en.wikipedia.org/wiki/2021_Facebook_outage#Impact
Somewhat related tip of the day, don't host your status page as a subdomain off the main site. Ideally host it with a different provider entirely
Your margin is my opportunity indeed.
Hetzner has the better web interface and supposedly better uptime, but I've had no problems with either. Web interface not necessary at all either when using only ssh and paying directly.
I think I am more distributed then most of the AWS folks and it still is way cheaper.
Comments like this are so exaggerated that they risk moving the goodwill needle back to where it was before. Hetzner offers no service that is similar to DynamoDB, IAM or Lambda. If you are going to praise Hetzner as a valid alternative during a DynamoDB outage caused by DNS configuration, you would need to a) argue that Hetzner is a better option regarding DNS outages, b) Hetzner is a preferable option for those who use serverless offers.
I say this as a long-time Hetzner user. Herzner is indeed cheaper, but don't pretend that Herzner let's you click your way into a highly-availale nosql data store. You need non-trivial levels of you're ow work to develop, deploy, and maintain such a service.
The key thing you should ask yourself: do you need DynamoDB or Lambda? Like "need need" or "my resume needs Lambda".
If you read the message you're replying to, you will notice that I singled out IAM, Lambda, and DynamoDB because those services were affected by the outage.
If Hetzner is pushed as a better or even relevant alternative, you need to be able to explain exactly what you are hoping to say to Lambda/IAM/DynamoDB users to convince them that they would do better if they used Hetzner instead.
Making up conspiracy theories over CVs doesn't cut it. Either you know anything about the topic and you actually are able to support this idea, or you're an eternal September admission whose only contribution is noise and memes.
What is it?
Plenty of heavy traffic, high redundancy applications exist without the need for AWS (or any other cloud providers) overpriced "bespoke" systems.
At some point in the scaling journey bare metal might be the right choice, but I get the feeling a lot of people here trivialize it.
database services such as DynamoDB support a few backup strategies out of the box, including continuous backups. You just need to flips switch and never bother about it again.
> Has worked for me for 20 years at various companies, side projects, startups …
That's perfectly fine. There are still developers who don't even use version control at all. Some old habits die hard, even when the whole world moved on.
If you're doing it yourself, learn Ansible, you'll do it once and be set forever.
You do not need "managed" database services. A managed database is no different from apt install postgesql followed by a scheduled backup.
It genuinely is trivial, people seem to have this impression theres some sort of unique special sauce going on at AWS when there really isn't.
It's the HA part, especially with a high-volume DB that's challenging.
Managed databases are a lot more than apt install postgresql.
You do not need "managed" database services. A managed database is no different from apt install postgesql followed by a scheduled backup.
Genuinely no disrespect, but these statements really make it seem like you have limited experience building an HA scalable system. And no, you don't need to be Netflix or Amazon to build software at scale, or require high availability.
And almost all of them need a database, a load balancer, maybe some sort of cache. AWS has got you covered.
Maybe some of them need some async periodic reporting tasks. Or to store massive files or datasets and do analysis on them. Or transcode video. Or transform images. Or run another type of database for a third party piece of software. Or run a queue for something. Or capture logs or metrics.
And on and on and and on. AWS has got you covered.
This is Excel all over again. "Excel is too complex and has too many features, nobody needs more than 20% of Excel. It's just that everyone needs a different 20%".
I think a few people who claim to be in devops could do with learning the basics about how things like Ansible can help them as there's a fair few people who seem to be under the impression AWS is the only, and the best option, which unless you're FAANG really is rarely the case.
Load balancing is trivial unless you get into global multicast LBs, but AWS have you covered there too.
(/s, obviously)
I think you don't understand the scenario you are commenting on. I'll explain why.
It's irrelevant if you believe that you are able to imagine another way to do something, and that you believe it's "insanely easy" to do those yourself. What matters is that others can do that assessment themselves, and what you are failing to understand is that when they do so, their conclusion is that the easiest way by far to deploy and maintain those services is AWS.
And it isn't even close.
You mention load balancing and caching. The likes of AWS allows you to setup a global deployment of those services with a couple of clicks. In AWS it's a basic configuration change. And if you don't want it, you just tear down everything with a couple of clicks as well.
Why do you think a third of all the internet runs on AWS? Do you think every single cloud engineer in the world is unable to exercise any form of critical thinking? Do you think there's a conspiracy out there to force AWS to rule the world?
I think you don't even understand the issue you are commenting on. It's irrelevant if you are Netflix or some guy playing with a tutorial. One of the key traits of serverless offerings is how it eliminates the need to manage and maintain a service or even worry about you have enough computational resources. You click a button to provision everything, you configure your clients to consume that service, and you are done.
If you stop to think about the amount of work you need to invest to even arrive at a point where you can actually point a client at a service, you'll be looking at what the value of serverless offerings.
Ironically, it's the likes of Netflix who can put together a case against using serverless offerings. They can afford to have their own teams managing their own platform services with the service levels they are willing to afford. For everyone else, unless you are in the business of managing and tuning databases or you are heavily motivated to save pocket change on a cloud provider bill, the decision process is neither that clear not favours running your own services.
You will in both cases need specialized people.
The idea you can click your way to a highly available, production configured anything in AWS - especially involving Dynamo, IAM and Lambda - is something I've only heard from people who've done AWS quickstarts but never run anything at scale in AWS.
Of course nobody else offers AWS products, but people use AWS for their solutions to compute problems and it can be easy to forget virtually all other providers offer solutions to all the same problems.
With some services I'd agree with you, but DynamoDB and Lambda are easily two of their 'simplest' to configure and understand services, and two of the ones that scale the easiest. IAM roles can be decently complicated, but that's really up to the user. If it's just 'let the Lambda talk to the table' it's simple enough.
S3/SQS/Lambda/DynamoDB are the services that I'd consider the 'barebones' of the cloud. If you don't have all those, you're not a cloud provider, your just another server vendor.
Not if you want to build something production ready. Even a simple thing like say static IP ingress for the Lambda is very complicated. The only AWS way you can do this is by using Global Accelerator -> Application Load Balancer -> VPC Endpoint -> API Gateway -> Lambda !!.
There are so many limits for everything that is very hard to run production workloads without painful time wasted in re-architecture around them and the support teams are close to useless to raise any limits.
Just in the last few months, I have hit limits on CloudFormation stack size, ALB rules, API gateway custom domains, Parameter Store size limits and on and on.
That is not even touching on the laughably basic tooling both SAM and CDK provides for local development if you want to work with Lambda.
Sure Firecracker is great, and the cold starts are not bad, and there isn't anybody even close on the cloud. Azure functions is unspeakably horrible, Cloud Run is just meh. Most Open Source stacks are either super complex like knative or just quite hard to get the same cold start performance.
We are stuck with AWS Lambda with nothing better yes, but oh so many times I have come close to just giving up and migrate to knative despite the complexity and performance hit.
Explain exactly what scenario you believe requires you to provide a lambda behind a static IP.
In the meantime, I recommend you learn how to invoke a lambda, because static IPs is something that is extremely hard to justify.
When you are working with enterprise customers or integration partners it doesn’t even have to be regulated sectors like finance or healthcare, these are basic asks you cannot get away from .
people want to be able to know whitelist your egress and ingress IPs or pin certificates. It is not up to me to say on efficacy of these rules .
I don’t make the rules of the infosec world , I just follow them.
Alright, if that's what you're going with then you can just follow a AWS tutorial:
https://docs.aws.amazon.com/lambda/latest/dg/configuration-v...
Provision an elastic IP to have your static IP address, set the NAT gateway to handle traffic, and plugin the lambda to the NAT gateway.
Do you think this qualifies as very complicated?
The only parts we are swapping out `GA -> ALB -> VPC` for `IG -> Router -> NAT -> VPC`.
Is it any simpler ? Doesn't seem like it is to me.
Going the NAT route means, you also need to have intermediate networking skills to handle a routing table (albeit a simple one), half the developers of today never used IP tables is or what chaining rules is.
---
I am surprised at the amount of pushback on a simple point which should be painfully obvious.
AWS (Azure/GCP are no different) has become overly complex with no first class support for higher order abstractions and framework efforts like SAM or even CDK seem to getting not much love at all in last 4-5 years.
Just because they offer and sell all these components to be independently, doesn't mean they should not invest and provide higher order abstractions for people with neither bandwidth or the luxury to be a full time "Cloud Architect".
There is a reason why today Vercel, Render or Railway others are popular despite mostly sitting on top of AWS.
On Vercel the same feature would be[1] quite simple. They use the exact solution you suggest on top of AWS NAT gateway, but the difference I don't have to know or manage it, is the large professional engineering team with networking experience at Vercel.
There is no reason AWS could not have built Vercel like features on top of their offerings or do so now.
At some point small to midsize developers will avoid direct AWS by either choosing to setup Hetzner/OVH bare machines or with bit more budget colo with Oxide[3] or more likely just stick to Vercel and Railway kind of platforms.
I don't know how that will impact AWS, we will all still use them, however a ton of small customers paying close to rack rate is definitely much much higher margin than what Vercel is paying AWS for the same workload is going to be.
--
[1] https://docs.aws.amazon.com/prescriptive-guidance/latest/pat...
[2] this https://vercel.com/docs/connectivity/static-ips
[3] Would be rare, obviously only if they have the skill experience to do so.
>>Gives a specific edge case about static IPs and doing a serverless API backed by lambda.
The most naive solution you'd do on any non-cloud vendor, just have a proxy with a static ip that then routes traffic whereever it needed to go, would also work on AWS.
So if you think AWS's solution sucks why not just go with that? What you described doesn't even sound complicated when you think of the networking magic behind the scenes that will take place if you ever do scale to 1 million tps.
Don’t know what you think should mean but for me that means
1. Declarative IaaC in either in CF/terraform
2. Fully Automated discovery which can achieve RTO/RPO objectives
3. Be able to Blue/Green and % or other rollouts
Sure I can write ansible scripts, have custom EC2 images run HA proxy and multiple nginx load balancers in HA as you suggest, or host all that to EKS or a dozen other “easier” solutions
At the point why bother with Lambda ? What is the point of being cloud native and serverless if you have to literally put few VMs/pod in front and handle all traffic ? Might as well host the app runtime too .
> doesn’t even sound complicated .
Because you need a full time resource who is AWS architect and keeps up with release notes and documentation or training and constantly works to scale your application - because every single component has a dozen quotas /limits and you will hit them - it is complicated.
If you spend few million a year on AWS then spending 300k on an engineer to do just do AWS is perhaps feasible .
If you spend few hundred thousands on AWS as part of mix of workloads it is not easy or simple.
The engineering of AWS impressive as it maybe has nothing to the products being offered . There is a reason why Pulumi, SST or AWS SAM itself exist .
Sadly SAM is so limited I had to rewrite everything to CDK in couple of months . CDK is better but I am finding that I have to monkey patching limits on CDK with the SDK code now, while possible , the SDK code will not generate Cloudformation templates .
I think your inexperience is showing, if that's what you try to mean by "production-ready". You're making a storm in a teacup over features that you automatically onboard if you go through an intro tutorial, and "production-ready" typically means way more than a basic run-of-the-mill CICD pipeline.
As most of the times, the most vocal online criticism comes from those who have the least knowledge and experience over the topic they are railing against, and their complains mainly boil down to criticising their own inexperience and ignorance. There is plenty of things to criticize AWS for, such as cost and vendor lock-in, but being unable and unwilling to learn how to use basic services is not it.
We agree, but also, I feel like you're missing my point: "let the Lambda talk to the table" is what quickstarts produce. To make a lambda talk to a table at scale in production, you'll want to setup your alerting and monitoring to notify you when you're getting close to your service limits.
If you're not hitting service limits/quotas, you're not running even close to running at scale.
I'll bite. Explain exactly what work you think you need to do to get your pick of service running on Hetnzer to have equivalent fault-tolerance to, say, a DynamoDB Global Table created with the defaults.
Maybe not click, but Scylla’s install script [0] doesn’t seem overly complicated.
0: https://docs.scylladb.com/manual/stable/getting-started/inst...
It's a server! What in the world is your friend doing running a single disk???
Ate a bare minimum they should have been running a mirror.
(Interesting that an anectode like above got downvoted)
experts almost unilaterally judge newbies harshly, as if the newbies should already know all of the mistakes to avoid. things like this are how you learn what mistakes to avoid.
"hindsight is 20/20" means nothing to a lot of people, unfortunately.
Sure, if you configure offsite backups you can guard against this stuff, but with anything in life, you get what you pay for.
The truth is one under the age of 35 is able to configure a webserver any more, apparently. Especially now that static site generators are in vogue and you don't even need to worry about php-fpm.
I would say tech workers rather than "people" as they are the ones needing to interact with it the most
Aside from Teams and Outlook Web, I really don't interact with Microsoft at all, haven't done since the days of XP. I'm sure there is integration on our corporate backends with things like active directory, but personally I don't have to deal with that.
Teams is fine for person-person instant messaging and video calls. I find it terrible for most other functions, but fortunately I don't have to use it for anything other than instant messaging and video calls. The linux version of teams still works.
I still hold out a healthy suspicion of them from their behaviour when I started in the industry. I find it amusing the Microsoft fanboys of the 2000s with their "only needs to work in IE6" and "Silverlight is the future" are still having to maintain obsolete machines to access their obsolete systems.
Meanwhile the stuff I wrote to be platform-agnostic 20 years ago is still in daily use, still delivering business benefit, with the only update being a change from "<object" to "<video" on one internal system when flash retired.
As long as the illusion that AWS/clouds are the only way to do things continues, their investment will keep being valuable and they will keep getting paid for (over?)engineering solutions based on such technologies.
The second that illusion breaks down, they become no better than any typical Linux sysadmin, or teenager ricing their Archlinux setup in their homelab.
One thing to note is that there are some scheduled maintenances were we needed to react.
Going forward I expect American companies to follow this European vibe, it's like the opposite of enshitification.
Why do you expect American companies to follow it then? >:)
Admittedly they're getting fewer and fewer, but they exist.
The same is also true in GCP, so as much as I prefer GCP from a technical standpoint: the truth is, if you can't see it, it doesn't mean it goes away.
(There may still be some core IAM dependencies in USE1, but I haven’t heard of any.)
We'll know when (if) some honest RCAs come out that pinpoint the issue.
the Billing part of the console in eu-west-2 was down though, presumably because that uses us-east-1 dynamodb, but route53 doesn't.
R53 seems to use Dynamo to keep track of the syncing of the DNS across the name servers, because while the record was there and resolving, the change set was stuck in PENDING.
After DynamoDB came back up, R53's API started working.
That might be datacenter dependant of course, since our root servers and cloud services are all hosted in Europe, but I really never understood why Hetzner is said to be less reliable
I could find one or two downvoted or heavily critisized comments, but I can find more people mentioning the opposite.
It's just a single data point, but for me that's a pretty good record.
It's not because Hetzner is miraculously better at infrastructure, it's because physical servers are way simpler than the extremely complex software and networking systems that AWS provides.
Or, rather, it's your fault when the complex software and networking systems you deployed on top of those physical servers go wrong (:
5 mins seems unrealistic unless you’re spending time somewhere else to keep up to speed with version releases, upgrades, etc.
I’m not using k8s personally but the moment I moved from traditional infrastructure (chef server + VMs) to containers (Portainer) my level of effort went down by like 10x.
K8s solves only one problem - the problem of organizational structure scaling. For example, when your Ops team and your Dev team have different product deadlines and different budgets. At this point you will need the insanity of k8s.
So Hetzner is OK for the overly complex as well, if you wish to do so.
When those pink slips come in, we’ll just go somewhere else and do the same thing!
actually, why do people block ICMP? I remember in 1997-1998 there were some Cisco ICMP vulnerabilities and people started blocking ICMP then and mostly never stopped, and I never understood why. ICMP is so valuable for troubleshooting in certain situations.
I've rarely actually seen that advice anywhere, more so 20 years ago than now but people are still clearly getting it from circles I don't run in.
Also sorts of monitoring gets flipped.
And no, there generally aren't brief outages in normal servers unless you did it.
I did have someone accidentally shut down one of the servers once though.
This was extra painful, because I wasn't using one of the OS that is blessed by Hetzner, so it requires a remote install. Remote installs require a system that can run their Java web plugin, and that have a stable and fast enough connection to not time out. The only way I have reliably gotten them to work is by having an ancient Linux VM that was also running in Hetzner, and had the oldest Firefox version I could find that still supported Java in the browser.
My fault for trying to use what they provide in a way that is outside their intended use, and props to them for letting me do it anyway.
I learned a long time ago that servers should be an output of your declarative server management configuration, not something that is the source of any configuration state. In other words, you should have a system where you can recreate all your servers at any time.
In your case, I would indeed consider starting with one of the OS base installs that they provide. Much as I dislike the Linux distribution I'm using now, it is quite popular, so I can treat it as a common denominator that my ansible can start from.
> 99.99% uptime infra significantly cheaper than the cloud.
I guess that's another person that has never actually worked in the domain (SRE/admin) but still wants to talk with confidence on the topic.
Why do I say that? Because 99.99% is frickin easy
That's almost one full hour of complete downtime per year.
It only gets hard in the 99.9999+ range ... And you rarely meet that range with cloud providers either as requests still fail for some reason, like random 503 when a container is decommissioned or similar
Aws/cloud has similar outages too, but more redundancy and automatic failover/migrations that are transparent to customers happen. You don't have to worry about DDOS and many other admin burdens either.
YMMV, I'm just saying sometimes Aws makes sense, other times Hetzner does.
That's not necessarily ironic. Seems like you are suffering from recency bias.
It’s always DNS.
But DNS was designed in the 80s! It's actually a minor miracle it works as well as it does
/s
Systems often start with minimal dependencies, and then over time you add a dependency on X for a limited use case as a convenience. Then over time, since it's already being used it gets added to other use cases until you eventually find out that it's a critical dependency.
That's a major way your DNS stops working.
I suppose it's possible DNS broke health checks but it seems more likely to be the other way around imo
One particular “dns” issue that caused an outage was actually a bug in software that monitors healthchecks.
It would actively monitor all servers for a particular service (by updating itself based on what was deployed) and update dns based on those checks.
So when the health check monitors failed, servers would get removed from dns within a few milliseconds.
Bug gets deployed to health check service. All of a sudden users can’t resolve dns names because everything is marked as unhealthy and removed from dns.
So not really a “dns” issue, but it looks like one to users
> Signal is experiencing technical difficulties. We are working hard to restore service as quickly as possible.
Edit: Up and running again.There's no way it's DNS
It was DNS
Probably makes sense to add "relies on AWS" to the criteria we're using to evaluate 3rd-party services.
1) GDPR is never enforced other than token fines based on technicalities. The vast majority of the cookie banners you see around are not compliant, so it the regulation was actually enforced they'd be the first to go... and it would be much easier to go after those (they are visible) rather than audit every company's internal codebases to check if they're sending data to a US-based provider.
2) you could technically build a service that relies on a US-based provider while not sending them any personal data or data that can be correlated with personal data.
>other than token fines based on technicalities. Result!
You can't consider a regulation being enforced if everyone gets away with publishing evidence of their non-compliance on their website in a very obnoxious manner.
"Do we enable DR? Yes/No". That's all you can do. If you do, it's a whole machinery starting, which might take longer than the outage itself.
They can't even use Slack to communicate - messages are being dropped/not sent.
And then we laugh at the South Koreans for not having backed up their hard drives (which got burnt by actual fire, a statistically way less occurring event than an AWS outage). OK that's a huge screw up, but hey, this is not insignificant either.
What will happen now? Nothing, like nothing happened after Crowdstrike's bug last year.
Just because today's implementation has 4 9s that doesn't mean tomorrow's will...
If you are on AWS and AWS goes down, that's covered in the news as a bunch of billion dollar companies were also down. Customer probably gives you a pass.
Exactly - I've had clients say, "We'll pay for hot standbys in the same region, but not in another region. If an entire AWS region goes down, it'll be in the news, and our customers will understand, because we won't be their only service provider that goes down, and our clients might even be down themselves."
Show up at a meeting where a whole bunch of people appear to have wet themselves, and we’ll all agree not to mention it ever again…
They measure uptime using averages of "if any part of a chain is even marginally working".
People experience downtime however as "if any part of a chain is degraded".
Seriously, this thing runs already on 3 servers. A primary + backup and a secondary in another datacenter/provider at Netcup. DNS with another AnycastDNS provider called ClouDNS. Everything still way cheaper then AWS. The database is already replicated for reads. And I could switch to sharding if necessary. I can easily scale to 5, 7, whatever dedicated servers. But I do not have to right now. The primary is at 1% (sic!) load.
There really is no magic behind this. And you have to write your application in a distributable way anyway, you need to understand the concepts of stateless, write-locking, etc. also with AWS.
Of course if they had on-site staff it wouldn't be an issue (worst case, just walk down to the transmitter hut and use the transmitter's aux input, which is there specifically for backup operations like this), but consolidation and enshittification of broadcast media mean there's probably nobody physically present.
It impacts AWS internally too. For example rather ironically it looks like the outage took out AWS’s support systems so folks couldn’t contact support to get help.
Unfortunately it’s not as simple as just deploying in multiple regions with some failover load balancing.
So moving stuff out of us-east-1 absolutely does help
They make lawyers happy and they stop intelligence services to access the associated resources.
For example, no one would even consider accessing data from a European region without the right paperwork.
But yeah, that's pretty hard and there are other reasons customers might want to explicitly choose the region.
Yes, within the same region. Doing stuff cross-region takes a little bit more effort and cost, so many skip it.
Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.
Decentralized in terms of many companies making up the internet. Yes we've seen heavy consolidation in now having less than 10 companies make up the bulk of the internet.
The problem here isn't caused by companies chosing one cloud provider over the other. It's the economies of scale leading us to few large companies in any sector.
But trust me, 10 seconds after this outage is solved everybody will have forgotten about the possibility.
As long as the outages are rare enough and you automatically fail over to a different region, what's the problem?
And secondly, how often do you create that backup and are you willing to lose the writes since the last backup?
That backup is absolutely something people should have, but I doubt those are ever used to bring a service back up. That would be a monumental failure of your hosting provider (colo/cloud/whatever)
Not, but if some Amazon flunky decides to kill your account to protect the Amazon brand then you will at least survive, even if you'll lose some data.
Not companies, the protocols are decentralized and at some point it was mostly non-companies. Anyone can hook up a computer and start serving requests which was/is a radical concept, we've lost a lot, unfortunately
We have put more and more services on fewer and fewer vendors. But that's the consolidation and cost point.
May I introduce you to our Lord and Slavemaster CGNAT?
Be it a company or a state, concentration of power that exeeds the needs for their purpose to function by a large margin is always a sure way to spread corruption, create feedback loop in single point of failure, and is buying everyone a ticket to some dystopian reality with a level of certainty that beats anything that a SLA will ever give us.
Even connectivity has it's points of failure. I've touched with my own hands fiber runs that, with a few quick snips from a wire cutter, could bring sizable portions of the Internet offline. Granted that was a long time ago so those points of failure may no longer exist.
my business contingency plan for "AWS shuts down and never comes back up" is to go bankrupt
Make sure you let your investors know.
Where did I say that? If I didn't say it: could you please argue in good faith. Thank you.
I get your point, but most companies don't need Y nines of uptime, heck, many should probably not even use AWS, k8s, serverless or whatever complicated tech gives them all these problems at all, and could do with something far simpler.
Again, of course there are exceptions, but advising people in general that they should think about what happens if AWS goes offline for good seems like poor engineering to me. It’s like designing every bridge in your country to handle a tomahawk missile strike.
I guess the reason why people are not doing it is because it hasn't been demonstrated it's worth it, yet!
I've got to admit though, whenever I hear about having a backup plan I think having apples to apples copy elsewhere which is probably not wise/viable anyway. Perhaps having just enough to reach out to the service users/customers suffice.
Also I must add I am heavily influenced by a comment by Adrian Cockroft on why going multi cloud isn't worth it. He worked for AWS (at the time at least) so I should have probably reached to the salt dispenser.
You might as well say the entire NY + DC metro losses power and "never comes back up" What is the plan around that? The person replying is correct, most companies do not have a actionable plan for AWS never coming back up.
I worked at a medium-large company and was responsible for reviewing the infrastructure BCP plan. It stated that AWS going down was a risk, and if it happens we wait for it to come back up. (In a lot more words than that).
An old mini PC running a home media server and a Minecraft server will run flawlessly forever. Put something of any business importance on it and it will fail tomorrow.
Related, I’m sure, is the fact that things like furnaces and water heaters will die on holidays.
Absurd claim.
Multi-region and multi-cloud are strategies that have existed almost as long as AWS. If your company is in anything finance-adjacent or critical infrastructure then you need to be able to cope with a single region of AWS failing.
GP said:
> most companies
Most companies aren't finance-adjacent or critical infrastructure
Of course there are cases when multi cloud makes sense, but it is in the minority. The absurd claim is that most companies should worry cloud outages and plan for AWS to go offline for ever.
That still fits in with "almost guarantee". It's not as though it's true for everyone, e.g. people who might trigger DR after 10 minutes of downtime, and have it up and running within 30 more minutes.
But it is true for almost everyone, as most people will trigger it after 30 minutes or more, which, plus the time to execute DR, is often going to be far less than the AWS resolution time.
Best of all would be just multi-everything services from the start, and us-east-1 is just another node, but that's expensive and tricky with state.
This describes, what, under 1% of companies out there?
For most companies the cost of being multi-region is much more than just accepting with the occasional outage.
It is rare, but it happens at LEAST 2-3x a year. AWS us-east-1 has a major incident affecting multiple services (that affect most downstream aws services) multiple times a year. Usually never the same root cause.
Not very many people realize that there are some services that still run only in us-east-1.
AWS (over-)reliance is insane...
Why people keep using "nation-state" term incorrectly in HN comments is beyond me...
You almost never see the definition you are referring used except in the context of explicit comparison of different bases and compositions of states, and in practice there is very close to zero ambiguity which sense is meant, and complaining about it is the same kind of misguided prescriptivism as (also popular on HN) complaining about the transitive use of "begs the question" because it has a different sense than the intransitive use.
It's like if you ran you cloud on an old dell box in your closet while your parent company is offering to directly host it in AWS for free.
Not really? Most of the infrastructure is quite resilient and the rare outage is usually limited to a street or two, with restoration time mainly determined by the time it takes the electricians to reach the incident site. For any given address that's maybe a few hours per decade - with the most likely cause being planned maintenance. That's not a "spares are too expensive" issue, that's a "giving every home two fully independent power feeds is silly" issue.
Anything on a metro-sized level is pretty much unheard of, and will be treated as serious as a plane crash. They can essentially only be caused by systemic failure on multiple levels, as the grid is configured to survive multiple independent failures at the same time.
Comparing that to the AWS world: individual servers going down is inevitable and shouldn't come as a surprise. Everyone has redundancies, and an engineer accidentally yanking the power cables of an entire rack shouldn't even be noticeable to any customers. But an entire service going down across an entire availability zone? That should be virtually impossible, and having it happen regularly is a bit of a red flag.
(Of course that's still not the same as a big boy grid failure (Texas ice storm-sized) which are the things that utilities are meant to actively prevent ever happening.)
The right analogy is to imagine if businesses that used electricity took that stance, and they basically all do. If you're a hospital or some other business where a power outage is life or death, you plan by having backup generators. But if you're the overwhelming majority of businesses, you do absolutely nothing to ensure that you have power during a power outage, and it's fine.
it is fine because the electricity supplier is so good today that people don't see it going down as a risk.
Look at south africa's electricity supplier for a different scenario.
Texas has had statewide power outages. Spain and Portugal suffered near-nationwide power outages last year. Many US states are heavily reliant on the same single source for water. And remember the discussions on here about Europe's reliance on Russian gas?
Then you have the XKCD sketch about how most software products are reliant on at least one piece of open source software that is maintained by a single person as a hobby.
Nobody likes a single point of failure but often the costs associated with mitigating that are much greater than the risks of having that point of failure.
This is why "risk assessments" are a thing.
Not all utility companies have the same policies, but all have a resiliency plan to avoid blackout that is a bit more serious than "Just run it on AWS".
You're arguing as if "run it on AWS" was a decision that didn't undergo the same kinds of risk assessment. As someone who's had to complete such processes (and in some companies, even define them), I can assure you that nobody of any competency runs stuff on AWS complacently.
In fact running stuff with resilience in AWS isn't even as simple as "just running it in AWS". There's a whole plethora of things to consider, and each with its own costs attached. As the meme goes "one does not simply just run something on AWS"
I agree with this. My point is simply that we, as an industry, are not a very competent bunch when it comes to risk management ; and that's especially true when compared to TSOs.
That doesn't mean nobody knows what they do in our industry or that shit never hits the fan elsewhere, but I would argue that it's an outlier behaviour, whereas it's the norm in more secure industries.
> As the meme goes "one does not simply just run something on AWS"
The meme has currency for a reason, unfortunately.
---
That being said, my original point was that utilities losing clients after a storm isn't the consequence of bad (or no) risk assessment ; it's the consequence of them setting up acceptable loss thresholds depending on the likelihood of an event happening, and making sure that the network as a whole can respect these SLOs while strictly respecting safety criteria.
This is particularly true when Amazon hand out credits like candy. So you just need to moan to your AWS account manager about the service interruption and you’ll be covered.
Imagine if the cloud supplier was actually as important as the electricity supplier.
But since you mention it, there are instances of this and provisions for getting back up and running:
* https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...
As someone who lives in Ontario, Canada, I got hit by the 2003 grid outage, which is once in >20 years. Seems like a fairly good uptime to me.
(Each electrical grid can perhaps be considered analogous to a separate cloud provider. Or perhaps, in US/CA, regions:
* https://en.wikipedia.org/wiki/North_American_Electric_Reliab...
)
Catastrophic data loss or lack of disaster recovery kills companies. AWS outages do not.
1. https://en.wikipedia.org/wiki/Black_start 2. https://en.wikipedia.org/wiki/2025_Iberian_Peninsula_blackou...
And regardless, electric service all over the world goes down for minutes or hours all the time.
What are those ?
The only ones that you're likely to encounter are IAM, Route53, and the billing console. The billing console outage for a few hours is hardly a problem. IAM and Route53 are statically stable and designed to be mostly stand-alone. They are working fine right now, btw.
During this outage, my infrastructure on AWS is working just fine, simply because it's outside of us-east-1.
Ironically, our observability provider went down.
I presume this means you must not be working for a company running anything at scale on AWS.
But thinking AWS SLA is granted forever, and everyone should put all its eggs in it because "everyone do it" is neither wise not safe. Those who can afford it, and there are many businesses like that out there, should have a plan B. And actually AWS should not necessarily be plan A.
Nothing is forever. Not the Roman empire, not Inca empire, not china dynasties, not USA geological supremacy. That's not a question of if but when. It doesn't need to be through a lot of suffering, but if we don't systematically organise for a humanity which spread well being for everyone in a systematically resilient way, we will make it through more lot of tragic consequences when this or that single point of failure finally falls.
One of my projects is entirely hosted on S3. I don't care enough if it becomes unavailable for a few hours to justify paying to distribute it to GCP et al.
And actually for most companies, the cost of multi-cloud is greater than the benefits. Particularly when those larger entities can just bitch to their AWS account manager to get a few grand refunded as credits.
Not only that, but as you're seeing with this and the last few dozen outages... when us-east-1 goes down, a solid chunk of what many consumers consider the "internet" goes down. It's perceived less as "app C is down" and more is "the internet is broken today".
If my site's offline at the same time as the BBC has front page articles about how AWS is down and it's broken half the internet... it makes it _really_ easy for me to avoid blame without actually addressing the problem.
I don't need to deflect blame from my customers. Chances are they've already run into several other broken services today, they've seen news articles about it, and all from third parties. By the time they notice my service is down, they probably won't even bother asking me about it.
I can definitely see this encouraging more centralization, yes.
Oh god, this. At my company, we found a bug recently with rds.describe_events, which we needed to read binlog information after a B/G cutover. The bug, which AWS support “could not see the details of,” was that events would non-deterministically not show up if you were filtering by instance name. Their recommended fix was to pull in all events for the past N minutes, and do client-side filtering.
This was on top of the other bug I had found earlier, which was that despite the docs stating that you can use a B/G as a filter - a logical choice when querying for information directly related to the B/G you just cut over - doing so returns an empty set. Also, you can’t use a cluster (again, despite docs stating otherwise), you have to use the new cluster’s writer instance.
1. they are idiots
2. they do it on purpose and they think you are an idiot
For me, it just means that the moment you integrate with any API, you are basically their bitch (unless you implement one from every competitor in the market, at which point you can just as well do it yourself).Hosting your services on AWS while having a status page on AWS during an AWS outage is an easily avoidable problem.
It’s not hard to imagine events that would keep AWS dark for a long period of time, especially if you’re just in one region. The outage today was in us-east-1. Companies affected by it might want to consider at least geographic diversity, if not supplier diversity.
Sure, it's worth considering, but for most companies it's not going to be worth the engineering effort to architect cross-cloud services. The complexity is NOT linear.
IMO most shops should focus on testing backups (which should be at least cross-cloud, potentially on-prem of some sort) to make sure their data integrity is solid. Your data can't be recreated, everything else can be rebuilt even if it takes a long time.
You know that’s not true, is-east-1 last one was 2 years ago? But other services have bad days and foundational one drag others a long
At this point, being in any other region cuts your disaster exposure dramatically
AWS US-East 1 has many outages. Anything significant should account for that.
If your resilience plan is to trust a third party, that means you don't really care about going down, does it?
Besides that, as the above poster said, the issue with top tier cloud providers (or cloudflare, or google, etc) is not just that you rely on them, it is that enough people rely on them that you may suffer even if you don't.
What about if your account gets deleted? Or compromised and all your instances/services deleted?
I think the idea is to be able to have things continue running on not-AWS.
"Permanent AWS outage" includes someone pressing the wrong button in the AWS console and deleting something important or things like a hack or ransomware attack corrupting your data, as well as your account being banned or whatever. While it does include AWS itself going down in a big way, it's extremely unlikely that it won't come back, but if you cover other possibilities, that will probably be covered too.
Step 2 is multi-AZ
Step 3 is multi-region
Step 4 is multi-cloud.
Each company can work on it's next step, but most will not have positive EROI going from 2 to 3+
Lesson here is that your approach will depend on your industry and peers. Every market will have their won philosophy and requirements here.
https://aws.amazon.com/blogs/industries/o2-telefonica-moves-...
A few hours could be a problem.
Not to mention it creates valuable a single point of failure for a hostile attack.
But a large enough of "not too big to fail" companies become a too big to fail event. Too many medium sized companies have a total dependency on AWS or Google or Microsoft, if not several at the same time.
We live in a increasingly fragile society, one step closer to critical failure because big tech is not regulated in the same way than other infrastructure.
And FWIW, "AWS is down"....only one region (out of 36) of AWS is down.
You can do the multi-region failover, though that's still possibly overkill for most.
Second, preparing for the disappearance of AWS is even more silly. The chance that it will happen are orders of magnitude smaller than the chance that the cost of preparing for such an event will kill your business.
Let me ask you: how do you prepare your website for the complete collapse of western society? Will you be able to adapt your business model to post-apocalyptic world where it's only cockroaches?
How did we go from 'you could lose your AWS account' to 'complete collapse of western society'? Do websites even matter in that context?
> Second, preparing for the disappearance of AWS is even more silly.
What's silly is not thinking ahead.
That's the main topic that going through my mind lately, if you replace "my website" with "Wikimedia movement".
We need a far better social, juridical and technical architecture regarding resilience as hostil agendas are in the rise at all level agaisnt sourced trackable global volunteer community knowledge bases.
Given the current geopolitical circumstances, that's not a far fetched scenario. Especially for us-east-1; or anything in the D.C. metro area.
For small and medium sized companies it's not easy to perform an accurate due diligency.
Need to keep eyes peeled at all levels or the organization as many of these enter through day-to-day…
Resilient systems work autonomously and can synchronize - but don't need to synchronize.
* Git is resilient.
* Native E-Mail clients - with local storage enabled - are somewhat resilient.
* A local package repository is - somewhat resilient.
* A local file-sharing app (not Warp/ Magic-Wormhole -> needs relay) is resilient if it uses only local WiFi or Bluetooth.
We're building weak infrastructure. A lot of stuff shall work locally and only optionally use the internet.But the web, that's the fragile, centralized and weak point currently, and seems to be what you're referring to rather.
Maybe nitpicky, but I feel like it's important to distinguish between "the web" and "the internet".
https://www.bbc.com/news/technology-57707530
That because people trust and hope blindly. They believe IT is for saving money? It isn’t. They coupled their cash registers onto an American cloud service. The couldn’t even pay in cash.
It usually gets worse, when not outages happens for some time. Because that increases blind trust.
The word "seems" is doing a lot of heavy lifting there.
Yes the Internet has stayed stable.
The Web, as defined by a bunch of servers running complex software, probably much less so.
Just the fact that it must necessarily be more complex means that it has more failure modes...
Now we have computers that shit themselves if DNS isn’t working, let alone LANs that can operate disconnected from the Internet as a whole.
And partially working or indicating this it works (when it doesn’t) is usually even worse.
For you as a Mexican the end results is the same, AWS went away, and considering there already is a list of countries that cannot use AWS, GitHub and a bunch of other "essential" services, it's not hard to imagine that that list might grow in the future.
I've actually had that.
Proper hardware (Sun, Cisco) had a serial management interface (ideally "lights-out") which could be used to remedy many kinds of failures. Plus a terminal server with a dial-in modem on a POTS line (or adequate fakery), in case the drama took out IP routing.
Then came Linux on x86, and it took waytoomanyyears for the hardware vendors to outgrow the platform's microsoft local-only deployment model. Aside from Dell and maybe Supermicro, I'm not sure if they ever worked it out.
Then came the cloud. Ironically, all of our systems are up and happy today, but services that rely on partner integrations are down. The only good thing about this is that it's not me running around trying to get it fixed. :)
There is no reason to have such brittle infra.
The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!
It really is a single point of failure for the majority of the Internet.
Why would a third-party be in your product's critical path? It's like the old business school thing about "don't build your business on the back of another"
The reason third-party things are in the critical path is because most of the time, they are still more reliable than self-hosting everything; because they're cheaper than anything you can engineer in-house; because no app is an island.
It's been decades since I worked on something that was completely isolated from external integrations. We do the best we can with redundancy, fault tolerance, auto-recovery, and balance that with cost and engineering time.
If you think this is bad, take a look at the uptime of complicated systems that are 100% self-hosted. Without a Fortune 500 level IT staff, you can't beat AWS's uptime.
E.g., a hospital could keep recent patient data on-site and sync it up with the central cloud service as and when that service becomes available. Not all systems need to be linked in real time. Sometimes it makes sense to create buffers.
But the downside is that syncing things asynchronously creates complexity that itself can be the cause of outages or worse data corruption.
I guess it's a decision that can only be made on a case by case basis.
i bet only 1-2% of AI startups are running their own models and the rest are just bouncing off OpenAI, Azure, or some other API.
Good luck naming a large company, bank, even utility that doesn't have some kind of dependency like this somewhere, even if they have mostly on-prem services.
"Some kind of dependency" is fine and unavoidable, but well-architected systems don't have hard downtime just because someone somewhere you have no control over fucked up.
Timing.
If Amazon has peaked then they will not be worth much. Shares go down. Even in rising markets shares of failing companies go down...
Mind tho, Amazon has so much mind share they will need to fail harder to fail totally...
In fairness, that's been my experience with everyone except OpenAI and Anthropic where I only occasionally come out underwhelmed
Really I think AWS does a fairly poor job bringing new services to market and it takes a while for them to mature. They excel much more in the stability of their core/old services--especially the "serverless" variety like S3, SQS, Lambda, EC2-ish, RDS-ish (well, today notwithstanding)
I found this summary:
https://fortune.com/2025/07/31/amazon-aws-ai-andy-jassy-earn...
And the transcript (there’s an annoying modal obscuring a bit of the page, but it’s still readable):
https://seekingalpha.com/article/4807281-amazon-com-inc-amzn...
(search for the word “tough”)
Anyway, I actually loved my first time at AWS. Which is why I went back. My second stint wasn't too bad, but I probably wouldn't go back, unless they offered me a lot more than what I get paid, but that is unlikely.
Maybe those who have been around longer have seen this before, but its the first time for me.
This is and was never true. I've done setups in the past where monitoring happened "multi cloud" with also multiple dedicated servers. Was pretty broad so you could actually see where things broke.
Was quite some time ago so I don't have the data, but AWS never came out on top.
It actually matched largely with what netcraft.com put out. Not sure if they still do that and release those things to the public.
The best advice I can give to any org in AWS is to get out of us-east-1. If you use a service whose management layer is based there, make sure you have break-glass processes in place or, better yet, diversify to other services entirely to reduce/eliminate single points of failure.
This is not a new issue caused by improper investment, it's always been this way.
It's both the oldest and largest (most ec2 hosts, most objects in s3, etc) AWS region, and due to those things it's the region most likely to encounter an edge case in prod.
I'd suggest to ++double the cost. Compare:
++double: spoken as "triple" -> team says that double++ was a joke, we can obviously only double the cost -> embarrassingly you quickly agree -> team laughs -> team approves doubling -> you double the cost -> team goes out for beers -> everyone is happy
double++: spoken as "double" -> team quickly agrees and signs off -> you consequently triple the cost per c precedence rules -> manager goes ballistic -> you blithely recount the history of c precedence in a long monotone style -> job returns EINVAL -> beers = 0
This is all very, very hand-wavey. And if one says "golly gee, all our config is too cloud specific to do multi-cloud" then you've figured out why cloud blows and that there is no inherent reason not to have API standards for certain mature cloud services like serverless functions, VMs and networks.
Edit to add- ink ow how grossly simplified this is, and that most places have massively complex systems.
Yes, mostly.
Yep. Although it's just anecdata, it's what we do where I work - haven't had a slightest issue in years.
It's amazing how few problems we have. Honestly, I don't think we have to worry about configuration issues as often as people who rely on the cloud.
If you have an app that experiences 1000x demand spikes at unpredictable times then sure, go with the cloud. But there are a lot of companies that would be better off if they seriously considered their options before choosing the cloud for everything.
If an internal "AWS team" then this translates to "I am comfortable using this tool, and am uninterested in having to learn an entirely new stack."
If you have to diversify your cloud workloads give your devops team more money to do so.
It’s likely that, like many organizations, this scenario isn’t something Reddit are well prepared for in terms of correct error messaging.
There aren't any communities on Reddit with that name. Double-check the community name or start a new community.
I believe the 2021 one was especially horrific because of it affecting their dns service (route53) and the outage made writes to that service impossible. This made fail overs not work etcetera so their prescribed multi region setups didn't work.
But in the end, some things will have to synchronizes their writes somewhere, right? So for dns I could see how that ends up in a single region.
AWS is bound by the same rules as everyone else in the end... The only thing they have going for them that they have a lot of money to make certain services resilient, but I'm not aware of a single system that's resilient to everything.
Every AWS “global” service be it IAM, STS, CloudFormation, CloudFront, Route 53, Organizations, they all have deep ties to control systems originally built only in us-east-1/n. va.
That's poor design, after all these years. They've had time to fix this.
Until AWS fully decouples the control plane from us-east-1, the entire platform has a global dependency. Even if your data plane is fine, you still rely on IAM and STS for authentication and maybe Route 53 for DNS or failover CloudFormation or ECS for orchestration...
If any of those choke because us-east-1’s internal control systems are degraded, you’re fucked. That’s not true regional independence.
Assuming you cannot alter requirements or SLAs, I could see how their technical solutions are limited. It's possible, just not without breaking their promises. At that point it's no longer a technical problem
The control plane must know the truth about your account and that truth must be globally consistent. That’s where the trouble starts I guess.
I think my old-school system admin ethos is just different than theirs. It's not a who's wrong or right, just a difference in opinions on how it should be done I guess.
The ISP I work for requires us to design in a way that no single DC will cause a point of failure, just difference in design methods and I have to remember the DC I work in is completely differently used than AWS.
In the end however, I know solutions for this exist (federated ledgers, CRDT-based control planes, regional autonomy but they’re just expensive and they don’t look good on quarterly slides), it just takes the almighty dollar to implement and that goes against big business, if it "works" it works, I guess.
AWS’s model scales to millions of accounts because it hides complexity, sure but the same philosophy that enables that scale prevents true decentralization. That is shit. I guess people can architect as if us-east-1 can disappear so that things can continue on, but then thats AWS causing complexity in your code. They are just shifting who is shouldering that little known issue.
"upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection timeout"
There'll be much admin moaning
And servers not glowing
and the NOC crew in tears
It's the most blunderful time of the year ♫
Amazon is burning and driving away the technical talent and knowledge knowing the vendor lock-in will keep bringing the sweet money. You will see more sales people hoovering around your c-suites and executives, while you will face even worse technical support, that seem not knowing what they are talking about, yet alone to fix the support issue you expect to be fixed easily.
Mark my words, and if you are putting your eggs in one basket, that basket is now too complex and too interdependent, and the people who built and knew those intricacies are driven away with RTOs, move to hubs. Eventually those services; all others (and also aws services themselves) heavily dependent on, might be more fragile than the public knows.
However, as may be apparent just from that small set, it is not exactly something technical people often feel comfortable with doing. It is why at least in some organizations you get the friction of a business type interfacing with technical people in varying ways, but also not really getting along because they don’t understand each other and often there are barriers of openness.
When a company moves from engineering/technical driven to sales/profit/stock price/shareholders satisfaction driven, once it was not possible to cut (technical) corners, now becomes the de facto. If you push the L7s/L8s out of the discussion room, who would definitely stop or veto circular dependencies, and replace with sir-yes-sir people, now you've successfully created short term KPI wins for the lofty chairs but with a burning fuse of catastrophic failures to come.
WILL see? We've been seeing this since 2019.
"And so, a quiet suspicion starts to circulate: where have the senior AWS engineers who've been to this dance before gone? And the answer increasingly is that they've left the building — taking decades of hard-won institutional knowledge about how AWS's systems work at scale right along with them."
...
"AWS has given increasing levels of detail, as is their tradition, when outages strike, and as new information comes to light. Reading through it, one really gets the sense that it took them 75 minutes to go from "things are breaking" to "we've narrowed it down to a single service endpoint, but are still researching," which is something of a bitter pill to swallow. To be clear: I've seen zero signs that this stems from a lack of transparency, and every indication that they legitimately did not know what was breaking for a patently absurd length of time."
....
"This is a tipping point moment. Increasingly, it seems that the talent who understood the deep failure modes is gone. The new, leaner, presumably less expensive teams lack the institutional knowledge needed to, if not prevent these outages in the first place, significantly reduce the time to detection and recovery. "
...
"I want to be very clear on one last point. This isn't about the technology being old. It's about the people maintaining it being new. If I had to guess what happens next, the market will forgive AWS this time, but the pattern will continue."
Ladies and Gentleman's it's about time to learn reshoring in the IT world as well. Owning nothing, renting all means extreme fragility.
> Oct 20 3:35 AM PDT
> The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.
one of the main points of cloud computing is scaling up and down frequently
`us-east-1` is unfortunately special in some ways but not in ways that should affect well-designed serving systems in other regions.
It seems like the outage is only effecting one region so AWS is likely falling back to others. I’m sure parts of the site are down but the main sites are resilient
Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?
>We recommend customers continue to retry any failed requests.
There's no such thing as infinite scalability. Even the most elastic services are not infinitely elastic. When resources are short, you either have to rely on your customers to retry nicely, or you have to shed load during overload scenarios to protect goodput (which will deny service to some). For a high demand service, overload is most likely during the first few hours after recovery.
See e.g., https://d1.awsstatic.com/builderslibrary/pdfs/Resilience-les...
Yes your customers may well implement stupidly aggressive retries, but that shouldn't break your stuff, they should just start getting 429s?
Both of which seem to prop up in post mortems for these widespread outages.
DynamoDB is not going to set up its own DNS service or its own Route 53.
Maybe DynamoDB should have had tooling that tested DNS edits before sending it to Route 53, or Route53 should have tooling to validate changes before accepting them. I'm sure smart people at AWS are yelling at each other about it right now.
> Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum.
IAM is replicated, so each region has its own read-only IAM cache. AWS SigV4 is also designed to be regionalized, if you ever wondered why the signature key derivation has many steps, that's exactly why ( https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_s... ).
It took several days to manually spin up DynamoDB/EC2 instances so that both services could recover slowly together. Since then, there was a big push to remove dependencies between the “tier one” systems (S3, DynamoDB, EC2, etc.) so that one system couldn’t bring down another one. Of course, it’s never foolproof.
The very large 2017 AWS outage originated in s3. Maybe you're thinking about a different event?
https://aws.amazon.com/message/5467D2/
I imagine this was impossible in 2017 because of actions taken after the 2015 incident
If you're talking about this part:
> Initially, we were unable to add capacity to the metadata service because it was under such high load, preventing us from successfully making the requisite administrative requests.
It isn't about spinning up ec2 instances or provisioning hardware. It is about logically adding the capacity to the system. The metadata service is a storage service, so adding capacity necessitates data movement. There are a lot of things that need to happen to add capacity while maintaining data correctness and availability (mind at this point, it was still trying to fulfill all requests)
I think most sysadmin don't plan for AWS outage. And economically it makes sense.
But it makes me wonder, is sysadmin a lost art?
I dunno, let me ask chatgpt. Hmmm, it said yes.
Yes. 15-20 years ago when I was still working on network-adjacent stuff I witnessed the shift to the devops movement.
To be clear, the fact that devops don't plan for AWS failures isn't an indication that they lack the sysadmin gene. Sysadmins will tell you very similar "X can never go down" or "not worth having a backup for service Y".
But deep down devops are developers who just want to get their thing running, so they'll google/serveroverflow their way into production without any desire to learn the intricacies of the underlying system. So when something breaks, they're SOL.
"Thankfully" nowadays containers and application hosting abstracts a lot of it back away. So today I'd be willing to say that devops are sufficient for small to medium companies (and dare I say more efficient?).
Depends on the devops team. I have worked with so many devops engineers who came from network engineering, sysadmin, or SecOps backgrounds. They all bring a different perspective and set of priorities.
… but we should not compare them to self-hosting because hosting that much data and compute is complicated.
The emperor has no clothes.
I think no matter how hard you try to avoid it, in the end there's always a massive dependency chain for modern digital infrastructure[^2].
[1]: https://itsfoss.community/uploads/default/optimized/2X/a/ad3...
Signal was also down.
Well, inter-region DR/HA is a expensive thing to ensure (whether on salaries, infra or both), specially when you are in AWS.
Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.
It is kind of the child of what used to be called Catastrophe Theory, which in low dimensions is essentially a classification of folding of manifolds. Now the systems are higher dinemsional and the advice more practical/heuristic.
Doesn't help either. us-east-1 hosts the internal control plane of AWS and a bunch of stuff is only available in us-east-1 at all - most importantly, Cloudfront, AWS ACM for Cloudfront and parts of IAM.
And the last is the one true big problem. When IAM has a sniffle, everything else collapses because literally everything else depends on IAM. If I were to guess IAM probably handles millions if not billions of requests a second because every action on every AWS service causes at least one request to IAM.
The sheer volume is one thing, but... IAM's policy engine, that's another thing. Up to 5000 different roles per account, dozens of policies that can have an effect on any given user entity and on top of that you can also create IAM policies that blanket affect all entities (or only a filtered subset) in an account, and each policy definition can be what, 10 kB or so, in size. Filters can include multiple wildcards everywhere so you can't go for a fast-path in an in-memory index, and they can run variables with on-demand evaluation as well.
And all of that is reachable not on an account-specific endpoint that could get sharded from a shared instance should the load of one account become too expensive, no, it's a global (and region-shared) endpoint. And if that weren't enough, all calls are shipped off to CloudTrail's event log, always, with full context cues to have an audit and debug trail.
To achieve all that in a service quality that allows for less than 10 seconds worth of time before a change in an IAM policy becomes effective and milliseconds of call time is nothing short of amazing.
IAM is solid, but is it any more special than any other distributed AuthN+AuthZ service?
The scale, speed, and uptime, are downstream from the simplicity.
It's good solid work, I guess I read "amazing" as something surprising or superlative.
(The simple, solid, reliable services should absolutely get more love! Just wasn't sure if I was missing something about IAM.)
The admin UX is ... awkward and incomplete at best. I think the admin UI makes the service appear more complex than it is.
The JSON representation makes it look complicated, but with the data compiled down into a proper processable format, IAM is just a KVS and a simple rules engine.
Not much more complicated than nginx serving static files, honestly.
(Caveat: none of the above is literally simple, but it's what we do every day and -- unless I'm still missing it -- not especially amazing, comparatively).
You are right. But alas, a peek at the AMZN stock ticker suggests that the market doesn't really value resilience that much.
AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.
Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”
I thought that if us-east-1 goes down you might not be able to administer (or bring up new services) in other zones, but if you have services running that can take over from us-east-1, you can maintain your app/website etc.
I haven’t had to do this for several years but that was my experience a few years ago on an outage - obviously it depends on the services you’re using.
You can’t start cloning things to other zones after us-east-1 is down - you’ve left it too late
What it turned into was Daedalus from Deus Ex lol.
I think it's hypocritical for them to push customers to double or triple their spend in AWS when they themselves have single points of failure on a single region.
Hey! Pay us more money so when us-east-1 goes down you're not down (actually you'll still go down because us-east-1 is a single point of failure even for our other regions).
They've been charging a premium for, and marketing, "Availability" for decades at this point. I worked for a competitor and made a better product: it could endure any of the zones failing.
For the uninitiated: https://en.wikipedia.org/wiki/Room_641A
IAD datacenters have forever been the place where Amazon software developers implement services first (well before AWS was a thing).
Multi-AZ support often comes second (more than you think; Amazon is a pragmatic company), and not every service is easy to make TRULY multi-AZ.
And then other services depend on those services, and may also fall into the same trap.
...and so much of the tech/architectural debt gets concentrated into a single region.
Were the docs/tooling up to date? Tough bet. Much easier to fix BGP or whatever.
(I'm assuming by zone you mean the equivalent of an AWS region, with multiple connected datacenters)
I'll let people guess for the sport of it, here's the hint: there were at least 30 of them comprised of Real Datacenters. Thanks for the doubt, though. Implied or otherwise.
At risk of more snark [well-intentioned]: Clouds aren't the Death Star, they don't have to have an exhaust port. It's fair the first one does... for a while.
No kidding! The customers differ, business/finance/governments, but the volume [systems/time/effort] was comparable to Amazon. The people involved in audits were consumed practically for a whole quarter, if memory serves. Not necessarily for testing itself: first, planning, sharing the plan, then dreading the plan.
Anyway, I don't miss doing this at all. Didn't mean to imply mitigation is trivial, just feasible :) 'AWS scale' is all the more reason to do business continuity/disaster recovery testing! I guess I find it being surprising, surprising.
Competitors have an easier time avoiding the creation of a Gordian Knot with their services... when they aren't making a new one every week. There are significant degrees to PaaS, a little focus [not bound to a promotion packet] goes a long way.
Your experiment proves nothing. Anyone can pull it off.
A Cloud with multiple regions, or zones for that matter, that depend on one is a poorly designed Cloud; mine didn't, AWS does. So, let's revisit what brought 'whatever1', here:
> Your experiment proves nothing. Anyone can pull it off.
Amazon didn't, we did. Hmm.
The goal posts were fine: bomb the AZ of your choice, I don't care. The Cloud [that isn't AWS, in the case of 'us-east-1'] will still work.
Yes, everything has a weakness. Not every weakness is comparable to 'us-east-1'. Ours was billing/IAM. Guess what? They lived in several places with effective and routinely exercised redundancy. No single zone held this much influence. Service? Yes, that's why they span zones.
Said in the absolute kindest way: please fuck off. I have nothing to prove or, worse, sell. The businesses have done enough.
Not that "forgot to pay" is going to result in a cut off - that doesn't happen with the multi-megawatt supplies from multiple suppliers that go into a dedicated data centre. It's far more likely that the receivers will have taken over and will pay the bill by that point.
How’s not paying your AWS bill going for you?
Real People would notice/care, but financially, it didn't matter. Contract said the edge had to be lost for a moment/restored. I've played both Incident Manager and SRE in this routine.
edit: Less often we'd do a more thorough test: power loss/full recovery. We'd disconnect more regularly given the simplicity.
There can be other valid usecases than your own.
Price is beside my original point: Amazon has enjoyed decades for arbitrage. This sounds more accusatory than intended: the 'us-south-1' problem exists because it's allowed/chosen. Created in 2006!
Now, to retract that a bit: I could see technical debt/culture making this state of affairs practical, if not inevitable. Correct? No, if I was Papa Bezos I'd be incredibly upset my Supercomputer is so hamstrung. I think even the warehouses were impacted!
The real differentiator was policy/procedure. Nobody was allowed to create a service or integration with this kind of blast area. Design principles, to say the least. Fault zones and availability zones exist for a reason beyond capacity, after all.
The stuff I'm proudest of solved a problem and made money but it wasn't complicated for the sake of being complicated. It's like asking a mechanical engineer "what's the thing you've designed with the most parts"
Most web jobs are not technically complex. They use standard software stacks in standard ways. If they didn't, average developers (or LLMs) would not be able to write code for them.
I.e. a complicated but required system is fine (I had to implement a consensus algorithm for a good reason).
A complicated but unrequired system is bad (I built a docs platform for us that requires a 30-step build process, but yeah, MkDocs would do the same thing.
I really like it when people can pick out hidden complexity, though. "DNS" or "network routing" or "Kubernetes" or etc are great answers to me, assuming they've done something meaningful with them. The value is self-evident, and they're almost certainly more complex than anything most of us have worked on. I think there's a lot of value to being able to pick out that a task was simple because of leveraging something complex.
Why would they keep a large set of centralized, core traffic services in Virginia for decades despite it being a bad design?
I mean look at their console. Their console application is pretty subpar.
It's the rest of the world that has not. For a long time companies just ran everything in us-east-1 (e.g. Heroku), without even having an option to switch to another region.
You eventually get services that need to be global. IAM and DNS are such examples, they have to have a global endpoint because they apply to the global entities. AWS users are not regionalized, an AWS user can use the same key/role to access resources in multiple regions.
There's one for China, one for the AWS government cloud, and there are also various private clouds (like the one hosting the CIA data). You can check their list in the JSON metadata that is used to build the AWS clients (e.g. https://github.com/aws/aws-sdk-go-v2/blob/1a7301b01cbf7e74e4... ).
"We can't run things on just a box! That's a single point of failure. We're moving to cloud!"
The difference is that when the cloud goes down you can shift the blame to them, not you, and fixing it is their problem.
The corporate world is full of stuff like this. A huge role of consultants like McKinsey is to provide complicated reports and presentations backing the ideas that the CEO or other board members want to pursue. That way if things don't work out they can blame McKinsey.
You don't have one data center with critical services. You know lots of companies are still not in the cloud, and they manage their own datacenters, and they have 2-3 of them. There are cost, support, availability and regulatory reasons not to be in the cloud for many parties.
That's a very human sentiment, and I share it. That's why I don't swap my car wheels myself, I don't want to feel responsible if one comes loose on the highway and I cause an accident.
But at the same time it's also appalling how low the bar has gotten. We're still the ones deciding that one cloud is enough. The down being "their" fault really shouldn't excuse that fact. Most services aren't important enough to need redundancy. But if yours is, and it goes down because you decided that one provider is enough, then your provider isn't solely at fault here and as a profession I wish we'd take more accountability.
But I’ve led enough cloud implementations where I discuss the cost and complexity between - multi-AZ (it’s almost free so why not), multi region , and theoretically multi cloud (never came up in my experience) and then cold, warm and hot standby, RTO and RPO, etc
And for the most part, most businesses are fine with just multi-AZ as long as their data can survive catastrophe.
I hope they release a good root cause analysis report.
It was only when stuff started breaking that all this crap about “well actually stuff still relies on us-east-1” starts coming out.
Well it did for me today...Dont use us-east-1 explicitly just other regions and I had no outage today...( I get the point about the skeletons in the closet of us-east-1 ...maybe the power plug goes via Bezos wood desk? )
Perhaps for the internet as a whole, but for each individual service it underscores the risk of not hosting your service in multiple zones or having a backup
Obviously, some services are only available in us-east-1, but many applications can gain some resiliency just by making a primary home in any other region.
> There is one IAM control plane for all commercial AWS Regions, which is located in the US East (N. Virginia) Region. The IAM system then propagates configuration changes to the IAM data planes in every enabled AWS Region. The IAM data plane is essentially a read-only replica of the IAM control plane configuration data.
and I believe some global services (like certificate manager, etc.) also depend on the us-east-1 region
https://docs.aws.amazon.com/IAM/latest/UserGuide/disaster-re...
I recently ran into an issue where some Bedrock functionality was available in us-east-1 but not one of the other US regions.
The AWS has been steering people to us-east-2 for a while. For example, traffic between us-east-1 and us-east-2 has the same cost as inter-AZ traffic within the us-east-1.
There's no way it's DNS
It was DNS
As it happens, that naturally maps to the bootstrapping process on hardware needing to know how to find the external services it needs, which is what "DNS" is for. So "DNS" ends up being the top level of manual configuration.
But it's the inevitability of the manual process that's the issue here, not the technology. We're at a spot now where the rest of the system reliability is so good that the only things that bring it down are the spots where human beings make mistakes on the tiny handful of places where human operation is (inevitably!) required.
Unless DNS configuration propagates over DHCP?
At the top of the stack someone needs to say "This is the cluster that controls boot storage", "This is the IP to ask for auth tokens", etc... You can automatically configure almost everything but there still has to be some way to get started.
> 5000 Reddit users reported a certain number of problems shortly after a specific time.
> 400000 A certain number of reports were made in the UK alone in two hours.
Dumb argument imho, but that's how many of them think ime.
Also, lots of the bad guy boogeymen countries have legal and technical methods to do this without property damage. Just blackhole a bunch of routes.
Other AWS services should be able to survive this kind of interruption by rerouting requests to other AZs. Big company clients might also want to test against these kinds of scenarios.
There are very few businesses that genuinely cannot handle an outage like this. The only examples I've personally experienced are payment processing and semiconductor manufacturing. A severe IT outage in either of these businesses is an actual crisis. Contrast with the South Korean government who seems largely unaffected by the recent loss of an entire building full of machines with no backups.
I've worked in a retail store that had a total electricity outage and saw virtually no reduction in sales numbers for the day. I have seen a bank operate with a broken core system for weeks. I have never heard of someone actually cancelling a subscription over a transient outage in YouTube, Spotify, Netflix, Steam, etc.
The takeaway I always have from these events is that you should engineer your business to be resilient to the real tradeoff that AWS offers. If you don't overreact to the occasional outage and have reasonable measures to work around for a day or 2, it's almost certainly easier and cheaper than building a multi cloud complexity hellscape or dragging it all back on prem.
Thinking in terms of competition and game theory, you'll probably win even if your competitor has a perfect failover strategy. The cost of maintaining a flawless eject button for an entire cloud is like an anvil around your neck. Every IT decision has to be filtered through this axis. When you can just slap another EC2 on the pile, you can run laps around your peers.
An enduring image that stays with me was when I was a child and the local supermarket lost electricity. Within seconds the people working the tills had pulled out hand cranks by which the tills could be operated.
I'm getting old, but this was the 1980's, not the 1800's.
In other words, to agree with your point about resilience:
A lot of the time even some really janky fallbacks will be enough.
But to somewhat disagree with your apparent support for AWS: While it is true this attitude means you can deal with AWS falling over now and again, it also strips away one of the main reasons people tend to give me for why they're in AWS in the first place - namely a belief in buying peace of mind and less devops complexity (a belief I'd argue is pure fiction, but that's a separate issue). If you accept that you in fact can survive just fine without absurd levels of uptime, you also gain a lot more flexibility in which options are viable to you.
The cost of maintaining a flawless eject button is indeed high, but so is the cost of picking a provider based on the notion that you don't need one if you're with them out of a misplaced belief in the availability they can provide, rather than based on how cost effectively they can deliver what you actually need.
We have become much more reliant on digital tech (those hand cranked tills were prob not digital even when the electricity was on), and much less resilient to outages of such tech I think.
what did the do with the frozen food section? Was all that inventory lost?
So your complaints matter nothing because "number go up".
I remember the good old days of everyone starting a hosting company. We never should have left.
And everybody starting a hosting company is definitely a profit driven activity.
Absolutely, nobody was doing it out of charity, but there is more diversity in the market and thus more innovation and then the market decides. Right now we have 3 major providers, and that makes up the lion's share. That's consolidation of a service. I believe that's not good for the market or the internet as a whole.
(The counter joke is, of course, "but that's `us-east-1` already! But I mean deliberately and frequently.)
The biggest miss on our side is that, although we designed a multi-region capable application, we could not run the failover process because our security org migrated us to Identity Center and only put it in us-east-1, hard locking the entire company out of the AWS control plane. By the time we'd gotten the root credentials out of the vault, things were coming back up.
Good reminder that you are only as strong as your weakest link.
i don't think any method of auth was working for accessing the AWS console
I remember seeing a meme for a cover of "Meta Data Center Simulator 2021" where hands were holding an angle grinder with rows of server racks in the background.
"Meta Data Center Simulator 2021: As Real As It Gets (TM)"
Core services teams had backup communication systems in place prior to that though. IIRC it was a private IRC on separate infra specifically for that type of scenario.
In the mid-2000s most of the conference call traffic started leaving copper T1s and going onto fiber and/or SIP switches managed by Level3, Global Crossing, Qwest, etc. Those companies combined over time into Century Link which was then rebranded Lumen.
As of last October, Lumen is now starting to integrate more closely with AWS, managing their network with AWS's AI: https://convergedigest.com/lumen-expands-fiber-network-to-su...
"Oh what a tangled web we weave..."
add a bunch of other poinless scifi and evil villan lair tropes in as well...
Still have my "my other datacenter is made of razorblades and hate" sticker. \o/
Flame chemistry is weird. Halogenated fire suppression agents work by making Hydrogen (!) out of free radicals.
https://www.nist.gov/system/files/documents/el/fire_research...
On that day, the VP showed up and told the security staff, "just open all the doors!". So they did. If you knew where the datacenter was, you could just walk-in in mess with eBay servers. But since we were still a small ops team, we pretty much knew everyone who was supposed to be there. So security was basically "does someone else recognize you?".
/those were the days
The early oughts were a different time.
just make sure the zone based door lock/unlock system isn't on AWS ;)
I actually can't think of a more secure protocol. Doesn't scale, though.
The thieves had access to the office building but not the server room. They realized the server room shared a wall with a room that they did have access to, so they just used a sawzall to make an additional entrance.
Hooked from that moment! The series got progressively more ridiculous but what a start!
There is a video from the lock pick lawyer where he receives a padlock in the mail with so much tape that it takes him whole minutes to unpack.
Concrete is nice, other options are piles of soil or brick in front of the door. There probably is a sweet spot where enough concrete slows down an excavator and enough bricks mixed in the soil slows down the shovel. Extra points if there is no place nearby to dump the rubble.
Classic.
In my first job I worked on ATM software, and we had a big basement room full of ATMs for test purposes. The part the money is stored in is a modified safe, usually with a traditional dial lock. On the inside of one of them I saw the instructions on how to change the combination. The final instruction was: "Write down the combination and store it safely", then printed in bold: "Not inside the safe!"
awesome !
If you just wanted recovery keys that were secure from being used in an ordinary way you can use Shamir to split the key over a couple hard copies stored in safety deposit boxes a couple different locations.
Management was not happy, but I didn’t get in trouble for it. And yes, it was awesome. Surprisingly easy, especially since the fire extinguisher was literally right next to it.
Nothing says ‘go ahead, destroy that shit’ like money going up in smoke if you don’t.
P.S. don’t park in front of fire hydrants, because they will have a shit eating grin on their face when they destroy your car- ahem - clear the obstacle - when they need to use it to stop a fire.
Thankfully none of my 10 Gbps wavelengths were impacted. Oh did I appreciate my aversion to >= layer 2 services in my transport network!
There's also the insistence that Rogers employees should use Rogers services. Paying for every Rogers employee to have Bell cell phone would not sit well with their executives.
That the risk assessments of the changes being made to the router configuration were incorrect also contributed to the outage.
Is it possible to have it in multiple regions? Last I checked, it only accepted one region. You needed to remove it first if you wanted to move it.
I’m unaware of any common and popular distributed IDAM that is reliable
There's also "identity orchestration" tools like Strata that let you use multiple IdPs in multiple clouds, but then your new weakest link is the orchestration platform.
Curious. Is your solution active-active or active-passive? We've implemented multi-region active-passive CIAM/IAM in our hosted solution[0]. We've found that meets needs of many of our clients.
I'm only aware of one CIAM solution that seems to have active-active: Ory. And even then I think they shard the user data[1].
0: https://fusionauth.io/docs/get-started/run-in-the-cloud/disa...
1: https://www.ory.com/blog/global-identity-and-access-manageme... is the only doc I've found and it's a bit vague, tbh.
Ory’s setup is indeed true multi-region active-active; not just sharded or active-passive failover. Each region runs a full stack capable of handling both read and write operations, with global data consistency and locality guarantees.
We’ll soon publish a case study with a customer that uses this setup that goes deeper into how Ory handles multi-region deployments in production (latency, data residency, and HA patterns). It’ll include some of the technical details missing from that earlier blog post you linked. Keep an eye out!
There are also some details mentioned here: https://www.ory.com/blog/personal-data-storage
Other clouds, lmao. Same requirements, not the same mistakes. Source: worked for several, one a direct competitor.
Thats some nice manager deactivating jargon.
Who watches the watchers.
We learned that lesson by having to do emergency failovers and having some problems. :)
The usability of AWS is so poor.
Always DNS..
Lots of orgs operating wholly in AWS and sometimes only within us-east-1 had no operational problems last night. Some that is design (not using the impacted services). Some of that is good resiliency in design. And some of that was dumb luck (accidentally good design).
Overall, those companies that had operational problems likely wouldn't have invested in resiliancy expenses in any other deployment strategy either. It could have happened to them in Azure, GCP or even a home rolled datacenter.
Are customers willing to pay companies for that redundancy? I think not. Once every few years outage for 3 hours is fine for non critical services.
That right there means the business model is fucked to begin with. If you can't have a resilient service, then you should not be offering that service. Period. Solution: we were fine before the cloud, just a little slower. No problem going back to that for some things. Not everything has to be just in time at lowest possible cost.
IMO, going multi AZ or multi-cloud adds a good amount of complexity.
TBH I don't care if last.fm doesn't work for 8 hours a year, that isn't a big deal. My bank? Yeah that should work.
Servers are easy. I’m sure most companies already have servers that can be spun up. Things related to data are not.
And no, data replication or load balancing is not easy, nor cheap.
* Automated. * Scoped to business critical services. Typically not including many of the 3rd party services. * Uses data replication, which is a feature in any modern cloud. * Load balancing, by DNS basically for free or a real LB somewhere on the edge.
If you fail at this you probably fail at disaster recovery too or any good practice on how to run things in the cloud. Most likely because of very poor architecture.
It's easy to replicate multi-tenancy for RDS since it's built in. But it's not cheap. It's double, triple the price.
Entire regions go down
Don't pay for intra-az traffic friends
Our applications and databases must have ultra high availability. It can be achieved with applications and data platforms hosted on different regions for failover.
Critical businesses should also plan for replication across multiple cloud platforms. You may use some of the existing solutions out there that can help with such implementations for data platforms.
- Qlik replicate - HexaRocket
and some more.
Or rather implement native replication solutions available with data platforms.
FFS ...
It's ridiculous how everything is being stored in the cloud, even simple timers. It's past high time to move functionality back on-device, which would come with the advantage of making it easier to de-connect from big tech's capitalist surveillance state as well.
I half-seriously like to say things like, "I'm excited for a time when we have powerful enough computers to actually run applications on them instead of being limited to only thin clients." Only problem is most of the younger people don't get the reference anymore, so it's mainly the olds that get it
See the sales team from Google flew out an executive to NBA Finals, Azure Sales team flew out another executive to NFL superBowl and the AWS team flew out yet another executive to Wimbledon finals. And thats how you end up with multi-cloud strategy.
I could care less about having more vendor dinners when I know I am promising a falsehood that is extremely expensive and likely going to cost me my job or my credibility at some point.
That is the computing business. There is no actual accountability, just ass covering
Sure you can abstract everything away, but you can also just not use vendor-flavored services. The more bespoke stuff you use the more lock in risk.
But if you are in a "cloud forward" AWS mandated org, a holder of AWS certifications, alphabet soup expert... thats not a problem you are trying to solve. Arguably the lock in becomes a feature.
I'd counter that past a certain scale, certainly the scale of a firm that used to & could run its own datacenter.. it's probably your responsibility to not use those services.
Sure it's easier, but if you decide feature X requires AWS service Y that has no GCP/Azure/ORCL equivalent.. it seems unwise.
Just from a business perspective, you are making yourself hostage to a vendor on pricing.
If you're some startup trying to find traction, or a small shop with an IT department of 5.. then by all means, use whatever cloud and get locked in for now.
But if you are a big bank, car maker, whatever.. it seems grossly irresponsible.
On the east coast we are already approaching an entire business day being down today. Gonna need a decade without an outage to get all those 9s back. And not to be catastrophic but.. what if AWS had an outage like this that lasted.. 3 days? A week?
The fact that the industry collectively shrugs our shoulders and allows increasing amounts of our tech stacks to be single-vendor hostage is crazy.
Well, nobody is going to get blamed for this one except people at Amazon. Socially, this is treated as as a tornado. You have to be certain that you can beat AWS in terms of reliability for doing anything about this to be good for your career.
Most of my on-prem days, you had more frequent but smaller failures of a database, caching service, task runner, storage, message bus, DNS, whatever.. but not all at once. Depending on how entrenched your organization is, some of these AWS outages are like having a full datacenter power down.
Might as well just log off for the day and hope for better in the morning. That assumes you could login, which some of my ex-US colleagues could not for half the day, despite our desktops being on-prem. Someone forgot about the AWS 2FA dependency..
It's actually probably not your responsibility, it's the responsibility of some leader 5 levels up who has his head in the clouds (literally).
It's a hard problem to connect practical experience and perspectives with high-level decision-making past a certain scale.
1) If you try to optimize in the beginning, you tend to fall into the over-optimization/engineering camp;
2) If you just let things go organically, you tend to fall into the big messy camp;
So the ideal way is to examine from time and time and re-architecture once the need arises. But few companies can afford that, unfortunately.
Common Cause Failures and false redundancy are just all over the place.
https://www.nytimes.com/2025/05/25/business/amazon-ai-coders...
"Pushed to use artificial intelligence, software developers at the e-commerce giant say they must work faster and have less time to think."
Every bit of thinking time spent on a dysfunctional, lying "AI" agent could be spent on understanding the system. Even if you don't move your mouse all the time in order to please a dumb middle manager.
Assuming we’re talking about hosting things for Internet users. My fiber internet connection has gone down multiple times, though relatively quickly restored. My power has gone out several times in the last year, with one storm having it out for nearly 24 hrs. I was sleep when it went out and I didn’t start the generator until it was out for 3-4 hours already, far longer than my UPSes could hold up. I’ve had to do maintenance and updates both physical and software.
All of those things contribute to a downtime significantly higher than I see with my stuff running on Linode, Fly.io or AWS.
I run Proxmox and K3s at home and it makes things far more reliable, but it’s also extra overhead for me to maintain.
Most or all of those things could be mitigated at home, but at what cost?
If you had /two/ houses, in separate towns, you'd have better luck. Or, if you had cell as a backup.
Or: if you don't care about it being down for 12 hours.
These are the issues I've ran into that have caused downtime in the last few years:
- 1x power outage: if I had set up restart on power, probably would have been down for 30-60 minutes, ended up being a few hours (as I had to manually press the power button lol). Probably the longest non-self-inflicted issue.
- Twitch bot library issues: Just typical library bugs. Unrelated to self-hosting.
- IP changes: My IP actually barely ever changes, but I should set up DDNS. Fixable with self-hosting (but requires some amount of effort).
- Running out of disk space: Would be nice to be able to just increase it.
- Prooooooobably an internet outage or two, now that I think about it? Not enough that it's been a serious concern, though, as I can't think of a time that's actually happened. (Or I have a bad memory!)
I think that's actually about it. I rely fairly heavily on my VPN+personal cloud as all my notes, todos, etc are synced through it (Joplin + Nextcloud), so I do notice and pay a lot of attention to any downtime, but this is pretty much all that's ever happened. It's remarkable how stable software/hardware can be. I'm sure I'll eventually have some hardware failure (actually, I upgraded my CPU 1-2 years ago because it turns out the Ryzen 1700 I was using before has some kind of extremely-infrequent issue with Linux that was causing crashes a couple times a month), but it's really nice.
To be clear, though, for an actual business project, I don't think this would be a good idea, mainly due to concerns around residential vs commercial IPs, arbitrary IPs connecting to your local network, etc that I don't fully pay attention to.
Lost data, revenue, etc.
I'm not talking about AWS but whoever's downstream.
Is it like 100M, like 1B?
Reminds me of a great Onion tagline:
"Plowshare hastily beaten back into sword."
The retrospective will be very interesting reading!
(Obviously the category of outages caused by many restored systems "thundering" at once to get back up is known, so that'd be my guess, but the details are always good reading either way).
I think we're doing the 21st century wrong.
I think we're doing the 16th century wrong.
My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.
The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).
Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.
Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.
This seems like such a low bar for 2025, but here we are.
I prefer the API Gateway model where I can create regional endpoints and sew them together in DNS.
The hardest part is that our customers' resources aren't always available in multiple regions. When they are we fall back to a region where they exist that is next closest (by latency, courtesy of https://www.cloudping.co/).
So you’re minimally hydrating everyone’s data everywhere so that you can have some failover. Seems smart and a good middle ground to maximize HA. I’m curious what your retention window for the failover data redundancy is. Days/weeks? Or just a fifo with total data cap?
We’ve gone to great lengths to minimize the amount of information we hold. We don’t even collect an email address upon sign-up, just the information passed to us by AWS Marketplace, which is very minimal (the account number is basically all we use).
There is some S3 replication as well in the CI/CD pipeline, but that doesn't impact our customers directly. If we'd seen errors there it would mean manually taking Virginia out of the pipeline so we could deploy everyehere else.
Modern companies live life on the edge. Just in time, no resilience, no flexibility. We see the disaster this causes whenever something unexpected happens - the Evergiven blocking Suez for example, let alone something like Covid
However increasingly what should be minor loss of resilience, like an AWS outage or a Crowdstrike incident, turns into major failures.
This fragility is something government needs to legislate to prevent. When one supermarket is out that's fine - people can go elsewhere, the damage is contained. When all fail, that's a major problem.
On top of that, the attitude that the entire sector has is also bad. People thing IT should tail once or twice a year and it's not a problem. If that attitude affect truly important systems it will lead to major civil projects. Any civilitsation is 3 good meals away from anarchy.
There's no profit motive to avoid this, companies don't care about being offline for the day, as long as all their mates are also offline.
Even @ 9:30am ET this morning, after this supposedly was clearing up, my doctor's office's practice management software was still hosed. Quite the long tail here.
Somewhat common. Comes from the US military in WW2.
Even the acronym is fucked.
My favorite by a large margin...
https://en.wikipedia.org/wiki/List_of_military_slang_terms#F...
Not to be confused with "Foobar" which apparently originated at MIT: https://en.wikipedia.org/wiki/Foobar
TIL, an interesting footnote about "foo" there:
'During the United States v. Microsoft Corp. trial, evidence was presented that Microsoft had tried to use the Web Services Interoperability organization (WS-I) as a means to stifle competition, including e-mails in which top executives including Bill Gates and Steve Ballmer referred to the WS-I using the codename "foo".[13]'
There are documented uses of FUBAR back into the '40s.
I.e. lots of folks that weren't expected to work today and/or trying to round them up to work the problem.
All the schools in the area have days off for Indian Holidays since so many would be out of school otherwise.
There are 153k Amazon employees based in India according to LinkedIn.
My main beef with that team was that we worked on too many stories in parallel so information on brand new work was siloed. Everyone caught up after a bit but stuff we just or hadn’t demoed yet was spotty for coverage.
If I was up at 1 am it was because I had insomnia and figured out exactly what the problem was and it was faster to fix it than to explain. Or if I wake up really early and the problem is still not fixed.
In my experience, the teams at AWS are pretty diverse, reflecting the diversity in the area. Even if a lot of the Indian employees are taking the day off, there should be plenty of other employees to back them up. A culturally diverse employee base should mitigate against this sort of problem.
If it does turn out that the outage was prolonged due to one or two key engineers being unreachable for the holiday, that's an indictment of AWS for allowing these single points of failure to occur, not for hiring Indians.
That number had dropped to 1,190 by 4:22 AM Pacific (7:22 AM Eastern).
However, that number is back up with a vengeance. 9,230 reports as of 9:32 AM Pacific (12:32 Eastern).
Part of that could be explained by more people making reports as the U.S. west coast awoke. But I also have a feeling that they aren't yet on top of the problem.
Our company decided years ago to use any region other than us-east-1.
Of course, that doesn't help with services that are 'global', which usually means us-east-1.
I would think a lot of clients would want that.
On AWS's side, I think us-east-1 is legacy infrastructure because it was the first region, and things have to be made replicable.
For others on AWS who aren't AWS themselves: because AWS outbound data transfer is exorbitantly expensive. I'm building on AWS, and AWS's outbound data transfer costs are a primary design consideration for potential distribution/replication of services.
And yes, AWS' rates are highway robbery. If you assume $1500/mo for a 10 Gbps port from a transit provider, you're looking at $0.0005/GB with a saturated link. At a 25% utilization factor, still only $0.002/GB. AWS is almost 50 times that. And I guarantee AWS gets a far better rate for transit than list price, so their profit margin must be through the roof.
Which makes sense, but even their rates for traffic between AWS regions are still exorbitant. $0.10/GB for transfer to the rest of the Internet somewhat discourages integration of non-Amazon services (though you can still easily integrate with any service where most of your bandwidth is inbound to AWS), but their rates for bandwidth between regions are still in the $0.01-0.02/GB range, which discourages replication and cross-region services.
If their inter-region bandwidth pricing was substantially lower, it'd be much easier to build replicated, highly available services atop AWS. As it is, the current pricing encourages keeping everything within a region, which works for some kinds of services but not others.
This aligns with their “you should be in multiple AZs” sales strategy, because self-hosted and third-party services can’t replicate data between AZs without expensive bandwidth costs, while their own managed services (ElastiCache, RDS, etc) can offer replication between zones for free.
In general it seems like Europe has the most internet of anywhere - other places generally pay to connect to Europe, Europe doesn't pay to connect to them.
The other concerns could have to do with the impact of failover to the backup regions.
"Changes that I make are not always immediately visible": - "...As a service that is accessed through computers in data centers around the world, IAM uses a distributed computing model called eventual consistency. Any changes that you make in IAM (or other AWS services), including attribute-based access control (ABAC) tags, take time to become visible from all possible endpoints. Some delay results from the time it takes to send data from server to server, replication zone to replication zone, and Region to Region. IAM also uses caching to improve performance, but in some cases this can add time. The change might not be visible until the previously cached data times out...
...You must design your global applications to account for these potential delays. Ensure that they work as expected, even when a change made in one location is not instantly visible at another. Such changes include creating or updating users, groups, roles, or policies. We recommend that you do not include such IAM changes in the critical, high availability code paths of your application. Instead, make IAM changes in a separate initialization or setup routine that you run less frequently. Also, be sure to verify that the changes have been propagated before production workflows depend on them..."
https://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoo...
us-east-1 is so the government to slurp up all the data. /tin-foil hat
Also, who’s using email-based OTP?
At this point, is there any cloud provider that doesn't have these problems? (GCP is a non-starter because a false-positive YouTube TOS violation get you locked out of GCP[1]).
[1]: https://9to5google.com/2021/02/26/stadia-port-of-terraria-ca...
It is an extremely fundamental level of incompetence at Google. One should "figure out" the viability of placing all of one's eggs in the basket of such an incompetent partner. They screwed the authentication issue up and, this is no slippery slope argument, that means they could be screwing other things up (such as being able to contact a human for support, which is what the Terraria developer also had issues with).
And looked at from the perspective of an individual company, as a customer of AWS, the occasional outage is usually an acceptable part of doing business.
However, today we’ve seen a failure that has wiped out a huge number of companies used by hundreds of millions - maybe billions - of people, and obviously a huge number of companies globally all at the same time. AWS has something like 30% of the infra market so you can imagine, and most people reading this will to some extent have experienced, the scale of disruption.
And the reality is that whilst bigger companies, like Zoom, are getting a lot of the attention here, we have no idea what other critical and/or life and death services might have been impacted. As an example that many of us would be familiar with, how many houses have been successfully burgled today because Ring has been down for around 8 out of the last 15 hours (at least as I measure it)?
I don’t think that’s OK, and I question the wisdom of companies choosing AWS as their default infra and hosting provider. It simply doesn’t seem to be very responsible to be in the same pond as so many others.
Were I a legislator I would now be casting a somewhat baleful eye at AWS as a potentially dangerous monopoly, and see what I might be able to do to force organisations to choose from amongst a much larger pool of potential infra providers and platforms, and I would be doing that because these kinds of incidents will only become more serious as time goes on.
It's the same thing here. Do you think other providers are better? If people moved to other providers, things would still go down, more likely than not it would be more downtime in aggregate, just spread out so you wouldn't notice as much.
At least this way, everyone knows why it's down, our industry has developed best practices for dealing with these kinds of outages, and AWS can apply their expertise to keeping all their customers running as long as possible.
More importantly you appear to have misunderstood the scenario I’m trying to avoid, which is the precise situation we’ve seen in the past 24 hours where a very large proportion of internet services go down all at the same time precisely because they’re all using the same provider.
And then finally the usual outcome of increased competition is to improve the quality of products and services.
I am very aware of the WWII bomber story, because it’s very heavily cited in corporate circles nowadays, but I don’t see that it has anything to do with what I was talking about.
AWS is chosen because it’s an acceptable default that’s unlikely to be heavily challenged either by corporate leadership or by those on the production side because it’s good CV fodder. It’s the “nobody gets fired for buying IBM” of the early mid-21st century. That doesn’t make it the best choice though: just the easiest.
And viewed at a level above the individual organisation - or, perhaps from the view of users who were faced with failures across multiple or many products and services from diverse companies and organisations - as with today (yesterday!) we can see it’s not the best choice.
Reality is, though, that you shouldn't put all your eggs in the same basket. And it was indeed the case before the cloud. One service going down would have never had this cascade effect.
I am not even saying "build your own DC", but we barely have resiliency if we all rely on the same DC. That's just dumb.
That is the point, though: Correlated outages are worse than uncorrelated outages. If one payment provider has an outage, chose another card or another store and you can still buy your goods. If all are down, no one can shop anything[1]. If a small region has a power blackout, all surrounding regions can provide emergency support. If the whole country has a blackout, all emergency responders are bound locally.
[1] Except with cash – might be worth to keep a stash handy for such purposes.
The internet was originally intended to be decentralised. That decentralisation begets resilience.
That’s exactly the opposite of what we saw with this outage. AWS has give or take 30% of the infra market, including many nationally or globally well known companies… which meant the outage caused huge global disruption of services that many, many people and organisations use on a day to day basis.
Choosing AWS, squinted at through a somewhat particular pair of operational and financial spectacles, can often make sense. Certainly it’s a default cloud option in many orgs, and always in contention to be considered by everyone else.
But my contention is that at a higher level than individual orgs - at a societal level - that does not make sense. And it’s just not OK for government and business to be disrupted on a global scale because one provider had a problem. Hence my comment on legislators.
It is super weird to me that, apparently, that’s an unorthodox and unreasonable viewpoint.
But you’ve described it very elegantly: 99.99% (or pick the number of 9s you want) uptime with uncorrelated outages is way better than that same uptime with correlated, and particularly heavily correlated, outages.
That homogeneity is a systemic risk that we all bear, of course. It feels like systemic risks often arise that way, as an emergent result from many individual decisions each choosing a path that truly is in their own best interests.
And at this point I’m looking at the problem and thinking, “how do we do that other than by legislating?”
Because left to their own devices a concerningly large number of people across many, many organisations simply follow the herd.
In the midst of a degrading global security situation I would have thought it would be obvious why that’s a bad idea.
I'm not sure a lot of companies are really looking at the costs of multi-region resiliency and hot failovers vs being down for 6 hours every year or so and writing that check.
1. The main one: it's the cheapest region, so when people select where to run their services they pick it because "why pay more?"
2. It's the default. Many tutorials and articles online show it in the examples, many deployment and other devops tools use it as a default value.
3. Related to n.2. AI models generate cloud configs and code examples with it unless asked otherwise.
4. It's location make it Europe-friendly, too. If you have a small service and you'd like to capture European and North American audience from a single location us-east-1 is a very good choice.
5. Many Amazon features are available in that region first and then spread out to other locations.
6. It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks. In hybrid cloud scenarios where you want to connect bits of your infrastructure running on AWS and on some physical hardware by a set of dedicated fiber optic lines us-east-1 is the place to do it.
7. Yes, for AWS deployments it's an experimental location that has higher risks of downtime compared to other regions, but in practice when a sizable part of us-east-1 is down other AWS services across the world tend to go down, too (along with half of the internet). So, is it really that risky to run over there, relatively speaking?
It's the world's default hosting location, and today's outages show it.
In every SKU I've ever looked at / priced out, all of the AWS NA regions have ~equal pricing. What's cheaper specifically in us-east-1?
> Europe-friendly
Why not us-east-2?
> Many Amazon features are available in that region first and then spread out to other locations.
Well, yeah, that's why it breaks. Using not-us-east-1 is like using an LTS OS release: you don't get the newest hotness, but it's much more stable as a "build it and leave it alone" target.
> It's also a region where other cloud providers and hosting companies offer their services. Often there's space available in a data center not far from AWS-running racks.
This is a better argument, but in practice, it's very niche — 2-5ms of speed-of-light delay doesn't matter to anyone but HFT folks; anyone else can be in a DC one state away with a pre-arranged tier1-bypassing direct interconnect, and do fine. (This is why OVH is listed on https://www.cloudinfrastructuremap.com/ despite being a smaller provider: their DCs have such interconnects.)
For that matter, if you want "low-latency to North America and Europe, and high-throughput lowish-latency peering to many other providers" — why not Montreal [ca-central-1]? Quebec might sound "too far north", but from the fiber-path perspective of anywhere else in NA or Europe, it's essentially interchangeable with Virginia.
Just go to the EC2 pricing page and change from us-east-1 to us-west-1
Incentivize the best behaviors.
Or is there a perspective I don't see?
Most people (myself include) only choose it because its the cheapest. If multiple regions were the same price then there'd be less impact if one goes down.
A negligible cost difference shouldn't matter when your apps are unstable due to the region being problematic.
agreed, but a sizable cohort of people don't have the foresight or incentives for think past their nose and clicking the cheapest option.
So its on Amazon to incentivize what's best.
This is the biggest one isn't it? I thought Route 53 isn't even available on any other region.
Our stuff is all in us-east-1, ops was a total shitshow today (mostly because many 3rd party services besides aws were down/slow), but our prod service was largely "ok", a total of <5% of customers were significantly impacted because existing instances got to keep running.
I think we got a bit lucky, but no actual SLAs were violated. I tagged the postmortem as Low impact despite the stress this caused internally.
We definitely learnt something here about both our software and our 3rd party dependencies.
Amazon says service is now just "degraded" and recovering, but searching for products on Amazon.com still does not work for me. https://health.aws.amazon.com/health/status
When this is fixed, I am very interested in seeing recorded spend for Sunday and Monday.
365 day * 24 * 0.0001 is roughly 8 hours, so it already lost the 99.99% status.
[0] Fraction is ~ 1
From reading the EC2 SLA I don't think this is covered. https://aws.amazon.com/compute/sla/
The reason is the SLA says "For the Instance-Level SLA, your Single EC2 Instance has no external connectivity.". Instances that were already created kept working, so this isn't covered. The SLA doesn't cover creation of new instances.
The refund they give you isn’t going to dent lost revenue.
The duration of the outage in relation to that uptime is (8 h / 33602 h) * 100% = 0.024%, so the uptime is 99.976%, slightly worse than 99.99%, but clearly better than 99.90%.
They used to be five nines, and people used to say that it's not worth the while to prepare for an outage. With less than four nines, the perception might shift, but likely not enough to induce a mass migration to outage-resistant designs.
When your SLA holds within a joke SLA window, you know you goofed.
"Five nines, but you didn't say which nines. 89.9999...", etc.
Our only impact was some atlassian tools.
If the server didnt work - the tool too measure didnt work too! Genius
February 28, 2017. S3 went down and took down a good portion of AWS and the Internet in general. For almost the entire time that it was down, the AWS status page showed green because the up/down metrics were hosted on... you guessed it... S3.
- 2008 - https://news.ycombinator.com/item?id=116445
- 2010 - https://news.ycombinator.com/item?id=1396191
- 2015 - https://news.ycombinator.com/item?id=10033172
- 2017 - https://news.ycombinator.com/item?id=13755673 (Postmortem: https://news.ycombinator.com/item?id=13775667)
Maybe they should start using real software instead of mathematicians' toy langs
Sadly while I still use that tool a couple of jobs/companies later - I no longer recommend it because it migrated to AWS a few years back.
(For now, my out-of-AWS monitoring tool is a bunch of cron jobs running on a collections of various inexpensive vpses and my and other dev's home machines.)
(No disrespect to Nagios, I'm sure a competently managed installation is capable of being way better than what I had to put up with.)
I can’t explain why Saucelabs was the most grating one, but it was. I think it’s because they routinely experienced 100% down for 1% of customers, and we were in that one percent about twice a year. <long string of swears omitted>
I don't think anyone would quote availability as availability in every region I'm in?
While this is their most important region, there's a lot of clients that are probably unaffected if they're not in use1.
They COULD be affected even if they don't have anything there because of the AWS services relying on it. I'm just saying that most customers that are multi region should have their east region out and are just humming along.
I don't think this is true anymore. In the early days bad enough outages in us-east-1 would bring down everything because some metadata / control pane stuff was there, I remember getting affected while in other regions, but there's been many years since this has happened.
Today for example no issues. I just avoid us-east-1 and everyone else should to. It's their worst region by far in terms of reliability because they launch all the new stuff there and are always messing it up.
I do not envy anyone working on this problem today.
We were more honest, and it probably cost us at least once in not getting business.
If you as a customer ask for 5 9s per month, with service credit of 10% of at-risk fees for missing on a deal where my GM is 30%, I can just amortise that cost and bake it into my fee.
A lot of these are second order dependencies like Astronomer, Atlassian, Confluent, Snowflake, Datadog, etc... the joys of using hosted solutions to everything.
When the NAS shit the bed, we lost half of production and all our run books. And we didn’t have autoscaling yet. Wouldn’t for another 2 years.
Our group is a bunch of people that has no problem getting angry and raising voices. The whole team was so volcanically angry that it got real quiet for several days. Like everyone knew if anyone unclenched that there would be assault charges.
Rest and vest CEOs
He got a lot of impossible shit done as COO.
They do need a more product minded person though. If Jobs was still around we’d have smart jewelry by now. And the Apple Watch would be thin af.
If it doesn’t stop, that means it has a battery backup. But you can still make life more bearable. Switch off all your breakers (you probably have a master breaker for this), then open up the alarm box and either pull the battery or - if it’s non-removable - take the box off the wall, put it in a sealed container, and put the sealed container somewhere… else. Somewhere you can’t hear it or can barely hear it until the battery runs down.
Meanwhile you can turn the power back on but make sure you’ve taped the bare ends of the alarm power cable, or otherwise electrically insulated them, until you’re able to reinstall it.
I would honestly do your box option. Stuff it in there with some pillows and leave it in the shed for a while.
https://health.aws.amazon.com/health/status?path=open-issues
The closest to their identification of a root cause seems to be this one:
"Oct 20 8:43 AM PDT We have narrowed down the source of the network connectivity issues that impacted AWS Services. The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers. We are throttling requests for new EC2 instance launches to aid recovery and actively working on mitigations."
By the way, Twilio is also down, so all those login SMS verification codes aren’t being delivered right now.
I remember a meme years ago about Nestle. It was something like: GO ON, BOYCOT US - I BET YOU CAN’T - WE MAKE EVERYTHING.
Same meme would work for Aws today.
Not really, there are enough alternatives.
And it’s not lie there aren’t other brands of chocolate either…
It's fairly difficult to avoid single points of failure completely, and if you do it's likely your suppliers and customers haven't managed to.
It's about how much your risk level is.
AWS us-east-1 fails constantly, it has terrible uptime, and you should expect it to go. A cyberattack which destroyed AWSs entire infrastructure would be less likely. BGP hijacks across multiple AWS nodes are quite plausible though, but that can be mitigated to an extent with direct connects.
Sadly it seems people in charge of critical infrastructure don't even bother thinking about these things, because next quarters numbers are more important.
I can avoid London as a single point of failure, but the loss of Docklands would cause so much damage to the UK's infrastructure I can't confidently predict that my servers in Manchester connected to peering points such as IXman will be able to reach my customer in Norwich. I'm not even sure how much international connectivity I could rely on. In theory Starlink will continue to work, but in practice I'm not confident.
When we had power issues in Washington DC a couple of months ago, three of our four independent ISPS failed, as they all had undeclared requirements on active equipment in the area. That wasn't even a major outage, just a local substation failure. The one circuit which survived was clearly just fibre from our (UPS/generator backed) equipment room to a data centre towards Baltimore (not Ashburn).
Until this happen. A single region in a cascade failure and your saas is single region.
> They’re not wrong though. If AWS goes down, EVERYTHING goes down to some degree. Your app, your competitor’s apps, your clients’ chat apps. You’re kinda off the hook.
They made their own bigger problems by all crowding into the same single region.
imagine a beach, with icecream vendors. You'd think it would be optimal for two vendors to each split it half north, half south. However, in wanting to steal some of the other vendors' customers, you end up with two icecream stands in the center.
So too with outages. Safety / loss of blame in numbers.
I mean I agree but what you are saying is that where else are you gonna host it? If you host it yourself and then it turns out to be an issue and you go down then that's entirely on you and 99% of the internet still works.
But if Aws goes down, lets say 50% of the internet goes down.
So, in essence, nobody blames a particular team/person just as the parent comment said that nobody gets fired for picking IBM.
Although, I still think that the idea which is worrying is such massive centralization of servers that we have a single switch which can turn half the internet off. So I am a bit worried from the centralization side of thing's.
From my perspective, multiple unrelated websites quit working at the same time. I would rather have had one website down, and the rest working, than for me to be completely hamstrung because so many services are down simultaneously.
Recall is on AWS.
Everyone using Recall for meeting recordings is down.
In some domains, a single SaaS dominates the domain and if that SaaS sits on AWS, it doesn't matter if AWS is 35% marketshare because the SaaS that dominates 80% of the domain is on AWS so the effect is wider than just AWS's market share.
We're on GCP, but we have various SaaS vendors on AWS so any of the services that rely on AWS are gone.
Many chat/meeting services also run on AWS Chime so even if you're not on AWS, if a vendor uses Chime, that service is down.
And this comes in a time with regulations like Dora and the BaFin tightening things - managing these boxes becomes less effort than maintaining compliance across vendors.
Note, I'm not affiliated with any of these companies.
For example, Purestorage has put a lot of work into their solution and for a decent chunk of cache, you get a system that slots right into VMware, offers iSCSI for other infrastructure providers, offers a CSI plugin for containers, and speaks S3. And integration with a few systems like OpenShift has been simplified as well.
This continues. You can get ingress/egress/network monitoring compliance from Calico slotting in as a CNI plugin, some systems managing supply chain security, ... Something like Nutanix is an entirely integrated solution you rack and then you have a container orchestration with storage and all of the cool things.
Cost is not really that much a factor in this market. Outsourcing regulatory requirements and liability to vendors is great.
At the end of the day most of us aren't working on super critical things. No one is dying because they can't purchase X item online or use Y SaaS. And, more importantly, customers are _not_ willing to pay the extra for you to host your backend in multiple regions/providers.
In my contracts (for my personal company) I call out the single-point-of-failure very clearly and I've never had anyone balk. If they did I'd offer then resiliency (for a price) and I have no doubt that they would opt to "roll the dice" instead of pay.
Lastly, it's near-impossible to verify what all your vendors are using so even if you manage to get everything resilient it only takes one chink in the armor the bring it all down (See: us-east-1 and various AWS services that rely on that even if you don't host anything in us-east-1 directly).
I'm not trying to downplay this, pretend it doesn't matter, or anything like that. Just trying to point out that most people don't care because no one seems to care (or want to pay for it). I wish that was different (I wish a lot of things were different) but wishing doesn't pay my bills and so if customers don't want to pay for resiliency then this is what they get and I'm at peace with that.
US$70 billion in spend aggregating data, back then, this number has only increased.
https://journals.sagepub.com/doi/pdf/10.1177/205395171454186...
2) People who thought that just having stuff "in the cloud" meant that it was automatically spread across regions. Hint, it's not; you have to deploy it in different regions and architect/maintain around that.
3) Accounting.
us-east-1 is more a legacy of it being the first region so by virtue of being the default region for the longest time most customers built on it.
So as a result, everyone keeps production in us-east-1. =)
And they give you a much better developer experience...
Sigh
Then things got worse. At 9:13 AM PT it sounds like they’re back to troubleshooting.
Honestly sounds like AWS doesn’t even really know what’s going on. Not good.
1. When aws deploys changes they run through a pipeline which pushes change to regions one at a time. Most services start with us-east-1 first.
2. us-east-1 is MASSIVE and considerably larger than the next largest region. There's no public numbers but I wouldn't be surprised if it was 50% of their global capacity. An outage in any other region never hits the news.
This is true.
> Most services start with us-east-1 first.
This is absolutely false. Almost every service will FINISH with the largest and most impactful regions.
In general:
You don't deploy to the largest region first because of the large blast radius.
You may not want to deploy to the largest region last because then if there's an issue that only shows up at that scale you may need to roll every single region back (divergent code across regions is generally avoided as much as possible).
A middle ground is to deploy to the largest region second or third.
> When AWS deploys updates to its services, deployments to Availability Zones in the same Region are separated in time to prevent correlated failure.
https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
I think I might be ready to build out a replacement through vibe coding. I don’t like being dependent on user submissions though. I feel like that’s a challenge on its own.
My refusal to hoard every asset into AWS (let alone put anything of import in us-east-1) has saved me repeatedly in the past. Diversity is the foundation of resiliency, after all.
Is this well known/documented? I don't have anything on AWS but previously worked for a company that used it fairly heavily. We had everything in EU regions and I never saw any indication/warning that we had a dependency on us-east-1. But I assume we probably did based on the blast radius of today's outage.
See: https://docs.aws.amazon.com/whitepapers/latest/aws-fault-iso...
“In the aws partition, the IAM service’s control plane is in the us-east-1 Region, with isolated data planes in each Region of the partition.“
Also, intra-region, many of the services use eachother, and not in a manner where the public can discern the dependency map.
> The Amazon retail site seems available
Some parts of amazon.com seem to be affected by the outage (e.g. product search: https://x.com/wongmjane/status/1980318933925392719)
I would also guess the testing is incomplete. Alexa+ is a slow roll out so they can improve precision/recall on the intents with actual customers. Alexa+ is less deterministic than the previous model was wrt intents
One strange one was metrics capturing for Elasticache was dead for us (I assume Cloudwatch is the actual service responsible for this), so we were getting no data alerts in Datadog. Took a sec to hunt that down and realize everything was fine, we just don't have the metrics there.
I had minor protests against us-east-1 about 2.5 years ago, but it's a bit much to deal with now... Guess I should protest a bit louder next time.
https://health.aws.amazon.com/health/status?path=service-his...
Ah yes, the great AWS us-east-1 outage.
Half the internet’s on fire, engineers haven’t slept in 18 hours, and every self-styled “resilience thought leader” is already posting:
“This is why you need multi-cloud, powered by our patented observability synergy platform™.”
Shut up, Greg.
Your SaaS product doesn’t fix DNS, you're simply adding another dashboard to watch the world burn in higher definition.
If your first reaction to a widespread outage is “time to drive engagement,” you're working in tragedy tourism. Bet your kids are super proud.
Meanwhile, the real heroes are the SREs duct-taping Route 53 with pure caffeine and spite.
https://www.linkedin.com/posts/coquinn_aws-useast1-cloudcomp...
Yes I know it’s sad…
I have clients and I’ve heard “even Amazon is down, we can be down” more than once.
AWS makes their SLAs & uptime rates very clear, along with explicit warnings about building failover / business continuity.
Most of the questions on the AWS CSA exam are related to resiliency .
Look, we've all gone the lazy route and done this before. As usual, the problem exists between the keyboard and the chair.
If they don’t obfuscate the downtime (they will, of course), this outage would put them at, what, two nines? Thats very much out of their SLA.
People also keep talking about it as if its one region, but there are reports in this thread of internal dependencies inside AWS which are affecting unrelated regions with various services. (r53 updates for example)
Just like reading medication side effects, they are letting you know that downtime is possible, albeit unlikely.
All of the documentation and training programs explain the consequence of single-region deployments.
The outage was a mistake. Let's hope it doesn't indicate a trend. I'm not defending AWS. I'm trying to help people translate the incident into a real lesson about how to proceed.
You don't have control over the outage, but you do have control over how your app is built to respond to a similar outage in the future.
....AWS!
CNBC is supposed to inform users about this stuff, but they know less than nothing about it. That's why they were the most excited about the "Metaverse" and telling everyone to get on board (with what?) or get left behind.
The market is all about perception of value. That's why Musk can tweet a meme and double a stocks price, it's not based in anything real.
I wss under the impression that having multiple available zones guarantees high availability.
It seems this is not the case.
Economic efficiency and technical complexity are both, separately and together, enemies of resilience
https://www.bbc.com/news/live/c5y8k7k6v1rt?post=asset%3Ad902...
Of course in a sane world you'd have an internal fallback for when cloud connectivity fails but I'm sure someone looked at the cost and said "eh, what's the worst that could happen?"
Humans have built-in redundancy for a reason.
• Laying off top US engineering earners.
• Aggressively mandating RTO so the senior technical personnel would be pushed to leave.
• Other political ways ("Focus", "Below Expectations") to push engineering leadership (principal engineers, etc) to leave, without it counting as a layoff of course.
• Terminating highly skilled engineering contractors everywhere else.
• Migrating serious, complex workloads to entry-level employees in cheap office locations (India, Spain, etc).
This push was slow but mostly completed by Q1 this year. Correlation doesn't imply causation? I find that hard to believe in this case. AWS had outages before, but none like this "apparently nobody knows what to do" one.
Source: I was there.
Will we see mass exits from their service? Who knows. My money says no though.
How many companies can just ride the "but it's not our fault" to buy time with customers until it's fixed?
"It's been on the dev teams list for a while"
"Welp....."
Right now on levels.fyi, the highest-paying non-managerial engineering role is offered by Oracle. They might not pay the recent grads as well as Google or Microsoft, but they definitely value the principal engineers w/ 20 years of experience.
How the hell did Ring/Amazon not include a radio-frequency transmitter for the doorbell and chime? This is absurd.
To top it off, I'm trying to do my quarterly VAT return, and Xero is still completely borked, nearly 20 hours after the initial outage.
I am the CEO of the company and started it because I wanted to give engineering teams an unbreakable cloud. You can mix-n-match services of ANY cloud provider, and workloads failover seamlessly across clouds/on-prem environments.
Feel free to get in touch!
The costs, performance overhead, and complexity of a modern AWS deployment are insane and so out of line with what most companies should be taking on. But hype + microservices + sunk cost, and here we are.
If I'm a mid to large size company built on DynamoDB, I'd be questioning if it's really worth the risk given this 12+ hour outage.
I'd rather build upon open source tooling on bare metal instances and control my own destiny, than hope that Amazon doesn't break things as they scale to serve a database to host the entire internet.
For big companies, it's probably a cost savings too.
For any sized company, moving away from big clouds back onto traditional VPS or bare-metal offerings will lead to cost savings.
It's not difficult, it's just that we engineers chose convenience and delegated uptime to someone else.
[1] - https://usetrmnl.com
Not just AWS, but Cloudflare and others too. Would be interesting to review them clinically.
> “The Machine,” they exclaimed, “feeds us and clothes us and houses us; through it we speak to one another, through it we see one another, in it we have our being. The Machine is the friend of ideas and the enemy of superstition: the Machine is omnipotent, eternal; blessed is the Machine.”
..
> "she spoke with some petulance to the Committee of the Mending Apparatus. They replied, as before, that the defect would be set right shortly. “Shortly! At once!” she retorted"
..
> "there came a day when, without the slightest warning, without any previous hint of feebleness, the entire communication-system broke down, all over the world, and the world, as they understood it, ended."
You're gonna hear mostly complaints in this thread, but simple, resilient, single-region architecture is still reliable as hell in AWS, even in the worst region.
atymic•3mo ago
Resolves to nothing.
immibis•3mo ago
Alternatively, perhaps their DNS service stopped responding to queries or even removed itself from BGP. It's possible for us mere mortals to tell which of these is the case.
Nextgrid•3mo ago
theshrike79•3mo ago
Sparkyte•3mo ago