But now, people are quite happy to put their app in a Docker container and outsource all design and architecture decisions pertaining to data storage and performance.
And with that, the likes of ECS, Dynamo, RedShift, etc, are a somewhat reasonable answer to that. It's much easier to offer a distinct proposition around that state of affairs, than say a market that was solely based on EC2-esque VMs.
What I did not like, but absolutely expected, was this lurch towards near enough standardising one specific vendor's model. We're in quite a strange place atm, where AWS specific knowledge might actually have a slightly higher value than traditional DevOps skills for many organisations.
Felt like this all happened both at the speed of light, and in slow motion, at the same time.
Containers allow me to outsource host management. I gladly spend far less time troubleshooting cloud-init, SSH, process managers, and logging/metrics agents.
Exactly.
And sure, you can use S3/Dynamo/Aurora from an EC2 box, but what would be the point of that? Just get the app running in a container, and we can look into infrastructure later.
It's a very common refrain. That's why I believe Docker is strongly linked the development of these proprietary, cloud based models of computing, that place containerisation at the heart of an ecosystem that bastardises the classic idea of a 'server'.
The existence of S3 is one good result of this. IAM, on the other hand, can die in dumpster fire. Though it won't...
Before Docker you had things like Heroku and Amazon Elastic Beanstalk with a much greater degree of lock in than Docker.
ECS and its analogues on the other cloud providers have very little lock in. You should be able to deploy your container to any provider or your own VM. I don't see what Dynamo and data storage have to do with that. If we were all on EC2s with no other services you'd still have to figure out how to move your data somewhere else?
Like I truly don't understand your argument here.
Its also more resilient because I can trash a container and load up a new one with low overhead. I can't really do that with a full machine. It also gives some more security by sandboxing.
This does lead to laziness by programmers accelerated by myopic management. "It works" except when it doesn't. Easy to say you just need to restart the container then to figure out the actual issue.
But I'm not sure what that has to do with cloud. You'd do the same thing self hosting. Probably save money too. Though I'm frequently confused why people don't do both. Self host and host in the cloud. That's how you create resilience. Though you also need to fix problems rather than restart to be resilient too.
I feel like our industry wants to move fast but without direction. It's like we know velocity matters but since it's easier to read the speedometer we pretend they're the same thing. So fast and slow makes sense. Fast by magnitude of the vector. Slow if you're measuring how fast we make progress in the intended direction.
I assume most organizations, both small and large, just host on whatever provider they know or that costs them the least. If you have budget maybe you deploy to multiple providers for redundancy? But that increases cost and complexity.
Who’s going to bother with colo given the cost / complexity? Who’s going to run a server from their office given ISP restrictions and downtime fears?
What is the realistic antidote here?
To fix it, test your failback procedures. For everything else, there's nothing to fix, it's working by design.
My CI was down for 2 hours this morning, despite not even being on AWS. We have a set of credentials on that host that we call assumeRole with and push to an S3 bucket, which has a lambda that duplicates to buckets in other regions. All our IAM calls were failing due to this outage, and we have 0 items deployed in us-east-1 (we're european)
One thing that AWS should do is provide an easier way to detect these hidden dependencies. You can do that with CloudTrail if you know how to do it (filter operations by region and check that none are in us-east-1), but a more explicit service would be nice.
For most of what I run these days, I'd rather just have someone else run and administer my database. Same with my load balancers. And my Kubernetes cluster. I don't really care if there is an outage every 2 years.
And colo and datacenters aren't immune to going down
Did multi cloud redundancy end up being too expensive? Tech didn't line up enough? No good business case?
The elastic cloud story that never was? https://www.slideshare.net/slideshow/pets-vs-cattle-the-elas...
What happened?
Simplicity is linked to uptime and having a single cloud solution is a simpeler solution.
For large companies, its mostly cost savings. Easier to negotiate a good discount at N million versus N/2 million.
Besides that no-one ever got fired for picking AWS ;)
If there's an issue with relying only on AWS it has not been expressed in this outage.
Should we similarly cap say Front End frameworks on market penetration / growth? Is react too big to fail? Do we need to force some of it's users to use something else?
You have to build it in. That takes time money and training. Do you do failovers? Do they work? What is your backup situation? What is your list of work items to do during the failover? How long does it take? Do you even HAVE a failover plan? Can your services handle being in 'split brain'? Do you have specialty services that can only run in one place?
The unfortunate reality is this planning happens many times too late.
This is the situation for my company that started with the intent of being platform agnostic, but it quickly became much less complex as all of the potential client pool was using the same cloud. People with buckets with large amounts of data are not going to be able to convince the bean counters that it would be worth it to have that storage bill from multiple vendors.
Because it rarely is. Occasional downtime is just a cost of doing business. It is, or should be, rare enough that you just take it as it comes instead of trying to have a redundancy. We don't build tunnels everywhere as a backup for surface roads on snowy days. We just cancel school and work for the day and make up for it later. Do some important things get impacted? Sure, but most things are as mission critical as we make them out to be. The press coverage of an AWS outage makes it so easy to shrug it off and point fingers.
Additionally, moving your load to a different cloud can be challenging while one is down. It ends up being a lot of work that pays off for a few hours a year. For a lot of applications, it's better to just suffer the downtime and spend money on other things.
Multi cloud redundancy is like Java being a solution to platform independency.
There is no panacea. The reason many people use these is because it’s easy and hard to find people that know other clouds and their quirks.
If you are just one company whose goal is to maximize uptime without bringing in the complexity of multi-cloud, relying on AWS is reasonable. You probably won't get better uptime using something else, you'll only be down at different times than most others, which in most cases is actually worse.
But on the other hand, maybe I hang around too many tech people to not empathically understand the other point of view.
Also there are some outages that affect real life like airlines, but tech news overstates some like Facebook. It turns out that FB and IG can be totally broken for a whole day, the world will keep spinning, and they won't even lose users.
If you removed AWS/GCP/Azure/etc and just had 100 small providers scattered all over, the result would be hundreds of outages throughout the year, as opposed to one big outage every other year [in one region]. AWS is already way more reliable than any other provider.
The real problem here is that companies that use AWS are morons who don't know how to architect/build infrastructure properly.
If it's important, it should be built right, regardless of who the provider is. A software building code would mandate how companies could use infrastructure (AWS or any provider) so that important services would not go down when one service or region goes down.
This is the basic concept behind things like the electrical code. It doesn't matter how great a public utility is; if your business is wired up so badly that a stiff breeze sets it on fire, just switching utilities isn't gonna help. And some utilities do occasionally have problems that persist down their lines to the customers, so customers need to set up equipment to protect against those failures. Whole-house surge protectors, lightning arresters, EMP shields, etc are necessary so that a rare event doesn't fry expensive customer equipment.
(If most companies liked using cloud providers in parallel, they’d already be doing it today between AWS, Azure, and GCP.)
(which, is not true in reality if you have ordinary customers).
Companies are using higher-level "PaaS" suite of services from AWS such as DynamoDB, RedShift, etc and not just the lower-level "IaaS" such as basic EC2 instances or pure containers. Same "lock-in" situation with using the higher-level services from MS Azure and Google Cloud.
For those dependent on high-level services, migrating to a VPS like Hetzner or self-hosting is not possible unless they re-invent the AWS stack by installing/babysitting a bunch of open-source software. It's going to be a lot more involved than just installing a PostgreSQL db instance on a VPS.
Yes, and you can't escape that by outsourcing it. The complexity is still there, and it will still bite you when your outsourcer fails to manage it.
> We continue to investigate the root cause for the network connectivity issues that are impacting AWS services such as DynamoDB, SQS, and Amazon Connect in the US-EAST-1 Region. We have identified that the issue originated from within the EC2 internal network.
So, kinda? Some global services depend on us-east-1...
> Global services or features that rely on US-EAST-1 endpoints such as IAM updates and DynamoDB Global tables may also be experiencing issues.
Basically, you know it's going to be a bumpy day when us-east-1 has an issue because your ability to run across regions depends on what the issue is what the impact is.
Overall not that bad for us, but if you had more high-level service dependencies, there would have been impact.
If not, I look forward to the next single-point-of-failure outage. And the next. And the next.
Which was effectively the only region
Not that I disagree with you, but maybe not for the reasons you say (:
As someone who used to work on the inside, us-east-1 has the biggest pile of legacy workarounds for internal AWS issues, it has a variety of legacy API behaviours that don't exist in other regions, and because everyone picks it as the default, it has significantly more pressure on contested resources (i.e. things like spot instance pools).
Plus since it's the default in all the tooling, if you ever decide to go multi-region, you'll find tons of things break right away.
Or you might use BART to come to work and you got stuck: https://www.kqed.org/news/12060687/bart-resumes-service-but-...
wish I'd already had this link in my back pocket. our industry needs to take its job, as a whole, much more seriously.
https://health.aws.amazon.com/health/status?path=service-his...
Always worth taking sentences that use “the Cloud” or “the Internet” and try replacing those phrases with “A shed in Virginia” to see how they hold up. “Our service is fully based in a shed in Virginia”; “All my files are in a shed in Virginia”; “A shed in Virginia was designed to survive a nuclear war”, etc.
In a surprise to literally no one that happening on the last friday before xmas break got my "We need to secure the main comms cabinet" (which had the backup server and main ingress for WAN and was in a separate building on other side of site) item that I'd been asking about for months to the top of the list.
Still one of my favourite "outages" because I got to my desk, turned PC on, no network, walked across the landing into the main office, opened comms cabinet, plugged it back in and was "resolved" before the MD got to my desk.
You're not entirely wrong, but you're being hyperbolic too. I'm actually curious how old you are / how long you've worked in tech, because I started out pre-cloud and things weren't nearly as bad or as limited as you suggest.
First, on-prem servers are not the only alternative to "cloud." Many businesses, including the ones I worked for, did co-location. The companies owned their own bare metal servers, but would rent a rack in a data centre, and certain things - like the network admin - was entirely outsourced to the data centre / hosting company.
You could also rent managed bare metal servers (you still can). This means that you can pretty much outsource your entire IT department, but you're still not doing cloud services. Meaning you've got bare metal servers, someone you're paying at the hosting company is handling security updates and troubleshooting. You don't get things like auto-scaling or serverless or other cloud features, but you also don't have to worry about Jim tripping over the power cable either.
There's also still virtual servers. Which is basically a VM running on a server that hosts multiple clients.
All of this is to say that the alternative is not "cloud" or "box in a closet." The alternative is "cloud" and a ton of different server options: owned, rented, co-located, on-prem, dedicated, virtual, managed v un-managed (outsource IT vs admin your own) and the list goes on and on.
100% of our failure rates with this machine have been "carpet cleaners unplugged the machine" in 2 years. Last year we had nobody in the office (due to carpet cleaning). This year we sent someone in straight after the cleaning to fix it.
https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...
Meanwhile, everyone that spends actual time in these areas:
- Knows that running an operation at AWS scale is difficult and any armchair critism from 'experts' is exactly that. Actions speak louder than words.
- Understands that the cost of actually accounting for this kind of scenarios is incredibly high for the benefit in most cases
- Knows that genuinely 'critical' services (i.e. health) should be designed to account for this, and every other 'serious' issue such as 'I can't log in to Fortnite' just shows what the price and effort of actually making that work is versus how much it costs affected companies when it happens
- Knows how much time national newspapers spend actually talking about the importance of multi-region/multi-cloud redundancy, that is, it's zero until the one day where it happens and then it's old news
- Is just curious as to just what exactly happened from a technical perspective
This isn't to say that good blameless post-mortem shouldn't happen to figure out process and technical issues, but the armchair criticism with no actual followup? All noise, no signal.
I couldn't even name another provider except maybe Hetzner
The time spent on that (let alone cost, for companies with large amounts of data) far outweighs the cost when a single region has an issue of today's scope. And you said it yourself, it's a 'small benefit'. Small benefits sound like exactly the things not worth spending time or money on.
For as much as many companies have had issues today, the daily reality is that these same companies haven't been having issues all the rest of the time (or this wouldn't have felt so shocking) and are likely to be okay with an outage of this scope (plus, everyone's too busy making noise about the issues to be working normally).
NO. From their own reports, clearly AWS is too centralized and dependent on a specific region (us-east-1) and a specific service (DynamoDB). This has been observed for well over 10 years. Why do they stay in this centralized architecture? Cloud services need much higher standards than the average corporation. Just look how they took down 2000+ services for many hours.
And to be clear, I'm not at all arguing for the monopolisation of cloud providers, only stating that it's easy to point from far away and say 'This is bad' while simultaneously not doing anything to understand the cost and make that change that you say is important, because it's actually costly (in many dimensions) to do.
We always used to talk a lot about minimising blast radius and there’s been enough time, and enough scale, to fix it.
Nevertheless the Guardian’s choice to label self-promoting policy wonks as “experts” is a cringe-inducing reminder that journalists don’t know anything about anything.
> Dr Corinne Cath-Speth, the head of digital at human rights organisation Article 19
Dr. Cath-Speth has a PhD in cultural anthropology
> Cori Crider, the executive director of the Future of Technology Institute
A lawyer
> Madeline Carr, professor of global politics and cybersecurity at University College London
A professor. Her bio doesn't say what her degree is in, but she mostly seems to publish in political science and international relations
So, not a single technical expert. Not anyone who has ever run a hosting service before or even worked for one. Just people who write papers and sit around waiting for journalists to call them for quotes.
The headline is misleading because when there is news about experts saying something about technology, one would naturally think that they are at least somewhat technical experts. Instead the "expert" is the director of the "Big Tech is Bad Institute" who says that "Big Tech is Bad". And their qualification of being an expert is solely that they are director of the "Big Tech is Bad Institute".
What are they going to say that’s useful for making concrete technical decisions?
They can advise on how to write contracts for dealing with these situations after the fact, I suppose.
Yeah, that's completely fair. My angle was more that firstly this doesn't come across as an opinion that needs the expert in question, and secondly this is yet another case of 'Talk is cheap, show me the code', particularly when quotes in the article include "We urgently need diversification in cloud computing."
I feel like the 'We' is doing an awful lot of heavy lifting and there's no mention of the costs of taking on such a task.
Additionally, and awkwardly, it's possible to be both a monopoly in the space but also technically a more stable solution, making the cost for competitors or people willing to use competitors doubly high.
If your margins are so tight that 16 hours of downtime will bankrupt you then I think either: a) I have no idea how to run a business; or b) you have no idea how to run a business. I'm also biased because I love highly fault-tolerant, geo-redundant, durable systems much more than "good enough for this KPI".
On a serious note, resiliency takes effort and investment no matter where you host your content.
I sincerely hope that the base functionality of these doorbells (i.e., triggering the ringing of the bell within the home) is preserved in the event of an internet outage.
Because I think it's very much the same way as it is with Cloudflare - while the large vendors aren't always openly hostile, we can just smile and hope that they don't get too keen on reminding us that they're holding us hostage.
I don't see that changing anytime soon. I've personally also used Hetzner, Contabo, Scaleway, Vultr, DigitalOcean, Time4VPS and some other platforms, but when people couple their setups to CF/AWS/GCP/Azure, typically that coupling is hard to get rid of and doing so is hard to justify.
Lots of businesses who will be completely forgotten as having an outage today because all of their customers were dealing with their own outages and outages in dozens of other providers.
Obviously, that doesn't fly for everyone.
They went and bought a bunch of literal servers and installed them in a datacenter, 90 miles away from our offices, and this is where all our applications ran for the remainder of that company's existence (about 6 more years). For the whole time I was at that company, we had somewhat more, and usually more lengthy, outages than the average startup. The only difference is that when some piece of networking gear took a crap, or a disk failed, or whatever, our guys had to diagnose and resolve it (Their karma, I guess, since this was their idea).
Anyway, I do think it would be good if at least so-calld 'tech companies' had a little less obsession to outsource everything -- even easy things -- to AWS, GCP, and Azure. I feel that way mainly for cost reasons as many of these services are wildly overpriced. But also we shouldn't kid ourselves by ignoring the advantages of operating at the scale those guys do. They can afford to have multiple absolute wizards available around the clock who make sure that when a problem happens, it's not the kind of "S-show" we had at my old company where we're all on a slack room or zoom or whatever and just guessing at to try for half an hour before we can figure out what the actual issue is.
It's someone else's problem.
The real issue is that business pricks will cut costs and single-homing in a single availability zone will be the only workable solution.
On top of that, infrastructure ops are seen as a nuisance who get in the way of the sexy stuff like shipping your latest code changes now. If you complicate the ops pipeline that gets in the way of sexy dev work. So fuck that just ship lol!
labrador•2h ago
"October 17, 2025, was my last day at Amazon Web Services... CloudFront is a CDN, a content delivery network, or, simply put, a large distributed cache for your cat photos. And a very successful one. Something like 30% of all internet traffic goes through CloudFront in one way or another. Pretty cool, huh? In practice, this means that with any change, you have a chance of crashing 30% of the internet."
Yokolos•2h ago
ta1243•1h ago
No. No its not. But tech enthusiasts on HN and Reddit love it.
(Another 30% runs through cloudflare)