The long term app model on the market model is shifting much more towards buying services vs renting infrastructure. It’s here where the AWS case falls apart with folks now buying Planet Scale vs RDS, buying DataBricks over the mess that AWS puts for for data lakes, working with model providers directly vs the headaches of Bedrock. The real long term threat is AWS continues to whiff on all the other stuff and gets reduced to a boring rent-a-server shop that market forces will drive to be very low margin.
Yes a lot of those 3rd party services will run on AWS but the future looks like folks renting servers from AWS at 7% gross margin and selling their value-add service on top at 60% gross margin.
What people forget about the OVH or Hetzner comparison is that for those entry servers they are known for, think the Advance line with OVH or AX with Hetzner. Those boxes come with some drawbacks.
The OVH Advance line for example comes without ECC memory, in a server, that might host databases. It's a disaster waiting to happen. There is no option to add ECC memory with the Advance line, so you have to use Scale or High Grade servers, which are far from "affordable".
Hetzner per default comes with a single PSU, a single uplink. Yes, if nothing happens this is probably fine, but if you need a reliable private network or 10G this will cost extra.
But imo, systems like these (like the ones handling bank transaction), should have a degree of resiliency to this kind of failure, as any hw or sw problem can cause something similar.
For a startup with one rack in each of two data centers, it’s probably fine. You’ll end up testing failover a bit more, but you’ll need that if you scale anyway.
If it’s for some back office thing that will never have any load, and must not permanently fail (eg payroll), maybe just slap it on an EC2 VM and enable off-site backup / ransomware protection.
reference : https://news.ycombinator.com/item?id=38294569
> We have 730+ days with 99.993% measured availability and we also escaped AWS region wide downtime that happened a week ago.
This is a very nice brag. Given they are using their ddos protection ingress via CloudFlare there is that dependancy, but in that case I can 100% agree than DNS and ingress can absolutely be a full time job. Running some microservices and a database absolutely is not. If your teams are constantly monitoring and adjusting them such as scaling, then the problem is the design. Not the hosting.
Unless you're a small company serving up billions of heavy requests an hour, I would put money on the bet AWS is overcharging you.
Technically? Totally doable. But the owners prefer renting in the cloud over the people-related issues of hiring.
You don't need to hire dedicated people full time. It could even be outsourced and then a small contract for maintenance.
It's the same argument you could say for "accounting persons", or "HR persons" - "We are a software organisation!" - Personally I don't buy the argument.
Yeah, those people we outsourced to happen to work at AWS.
Pay some "devops" folks and then underfund them and give them a mandate of all ops but with less people and also you need to manage the constant churn of aws services and then deal with normal outages and dumb dev things.
These people exist, but we have far more stupid "admins" around here
When you are not in the infrastructure business (I work in retail at the moment), the public cloud is the sane way to go (which is sad, but anyway)
I’ve been working at a place for a long time and we have our own data centers. Recently there has been a push to move to the public cloud and we were told to go through AWS training. It seems like the first thing AWS does in its training is spend a considerable amount of time on selling their model. As an employee who works in infrastructure, hearing Amazon sell so hard they the company doesn’t need me anymore is not exactly inspiring.
After that section they seem to spend a considerable amount of time on how to control costs. These are things no one really thinks about currently, as we manage our own infra. If I want to spin up a VM and write a bunch of data to it, no one really cares. The capacity already exists and is paid for, adding a VM here or there is inconsequential. In AWS I assume we’ll eventually need to have a business justification for every instance we stand up. Some servers I run today have value, but it would be impossible to financially justify in any real terms when running in AWS where everything has a very real cost assigned to it. What I do is too detached from profit generation, and the money we could save is mostly theoretical, until something happens. I don’t know how this will play out, but I’m not excited for it.
The AWS mandatory training I did in the past was 100% marketing of their own solutions, and tests are even designed to make you memorize their entire product line.
The first two levels are not designed for engineers: they're designed for "internal salespeople". Even Product Managers were taking the certification, so they would be able to recommend AWS products to their teams.
I do not miss that crap
Who does, then? Even with automatic updates, one can assume some level of maintenance is required for long-term deployments.
Don’t get me wrong, I love running stuff bare metal for my side projects, but scaling is difficult without any ops.
It wasn't as simple as that then, at it's still not as simple as that now.
It’s become polarised (as everything seems to).
I’ve specced bare metal, I’ve specced AWS, which is used entirely a matter of the problem/costs and relative trade-offs.
That is all it is.
Clients that use cloud consistently end up spending more on devops resources, because their setups tends to be wastly more complex and involve more people.
The biggest ops teams I worked alongside were always dedicated to running AWS setups. The slowest too were dedicated to AWS. Proportionally, I mean, of course.
People here are comparing the worst possible of Bare Metal with "hosting my startup on AWS".
I wish I could come up with some kind of formalization of this issue. I think it has something to do with communication explosions across multiple people.
Don't make perfect the enemy of the good.
Right, doesn't that include figuring out the right and best way of running it, regardless if it runs on client machines or deployed on servers?
At least I take "software engineering" to mean the full end-to-end process, from "Figure out the right thing to build" to "runs great wherever it's meant to run". I'm not a monkey that builds software on my machine and then hands it off to some deployment engineer who doesn't understand what they're deploying. If I'm building server software, part of my job is ensuring it's deployed in the right environment and runs perfectly there too.
With AWS I think this tradeoff is very weak in most cases: the tasks that you are paying AWS for are relatively cheap in time-of-people-in-your-org, and AWS also takes up a significant amount of that time with new tasks as well. Of the organisations I'm personally aware of, the ones who hosted on-prem spent less money on their compute and had smaller teams managing it, with more effective results than those who were cloud-based (to various degrees of egregousness from 'well, I can kinda see how it's worth it because they're growing quickly' to 'holy shit they're setting money on fire and compromising their product because they can't just buy some used tower PCs and plug them in in a closet in the office')
It's also that the requirements vary a lot, discussions here on HN often seem to assume that you need HA and lots of scaling options. That isn't universally true.
This applies only if you had an extra customer that pays the difference. Basically argument only holds if you can’t take more customers because upkeeping the infrastructure takes too much time or you need to hire extra person which takes more money than AWS bill difference.
Funny how our perceptions differ. I seem to mostly see people saying all you need is a cheap Hetzner instance and postgres to solve all technical problems. We clearly all have different working environments and requirements. That's I roll my eyes at the suggestions in threads I see of going all in on colo. My last two major cloud migrations were due to colo facilities shutting down. They were getting kicked out and had a deadline. In one of the cases, the company I was working with was the second largest client at the colo but when the largest client decided to pull out the owners decided the economics of running the datacenter didn't make sense to them anymore. Switching colo facilities when you have a few servers isn't a big deal. It's annoying but manageable. When you have hundreds to thousands of servers, it becomes a major operational risk and is enormously disruptive to business as usual.
("Shall we make the app very resilient to failure? Yes running on multiple regions makes the AWS bill bigger but you'll get much fewer outages, look at all this technobabble that proves it")
And of course AWS lock-in services are priced to look cheaper compared to their overpricing of standard stuff[1] - if you just spend the engineering effort and IaC coding effort to move onto them, this "savings" can be put to more AWS cloud engineering effort which again makes your cloud eng org bigger and more important.
[1] (For example implementing your app off containers to Lambda, or the db off PostgreSQL to DynamoDB etc)
I don't think it is easy. I see most organizations struggle with the fact that everything is throttled in the cloud. CPU, storage, network. Tenants often discover large amounts of activity they were previously unaware of, that contributes to the usage and cost. And there may be individuals or teams creating new usages that are grossly impacting their allocation. Did you know there is a setting in MS SQL Server that impacts performance by an order of magnitude when sending/receiving data from the Cloud to your on-premises servers? It's the default in the ORM generated settings.
Then you can start adding in the Cloud value, such as incomprehensible networking diagrams that are probably non-compliant in some way (guess which ones!), and security? What is it?
Sounds interesting, which setting is that?
MARS isn't strictly needed for most things. Some features that requires it are ORM (EF) proxies and lazy loading. If you need MARS, there are third party "accelerators" that workaround this madness.
"MARS Acceleration significantly improves the performance of connections that use the Multiple Active Result Sets (MARS) connection option."
https://documentation.nitrosphere.com/resources/release-note...
> Did you know there is a setting in MS SQL Server that impacts performance by an order of magnitude when sending/receiving data from the Cloud to your on-premises servers? It's the default in the ORM generated settings
As an Computer Science dude and former C64/Amiga coder in Senior Management of a large international Bank, I saw first hand, how cost balloon simply due to the fact, that the bank recreates and replicates its bare metal environment in the cloud.
So increasing costs while nothing changed. Imagine that: fixed resources, no test environments, because virtualisation was out of the equation in the cloud due to policies and SDLC processes. And it goes on: releases on automation? Nope, request per email and attached scan of a paper document as sign-off.
Of course your can buy a Ferrari and use it as a farm tractor. I bet it is possible with a little modification here and there.
Another fact is, that lock in plays a huge role. Once you are in it, no matter what you subscribe to, magically everything slows suddenly down, a bit, but since I am a guy who uses a time tracker to test and monitor apps, I could easily draw a line even without utilizing my Math background: enforced throtelling.
There is a difference between 100, 300 and 500ms for SaaS websites - people without prior knowledge of peceptual psychology feel it but cannot but their finger in the wound. But since we are in the cloud, suddenly a cloud manager will offer you an speed upgrade - just catered for your needs! Here, have a trial period over 3 month for free and experience the difference for your business!
I am a bit of opinionated here and really suppose, that cloud metrics analysed the banks traffic and service usage to willingly slow it down in a way, only professionals could find out. Have you promised to be lightning fast in the first place? No, that's not what the contract says. We fed you with it, but a "normal" speed was agreed upon. It is like getting a Porsche as a rental car for free when you take your VW Beetle to the dealer for a checkup. Hooked, of course. A car is a car after all. How to boil a frog? Slowly.
Of course there will be more sales and this is achilles' heel for every business and indifferent customers - easy prey.
It is a vicious cycle, almost like taxation. You cannot hide from it, no escape and it is always on the rise.
Many a company was stuck with a datacenter unit that was unresponsive to the company's needs, and people migrated to AWS to avoid dealing with them. This straight out happened in front of my eyes multiple times. At the same time, you also end up in AWS, or even within AWS, using tools that are extremely expensive, because the cost-benefit analysis for the individuals making the decision, who often don't know very much other than what they use right now, are just wrong for the company. The executive on top is often either not much of a technologist or 20 years out of date, so they have no way to discern the quality of their staff. Technical disagreements? They might only know who they like to hang out with, but that's where it ends.
So for path dependent reasons, companies end up making a lot of decisions that in retrospect seem very poor. In startups if often just kills the company. Just don't assume the error is always in one direction.
I'd like to +1 here - it's an understated risk if you've got datacenter-scale workloads. But! You can host a lot of compute on a couple racks nowadays, so IMHO it's a problem only if you're too successful and get complacent. In the datacenter, creative destruction is a must and crucially finance must be made to understand this, or they'll give you budget targets which can only mean ossification.
Like can’t we just give the data center org more money and they can over provision hardware. Or can we not have them use that extra money to rent servers from OVH/Hetzner during the discovery phase to keep things going while we are waiting on things to get sized or arrive?
Or just use Hetzner for major performance at low cost... Their apis and stuff make it look like its your datacenter.
If you hire people that are not responsive to your needs, then, sure, that is a problem that will be a problem irrespective of what their pet stack is.
In a large company I worked the Ops team that had the keys to AWS was taking literal months to push things to the cloud, causing problems with bonuses and promotions. Security measures were not in place so there were cyberattacks. Passwords of critical services lapsed because they were not paying attention.
At some point it got so bad that the entire team was demoted, lost privileges, and contractors had to jump in. The CTO was almost fired.
It took months to recover and even to get to an acceptable state, because nothing was really documented.
Looking back at doing various hiring decisions at various levels of organizations, this is probably the single biggest mistake I've done multiple times, hiring specific people using specific technology because we were specifically using that.
You'll end up with a team unwilling to change, because "you hired me for this, even if it's best for the business with something else, this is what I do".
Once I and the organizations shifted our mindset to hiring people who are more flexible, even if they have expertise in one or two specific technologies, they won't put their head in the sand whenever changes come up, and everything became a lot easier.
I'll also tend to look closely at whether people have "gotten stuck" specialising in a single stack. It won't make me turn them down, but it will make me ask extra questions to determine how open they are to alternatives when suitable.
A modern server can be power cycled remotely, can be reinstalled remotely over networked media, can have its console streamed remotely, can have fans etc. checked remotely without access to the OS it's running etc. It's not very different from managing a cloud - any reasonable server hardware has management boards. Even if you rent space in a colo, most of the time you don't need to set foot there other than for an initial setup (and you can rent people to do that too).
But for most people, bare metal will tend to mean renting bare metal servers already configured anyway.
When the first thing you then tend to do is to deploy a container runtime and an orchestrator, you're effectively usually left with something more or less (depending on your needs) like a private cloud.
As for "buying ahead of time", most managed server providers and some colo operators also offer cloud services, so that even if you don't want to deal with a multi-provider setup, you can still generally scale into cloud instances as needed if your provider can't bring new hardware up fast enough (but many managed server providers can do that in less than a day too).
I never think about buying ahead of time. It hasn't been a thing I've had to worry about for a decade or more.
All of this was already possible 20 years ago, with iLO and DRAC cards.
And, lets face it - arent you already overprovisioning on the cloud because you cant risk your users waiting 1-2 minutes until your new nodes and pods get up? So basically the 'autoscaling' of cloud has always been a myth.
Also there's a mindset difference - if I gave you a server with 32 cores you wouldn't design a microservice system on it, would you? After all there's nowhere to scale to.
But with AWS, you're sold the story of infinite compute you can just expect to be there, but you'll quickly find out just how stingy they can get with giving you more hardware automatically to scale to.
I don't dislike AWS, but I feel this promise of false abundance has driven the growth in complexity and resource use of the backend.
Reality tends to be you hit a bottleneck you have a hard time optimizing away - the more complex your architecture, the harder it is, then you can stew.
This is key.
Most people never scale to a size where they hit that limit, and in most organisations where that happens, someone else have to deal with it, and so most developers are totally unaware of just how fictional the "infinite scalability" actually is.
Yet it gets touted as a critical advantage.
At the same time, most developers have never ever tried to manage modern server harware, and seem think it is somehwat like managing the hardware they're using at home.
I kinda feel like this argument could be used against programming in essentially any language. Your company, or you yourself, likely chose to develop using (whatever language it is) because that's what you knew and what your developers knew. Maybe it would have been some percentage more efficient to use another language, but then you and everyone else has to learn it.
It's the same with the cloud vs bare metal, though at least in the cloud, if your using the right services, if someone asked you tomorrow to scale 100x you likely could during the workday.
And generally speaking if your problem is at a scale where baremetal is trivial to implement, its likely we're only taking about a few hundred dollars a month being 'wasted' in AWS. Which is nothing to most companies, especially when they'd have to consider developer/devops time.
I've never seen a cloud setup where that was true.
For starters: Most cloud providers will impose limits on you that often means going 100x would involve pleading with account managers to have limits lifted and/or scrounding a new, previously untested, combination of instance sizes.
But secondly, you'll tend to run into unknown bottlenecks long before that.
And so, in fact, if that is a thing you actually want to be able to do, you need to actually test it.
But it's also generally not a real problem. I more often come across the opposite: Customers who've gotten hit with a crazy bill because of a problem rather than real use.
But it's also easy enough to set up a hybrid setup that will spin up cloud instances if/when you have a genuine need to be able to scale up faster than you can provision new bare metal instances. You'll typically run an orchestrator and run everything in containers on a bare metal setup too, so typically it only requires having an auto-scaling group scaled down to 0, and warm it up if load nears critical level on your bare metal environment, and then flip a switch in your load balancer to start directing traffic there. It's not a complicated thing to do.
Now, incidentally, your bare metal setup is even cheaper because you can get away with a higher load factor when you can scale into cloud to take spikes.
> And generally speaking if your problem is at a scale where baremetal is trivial to implement, its likely we're only taking about a few hundred dollars a month being 'wasted' in AWS. Which is nothing to most companies, especially when they'd have to consider developer/devops time.
Generally speaking, I only relatively rarely work on systems that cost less than in the tens of thousands per month and up, and what I consistently see with my customers is that the higher the cost, the bigger the bare-metal advantage tends to be as it allows you to readily amortise initial setup costs of more streamlined/advanced setups. The few places where cloud wins on cost is the very smallest systems, typically <$5k/month.
"The right services" is I think doing a lot of work here. Which services specifically are you thinking of?
- S3? sure, 100x, 1000x, whatever, it doesn't care about your scale at all (your bill is another matter).
- Lambdas? On their own sure you can scale arbitrarily, but they don't really do anything unless they're connected to other stuff both upstream and downstream. Can those services manage 100x the load?
- Managed K8s? Managed DBs? EC2 instances? Really anything where you need to think about networking? Nope, you are not scaling this 100x without a LOT of planning and prep work.
You're note getting 100x increase in instances without justifying it to your account manager, anyway, long before you figure out how to get it to work.
EC2 has limits on the number of instances you can request, and it certainly won't let you 100x unless you've done it before and already gone through the hassle to get them to raise your limits.
On top of that, it is not unusual to hit availability issues with less common instance types. Been there, done that, had to provision several different instance types to get enough.
Let me go on a tangent about trains. In Spain before you board a high-speed train you need to go though full security check, like on an airport. In all other EU countries you just show up and board, but in Spain there's the security check. The problem is that even though the security check is an expensive, inefficient theatre, just in case something does blow up, nobody wants to be the politician that removed the security check. There will be no reward for a politician that makes life marginally easier for lots of people, but there will be severe punishment for a politician that is involved in a potential terrorist attack, even if the chance of that happening is ridiculously small.
This is exactly why so many companies love to be balls deep into AWS ecosystem, even if it's expensive.
Just for curiosity's sake, did any other EU countries have any recent terrorist attacks involving bombs on trains in the capital, or is Spain so far alone with this experience?
Edit: Also, after looking it up, it seems like London did add temporary security scanners at some locations in the wake of those bombings, although they weren't permanent.
Russia is the only other European country besides Spain that after train bombings added permanent security scanners. Belgium, France and a bunch of other countries have had train bombings, but none of them added permanent scanners like Spain or Russia did.
You can pay for EC2+EBS+network costs, or you can have a fancy cloud native solution where you pay for Lambda, ALBs, CloudWatch, Metrics, Secret Manager, (things you assume they would just give you, like if you eat at a restaurant, you probably won't expect to pay for the parking, toilet, or paying rent for the table and seats).
So cloud billing is its own science and art - and in most orgs devs don't even know how much the stuff they're building costs, until finance people start complaining about the monthly bills.
Because it was mostly fine at first, but later we had some close calls when there were changes that needed to be made on the servers. By the time we managed to mess up our hand managed incremental restart process, we had several layers of cache and so accidentally wiping one didn’t murder our backend, but did throw enough alerts to cause a P2. And because we were doing manual bucketing of caches instead of consistent hashing we hit the OOMKiller a couple times while dialing in.
But at this point it was difficult to move back to managed.
This feels closest to digital ocean’s business model.
Engineering mangers are promised cost savings on the HR level. Corporate finance managers are promised OpEx for CapEx trade-off, the books look better immediately. Cloud engineers are embarking on their AWS journey of certification being promised an uptick to their salaries. It’s a win/win for everyone, in isolation, a local optimum for everyone, but the organization now has to pay way more than it—hypothetically—would have been paying for bare metal ops. And hypothetical arguments are futile.
And it lends itself well to overengineering and the microservices cargo cult. Your company ends up with a system distributed around the globe across multiple AZs per region of business operations, striving to shave off those 100ms latency off your clients’ RTT. But it’s outgrown your comprehension, and it’s slow anyway, and you can’t scale up because it’s expensive. And instead of having one problem, you now have 99 and your bill is one.
So it is not like one can dazzle decision makers with any logic or hard data. They are just announcing the decision while calling it a robust discussion over pros and cons of on-prem vs cloud placement.
It’s really disturbing how the human factor controls decision making in corporations.
For my peace of mind, I chose a sane path - if the company as an entity decides to do AWS, I will do my best to meet its goals. I’ve got all Professional and Specialty certs. It’s the human nature. No purpose in tilting at windmills.
The consequence of running a database poorly is lost data.
At the end of the day they're all just processes on a machine somewhere, none of it is particularly difficult, but storing, protecting, and traversing state is pretty much _the_ job and I can't really see how you'd think ingress and DNS would be more work than the datastores done right.
Now with AWS, I have a SaaS that makes 6 figures and the AWS bill is <$1000 a month. I'm entirely capable of doing this on-prem, but the vast majority of the bill is s3 state, so what we're actually talking about is me being on-call for an object store and a database, and the potential consequences of doing so.
With all that said, there's definitely a price point and staffing point where I will consider doing that, and I'm pretty down for the whole on-prem movement generally.
That's the sweet spot for AWS customers. Not so much for AWS.
The key thing for AWS is trying to get you locked in by "helping you" depend on services that are hard to replicate elsewhere, so that if your costs grow to a point where moving elsewhere is worth it, it's hard for you to do so.
The biggest difficulty in eating into AWS market share is that believing it is cheap has become religion.
Perfect example - MSK. The brokers are config locked at certain partition counts, even if your CPU is 5%. But their MSK replicator is capped on topic count. So now I have to work around topic counts at the cluster level, and partition counts at the broker level. Neither of which are inherent limits in the underlying technologies (kafka and mirrormaker)
It's a way to "commoditize" engineers. You can run on premise or mixed infra better and cheaper, but only if you know what you are doing. This requires experienced guys and doesn't work with new grad hired by big cons and sold ad "cloud experts".
(I do that for people; my AWS using customers consistently end up needing more help)
but somehow that is never a problem.
When everyone is suffering because AWS is having its bi-yearly 8 hour outage, the CTO isn't blamed, bonus all round, and maybe the AWS sales team takes him for an apology lunch
When the CTO is up for 1500 days straight then has a 2 hour downtime when nobody else does, the CTO is blamed, no bonus, and more likely to get fired
IBM is older, and it's incredibly well documented how mainframes are more expensive to run than normal servers.
this is the main advantage of cloud, no one cares if the site/service/app is down as long as it's someone else's fault and responsibility.
If not, you're at the mercy of others.
I'm not. It seems to be happening a lot. Any time a topic about not using AWS comes up here, or on Reddit there a sudden surge of people appearing out of nowhere shouting down anyone who suggests other options. It's honestly starting to feel like paid shilling.
The cloud stuff is extremely expensive and doesn't work any better than our existing solutions. Like a commentator said below, it's insidious as your entire organization later becomes dependent on that. If you buy a cloud solution, you're also stuck with the vendor deciding to double the cost of the product once you're locked in.
The GPU stuff is annoying as all of our needs are fine with normal CPU workloads today. There are no performance issues, so again...what's the point? Well... somebody wants to play with GPUs I guess.
AWS/Azure/GCP is great, but like any tool or platform you need to do some financial/process engineering to make an optimal choice. For small companies, time to market is often key, hence AWS.
Once you’re a little bigger, you may develop frameworks to operate efficiently. I have apps that I run in a data center because they’d cot 10-20x at a cloud provider. Conversely, I have apps that get more favorable licensing terms in AWS that I run there, even though the compute is slower and less efficient.
You also have people who treat AWS with the old “nobody gets fired for buying IBM” mentality.
I imagine a lot of people who use Linux/AWS now started out with bare metal Microsoft/VMWare/Oracle type of environments where AWS services seemed like a massive breath of fresh air.
Having an ability to spin up a server or a vm when you need it without having to ask a single question is very liberating. Sometimes such elasticity is exactly what's needed. OTOH other people's servers aren't always the wise choice, but you have to know both environments to make the right choice, and nowadays I feel most people don't really know anything about bare metal.
Luckily, Amazon is far from the only VM provider out there, so this discussion doesn't need to be polarized between "AWS everything" and "on-premise everything". You can rent VMs elsewhere for a fraction of the cost. There are many places that will rent you bare metal servers by the hour, just as if they were VMs. You can even mix VMs and bare metal servers in the same datacenter.
If anything it enables a hybrid environment
Look at what Amazon/Google/Microsoft does. If you told me you advocate running your own power plants, I'd eyeroll. But... if you're as large a power consumer as a hyper-scaler, totally different story. Google and Microsoft are investing in lighting up old nuclear plants.
Talos OS looks really interesting. But I also need the storage parts, networking parts, etc.
Beyond public cloud being bad for the planet, I also hate that it drains companies of money, centralizes everyone's risk, and helps to entrench Amazon as yet another tech oligarchic fiefdom. For most people, these things just don't matter apparently.
Similar here, I think. I got into Computer Science because I liked software... the way it was. Now I truly think that most software completely sucks.
The thing is that it has grown so much since then, that most developers come from a different angle.
Do what works best for your situation.
Migrating to lower cost options thereafter when scaling is prudent, but you "build one to throw away", as it were.
the companies selling Cloud are also massive IT giants with unlimited compute resources and extensive online marketing operations.
like of fucking course they're using shillbots, they run the backend shillbot infrastructure.
they literally have LLM chatbot agents as an offering, and it's trivially easy to create fake users and repost / retweet last weeks comments to create realistic looking accounts, when then shill hard for whatever their goals are.
If you cant build a positive business case then its not the correct move. Cash is king. Sadly.
A lot of the discussion here is that the cost of the in-house team is less than people think.
For instance: at a former gig, we used a service in the EU that handled weekends, holidays and night time issues and escalated to our team as needed. It was pretty cheap, approximately $10K monthly fee for availability and hourly rate when there were any issues to be resolved. There were a few mornings I had an email with a post-mortem report and an invoice for a hundred euros or so. We came pretty close to 5 9's uptime but we didn't have to worry about SLA's or anything.
Well, well, they have a whole team doing "devops administration" on AWS and require extra people. So not having the money for an in-house team ... no AWS for you.
I've worked for 2 large-ish firms in the past 3 years. One huge telco, one "medium" telco (still 100s of people). BOTH had a team just for AWS IAM administration. Only for that one thing, because that was company-wide (and was regularly demonstrated to be a single point of failure). And they had AWS administrator teams, yes teams, for every department (even HR had one, though in the medium telco all management had a shared team, but the networking and development departments still had their own AWS teams, who, btw, also did IAM. The company-wide IAM team maintained an AWS IAM and some solution they'd bought that also worked for their windows domain and ticketing system (I hate you IBM remedy), and eqiupment ordering portal and ...)
AND there were "devops" positions on every development team, and on the network engineering team, and even a small one for the building "technics" team.
Oh and they both had an internal cluster on top of AWS, part on-premise, part rented DC space, which did at least half the compute work (but presumably a lot less of the weird edge-cases), that one ran the company services that are just insane on AWS like any kind of video.
they sell "you don't need a team"... which is true om your prototype and mvp phase. and you know when you grow you will have an ops team and maybe move out.
but in the very long middle time... you will be supporting clients and sla etc, and will end up paying both aws AND an ops team without even realizing.
We should coin the term "Cloud Learned Helplessness"
Having done consulting in this space for a decade, and worked with containerised systems since before AWS existed, my experience is that managing an AWS system is consistently more expensive and that in fact the devops cost is part of what makes AWS an expensive option.
That said, I've seen real world scenarios where complexity is up the wazoo and an opex cost focus means you're hiring under skilled staff to manage offerings built on components with low sticker prices. Throw in a bit of the old NIH mindset (DIY all the things!) and it's large blast radii with expensive service credits being dished out to customers regularly. On a human factors front your team will be seeing countless middle of the night conference calls.
While I'm not 100% happy with the AWS/Azure/GCP world, the reality is that on-prem skillsets are becoming rarer and more specialist. Hiring good people can be either really expensive or a bit of a unicorn hunt.
Most AWS-only Ops engineers I know are making bank and in high demand, and Ops teams are always HUGE in terms of headcount outside of startups.
The "AWS is cheaper" thing is the biggest grift in our industry.
You can easily get your service up by asking claude code or whatever to just do it
It produces aws yaml that’s better than many devops people I’ve worked with. In other words, it absolutely should not be trusted with trivial tasks, but you could easily blow $100K’s per year for worse.
So I think for developers that have deep experience with systems LLMs are great -- I did a huge migration in a few weeks that probably would have taken many months or even half a year before. But I worry that people that don't really know what's going on will end up with a horrible mess of infra code.
The time difference between having a script ready has decreased dramatically in the last 3 years. The amount of problems when deploying the first time has also increased in the same period.
The difference between the ones who actually know what they're doing and the ones who don't is whether they will refactor and test.
After fully in cloud for sometimes, we’re moving to hybrid solutions. The upper management happy with costs and the cloud engineer had new toy's
2. niche, bespoke domain primarily occupied by companies looking to cut costs
At the same time, the incredible complexity of the software infrastructure is making specialists more and more useless. To the point that almost every successful specialist out there is just some disguised generalist that decided to focus their presentation in a single area.
What?
I throw up in my mouth every time I see "full stack" in a job listing.
We got rid of roles... DBA's, QA teams, Sysadmins, then front and back end. Full Stack is the "webmaster" of the modern era. It might mean front and back end, it might mean sysadmin and DBA as well.
> We got rid of roles... DBA's, QA teams, Sysadmins, then front and back end.
On a first approximation, those roles were all wrong. If your people don't wear many of those hats at the same time, they won't be able to create software.
But yeah, we did get rid of roles. And still require people to be specialized to the point it's close to impossible to match the requirements of a random job.
As mentioned below, never labeled "full stack", never plan on it. "Generalist" is what my actual title became back in the mid 2000s. My career has been all over the place... the key is being stubborn when confronted with challenges and being able to scale up (mentally and sometimes physically) to meet the needs, when needed. And chill out when it's not.
It's not even limited to sysadmins, or in tech. How do you know whether a mechanic is very good, or iffy? Is a financial advisor giving you good advice, or basically robbing you? It's not as if many companies are going to hire 4 business units worth of on prem admins, and then decide which one does better after running for 3 years, or something empirical like that. You might be the poor sob that hires the very expensive, yet incompetent and out of date specialist, whose only remaining good skill is selling confidence to employers.
Of course but unless I misunderstood what you meant to say, you don't escape that by buying from AWS. It's just that instead of "sysadmin specialists" you need "AWS specialists".
If you want to outsource the job then you need to go up at least 1 more layer of abstraction (and likely an order of magnitude in price) and buy fully managed services.
The most frustrating part of hyperscalers is that it's so easy to make mistakes. Active tracking of you bill is a must, but the data is 24-48h late in some cases. So a single engineer can cause 5-figure regrettable spend very quickly.
Its easier than ever to do this but people are doing it less and less.
Much easier to find. Even more, they are skills much easier to learn for existing engineers. What's better, they are fundamental skills that will never lose their value as those systems are what everything else is built on.
helm repo add mariadb-operator https://mariadb-operator.github.io/mariadb-operator
helm install mariadb-operator mariadb-operator/mariadb-operator
And then you can just provision MariaDB "kind", ie. you kubectl apply with something specifying database name, maximum memory, type of high availability (single primary or multimaster) and secret reference and there you go: new database, ready to be plugged into other pods.see so many series B+ companies running DB and storage without a care in the world.
Running mysqldump to a usb disk in the office once a day is pretty cheap.
I expect most, if not 99%, of all businesses can cope with a hardware failure and the associated downtime while restoring to a different server, judging from the impact of the recent AWS outage and the collective shrug in response. With a proper raid setup, data loss should be quite rare, if more is required a primary + secondary setup with a manual failover isn't hard.
Basic async logical replication in MySQL/MariaDB is extremely easy to set up, literally just a few commands to type.
Ditto for doing failover manually the rare times it is needed. Sure, you'll have a few minutes of downtime until a human can respond to the "db is down" alert and initiates failover, but that's tolerable for many small to medium sized businesses with relatively small databases.
That approach was extremely common ~10-15 years ago, and online businesses didn't have much worse availability than they do today.
Moving away is an existential issue for them - this is why there's such pushback. A huge % of new developer and devops generation doesn't know anything about deploying software on bare metal or even other clouds and they're terrified about being unemployed.
I see more comments in favor than pushing back.
The problem I have with these stories is the confirmation bias that comes with them. Going self-hosted or on-premises does make sense in some carefully selected use cases, but I have dozens of stories of startup teams spinning their wheels with self-hosting strategies that turn into a big waste of time and headcount that they should have been using to grow their businesses instead.
The shared theme of all of the failure stories is missing the true cost of self-hosting: The hours spent getting the servers just right, managing the hosting, debating the best way to run things, and dealing with little issues add up but are easily lost in the noise if you’re not looking closely. Everyone goes through a honeymoon phase where the servers arrive and your software is up and running and you’re busy patting yourselves on the back about how you’re saving money. The real test comes 12 months later when the person who last set up the servers has left for a new job and the team is trying to do forensics to understand why the documentation they wrote doesn’t actually match what’s happening on the servers, or your project managers look back at the sprints and realize that the average time spent on self-hosting related tasks and ideas has added up to a lot more than anyone would have guessed.
Those stories aren’t shared as often. When they are, they’re not upvoted. A lot of people in my local startup scene have sheepish stories about how they finally threw in the towel on self-hosting and went to AWS and got back to focusing on their core product. Few people are writing blog posts about that because it’s not a story people want to hear. We like the heroic stories where someone sets up some servers and everything just works perfectly and there are no downsides.
You really need to weigh the tradeoffs, but many people are not equipped to do that. They just think their chosen solution will be perfect and the other side will be the bad one.
What the modern software business seems to have lost is the understanding that ops and dev are two different universes. DevOps was a reaction to the fact that even outsourcing ops to AWS doesn’t entirely solve all of your ops problems and the role is absolutely no substitute for a systems administrator. Having someone that helps derive the requirements for your infrastructure, then designs it, builds it , backs it up, maintains it, troubleshoots it, monitors performance, determines appropriate redundancy, etc. etc. etc. and then tells the developers how to work with it is the missing link. Hit-by-a-bus documentation, support and update procedures, security incident response… these are all problems we solved a long time ago, but sort of forgot about moving everything to cloud architecture.
This is a fascinating take, if you ask me, treating them as separate is the whole problem!
The point of being an engineer is to solve real world problems, not to live inside your own little specialist world.
Obviously there's a lot to be said for being really good at a specialized set of skills, but thats only relevant to the part where you're actually solving problems.
DevOps, conceptually, goes back to the 90s. I was using the term in 2001. If memory serves, AWS didn't really start to take off until the mid/late aughts, or at least not until they launched S3.
DevOps was a reaction to the software lifecycle problem and didn't have anything to do with AWS. If anything it's the other way around: AWS and cloud hosting gained popularity in part due to DevOps culture.
This is revisionist history. DevOps was a reaction to the fact that many/most software development organizations had a clear separation between "developers" and "sysadmins". Developers' responsibility ended when they compiled an EXE/JAR file/whatever, then they tossed it over the fence to the sysadmins who were responsible for running it. DevOps was the realization that, huh, software works between when the people responsible for building the software ("Dev") are also the same people responsible for keeping it running ("Ops").
Jump to my first "enterprise" job and suddenly I can't fix things anymore. I have to submit tickets to other teams to look at why the thing I built isn't running as expected. That, to me, was pure insanity. The sysadmins knew fuck all about my app and as far as I was concerned barely knew how to admin systems. I knew a lot more in my 20's after all. But the friction of not running what I wrote was absolutely real and one of the main killers of productivity versus my startup days.
I also have seen this from most of the "enterprise" companies that do "DevOps" when really they just mean they have a sysadmin team who uses modern tools and IaC. The same exact friction and issues exist between dev and ops as before DevOps days. Those companies are explicitly doing DevOps wrong. When you look at the troubleshooting steps during an incident, it's identical. Bring in the devs and the ops team so we can figure out what's going on. I do think startups are more likely to get DevOps right because they aren't trying to force it on the only mental model they seem to be able to understand.
I've also found that dev teams who run and maintain their own stacks are better about automatic failure recovery and overall more reliable solutions. Whether that's due to better alignment between the app code and the app stack during development or because the dev team is now the first call when things aren't working I'm not entirely sure. Likely a mix of both.
Funnily enough, the article even affirms this, though most people seemed to have skimmed over it (or not read it at all).
> Cloud-first was the right call for our first five years. Bare metal became the right call once our compute footprint, data gravity, and independence requirements stabilised.
Unless you've got uncommon data egress requirements, if you're worried about optimizing cloud spend instead of growing your business in the first 5 years you're almost certainly focusing on the wrong problem.
> You really need to weigh the tradeoffs, but many people are not equipped to do that. They just think their chosen solution will be perfect and the other side will be the bad one.
This too. Most of the massive AWS savings articles in the past few days have been from companies that do a massive amount of data egress i.e. video transfer, or in this case log data. If your product is sending out multiple terabytes of data monthly, hosting everything on AWS is certainly not the right choice. If your product is a typical n-tier webapp with database, web servers, load balancer, and some static assets, you're going to be wasting tons of time reinventing the wheel when you can spin up everything with redundancy & backups on AWS (or GCP, or Azure) in 30 minutes.
Personally, I would never self-host some B2C or B2B application if you have less than 50 - 100 techies in a healthy org. You can get just too much from a few VMs and/or a few dedicated servers at like Hetzner, OVH, or AWS managed services. At least for the average web rest thingy with a DB and some file storage. I'm sure it's possible to find counter-examples.
On the other hand, we are about 120 devs at work now, couple thousand B2B customers, 10 Platform Ops, 7 HW & DC Ops. I guess we have more ops-people than a startup may have people. Once we get rid of VMWare licensing, our colos are ridiculously cheap when amortized across 5 years compared to AWS or cloud hosting. Once EOL, they'll also reduce cloud-costs on cheaper providers for test systems and provide spontaneous failover and disaster recovery tests.
We're now also getting good cross-team scaling processes going and at this point the big barriers are actually getting enough power and cooling, not buying/racking/maintaining systems. That will be a big price tag next year, but we've not paid that money to AWS the last two years, so it's fine.
As I keep saying internally, self-hosting is like buying a 40 ton excavator, like Large Marge or a 40 ton truck. If you have enough stuff to utilize a 40 ton truck, it's good. If you need to move food around in an urban environment, or need to move an organ transplant between hospitals, a 40 ton truck tends to be rather inefficient and very expensive to maintain and run.
Almost every bare metal success story paints a rosy picture of perfect hardware (which thankfully is often the case), or basic hard failures which are easily dealt with. Disk replacement or swapping 1u compute nodes is expected and you probably have spares on hand. But it's a special feeling to debug the more critical parts that likely don't have idle spares just sitting around. The raid controller that corrupts it's memory, reboots, and rolls back to it's previous known-good state. The network equipment that locks up with no explanation. Critical components that worked flawless for months or years, then shit the bed, but reboot cleanly.
Of course everyone built a secure management vlan and has remote serial consoles hooked up to all such devices right? Right? Oh good, they captured some garbled symbols. The vendor's first tier of support will surely not be outsourced offshore or read from a script, and will have a quick answer that explains and fixes everything. Right?
The cloud isn't always the right choice, but if you can make it work, it sure is nice to not deal with entire categories of problems when using it.
The problems here are no different than using SaaS anywhere else in a business, you can also run all your sales tracking through excel, it's just that once you have more than a few people doing sales that becomes a major bottleneck the same way not having an easier to manage infrastructure system.
We planned a migration to move from 4OD instances to one on prem machine and we guessed we’d save $1000/mo, our builds would be faster and we’d have less failures due to capacity issues. We even had a spare workstation and a rack in the office that so the capex was 0.
I plugged the machine into the rack and no internet connectivity. Put in an IT ticket which took 2 days for a reply, only to be told that this was an unauthorised machine and needed to be imaged by IT. The back and forth took 4 weeks, multiple meetings and multiple approvals. My guess is that 4 people spent probably 10 hours arguing whether we should do this or not.
On AWS I can write a python script and have a running windows instance in 15 minutes.
But as an example: It took about 3 months to provision an AWS server in a recent company I consulted for due to their own bureaucracy and ineptitude of the Ops team.
On the other hand, when I needed a few CI servers for a startup I worked at, I just collected them from AppleStore during lunch hour.
Now this above is what people are "forgetting" and don't want to listen to.
But because AWS isn't in the office, it's fine. We could probably use Hetzner or OVH, but then we have to go through procurement which is as much of as hassle as going through IT.
Long term yes you can save money rolling your own.
But with cloud you can get something up and running within maybe a few days, sometimes even faster. Often with built in scalability.
This is a much easier sell to the non-tech (i.e., money) people.
If the project continues, the path of least resistance is often to just continue with the cloud solution. At a certain point, there will be so much tech debt that any savings from long term costs from the traditional on-premises, co-location or managed hosting, are vastly by the cost of migration.
They've never used tools like Ansible (or Anaconda) or been in situations where they couldn't destroy the container and start afresh instantly.
$100-$300 on AWS -> $35/mo for DO + CF. Coincidentally, AWS had an outage soon after, which was avoided thanks to the move.
I have used DO for both clients and myself, and have not had any huge problems with them.
Basic rationalization. People will go to extraordinary lengths to justify and defend the choices they made. It's a defense mechanism: if they spent millions on AWS they are not going to sit idly while HN discusses saving hundreds of thousands with everyone nodding and agreeing. It's important for their own sanity to defend the choice they made.
Easy to push back against what is now the unknown (bare metal), when the layers extending bare metal to cloud service have become better and better, as well as more accessible.
Bigger question: When did people forget that doing that is much easier than AWS...
As far as they are concerned AWS is taking care of computing AND hiring for them.
I've never worked anywhere that at least some sort of power holder would instantly go to consultants or outsourcing rather than in house because they believe that if you work for the company you must be incompetent, dumb or below average. If you don't work for them you must be exceptional.
Same, this trend towards "AWS all the things" has really amazed me.
We've all mocked small companies copying big companies by trying to make their app super-duper scalable from the very start. After all, everyone things they are the next google, despite their 5 total users right now.
But this is really the opposite. AWS is phenomenal for the startup that would readily trade high opex for lower capex. Servers aren't the cheapest things in the world to buy and they depreciate. It makes total sense for startups to start this way.
But why are big companies, with an actual budget for staff, copying the behavior of their favorite startups?
This is an issue for several companies that start small and within 5 years they find the need to expand abroad. Be it for data sovereignty or so, which is becoming more important than ever in the last 10 years.
Duplicating a region is "a few clicks away" on AWS. This is what the provider enables you to do.
This and a lot of other things. And for such things, yes, you gotta pay.
But if you're in a growth/startup phase it doesn't make much sense to spend engineering time on this, not that multi region setups in Aws is one button either. Once you're past that and paying aws a million per week or so I think it can make sense to offload expensive services to your own hardware.
1. You are making a product with 3 friends on evenings and you want to ship asap without having the capacity to invest and setup infrastructure. 2. You are a huge corporation with tens of thousands of employees and hardware needs that you simply cannot source yourself easily or sort out the collocation of the hardware.
Everyone else - get a dozen second-hand servers, shove them in a rack in a data center and you will own the hardware and everything associated with it at half the price of what you'd be paying AWS in a year.
https://en.wikipedia.org/wiki/Bare_metal
Edit: For clarity, wikipedia does also have pages with other meanings of "bare metal", including "bare metal server". The above link is what you get if you just look up "bare metal".
I do aim to be some combination of clear, accurate and succinct, but I very often seem to end up in these HN pissing matches so I suppose I'm doing something wrong. Possibly the mistake is just commenting on HN in itself.
I'm not sure what you did, but when you go to that Wikipedia article, it redirects to "Bare Machine", and the article contents is about "Bare Machine". Clicking the link you have sends you to https://en.wikipedia.org/wiki/Bare_machine
So it seems like you almost intentionally shared the article that redirects, instead of linking to the proper page?
Can we stop this now? Please?
Fix it then, if you think it's incorrect. Otherwise, link to https://en.wikipedia.org/wiki/Bare_metal_(disambiguation) like any normal and charitable commentator would do.
> Can we stop this now? Please?
Sure, feel free to stop at any point you want to.
I thought my link made the point a bit better. I think maybe you've misunderstood something about how Wikipedia works, or about what I'm saying, or something. Which is OK, but maybe you could try to be a bit more polite about it? Or charitable, to use your own word?
Edit: In case this part isn't obvious, Wikipedia redirects are managed by Wikipedia editors, just like the rest of Wikipedia. Where the redirect goes is as much an indication of the collective will of Wikipedia editors as eg. a disambiguation page. I don't decide where a request for the "bare metal" page goes, that's Wikipedia.
Edit2: Unless you're suggesting I edited the redirect page? The redirect looks to have been created in 2013, and hasn't been changed since.
Things today are different. As cloud service providers have grown to become dominant, they now offer a vast, complicated tangle of services, microservices, control panels, etc., at prices that can spiral out of control if you are not constantly on top of them, making bare metal cheaper for many use cases.
That was never the case for AWS, the point was never "We're cheap" but "We let you scale faster for a premium".
I first came across cloud services around 2010-2011 I think, when the company I worked at at the time started growing and we needed something better than shared hosting. AWS was brought up as a "fresh but expensive" alternative, and the CTO managed to convince the management that we needed AWS even if it was expensive, because it'll be a lot easier to tear up/down servers as we need it. Bandwidth costs I think was the most expensive part of the package, at least back then.
When I look at what performance per $ you get with AWS et al today, it looks the same, incredibly expensive for the performance you (don't) get. Better off with dedicated instances unless you team is lacking the basic skills of server management, or until the company really grown so it keeps being difficult dealing with the infrastructure, then hire a dedicated person and let them make the calls for what's next.
Being able to start small from a $1/mth bill without any fixed cost overheads is incredibly powerful for small startups.
If I wanted to store bytes in a DC it would cost $10k/mth by the time I was paying colo/ servers/ disks before I stored my first byte. Sure there wouldn't be any incremental costs for the second byte but thats a steep jump. S3 would have cost me $0.02. Being able to try technology and prove concepts at the product development stage is very powerful and why AWS became not just a vendor but a _technology partner_ for many companies.
Yes, no doubt about it. Initially AWS was mostly sold as "You never know when you might want to scale fast, imagine being featured in a newspaper and your servers can't handle the load, you need cloud for that!" to growing startups, and in that context it kind of makes sense, pay extra but at least be online.
But initially when you're small, or later when you're big and establish, other things make more sense. But yes, I agree that if you need to aggressively be able to scale up or down, cloud resources make sense to use for that, in addition to your base infrastructure.
Actually, it was more like "Scale faster, easier, more reliably, with proven hardware and software infrastructure, operated by a proven organization, at a price point that is competitive with the investment you'd have to make to get comparable hardware, software, and organizational infrastructure." But that was then. Today, things are different. Cloud services have become giant hairballs of complexity, with plenty of shoot-yourself-in-the-foot-by-default traps, at prices that can quickly spiral out of control if you're not on top of them.
AWS needs to stop trying to have a half-arsed solution to every possible use case and instead focus on doing a few basic things really well.
[0] https://aws.amazon.com/certification/certified-solutions-arc...
AWS was mostly spared from yesterday’s big cuts but have been told to “watch this space” in the new year after re:Invent.
A lot of newer stuff that actually scales (so Lightsail doesn't count) is entangled with "security", "observability" and "network" services. So if you just want to run EC2 + RDS today, you also have to deal with VPC, Subnets, IAM, KMS, CloudWatch, CloudTrail, etc.
Since security and logs are not optional, you have very limited choice.
Having that many required additional services means lots of hidden charges, complexity and problems. And you need a team if you're not doing small-scale stuff.
Not sure what Amazon plans to do when the m6 hardware starts wearing out.
Azure is ... a different story...
Yes, EC2 might seem to be only 2.5 times the cost of storage... Except that, even if you buy the high speed storage, it's going to be 10x - 100x slower than bare metal. Which then means you can buy much slower drives, if you wanted to, and save a shit ton of money.
> Our workload is 24/7 steady. We were already at >90% reservation coverage; there was no idle burst capacity to “right size” away. If we had the kind of bursty compute profile many commenters referenced, the choice would be different.
Which TBH applies to many, many places, even if they are not aware of it.In any case, not everyone need five nines, and usually it's just much easier to bring down a platform due to some bug in your own software rather that the core infrastructure going down at a rack level.
It's probably the main reason why they were able to get away with this and why their application does not need scalability. I see they themselves are only offering two 9s of uptime.
> The Equinix Metal service will be sunset on June 30, 2026.
Our .NET Apps are still deployed as Docker Compose Apps which we use GitHub Actions and Kamal [1] to deploy. Most Apps use SQLite + Litestream with real-time replication to R2, but have switched to a local PostgreSQL for our Latest App with regular backups to R2.
Thanks to AI that can walk you through any hurdle and create whatever deployment, backup and automation scripts you need, it's never been easier to self-host.
This really is the crux of the matter in my opinion, at least for applications (databases and so on is in my opinion more nuanced). I've only worked at one place where using cloud functions made sense (keeping it somewhat vague here): data ingestion from stations that could be EXTREMELY bursty. Usually we got data from the stations at roughly midnight every day, nothing a regular server couldn't handle, but occasionally a station would come back online after weeks or new stations got connected etc which produced incredible load for a very short amount of time when we fetched, parsed and handled each packet. Instead of queuing things for ages we could instead just horizontally scale it out to handle the pressure.
Compared to all the other things that can and will go wrong, this risk seems pretty small, but I have no data to back that up.
Given the rates of fires in DCs, you'd rather need to be quite unlucky for it to happen to you.
Is there a simple safe setup that we can run on an Ubuntu server?
We self-host the Postgres db with frequent backups to s3 but just in case the site takes off, we need an affordable reliable solution.
Does anyone here run their own db servers? Any advise?
Backups, security, upgrades etc
Info noted
Setting up a DB isn't hard, using an LLM to ask questions will guide you to the right places. I'm always talking with Gemini because I switched from Ubuntu to Fedora 42 server and things are slightly different here and there.
But, different server hosts offer DB-ready OS's so all you have to do is load the OS on the server and you'll be ready to go.
The joy of Linux is getting everything _just right_ and so much _just right_ that you can launch a second server and set it up that way _just right_ within minutes.
Gee, how hard is to find SE experts in that particular combination of available ops tools? While in AWS every AWS certified engineer would speak the same language, the DIY approach surely suffers from the lack of "one way" to do things. Change Flux with Argo for example (assuming the post is talking about that Flex and no another tool with the same name), and you have a almost completely different gitops workflow. How do they manage to settle with a specific set of tools?
What are the major differences?
However to steelman AWS use. Many businesses are STILL running mainframes. Many run terrible setups like Access as a production database. In 2025 there are large companies with no CICD platforms or IAC, and some companies where even VC is still a new concept or a dark art. So not every company is in the position to actually hire competent system administrators and system engineers to set up some bare metal machines and configure Ceph, much less Hadoop or Kubernetes. So AWS lets these companies just buy this capabilities while forcing the software stack to modernize.
That company had its own data center, tape archives, etc. It had been running largely the same way continuously since the 90s. When I left for a better job, the company had split into two camps. The old curmudgeonly on-prem activists and the over-optimistic cloud native AWS/GCP certified evangelist with no real experience in the cloud (because they worked at a company with no cloud presence). I'm humble enough to admit that I was part of the second camp and I didn't know shit, I was cargo culting.
This migration is still not complete as far as I'm aware. Hopefully the teams that resisted this long and never left for the cloud get to settle in for another decade of on-prem superiority lol.
The story will be different for every business because every business has different needs.
Given the answer to "How much did migration and ongoing ops really cost?" it seems like they had an incredibly simple infrastructure on AWS, and it was really easy to move out. If you use a wider-range of services the cost savings are much more likely to cancel themselves.
Assuming this is indeed all they used, this was admittedly nonsense, they were essentially using cloud-based bare-metal.
Very much this.
Small team in a large company who has an enterprise agreement (discount) with a cloud provider? The cloud can be very empowering, in that teams who own their infra in the cloud can make changes that benefit the product in a fraction of the time it would take to work those changes through the org on prem. This depends on having a team that has enough of an understanding of database, network and systems administration to own their infrastructure. If you have more than one team like this, it also pays to have a central cloud enablement team who provides common config and controls to make sure teams have room to work without accidentally overrunning a budget or creating a potential security vulnerability.
Startup who wants to be able to scale? You can start in the cloud without tying yourself to the cloud or a provider if you are really careful. Or, at least design your system architecture in such a way that you can migrate in the future if/when it makes sense.
In 2010 you could only get 64 Core Xeon CPU coming in 8 Sockets, or maximum or 8 Core per socket. And that is ignoring NUMA issues. Today you could get 256 Core per socket that is at least twice as fast per core. What used to be 64 Server could now be fitted into 1. And by 2030, it would be closer to 100 to 1 ratio. Not to mention Software on Server has gotten a lot faster compared to 2010. PHP, Python, Ruby, Java, ASP or even Perl. If we added up everything I wouldn't be surprised we are 200 or 300 to 1 ratio compared to 2010.
I am pretty sure there is some version of Oxide in the pipeline that will catch up to latest Zen CPU Core. If a server isn't enough, a few Oxide Rack should fit 99% of Internet companies usage.
However, I do get the point about cost-premium and more importantly vendor-risk that's paid when using managed services.
We are hosted on cloudflare workers which is very cheap, but to mitigate the vendor risk we have also setup up replicas of our api servers on bunny.net and render.com.
For example, we couldn't offer free GeoIP downloads[0] if we were charged the outrageous $0.09 / GB, and the same is true for companies serving AI models or game assets.
But what makes me almost sick is how slow is the cloud. From network-attached disks to overcrowded CPUs, everything is so slooooow.
My experience is that the cloud is a good thing between 0-10,000 $ / month. But you should seriously consider renting bare-metal servers or owning your own after that. You can "over-provision" as much as you want when you get 10-20x (real numbers) the performance for 25% of the price.
It always makes sense to compare to back of the envelope bare metal numbers before rearchitecting your stack to work around some dumb cloud performance issue.
I really like how people throw around these baseless accusations.
S3 is one of the cheapest storage solutions ever created. The last 10 years I have migrated roughly 10-20PB worth of data to AWS S3 and it resulted in significant cost saving every single time.
If you do not know how to use cloud computing than yes, AWS can be really expensive.
The real cost of self-hosting, in my direct experience with multiple startup teams trying it, is the endless small tasks, decisions, debates, and little changes that add up over time to more overhead than anyone would have expected. Everyone thinks it’s going to be as simple as having the colo put the boxes in the rack and then doing some SSH stuff, then you’re free of those AWS bills. In my experience it’s a Pandora’s box of tiny little tasks, decisions, debates, and “one more thing” small changes and overhauls that add up to a drain on the team after the honeymoon period is over.
If you’re a stable business with engineers sitting idle that could be the right choice. For most startups who just need to get a product out there and get customers, pulling limited headcount away from the core product to save pennies (relatively speaking) on a potential AWS bill can be a trap.
Free? No, it's not free. It only costs less engineering time than AWS.
If those 20PB are deep archive, the S3 Glacier bill comes out to around $235k/year, which also seems ludicrous: it does not cost six figures a year to maintain your own tape archive. That's the equivalent of a full-time sysadmin (~$150k/year) plus $100k in hardware amortization/overhead.
The real advantage of S3 here is flexibility and ease-of-use. It's trivial to migrate objects between storage classes, and trivial to get efficient access to any S3 object anywhere in the world. Avoiding the headache of rolling this functionality yourself could well be worth $3.6M/year, but if this flexibility is not necessary, I doubt S3 is cheaper in any sense of the word.
I suspect the only way you could have 20PB is if you have metrics you don't aggregate or keep ancient logs (why do you need to know your auth service had a transient timeout a year ago?)
As soon as you start talking about any kind of serious data storage and data transfer the costs start piling up like crazy.
Like in my mind, the cost curve should flatten out over time. But that just doesn't seem to be the reality.
I think as AWS grows and changes the curve of the target audience is changing too. The value proposition is "You can get Cloud service without having a dedicated Cloud team," but there are caveats:
- AWS is complicated enough that you will still need a team to integrate against it. The abstractions are not free and the ones that are leaky will bite you without dedicated systems engineers to specialize in making it work with your company's goals.
- For small companies with little compute need, AWS is a good option. Beyond a certain scale... It is worth noting that big companies build their own datacenters, they don't rely on someone else's Cloud. Amazon, Google, and Microsoft don't run on each other.
- Recently, the cost model has likely changed if a company pokes their head up and runs the numbers, there's, uh, quite a few engineers with deep knowledge of how to build a scalable cloud infrastructure available to hire now for some reason. In fact, a savvy company keeping its ear to the ground can probably snap up some high-tier talent very soon (https://www.reuters.com/business/world-at-work/amazon-target...).
It really depends on where your company's risk and cost models are. Running on someone else's cloud just isn't the only option.
I just don't see it. Given the nature of the services they offer it's just too risky not to use as much managed stuff with SLAs as possible. k8s alone is a very complicated control plane + a freaking database that is hard to keep happy if it's not completely static. In a prior life I went very deep on k8s, including self managing clusters and it's just too fragile, I literally had to contribute patches to etcd and I'm not a db engineer. I kept reading the post and seeing future failure point after future failure point.
The other aspect is there doesn't seem to be an honest assessment of the tradeoffs. It's all peaches and cream, no downsides, no tradeoffs, no risk assessment etc.
And let’s be very real here: if your cloud service goes down for a few hours because you screwed something up, or because AWS deployed some bad DNS rules again, the world moves on. At the end of the day, nobody gives a shit.
AWS truly does let you focus on your business logic and abstracts a TON of undifferentiated work and well beyond the low hanging fruit of system updates and load balancing.
I guess put another way, providing a SaaS you need to have an SLA, those SLAs flow from SLO and SLIs and ultimately a risk profile of your hw and sw. The risk of a bad HBA alone probably means a day of downtime if you don't do things perfectly. AWS has bad HBAs, CPUs, memory, disks etc all day long every day and it's not even a blip for customers, never mind downtime. And if you don't model bad HBAs in your SLAs then your board is going to be pissed when that outage inevitably happens.
Now if you don't have SLAs and you like sysops, networkops, clusterops, dbops work then sure, YOLO.
Microk8s doesn’t use etcd (they have their own, simpler thing), which seems like a good tradeoff at single rack scale: https://benbrougher.tech/posts/microk8s-6-months-later/
The article’s deployment has a spare rack in a second DC and they do a monthly cutover to AWS in case the colo provider has a two site issue.
Spending time on that would make me sleep much better than hardening a deployment of etcd running inside a single point of failure.
What other problems do you see with the article? (Their monthly time estimates seem too low to me - they’re all 10x better than I’ve seen for well-run public cloud infrastructure that is comparable to their setup).
(1) Massive expansion of budget (100 - 1000x) to support empire building. Instead of one minimum-wage sysadmin with 2 high-availability, maxed-out servers for 20K - 40K (and 4-hour response time from Dell/HPE), you can have 100M multi-cloud Kubernetes + Lambda + a mix-and-match of various locked-in cloud services (DB, etc.). And you can have a large army of SRE/DevOps. You get power and influence as a VP of Cloud this and that and 300 - 1000 people reporting to you.
(2) OpEx instead of CapEx
(3) All leaders are completely clueless about hiring the right people in tech. They hire their incompetent buddies who hire their cronies. Data centers can run at scale with 5-10 good people. However, they hire 3000 horrible, incompetent, and toxic people, and they build lots of paperwork, bureaucracy, and approvals around it. Before AWS, it was VMware's internal cloud that ran most companies. Getting bare metal or a VM will take months to years, and many, many meetings and escalations. With AWS, here is my credit card, pls gimme 2 Vms is the biggest feature.
In contrast, you could throw a stone into a bush and hit an AWS guy.
He told me that before they started doing that, there were incidents like teams writing entire modules they didn't know already existed - now there were 2 pieces of code doing basically the same thing, that were just incompatible enough to not be possible to merge them.
One time, in on prem, we had a custom setup with a machine running half the services we used, including a reverse proxy using haproxy with some custom Lua scripts for routing, a fileserver using lighttpd, some docker compose stuff, a stateless query thingy running on nodejs, etc.
We needed to change something, and the guy who wrote it left a year ago and we had to reverse engineer the stuff he did (some of it was quite questionable).
We weren't entirely successful and had to rewrite some stuff. I'm not saying how it was done wasn't clever or cost efficient, but damn if it was done on AWS, I probably would've known where to look for stuff (and so would've most of my colleagues).
Sure, if you're only growing <30% YoY and already paying several millions for cloud and bandwidth/storage are large fraction of that, by staying in cloud you're proving your incompetence as an engineering org.
The internet assures me there are loads of these underemployed Unix/networking experts just sitting around waiting to set up your infrastructure. But in my experience, these people are actually really difficult to hire, and not at all cheap. (Possibly the sharp ones have 'sold out' and gone the SRE route and are now one of those '3000' people.)
So I wonder if there's a certain amount of wishful thinking on both sides here, like "I wish a 'clueful' company would hire me to be their head sysadmin...", while companies who have tried to do this on the cheap usually just have terrible ops. ("Whoops, the backups haven't worked in 2 years...")
I get crap recruiters in my inbox and LinkedIn every other week with the worst offers to go back to on-site bare metal admin. 30% less pay, on-site requirements, and it's a contracted position?
I need that Futurama "oh you're serious, let me laugh harder" gif
If companies want to whine that good Linux datacenter ops doesn't exist anymore, laugh in their faces.
Someone please explain to me why this matters. I'd think that expenditures are expenditures, and that if the outright purchase of hardware would see an RoI compared to renting it in the cloud in under a year, it'd be a no-brainer to just buy the hardware.
AWS makes the life of finance and leadership a lot easier because they spend a lot of money justifying their superiority in ways that you don't have to think too hard to use and be taken seriously. They're to CTOs what think tanks and lobbyist are for lawmakers.
"No one got fired for buying ibm" for the new era.
There is a lot of truth in AWS propaganda, they're great for many things. But some of it is built on lies, cost being one, performance another.
I wish you started out by telling me how many customers you have to serve, how many transactions they generate, how much I/O there is.
Eventually, I realized that it was because the devs wanted to put "AWS" on their resumes. I wondered how long it would take management to catch on that they were being used as a place to spruce up your resume before moving on to catch bigger fish.
But not long after, I realized that the management was doing the same thing. "Led a team migration to AWS" looked good on their resume, also, and they also intended to move on/up. Shortly after I left, the place got bought and the building it was in is empty now.
I wonder, now that Amazon is having layoffs and Big Tech generally is not as many people's target employer, will "migrated off of AWS to in-house servers" be what devs (and management) want on their resume?
And then discussions on how to move forward are held between people that only know AWS and people who want to use other stuff, but only one side is transparent about it.
> You do not have the appetite to build a platform team comfortable with Kubernetes, Ceph, observability, and incident response.
Has work been using AWS wrong? Other than Ceph, all those things add up to onerous half time jobs for rotating software engineers.
Before gp3 came out, working around EBS price/performance terribleness was also on the list.
The biggest downside I see? We had to sign a 3 year contract with the colocation facility up front, and any time we want to change something they want a new commitment. On AWS you don't commit to spending until after you've got it working, and even then it's your choice.
You don't get to an overcomplicated AWS madness without having a few engineers already pushing complexity.
And an overcomplicated setup also means it needs maintenance. There are no personnel savings there.
You could get manual failover with a single writer replicated managed Postgres setup and a warm VM.
That’s on the order of a thousand a month for a medium workload. It’s probably a 10x markup vs buying the servers, but it doesn’t matter if it saves an employee.
I understand that with AWS you cannot do that as it is often seem as opex.
I guess thats a good enough motivation to move out of AWS at scale.
Doesn't make me want to be a Equinix customer when they just randomly shut down critical hosting services.
I'm pretty sure that it's just the post-merger name for Packet which was an incredible provider that even had BYO IP with an anycast community. Really a shame that it went away, it was a solid alternative to both AWS and bare metal and prices were pretty good.
There's a missing middle between ultra expensive/weird cloud and cheap junk servers that I would really love to see get filled.
Good datacenters have redundant and physically separated power and communication from different providers.
Also, in case something catastrophic happens at one datacenter, the author mentions they are peered to another datacenter in a different country, as another layer of redundancy. Cloudflare handles their ingress, so such a catastrophic event wouldn't likely to be noticed by their customers.
I don't know how much time they spend configuring/dealing with Kubernetes, but I bet it's a large chunk of the 24 hour engineer-hours per quarter. But this is not a required expense: "EKS had an extra $1,260/month control-plane fee". Running EKS adds a massive IAM policy maintenance overhead, whereas a non-EKS (EC2 w/ golden AMIs) setup results in drastically simpler IAM policies.
NAT gateways are ~$50 a month, plus data transfer. Setting up a gateway VPC endpoint to S3 will avoid having to pay transfer charges to S3.
They were at 90% reservation capacity, so they should be using reservations for greater savings and in fact, running stable workloads with reservations is something that AWS excels at. Reservation means that you will be able to terminate and re-launch instances even when there's a spike in demand from other users--your instance capacity is guaranteed.
Running the basics on VMs also effectively avoids vendor lock-in. Every cloud provider supports VMs with a RedHat clone, VPCs, load balancing, networked storage, access controls, object storage and a fixed size fleet with auto-relaunch on instance failure.
With a consistent workload, they would have very likely escaped the downtime from AWS a week ago as well, because, as per AWS, "existing EC2 instances that had been launched prior to the start of the event remained healthy and did not experience any impact for the duration of the event".
With Terraform and automation for building launchable images, you can stand up a cluster quickly in any region with secure networking, including in a separate AWS account, in the same region, for the sake of testing.
With AWS, you can set up automatic EBS backups of all your data to snapshots trivially, and even send them to a 3rd locked-down account, so they can't be accidentally wiped.
Am I just naive? How is a uptime SaaS product saving over a million year on managed colo vs AWS? Was every API route in it's own EC2 instance?
AWS is expensive sure, but over a million dollars a year? For this product specifically?.
I got some clarification from their earlier posts and it looks like they were intentionally avoiding any AWS platform features:
>Our goal was to avoid reliance on AWS or any proprietary cloud technology.
>When we were utilizing AWS, our setup consisted of a 28-node managed Kubernetes cluster. Each of these nodes was an m7a EC2 instance. With block storage and network fees included, our monthly bills amounted to $38,000+. This brought our annual expenditure to over $456,000+.
I just think if you are going to deploy on AWS, then treat it AWS like managed-colo, then your bill is going to be high. I understand how that seems unfair, but AWS isn't really in the business of selling virtual machines. If you sit down and ask yourself how you got here, it just seems like you committed yourself to wasting money. If I knew I just needed some linux boxes from the start, there are better choices than AWS.
There are huge advantages of scale to computer operations in a few areas:
- facility: the capital and running cost of a purpose-built datacenter is far cheaper per rack than putting machines in existing office-class buildings, as long as it's a reasonable size - ours is ~1000 racks, but you might get decent scale at a quarter of that. (also one fat network pipe instead of a bunch of slow ones)
- purchasing: unlike consumer PCs, low-volume prices for major vendor servers are wildly inflated, and you don't get decent prices until you buy quite a few of them.
- operations: people come in integer units, and (assuming your salary ranges are bounded) are only competent in small number of technical areas each. Whether you have one machine or 1000s you need someone who can handle each technology your deployment depends on, from Kubernetes to network ops; multiply 4x for those requiring 24/7 coverage, or accept long response times for off-hours failures.
That last one is probably the kicker. To keep salary costs below 50% of your total, assuming US pay rates and 5-year depreciation since machines aren't getting faster as quickly as they used to, you probably need to be running tens of millions of dollars in hardware.
Note that a tiny deployment of a few machines in a tech company is an exception, since you have existing technical staff who can run them in their spare time. (and you have other interesting work for them to do, so recruiting and retention isn't the same problem as if their only job was to babysit a micro-deployment)
That's why it can be simultaneously true that (a) profit margins on AWS-like services are very high, and (b) AWS is cheaper than running your own machines for a large number of companies.
Just want to confirm what I am reading. You are talking about ~1000 racks as the facility size, not what a typical university requires.
I thought there would be a greater unbundling to AWS or to cheaper providers but it seems like a good-sized portion of the market is just going back to managing their own hardware.
I’m sorry but I don’t believe this for one second.
And unfortunately that makes me distrust the entirety of the article.
alyxya•11h ago