Ummm is that plateauing with us in the room?
The advantage of renting vs. owning is that you can always get the latest gen, and that brings you newer capabilities (i.e. fp8, fp4, etc) and cheaper prices for current_gen-1. But betting on something plateauing when all the signs point towards the exact opposite is not one of the bets i'd make.
Cloud excels for bursty or unpredictable workloads where quickly scaling up and down can save you money.
Ps... bx cable instead of conduit for electrical looks cringe.
One thing to keep in mind is that the curve for GPU depreciation (in the last 5 years at least) is a little steeper than 3 years. Current estimates is that the capital depreciation cost would plunge dramatically around the third year. For a top tier H100 depreciation kicks in around the 3rd year but they mentioned for the less capable ones like the A100 the depreciation is even worse.
https://www.silicondata.com/use-cases/h100-gpu-depreciation/
Now this is not factoring cost of labour. Labor at SF wages is dreadfully expensive, now if your data center is right across the border in Tijuana on the other hand..
[0] - https://azure-int.microsoft.com/en-us/pricing/tco/calculator...
Their "assumption" for hardware purchase prices seems way off compared to what we buy from Dell or HP.
It's interesting that the "IT labour" cost they estimate is $140k for DIY, and $120k for Azure.
Their saving is 5 times more than what we spend...
No, low isn't good perse. I worked in a datacenter which in winters had less than 40%, ram was failing all over the place. Low humidity causes static electricity.
It is much cheaper to use external air for cooling if you can.
Also this is where cutting corner indeed results in lower cost, which was the reason for the OP to begin with. It just means you won't get as good a datacenter as people who are actually tuning this whole day and have decades of experience.
/s
You can see it quite clearly here that there’s so many steps to take. Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.
It’s never about “is the expected cost in on premises less than cloud”, it’s about the risk adjusted costs.
Once you’ve spread risk not only on your main product but also on your infrastructure, it becomes hard.
I would be vary of a smallish company building their own Jira in house in a similar way.
Capex needs work. A couple of years, at least.
If you are willing to put in the work. Your mundane computer is always better than the shiny one you don't own.
>Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.
Yes, but one differentiating factor is always price and you don't want to lose all your margins to some infrastructure provider.
Think of a ~5000 employee startup. Two scenarios:
1. if they win the market, they capture something like ~60% margin
2. if that doesn't happen, they just lose, VC fund runs out and then they leave
In this dynamic, costs associated with infrastructure don't change the bottomline of profitability. The risk involved with rolling out their on infrastructure can hurt their main product's existence itself.
I reckon most on-prem deployments have significantly worse offboarding than the cloud providers. As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.
Its the other way around. How do you think all businesses moved to the cloud in the first place?
There are in between solutions. Renting bare metal instead of renting virtual machines can be quite nice. I've done that via Hetzner some years ago. You pay just about the same but you get a lot more performance for the same money. This is great if you actually need that performance.
People obsess about hardware but there's also the software side to consider. For smaller companies, operations/devops people are usually more expensive than the resources they manage. The cost to optimize is that cost. The hosting cost usually is a rounding error on the staffing cost. And on top of that the amount of responsibilities increases as soon as you own the hardware. You need to service it, monitor it, replace it when it fails, make sure those fans don't get jammed by dust puppies, deal with outages when they happen, etc. All the stuff that you pay cloud providers to do for you now becomes your problem. And it has a non zero cost.
The right mindset for hosting cost is to think of it in FTEs (full time employee cost for a year). If it's below 1 (most startups until they are well into scale up territory), you are doing great. Most of the optimizations you are going to get are going to cost you in actual FTEs spent doing that work. 1 FTE pays for quite a bit of hosting. Think 10K per month in AWS cost. A good ops person/developer is more expensive than that. My company runs at about 1K per month (GCP and misc managed services). It would be the wrong thing to optimize for us. It's not worth spending any amount of time on for me. I literally have more valuable things to do.
This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.
It's typically going to cost significantly less; it can make a lot of sense for small companies, especially.
Once they are up and running that employee is spending at most a few hours a month on them. Maybe even a few hours every six months.
OTOH you are specifically ignoring that you'll require mostly the same time from a cloud trained person if you're all-in on AWS.
I expect the marginal cost of one employee over the other is zero.
"An error occurred: API rate limit already exceeded for installation ID 73591946."
For critical infrastructure, I would rather pay a competent cloud provider than being responsible for reliability issues. Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.
For running many slurm jobs on good servers, cloud computing is very expensive and you sometimes save money in a matter of months. And who cares if the server room is a total loss after a while, worst case you write some more YAML and Terraform and deploy a temporary replacement in the cloud.
Another thing between is colocation, where you put hardware you own in a managed data center. It’s a bit old fashioned, but it may make sense in some cases.
I can also mention that research HPCs may be worth considering. In research, we have some of the world fastest computers at a fraction of the cost of cloud computing. It’s great as long as you don’t mind not being root and having to use slurm.
I don’t know in USA, but in Norway you can run your private company slurm AI workloads on research HPCs, though you will pay quite a bit more than universities and research institutions. But you can also have research projects together with universities or research institutions, and everyone will be happy if your business benefits a lot from the collaboration.
It's very expensive and only makes sense if you really need infrastructure sovereignty. It makes more sense if you're profitable in the tens of millions after raising hundreds of millions.
It also makes sense for governments (including those in the EU) which should think about this and have the compute in house and disconnected from the internet if they are serious about infrastructure sovereignty, rather than depending on US-based providers such as AWS.
Except now I have nightmares that the USA will enforce the patriot act and force Microsoft to hand over all their data in European data centers and then we have to migrate everything to a local cloud provider. Argh...
sys42590•1h ago
sschueller•1h ago
[1] https://www.techradar.com/news/remember-the-ovhcloud-data-ce...
[2] https://blocksandfiles.com/wp-content/uploads/2023/03/ovhclo...
AndroTux•15m ago
instagib•1h ago
Something very similar happened at work. Water valve monitoring wasn’t up yet. Fire didn’t respond because reasons. Huge amount of water flooded over a 3 day weekend. Total loss.
twelvechairs•1h ago
mbreese•59m ago
why build one when you can have two at twice the price?
But, if you're building a datacenter for $5M, spending $10-15M for redundant datacenters (even with extra networking costs), would still be cheaper than their estimated $25M cloud costs.
golem14•39m ago
You need however plan for 1MM+ pa in OPEX because good SREs ain’t cheap (or hardware guys building and maintaining machines)
fpoling•58m ago