Don't rent the cloud, own instead

https://blog.comma.ai/datacenter/

147•Torq_boi•2h ago

Comments

sys42590•1h ago

It would be interesting to hear their contingency plan for any kind of disaster (most commonly a fire) that hits their data center.

sschueller•1h ago

Yep, does anyone remember the OVH fire[1][2]?

[1] https://www.techradar.com/news/remember-the-ovhcloud-data-ce...

[2] https://blocksandfiles.com/wp-content/uploads/2023/03/ovhclo...

AndroTux•15m ago

contingency plan: Don't build your data center out of wood.

instagib•1h ago

Flooding due to burst frozen pipe, false sprinkler trigger, or many others.

Something very similar happened at work. Water valve monitoring wasn’t up yet. Fire didn’t respond because reasons. Huge amount of water flooded over a 3 day weekend. Total loss.

twelvechairs•1h ago

Theres only one solution to this problem and its 2 data centres in some way or form

mbreese•59m ago

What's the line from Contact?

why build one when you can have two at twice the price?

But, if you're building a datacenter for $5M, spending $10-15M for redundant datacenters (even with extra networking costs), would still be cheaper than their estimated $25M cloud costs.

golem14•39m ago

Or build two 2.5MM DCs (if can parallelize your workload well enough) and in case of disaster, you only lose capacity.

You need however plan for 1MM+ pa in OPEX because good SREs ain’t cheap (or hardware guys building and maintaining machines)

fpoling•58m ago

They use the datasenter for model training, not to serve online users. Presumably even if it will be offline for a week or even a month it will not be a total disaster as long as they have, for example, offsite tape backups.

langarus•1h ago

This is a great solution for a very specific type of team but I think most companies with consistent GPU workloads will still just rent dedicated servers and call it a day.

hyperbovine•1h ago

I agree, and cloud compute is poised to become even more commoditized in the coming years (gazillion new data centers + AI plateauing + efficiency gains, the writing is on the wall). There’s no way this makes sense for most companies.

NitpickLawyer•1h ago

> AI plateauing

Ummm is that plateauing with us in the room?

The advantage of renting vs. owning is that you can always get the latest gen, and that brings you newer capabilities (i.e. fp8, fp4, etc) and cheaper prices for current_gen-1. But betting on something plateauing when all the signs point towards the exact opposite is not one of the bets i'd make.

ocdtrekkie•55m ago

It's the opposite. The more consistent your workload the more practical and cost-effective it is to go on-prem.

Cloud excels for bursty or unpredictable workloads where quickly scaling up and down can save you money.

cgsmith•1h ago

I used to colocate a 2U server that I purchased with a local data center. It was a great learning experience for me. Im curious why a company wouldn't colocate their own hardware? Proximity isnt an issue when you can have the datacenter perform physical tasks. Bravo to the comma team regardless. It'll be a great learning experience and make each person on their team better.

Ps... bx cable instead of conduit for electrical looks cringe.

vidarh•23m ago

The main reason not to colocate is if you're somewhere with high real estate costs... E.g Hetzner managed servers competes on price w/co-location for me because I'm in London.

comrade1234•1h ago

15-years ago or so a spreadsheet was floating around where you could enter server costs, compute power, etc and it would tell you when you would break-even by buying instead of going with AWS. I think it was leaked from Amazon because it was always three-years to break-even even as hardware changed over time.

Onavo•1h ago

Well, somebody should recreate it. I smell a potential startup idea somewhere. There's a ton of "cloud cost optimizers" software but most involve tweaking AWS knobs and taking a cut of the savings. A startup that could offload non critical service from AWS to colo and traditional bare metal hosting like Hetzner has a strong future.

One thing to keep in mind is that the curve for GPU depreciation (in the last 5 years at least) is a little steeper than 3 years. Current estimates is that the capital depreciation cost would plunge dramatically around the third year. For a top tier H100 depreciation kicks in around the 3rd year but they mentioned for the less capable ones like the A100 the depreciation is even worse.

https://www.silicondata.com/use-cases/h100-gpu-depreciation/

Now this is not factoring cost of labour. Labor at SF wages is dreadfully expensive, now if your data center is right across the border in Tijuana on the other hand..

TonyStr•1h ago

Azure provides their own "Total Cost of Ownership" calculator for this purpose [0]. Notably, this makes you estimate peripheral costs such as cost of having a server administrator, electricity, etc.

[0] - https://azure-int.microsoft.com/en-us/pricing/tco/calculator...

Symbiote•24m ago

I plugged in our own numbers (60 servers we own in a data centre we rent) and Microsoft thinks this costs us an order of magnitude more than it does.

Their "assumption" for hardware purchase prices seems way off compared to what we buy from Dell or HP.

It's interesting that the "IT labour" cost they estimate is $140k for DIY, and $120k for Azure.

Their saving is 5 times more than what we spend...

g-b-r•36m ago

Did the AWS part include the egress costs to extract your data from AWS, if you ever want to leave them?

coreylane•6m ago

AWS says they will waive all egress costs when exiting https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-i...

vidarh•20m ago

If you buy, maybe. Leasing or renting tends to be cheaper from day one. Tack on migration costs and ca. 6 months is a more realistic target. If the spreadsheet always said 3 years, it sounds like an intentional "leak".

hbogert•1h ago

Datacenters need cool dry air? <45%

No, low isn't good perse. I worked in a datacenter which in winters had less than 40%, ram was failing all over the place. Low humidity causes static electricity.

mbreese•1h ago

Low is good if you are also adding more humidity back in. If you want to maintain 45-50% (guessing), then you would want <45% environmental humidity so that you can raise it to the level you want. You're right about avoiding static, but you'd still want to try to keep it somewhat consistent.

It is much cheaper to use external air for cooling if you can.

hbogert•39m ago

Yeah but the article makes it sound as if lower is better, which it is definitely not. And yeah you need to control humidity, that might mean sometimes lowering, and sometimes increase it by whatever solution you have.

Also this is where cutting corner indeed results in lower cost, which was the reason for the OP to begin with. It just means you won't get as good a datacenter as people who are actually tuning this whole day and have decades of experience.

Semaphor•1h ago

In case anyone from comma.ai reads this: "CTO @ comma.ai" the link at the end is broken, it’s relative instead of absolute.

croisillon•54m ago

no because it's on premise you see? you don't need to access the world wide web, just their server

simianwords•1h ago

The reason companies don’t go with on premises even if cloud is way more expensive is because of the risk involved in on premises.

You can see it quite clearly here that there’s so many steps to take. Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.

It’s never about “is the expected cost in on premises less than cloud”, it’s about the risk adjusted costs.

Once you’ve spread risk not only on your main product but also on your infrastructure, it becomes hard.

I would be vary of a smallish company building their own Jira in house in a similar way.

d1sxeyes•1h ago

It’s also opex vs capex, which is a battle opex wins most of the time.

simianwords•1h ago

I think it wins because opex is seen as stable recurring cost and capex is seen as the money you put in your primary differentiation for long term gains.

d1sxeyes•1h ago

True, but for a lot of companies “our servers are on-prem” is not a primary differentiator.

simianwords•30m ago

i think we are saying the same thing?

TonyStr•58m ago

Capex may also require you to take out loans

bayindirh•34m ago

Opex is faster. Login, click, SSH, get a tea.

Capex needs work. A couple of years, at least.

If you are willing to put in the work. Your mundane computer is always better than the shiny one you don't own.

fauigerzigerk•21m ago

I'm starting to wonder though whether companies even have the in-house competence to compare the options and price this risk correctly.

>Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.

Yes, but one differentiating factor is always price and you don't want to lose all your margins to some infrastructure provider.

simianwords•6m ago

Software companies have higher margins so these decisions are lower stakes. Unless on premises helps the bottom line of the main product that the company provides, these decisions don't really matter in my opinion.

Think of a ~5000 employee startup. Two scenarios:

1. if they win the market, they capture something like ~60% margin

2. if that doesn't happen, they just lose, VC fund runs out and then they leave

In this dynamic, costs associated with infrastructure don't change the bottomline of profitability. The risk involved with rolling out their on infrastructure can hurt their main product's existence itself.

danpalmer•59m ago

> Cloud companies generally make onboarding very easy, and offboarding very difficult.

I reckon most on-prem deployments have significantly worse offboarding than the cloud providers. As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.

lelanthran•8m ago

> As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.

Its the other way around. How do you think all businesses moved to the cloud in the first place?

intalentive•49m ago

I like Hotz’s style: simply and straightforwardly attempting the difficult and complex. I always get the impression: “You don’t need to be too fancy or clever. You don’t need permission or credentials. You just need to go out and do the thing. What are you waiting for?”

tirant•32m ago

This was written by Harald Schäfer, the CTO of comma.ai. I'm not so sure if G. Hotz is still involved in comma.ai.

intalentive•24m ago

Ah I missed that.

jillesvangurp•45m ago

At scale (like comma.ai), it's probably cheaper. But until then it's a long term cost optimization with really high upfront capital expenditure and risk. Which means it doesn't make much sense for the majority of startup companies until they become late stage and their hosting cost actually becomes a big cost burden.

There are in between solutions. Renting bare metal instead of renting virtual machines can be quite nice. I've done that via Hetzner some years ago. You pay just about the same but you get a lot more performance for the same money. This is great if you actually need that performance.

People obsess about hardware but there's also the software side to consider. For smaller companies, operations/devops people are usually more expensive than the resources they manage. The cost to optimize is that cost. The hosting cost usually is a rounding error on the staffing cost. And on top of that the amount of responsibilities increases as soon as you own the hardware. You need to service it, monitor it, replace it when it fails, make sure those fans don't get jammed by dust puppies, deal with outages when they happen, etc. All the stuff that you pay cloud providers to do for you now becomes your problem. And it has a non zero cost.

The right mindset for hosting cost is to think of it in FTEs (full time employee cost for a year). If it's below 1 (most startups until they are well into scale up territory), you are doing great. Most of the optimizations you are going to get are going to cost you in actual FTEs spent doing that work. 1 FTE pays for quite a bit of hosting. Think 10K per month in AWS cost. A good ops person/developer is more expensive than that. My company runs at about 1K per month (GCP and misc managed services). It would be the wrong thing to optimize for us. It's not worth spending any amount of time on for me. I literally have more valuable things to do.

This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.

g-b-r•21m ago

You should keep in mind that for a lot of things you can use a servicing contract, rather than hiring full-time employees.

It's typically going to cost significantly less; it can make a lot of sense for small companies, especially.

ashu1461•17m ago

And not just any FTEs, probably few senior / staff level engineers who would cost a lot more.

lelanthran•11m ago

Your calculation assumes that an FTE is needed to maintain a few beefy servers.

Once they are up and running that employee is spending at most a few hours a month on them. Maybe even a few hours every six months.

OTOH you are specifically ignoring that you'll require mostly the same time from a cloud trained person if you're all-in on AWS.

I expect the marginal cost of one employee over the other is zero.

durakot•32m ago

There's the HN I know and love

kavalg•31m ago

This was one of the coolest job ads that I've ever read :). Congrats for what you have done with your infrastructure, team and product!

tirant•31m ago

Well, their comment section is fore sure not running on premises, but on the cloud:

"An error occurred: API rate limit already exceeded for installation ID 73591946."

pja•19m ago

I’m impressed that San Diego electrical power manages to be even more expensive than in the UK. That takes some doing.

satvikpendem•17m ago

I just read about Railway doing something similar, sadly their prices are still high compared to other bare metal providers and even VPS such as Hetzner with Dokploy, very similar feature set yet for the same 5 dollars you get way more CPU, storage and RAM.

https://blog.railway.com/p/launch-week-02-welcome

speedgoose•16m ago

I would suggest to use both on-premise hardware and cloud computing. Which is probably what comma is doing.

For critical infrastructure, I would rather pay a competent cloud provider than being responsible for reliability issues. Maintaining one server room in the headquarters is something, but two servers rooms in different locations, with resilient power and network is a bit too much effort IMHO.

For running many slurm jobs on good servers, cloud computing is very expensive and you sometimes save money in a matter of months. And who cares if the server room is a total loss after a while, worst case you write some more YAML and Terraform and deploy a temporary replacement in the cloud.

Another thing between is colocation, where you put hardware you own in a managed data center. It’s a bit old fashioned, but it may make sense in some cases.

I can also mention that research HPCs may be worth considering. In research, we have some of the world fastest computers at a fraction of the cost of cloud computing. It’s great as long as you don’t mind not being root and having to use slurm.

I don’t know in USA, but in Norway you can run your private company slurm AI workloads on research HPCs, though you will pay quite a bit more than universities and research institutions. But you can also have research projects together with universities or research institutions, and everyone will be happy if your business benefits a lot from the collaboration.

rvz•11m ago

Not long ago Railway moved from GCP to their own infrastructure since it was very expensive for them. [0] Some go for a Oxide rack [1] for a full stack solution (both hardware and software) for intense GPU workloads, instead of building it themselves.

It's very expensive and only makes sense if you really need infrastructure sovereignty. It makes more sense if you're profitable in the tens of millions after raising hundreds of millions.

It also makes sense for governments (including those in the EU) which should think about this and have the compute in house and disconnected from the internet if they are serious about infrastructure sovereignty, rather than depending on US-based providers such as AWS.

[0] https://blog.railway.com/p/data-center-build-part-one

[1] https://oxide.computer/

kaon_2•5m ago

Am I the only one that is simply scared of running your own cloud? What happens if your administrator credentials get leaked? At least with Azure I can phone microsoft and initiate a recovery. Because of backups and soft deletion policies quite a lot is possible. I guess you can build in these failsafe scenarios locally too? But what if a fire happens like in South Korea? Sure most companies run more immediate risks such as going bankrupt, but at least Cloud relieves me from the stuff of nightmares.

Except now I have nightmares that the USA will enforce the patriot act and force Microsoft to hand over all their data in European data centers and then we have to migrate everything to a local cloud provider. Argh...

Don't rent the cloud, own instead

When internal hostnames are leaked to the clown

The TV industry concedes that the future may not be in 8K

Voxtral Transcribe 2

Postgres Postmaster does not scale

Wirth's Revenge

Sqldef: Idempotent schema management tool for MySQL, PostgreSQL, SQLite

A few CPU hardware bugs

Claude Code: connect to a local model when your quota runs out

ICE seeks industry input on ad tech location data for investigative use

OpenClaw is what Apple intelligence should have been

AI is killing B2B SaaS

Claude Code for Infrastructure

A case study in PDF forensics: The Epstein PDFs

Remarkable Pro Colors

Microsoft's Copilot chatbot is running into problems

Building a 24-bit arcade CRT display adapter from scratch

I built a search engine to index the un-indexable parts of Telegram

An interactive version of Byrne's The Elements of Euclid (1847)

Why S7 Scheme? (2020)

Study: Older Cannabis Users Have Larger Brains, Better Cognition

Listen to Understand

Why more companies are recognizing the benefits of keeping older employees

Lily Programming Language

Tractor

The Great Unwind

Claude is a space to think

How not to securely erase a NVME drive (2022)

RS-SDK: Drive RuneScape with Claude Code

A tale of two flows: Metaflow and Kubeflow

Don't rent the cloud, own instead

When internal hostnames are leaked to the clown

The TV industry concedes that the future may not be in 8K

Voxtral Transcribe 2

Postgres Postmaster does not scale

Wirth's Revenge

Sqldef: Idempotent schema management tool for MySQL, PostgreSQL, SQLite

A few CPU hardware bugs

Claude Code: connect to a local model when your quota runs out

ICE seeks industry input on ad tech location data for investigative use

OpenClaw is what Apple intelligence should have been

AI is killing B2B SaaS

Claude Code for Infrastructure

A case study in PDF forensics: The Epstein PDFs

Remarkable Pro Colors

Microsoft's Copilot chatbot is running into problems

Building a 24-bit arcade CRT display adapter from scratch

I built a search engine to index the un-indexable parts of Telegram

An interactive version of Byrne's The Elements of Euclid (1847)

Why S7 Scheme? (2020)

Study: Older Cannabis Users Have Larger Brains, Better Cognition

Listen to Understand

Why more companies are recognizing the benefits of keeping older employees

Lily Programming Language

Tractor

The Great Unwind

Claude is a space to think

How not to securely erase a NVME drive (2022)

RS-SDK: Drive RuneScape with Claude Code

A tale of two flows: Metaflow and Kubeflow

Don't rent the cloud, own instead

Comments