Don't rent the cloud, own instead

https://blog.comma.ai/datacenter/

91•Torq_boi•2h ago

Comments

sys42590•59m ago

It would be interesting to hear their contingency plan for any kind of disaster (most commonly a fire) that hits their data center.

sschueller•54m ago

Yep, does anyone remember the OVH fire[1][2]?

[1] https://www.techradar.com/news/remember-the-ovhcloud-data-ce...

[2] https://blocksandfiles.com/wp-content/uploads/2023/03/ovhclo...

instagib•51m ago

Flooding due to burst frozen pipe, false sprinkler trigger, or many others.

Something very similar happened at work. Water valve monitoring wasn’t up yet. Fire didn’t respond because reasons. Huge amount of water flooded over a 3 day weekend. Total loss.

twelvechairs•46m ago

Theres only one solution to this problem and its 2 data centres in some way or form

mbreese•23m ago

What's the line from Contact?

why build one when you can have two at twice the price?

But, if you're building a datacenter for $5M, spending $10-15M for redundant datacenters (even with extra networking costs), would still be cheaper than their estimated $25M cloud costs.

golem14•3m ago

Or build two 2.5MM DCs (if can parallelize your workload well enough) and in case of disaster, you only lose capacity.

You need however plan for 1MM+ pa in OPEX because good SREs ain’t cheap (or hardware guys building and maintaining machines)

fpoling•21m ago

They use the datasenter for model training, not to serve online users. Presumably even if it will be offline for a week or even a month it will not be a total disaster as long as they have, for example, offsite tape backups.

langarus•51m ago

This is a great solution for a very specific type of team but I think most companies with consistent GPU workloads will still just rent dedicated servers and call it a day.

hyperbovine•36m ago

I agree, and cloud compute is poised to become even more commoditized in the coming years (gazillion new data centers + AI plateauing + efficiency gains, the writing is on the wall). There’s no way this makes sense for most companies.

NitpickLawyer•26m ago

> AI plateauing

Ummm is that plateauing with us in the room?

The advantage of renting vs. owning is that you can always get the latest gen, and that brings you newer capabilities (i.e. fp8, fp4, etc) and cheaper prices for current_gen-1. But betting on something plateauing when all the signs point towards the exact opposite is not one of the bets i'd make.

ocdtrekkie•18m ago

It's the opposite. The more consistent your workload the more practical and cost-effective it is to go on-prem.

Cloud excels for bursty or unpredictable workloads where quickly scaling up and down can save you money.

cgsmith•48m ago

I used to colocate a 2U server that I purchased with a local data center. It was a great learning experience for me. Im curious why a company wouldn't colocate their own hardware? Proximity isnt an issue when you can have the datacenter perform physical tasks. Bravo to the comma team regardless. It'll be a great learning experience and make each person on their team better.

Ps... bx cable instead of conduit for electrical looks cringe.

comrade1234•45m ago

15-years ago or so a spreadsheet was floating around where you could enter server costs, compute power, etc and it would tell you when you would break-even by buying instead of going with AWS. I think it was leaked from Amazon because it was always three-years to break-even even as hardware changed over time.

Onavo•39m ago

Well, somebody should recreate it. I smell a potential startup idea somewhere. There's a ton of "cloud cost optimizers" software but most involve tweaking AWS knobs and taking a cut of the savings. A startup that could offload non critical service from AWS to colo and traditional bare metal hosting like Hetzner has a strong future.

One thing to keep in mind is that the curve for GPU depreciation (in the last 5 years at least) is a little steeper than 3 years. Current estimates is that the capital depreciation cost would plunge dramatically around the third year. For a top tier H100 depreciation kicks in around the 3rd year but they mentioned for the less capable ones like the A100 the depreciation is even worse.

https://www.silicondata.com/use-cases/h100-gpu-depreciation/

Now this is not factoring cost of labour. Labor at SF wages is dreadfully expensive, now if your data center is right across the border in Tijuana on the other hand..

TonyStr•24m ago

Azure provides their own "Total Cost of Ownership" calculator for this purpose [0]. Notably, this makes you estimate peripheral costs such as cost of having a server administrator, electricity, etc.

[0] - https://azure-int.microsoft.com/en-us/pricing/tco/calculator...

hbogert•44m ago

Datacenters need cool dry air? <45%

No, low isn't good perse. I worked in a datacenter which in winters had less than 40%, ram was failing all over the place. Low humidity causes static electricity.

mbreese•29m ago

Low is good if you are also adding more humidity back in. If you want to maintain 45-50% (guessing), then you would want <45% environmental humidity so that you can raise it to the level you want. You're right about avoiding static, but you'd still want to try to keep it somewhat consistent.

It is much cheaper to use external air for cooling if you can.

Semaphor•43m ago

In case anyone from comma.ai reads this: "CTO @ comma.ai" the link at the end is broken, it’s relative instead of absolute.

croisillon•18m ago

no because it's on premise you see? you don't need to access the world wide web, just their server

simianwords•35m ago

The reason companies don’t go with on premises even if cloud is way more expensive is because of the risk involved in on premises.

You can see it quite clearly here that there’s so many steps to take. Now a good company would concentrate risk on their differentiating factor or the specific part they have competitive advantage in.

It’s never about “is the expected cost in on premises less than cloud”, it’s about the risk adjusted costs.

Once you’ve spread risk not only on your main product but also on your infrastructure, it becomes hard.

I would be vary of a smallish company building their own Jira in house in a similar way.

d1sxeyes•30m ago

It’s also opex vs capex, which is a battle opex wins most of the time.

simianwords•26m ago

I think it wins because opex is seen as stable recurring cost and capex is seen as the money you put in your primary differentiation for long term gains.

d1sxeyes•23m ago

True, but for a lot of companies “our servers are on-prem” is not a primary differentiator.

TonyStr•22m ago

Capex may also require you to take out loans

danpalmer•22m ago

> Cloud companies generally make onboarding very easy, and offboarding very difficult.

I reckon most on-prem deployments have significantly worse offboarding than the cloud providers. As a cloud provider you can win business by having something for offboarding, but internally you'd never get buy-in to spend on a backup plan if you decide to move to the cloud.

intalentive•13m ago

I like Hotz’s style: simply and straightforwardly attempting the difficult and complex. I always get the impression: “You don’t need to be too fancy or clever. You don’t need permission or credentials. You just need to go out and do the thing. What are you waiting for?”

jillesvangurp•8m ago

At scale (like comma.ai), it's probably cheaper. But until then it's a long term cost optimization with really high upfront capital expenditure and risk. Which means it doesn't make much sense for the majority of startup companies until they become late stage and their hosting cost actually becomes a big cost burden.

There are in between solutions. Renting bare metal instead of renting virtual machines can be quite nice. I've done that via Hetzner some years ago. You pay just about the same but you get a lot more performance for the same money. This is great if you actually need that performance.

People obsess about hardware but there's also the software side to consider. For smaller companies, operations/devops people are usually more expensive than the resources they manage. The cost to optimize is that cost. The hosting cost usually is a rounding error on the staffing cost. And on top of that the amount of responsibilities increases as soon as you own the hardware. You need to service it, monitor it, replace it when it fails, make sure those fans don't get jammed by dust puppies, deal with outages when they happen, etc. All the stuff that you pay cloud providers to do for you now becomes your problem. And it has a non zero cost.

The right mindset for hosting cost is to think of it in FTEs (full time employee cost for a year). If it's below 1 (most startups until they are well into scale up territory), you are doing great. Most of the optimizations you are going to get are going to cost you in actual FTEs spent doing that work. 1 FTE pays for quite a bit of hosting. Think 10K per month in AWS cost. A good ops person/developer is more expensive than that. My company runs at about 1K per month (GCP and misc managed services). It would be the wrong thing to optimize for us. It's not worth spending any amount of time on for me. I literally have more valuable things to do.

This flips when you start getting into the multiple FTEs per month in cost for just the hosting. At that point you probably have additional cost measured in 5-10 FTE in staffing anyway to babysit all of that. So now you can talk about trading off some hosting FTEs for modest amount of extra staffing FTEs and make net gains.

Peak Human

A plea for lean software [pdf]

Show HN: YouTube Skills for AI Agents and OpenClaw

Do you need email replies to be visible inside your outreach tool?

Relax for the Same Result (2015)

Show HN: Imagens.app – Free AI image generator and enhancer for creators

Goodbye Smartwatches, Hello Health AI on Your Wrist

Kling V3 Video Generator

ICE Begins Buying 'Mega' Warehouse Detention Centers Across US

Show HN: Distributed Training via Webcams

Russia 'intercepts Europe's key satellites'

SSD VPS Hosting: Unlocking Ultra-Fast Performance for Modern Websites

Duna raises €30M, becoming best-funded member of "Stripe mafia" in Europe

Recreating uncensored Epstein PDFs from raw encoded attachments

Show HN: Owlyn – Get daily team clarity without standups or status meetings

MSCI Pressure Mounts on Billionaire-Held Indonesia Shares

Show HN: A free model to measure digital-first work performance in 3 minutes

Xcode 26 system prompts and internal documentation

Ready for another quick game break? Try HTTPS://szthx.xyz

Show HN: ChatVault – Search your Claude conversations locally with RAG

Show HN: CLI tool to convert Markdown to rich HTML clipboard content

I built Prethub – a collective memory where AI agents share execution experience

China's population is projected to halve by the end of the century

Teleporting into the future and robbing yourself of retirement projects

Data Center Demand Story Doesn't Add Up

Show HN: Toku.agency – Where AI agents hire each other for real USD

Modernizing Linux swapping: introducing the swap table

OpenClaw 101 – Guide to OpenClaw AI Assistant

Braids and Open Book Decompositions [pdf]

CIPS Stack – 5 memory systems that give your AI agents persistent memory