I don't see the point in self hosting unless you deploy a gpu in your own datacenter where you really have control. But that costs usually more for most use cases.
With larger purchases the GPU prices also drop so that is the scaling logic.
Not wanting to send tons of private data to a company who's foundation is exploiting data it didn't have permission to use?
LLM pricing is pretty intense if you’re using anything beyond a 8b model, at least that’s what I’m noticing on OpenRouter. 3-4 calls can approach eating up a $1 with bigger models, and certainly on frontier ones.
Their A100 80GB is going more than what I pay to rent H100s: if you really want to save money, getting the cheapest hourly rentals possible is the only way you have any hope of saving money vs major providers.
I think people vastly underestimate how much companies like OpenAI can do with inference efficiency between large nodes, large batch sizes, and hyper optimized inference stacks.
If you mean the serverless GPU offering, typically you set a cap for how many requests a single instance is meant to serve. Past that cap they'll spin up more instances.
But if you mean rentals, scaling is on you. With LLM inference there's a regime where the model responses will slow down on a per-user basis while overall throughput goes up, but eventually you'll run out of headroom and need more servers.
Another reason why generally speaking it's hard to compete with major providers on cost effectiveness.
Thank you, this is what I wanted to know.
typically you set a cap for how many requests a single instance is meant to serve
If this is on us then we'd have to make sure whatever caps we set beat api providers. I don't know how easy that cap is to figure out.
Runpod outsources much of their infrastructure to small players that own GPUs. They have recently added some requirements on security and reliability (eg: some level of security audit such as SOC 2, has to be hosted in a real DC, has to be in a locked rack), but fundamentally they are leaning on small shops that slap some GPUs in a server at a colocation facility. This personally would make me nervous about any sensitive workloads.
My impression is that Cerebrium either owns their own GPU servers or they're outsourcing to one of the big players. They certainly don't have the "partner program" advertised on their site like Runpod does.
- Runpod is one of the cheapest but it comes at the price of reliability (critical for businesses) - We have more performant cold start performance with something special launching soon here - Iterating on your application using CPUs/GPUs in the cloud takes just 2–10 seconds, compared to several minutes with Runpod due to Docker push/pull. - Allow you to deploy in multiple regions globally for lower latency and data residency compliance - We provide a lot of software abstractions (fire and forget jobs, websockets, batching, etc) where as Runpod just deploys your docker image. - SOC 2 and GDPR compliant
With that all being said - we are working on optimisations to bring down pricing
amelius•6h ago
anonymousDan•6h ago
amelius•6h ago
dist-epoch•5h ago
klabb3•3h ago
Not disagreeing, but this is quite an expression.
Incipient•4h ago
It gives you flexibility if the provider isn't keeping pace with the market and it prevents the provider from jacking prices relative to its competitors.
Vendor lockin is awful. Hypothetically, imagine how stuffed you'd be if your core virtualisation provider jacked prices 500%! You'd be really hurting.
...ohwait.
kristianc•2h ago