We experimented with a different approach.
Instead of pinning one model to one GPU, we: •Stage model weights on fast local disk •Load models into GPU memory only when requested •Keep a small working set resident •Evict inactive models aggressively •Route everything through a single OpenAI-compatible endpoint
In our recent test setup (2×A6000, 48GB each), we made ~60 Hugging Face text models available for activation. Only a few are resident in VRAM at any given time; the rest are restored when needed.
Cold starts still exist. Larger models take seconds to restore. But by avoiding warm pools and dedicated GPUs per model, overall utilization improves significantly for light workloads.
Short demo here:https://m.youtube.com/watch?v=IL7mBoRLHZk
Live demo to play with: https://inferx.net:8443/demo/
If anyone here is running multi-model inference and wants to benchmark this approach with their own models, I’m happy to provide temporary access for testing.
verdverm•2h ago
Do you have something people can actually try? If not, Show HN is not appropriate, please see the first sentence here
https://news.ycombinator.com/showhn.html
pveldandi•2h ago
You’re right that HN expects something runnable. We’re spinning up a public endpoint so people can test with their own models directly instead of requesting access. I’ll share it shortly. Thank you for the suggestion.
pveldandi•2h ago
verdverm•1h ago
Really, it looks like someone who is new to startups / b2b copy, welcome to first contact with users. Time to iterate or pivot
I would focus on design, aesthetics, and copy. Don't put any more effort into building until you have a message that resonates
pveldandi•52m ago
If that’s not a problem space you care about, that’s totally fair. But for teams juggling many models with uneven traffic, that’s where the economics start to matter.
pveldandi•47m ago
pveldandi•38m ago
verdverm•29m ago
Are you going to make this open source? That's the modus operandi around Ai and gaining adoption for those outside Big Ai (where branding is already strong)
pveldandi•15m ago
pveldandi•16m ago
verdverm•38m ago
there is nothing to try or play with, it's just content
pveldandi•31m ago
verdverm•1h ago
In other words, how many middlemen do you think you TAM is?
You go on to say this is great for light workloads, because obviously at scale we run models very differently.
So who is this for in the end?
pveldandi•1h ago
This isn’t for single-model apps running steady traffic at high utilization. If you’re saturating GPUs 24/7, you’ll architect very differently.
This is for teams that…
• Serve many models with uneven traffic • Run per-customer fine-tunes • Offer model marketplaces • Do evaluation / experimentation at scale • Have spiky workloads • Don’t want idle GPU burn between requests
A lot of SaaS AI products fall into that category. They aren’t OpenAI-scale. They’re running dozens of models with unpredictable demand.
Lambda exists because not every workload is steady state. Same idea here.
verdverm•1h ago
How do you know this? What are the numbers like?
> Lambda exists because not every workload is steady state
Vertex AI has all these models via API or hosting the same way. Same features already available with my current cloud provider. (traffic scaling, fine-tunes,all of the frontier and leading oss models)
pveldandi•59m ago
What we focus on is the runtime layer underneath. You can run us behind Cloud Run or inside your existing GCP setup. The difference is at the GPU utilization level when you’re serving many models with uneven demand.
If your workload is steady and high volume on a small set of models, the standard cloud stack works well. If you’re juggling dozens of models with spiky traffic, the economics start to look very different.
As an example, we’re currently being tested inside GCP environments. Some teams are experimenting with running us behind their existing Google Cloud setup rather than replacing it. The idea isn’t to swap out Cloud Run or Vertex, but to improve the runtime efficiency underneath when serving multiple models with uneven demand.
verdverm•56m ago
I don't see anything you do that they don't already do for me. I suggest you do a deep dive on their offering as there seem to be gaps in your understanding of what features they have
> economics start to look very different
You need to put numbers to this, comparing against API calls at per-token pricing is a required comparison imo, because that is the more popular alternative to model hot-swapping for spikey or heterogeneous workloads