> the majority of organizations achieve less than 70% GPU Allocation Utilization when running at peak demand — to say nothing of aggregate utilization. This is true even of sophisticated players, like the former Banana serverless GPU platform, which operated at an aggregate utilization of around 20%.
Saw this sort of thing at my last job. Was very frustrating pointing this out to people only for them to respond with ¯\_(ツ)_/¯. I posted a much less tactful article (read: rant) than the one by Modal, but I think it still touches on a lot of the little things you need to consider when deploying AI models: https://thelisowe.substack.com/p/you-suck-at-deploying-ai-mo...
From a capital expenditure perspective, you are renting the CPU you bought in terms of opportunity cost.
What people have some sense of is that there's an ascribable value to having a capability in reserve versus discovering you don't have it when you need it.
That’s not a concern with typical GPU workloads, which are batch/throughput-oriented.
One of the reasons why inference providers sell batch discounts.
And yes, people back in the day were actively concerned if their CPUs ever hit 100% - it never made sense then, it doesn't make sense now.
Peaking at 90% over the monitoring interval does not mean you can fit 10% more load without compromises. It does not mean your CPU is oversized.
Non-zero concern is correct.
My CPU meters read 100% every day during set up time.
You're seriously telling people that a CPU working at 90% is processing faster than a CPU working at 100%?
Context switching, memory access, and the ability of the problem to be computed in parallel - sure, but CPU usage as the defining metric - seriously stop smoking that stuff.
If your CPU sits at 100% for several seconds then your task is almost certainly bottlenecked by the CPU part of the time. Therefore, if you use a faster CPU your task will get done faster.
So if we keep everything else the same, same hardware, same workload, same CPU design, only changing the clock speed of the CPU, then the CPU that reads 90% must be a faster CPU. Therefore it will reduce the bottlenecking, and your task will get done faster.
For the CPU upgrade to not matter, you'd have to have a task that is never bottlenecked by the CPU. That is very unlikely to be the case if your CPU is reading 100% for several seconds.
Edit to reply:
> Ok, so your grand solution to this is - faster CPUs process faster than slower CPUs. Wow. Who would have thunk it.
You were the one saying that a fast CPU is "oversized". So I explained in detail why avoiding 100% does not make your CPU oversized.
Yes it's obvious, glad you agree now.
> 100% of a CPU is faster than 90% of that same CPU.
It has more throughput but for most types of software it now has worse latency.
If you care about latency, then you don't want to increase throughput by pegging your CPU at 100%. You want to increase throughput by getting more/better CPUs and keeping them at a lower percent.
(Thanks for writing this btw!)
In the few months since I originally wrote this, I've come to an even greater appreciation of just how hard it is to maximize utilization of the Tensor Cores. It's a lot more than just kernel parameter tuning and using a few parallel programming tricks (parallel reduce, unrolling). It really borks your CUDA code -- you need warp specialization, you need to break warp uniformity, you need to work with explicit asynchrony. Hoping to write about this for the Modal blog/GPU Glossary soon!
I also spent a bit of time working with ncu/"NSight Compute". I'd probably include a bit about it in the section on how to improve your MFU if I rewrote the article today. But tl;dr use the profiling tool, Luke! And a good way to learn is to watch NVIDIA's GTC talks.
That said, I've also noticed even more cases where GPU kernel utilization is well below target. I think (and Horace He has argued) that that comes in part from optimized GEMMs running so fast on Tensor Cores that host overhead becomes the bottleneck (classic Amdahl). This unfortunately means more host logic needs to be compiled -- either graph-compiled as in torch.compile or moved into a compiled language.
The tricky thing here is that many GPU workloads saturate at least one of the resources on the GPU -- arithmetic throughput, memory bandwidth, thread slots, registers -- and so there's typically resource contention that leads to lowered throughput/increased latency for all parties.
And in a cloud (esp serverless/auto-scaling) computing context, the variety of GPU SKUs means you can often more easily right-size your workload onto whole replicas (on our platform, from one T4 up to 8 H100s per replica).
Or am I naive and my knowledge is outdated? I am genuinely curious what people see and what providers are capable of in 2025.
I haven’t tried it but if that’s the case it’s a game changer
It might help to remember that the training process is essentially a game of "how fast can we shove data into these GPUs", and having a GPU sit idle because the data can't get into it fast enough is a challenge people have been tackling since at least the P100 series. This has resulted in improvements on the GPUs as well as all the hardware around them. Getting data into the chips is one of the most efficient processes at this point.
I imagine in the real world that model usage follows a zipfian distribution, ie, a small number of models (<10) represent 95% of the machines for inference workloads. And, for those machines, you can just load the weights off of your ~40gbit ethernet connection since they're never cycling.
But for that last 5%, I feel like that's where it becomes important. If I'm running a weird, custom model and I want Lambda-like billing... what's the stack? Is the market big enough that people care? (And do most people just use LORAs which are much easier to hot swap?)
Training I imagine is a totally different ballpark because you're constantly checkpointing, transferring data at each step, etc, versus inference. That's a world I know a lot less about though!
It allocates a range of quants for a model across N number of devices using DFS to find the most ideal allocation for the given input of models. Ideal here meaning the most tokens per second and the least time to initialize the allocation. I keep track of memory capacity, pice bandwidth, and link bandwidth (including nvlink).
I intend to serve this behind an api using llamacpp so you can send a request to the api and it will fetch the model to fulfill the request, or create a new allocation to accommodate. Sort of like llama swap, but explicitly with the goal of enabling as many LLMs as you need to run on your hardware.
Anyways, I just bring this up because I’m curious if anyone else has done something like this? Or if it’s a problem worth solving? My dream is to take it out of my bedroom server and run in on something like modal.
You can check us out at https://yeet.cx
Heres an overview of our GPU specific solution
> Graphics Processing Units, or GPUs, are the hottest mathematical co-processor since the FM synthesis chips that shaped the sounds of the 1990s
since FM was more of an 80s thing. Even their linked comment says
> Throughout the 90s FM was old-hat. Nobody wanted to hear those woody clangy sounds of the 80s anymore.
FM synthesis has kept being a thing ever since in specific applications but the zeitgeist of the 90s (and its modern postmodern retreads like vaporwave) is arguably digital sampling.
Will replace with the lore-accurate "late 1900s".
cubefox•3d ago
> We’ll specifically focus on neural network inference workloads
keybored•3d ago