'I paid for the whole GPU, I am going to use the whole GPU'

https://modal.com/blog/gpu-utilization-guide

154•mooreds•1mo ago

Comments

cubefox•1mo ago

For anyone thinking this is about video games:

> We’ll specifically focus on neural network inference workloads

keybored•1mo ago

It’s hard to forget the neural network application these days.

mooreds•1mo ago

The subtitle (which is important but was too long for the HN submission) is "A high-level guide to GPU utilization".

mwilcox•1mo ago

Understandable

Mockapapella•1mo ago

This is a good article on the "fog of war" for GPU inference. Modal has been doing a great job of aggregating and disseminating info on how to think about high quality AI inference. Learned some fun stuff -- thanks for posting it.

> the majority of organizations achieve less than 70% GPU Allocation Utilization when running at peak demand — to say nothing of aggregate utilization. This is true even of sophisticated players, like the former Banana serverless GPU platform, which operated at an aggregate utilization of around 20%.

Saw this sort of thing at my last job. Was very frustrating pointing this out to people only for them to respond with ¯\_(ツ)_/¯. I posted a much less tactful article (read: rant) than the one by Modal, but I think it still touches on a lot of the little things you need to consider when deploying AI models: https://thelisowe.substack.com/p/you-suck-at-deploying-ai-mo...

charles_irl•1mo ago

Nice article! I had to restrain myself from ranting on our blog :)

awesome_dude•1mo ago

I'm old enough to remember when people would be concerned if their CPU usage went to 100%

bdangubic•1mo ago

back in those days you weren’t renting them :)

XorNot•1mo ago

Well also people are pretty bad at logistical reasoning though.

From a capital expenditure perspective, you are renting the CPU you bought in terms of opportunity cost.

What people have some sense of is that there's an ascribable value to having a capability in reserve versus discovering you don't have it when you need it.

theandrewbailey•1mo ago

You're right, you were leasing them from IBM.

twoodfin•1mo ago

You’d worry about 100% CPU because even if the OS was successfully optimizing for throughput (as Linux is very good at), latency/p99 is certain to suffer as spare cycles disappear.

That’s not a concern with typical GPU workloads, which are batch/throughput-oriented.

calaphos•1mo ago

There's still a throughput/latency tradeoff curve, at least for any sort of interactive models.

One of the reasons why inference providers sell batch discounts.

awesome_dude•1mo ago

If you never reach 100% CPU usage, why did you buy an oversized CPU?

frainfreeze•1mo ago

Surely you're referring to occasional full load, not 24/7 load? Or 100% usage on some but not all cores? 100% used CPU means unresponsive system usually, and things crashing.

awesome_dude•1mo ago

I mean, I'm glad you could see that, it appears several other people interpret things in only one direction, 100% 24/7

And yes, people back in the day were actively concerned if their CPUs ever hit 100% - it never made sense then, it doesn't make sense now.

Dylan16807•1mo ago

If the load meter ever reads "100%" then that means you were at 100% for long enough to cause problems. It's measuring a period much longer than a millisecond. It depends on use case how big those problems are, and whether you want to pay money to avoid them, but they exist even before you hit 100% and even if it's only briefly that high.

Peaking at 90% over the monitoring interval does not mean you can fit 10% more load without compromises. It does not mean your CPU is oversized.

Non-zero concern is correct.

awesome_dude•1mo ago

Bollocks.

My CPU meters read 100% every day during set up time.

Dylan16807•1mo ago

And the compromise is that set up takes longer.

awesome_dude•1mo ago

I am trying not to spit my coffee out here.

You're seriously telling people that a CPU working at 90% is processing faster than a CPU working at 100%?

Context switching, memory access, and the ability of the problem to be computed in parallel - sure, but CPU usage as the defining metric - seriously stop smoking that stuff.

Dylan16807•1mo ago

> You're seriously telling people that a CPU working at 90% is processing faster than a CPU working at 100%?

If your CPU sits at 100% for several seconds then your task is almost certainly bottlenecked by the CPU part of the time. Therefore, if you use a faster CPU your task will get done faster.

So if we keep everything else the same, same hardware, same workload, same CPU design, only changing the clock speed of the CPU, then the CPU that reads 90% must be a faster CPU. Therefore it will reduce the bottlenecking, and your task will get done faster.

For the CPU upgrade to not matter, you'd have to have a task that is never bottlenecked by the CPU. That is very unlikely to be the case if your CPU is reading 100% for several seconds.

Edit to reply:

> Ok, so your grand solution to this is - faster CPUs process faster than slower CPUs. Wow. Who would have thunk it.

You were the one saying that a fast CPU is "oversized". So I explained in detail why avoiding 100% does not make your CPU oversized.

Yes it's obvious, glad you agree now.

> 100% of a CPU is faster than 90% of that same CPU.

It has more throughput but for most types of software it now has worse latency.

If you care about latency, then you don't want to increase throughput by pegging your CPU at 100%. You want to increase throughput by getting more/better CPUs and keeping them at a lower percent.

johnklos•1mo ago

Your point is taken, but if things are crashing because your CPU is running at 100%, you either have an Intel CPU or you have other hardware problems. There should be no issue running a CPU at 100% 24/7/365 indefinitely.

frainfreeze•1mo ago

Pin all your cores to 100% and see what happens with random programs. It's not about Intel.

dehrmann•1mo ago

For low-latency applications, and this can be anything from your computer's UI to a webserver, at around 80%, you start to see latency increase quickly. If it's CPU-intensive on your computer, you can run the CPU hog at a lower priority. If it's a webserver, you're forced to trade off throughput and latency.

charles_irl•1mo ago

Oh, I wrote this! Thanks for sharing it.

freeqaz•1mo ago

Anything you feel is worth adding for the HN crowd while you've got our attention? :)

(Thanks for writing this btw!)

charles_irl•1mo ago

Hmm, hard to say!

In the few months since I originally wrote this, I've come to an even greater appreciation of just how hard it is to maximize utilization of the Tensor Cores. It's a lot more than just kernel parameter tuning and using a few parallel programming tricks (parallel reduce, unrolling). It really borks your CUDA code -- you need warp specialization, you need to break warp uniformity, you need to work with explicit asynchrony. Hoping to write about this for the Modal blog/GPU Glossary soon!

I also spent a bit of time working with ncu/"NSight Compute". I'd probably include a bit about it in the section on how to improve your MFU if I rewrote the article today. But tl;dr use the profiling tool, Luke! And a good way to learn is to watch NVIDIA's GTC talks.

That said, I've also noticed even more cases where GPU kernel utilization is well below target. I think (and Horace He has argued) that that comes in part from optimized GEMMs running so fast on Tensor Cores that host overhead becomes the bottleneck (classic Amdahl). This unfortunately means more host logic needs to be compiled -- either graph-compiled as in torch.compile or moved into a compiled language.

semessier•1mo ago

well, related: fractional GPUs to multiplex workloads for aggregate utilization have been a topic for some time with no definite (NVIDIA) solutions for it: https://vhpc.org

charles_irl•1mo ago

We looked into this at Modal! We put out vGPUs but didn't see demand and our internal benchmarks for MPS and Green Contexts didn't indicate a big win.

The tricky thing here is that many GPU workloads saturate at least one of the resources on the GPU -- arithmetic throughput, memory bandwidth, thread slots, registers -- and so there's typically resource contention that leads to lowered throughput/increased latency for all parties.

And in a cloud (esp serverless/auto-scaling) computing context, the variety of GPU SKUs means you can often more easily right-size your workload onto whole replicas (on our platform, from one T4 up to 8 H100s per replica).

freeqaz•1mo ago

How fast are modern GPU boxes able to spin up these days? Loading in a massive blob of weights into VRAM feels like it's gotta be the bottleneck even if server provisioning is fast.

Or am I naive and my knowledge is outdated? I am genuinely curious what people see and what providers are capable of in 2025.

mountainriver•1mo ago

There is some recent work to make the model loading significantly faster by InferX. They are claiming to be able to load the a 7b in under 2 seconds.

I haven’t tried it but if that’s the case it’s a game changer

charles_irl•1mo ago

We've talked to them and there's some impressive technology there!

tedivm•1mo ago

Once you have the models on local storage you can move pretty quickly from there to VRAM, I've never found that to be the biggest bottleneck. The problem is provisioning itself, especially if you have to actually move models locally. Some of this can be avoided with extremely expensive networking (infiniband to a NAS with model weights), but that's not something you're going to have fun dealing with in a cloud environment.

It might help to remember that the training process is essentially a game of "how fast can we shove data into these GPUs", and having a GPU sit idle because the data can't get into it fast enough is a challenge people have been tackling since at least the P100 series. This has resulted in improvements on the GPUs as well as all the hardware around them. Getting data into the chips is one of the most efficient processes at this point.

freeqaz•1mo ago

How do Serverless GPU Cloud Providers deal with that then? Do they go down the Infiniband-to-NAS rabbit hole to build all of their infrastructure? Or do they just setup an NVME RAID cache to hold the models locally? (Maybe an LRU? With the system memory also being used?)

I imagine in the real world that model usage follows a zipfian distribution, ie, a small number of models (<10) represent 95% of the machines for inference workloads. And, for those machines, you can just load the weights off of your ~40gbit ethernet connection since they're never cycling.

But for that last 5%, I feel like that's where it becomes important. If I'm running a weird, custom model and I want Lambda-like billing... what's the stack? Is the market big enough that people care? (And do most people just use LORAs which are much easier to hot swap?)

Training I imagine is a totally different ballpark because you're constantly checkpointing, transferring data at each step, etc, versus inference. That's a world I know a lot less about though!

kllrnohj•1mo ago

pcie 5.0 x16 is ~64gb/s of bandwidth. Real world is never perfect, but it's not exactly a small pipe here.

kristianpaul•1mo ago

I'm still trying to use all my CPUs...

drob518•1mo ago

And we’re back to time-sharing.

charles_irl•1mo ago

When I'm feeling sassy, I like to tell people that Modal is "Enterprise Java Beans for AI".

dehrmann•1mo ago

Tomcat wanted to be some sort of compile once, run anywhere Docker.

esperent•1mo ago

What is time-sharing and why is being back to it a bad thing?

J_Shelby_J•1mo ago

I’m at the integration testing and benchmarking phase of a rust crate for allocating LLMs to GPUs and System RAM. The impetus is that single models are limited in what they can achieve and more complex workflows require LLM swarms. Think of a lot of smaller models doing reasoning steps or tasks like search and then a big model to summarize it all.

It allocates a range of quants for a model across N number of devices using DFS to find the most ideal allocation for the given input of models. Ideal here meaning the most tokens per second and the least time to initialize the allocation. I keep track of memory capacity, pice bandwidth, and link bandwidth (including nvlink).

I intend to serve this behind an api using llamacpp so you can send a request to the api and it will fetch the model to fulfill the request, or create a new allocation to accommodate. Sort of like llama swap, but explicitly with the goal of enabling as many LLMs as you need to run on your hardware.

Anyways, I just bring this up because I’m curious if anyone else has done something like this? Or if it’s a problem worth solving? My dream is to take it out of my bedroom server and run in on something like modal.

r3tr0•1mo ago

We spend a lot of time on getting these measurement w eBPF

You can check us out at https://yeet.cx

Heres an overview of our GPU specific solution

https://yeet.cx/solutions/maximize-ai-infra-roi

pavelstoev•1mo ago

GPU sharing is a concern for sensitive data. It is more appropriate to increase the utilization rate of GPU chip internals via a variety of low-level (CUDA and below) optimizations.

cph123•1mo ago

Could SR-IOV VFs be a solution to that?

alexjplant•1mo ago

OT: I'm not really sure what the author meant by

> Graphics Processing Units, or GPUs, are the hottest mathematical co-processor since the FM synthesis chips that shaped the sounds of the 1990s

since FM was more of an 80s thing. Even their linked comment says

> Throughout the 90s FM was old-hat. Nobody wanted to hear those woody clangy sounds of the 80s anymore.

FM synthesis has kept being a thing ever since in specific applications but the zeitgeist of the 90s (and its modern postmodern retreads like vaporwave) is arguably digital sampling.

charles_irl•1mo ago

That's a fair point! I was thinking of the Yamaha chips in the Sega consoles mentioned in that comment -- which certainly defined the sound of the 1990s for me as a child. But my small town Midwestern up-bringing was behind the curve!

Will replace with the lore-accurate "late 1900s".

Large Language Model-Powered Agent for C to Rust Code Translation

Let's create a Tree-sitter grammar

Musk said to bet on Tesla delivering Robotaxi in June, those who did lost big

The story how I acquired the domain name Onions.com

Offline-First AI Platform for Resilient Edge and IoT Applications

Three-Dimensional Time: A Mathematical Framework for Fundamental Physics

Young job applicants fight fire (ATS systems) with fire (AI) – Global trends

Google to buy fusion startup Commonwealth's power- if they can ever make it work

A hoax ended up on the HN front page

Apple Execs on what went wrong with Siri, iOS 26 and more [video]

Adding Text-to-Speech to Your Blog with OpenAI's TTS API

Do Car Buyers Care Which Engine Is Under the Hood? A Ford Exec Doesn't Think So

CertMate – SSL Certificate Management System

Ask HN: How to build a LifeOS using vibe coding?

Show HN: On-chain Fund Administration Protocol

Portal, for the C64

Defending Savannah from DDoS Attacks

Beltabol: An eager functional esolang based on the Expanse

Show HN: Transform handwritten chess notation to Lichess or chess.com instantly

Dias on the Web – Pandas Rewriter

Bulk Lots of DB-19s for Sale

The Impact of Early Galaxy Formation on the Cosmic Microwave Background

My Database Is My Application: Rethinking Webhook Logic with DuckDB and SQL

Jony Ive's AI gadget might be a pen

"Fuck the algorithm"?: What to learn from the UK's A-level grading fiasco (2020)

Senate GOP budget bill has little-noticed provision that could hurt your Wi-Fi

iOS Dev Weekly: Swift Everywhere: Bringing Swift Packages to Android

Machine Consciousness Psuedocode

Nuclear Matters Handbook [pdf]

Ted Chiang on Superintelligence in "The Hampdenshire Wonder"

'I paid for the whole GPU, I am going to use the whole GPU'

Comments

Large Language Model-Powered Agent for C to Rust Code Translation

Let's create a Tree-sitter grammar

Musk said to bet on Tesla delivering Robotaxi in June, those who did lost big

The story how I acquired the domain name Onions.com

Offline-First AI Platform for Resilient Edge and IoT Applications

Three-Dimensional Time: A Mathematical Framework for Fundamental Physics

Young job applicants fight fire (ATS systems) with fire (AI) – Global trends

Google to buy fusion startup Commonwealth's power- if they can ever make it work

A hoax ended up on the HN front page

Apple Execs on what went wrong with Siri, iOS 26 and more [video]

Adding Text-to-Speech to Your Blog with OpenAI's TTS API

Do Car Buyers Care Which Engine Is Under the Hood? A Ford Exec Doesn't Think So

CertMate – SSL Certificate Management System

Ask HN: How to build a LifeOS using vibe coding?

Show HN: On-chain Fund Administration Protocol

Portal, for the C64

Defending Savannah from DDoS Attacks

Beltabol: An eager functional esolang based on the Expanse

Show HN: Transform handwritten chess notation to Lichess or chess.com instantly

Dias on the Web – Pandas Rewriter

Bulk Lots of DB-19s for Sale

The Impact of Early Galaxy Formation on the Cosmic Microwave Background

My Database Is My Application: Rethinking Webhook Logic with DuckDB and SQL

Jony Ive's AI gadget might be a pen

"Fuck the algorithm"?: What to learn from the UK's A-level grading fiasco (2020)

Senate GOP budget bill has little-noticed provision that could hurt your Wi-Fi

iOS Dev Weekly: Swift Everywhere: Bringing Swift Packages to Android

Machine Consciousness Psuedocode

Nuclear Matters Handbook [pdf]

Ted Chiang on Superintelligence in "The Hampdenshire Wonder"