This is true, but I've never heard anyone refer to this as a GPU bubble before.
I think most people hear "GPU bubble" and think of a financial bubble of some kind.
Better term, anyone?
Very odd, but perhaps more familiar to graphics programmers? I will say I'd probably call it a stall, which is exactly what the Vulkan docs call it moments later, so :shrug:
This appears to be different than the recent "Speculative Pipeline Decoding" paper: https://arxiv.org/abs/2605.30852
blueblazin•49m ago
To work in LLM training/inference you’re expected to know this stuff but to know this stuff you need to be working in the space.
rjzzleep•30m ago
First, where do you know exactly what the optimal VRAM assignment per model, per context size is, which seems to be currently based purely on experience and second how do you make sure that only that amount is available to your infra/containers, which is being handled by DRA and stuff like https://project-hami.io
While only tangentially related to the blog post here. The title is picked in such a way that I couldn't help, but put the shameless plug here. When he wrote popping the bubble, I thought we're talking about devices and reducing NVIDIA dependency, but this seems very focused on Cuda.
Disclaimer: I work with Dynamia.ai, the founders of which created HAMi.
radq•23m ago