This is the result of that question itself and some weekend vibecoding (it has the linked library repository in the readme as well), it seems to work, even on consumer gpus, it should work better on professional ones tho
This is the result of that question itself and some weekend vibecoding (it has the linked library repository in the readme as well), it seems to work, even on consumer gpus, it should work better on professional ones tho
Nice work. PCI-P2P (GPU-Direct (tm)) is such great stuff. Cool to see!
One workup indicated it was theoretically possible to modify a piece of SGLang's routing layer to support JIT predict-ahead expert swaps from Gen5 NVMe storage straight into GPU memory.
I'm hoping that proves true. The setup relies on NVIDIA Dynamo, so NIXL primitives are available to support that.
Curious if anyone's tried this already.
I’ve also wondered why the routers aren’t training to be serially consistent so you can predict layers to swap into VRAM a few layers ahead to maximize available bandwidth.
1) This is basically the intention of several recent MoE models: keep particular generally useful experts hot in VRAM.
2) Unless you can swap layers in faster than you consume them there is no point to predicting layers (what does this even really mean? did you mean predicting experts?).
It seems at the moment the best you can do is keep experts and layers more likely to be used for a given query in VRAM and offload the rest, but this is work-dependent.
Unless you're handing that in some kind of fancy way, you'll be holding up the batch while waiting for host memory which will kill your throughout.
It makes much more sense for non batched local inference, especially if you can keep the MoE routing stable like you say, but most folks aren't optimising for that.
randomtoast•1h ago
tyfon•1h ago
But 5 seconds / token is quite slow yeah. I guess this is for low ram machines? I'm pretty sure my 5950x with 128 gb ram can run this faster on the CPU with some layers / prefill on the 3060 gpu I have.
I also see that they claim the process is compute bound at 2 seconds/token, but that doesn't seem correct with a 3090?
tgrowazay•41m ago
DDR4 tops out about 27Gbs
DDR5 can do around 40Gbs
So for 70B model at 8 bit quant, you will get around 0.3-0.5 tokens per second using RAM alone.
vlovich123•28m ago
someguy2026•23m ago
zozbot234•12m ago
Wuzado•43m ago