This is the result of that question itself and some weekend vibecoding (it has the linked library repository in the readme as well), it seems to work, even on consumer gpus, it should work better on professional ones tho
This is the result of that question itself and some weekend vibecoding (it has the linked library repository in the readme as well), it seems to work, even on consumer gpus, it should work better on professional ones tho
Nice work. PCI-P2P (GPU-Direct (tm)) is such great stuff. Cool to see!
One workup indicated it was theoretically possible to modify a piece of SGLang's routing layer to support JIT predict-ahead expert swaps from Gen5 NVMe storage straight into GPU memory.
I'm hoping that proves true. The setup relies on NVIDIA Dynamo, so NIXL primitives are available to support that.
Curious if anyone's tried this already.
randomtoast•1h ago
tyfon•42m ago
But 5 seconds / token is quite slow yeah. I guess this is for low ram machines? I'm pretty sure my 5950x with 128 gb ram can run this faster on the CPU with some layers / prefill on the 3060 gpu I have.
I also see that they claim the process is compute bound at 2 seconds/token, but that doesn't seem correct with a 3090?
tgrowazay•12m ago
DDR4 tops out about 27Gbs
DDR5 can do around 40Gbs
So for 70B model at 8 bit quant, you will get around 0.3-0.5 tokens per second using RAM alone.
Wuzado•14m ago