I benchmarked ROLV against dense cuBLAS on the actual Llama 4 Maverick MoE expert FFN layer (up_proj, 16384×5120, bfloat16) pulled directly from HuggingFace (model-00001-of-00084.safetensors).
Numbers (Batch=512, 1000 iters, NVIDIA B200):
Tokens/s: 369K (cuBLAS) → 7.66M (ROLV) — 20.7x faster
TFLOPS (effective): 62 → 1,285 — 20.7x
Time to First Token: 64.8ms → 0.37ms — 177x faster
Energy: 232J → 43J — 81.5% savings
ROLV exploits structured sparsity in MoE expert weights to skip large blocks of computation entirely, while producing canonically equivalent output (hash-verified). The TFLOPS figure is "effective" — computed as if doing the full dense multiply — so the 1285 TFLOPS isn't violating hardware peak; it's reflecting how much work was avoided.
The TTFT speedup (177x) is especially relevant for interactive inference: MoE models spend a huge fraction of first-token latency in these expert projections, and collapsing that from 65ms to 0.4ms per layer changes what's possible for real-time applications.
Verified with norm hashes at both ends (baseline and ROLV output) and a canonical check. Weights are real, not synthetic.
Setup: PyTorch 2.8.0+cu128, CUDA 12.8, Python 3.12, NVIDIA B200.
Comments
heggenhougen•2h ago
Happy to answer questions. Quick note on methodology: the TFLOPS figure is effective (computed as if doing the full dense multiply) — ROLV doesn't violate hardware peak, it avoids work entirely via structured sparsity. Weights are pulled directly from HuggingFace, output verified with norm hashes and a canonical check. If you want to run a baseline on your own hardware, there's a validation kit at rolv.ai.
heggenhougen•2h ago