Why this matters: NVIDIA's TRT-LLM explicitly blocks desktop Blackwell from FP4 — the error literally says "FP4 Gemm not supported before Blackwell, nor GeForce Blackwell." The RTX 5090, PRO 6000, and DGX Spark all use SM120 — same FP4 tensor cores as the B100/B200 datacenter chips (SM100). The lock is artificial product segmentation, not a hardware limitation.
CUTLASS 4.2+ already ships SM120 FP4 kernels. They're compiled into vLLM. The problem is purely dispatch logic — Python-level capability checks that only recognize SM100, not SM120.
Setup (vLLM 0.17.0, stable pip install):
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model Sehyo/Qwen3.5-122B-A10B-NVFP4 --port 8100 --max-model-len 4096 --gpu-memory-utilization 0.85 --compilation-config '{"cudagraph_mode": "piecewise"}'
Key gotchas: (1) Do NOT pass --quantization flag, model uses compressed-tensors format and vLLM auto-detects. (2) Full CUDA graphs OOM — use piecewise mode (31 tok/s vs 12 tok/s eager). (3) Python 3.14 breaks numba, stick with 3.13.
Results: 31 tok/s on 1 GPU vs 54 tok/s on 2 GPUs with Q8_0 llama.cpp. Half the hardware, ~60% the speed, ~98% the quality.
The broader point: SM120 and SM100 share the same FP4 tensor core architecture. CUTLASS has the kernels. The frameworks just need to route SM120 to them. A 122B MoE model on a single desktop GPU at 31 tok/s was datacenter-only six months ago.
Relevant issues: vLLM #33416, SGLang #18954, CUTLASS #2800. We're submitting a PR (~10 lines of Python).