I finally extracted some useful signals about what results you can get on the DGX Station machines. A bit of news broke via AI engineer conference today.
Would have preferred Kimi 2.7 Code numbers, but 2.5 was what I could get.
Kimi 2.5, 1.1T params
40-50 tok/s total output across all users
NVIDIA rep number; about 595GB model weights; we still need benchmark conditions
Nemotron Ultra, 550B
~35 tok/s at concurrency 1; scales to 4-5 concurrent users
NVIDIA rep number; useful because it includes a concurrency claim
GLM-5.2-REAP, 504B
~60 tok/s
Public 0xSero number from AI Engineer; Alec Fong says an earlier GLM NVFP4 attempt was ~25 tok/s; still missing exact quant, prefill, context, and memory residency/concurrency details
I also learned a lot about what it costs and when it's shipping.
connorturland•1h ago
Would have preferred Kimi 2.7 Code numbers, but 2.5 was what I could get.
Kimi 2.5, 1.1T params 40-50 tok/s total output across all users NVIDIA rep number; about 595GB model weights; we still need benchmark conditions
Nemotron Ultra, 550B ~35 tok/s at concurrency 1; scales to 4-5 concurrent users NVIDIA rep number; useful because it includes a concurrency claim
GLM-5.2-REAP, 504B ~60 tok/s Public 0xSero number from AI Engineer; Alec Fong says an earlier GLM NVFP4 attempt was ~25 tok/s; still missing exact quant, prefill, context, and memory residency/concurrency details
I also learned a lot about what it costs and when it's shipping.
Full writeup at the link