Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode.
Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode.
Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster.
The only thing that might be interesting about this DGX Spark is it's prefill manages to be faster due to better compute. I haven't compared the numbers yet, but they are included in the article.
1. Virtually every model that you'd run was developed on Nvidia gear and will run on Spark. 2. Spark has fast-as-hell interconnects. The sort of interconnects that one would want to use in an actual AI DC, so you can use more than one Spark at the same time, and RDMA, and actually start to figure out how things work the way they do and why. You can do a lot with 200 Gb of interconnect.
SethTro•3h ago
CamperBob2•3h ago
nialse•2h ago
newman314•1h ago
yvbbrjdr•1h ago
ggerganov•1h ago
yvbbrjdr•1h ago
__mharrison__•1h ago
rajatgupta314•53m ago
Example looking at blk.0.attn_k.weight, it's q8_0 amongst other layers:
https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main?s...
Example looking at the same weight on Ollama is BF16:
https://ollama.com/library/gpt-oss:20b/blobs/e7b273f96360
moondev•1h ago
EnPissant•1h ago