Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode.
Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode.
Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster.
The only thing that might be interesting about this DGX Spark is it's prefill manages to be faster due to better compute. I haven't compared the numbers yet, but they are included in the article.
tl;dr it gets absolutely smashed by Strix Halo, at half the price.
1. Virtually every model that you'd run was developed on Nvidia gear and will run on Spark. 2. Spark has fast-as-hell interconnects. The sort of interconnects that one would want to use in an actual AI DC, so you can use more than one Spark at the same time, and RDMA, and actually start to figure out how things work the way they do and why. You can do a lot with 200 Gb of interconnect.
$1,295.00
https://www.balticnetworks.com/products/mikrotik-crs812-ddq-...
This is insanely slow given its 200+GB/s memory bandwidth. As a comparison, I've tested GPT OSS 120B on Strix Halo and it obtains 420tps prefill and >40tps decode.
It would be interesting to swap out Ollama for LM Studio and use their built-in MLX support and see the difference.
Could I write code that runs on Spark and effortlessly run it on a big GB300 system with no code changes?
It's designed to be a local dev machine for Nvidia server products. It has the same software and hardware stack as enterprise Nvidia hardware. That's what it is designed for.
Wait for M5 series Macs for good value local inferencing. I think the M5 Pro/Max are going to be very good values.
I am still amazed at how many companies buy a ton of DGX boxes and then are surprised that Nvidia does not have any Kubernetes native platform for training and inferencing across all the DGX machines. The Run.ai acquisition did not change anything, as you leave all the work to the user to integrate with distributed training frameworks like Ray or scalable inference platforms, like KServe/vLLM.
[1] (Updated) NVIDIA Jetson AGX Thor Developer Kit to Launch in Mid-August with 2070 TFLOPS AI Performance, Priced at $3499:
https://linuxgizmos.com/updated-nvidia-jetson-agx-thor-devel...
[2] AAEON Announces BOXER-8741AI with NVIDIA Jetson Thor T5000 Module:
https://linuxgizmos.com/aaeon-announces-boxer-8741ai-with-nv...
No doubt that’s present here too somehow.
Gotta cut off something important so you’ll spend more on the next more expensive product.
I somehow expected the Spark to be the 'God in a Box' moment for local AI, but it feels like they went for trying to sell multiple units instead.
I'd be more tempted by a 2nd hand 128GB M2 ultra at ~800GB/s but the prices here are still high, and I'm not sure the Spark is going to convince people to part with those, unless we see some M5 glutenous RAM boxes soon. An easy way for Apple to catch up again.
I guess my next one I'm looking out for is the Orange Pi AI studio pro. Should have 192gb of ram, so able to run qwen3 235b, even though it's ddr4, it's nearly double the bandwidth of the spark.
Admittedly I'm not a huge fan of debian; likely would end up going Arch on this one.
>Also, if you're in the U.S.,
Im not.
> I'd much rather stick with nVidia that has an ecosystem (even Apple for that matter), than touch a system like this off of Alibaba.
I get that. Realistically I'm waiting for medusa halo, some affordable datacenter card, something.
a) what is the noise level? In that small box, it should be immense?
b) how many frames do we get in Q3A at max. resolution and will it be able to run Crysis? ;-) LOL (SCNR)
DGX Spark
pp - 1723.07/s
tg - 38.55/s
Ryzen AI Max+ 395
pp - 711.67/s
tg - 40.25/s
Is it worth the money?
(photo for reference: https://www.wwt.com/api-new/attachments/5f033e355091b0008017...)
SethTro•3mo ago
CamperBob2•3mo ago
threeducks•3mo ago
For inference, the DGX Spark does not look like a good choice, as there are cheaper alternatives with better performance.
CamperBob2•3mo ago
Then there's the Mac Studio, which outdoes them in all respects except FP8 and FP4 support. As someone on Reddit put it: https://old.reddit.com/r/LocalLLaMA/comments/1n0xoji/why_can...
altspace•3mo ago
KeplerBoy•3mo ago
The DGX seems vastly more capable.
nialse•3mo ago
newman314•3mo ago
yvbbrjdr•3mo ago
ggerganov•3mo ago
yvbbrjdr•3mo ago
alecco•3mo ago
Or you can just ask the ollama people about the ollama problems. Ollama is (or was) just a Go wrapper around llama.cpp.
ilc•3mo ago
__mharrison__•3mo ago
xs83•3mo ago
Eggpants•3mo ago
https://www.ebay.com/sch/i.html?_nkw=mac+studio+m3+ultra+512...
rajatgupta314•3mo ago
Example looking at blk.0.attn_k.weight, it's q8_0 amongst other layers:
https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main?s...
Example looking at the same weight on Ollama is BF16:
https://ollama.com/library/gpt-oss:20b/blobs/e7b273f96360
xs83•3mo ago
So 38.54 t/s on 120B? Have you tested filling the context too?
ggerganov•3mo ago
nialse•3mo ago
moondev•3mo ago
zackangelo•3mo ago
moondev•3mo ago
ddelnano•3mo ago
pjmlp•3mo ago
People that keep pushing for Apple gear tend to forget Apple has decided what industry considers industry standards, proprietary or not, aren't made available on their hardware.
Even if Metal is actually a cool API to program for.
thom•3mo ago
omneity•3mo ago
NewsaHackO•3mo ago
pjmlp•3mo ago
It is called De facto standard, which you can check in your favourite dictionary.
EnPissant•3mo ago
sandworm101•3mo ago
EnPissant•3mo ago
adrian_b•3mo ago
Still, a PC with a 5090 will give in many cases a much better bang for the buck, except when limited by the slower speed of the main memory.
The greater bandwidth available when accessing the entire 128 GB memory is the only advantage of NVIDIA DGX, while a cheaper PC with discrete GPU has a faster GPU, a faster CPU and a faster local GPU memory.
bilekas•3mo ago
Xss3•3mo ago
Things are changing rapidly and there is a non insignificant chance that it'll seem like a big waste of money within 12 months.
eadwu•3mo ago
If SOCAMM2 is used it will still probably be at most near the range of 512/768 GB/s bandwidth, unless LPDDR6X / LPDDR7X or SOCAMM2 is that much better, SOCAMM on the DGX Station is just 384 GB/s w/ LPDDR5X.
Form factor will be neutered for the near future, but will probably retain the highest compute for the form factor.
The only way there will be a difference is if Intel or AMD pump their foot on the gas, which this makes maybe 2/3 years of it, with another 2 years unless they have something cooking it isn't going to happen.
Xss3•3mo ago
Maybe a company is working on something totally different in secret that we cant even imagine. The amount of £ thrown into this space at the moment is enormous.
metadat•3mo ago
Tepix•3mo ago
To me it seems like you're paying more than twice the price mostly for CUDA compatibility.
Tepix•3mo ago