Architecturally, the DGX Spark has a far better cache setup to feed the GPU, and offers NVLINK support.
But yeah, this should have been further up.
Further down, in the exploded view it says "Blackwell GPU 1PetaFLOP FP4 AI Compute"
Then further down in the spec chart they get less specific again with "Tensor Performance^1 1 PFLOP" and "^1" says "1 Theoretical FP4 TOPS using the sparsity feature."
Also, if you click "Reserve Now" the second line below that redundant "Reserve Now" button says "1 PFLOPS of FP4 AI performance"
I mean I'll give you that they could be more clear and that it's not cool to just hype up on FP4 performance, but they aren't exactly hiding the context like they did during GTC. I wouldn't call this "disingenuous"
I think lots of children are going to be very disappointed running their blas benchmarks on Christmas morning and seeing barely tens of teraflops.
(For reference see how the still optimistic numbers are for the H200 when you use realistic datatypes.
https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200... )
Even AMD got that memo and is mostly advertising their 8bit/block fp16 performance on their GPUs and NPUs, even though the NPUs support 4 bit INT with sparsity, which would 4x the quoted numbers if they used Nvidia's marketing FLOPs.
Yeah, it’s miles better than WiFi. But if there was something I’d think maybe benefit from Thunderbolt this would’ve been it.
The ability to transfer large models or datasets that way just seems like it would be much faster and a real win for some customers.
Why would you ever want a DGX Spark to talk to a “normal PC” at 40+ Gbps speeds anyways? The normal PC has nothing that interesting to share with it.
But, yes, the DGX Spark does have four USB4 ports which support 40Gbps each, the same as Thunderbolt 4. I still don’t see any use case for connecting one of those to a normal PC.
ASUS Ascent GX10 - 1TB $2,999
MSI EdgeXpert MS-C931 - 4TB $3,999
the 1TB/4TB seems to be the size of the included NVMe SSD.the reserve now page also lists
NVIDIA DGX Spark Bundle
2 NVIDIA DGX Spark Units - 4TB with Connecting Cable $8,049
The DGX Spark specs lists an NVIDIA ConnectX-7 Smart NIC which is rated at 200Gbe to connect to another DGX Spark, for about double the amount of memory for models.Their prompt processing speeds are absolutely abysmal: if you're trying to tinker from time to time, a GPU like a 5090 or renting GPUs is a much better option.
If you're just trying to prep for impending mainstream AI applications, few will be targeting this form factor: it's both too strong compared to mainstream hardware, and way too weak compared to dedicated AI-focused accelerators.
-
I'll admit I'm taking a less nuanced take than some would prefer, but I'm also trying to be direct: this is not ever going to be a better option than a 5090.
Their prompt processing speeds are absolutely abysmal
They are not. This is Blackwell with Tensor cores. Bandwidth is the problem here.I've run inference workloads on a GH200 which is an entire H100 attached to an ARM processor and the moment offloading is involved speeds tank to Mac Mini-like speeds, which is similarly mostly a toy when it comes to AI.
Not entirely sure how your ARM statement matters here. This is unified memory.
The limiting factor is going to be the VRAM on the 5090, but nvidia intentionally makes trying to break the 32GB barrier extremely painful - they want companies to buy their $20,000 GPUs to run inference for larger models.
Then the RTX Pro 6000 for running a little bit larger models (96gb VRAM, but only ~15-20% more perf than 5090).
Some suggest Apple Silicon only for running larger models on a budget because of the unified memory, but the performance won't compare.
?? this seems more than a little disingenuous...
ASUS and NVIDIA told us that their GB10 platforms are expected to use up to 170W.
[edit] the PSU is 240W so that'd place an upper limit on power draw, unless they upgrade it.These seem to be highly experimental boards, even though are super powerful for their form factor.
Ryzen AI Max 395+, ~120 tops (fp8?), 128GB RAM, $1999
Nvidia DGX Spark, ~1000 tops fp4, 128GB RAM, $3999
Mac Studio max spec, ~120 tflops (fp16?), 512GB RAM, 3x bandwidth, $9499
DGX Spark appears to potentially offer the most token per second, but less useful/value as everyday pc.
Mac Studio max spec, ~120 tflops (fp16?), 384GB RAM, 3x bandwidth, $9499
512GB.DGX has 256GB/s bandwidth so it wouldn't offer the most tokens/s.
Using an M3 Ultra I think the performance is pretty remarkable for inference and concerns about prompt processing being slow in particular are greatly exaggerated.
Maybe the advantage of the DGX Spark will be for training or fine tuning.
Also notably, Strix Halo and DGX Spark are both ~275GBps memory bandwidth. Not always but in many machine learning cases it feels like that's going to be the limiting factor.
Just got my Framework PC last week. It's easy to setup to run LLMs locally - you have to use Fedora 42, though, because it has the latest drivers. It was super easy to get qwen3-coder-30b (8 bit quant) running in LMStudio at 36 tok/sec.
I'm looking at going for a Framework Desktop and would like to know what kind of performance gain I'd get over the current hardware I have, which so far I have a "feel" for the performance of from running Ollama and OpenWebUI, but no hard numbers.
qwen-code (cli) gives like 2k requests per day for free (and is fantastic), so unless you have a very specific use case, buying a system for local LLM use is not a good use of funds.
If you're in the market for a desktop PC anyway, and just want to tinker with LLMs, then the AMD systems are a fair value IMO, plus the drivers are open source so everything just works out of the box (with Vulkan, anyway).
Yeah, this is why I bought it. To tinker with LLMs (and some more experimental ML algorithms like differential logic and bitnets), but also it can compile LLVM in a little under 7 minutes, and, I didn't time it, but it can build the riscv gcc toolchain very quickly as well. My current (soon to be previous) dev box took about an hour to compile LLVM (if it didn't fail linking due to running out of memory) so doing any kind of LLVM development or making changes to binutils was quite tedious.
With my 5070 Ti + 2080 Ti I have Qwen 3 Coder 30B Q4_K_M running entirely on the GPUs with 16k context. Not great for larger code bases, but not nothing either.
Asking it to summarize llama-model-loader.cpp, which is about ~12k tokens, the TTFT is ~13 seconds and generation speed is about 55 tok/sec.
So yeah, for local stuff it's pick any two of large models, long contexts and decent speed.
I find Qwen 3 Coder to be quite usable, I get around 20TPS on my AMD AI 350 system, as long as the net-new context isn't too big.
Need a big case tho or go bitcoin miner style.
Not seriously thinking about it, just playing around.
I currently use the GPU in my server for n8n and Home Assistant with small-ish tooling models that fit in my 8GB VRAM.
TTFT is pretty poor right now, I get 10+ seconds for the longer inputs from HA, n8n isn't too bad unless I'm asking it to handle a largish input, but that one is less time sensitive as it's running things on schedules rather than when I need output.
Ideally I'd like to get Assistant responses in HA to under about 2s if possible.
Looking also for a new desktop at some point but I don't want to use the same hardware, the inference GPU is in a server that's always on running "infrastructure" (Kubernetes, various pieces of software, NAS functionality, etc), but I've always build desktops from components since I was a wee child when a 1.44MB floppy was an upgrade, so a part of me is reluctant to switch to a mini-PC for that;
I might be convinced to get a Framework Desktop though if it'll do for Steam gaming on Linux knowing that when I eventually need to upgrade it, it could supplement my server rack and be replaced entirely with a new model on the desktop, given there's very little upgrade path than to replace the entire mainboard.
No real interest in coding assistants, and running within my home network is an absolute must, which limits capability to "what's the best hardware I can afford?".
512 ops/clock/CU * 40 CU * 2.9e9 clock / 1e12 = 59.392 FP16 TFLOPS
Note, even with all my latest manual compilation whistles and the latest TheRock ROCm builds the best I've gotten mamf-finder up to about 35 TFLOPS, which is still not amazing efficiency (most Nvidia cards are at 70-80%), although a huge improvement over the single-digit TFLOPS you might get ootb.If you're not training, your inference speed will largely be limited by available memory bandwidth, so the Spark token generation will be about the same as the 395.
On general utility, I will say that the 16 Zen5 cores are impressive. It beats my 24C EPYC 9274F in single and multithreaded workloads by about 25%.
Strix Halo has the same and I agree it’s overrated.
(Posting this comment in hopes of being corrected and learning something).
Nvidia DGX: 273 GB/s
M4 Max: (up to) 546 GB/s
M3 Ultra: 819 GB/s
RTX 5090: ~1.8 TB/s
RTX PRO 6000 Blackwell: ~1.8 TB/s
Hence the disappointment.
5090: 3352 | 1999 | 0.60
Thor: 2070 | 3499 | 1.69
Spark: 1000 | 3999 | 4.00
____________
FP8-dense (TFLOPS) | Price | $/TF8d (4090s have no FP4)
4090 : 661 | 1599 | 2.42
4090 Laptop: 343 | vary | -
____________
Geekbench 6 (compute score) | Price | $/100k
4090: 317800 | 1599 | 503
5090: 387800 | 1999 | 516
M4 Max: 180700 | 1999 | 1106
M3 Ultra: 259700 | 3999 | 1540
____________
Apple NPU TOPS (not GPU-comparable)
M4 Max: 38
M3 Ultra: 36
Memory chips are a commodity, that I agree. Though HBM is trending towards not being a commodity.
DRAM+mm2 bom will have a different slope to just DRAM bom but still basically linear. Nonlinear pricing is pure market segmentation.
Memory controllers are die area. mm2 die space is linear bom cost.
Ignoring the design and platform support that comes with higher bandwidth memory controllers.I don't think you know how industry pricing works. Wafers have a price, double mm2, double the price of chip in bom.
I'm explaining why despite memory being a commodity, high memory bandwidth VRAM cost is not cheap.
It's more expensive in a linear way wrt bom.
I think the RTX Pro is probably the best deal right now if you're looking for a GPU dev desktop and don't care about physical size or power consumption.
In fact you're also doing the work Nvidia should have done when they put together their (imho) ridiculously imprecise spec sheet.
There's two models that go by 6000, the RTX Pro 6000 (Blackwell) is the one that's currently relevant.
A40 nvlink was limited though, one-to-one (bridge), never saw it daisy-chained or nvswitched (might have just not seen them myself, they may have existed).
2. You should read up on the Gestapo
But it was a different time. Most policies had some connection to the subject at hand.
Policies today are all about brand Trump and brand MAGA.
I call the stack with Mac Studios “MacAIver” because it feels like a duct tape solution, but the Spark equivalent would likely be more elegant.
16 compared to 4. Surely even much faster networking in the Spark would degrade with that many devices?
Biggest problem with Macs is that they don't have dedicated tensor cores in the GPU which makes prompt processing very slow compared to Nvidia and AMD.
https://x.com/liuliu/status/1932158994698932505
https://developer.apple.com/metal/Metal-Shading-Language-Spe...
4090: 24GB RAM
Thor & Spark: 128GB RAM (probably at least 96GB usable by the GPU if they behave similar to the AMD Strix Halo APU)
From other less reliable sources like eBay they are more like £1800.
Well, I'm glad to be wrong on his!
That gives you 250 tops of fp8 for Spark.
Spark: 128 GB LPDDR5x, unified system memory
5090 : 32 GB GDDR7,
Model sizes (parameter size) Spark: 200B
5090 : 12B (raw)Information is in the ratio of these numbers. They stay the same.
Please compare the same things: carrots VS carrots, not apples VS eggs.
200B is probably a rough estimate of Q4 + some space for context.
The Spark has 4x the VRAM of a 5090. That's all you need to know from a "how big can it go" perspective.
With 128 GB of unified system memory, developers can experiment, fine-tune, or inference models of up to 200B parameters. Plus, NVIDIA ConnectX™ networking can connect two NVIDIA DGX Spark supercomputers to enable inference on models up to 405B parameters.You can do it, if you quantize to FP4 — and Nvidia's special variant of FP4, NVFP4, isn't too bad (and it's optimized on Blackwell). Some models are even trained at FP4 these days, like the gpt-oss models. But gigabytes are gigabytes, and you can't squeeze 400GB of FP16 weights into only 128GB (or 256GB) of space.
The datasheet is telling you the truth: you can fit a 200B model. But it's not saying you can do that at FP16 — because you can't. You can only do it at FP4.
If the 200B model was at FP16, marketing could've turned around and claimed the DGX Spark could handle a 400B model (with an 8-bit quant) or a 800B model at some 4-bit quant.
Why would marketing leave such low-hanging fruit on the tree?
They wouldn't.
I assume we can go up to 120B using fp8?
Even if you were to say memory bandwidth was the problem, there is no consumer grade GPU that can run any SoTA LLM, no matter what you'd have to settle for a more mediocre model.
Outside of LLMs, 256 GB/s is not as much of an issue and many people have dealt with less bandwidth for real world use cases.
For the newest models unless you quantize the crap out of them, even with a 5090 you’re going to be swapping blocks, which slows things down anyways. At least you’d be able to train on them at full precision with a decent batch size.
That said, I can’t imagine there’s enough of a market there to make it worth it.
The only likely difference with DGX Spark is that it'll be a more desktop-centered platform, what people can do with it, not sure, but say for VR, the DGX Spark is basically the best compute puck for one right now.
$3,999
I'd rather just get an M3 Ultra. Have an M2 Ultra on the desk, and an M3 Ultra sitting on the desk waiting to be opened. Might need to sell it and shell out the cash for the max ram option. Pricey, but seems worthwhile.
Fits into 32gb: 5090
Fits into 64gb - 96gb: Mac Studio
Fits into 128gb: for now 395+ $/token/s,
Mac Studio if you don't care about $
but don't have unlimited money for Hxxx
This could be great for models that fit 128gb and you want best $/token/s (if it is faster than a 395+). # This is deprecated, but can still be referenced
options amdgpu gttsize=122800
# This specifies GTT by # of 4KB pages:
# 31457280 * 4KB / 1024 / 1024 = 120 GiB
options ttm pages_limit=31457280shouldn't it be infer?
Funnily enough things like this show that a human probably was involved in the writing. I doubt an LLM would have produce that. I've often thought about how future generations are going to signal that they are human and maybe the way will be human language changing much more rapidly than it has done, maybe even mid sentence.
I'argue that "inference" has taken on a somewhat distinct new meaning in an LLM-context (loosely: running actual tokens through the model) and deviating from the base term to the verb form would make the sentence less clear to me.
Cf. "compute" is a verb for normal people, but for techies it is also "hardware resources used to compute things".
I wonder if this also applies to this DGX Spark. I hope not.
That platform was great for a few months...
For most of the Tegra boards there’s also upstream support. Overall the situation with NVidia BSP is about 10000x better than weird Chinese stuff. In the case of Tegra/Jetson, there’s even detailed first-party documentation about reconstructing the BSP components from source:
https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-327...
I’d assume the decent software support will carry over to DGX.
oracel•5mo ago
brookst•5mo ago
mynegation•5mo ago
bigyabai•5mo ago
TiredOfLife•5mo ago
https://www.jeffgeerling.com/blog/2024/amd-radeon-pro-w7700-...