Nvidia DGX Spark

https://www.nvidia.com/en-us/products/workstations/dgx-spark/

180•janandonly•5mo ago

Comments

oracel•5mo ago

Can it run Crysis?

brookst•5mo ago

It can write Crysis.

mynegation•5mo ago

It’s gonna run Crysis for my wallet alright

bigyabai•5mo ago

Bit of a non-sequitur, but it seems like it's been playable via box86 for quite a while now: https://youtu.be/NcBJG3z8kF0

TiredOfLife•5mo ago

If raspberry pi can then this probably also can.

https://www.jeffgeerling.com/blog/2024/amd-radeon-pro-w7700-...

lvl155•5mo ago

Is this worth getting vs AMD?

zxexz•5mo ago

What are you trying to do?

manishsharan•5mo ago

Not the above poster. I would like to run agents with local multimodal LLMs to process huge numbers of sensitive files for an org( summarization, knowledge extraction, answer user questions ,etc). Any ideas?

DoctorOetker•5mo ago

suppose 1/3rd of memory is used to host a teacher network, and 2/3rds of memory is used to host a student network, how long would knowledge distillation typically take?

blitzar•5mo ago

They will meet at 18:47 on tuesday evening.

nightski•5mo ago

Am I missing something or does the comparably priced (technically cheaper) Jetson Thor have double the PFLOPs of the Spark with the same memory capacity and similar bandwidth?

modeless•5mo ago

Also Thor is actually getting sent out to robotics companies already. Did anyone outside Nvidia get a DGX Spark yet?

Apes•5mo ago

My understanding is the DGX Spark is optimized for training / fine tuning and the Jetson Thor is optimized for running inference.

Architecturally, the DGX Spark has a far better cache setup to feed the GPU, and offers NVLINK support.

AlotOfReading•5mo ago

There's a lot of segmentation going on in the Blackwell generation from what I'm told.

Y_Y•5mo ago

It's a bit disingenuous to claim 1 PFLOPs without making clear that's for FP4 (with "structured sparsity"?)

csunoser•5mo ago

It does say `Experience up to 1 petaFLOP of AI performance at FP4 precision with the NVIDIA Grace Blackwell architecture.` in the features section.

But yeah, this should have been further up.

godelski•5mo ago

If you scroll down a little and see the chip icon, where it says "NVIDIA GB10 Superchip " it also says "Experience up to 1 petaFLOP of AI performance at FP4 precision with the NVIDIA Grace Blackwell architecture."

Further down, in the exploded view it says "Blackwell GPU 1PetaFLOP FP4 AI Compute"

Then further down in the spec chart they get less specific again with "Tensor Performance^1 1 PFLOP" and "^1" says "1 Theoretical FP4 TOPS using the sparsity feature."

Also, if you click "Reserve Now" the second line below that redundant "Reserve Now" button says "1 PFLOPS of FP4 AI performance"

I mean I'll give you that they could be more clear and that it's not cool to just hype up on FP4 performance, but they aren't exactly hiding the context like they did during GTC. I wouldn't call this "disingenuous"

Y_Y•5mo ago

Even if that "sparsity feature" is that two or of every four adjacent values in your areay be zeros, and that performance halves if not doing this?

I think lots of children are going to be very disappointed running their blas benchmarks on Christmas morning and seeing barely tens of teraflops.

(For reference see how the still optimistic numbers are for the H200 when you use realistic datatypes.

https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200... )

imtringued•5mo ago

Using sparsity in advertising is incredibly misleading to the point of lying. The entire point of sparsity is that you avoid doing calculations. Sparsity support means you need fewer FLOPs for a matrix of the same size. It doesn't magically increase the number of FLOPs you have.

Even AMD got that memo and is mostly advertising their 8bit/block fp16 performance on their GPUs and NPUs, even though the NPUs support 4 bit INT with sparsity, which would 4x the quoted numbers if they used Nvidia's marketing FLOPs.

MBCook•5mo ago

I’m not in this space, so I don’t know what’s normal, but I guess I’m a little surprised to see only 10 gig Ethernet for high speed connectivity.

Yeah, it’s miles better than WiFi. But if there was something I’d think maybe benefit from Thunderbolt this would’ve been it.

The ability to transfer large models or datasets that way just seems like it would be much faster and a real win for some customers.

coder543•5mo ago

This thing has a ConnectX-7, which gives it 2 x 200 Gbps networking. The 10 gig port is far from the fastest network interface on the Spark.

MBCook•5mo ago

But can you hook that up to a normal PC?

coder543•5mo ago

You were complaining about speed. Yes, a PC can have the same ports, and then you get much faster speeds than Thunderbolt can provide.

Why would you ever want a DGX Spark to talk to a “normal PC” at 40+ Gbps speeds anyways? The normal PC has nothing that interesting to share with it.

But, yes, the DGX Spark does have four USB4 ports which support 40Gbps each, the same as Thunderbolt 4. I still don’t see any use case for connecting one of those to a normal PC.

renewiltord•5mo ago

Yes. Just buy the Mellanox card. We had a bunch of ConnectX 5 hooked up through SFP. Needs cooling but fast.

_zoltan_•5mo ago

why criticize something in the first place when you clearly have not even looked at the product?

x2tyfi•5mo ago

You’re almost always going to bottleneck on your home internet or upstream ISP, rather than this local interface. That being said, you aren’t going to be waiting too long either way, depending on download speed. Deepseek R1 is 671GB. Multiply by 8 to get into bits: 5368Gb At full 10gbps (which, again, you probably won’t get): 5368Gb / 10gbps = 537 seconds to download 537s / 60 = 8.95 minutes. Call it 10m with overhead.

rightisleft•5mo ago

I think I interviewed you the other day and you didn’t get the job…

wiredpancake•5mo ago

What?

_zoltan_•5mo ago

I have 25 Gbps symmetric ethernet from my ISP (not XGPON). they are talking about rolling out 100 Gbps.

ls612•5mo ago

Is this the $3500 one?

x2tyfi•5mo ago

That was their Digits box.

jrgifford•5mo ago

Digits is no more, it’s DGX. Source: signup link in the press release (https://nvidianews.nvidia.com/news/nvidia-puts-grace-blackwe...) now goes to DGX Spark preorder (https://www.nvidia.com/en-us/products/workstations/dgx-spark...)

wmf•5mo ago

It's the same thing. They renamed it from Digits to Spark.

gardnr•5mo ago

There are cheaper ASUS and MSI versions "coming soon" with the same chip and less storage / memory.

canucker2016•5mo ago

on the reserve now page for USA, there's:

  ASUS Ascent GX10 - 1TB $2,999

  MSI EdgeXpert MS-C931 - 4TB $3,999

the 1TB/4TB seems to be the size of the included NVMe SSD.

the reserve now page also lists

  NVIDIA DGX Spark Bundle
  2 NVIDIA DGX Spark Units - 4TB with Connecting Cable $8,049

The DGX Spark specs lists an NVIDIA ConnectX-7 Smart NIC which is rated at 200Gbe to connect to another DGX Spark, for about double the amount of memory for models.

dirtyhand•5mo ago

I was considering getting an RTX 5090 to run inference on some LLM models, but now I’m wondering if it’s worth paying an extra $2K for this option instead

BoorishBears•5mo ago

No. These are practically useless for AI.

Their prompt processing speeds are absolutely abysmal: if you're trying to tinker from time to time, a GPU like a 5090 or renting GPUs is a much better option.

If you're just trying to prep for impending mainstream AI applications, few will be targeting this form factor: it's both too strong compared to mainstream hardware, and way too weak compared to dedicated AI-focused accelerators.

I'll admit I'm taking a less nuanced take than some would prefer, but I'm also trying to be direct: this is not ever going to be a better option than a 5090.

aurareturn•5mo ago

  Their prompt processing speeds are absolutely abysmal

They are not. This is Blackwell with Tensor cores. Bandwidth is the problem here.

BoorishBears•5mo ago

They're abysmal compared to anything dedicated at any reasonable batch size because of both bandwidth and compute, not sure why you're wording this like it disagrees with what I said.

I've run inference workloads on a GH200 which is an entire H100 attached to an ARM processor and the moment offloading is involved speeds tank to Mac Mini-like speeds, which is similarly mostly a toy when it comes to AI.

aurareturn•5mo ago

Again, prompt processing isn't the major problem here. It's bandwidth. 256GB/s bandwidth (maybe ~210 in real world) limits the tokens per second well before prompt processing.

Not entirely sure how your ARM statement matters here. This is unified memory.

Apes•5mo ago

RTX 5090 is about as good as it gets for home use. Its inference speeds are extremely fast.

The limiting factor is going to be the VRAM on the 5090, but nvidia intentionally makes trying to break the 32GB barrier extremely painful - they want companies to buy their $20,000 GPUs to run inference for larger models.

apitman•5mo ago

If you want to run small models fast get the 5090. If you want to run large models slow get the Spark. If you want to run small models slow get a used MI50. If you want to run large models fast get a lot more money.

Gracana•5mo ago

You might be able to do "large models slow" better than the spark with a 5090 and CPU offload, so long as you stick with MoE architectures. With the kv cache and shared parts of the model on GPU and all of the experts on CPU, it can work pretty well. I'm able to run ~400GB models at 10 tps with some A4000s and a bunch of RAM. That's on a Xeon W system with poor practical memory bandwidth (~190GB/s), you can do better with EPYC.

skhameneh•5mo ago

RTX 5090 for running smaller models.

Then the RTX Pro 6000 for running a little bit larger models (96gb VRAM, but only ~15-20% more perf than 5090).

Some suggest Apple Silicon only for running larger models on a budget because of the unified memory, but the performance won't compare.

senectus1•5mo ago

Power consumption : TBD

?? this seems more than a little disingenuous...

canucker2016•5mo ago

from https://www.servethehome.com/this-is-the-asus-ascent-gx10-a-...

  ASUS and NVIDIA told us that their GB10 platforms are expected to use up to 170W.

[edit] the PSU is 240W so that'd place an upper limit on power draw, unless they upgrade it.

wewewedxfgdf•5mo ago

It'll be stunted in some way - Nvidia always holds back some crucial feature that you need, to push you up to the next highest priced product line.

agnokapathetic•5mo ago

it uses LPDDR5x instead of the datacenter variant’s HBM3e.

syntaxing•5mo ago

While a completely different price point, I have a Jetson Orin Nano. Some people forget the kernels are more or less set in stone for product like these. I could rebuild my own Jetpack kernel but it’s not that straight forward to update something like CUDA or any other module. Unless you’re a business where your product relies on this hardware, I find it hard to buy this for consumer applications.

coredog64•5mo ago

Came in here to say the same thing. Have bought 3 Nvidia dev boards and never again as you quickly get left behind. You're then stuck compiling everything from scratch.

larodi•5mo ago

My experience with Jetson Nano was that it had to have its Ubuntu debloatred first (with 3rd party script) before we could get their NN something library to run the image recognition, designated to run on this device.

These seem to be highly experimental boards, even though are super powerful for their form factor.

syntaxing•5mo ago

That’s true for the Jetson Nano. The Jetson Orin Nano (I know, the naming sucks) is much better in that aspect. Higher memory (8GB), way higher memory bandwidth (120 GB/s), and Orin has way more CUDA cores. It can pretty much run any “traditional” neural network, even YOLO large and even LLMs.

cherioo•5mo ago

The mainstream options seem to be

Ryzen AI Max 395+, ~120 tops (fp8?), 128GB RAM, $1999

Nvidia DGX Spark, ~1000 tops fp4, 128GB RAM, $3999

Mac Studio max spec, ~120 tflops (fp16?), 512GB RAM, 3x bandwidth, $9499

DGX Spark appears to potentially offer the most token per second, but less useful/value as everyday pc.

aurareturn•5mo ago

  Mac Studio max spec, ~120 tflops (fp16?), 384GB RAM, 3x bandwidth, $9499

512GB.

DGX has 256GB/s bandwidth so it wouldn't offer the most tokens/s.

rz2k•5mo ago

Perhaps they are referring to default GPU allocation that is 75% of the unified memory, but it is trivial to increase it.

jauntywundrkind•5mo ago

The GPU memory allocation refers to how capacity is alloted, not bandwidth. Sounds like the same 256-bit/quad-channel 8000MHz lpddr5 you can get today with Strix Halo.

rz2k•5mo ago

384GB is 75% of 512GB. The M3 Ultra bandwidth is over 800GB/s, though potentially less in practice.

Using an M3 Ultra I think the performance is pretty remarkable for inference and concerns about prompt processing being slow in particular are greatly exaggerated.

Maybe the advantage of the DGX Spark will be for training or fine tuning.

vid•5mo ago

I very consistently see people say prompt processing is slow for larger context sizes ("notoriously slow"), something that is much less of an issue with eg CUDA setups.

Art9681•5mo ago

Depends on the model. gpt-oss-120b will easily crunch large prompts in a few seconds. It's remarkable. It's gpt-4-mini at home.

echelon•5mo ago

tokens/s/$ then.

jauntywundrkind•5mo ago

NVidia Spark is $4000. Or, will be, supposedly whenever it comes out.

Also notably, Strix Halo and DGX Spark are both ~275GBps memory bandwidth. Not always but in many machine learning cases it feels like that's going to be the limiting factor.

UncleOxidant•5mo ago

> Ryzen AI Max 395+, ~120 tops (fp8?), 128GB RAM, $1999

Just got my Framework PC last week. It's easy to setup to run LLMs locally - you have to use Fedora 42, though, because it has the latest drivers. It was super easy to get qwen3-coder-30b (8 bit quant) running in LMStudio at 36 tok/sec.

hasperdi•5mo ago

Hi could you share if you get a decent coding performance (quality wise) with this setup? IE. Is it good enough to replace say Claude Code?

UncleOxidant•5mo ago

qwen3-coder-30b is surprisingly good for a smallish model, but it's not going to replace Claude Code. Maybe if you're using it for Python it could do well enough. I've been trying it with C code generation and it's not bad, but certainly not at Claude Code level. I hope they come out with a qwen coder model in the 60b to 80b range - something like that would give higher quality results and likely still run in the 15 tok/sec range which would be usable.

pixelpoet•5mo ago

Very encouraging result, I'm waiting super anxiously for mine! How much memory did you allocate for the iGPU?

UncleOxidant•5mo ago

I haven't done any fiddling with that yet. Out of the box it seems to allocate 1/2 for the iGPU. The qwen3-coder-30b 8bit quant model was (as you would expect) only taking 30GB (a bit less than half of what was allocated). Though weirdly, in htop it shows that the CPU has 125GB available to it, so I'm not sure what to make of that.

alias_neo•5mo ago

I'm pretty new to this, so if I wanted to benchmark my current hardware and compare to your results what would be the best way to do that?

I'm looking at going for a Framework Desktop and would like to know what kind of performance gain I'd get over the current hardware I have, which so far I have a "feel" for the performance of from running Ollama and OpenWebUI, but no hard numbers.

UncleOxidant•5mo ago

You could load up LMStudio on your current hardware, get qwen3-coder-30b (8bit quant) and give it some coding tasks, something meaty (I had it create a recursive descent parser in C that parses the C programming language). At the end of it's response it shows the tok/sec. I'm getting 36 tok/sec on the Framework running that model.

linuxftw•5mo ago

What nobody seems to ever share is the context and TTFT (time to first token). You can get a very good TPS by using small prompts, even if the output tokens are very large. If you try to do any kind of agentic coding locally, where contexts are 7k+, local hardware completely falls over.

qwen-code (cli) gives like 2k requests per day for free (and is fantastic), so unless you have a very specific use case, buying a system for local LLM use is not a good use of funds.

If you're in the market for a desktop PC anyway, and just want to tinker with LLMs, then the AMD systems are a fair value IMO, plus the drivers are open source so everything just works out of the box (with Vulkan, anyway).

UncleOxidant•5mo ago

> If you're in the market for a desktop PC anyway, and just want to tinker with LLMs, then the AMD systems are a fair value

Yeah, this is why I bought it. To tinker with LLMs (and some more experimental ML algorithms like differential logic and bitnets), but also it can compile LLVM in a little under 7 minutes, and, I didn't time it, but it can build the riscv gcc toolchain very quickly as well. My current (soon to be previous) dev box took about an hour to compile LLVM (if it didn't fail linking due to running out of memory) so doing any kind of LLVM development or making changes to binutils was quite tedious.

magicalhippo•5mo ago

> If you try to do any kind of agentic coding locally, where contexts are 7k+, local hardware completely falls over.

With my 5070 Ti + 2080 Ti I have Qwen 3 Coder 30B Q4_K_M running entirely on the GPUs with 16k context. Not great for larger code bases, but not nothing either.

Asking it to summarize llama-model-loader.cpp, which is about ~12k tokens, the TTFT is ~13 seconds and generation speed is about 55 tok/sec.

So yeah, for local stuff it's pick any two of large models, long contexts and decent speed.

linuxftw•5mo ago

Yeah, that sounds decent for some one-shots. The unified memory systems can have longer back-and-forth context chats, but at slower speed (at least on AMD).

I find Qwen 3 Coder to be quite usable, I get around 20TPS on my AMD AI 350 system, as long as the net-new context isn't too big.

magicalhippo•5mo ago

I was looking at the 5060 Ti 16GB, and it has about half the memory bandwidth of the 5070 Ti, but at half the price here. With four of them you'd have 64 GB VRAM and still a lot cheaper than a 5090. Should get around 20-25 TPS for Qwen 3 Coder 30B, which is within usable range.

Need a big case tho or go bitcoin miner style.

Not seriously thinking about it, just playing around.

alias_neo•5mo ago

My use cases are mostly for automation, and local-only is a must.

I currently use the GPU in my server for n8n and Home Assistant with small-ish tooling models that fit in my 8GB VRAM.

TTFT is pretty poor right now, I get 10+ seconds for the longer inputs from HA, n8n isn't too bad unless I'm asking it to handle a largish input, but that one is less time sensitive as it's running things on schedules rather than when I need output.

Ideally I'd like to get Assistant responses in HA to under about 2s if possible.

Looking also for a new desktop at some point but I don't want to use the same hardware, the inference GPU is in a server that's always on running "infrastructure" (Kubernetes, various pieces of software, NAS functionality, etc), but I've always build desktops from components since I was a wee child when a 1.44MB floppy was an upgrade, so a part of me is reluctant to switch to a mini-PC for that;

I might be convinced to get a Framework Desktop though if it'll do for Steam gaming on Linux knowing that when I eventually need to upgrade it, it could supplement my server rack and be replaced entirely with a new model on the desktop, given there's very little upgrade path than to replace the entire mainboard.

No real interest in coding assistants, and running within my home network is an absolute must, which limits capability to "what's the best hardware I can afford?".

rjzzleep•5mo ago

Maybe the real value of the DGX spark is to work on Switch 2 emulation. ARM + Nvidia GPU. Start with Switch 2 emulation on this machine and then optimize for others. (Yeah, I know, kind of expensive toy).

pta2002•5mo ago

I think you can get something a lot cheaper if that’s all you want, e.g. something in the Jetson Orin line. That’s more similar to the switch, also, since it’s a Tegra CPU.

ThatMedicIsASpy•5mo ago

Expensive today. But how quickly (years) will these systems lower in value? At least on the Nvidia side of things they can be stacked.. so maybe not so much =/

lhl•5mo ago

RDNA3 CUs do not have FP8 support and its INT8 runs at the same speed as FP16 so Strix Halo's max theoretical is basically 60 TFLOPS no matter how you slice it (well it has double INT4, but I'm unclear on how generally useful that is):

    512 ops/clock/CU * 40 CU * 2.9e9 clock / 1e12 = 59.392 FP16 TFLOPS

Note, even with all my latest manual compilation whistles and the latest TheRock ROCm builds the best I've gotten mamf-finder up to about 35 TFLOPS, which is still not amazing efficiency (most Nvidia cards are at 70-80%), although a huge improvement over the single-digit TFLOPS you might get ootb.

If you're not training, your inference speed will largely be limited by available memory bandwidth, so the Spark token generation will be about the same as the 395.

On general utility, I will say that the 16 Zen5 cores are impressive. It beats my 24C EPYC 9274F in single and multithreaded workloads by about 25%.

littlestymaar•5mo ago

You should add memory bandwidth to your comparison, as it's usually the bottleneck in terms of tps (at least for token generation, prompt processing is a different story).

robbomacrae•5mo ago

GosuCoder's latest video seems to be a well timed test of using Ryzen AI Max on some local models getting 40 TPS on a quantized Qwen 3 coder.

https://www.youtube.com/watch?v=0DET4YFzS6A

ComplexSystems•5mo ago

The RAM bandwidth is so slow on this that you can barely train or do inference or do anything on it. I think the only use case they have in mind for this is fine tuning pretrained models.

wmf•5mo ago

It's the same as Strix Halo and M4 Max that people are going gaga about, so either everyone is wrong or it's fine.

7thpower•5mo ago

The other ones are not framed as an “AI Supercomputer on your desk”, but instead are framed as powerful computers that can also handle AI workloads.

aurareturn•5mo ago

M4 max has more than double the bandwidth.

Strix Halo has the same and I agree it’s overrated.

Rohansi•5mo ago

I would expect/hope that DGX would be able to make better use of its bandwidth than the M4 Max. Will need to wait and see benchmarks.

aurareturn•5mo ago

It should. It has tensor cores which should drastically improve prompt processing. It should also be highly optimized for most AI apps.

woooooo•5mo ago

Matrix vector multiplication for feed forward layers is most of the bandwidth as I understand things, there's not really a way to do it "better", its just a bunch of memory-bound dot products.

(Posting this comment in hopes of being corrected and learning something).

Rohansi•5mo ago

The problem is different parts of the SoC (CPU, GPU, NPU) may not actually be able to consume all of the bandwidth available to the system as a whole. This is why you'd need to benchmark - different chips may be able to feed the cores better than others.

woooooo•5mo ago

Ah, yeah. I guess as we venture further into SoCs that will be more common, I was just thinking "it's whatever the memory controller can do".

imtringued•5mo ago

Training is performed in parallel with batching and is more flops heavy. I don't have an intuition on how memory bandwidth intensive updating the parameters is. It shouldn't be much worse than doing a single forward pass though.

gardnr•5mo ago

Memory Bandwidth:

Nvidia DGX: 273 GB/s

M4 Max: (up to) 546 GB/s

M3 Ultra: 819 GB/s

RTX 5090: ~1.8 TB/s

RTX PRO 6000 Blackwell: ~1.8 TB/s

littlestymaar•5mo ago

Same as Strix Halo, which is 30% cheaper and readily available, yes.

Hence the disappointment.

hereme888•5mo ago

FP4-sparse (TFLOPS) | Price | $/TF4s

5090: 3352 | 1999 | 0.60

Thor: 2070 | 3499 | 1.69

Spark: 1000 | 3999 | 4.00

____________

FP8-dense (TFLOPS) | Price | $/TF8d (4090s have no FP4)

4090 : 661 | 1599 | 2.42

4090 Laptop: 343 | vary | -

____________

Geekbench 6 (compute score) | Price | $/100k

4090: 317800 | 1599 | 503

5090: 387800 | 1999 | 516

M4 Max: 180700 | 1999 | 1106

M3 Ultra: 259700 | 3999 | 1540

____________

Apple NPU TOPS (not GPU-comparable)

M4 Max: 38

M3 Ultra: 36

aurareturn•5mo ago

It's not good value when you put it like that. It doesn't have a lot of compute and bandwidth. What it has is the ability to run DGX software for CUDA devs I guess. Not a great inference machine either.

jacquesm•5mo ago

It's great at one thing: memory. And that's interesting because memory is a commodity, but they still make bank on just being able to access it.

aurareturn•5mo ago

Memory is a commodity but access to high bandwidth memory is expensive whether it's HBM, or LPDDR/DDR connected to many memory channels.

throwaway48476•5mo ago

Its not priced in a linear way wrt to bom cost.

aurareturn•5mo ago

That's because memory channels cost money. Memory controllers are more complex. Lastly, chips that can make use of high bandwidth VRAM are both of the above.

Memory chips are a commodity, that I agree. Though HBM is trending towards not being a commodity.

throwaway48476•5mo ago

Memory controllers are die area. mm2 die space is linear bom cost.

DRAM+mm2 bom will have a different slope to just DRAM bom but still basically linear. Nonlinear pricing is pure market segmentation.

aurareturn•5mo ago

  Memory controllers are die area. mm2 die space is linear bom cost.

Ignoring the design and platform support that comes with higher bandwidth memory controllers.

throwaway48476•5mo ago

And this is not also linear or fixed cost, why?

I don't think you know how industry pricing works. Wafers have a price, double mm2, double the price of chip in bom.

aurareturn•5mo ago

Yes. Bigger die size. More complex memory controller designs. More memory lanes. More software support for higher memory lanes. All adds up.

I'm explaining why despite memory being a commodity, high memory bandwidth VRAM cost is not cheap.

throwaway48476•5mo ago

Memory controllers are copy paste. It's not more complex.

It's more expensive in a linear way wrt bom.

dheera•5mo ago

The RTX Pro 6000 Blackwell gets you 96GB of RAM, a LOT more compute, and costs ~$7K which is not that much more than the $3K-$4K you'd pay for a DGX Spark.

I think the RTX Pro is probably the best deal right now if you're looking for a GPU dev desktop and don't care about physical size or power consumption.

jacquesm•5mo ago

What I ended up doing is buy 3090's in bulk from ex mining rigs, stick new fans in them and run 14 of them connected to an old Supermicro chassis using very nice PCI express splitters. I figured out just what the sweet spot was in terms of PCI bandwidth and maxed it out on RAM (512G). The end result was pretty usable and allowed me to run some of my own benchmarks without sharing any data with the usual suspects. The bigger problem was power delivery, the box had four pretty beefy power supplies running on two different phases. It's retired now because we're much too busy with other stuff but that was a lot of fun and gave me a much better angle on all these new developments than I would have gotten otherwise. Total cost of the system was less than one RTX Pro 6000 and it had a lot more VRAM. The whole thing looked terrible though :)

Y_Y•5mo ago

You are doing god's work.

In fact you're also doing the work Nvidia should have done when they put together their (imho) ridiculously imprecise spec sheet.

conradev•5mo ago

where does an RTX Pro 6000 Blackwell fall in this? I feel like that’s the next step up in performance (and about the same price as two Sparks)

qingcharles•5mo ago

I thought the 6000 was slightly lower throughput than 5090, but obviously has a shitload more RAM.

skhameneh•5mo ago

It's more throughput, but way less value and there's still no NVLink on the 6000. Something like ~4x the price, ~20% more performance, 3x the VRAM.

There's two models that go by 6000, the RTX Pro 6000 (Blackwell) is the one that's currently relevant.

QQ00•5mo ago

the RTX Pro 6000 (Blackwell) does not have NVlink? if so, what the fuck Mr.leather jacket.

kouteiheika•5mo ago

Of course it doesn't; artificial segmentation because they really want you to buy their even more expensive datacenter GPUs for AI training.

touisteur•5mo ago

L40/L40S didn't have it either, which was announced late and felt like a gut-punch for the rare actual non-AI use-case at the time.

A40 nvlink was limited though, one-to-one (bridge), never saw it daisy-chained or nvswitched (might have just not seen them myself, they may have existed).

scosman•5mo ago

How does the process management comparison work for GPU vs full systems?

nodesocket•5mo ago

Once the updated Mac Studio with M4/M5 Ultra comes out, pretty much going to make the DGX irrelevant right?

wmf•5mo ago

Ultras are pretty expensive.

nodesocket•5mo ago

I mean the spark is $3,999 and current M3 Max 28-Core CPU 60-Core GPU is the same price. I would expect the refreshed studio will stay around the same price.

KingOfCoders•5mo ago

In Germany the 96gb version is 5000 EUR and the 256gb version is 7000 EUR (no 128gb available as far as I can see).

spwa4•5mo ago

At that point it's far superior to fly to the US, buy it, and fly back. Hell, have a nice week in a hotel and bring two.

jweir•5mo ago

Fly to a state with no sales tax. Portland, Oregon serves this purpose for high end shoppers that come from out of state and out of the country. Folks fly in to buy their Rolex, Gucci, etc, with no tax.

KingOfCoders•5mo ago

Doesn't work with Germany, you'll have to pay VAT at the airport.

nodesocket•5mo ago

What are they examining your bags for purchased items? That’s gestapoish.

gunalx•5mo ago

not nessesarily, but you'll have an issue if they do a controll, or it comes up.

KingOfCoders•5mo ago

1. Yes they will if they suspect you (age group, clothes, newest phone, certain flights like LAX, LGA) as all custom officers all over the world do. As my bags have been searched every time I've entered the US.

2. You should read up on the Gestapo

spwa4•5mo ago

US customs won't care if they find new electronics, so they're no problem (they are annoying, with the suitcase thing at port-of-entry, but no problem). As for German customs, I don't know, but: do the initial leave out of Germany over the road where there's checks only in theory and leave from an airport not too far over the border. You can probably get a cheaper flight in the process (e.g. Fly from Basel)

lostlogin•5mo ago

The Germans look for electronics, the Americans look in the electronics and what’s on them.

spwa4•5mo ago

Fly back through Geneva, and take some basic precautions, such as not taking the box. Done/done.

stefanfisk•5mo ago

Are you comparing prices with or without taxes? US usually prices without and EU with.

FirmwareBurner•5mo ago

If that would be true why aren't Mac sales banned in China instead of Nvidia GPUs?

nevi-me•5mo ago

Because Tim bribed Trump with a golden calf, or more seriously it's easier to ban a component and its manufacturer vs broader systems.

actionfromafar•5mo ago

Not unprecedented though, Playstation 2 had export restrictions.

But it was a different time. Most policies had some connection to the subject at hand.

Policies today are all about brand Trump and brand MAGA.

saagarjha•5mo ago

Those are high end GPUs that aren’t comparable

int_19h•5mo ago

Because it's only a superior solution if you just want one box, and that mostly for inference. Once you start scaling to larger loads, it's much trickier to get a clusters of Macs to efficiently process them in parallel, whereas datacenter GPUs are designed for clusters.

thomasskis•5mo ago

I run 4 Mac Studio ultras at work (they’re pricy when maxed out), for local-first AI dev services. But there’s a few things that make me want to switch to the Spark. Networking is the biggest one, the Macs have Thunderbolt and Ethernet, but if I run distributed inference with EXO over Thunderbolt; the drop in tokens/second is massive. These Sparks get RDMA and can stack nicely. The other big one is access to CUDA, MLX has come a long way but being able to have CUDA and GPU access in containers would simplify the stack so nicely. If I had a USB-C/Thunderbolt backplane it might compare, but scaling with the Spark is likely a lot more straightforward.

I call the stack with Mac Studios “MacAIver” because it feels like a duct tape solution, but the Spark equivalent would likely be more elegant.

aurareturn•5mo ago

You'd have to stack 16 of these to get 2TB of VRAM, equivalent to 4 Mac Studios 512GBs chained together.

16 compared to 4. Surely even much faster networking in the Spark would degrade with that many devices?

Biggest problem with Macs is that they don't have dedicated tensor cores in the GPU which makes prompt processing very slow compared to Nvidia and AMD.

themgt•5mo ago

n.b. there's been a little speculation that Apple adding TensorOps to Metal 4 suggests M5/M6 may get tensor cores.

https://x.com/liuliu/status/1932158994698932505

https://developer.apple.com/metal/Metal-Shading-Language-Spe...

aurareturn•5mo ago

Nice. I hope so. That would make Macs the best local LLM machines for the masses by far.

thomasskis•5mo ago

It’s $12k for each Mac Studio, and the networking makes them only effective individually (it’s like less that 15 tokens/s with EXO) while NVLINK is very effective. The Spark is definitely more scalable, but the MLX and metal teams are cooking, so honestly either way is still winning.

canucker2016•5mo ago

5090: 32GB RAM (newegg & amazon lowest price seems to be +300 more)

4090: 24GB RAM

Thor & Spark: 128GB RAM (probably at least 96GB usable by the GPU if they behave similar to the AMD Strix Halo APU)

oliwary•5mo ago

True... It would be very interesting to make a comparison of various open models based on token generation speed on these platforms. Presumably starting st some size the larger accessible RAM wins out over raw speed but low VRAM? Although I suppose things like MoE and FP would also matter.

bjackman•5mo ago

Note you cannot actually get a 5090 for $1999 that's just the RRP. I believe they actually cost $4k

IshKebab•5mo ago

I just googled it and the first result was one in stock for £2200. That's including tax. I assume $1999 is excluding tax. Without tax and converted to dollars it's $2470.

From other less reliable sources like eBay they are more like £1800.

bjackman•5mo ago

Huh, you are right. I googled it yesterday too but I guess I had confirmation bias and happened to stumble across an old price and go "yep still 4k".

Well, I'm glad to be wrong on his!

alkonaut•5mo ago

So long as that's true, it's also likely you'll see the same markup for the spark, so they should compare similarly.

hnuser123456•5mo ago

The prices came down to near MSRP in the last month or so.

boulos•5mo ago

As long as you're going to add FP8 dense, you could do the same for the parts mentioned in the FP4 section. Divide by two from dense => sparse, and another two for FP4 => FP8.

That gives you 250 tops of fp8 for Spark.

nabla9•5mo ago

Memory is the bottleneck. It limits the size of the models you can run and what you pay for.

  Spark: 128 GB LPDDR5x, unified system memory
  5090 :  32 GB GDDR7,

Model sizes (parameter size)

  Spark: 200B 
  5090 :  12B (raw)

artemisart•5mo ago

That's very true and what's segmenting the market, but I don't understand why you're saying the 5090 supports only 12B model when it can go up to 50-60B (= a bit less than 64B to leave room for inference) as it supports FP4 as well.

nabla9•5mo ago

Its for comparison using raw, non optimized models. Both can do much better when you optimize for inference.

Information is in the ratio of these numbers. They stay the same.

artemisart•5mo ago

Ok then just to clarify: you can fit 4x larger models on the Spark vs 5090, not 17x.

ilirium•5mo ago

@nabla9 have tried to tell you that for DGX Spark, you can also use optimized models; therefore, this means that Spark can also be used for inference with bigger models, such as those exceeding 200B.

Please compare the same things: carrots VS carrots, not apples VS eggs.

artemisart•5mo ago

I don't understand what's not optimized on 5090. If we're comparing with Apple chips or AMD Strix Halo yes you will have very different hardware + software support, no FP4 etc. but here everything is CUDA, Blackwell vs Blackwell, same FP4 structured sparsity, so I don't get how it would be honest to compare a quantized FP4 model on Spark with an unoptimized FP16 model on a 5090 ?

NewsaHackO•5mo ago

To me, what I think they are saying is that the Spark can use a FP16 unoptimized model with 200B parameters. However I don't really know.

reissbaker•5mo ago

You can't. The Spark has 128GB VRAM; the highest you can go in FP16 is 64B — and that's with no space for context.

200B is probably a rough estimate of Q4 + some space for context.

The Spark has 4x the VRAM of a 5090. That's all you need to know from a "how big can it go" perspective.

canucker2016•5mo ago

from the NVidia DGX Spark datasheet:

  With 128 GB of unified system memory, developers can experiment, fine-tune, or inference models of up to 200B parameters. Plus, NVIDIA ConnectX™ networking can connect two NVIDIA DGX Spark supercomputers to enable inference on models up to 405B parameters.

reissbaker•5mo ago

The datasheet isn't telling you the quantization (intentionally). Model weights at FP16 are roughly 2GB per billion params. A 200B model at FP16 would take 400GB just to load the weights; a single DGX Spark has 128GB. Even two networked together couldn't do it at FP16.

You can do it, if you quantize to FP4 — and Nvidia's special variant of FP4, NVFP4, isn't too bad (and it's optimized on Blackwell). Some models are even trained at FP4 these days, like the gpt-oss models. But gigabytes are gigabytes, and you can't squeeze 400GB of FP16 weights into only 128GB (or 256GB) of space.

The datasheet is telling you the truth: you can fit a 200B model. But it's not saying you can do that at FP16 — because you can't. You can only do it at FP4.

canucker2016•5mo ago

I never claimed the 200B model was FP16.

If the 200B model was at FP16, marketing could've turned around and claimed the DGX Spark could handle a 400B model (with an 8-bit quant) or a 800B model at some 4-bit quant.

Why would marketing leave such low-hanging fruit on the tree?

They wouldn't.

hnuser123456•5mo ago

You and nabla9 are both the one comparing apples and eggs. 4x more RAM means 4x larger models when everything else is held the same to make a fair comparison.

eurekin•5mo ago

So how many generation T/s we can expect for a dense model?

I assume we can go up to 120B using fp8?

garyfirestorm•5mo ago

What did I miss? This was revealed in May - I don’t see anything new in that link since it was revealed.

wmf•5mo ago

Not much. There was a presentation yesterday but it's mostly what we already knew: https://www.servethehome.com/nvidia-outlines-gb10-soc-archit...

numpad0•5mo ago

This has been getting delayed for months. Display out isn't working or something.

eadwu•5mo ago

Most people are missing the point. LLMs are not the be all end all of AI.

Even if you were to say memory bandwidth was the problem, there is no consumer grade GPU that can run any SoTA LLM, no matter what you'd have to settle for a more mediocre model.

Outside of LLMs, 256 GB/s is not as much of an issue and many people have dealt with less bandwidth for real world use cases.

gardnr•5mo ago

What other use cases would use 128GB VRAM but not require higher throughput to run at acceptable speeds?

AuryGlenz•5mo ago

Fine tuning text to image/video models perhaps?

For the newest models unless you quantize the crap out of them, even with a 5090 you’re going to be swapping blocks, which slows things down anyways. At least you’d be able to train on them at full precision with a decent batch size.

That said, I can’t imagine there’s enough of a market there to make it worth it.

eadwu•5mo ago

People have done more with less for a long time (basically with the Jetson counterparts).

The only likely difference with DGX Spark is that it'll be a more desktop-centered platform, what people can do with it, not sure, but say for VR, the DGX Spark is basically the best compute puck for one right now.

sorrythanks•5mo ago

NVIDIA DGX Spark - 4TB

$3,999

maz1b•5mo ago

Dunno, doesn't seem that good to me. Granted, I recognize the pace of advancement, but fwiw at present time.. yeah.

I'd rather just get an M3 Ultra. Have an M2 Ultra on the desk, and an M3 Ultra sitting on the desk waiting to be opened. Might need to sell it and shell out the cash for the max ram option. Pricey, but seems worthwhile.

DrNosferatu•5mo ago

Now we need a threeway benchmark between this DGX Spark, a maxed out AMD Strix* and the Mac 512GB.

maddynator•5mo ago

So Raspberry Pi With GPU?

_zoltan_•5mo ago

the GH/GB line is also based around Arm. the CPU here doesn't matter.

KingOfCoders•5mo ago

I think it depends on your model size

   Fits into 32gb: 5090
   Fits into 64gb - 96gb: Mac Studio
   Fits into 128gb: for now 395+ $/token/s, 
     Mac Studio if you don't care about $ 
     but don't have unlimited money for Hxxx

This could be great for models that fit 128gb and you want best $/token/s (if it is faster than a 395+).

timc3•5mo ago

The 395 although it can be supplied with 128GB can’t use all that for VRAM (unless something has changed in the last couple of weeks).

lhl•5mo ago

In Linux, you can set it as high as you want, although you should probably have a swap drive and still be prepared for you system to die if you set it to 128GiB. Here's how you'd set it to 120GiB:

    # This is deprecated, but can still be referenced
    options amdgpu gttsize=122800

    # This specifies GTT by # of 4KB pages:
    #   31457280 * 4KB / 1024 / 1024 = 120 GiB
    options ttm pages_limit=31457280

KingOfCoders•5mo ago

From YouTube it seems up to 105gb Models disksize work, yes.

monster_truck•5mo ago

Paper launch. The people I know there who I have asked about it haven't even seen one yet

fh973•5mo ago

Ordered one in spring. Delivery time was pushed from July to September. Apparently they had a bug in the HDMI output.

wtallis•5mo ago

That's eerily similar to what happened to Qualcomm's failed Snapdragon X Elite dev kit. That one eventually shipped in small quantities with a Type-C to HDMI dongle in the box to make up for the built-in HDMI port going missing. Then Qualcomm cancelled the whole project and refunded everyone, including people who had already received their hardware.

_zoltan_•5mo ago

because they realized it sucked.

throwaway48476•5mo ago

Because the dev units shipped after retail units. Total incompetence from QC just as I predicted.

numpad0•5mo ago

Do anyone know why official pages don't mention FP16 performance(250 TFLOPS)?

isusmelj•5mo ago

Are there any news about power consumption? I didn’t even see a tdp or so mentioned.

lmeyerov•5mo ago

One of the first things I looked at too...

canucker2016•5mo ago

from my comment elsewhere in this thread, https://news.ycombinator.com/item?id=45048078, "up to 170W." was the quote from March.

querez•5mo ago

"developers can prototype, fine-tune, and inference [AI models]"...

shouldn't it be infer?

nharada•5mo ago

I don’t think either of those are right…

globular-toast•5mo ago

Perhaps "infer from"? I was also taken aback by how they just decided to make "inference" a verb, though. A decent writer would have rewritten the sentence to make it work, similar to how a software implementation sometimes just doesn't work out. But apparently that's too much to ask from Nvidia marketing.

Funnily enough things like this show that a human probably was involved in the writing. I doubt an LLM would have produce that. I've often thought about how future generations are going to signal that they are human and maybe the way will be human language changing much more rapidly than it has done, maybe even mid sentence.

myrmidon•5mo ago

It should be "run inference on" in my opinion, and would be best shortened IMO to just "prototype, fine-tune, and run".

I'argue that "inference" has taken on a somewhat distinct new meaning in an LLM-context (loosely: running actual tokens through the model) and deviating from the base term to the verb form would make the sentence less clear to me.

killerstorm•5mo ago

No. It's quite common for technical slang to deviate from general vocabulary.

Cf. "compute" is a verb for normal people, but for techies it is also "hardware resources used to compute things".

querez•5mo ago

I don't think "inference" as a verb has become technical slang. At least not in my bubble.

apples_oranges•5mo ago

Question from a random consumer: Why not more RAM?

Strom•5mo ago

So they can sell you the next model which upgrades the RAM capacity.

nsteel•5mo ago

Do larger LPDDR5 chips exist yet? Isn't 32GB the max for a 32-bit package?

seanalltogether•5mo ago

Do we need a new term to describe "unified memory" where the cpu and gpu are still isolated from each other and memory needs to be allocated for one or the other and "unified memory" where the cpu and gpu can both access the same addresses. Which systems use which?

qwertox•5mo ago

According to Wendell from Level1Techs, the now-launched Jetson Thor uses a Linux Kernel built by Nvidia, on Ubuntu 20.04 [0]. So I assume getting upgrades will have the same feel as those Chinese SBC's like from Radxa or cheap Android devices.

I wonder if this also applies to this DGX Spark. I hope not.

[0] https://www.youtube.com/watch?v=cgnKUUcCKcs&t=669s

rowanG077•5mo ago

Oooof that would be an instant dealbreaker for me... Better get a mac pro with asahi linux. That at least has great linux support.

fh973•5mo ago

Marketing material says NVIDIA DGX™ OS, which at version 7 would be an Ubuntu 24.04: https://docs.nvidia.com/dgx/dgx-os-7-user-guide/introduction...

naikrovek•5mo ago

or how the Jetson Nano was on Ubuntu 18.04 when it was released in 2019 and never got a single major OS upgrade.

That platform was great for a few months...

bri3d•5mo ago

In the case of Jetson, NVidia also have a fairly generic BSP which you can use to customize almost any distribution, and the Jetson-customized Ubuntu is standard enough that you can upgrade it using the normal Ubuntu upgrade path without major issue.

For most of the Tegra boards there’s also upstream support. Overall the situation with NVidia BSP is about 10000x better than weird Chinese stuff. In the case of Tegra/Jetson, there’s even detailed first-party documentation about reconstructing the BSP components from source:

https://docs.nvidia.com/jetson/archives/l4t-archived/l4t-327...

I’d assume the decent software support will carry over to DGX.

ionwake•5mo ago

Im confused the landing pade to reserve the DGX spark was open 4 months ago? Or am I missing something there.

DebtDeflation•5mo ago

It's a great idea that they utterly gimped with the 273 GB/s memory bandwidth. You can buy a Macbook Pro M4 Max with 128GB that has TWICE the bandwidth. A 4090 has 4 TIMES the bandwidth. I get that they're terrified of doing anything that competes with their H100/B200 but there is an enormous gap between 273 GB/s and the >2TB/s of their data center products.

throwaway48476•5mo ago

Its absurd because the B200 are all on NVLink. Different market entirely.

hoppp•5mo ago

So what is the price?

throwaway48476•5mo ago

The only useful part of this is the new LPDDR memory form factor.

pjs_•5mo ago

So if I buy 1000 of these I have an exascale supercomputer? I remember when exascale was disparaged as science fiction :)

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Y Combinator Founder Organizes 'March for Billionaires'

Ask HN: Need feedback on the idea I'm working on

OpenClaw Addresses Security Risks

Apple finalizes Gemini / Siri deal

Italy Railways Sabotaged

Emacs-tramp-RPC: high-performance TRAMP back end using MsgPack-RPC

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

Show HN: 26/02/26 – 5 songs in a day

Toroidal Logit Bias – Reduce LLM hallucinations 40% with no fine-tuning

Top AI models fail at >96% of tasks

The Science of the Perfect Second (2023)

Bob Beck (OpenBSD) on why vi should stay vi (2006)

Show HN: a glimpse into the future of eye tracking for multi-agent use

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Y Combinator Founder Organizes 'March for Billionaires'

Ask HN: Need feedback on the idea I'm working on

OpenClaw Addresses Security Risks

Apple finalizes Gemini / Siri deal

Italy Railways Sabotaged

Emacs-tramp-RPC: high-performance TRAMP back end using MsgPack-RPC

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

Show HN: 26/02/26 – 5 songs in a day

Toroidal Logit Bias – Reduce LLM hallucinations 40% with no fine-tuning

Top AI models fail at >96% of tasks

The Science of the Perfect Second (2023)

Bob Beck (OpenBSD) on why vi should stay vi (2006)

Show HN: a glimpse into the future of eye tracking for multi-agent use

Nvidia DGX Spark

Comments