Comparing it against the RTX 4000 SFF Ada (20GB) which is around $1.2k (if you believe the original price on the nvidia website https://marketplace.nvidia.com/en-us/enterprise/laptops-work...). Which I have access to on a Hetzner GEX44.
I'm going to ballpark it between 2.5-3x faster than the desktop. Except for the tg128 test, where the difference is "minimal" (but I didn't do the math).
Theoretically you can have the best of both worlds if you don’t mind running an Occulink E-GPU enclosure
I keep thinking that the bottleneck has to be CPU RAM, and for a large model the difference would be minor. For example with an 100 GByte model such as quantised gpt-oss-120B, I imagine that going from 10G to 24G would scale up my tk/s like 1/90 -> 1/76, so 20% advantage? But I can't find much on the high-level scaling math. People seem to either create calculators that oversimplify, or they seem too deep into the weeds.
I'd like a new anandtech please.
Obviously an RTX 5090 with 32GB of VRAM is even better, but they cost around $2000, if you can find one.
What's interesting about this Strix Halo system is that it has 128GB of RAM that is accessible (or mostly accessible) to the CPU/GPU/APU. This means that you can run much larger models on this system than you possibly could on a 3090, or even a 5090. The performance tests tend to show that the Strix Halo's memory bandwidth is a significant bottleneck though. This system might be the most affordable way of running 100GB+ models, but it won't be fast.
The BIOS allows pre-allocating 96 GB max, and I'm not sure if that's the maximum for Windows, but under Linux, you can use `amdttm.pages_limit` and `amdttm.page_pool_size` [1]
[1] https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...
htpc@htpc:~% free -h
total used free shared buff/cache available
Mem: 125Gi 123Gi 920Mi 66Mi 1.6Gi 1.4Gi
Swap: 19Gi 4.0Ki 19Gi
[1] https://bpa.st/LZZQIn 4K pages for example:
options ttm pages_limit=31457280
options ttm page_pool_size=15728640
This will allow up to 120GB to be allocated and pre-allocate 60GB (you could preallocate none or all depending on your needs and fragmentation size. I believe `amdgpu.vm_fragment_size=9` (2MiB) is optimal.That gives us a total TDP of around 150W, 48 GB of VRAM and we can run Qwen 3 Coder 30B A3B at 4bit quantization with up to 32k context at around 60-70 t/s with Ollama. I also tried out vLLM, but the performance surprisingly wasn't much better (maybe under bigger concurrent load). Felt like sharing the data point, because of similarity.
Honestly it's a really good model, even good enough for some basic agentic use (e.g. with Aider, RooCode and so on), MoE seems the way to go for somewhat limited hardware setups.
Ofc obviously not recommending L4 cards cause they have a pretty steep price tag. Most consumer cards feel a bit power hungry and you'll probably need more than one to fit decent models in there, though also being able to game with the same hardware sounds pretty nice. But speaking of getting more VRAM, the Intel Arc Pro B60 can't come soon enough (if they don't insanely overprice it), especially the 48 GB variety: https://www.maxsun.com/products/intel-arc-pro-b60-dual-48g-t...
I wonder if I could follow that up by buying a 3090 (jumping the price by $1000 plus whatever I plug it into) and contrasting the difference. Could be an eye opening experiment for me.
Here's the write up of my plan for the cheap machine if anyone is interested.
https://joeldare.com/my_plan_to_build_an_ai_chat_bot_in_my_b...
See my AI cluster automation setup here: https://github.com/geerlingguy/beowulf-ai-cluster
I was building that through the course of making this video, because it's insane how much manual labor people put into building home AI clusters :D
I saw mixed results but comments suggest very good performance relative to other at-home setups. Can someone summarize?
I've had that using gemini (via windsurf). Doesn't seem to happen with other models. No idea if there's any correlation but it's an interesting failure mode.
I'd recommend rerunning the benchmarks with the sampling methods explicitly configured the same in each tool. It's tempting to benchmark with all the nondeterminism turned off, but I think it's less useful since in practice for any model you're self hosting for real work you're going to probably want top-p sampling or something like it and you want to benchmark the implementation of that too.
I've never seen gemini do this though, that'd be kinda wild if they shipped something that samples that way. I wonder if windsurf was sending a different config over the api or if this was a different bug.
Here are some examples: https://www.reddit.com/r/GeminiAI/comments/1lxqbxa/i_am_actu...
After having that happen, I switched my stuff to check the model's preferred context size, then set the context size to match, before using any given model.
I was monitoring the network while running various models, and for all models, the first step was to copy over layers (a few gigabytes to 100 or so GB for the huge models), and that would max out the 5 Gbps connection.
But then while warming up and processing, there were only 5-10 Mbps of traffic, so you could do it over a string and tin cans, almost.
But that's a limitation of the current RPC architecture, it can't really parallelize processing, so as I noted in the post and in my video, it kinda uses resources round-robin style, and you can only get worse performance across the entire cluster than on a single node for any model you can fit on the single node.
Even 25gbit these days is pretty affordable for home, if it scaled 5x better than 5gbit that might be enough to make larger models MUCH more practical.
eg if the individual layers of your model can fit on one node and the output can be pipelined so work can continue cascading through the various nodes it'd do well. But because the current word changes the next word a lot on LLMs you can't pipeline it. But you can see it in this [0] image from the attached blog post when he was testing llama.cpp, each node processes a batch of work and passes it off to the next node then goes idle.
[0] https://www.jeffgeerling.com/sites/default/files/images/fram...
There's some folks who use this for clustering. Here's a reddit around Mac systems. The top link is to a really not great hub-spoke model usage (not everyone has BGP skills alas). I've linked to that, 19m35s in. https://www.reddit.com/r/MacStudio/comments/1mc1z0s/anyone_c... https://youtu.be/Ju0ndy2kwlw?t=19m35s
I do hope that CXL 3.1 with its host to host capability makes gluess scale out easier. It's hyped as being for accelerators and attached memory, but having a much lower overhead RDMA capable fabric in every PCIe+CXL port is very very alluring. Can't come soon enough! Servers at first and maybe I'm hopelessly naive here but I do sort of expect it to show up on consumer too.
Upcoming Zen6 Epyc was just confirmed to go to 12-> 16 channel. That'll be very good to see. The Strix Halo successor Medusa Halo is supposed to be 6x channel. (Most of these rumors/leaks via Moore's Law is Dead fwiw). It's absolutely needed to scale to more cores. But still seems so short of what AI demands.
I really can't congratulate Apple enough for being deadly serious about memory bandwidth. What is just gobsmacking to me is that no one else has responded, half a decade latter. Put the ram on chip! DDR, not crazy expensive HBM. The practice of building super chips out of 4x chips, getting scalability that way, feels so obvious too, is so commendable!
Different end of the spectrum, but Intels tablet-size Lakefield had Package-on-Package (PoP) ram, and pretty fast for its day (4266MHz). But it didn't scale up the width, like Apple has.
Its hard to see x86 seem so stuck, be so unable to make what feels like such a necessary push to
I really wish we saw more testing of USB subsystems! With PCIe being so limited, there's such allure to having two USB4 ports! But will they work?
IIRC we saw similar very low bandwidth on Apple's ARM chips too. This was during M1 or so; dunno if things got better with that chip or future ones! Presumably so or I feel like we'd be hearing about it, but also, these things can just go so hidden!
It was really cool back in Ryzen 1 era seeing their CPU get some USB on the cpu itself, not have to go through the IO/peripheral Hub (southbridge?), with its limited connection to the CPU. There's a great up breakout chart here, showing both the 1800x and the various chipsets available: relishable data. https://www.techpowerup.com/cpu-specs/ryzen-7-1800x.c1879
I feel like there's been some recent improvements to USB4/thunderbolt in the kernel, to really insure all lanes get used. But I'm struggling to find a reference/link. What kernel was this tested against? If nothing else, it's be great to poke around at debugfs, to make sure it's getting all the lanes configured. https://www.phoronix.com/news/Linux-6.13-USB-Changes
jeffbee•6mo ago
dijit•6mo ago
I’m struggling to justify the cost of a Threadripper (let alone pro!) for a AAA game studio though.
I wonder who can justify these machines. High frequency trading? data science? shouldn’t that be done on servers?
jeffbee•6mo ago
wpm•6mo ago
This is pure market segmentation. If you need that little bit extra, you’re forced to compromise, or to open your wallet big time, and AMD is betting on people who “really” need that slightly extra oomph to pay.
toast0•6mo ago
If you need that in a single system, you gotta pay up. Lower tier SP6 processors are actually pretty reasonably priced, boards are still spendy though.
sliken•6mo ago
Seems like threadripper is too low volume to be price competitive with Epyc and there's a relatively price insensitive workstation market out there.
kadoban•6mo ago
StrangeDoctor•6mo ago
I think the Xeon systems should have worked and that it was actually a motherboard bios issue, but I had seen a photo of it running in a threadripper and prayed I wasn’t digging an even deeper hole.
kadoban•6mo ago
jeffbee•6mo ago
sliken•6mo ago
rtkwe•6mo ago
geerlingguy•6mo ago
I found it difficult to install ROCm on Fedora 42 but after upgrading to Rawhide it was easy, so I re-tested everything with ROCm vs Vulkan.
Ollama, for some silly reason, doesn't support Vulkan even though I've used a fork many times to get full GPU acceleration with it on Pi, Ampere, and even this AMD system... (moral of the story just stick with llama.cpp).
edwinjones•6mo ago
https://x.com/ollama/status/1952783981000446029
No experimental flag option, no "you can use the fork that works fine but we don't have capacity to support this" just a hard "no, we think it's unreliable". I guess they just want you to drop them and use llama.cpp.
geerlingguy•6mo ago
ROCm support is not wonderful. It's certainly worse for an end user to deal with than Vulkan, which usually 'just works'.
edwinjones•6mo ago
Considering they created mantle, you would think it would be the obvious move too.
MindSpunk•6mo ago
dagmx•6mo ago
There was a period in between where AMD basically EOL’d mantle and Vulkan wasn’t even in the works yet.
edwinjones•6mo ago
zozbot234•6mo ago
My conspiracy theory is that it would help if contributors kept the Vulkan Compute proposed support up to date with new Ollama versions; no maintainer wants to deal with out-of-date pull req's.
jcastro•6mo ago
Great work on this!
nicolaslem•6mo ago
sliken•6mo ago
If you need the memory bandwidth the strix halo looks good, if you are cache friendly and don't care about using almost double the power than the 9950x is a better deal.
jeffbee•6mo ago