ROCm GPU Compute Performance With AMD Ryzen AI MAX+ "Strix Halo": https://www.phoronix.com/review/amd-strix-halo-rocm-benchmar...
There are some recent discussions on YouTube about it [1], including one with a senior VP [2].
[0]: https://github.com/ROCm/TheRock [1]: https://www.youtube.com/watch?v=6tASUo7UqNw&t=4551 [2]: https://www.youtube.com/watch?v=0B8JOtS2Tew
[1] https://www.amd.com/en/technologies/cdna.html [2] https://www.amd.com/en/technologies/rdna.html
It's not rumor. It came straight from an executive: https://www.tomshardware.com/pc-components/cpus/amd-announce...
Maybe Zen 6. AMD already provides the Z-Series specifically for handhelds. The Steamdeck uses currently the regular Zen2 with a RNDA2 from Zen3+. Best would be passive cooling, which would probably run well with native Linux ports like Counter-Strike 2. But I'm worried that they need to use fans again.
One thing I wouldn't mind having on my Deck is FSR4 support, though AMD still hasn't submitted their Vulkan FP8 support proposal, requiring unofficial Mesa hacks to enable it even on desktop Linux.
2001: Radeon 8000 series
2013: Radeon HD 8000 series
2025: Radeon 8000S series
Today's northbridge (aka: Memory controller) is on CPUs. GPUs need a powerful memory controller. And the most powerful memory controller between CPU and Southbridge/Chipset is the memory controller on the CPU itself.
More generally there isn't really a place for low performance integrated graphis any more and southbridge style chips are made with old old cheap processes that cripped with poor memory access would probably not run any modern desktop well. A second option for the memory would be to put a small amount of local memory on the mobo along with the chipset, which again would be slow and still costly while losing the normal iGPU advantage of unified gpu & cpu memory access to th same data (UMA).
I think Strix Halo is up to 44 CUs? Which is more than a 7600xt but less than a 7700xt.
So a bit on the low-end in the scheme of dGPUs. But Strix Halo might be the most powerful iGPU ever made.
edit: there's a new marketing claim from AMD that it beats M4 in some configuraton by 2.6x: https://www.amd.com/en/developer/resources/technical-article... .. but it's a small memory mode of M4 Pro, I wonder if there's independently benchmarked numbers of M4 Max vs the 395 out there.
Of course the major problem about Strix Halo is the price. I'm just wondering how much the iGPU contributed to the insane price tag compared to the NPU. If AMD can release a similar APU without the useless NPU (at least in Linux) with a more accessible pricing (e.g. 8745HS), they can easily dominate the low-end mobile dGPU market.
https://pbs.twimg.com/media/GsyOHOEW0AAVogO?format=jpg&name=...
SecretDreams•11h ago
Maybe I'm missing something?
rbanffy•10h ago
roenxi•10h ago
Plus, although I can't really swear to understand how these chips work, my read is this is basically a graphics card that can be configured with 64GB of memory. If I'm not misreading that it actually sounds quite interesting; even AMDs hopeless compute drivers might potentially be useful for AI work if enough RAM gets thrown into the mix. Although my burn wounds from buying AMD haven't healed yet so I'll let someone else fund that experiment.
jakogut•8h ago
Using ollama, hardware acceleration doesn't really work through ROCm. The framework doesn't officially support gfx1151 (Strix Point RDNA 3.5+), though you can override it to fake gfx1150 (Strix Halo, also RDNA 3.5+ and UMA), and it works.
I think I got it to work for smaller models that fit entirely into the preallocated VRAM buffer, but my machine only allows for statically allocating up to 16 GB for the GPU, and where's the fun in that? This is a unified memory architecture chip, I want to be able to run 30+ GB models seamlessly.
It turns out, you can. Just build llama.cpp from source with the Vulkan backend enabled. You can use a 2 GB static VRAM allocation and any additional data spills into GTT which the driver handles mapping into the GPU's address space seamlessly.
You can see a benchmark I performed of a small model on GitHub [0], but I've done up to Gemma3 27b (~21 GB) and other large models with decent performance, and Strix Halo is supposed to have 2-3x the memory bandwidth and compute performance. Even 8b models perform well with the GPU in power saving mode, inside ~8W.
Come to think of it, those results might make a good blog post.
[0] https://github.com/ggml-org/llama.cpp/discussions/10879
Search for "HX 370"
SecretDreams•7h ago
With this reasoning, I'd probably argue all modern cpus and GPUs aren't particularly remarkable/novel. That could help even be fine.
At the end of the day, these benchmarks are all meant to inform on relative performance, price, and power consumption for end users to make informed decisions (imo). The relative* comparisons are low-key just as important as the new bench data point.
opencl•9h ago
Phoronix just doesn't do much mobile dGPU testing in general to have any data to compare with there.
michaellarabel•9h ago
SecretDreams•7h ago
jauntywundrkind•7h ago
Strix Halo as an APU has two very clear advantages. First, I expect power consumption is somewhat better, due to using LPDDR5(x?) and not needing to go over PCIe.
But the real win is that you can get a 64GB or 128GB GPU (well somewhat less than that)! And there's not really anything stopping 192GB or 256GB builds from happening, now that bigger ram sizes are finally available in the world. But so far all Strix Halo offerings are soldered on ram (non user upgradeable, no camm2 offerings yet), and no one's doing more than 128GB. But that's still a huge LLM compared to what consumers could run before! Or lots of LLMs loaded and ready to go! We see similar things with the large unified memory on Mac APUs; it's why are minis are sometimes popular for LLMs.
Meanwhile Nvidia is charging $20k+ for an A100 GPU with 80GB ram. You won't have that level of performance, but you I'll be able to fit an even bigger LLM than it. For 1/10th the price.
There's also a lot of neat applications for DB's or any kind of data-intense processing. Because unified memory means the work can move between CPU and GPU, without having to move the workload. Normally to use a GPU you end up copying data out of main memory then writing it to the GPU* then reading it to do work, and you can skip 2/3rds of these read/write steps here.
There's some very awesome potential for doing query processing on GPU (ex: PG-Strom). Might also be a pretty interesting for a GPU based router, ala PacketShader (2010).
* Note that PCIe p2p-dma / device RDMA / dma-buf has been getting progressively much better, a lot of attention, across the past half deacde, such that say a nic can send network data direct to GPU memory, or a NVMe drive can send data direct to GPU or network, without bouncing through main memory. One recent example of many: https://www.phoronix.com/news/Device-Memory-TCP-TX-Linux-6.1...
Giving an APU actually fast ram has some cool use cases. I'm excited to see the lines blur in computing like this, to see
dragonwriter•6h ago
Or, sometime in the next month or so, NVidia GB10-based miniPC form factor devices with 128GB (with high-speed interconnect to allow two to serve as a single 256GB system) from various brands (including direct from Nvidia) for $3000-4000 depending on exact configuration and who is assembling the completed system.