1) Support your graphics cards on linux using kernel drivers that you upstream. All of them. Not just a handful - all the ones you sell from say 18 months ago till today.
2) Make GPU acceleration actually work out of the box for pytorch and tensorflow. Not some special fork, patched version that you “maintain” on your website, the tip of the main branch for both of those libraries should just compile out of the box and give people gpu-accelerated ML.
This is table stakes but it blows my mind that they keep making press releases and promises like this that things are on the roadmap without doing thing one and unfucking the basic dev experience so people can actually use their GPUs for real work.
How it actually is: 1) Some cards work with rocm, some cards work with one of the other variations of BS libraries they have come up with over the years. Some cards work with amdgpu but many only work with proprietary kernel drivers which means if you don’t use precisely one of the distributions and kernel versions that they maintain you are sool.
2) Nothing whatsoever builds out of the box and when you get it to build almost nothing runs gpu accelerated. For me, pytorch requires a special downgrade, a python downgrade and a switch to a fork that AMD supposedly maintain although it doesn’t compile for me and when I managed to beat it into a shape where it compiled it wouldn’t run GPU accelerated even though games use the GPU just fine. I have a GPU that is supposedly current, so they are actively selling it, but can I use it? Can I bollocks. Ollama won’t talk to my GPU even though it supposedly works with ROCm. It only works with ROCm with some graphics cards. Tensorflow similar story when I last tried it although admittedly I didn’t try as hard as pytorch.
Just make your shit work so that people can use it. It really shouldn’t be that hard. The dev experience with NVidia is a million times better.
Because, there are tons of IP and trade secrets involved in driver development and optimization. Sometimes game related, sometimes for patching a rogue application which developers can't or don't fix, etc. etc.
GPU drivers are ought to be easy, but in reality, they are not. The open source drivers are "vanilla" drivers without all these case-dependent patching and optimization. Actually, they really work well out of the box for normal desktop applications. I don't think there are any cards which do (or will) not work with the open source kernel drivers as long as you use a sufficiently recent version.
...and you mention ROCm.
I'm not sure how ROCm's intellectual underpinnings are but, claiming lack of effort is a bit unfair to AMD. Yes, software was never their strong suit, but they're way better when compared to 20 years earlier. They have a proper open source driver which works, and a whole fleet of open source ROCm packages, which is very rigorously CI/CD tested by their maintainers now.
Do not forget that some of the world's most powerful supercomputers run on Instinct cards, and AMD is getting tons of experience from these big players. If you think the underpinnings of GPGPU libraries are easy, I can only say that the reality is very different. The simple things people do with PyTorch and other very high level libraries pull enormous tricks under the hood, and you're really pushing the boundaries of the hardware performance and capability-wise.
NVIDIA is not selling a tray full of switches and GPUs and require OEMs to integrate it as-is for no reason. On the other hand, the same NVIDIA acts very slowly to enable an open source ecosystem.
So, yes, AMD is not in an ideal position right now, but calling them incompetent doesn't help either.
P.S.: The company which fought for a completely open source HDMI 2.1 capable display driver is AMD, not NVIDIA.
ATI started as a much more closed company. Then, they pivoted and started to open their parts. They were hamstrung at the HDCP at one point, and they decided to decouple HDCP block from video accelerators at silicon level to allow open source drivers to access video hardware without leaking/disabling HDCP support. So, they devoted to open what they have, but when you have tons of legacy IP, things doesn't go from 0 to 100 in a day. I want to remind that "game dependent driver optimization" started pre 2000s. This is how rooted these codebases are.
NVIDIA took a different approach, which was being indifferent on the surface, but their hardware became a bit more hostile at every turn towards nouveau. Then they released some specs to allow "hardware enablement" by nouveau, so closed source drivers can be installed, and the card didn't blank-screened at boot.
Then, as they fought with kernel developers, with some hard prodding by Kernel guys and some coercing by RedHat, NVIDIA accepted to completely remove "shim" shenanigans, and moved closed bits of the kernel module to card firmware by revising card architecture. It's important to keep in mind that NVIDIA's open drivers means "an open, bare bones kernel module, a full fledged and signed proprietary firmware which can be used by closed source drivers only and a closed source GLX stack", where in AMD this means "An open source kernel module, standard MESA libraries, and a closed source firmware available to all drivers".
It was in talks with nouveau guys to allow them to use the full powered firmware with clock and power management support, but I don't know where it went.
The CUDA environment is also the same. Yes it works very well, and it's a vast garden, but it's walled and protected by electrified fence and turrets. You're all in, or all out.
Probably because they don't have an open-source driver for linux and they can focus on the proprietary one.
Yet, HDMI forum said that they can't implement an HDMI2.1 capable driver in the open, with some nasty legal letters.
I have a couple of friends who wrote 3D engines from scratch and debugged graphics drivers for their engines for a living. It's a completely different jungle filled with completely different beasts.
I think being able to call glxinfo on an AMD card running with completely open drivers and being able to see extensions from NVIDIA, AMD, SGI, IBM and others is a big win already.
What does the 3d engine look like? Custom made AAA in companies such as EA or Ubisoft?
They can't open source their proprietary drivers even if they wanted to because they don't own all of the IP and their code is full of NDA'd trade secrets. AMD isn't paying two different teams to do the same work because they like wasting money.
I'm aware that 8GB Vram are not enough for most such workloads. But no support at all? That's ridiculous. Let me use the card and fall back to system memory for all I care.
Nvidia, as much as I hate their usually awfully insufficient linux support, has no such restrictions for any of their modern cards, as far as I'm aware.
All the stuff works even if it's not officially supported. It's not that hard to set a single environment variable (HSA_OVERRIDE_GFX_VERSION).
Like literally, everything works, from Vega 56/64 to ryzen 99xx iGPUs.
Also, try nixos. Everything literally works with a single config entry after recent merge of rocm 6.3. I successfully run a zoo of various radeons of different generations.
https://open-iov.org/index.php/GPU_Support#AMD
https://github.com/GPUOpen-LibrariesAndSDKs/MxGPU-Virtualiza...
As "AI" use cases mature, NPU/AIE-ML virtualization will also be needed.
Libvirt on debian. The vCPUs apparently spend a good amount of time in idle (not steal or wait). Getting up to ca 50MB/s streaming to disk over network while unvirtualized basically saturating 2.5GE on the same hardware. Looking at perf the culprit seems to be a spinlock eating up most of cpu time. Can't seem to figure this one out and I've experimented like crazy with CPU pinning, SATA/NVMe/PCIe passthrough varieties, BIOS settings, over- and underprovisioning, memory ballooning, clock sources, every magic kernel parameter that seemed vaguely related when searching, etc.
I haven't seen anything like this on the Intel boxes running the same hypervisor setup.
janpmz•5h ago
Mountain_Skies•4h ago