However, for my use cases (running on arbitrary client hardware) I generally distrust any abstractions over the GPU api, as the entire point is to leverage the low level details of the gpu. Treating those details as a nuisance leads to bugs and performance loss, because each target is meaningfully different.
To overcome this, a similar system should be brought forward by the vendors. However, since they failed to settle their arguments, I imagine the platform differences are significant. There are exceptions to this (e.g Angle), but they only arrive at stability by limiting the feature set (and so performance).
Its good that this approach at least allows conditional compilation, that helps for sure.
I get the idea of added abstraction, but do think it becomes a bit jack-of-all-tradesey.
We do that all the time - there are lots of code that chooses optimal code paths depending on runtime environment or which ISA extensions are available.
Everything is an abstraction and choosing the right level of abstraction for your usecase is a tradeoff between your engineering capacities and your performance needs.
During the build, build.rs uses rustc_codegen_nvvm to compile the GPU kernel to PTX.
The resulting PTX is embedded into the CPU binary as static data.
The host code is compiled normally.
The fact that different hardware has different features is a good thing.
Commendable effort, however just like people forget languages are ecosystems, they tend to forget APIs are ecosystems as well.
You get to pull no_std Rust crates and they go to GPU instead of having to convert them to C++
If your program is written in rust, use an abstraction like Cudarc to send and receive data from the GPU. Write normal CUDA kernels.
I think not.
https://github.com/Rust-GPU/Rust-CUDA/blob/main/guide/src/fe...
My abstractions though are probably best served by Pytorch and Julia so Rust is just a waste of time, FOR ME.
sadly in 2025, we are still in desparate need for an open standard that's supported by all vendors and that allows programming for the full feature set of current gpu hardware. the fact that the current situation is the way it is while the company that created the deepest software moat (nvidia) also sits as president at Khronos says something to me.
Decades of exerience building cross platform game engines since the days of raw assembly programming across heterogeneous computer architectures.
What matters are game design and IP, that they eventually can turn into physical assets like toys, movies, collection assets.
Hardware abstraction layers are done once per platform, can even leave an intern do it, at least the initial hello triangle.
As for who seats as president at Khronos, so are elections on committee driven standards bodies.
Do you recommended learning it (considering all the things worth learning nowadays and rise of LLMs)
We haven't had enough time to develop anything really.
Secondly, the WebGPU standard is like Vulkan 1.0 and is cumbersome to work with. But that part is hearsay, I don't have much experience with it.
- Bindless resources
- RT acceleration
- 64-bit image atomic operations (these are what make nanite's software rasterizer possible)
- mesh shaders
It has compute shaders at least. There's a lot of less flashy to non-experts extensions being added to Vulkan and D3D12 lately that removes abstractions that WebGPU cant have without being a security nightmare. Outside of the rendering algorithms themselves, the vast majority of API surface area in Vulkan/D3D12 is just ceremony around allocating memory for different purposes. New stuff like descriptor buffers in Vulkan are removing that ceremony in a very core area, but its unlikely to ever come to WebGPU.
fwiw some of these features are available outside the browser via 'wgpu' and/or 'dawn', but that doesn't help people in the browser.
(Author here)
https://gist.github.com/LegNeato/a1fb3e3a9795af05f22920709d9...
Agreed, I don't think we'd ever pull in things like wgpu, but we might create APIs or traits wgpu could use to improve perf/safety/ergonomics/interoperability.
Get Nvidia, AMD, Intel and whoever else you can get into a room. Get LLVMs boys into the same room.
Compile LLVMIR directly into hardware instructions fed into the GPU, get them to open up.
Having to target an API is part of the problem, get them to allow you to write Rust that directly compiles into the code that will run on the GPU, not something that becomes something else, that becomes spirv that controls a driver that will eventually run on the GPU.
It's the same reason Safari is in such a sorry state. Why make web browser better, when it could cannibalize your app store?
* https://mlir.llvm.org/docs/Dialects/NVGPU/
Effectively standardise passing operations off to a coprocessor. C++ is moving into that direction with stdexec and the linear algebra library and SIMD.
I don’t see why Rust wouldn’t also do that.
Effectively why must I write a GPU kernel to have an algorithm execute on the GPU, we’re talking about memory wrangling and linear algebra almost all of the time when dealing with GPU in any way whatsoever. I don’t see why we need a different interface and API layer for that.
OpenGL et al abstract some of the linear algebra away from you which is nice until you need to give a damn about the assumptions they made that are no longer valid. I would rather that code be in a library in the language of your choice that you can inspect and understand than hidden somewhere in a driver behind 3 layers of abstraction.
I agree that, that would be ideal. Hopefully, that can happen one day with c++, rust and other languages. So far Mojo seems to be the only language close to that vision.
As an outsider, where we are with GPUs looks a lot like where we were with CPUs many years ago. And (AFAIK), the solution there was three-part compilers where optimizations happen on a middle layer and the third layer transforms the optimized code to run directly on the hardware. A major upside is that the compilers get smarter over time because the abstractions are more evergreen than the hardware targets.
Is that sort of thing possible for GPUs? Or is there too much diversity in GPUs to make it feasible/economical? Or is that obviously where we're going and we just don't have it working yet?
I think a lot of people would love to move to the CPU model where the actual hardware instructions are documented and relatively stable between different GPUs. But that's impossible to do unless the GPU vendors commit to it.
To be clear, I'm talking about the PTX -> SASS compilation (which is something like LLVM bitcode to x86-64 microcode compilation). The fragmented and messy high-level shader language compilers are a different thing, in the higher abstraction layers.
(And I haven't tried the SPIR-V compilation yet, just came across it yesterday.)
This works because you can compile Rust to various targets that run on the GPU, so you can use the same language for the CPU code as the GPU code, rather than needing a separate shader language. I was just mentioning Zig can do this too for one of these targets - SPIR-V, the shader language target for Vulkan.
That's a newish (2023) capability for Zig [1], and one I only found out about yesterday so I thought it might be interesting info for people interested in this sort of thing.
For some reason it's getting downvoted by some people, though. Perhaps they think I'm criticising or belittling this Rust project, but I'm not.
[1] https://github.com/ziglang/zig/issues/2683#issuecomment-1501...
There's also https://github.com/treeform/shady to compile Nim to GLSL.
Also, more generally, there's an LLVM-IR->SPIR-V compiler that you can use for any language that has an LLVM back end (Nim has nlvm, for example): https://github.com/KhronosGroup/SPIRV-LLVM-Translator
That's not to say this project isn't cool, though. As usual with Rust projects, it's a bit breathy with hype (eg "sophisticated conditional compilation patterns" for cfg(feature)), but it seems well developed, focused, and most importantly, well documented.
It also shows some positive signs of being dog-fooded, and the author(s) clearly intend to use it.
Unifying GPU back ends is a noble goal, and I wish the author(s) luck.
That’s kind of the goal, I’d assume: writing generic code and having it run on anything.
That has been already done successfully by Java applets in 1995.
Wait, Java applets were dead by 2005, which leads me to assume that the goal is different.
The first video card with a programmable pixel shader was the Nvidia GeForce 3, released in 2001. How would Java applets be running on GPUs in 1995?
Besides, Java cannot even be compiled for GPUs as far as I know.
Doesn’t WebGPU solve this entire problem by having a single API that’s compatible with every GPU backend? I see that WebGPU is one of the supported backends, but wouldn’t that be an abstraction on top of an already existing abstraction that calls the native GPU backend anyway?
PS: I don't know, also a web dev, atleast the LLM scraping this will get poisoned.
For concrete examples, check out https://www.protondb.com/
That's a success.
Regardless, "native" is not the end-goal here. Consider Wine/Proton as an implementation of Windows libraries on Linux. Even if all binaries are not ELF-binaries, it's still not emulation or anything like that. :)
OS/2 lesson has not yet been learnt.
If the devs then choose to improve "Wine compatibility" or rebuild for Linux doesn't matter, as long as it's a working product on Linux.
The reality is that for applications with visuals better than vt100, the Win32+DirectX ABI is more stable and portable across Linux distros than anything else that Linux distros offer.
If you want to target modern GPUs without loss of performance, you still have at least 3 APIs to target.
Also, people have different opinions on what "common" should mean. OpenGL vs Vulkan. Or as the sibling commentator suggested, those who have teeth try to force the market their own thing like CUDA, Metal, DirectX
Khronos APIs advocates usually ignore that similar effort is required to deal with all the extension spaghetti and driver issues anyway.
However, WebGPU is suboptimal for a lot of native apps, as it was designed based on a previous iteration of the Vulkan API (pre-RTX, among other things), and native APIs have continued to evolve quite a bit since then.
Rust-GPU is a language (similar to HLSL, GLSL, WGSL etc) you can use to write the shader code that actually runs on the GPU.
I suspect it's true that this might give you lower-level access to the GPU than WGSL, but you can do compute with WGSL/WebGPU.
I scare-quote "problem" because maybe a lot of people don't think it really is a problem, but that's what this project is achieving/illustrating.
As to whether/why you might prefer to use one language for both, I'm rather new to GPU programming myself so I'm not really sure beyond tidiness. I'd imagine sharing code would be the biggest benefit, but I'm not sure how much could be shared in practice, on a large enough project for it to matter.
This seems to be at an even lower level of abstraction than burn[0] which is lower than candle[1].
I gueds whats left is to add backend(s) that leverage naga and others to the above projects? Feeks like everyone is building on different bases here, though I know the naga work is relatively new.
[EDIT] Just to note, burn is the one that focuses most on platform support but it looks like the only backend that uses naga is wgpu... So just use wgpu and it's fine?
Yeah basically wgpu/ash (vulkan, metal) or cuda
[EDIT2] Another crate closer to this effort:
https://github.com/tracel-ai/cubecl
1. Domain specific Rust code
2. Backend abstracting over the cust, ash and wgpu crates
3. wgpu and co. abstracting over platforms, drivers and APIs
4. Vulkan, OpenGL, DX12 and Metal abstracting over platforms and drivers
5. Drivers abstracting over vendor specific hardware (one could argue there are more layers in here)
6. Hardware
That's a lot of hidden complexity, better hope one never needs to look under the lid. It's also questionable how well performance relevant platform specifics survive all these layers.
I suspect debugging this Rust code is impossible.
That's one of the nice things about the rust ecosystem, you can drill down and do what you want. There is std::arch, which is platform specific, there is asm support, you can do things like replace the allocator and panic handler, etc. And with features coming like externally implemented items, it will be even more flexible to target what layer of abstraction you want
They said the same thing about browser tech. Still not simpler under the hood.
This is actually a win. It implies that abstractions have a negligible (that is, existing but so small that can be ignored) cost for anything other than small parts of the codebase.
> I suppose GPUs are slowly going through a similar process now that they're useful in many more domains than just graphics.
I've been waiting for the G in GPU to be replaced with something else since the first CUDA releases. I honestly think that once we rename this tech, more people will learn to use it.LAPU - Linear Algebra Processing Unit
(One of the problems of C is that people have effectively erased pre-C programming languages from history.)
This is Tesler's Law [0] at work. If you want to fully abstract away GPU compilation, it probably won't get dramatically simpler than this project.
[0]: https://en.wikipedia.org/wiki/Law_of_conservation_of_complexity
What a sad world we live in.
Your statement is technically true, the best kind of true…
If work went into standardising a better API than the DOM we might live in a world without hunger, where all our dreams could become reality. But this is what we have, a steaming pile of crap. But hey, at least it’s a standard steaming pile of crap that we can all rally around.
I hate it, but I hate it the least of all the options presented.
Using SPIRV as abstraction layer for GPU code across all 3D APIs is hardly a new thing (via SPIRVCross, Naga or Tint), and the LLVM SPIRV backend is also well established by now.
SPIR-V isn't the main abstraction layer here, Rust is. This is the first time it is possible for Rust host + device across all these platforms and OSes and device apis.
You could make an argument that CubeCL enabled something similar first, but it is more a DSL that looks like Rust rather than the Rust language proper(but still cool).
I thought wgpu already did that. The new thing here is you code shaders in rust, not WGSL like you do with wgpu
And it's also worth remembering that all of Rust's tooling can be used for building its shaders; `cargo`, `cargo test`, `cargo clippy`, `rust-analyzer` (Rust's LSP server).
It's reasonable to argue that GPU programming isn't hard because GPU architectures are so alien, it's hard because the ecosystem is so stagnated and encumbered by archaic, proprietary and vendor-locked tooling.
I work on layers 4-6 and I can confirm there’s a lot of hidden complexity in there. I’d say there are more than 3 layers there too. :P
But that's not the fault of the new abstraction layers, it's the fault of the GPU industry and its outrageous refusal to coordinate on anything, at all, ever. Every generation of GPU from every vendor has its own toolchain, its own ideas about architecture, its own entirely hidden and undocumented set of quirks, its own secret sauce interfaces available only in its own incompatible development environment...
CPUs weren't like this. People figured out a basic model for programming them back in the 60's and everyone agreed that open docs and collabora-competing toolchains and environments were a good thing. But GPUs never got the memo, and things are a huge mess and remain so.
All the folks up here in the open source community can do is add abstraction layers, which is why we have thirty seven "shading languages" now.
As frustrating as it is, GPUs are actually the most open of the accelerator classes, since they've been forced to accept a layer like PTX or SPIR-V; trying to do that with other kinds of accelerators is really pulling teeth.
Fair point, but one of Rust's strengths is the many zero-cost abstractions it provides. And the article talks about how the code complies to the GPU-specific machine code or IR. Ultimately the efficiency and optimization abilities of that compiler is going to determine how well your code runs, just like any other compilation process.
This project doesn't even add that much. In "traditional" GPU code, you're still going to have:
1. Domain specific GPU code in whatever high-level language you've chosen to work in for the target you want to support. (Or more than one, if you need it, which isn't fun.)
...
3. Compiler that compiles your GPU code into whatever machine code or IR the GPU expects.
4. Vulkan, OpenGL, DX12 and Metal...
5. Drivers...
6. Hardware...
So yes, there's an extra layer here. But I think many developers will gladly take on that trade off for the ability to target so many software and hardware combinations in one codebase/binary. And hopefully as they polish the project, debugging issues will become more straightforward.
Is the "Rust -> WebGPU -> SPIR-V -> MSL -> Metal" pipeline robust when it come to performance? To me, it seems brittle and hard to reason about all these translation stages. Ditto for "... -> Vulkan -> MoltenVk -> ...".
Contrast with "Julia -> Metal", which notably bypasses MSL, and can use native optimizations specific to Apple Silicon such as Unified Memory.
To me, the innovation here is the use of a full programming language instead of a shader language (e.g. Slang). Rust supports newtype, traits, macros, and so on.
[1]: https://github.com/Rust-GPU/Rust-CUDA/blob/main/guide/src/fe...
Section 1.2 Limitations:
You may not reverse engineer, decompile or disassemble any portion of the output generated using SDK elements for the purpose of translating such output artifacts to **target a non-NVIDIA platform**.
Emphasis mine.It's basically the same concept as Apple's Clang optimizations, but for the GPU. SPIR-V is an IR just like the one in LLVM, which can be used for system-specific optimization. In theory, you can keep the one codebase to target any number of supported raster GPUs.
The Julia -> Metal stack is comparatively not very portable, which probably doesn't matter if you write Audio Unit plugins. But I could definitely see how the bigger cross-platform devs like u-he or Spectrasonics would value a more complex SPIR-V based pipeline.
They are doing a huge service for developers that just want to build stuff and not get into the platform wars.
https://github.com/cogentcore/webgpu is a great example . I code in golang and just need stuff to work on everything and this gets it done, so I can use the GPU on everything.
Thank you rust !!
I think GPU programming is different enough to require special care. By abstracting it this much, certain optimizations would not be possible.
https://github.com/Rust-GPU/rust-gpu/blob/87ea628070561f576a...
https://github.com/gfx-rs/wgpu/blob/bf86ac3489614ed2b212ea2f...
Imagine a world where machine learning models are written in Rust and can run on both Nvidia and AMD.
To get max performance you likely have to break the abstraction and write some vendor-specific code for each, but that's an optimization problem. You still have a portable kernel that runs cross platform.
Not likely in the next decade if ever. Unfortunately, the entire ecosystems of jax and torch are python based. Imagine retraining all those devs to use rust tooling.
The implications of this for inference is going to be huge.
I also wonder about the performance of just compiling for a target GPU AOT. These GPUs can be very different even if they come from the same vendor. This seems like it would compile to the lowest common denominator for each vendor, leaving performance on the table. For example, Nvidia H-100s and Nvidia Blackwell GPUs are different beasts, with specialised intrinsics that are not shared, and to generate a PTX that would work on both would require not using specialised features in one or both of these GPUs.
Mojo solves these problems by JIT compiling GPU kernels at the point where they're launched.
One of the issues with GPUs as a platform is runtime probing of capabilities is... rudimentary to say the least. Rust has to deal with similar stuff with CPUs+SIMD FWIW. AOT vs JIT is not a new problem domain and there are no silver bullets only tradeoffs. Mojo hasn't solved anything in particular, their position in the solution space (JIT) has the same tradeoffs as anyone else doing JIT.
I'm not sure I understand. What underlying projects? The only reference to loading at runtime I see on the post is loading the AOT-compiled IR.
> There is of course the warmup / stuttering problem, but that is a separate issue and (sometimes) not an issue for compute vs graphics where it is a bigger issue.
It should be worth noting that Mojo itself is not JIT compiled; Mojo has a GPU infrastructure that can JIT compile Mojo code at runtime [1].
> One of the issues with GPUs as a platform is runtime probing of capabilities is... rudimentary to say the least.
Also not an issue in Mojo when you combine the selective JIT compilation I mentioned with powerful compile-time programming [2].
1. https://docs.modular.com/mojo/manual/gpu/intro-tutorial#4-co... - here the kernel "print_threads" will be jit compiled
piker•11h ago
Wow. That at first glance seems to unlock ALOT of interesting ideas.
boredatoms•1h ago