The article covers extremely important CUDA warp-level synchronization/exchange primitives, but it's not what is generally called SIMD in the CUDA land .
Oh wow, TIL, thanks. I usually call stuff like that SWAR, and every now-and-then I try to think of a way to (fruitfully) use it.
The "SIMD" in this case was just an allusion to warp-wide functions looking like how one might use SIMD in CPU code, as opposed to typical SIMT CUDA.
Also, StringZilla looks amazing -- I just became your 1000th Github follower :)
ashvardanian•1h ago
Thanks, appreciate the gesture :)
Traditional SWAR on GPUs is a fascinating topic. I've begun assembling a set of synthetic benchmarks to compare DP4A vs. DPX (<https://github.com/ashvardanian/less_slow.cpp/pull/35>), but it feels incomplete without SWAR. My working hypothesis is that 64-bit SWAR on properly aligned data could be very useful in GPGPU, though FMA/MIN/MAX operations in that PR might not be the clearest showcase of its strengths. Do you have a better example or use case in mind?
winwang•58m ago
I don't -- unfortunately not too well-versed in this field! But I was a bit fascinated with SWAR after I randomly thought of how to prefix-sum with int multiplication, later finding out that it is indeed an old trick as I suspected (I'm definitely not on this thread btw): https://mastodon.social/@dougall/109913251096277108
As for 64-bit... well, I mostly avoid using high-end GPUs, but I was of the impression that i64 is just simulated.
In fact, I was thinking of using the full warp as a "pipeline" to implement u32 division (mostly as a joke), almost like anti-SWAR. There was some old-ish paper detailing arithmetic latencies in GPUs and division was approximately more than 32x multiplication (...or I could be misremembering).
DennisL123•2h ago
Interesting stuff. Not sure if I read this right that it‘s 16 und 32 bit values of integers that get sorted. If yes, I‘d love to see if the GPU implementation can beat a competitive Radix sort implementation on a CPU.
winwang•2h ago
It's 32 32-bit values which get sorted. I don't think a GPU sort would beat a CPU sort at this scale, even if you don't take kernel launch time into account. CPUs are simply too fast for (super-)small data, especially with AVX-512.
But if we're talking about a larger amount of data, that would be a different story, i.e. as part of a normal gpu mergesort.
maeln•1h ago
It is also useful if your data already lives on the GPU memory. For example, when you need to z-sort a bunch of particles in a 3d renderer particle system.
exDM69•17m ago
A 32 way GPU sorting algorithm might be just what I need for sorting and deduplicating triangle id's in a visibility buffer renderer I am working on.
ashvardanian•3h ago
Most "CUDA SIMD" intrinsics are designed to process a 32-bit data pack containing 2x 16-bit or 4x 8-bit values (<https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/gro...>). That significantly shrinks their applicability in most domains outside of video and string processing. I've had pretty high hopes for DPX on Hopper (<https://developer.nvidia.com/blog/boosting-dynamic-programmi...>) instructions and started integrating them in StringZilla last year, but the gains aren't huge.
winwang•2h ago
Also, StringZilla looks amazing -- I just became your 1000th Github follower :)
ashvardanian•1h ago
Traditional SWAR on GPUs is a fascinating topic. I've begun assembling a set of synthetic benchmarks to compare DP4A vs. DPX (<https://github.com/ashvardanian/less_slow.cpp/pull/35>), but it feels incomplete without SWAR. My working hypothesis is that 64-bit SWAR on properly aligned data could be very useful in GPGPU, though FMA/MIN/MAX operations in that PR might not be the clearest showcase of its strengths. Do you have a better example or use case in mind?
winwang•58m ago
As for 64-bit... well, I mostly avoid using high-end GPUs, but I was of the impression that i64 is just simulated. In fact, I was thinking of using the full warp as a "pipeline" to implement u32 division (mostly as a joke), almost like anti-SWAR. There was some old-ish paper detailing arithmetic latencies in GPUs and division was approximately more than 32x multiplication (...or I could be misremembering).