Why bother? CUB's radix sort is already memory-bound, which is theoretically optimal. But CUB's codebase prioritizes backward compatibility and generality across GPU generations, leaving room for targeted optimizations on modern hardware.
Benchmarks (RTX 5090, 67M elements): int32_t keys: 1.53ms vs 1.62ms (+6%) int8_t keys: 0.19ms vs 0.25ms (+32%) int32_t pairs: 3.07ms vs 3.10ms (+1%)
What's different: Implements OneSweep with decoupled lookback PDL (Programmatic Dependent Launch) overlaps histogram with first sort pass on Hopper+ Fine-grained PTX cache hints to avoid polluting L2 CUDA Graph caching drops launch overhead from ~100μs to ~5μs Auto-tuned tile sizes for 48 type combinations
The other goal is education. CUB is harder to read. I have tried to make every optimization decision explicit in the comments — not just what the code does, but why the GPU hardware forces that choice.
GitHub: https://github.com/IlyaGrebnov/libcusort
Happy to answer questions about GPU radix sort internals or the specific optimizations.