We’ve been experimenting with NVIDIA’s 2:4 structured sparsity format on Ampere GPUs and found something interesting: consumer RTX cards support the same sparse Tensor Core instructions as datacenter A100s, but NVIDIA’s software stack doesn’t expose them.
SparseFlow is a small open-source project that tries to use these instructions on RTX 30/40 series cards.
On large matrices (for example 4096×4096), we see roughly 2× throughput compared to dense cuBLAS. For smaller matrices, dense kernels are still faster.
A few things that stood out:
- The hardware support exists but isn’t documented for consumer cards
- Overhead dominates below certain size thresholds
- This only works with weights already in a 2:4 pattern
This is early work. The repo includes the CUDA kernels, benchmarks, and notes on where sparsity helps and where it does not.
I’d be interested to hear from others who have experimented with sparse inference or noticed similar hardware vs software gaps.
maplesilicon•1d ago
SparseFlow is a small open-source project that tries to use these instructions on RTX 30/40 series cards.
On large matrices (for example 4096×4096), we see roughly 2× throughput compared to dense cuBLAS. For smaller matrices, dense kernels are still faster.
A few things that stood out: - The hardware support exists but isn’t documented for consumer cards - Overhead dominates below certain size thresholds - This only works with weights already in a 2:4 pattern
This is early work. The repo includes the CUDA kernels, benchmarks, and notes on where sparsity helps and where it does not.
I’d be interested to hear from others who have experimented with sparse inference or noticed similar hardware vs software gaps.