To get benefits from sparsity, you usually need to have very sparse matrices, impose some structure on the sparsity pattern or have specialized hardware. None of it is the case if you want to rune pruned LLMs on consumer devices.
I wanted to see how far can you push it on a GPU and ended up with this.
Blog:
https://www.grizzlytech.dev/blog/macko-spmv
Paper:
https://arxiv.org/abs/2511.13061
Code (example with torch):
https://github.com/vlejd/macko_spmv
jjgreen•2h ago