Show HN: SparseLab–real sparse training(CSR+custom kernel) in PyTorch, CPU-first

https://news.ycombinator.com/from?site=github.com/darshanfofadiya

1•DARSHANFOFADIYA•2h ago

Comments

DARSHANFOFADIYA•2h ago

Most "sparse training" in PyTorch today isn't actually sparse. A binary mask gets multiplied into a dense weight matrix, which means the zeros still consume memory, still move through the cache, and still get multiplied. That's pruning simulation, not sparse computation. SparseLab does the other thing: real sparse storage (a custom Padded-CSR layout), custom NEON kernels, in an nn.Linear-compatible layer.

The premise. It's been known since the Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635) and RigL (https://arxiv.org/abs/1911.11134) that most models train competitively with ~10% of their parameters if those parameters are the right ones, chosen dynamically during training. Every year since, researchers have reproduced this in masked-dense simulation, then hit a wall when they want the actual memory savings. PyTorch's torch.sparse_csr isn't designed for training — the backward pass is unimplemented for most ops, and the ones that exist force dense intermediates, which defeats the point. The alternative has been to write your own CSR + SIMD kernels, a six-month detour from whatever you were actually trying to study. SparseLab is that detour, packaged.

Reproduced numbers (M3 Pro, all in the repo's docs/demos/):

- MLP on MNIST at 90% sparsity (10% of params live): 97.45% vs 98.06% dense — 0.61pp gap, 82% memory reduction. Sparse needed 1.8x more epochs to converge.

- 10M-param transformer on Tiny Shakespeare, 70% sparse attention + 90% sparse FFN, 10k steps: inference memory 15.3 MB vs 41.0 MB (37% of dense), 0.055 nats validation loss gap.

- Scaling check at 40M params, 1000 steps (same architecture family, 4x larger): inference memory 55.8 MB vs 150.7 MB dense — exactly 37% of dense again. The ratio held across the scale-up. Per-step slowdown narrowed from 4.6x to 4.1x as kernel time started dominating Python overhead.

- The honest caveat: on CPU we are still 4.1-4.6x slower per step than dense torch.matmul. The dW kernel is most of a step and is unvectorized in v0.1. Memory is the win, not speed.

Why CPU-first is the angle. A DGX H100 has 640 GB of GPU HBM across 8 cards and costs $200-400K up front. Ten Hetzner AX102 nodes at ~€104/month each give you 1.28 TB of DDR5 — 2x the trainable memory at a fraction of the capital cost, paid monthly. For independent researchers training in the 100M-1B param range, RAM is the binding constraint, not FLOPs. Real sparse storage turns "doesn't fit in HBM" into "fits in DDR5, trains slow, but trains." DDP for wallclock recovery is on the v0.2 roadmap.

API. Install with "pip install sparselab" (wheels for macOS arm64, Linux x86_64, Linux aarch64). One-line swap from nn.Linear:

  import sparselab
  layer = sparselab.SparseLinear(1536, 384, sparsity=0.9)
  algo = sparselab.RigL(sparsity=0.9, drop_fraction=0.3, update_freq=100)
  layer.apply(algo)  # mutates topology during training

SparsityAlgorithm is modeled on Cerebras's SparsityAlgorithm API (https://training-api.cerebras.ai/en/latest/wsc/tutorials/spa...) and credited in the docstrings. v0.1 ships Static, SET, and RigL.

Help wanted. The aim is for SparseLab to become solid scaffolding for sparse-from-scratch work. Four places a contributor can own something real:

1. A new DST algorithm as a PR — Sparse Momentum, Top-KAST, GraNet. SparsityAlgorithm is ~100 lines; a new algorithm is another ~100.

2. CPU perf — dW kernel NEON/AVX-512 vectorization + parallel scheduling is the highest-leverage contribution. The 40M scaling numbers quantify exactly why.

3. CUDA port of SpMM + rewrite kernels. v0.1 is CPU-only; the layout is GPU-friendly and a CUDA port is the third contributor track.

4. Push the scaling further. We validated the memory ratio at 40M. The 100M+ regime is open territory — if you have CPU cluster time, a GPT-2 small scale-up with a real convergence budget would be the first independent reproduction above author hardware.

Turn messy chats into structured TODOs and notes automatically

DeepSeek V4 Flash

DeepSeek 4 Launched

In Defense of Blub Studies

Need Help Please

A quick look at Mythos run on Firefox: too much hype?

Hello from Berkeley

Anthropic Engineering Postmortem: Claude's 60-Minute Memory Bug

DeepSeek-V4

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

Dear friend, you have built a Kubernetes (2024)

The Centrality Fallacy and ACM

DeepSeek-V4 Preview Version is launched

OpenInterpretability

DeepSeek v4

2026 Ruby on Rails Community Survey

MemCoT: Test-Time Scaling Through Memory-Driven Chain-of-Thought

Claude Opus 4.6 was nerfed prior to release of Opus 4.7

AI Kills HTML?

What is on my phone in 2026

I gave an AI persistent memory, self-learning, and earned autonomy

DeepSeek-V4 Technical Report [pdf]

Medical Student Created Top Influencer 'Emily Hart' Using AI, Making $ Thousands

The System of Context for Production AI

Nev – keyboard focused GUI and terminal text editor

I used to generically engineer plants to increase yield,now I sell garlic online

XOXO Festival Archive

Introducing Data Exports

Show HN: RustNmap

These are the countries moving to ban social media for children