frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: SparseLab–real sparse training(CSR+custom kernel) in PyTorch, CPU-first

https://news.ycombinator.com/from?site=github.com/darshanfofadiya
1•DARSHANFOFADIYA•2h ago

Comments

DARSHANFOFADIYA•2h ago
Most "sparse training" in PyTorch today isn't actually sparse. A binary mask gets multiplied into a dense weight matrix, which means the zeros still consume memory, still move through the cache, and still get multiplied. That's pruning simulation, not sparse computation. SparseLab does the other thing: real sparse storage (a custom Padded-CSR layout), custom NEON kernels, in an nn.Linear-compatible layer.

The premise. It's been known since the Lottery Ticket Hypothesis (https://arxiv.org/abs/1803.03635) and RigL (https://arxiv.org/abs/1911.11134) that most models train competitively with ~10% of their parameters if those parameters are the right ones, chosen dynamically during training. Every year since, researchers have reproduced this in masked-dense simulation, then hit a wall when they want the actual memory savings. PyTorch's torch.sparse_csr isn't designed for training — the backward pass is unimplemented for most ops, and the ones that exist force dense intermediates, which defeats the point. The alternative has been to write your own CSR + SIMD kernels, a six-month detour from whatever you were actually trying to study. SparseLab is that detour, packaged.

Reproduced numbers (M3 Pro, all in the repo's docs/demos/):

- MLP on MNIST at 90% sparsity (10% of params live): 97.45% vs 98.06% dense — 0.61pp gap, 82% memory reduction. Sparse needed 1.8x more epochs to converge.

- 10M-param transformer on Tiny Shakespeare, 70% sparse attention + 90% sparse FFN, 10k steps: inference memory 15.3 MB vs 41.0 MB (37% of dense), 0.055 nats validation loss gap.

- Scaling check at 40M params, 1000 steps (same architecture family, 4x larger): inference memory 55.8 MB vs 150.7 MB dense — exactly 37% of dense again. The ratio held across the scale-up. Per-step slowdown narrowed from 4.6x to 4.1x as kernel time started dominating Python overhead.

- The honest caveat: on CPU we are still 4.1-4.6x slower per step than dense torch.matmul. The dW kernel is most of a step and is unvectorized in v0.1. Memory is the win, not speed.

Why CPU-first is the angle. A DGX H100 has 640 GB of GPU HBM across 8 cards and costs $200-400K up front. Ten Hetzner AX102 nodes at ~€104/month each give you 1.28 TB of DDR5 — 2x the trainable memory at a fraction of the capital cost, paid monthly. For independent researchers training in the 100M-1B param range, RAM is the binding constraint, not FLOPs. Real sparse storage turns "doesn't fit in HBM" into "fits in DDR5, trains slow, but trains." DDP for wallclock recovery is on the v0.2 roadmap.

API. Install with "pip install sparselab" (wheels for macOS arm64, Linux x86_64, Linux aarch64). One-line swap from nn.Linear:

  import sparselab
  layer = sparselab.SparseLinear(1536, 384, sparsity=0.9)
  algo = sparselab.RigL(sparsity=0.9, drop_fraction=0.3, update_freq=100)
  layer.apply(algo)  # mutates topology during training
SparsityAlgorithm is modeled on Cerebras's SparsityAlgorithm API (https://training-api.cerebras.ai/en/latest/wsc/tutorials/spa...) and credited in the docstrings. v0.1 ships Static, SET, and RigL.

Help wanted. The aim is for SparseLab to become solid scaffolding for sparse-from-scratch work. Four places a contributor can own something real:

1. A new DST algorithm as a PR — Sparse Momentum, Top-KAST, GraNet. SparsityAlgorithm is ~100 lines; a new algorithm is another ~100.

2. CPU perf — dW kernel NEON/AVX-512 vectorization + parallel scheduling is the highest-leverage contribution. The 40M scaling numbers quantify exactly why.

3. CUDA port of SpMM + rewrite kernels. v0.1 is CPU-only; the layout is GPU-friendly and a CUDA port is the third contributor track.

4. Push the scaling further. We validated the memory ratio at 40M. The 100M+ regime is open territory — if you have CPU cluster time, a GPT-2 small scale-up with a real convergence budget would be the first independent reproduction above author hardware.

Turn messy chats into structured TODOs and notes automatically

https://noteithub.com
1•pardisapporify•3m ago•0 comments

DeepSeek V4 Flash

https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash
2•S0y•5m ago•0 comments

DeepSeek 4 Launched

https://deepseek4.hk/
2•mariopt•7m ago•2 comments

In Defense of Blub Studies

https://www.benkuhn.net/blub/
2•jonnonz•8m ago•0 comments

Need Help Please

1•activist_mel•10m ago•0 comments

A quick look at Mythos run on Firefox: too much hype?

https://xark.es/b/mythos-firefox-150
1•leonidasv•13m ago•0 comments

Hello from Berkeley

https://fluoverse.com
2•Panos_moschos•14m ago•1 comments

Anthropic Engineering Postmortem: Claude's 60-Minute Memory Bug

https://www.aiuniverse.news/claudes-memory-lapse-a-bug-erased-its-reasoning-after-an-hour/
1•aiuniversenews•16m ago•0 comments

DeepSeek-V4

https://huggingface.co/collections/deepseek-ai/deepseek-v4
3•meetpateltech•18m ago•0 comments

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
19•cmrdporcupine•18m ago•2 comments

Dear friend, you have built a Kubernetes (2024)

https://www.macchaffee.com/blog/2024/you-have-built-a-kubernetes/
2•Wingy•18m ago•0 comments

The Centrality Fallacy and ACM

https://cacm.acm.org/opinion/the-centrality-fallacy-and-acm/
2•pykq•19m ago•0 comments

DeepSeek-V4 Preview Version is launched

2•lanbin•19m ago•0 comments

OpenInterpretability

https://openinterp.org/
2•caiovicentino•23m ago•0 comments

DeepSeek v4

https://api-docs.deepseek.com/
7•impact_sy•24m ago•0 comments

2026 Ruby on Rails Community Survey

https://railsdeveloper.com/survey/
7•mooreds•25m ago•0 comments

MemCoT: Test-Time Scaling Through Memory-Driven Chain-of-Thought

https://arxiv.org/abs/2604.08216
2•MemTensor•25m ago•1 comments

Claude Opus 4.6 was nerfed prior to release of Opus 4.7

https://twitter.com/levelsio/status/2047387029915271445
1•nomilk•25m ago•0 comments

AI Kills HTML?

https://twitter.com/zan2434/status/2046982383430496444
2•cuttothechase•25m ago•1 comments

What is on my phone in 2026

https://joshblais.com/blog/what-is-on-my-phone-in-2026/
1•colinprince•26m ago•0 comments

I gave an AI persistent memory, self-learning, and earned autonomy

https://github.com/WingedGuardian/GENesis-AGI
1•genesiscogai•27m ago•1 comments

DeepSeek-V4 Technical Report [pdf]

https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
7•tianyicui•30m ago•0 comments

Medical Student Created Top Influencer 'Emily Hart' Using AI, Making $ Thousands

https://www.ibtimes.co.uk/ai-generated-influencer-emily-hart-maga-1793120
1•Baljhin•34m ago•3 comments

The System of Context for Production AI

https://www.mezmo.com/aura
1•pranay01•37m ago•0 comments

Nev – keyboard focused GUI and terminal text editor

https://github.com/Nimaoth/Nev
1•archargelod•41m ago•0 comments

I used to generically engineer plants to increase yield,now I sell garlic online

https://Demeterfamilyfarm.com
1•Hilliard_Ohiooo•42m ago•0 comments

XOXO Festival Archive

https://xoxofest.com/
2•surprisetalk•45m ago•0 comments

Introducing Data Exports

https://socket.dev/blog/introducing-data-exports
1•ilreb•51m ago•0 comments

Show HN: RustNmap

1•greatwallisme•52m ago•0 comments

These are the countries moving to ban social media for children

https://techcrunch.com/2026/04/23/social-media-ban-children-countries-list/
1•evo_9•58m ago•0 comments