Show HN: I built a 2nd-order PyTorch optimizer for LLMs that runs on 16GB GPUs

2•dnosoz•1h ago

Hi HN,

I'm Danilo. I've been struggling with the limitations of AdamW when fine-tuning LLMs locally. Second-order optimizers (like Shampoo or SOAP) offer significantly better step-convergence by exploiting Kronecker-factored curvature. The problem? They require O(d^2) memory and O(d^3) compute per layer, which immediately OOMs consumer hardware like a 16GB T4 or RTX 3090.

I wanted Shampoo-quality preconditioning on my home setup, so I built SCAO (Sparse Curvature-Aware Optimizer).

It's a PyTorch optimizer that acts as a drop-in replacement for AdamW, but it implements a few strict architectural changes to survive on consumer cards:

1. Adaptive Rank Selection: Instead of full-rank Kronecker factors, it truncates the eigenspace to retain >=95% of spectral mass. 2. Int8 EMA Quantization: The curvature accumulators are stored in symmetric int8, which yields a 4x memory reduction with zero degradation in perplexity. 3. Quantization Stability: Standard Shampoo usually crashes at step 1 during 4-bit QLoRA fine-tuning due to SVD ill-conditioning in quantized spaces. SCAO exploits sparse approximations to bypass this. 4. Fused CUDA kernels: I wrote custom kernels to fix an O(k * m^2 * n) complexity bottleneck in the naive projection implementation.

The Benchmark: I recently ran a head-to-head benchmark on a single T4 (16GB VRAM) fine-tuning Qwen2.5-3B (4-bit QLoRA, rank 16): - Shampoo: Failed at Step 1 (SVD mathematical collapse). - SCAO: 100% stability, peaked at exactly 7.14 GB VRAM, with a smooth loss descent.

It is pip-installable (pip install scao).

I've written a technical report detailing the regret bounds, ablation studies, and scaling laws (published on Zenodo), but I really wanted to get this community's eyes on the CUDA kernels and the PyTorch implementation.

GitHub: https://github.com/whispering3/scao Technical Report (DOI): https://doi.org/10.5281/zenodo.19870556

I'd love any feedback, code roasts, or questions about the math behind it!

Comments

satvikpendem•1h ago

Your account is shadow banned by the way, I guess you've just been self promoting too much.

dnosoz•1h ago

Author here. Happy to answer any deep-dive questions about the CUDA implementation or the Kronecker factorization math.

lostmsu•51m ago

Does it actually improve time to target loss?

SpaceX to give Musk 200M shares if 1M colonists on Mars and $7.5B valuation

New Gene Therapy Enables Children with a Rare Form of Deafness to Hear

Declarative Git repo sync/migration tool and self hosted code search engine

Fidelity Won't Let Fund Holders Donate to Southern Poverty Law Center

Barman – Backup and Recovery Manager for PostgreSQL

Ghost-hunter – AI cloud cost investigator that never touches your cloud

Digital dead man's switch: how it works and when to use one

LLM-Audit – Semgrep Rules for OWASP LLM Top in TypeScript

When the Bill Comes Due

Actual line in the official system prompt for Codex for GPT-5.5

Bit: An LLM in the browser that only answers yes or no

45800 tech employees laid off in March 2026 alone

The Triumph of the Data Raccoons

Social Media Cheet Sheet

Show HN: Apollo Data Auditor – GDPR/CCPA scanner, breach SIM, remediation

CodeThis – paste bin with Markdown, password, MCP, and code-to-image

The Edge of Galaxy

Show HN: My retired dad and I made a daily, somewhat difficult, quiz

AI Agents Know About Supabase. They Don't Always Use It Right

Show HN: Harness – Manage parallel Claude Code agents across Git worktrees

Mesa: a versioned filesystem for agents

Cordouan Lighthouse

Facebook Has a Health Scam Problem

Nvidia exec: 'The cost of compute is far beyond the costs of my employees'

Premature Coherence

Show HN: fixiproject.org – minimalist web tools

For the first time, more Americans are moving to Europe than vice-versa

The Bloomberg Terminal Is Getting an AI Makeover

Photoshopping the Package

Cybersecurity in the Intelligence Age