Show HN: LayerClaw – Lightweight observability for PyTorch training runs

1•prabhavsanga•1h ago

Comments

prabhavsanga•1h ago

Hi HN! I built LayerClaw (https://github.com/layerclaw/layerclaw), a local-first observability tool for PyTorch training.

The problem: When training neural networks, things go wrong silently. Your loss explodes at step 47,392. Your gradients vanish in layer 12. Your GPU memory spikes randomly. By the time you notice, you've wasted hours or days of compute.

I got tired of adding print statements, manually checking TensorBoard files, and tracking down training issues after the fact. Existing tools either require cloud accounts (W&B, Neptune) or are too heavyweight for quick experiments (MLflow, TensorBoard for gradient analysis).

What LayerClaw does:

- Automatically tracks gradients, metrics, and system resources during training - Stores everything locally (SQLite + Parquet, no cloud required) - Detects anomalies: gradient explosions, NaN/Inf values, loss spikes - Provides a CLI to compare runs: `tracer compare run1 run2 --metric loss` - Minimal overhead with async writes (~2-3%)

Quick example:

```python import tracer import torch

# Initialize (one line) tracer.init(project="my-project", track_gradients=True)

# Your normal training loop model = YourModel() tracer._state.tracer.attach_hooks(model)

for batch in dataloader: loss = train_step(model, batch) tracer.log({"loss": loss.item()}) tracer.step()

tracer.finish() ```

Then analyze: `tracer anomalies my-run --auto`

What makes it different:

1. Local-first: No sign-ups, no data leaving your machine, no vendor lock-in 2. Designed for debugging: Deep gradient tracking and anomaly detection built-in (not an afterthought) 3. Lightweight: Add 2 lines to your training loop, minimal overhead 4. Works with everything: Vanilla PyTorch, HuggingFace Transformers, PyTorch Lightning

Current limitations (v0.1.0):

- CLI-only (web UI planned for v0.2) - Single-machine training (distributed support coming) - Early stage - would love feedback on what's most useful

Available now: - GitHub: https://github.com/layerclaw/layerclaw

*I'm looking for contributors!* I've created several "good first issues" for anyone interested in contributing. Areas where I need help: - Web UI for visualizations - Distributed training support - More framework integrations - Real-time monitoring dashboard

If you've struggled with ML training issues before, I'd love your input on what would be most valuable. PRs welcome, or just star the repo if you find it interesting!

What features would make this indispensable for your workflow?

So We Built Our Own Agentic Developer

The Art of Being Lazy(log)

Scientists Discover Life Thriving Beneath Fukushima's Dead Reactors

Technocracy 2.0

Something Wild Going on with Emails?

Home Assistant Comm Badge

SanDisk crushes wallets with up to 2.8X SSD price hikes

Start all of your commands with a comma

Sh-DSL – Write/Use Shell with Janet

Exploring Different Keyboard Sensing Technologies – LTT Labs

Windsurf Tab v2

Securely run Claude Code agents in Docker

Hand-Crafting Domain-Specific Compression with an LLM

The perks of being a mole rat

Show HN: A TikTok-style research paper reader

PaperBanana – Automating Academic Illustration

Readr, Safari-Like Reading Mode for Chrome

GitHub integrates Claude and Codex AI coding agents directly into GitHub

ClickHouse Agent Skills

Anthropic's new AI tool: Next black stock market day for the software industry

Ask HN: How can you enforce rules for Claude etc.

Tell HN: Electrolux HR chief hired to layoff workforce bought 12 room apartment

Mean People Fail (2014)

NYC subway gates tested by the MTA use AI tech to track fare evaders

Show HN: Autonomous AI radio station about engineering, history and philosophy

GitHub ponders kill switch for pull requests to stop AI slop

DaveLovable

Why our society needs free and open power grid data [video]

Show HN: RAGStack – Scale-to-zero serverless RAG on your own AWS

The Idea River