The problem: When training neural networks, things go wrong silently. Your loss explodes at step 47,392. Your gradients vanish in layer 12. Your GPU memory spikes randomly. By the time you notice, you've wasted hours or days of compute.
I got tired of adding print statements, manually checking TensorBoard files, and tracking down training issues after the fact. Existing tools either require cloud accounts (W&B, Neptune) or are too heavyweight for quick experiments (MLflow, TensorBoard for gradient analysis).
What LayerClaw does:
- Automatically tracks gradients, metrics, and system resources during training
- Stores everything locally (SQLite + Parquet, no cloud required)
- Detects anomalies: gradient explosions, NaN/Inf values, loss spikes
- Provides a CLI to compare runs: `tracer compare run1 run2 --metric loss`
- Minimal overhead with async writes (~2-3%)
# Your normal training loop
model = YourModel()
tracer._state.tracer.attach_hooks(model)
for batch in dataloader:
loss = train_step(model, batch)
tracer.log({"loss": loss.item()})
tracer.step()
tracer.finish()
```
Then analyze: `tracer anomalies my-run --auto`
What makes it different:
1. Local-first: No sign-ups, no data leaving your machine, no vendor lock-in
2. Designed for debugging: Deep gradient tracking and anomaly detection built-in (not an afterthought)
3. Lightweight: Add 2 lines to your training loop, minimal overhead
4. Works with everything: Vanilla PyTorch, HuggingFace Transformers, PyTorch Lightning
Current limitations (v0.1.0):
- CLI-only (web UI planned for v0.2)
- Single-machine training (distributed support coming)
- Early stage - would love feedback on what's most useful
*I'm looking for contributors!* I've created several "good first issues" for anyone interested in contributing. Areas where I need help:
- Web UI for visualizations
- Distributed training support
- More framework integrations
- Real-time monitoring dashboard
If you've struggled with ML training issues before, I'd love your input on what would be most valuable. PRs welcome, or just star the repo if you find it interesting!
What features would make this indispensable for your workflow?
prabhavsanga•1h ago
The problem: When training neural networks, things go wrong silently. Your loss explodes at step 47,392. Your gradients vanish in layer 12. Your GPU memory spikes randomly. By the time you notice, you've wasted hours or days of compute.
I got tired of adding print statements, manually checking TensorBoard files, and tracking down training issues after the fact. Existing tools either require cloud accounts (W&B, Neptune) or are too heavyweight for quick experiments (MLflow, TensorBoard for gradient analysis).
What LayerClaw does:
- Automatically tracks gradients, metrics, and system resources during training - Stores everything locally (SQLite + Parquet, no cloud required) - Detects anomalies: gradient explosions, NaN/Inf values, loss spikes - Provides a CLI to compare runs: `tracer compare run1 run2 --metric loss` - Minimal overhead with async writes (~2-3%)
Quick example:
```python import tracer import torch
# Initialize (one line) tracer.init(project="my-project", track_gradients=True)
# Your normal training loop model = YourModel() tracer._state.tracer.attach_hooks(model)
for batch in dataloader: loss = train_step(model, batch) tracer.log({"loss": loss.item()}) tracer.step()
tracer.finish() ```
Then analyze: `tracer anomalies my-run --auto`
What makes it different:
1. Local-first: No sign-ups, no data leaving your machine, no vendor lock-in 2. Designed for debugging: Deep gradient tracking and anomaly detection built-in (not an afterthought) 3. Lightweight: Add 2 lines to your training loop, minimal overhead 4. Works with everything: Vanilla PyTorch, HuggingFace Transformers, PyTorch Lightning
Current limitations (v0.1.0):
- CLI-only (web UI planned for v0.2) - Single-machine training (distributed support coming) - Early stage - would love feedback on what's most useful
Available now: - GitHub: https://github.com/layerclaw/layerclaw
*I'm looking for contributors!* I've created several "good first issues" for anyone interested in contributing. Areas where I need help: - Web UI for visualizations - Distributed training support - More framework integrations - Real-time monitoring dashboard
If you've struggled with ML training issues before, I'd love your input on what would be most valuable. PRs welcome, or just star the repo if you find it interesting!
What features would make this indispensable for your workflow?