step time (CUDA events, no forced GPU sync)
dataloader time
GPU memory (incl. reserved vs allocated)
The goal is to make it easy to correlate stalls ↔ memory pressure ↔ compute time while the run is happening (instead of only post-hoc profiling).
Usage is intentionally minimal: 3 hooks (a context manager + a function + a decorator). Output can be viewed as a local dashboard / CLI / notebook.
Repo: https://github.com/traceopt-ai/traceml/
Feedback I would love:
What’s the one signal you always wish you had when debugging slow training?
What should this integrate with (or avoid) to stay low-overhead?