I am building TraceML (open-source) focused on one practical outcome:
Automatically identify why a PyTorch training run is slow and suggest what to change.
Instead of raw traces or yet another dashboard, the goal is:
- highlight the top bottleneck(s) (data stalls, GPU waiting, comm/sync, memory/alloc issues)
- keep it lightweight enough to run regularly, not only during deep debugging
I would love honest feedback from people who run real training workloads:
Would you use something like this, or does it inevitably become another dashboard to maintain?
What would make it clearly useful vs noise?
Repo: https://github.com/traceopt-ai/traceml
Happy to clarify anything.