The problem: Current ML profilers either dump too much data (torch.profiler) or abstract away the details you need. You can't see why your model is actually slow - is it memory bandwidth? Kernel launch overhead? Cache misses?
Our approach: We're reverse engineering GPU execution to trace from Python ops down to PTX instructions. One decorator gives you the full execution graph with actual bottlenecks highlighted.
Technical details: - Traces Python → CUDA kernels → PTX with timing breakdowns - Shows memory access patterns and bandwidth utilization - Kernel occupancy and scheduling analysis - Works with PyTorch/JAX, TensorFlow coming
We used this to optimize Llama inference and found bottlenecks we couldn't see before - got 50%+ speedup: https://www.herdora.com/blog/the-overlooked-gpu
Free beta with 10 hours of profiling: https://keysandcaches.com Github: https://github.com/Herdora/kandc Docs: https://www.keysandcaches.com/docs
Curious what inference bottlenecks others are hitting that current tools can't diagnose. What's your experience with existing profilers? Would be very useful to hear thoughts from the community :)