Continuous Nvidia CUDA Profiling in Production

https://www.polarsignals.com/blog/posts/2025/10/22/gpu-profiling

98•brancz•3mo ago

Comments

gnurizen•3mo ago

Author here, would be happy to field any questions or feedback!

sirhcm•3mo ago

Does the profiler read any of the GPU's performance counters? Would be super cool to have an open source tool that can capture the same data nsight compute does.

gnurizen•3mo ago

This profiler is focused on kernel execution but we do scrape high level metrics (https://www.polarsignals.com/blog/posts/2025/06/04/latest-in... which is based on https://github.com/polarsignals/gpu-metrics-agent). What performance counters in particular were you interested in?

sirhcm•3mo ago

Cache hit rate is probably the most immediately useful. Although given that this is for always-on profiling maybe this project isn't as geared towards optimizing kernels as I originally thought? In theory reading the counters should be low overhead though.

porridgeraisin•3mo ago

It depends on what counter.

[ All from my experience on home GPUs, and in lah with 2 nodes with 2 80GB H100 each. Not extensively benchmarked ]

Events like kernel launch, which this profiler reads right now, is a very small overhead (1-2%). Kernel level metrics like DRAM utilisation, cache hit rate, SM occupancy, etc usually give you a 5-10% overhead. If you want to plot a flame graph at a instruction level (mostly useful for learning purposes) then you go off the rails - even 25% overhead I have seen. And finally full traces add tons of overhead but that's pretty much expected - they anyways produce GBs of profiling data.

sirhcm•3mo ago

Occupancy and RAM utilization are available from static analysis. A sampling profiler would also obviously not be suitable for this always-on profiler case. But reading the counters [0] from the GSP should be cheap.

[0] https://en.wikipedia.org/wiki/Hardware_performance_counter

embedding-shape•3mo ago

This "low-overhead always on GPU profiler" seems really cool and useful, but we're not using Kubernetes for anything, and the instructions for how to use it seems to only include Kubernetes. Is there a way of running this without Kubernetes?

gnurizen•3mo ago

Yeah the quickstart guide covers docker, k8s and "raw" binary options:

https://www.parca.dev/docs/quickstart/

knlb•3mo ago

Thanks for the post, this is pretty cool!

I feel like I've seen Cupti have fairly high overhead depending on the cuda version, but I'm not very confident -- did you happen to benchmark different workloads with cupti on/off?

---

If you're taking feature requests: a way to subscribe to -- and get tracebacks for -- cuda context creation would be very useful; I've definitely been surprised by finding processes on the wrong gpu and being easily able to figure out where they came from would be great.

I did a hack by using LD_PRELOAD to subscribe/publish the event, but never really followed through on getting the python stack trace.

gnurizen•3mo ago

CUPTI is kind of a choose your own adventure thing, as you subscribe to more stuff the overhead goes up, this is kind of minimalist profiler that just subscribes to the kernel launches and nothing else. Still to your point depending on kernel launch frequency/granularity it may be higher overhead than some would want in production, we have plans to address that with some probabilistic sampling instead of profiling everything but wanted to get this into folks hands and get some real world feedback first.

Code only says what it does

The success of 'natural language programming'

The Scriptovision Super Micro Script video titler is almost a home computer

Discovering the "original" iPhone from 1995 [video]

Psychometric Comparability of LLM-Based Digital Twins

SidePop – track revenue, costs, and overall business health in one place

The Other Markov's Inequality

The Cascading Effects of Repackaged APIs [pdf]

Lightweight and extensible compatibility layer between dataframe libraries

Haskell for all: Beyond agentic coding

Dorsey's Block cutting up to 10% of staff

Show HN: Freenet Lives – Real-Time Decentralized Apps at Scale [video]

In the AI age, 'slow and steady' doesn't win

Administration won't let student deported to Honduras return

How were the NIST ECDSA curve parameters generated? (2023)

AI, networks and Mechanical Turks (2025)

Goto Considered Awesome [video]

Show HN: I Built a Free AI LinkedIn Carousel Generator

Implementing Auto Tiling with Just 5 Tiles

Open Challange (Get all Universities involved

Apple Tried to Tamper Proof AirTag 2 Speakers – I Broke It [video]

Show HN: Isolating AI-generated code from human code | Vibe as a Code

Show HN: More beautiful and usable Hacker News

Toledo Derailment Rescue [video]

War Department Cuts Ties with Harvard University

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

A Bid-Based NFT Advertising Grid

AI readability score for your documentation

NASA Study: Non-Biologic Processes Don't Explain Mars Organics

I inhaled traffic fumes to find out where air pollution goes in my body

Code only says what it does

The success of 'natural language programming'

The Scriptovision Super Micro Script video titler is almost a home computer

Discovering the "original" iPhone from 1995 [video]

Psychometric Comparability of LLM-Based Digital Twins

SidePop – track revenue, costs, and overall business health in one place

The Other Markov's Inequality

The Cascading Effects of Repackaged APIs [pdf]

Lightweight and extensible compatibility layer between dataframe libraries

Haskell for all: Beyond agentic coding

Dorsey's Block cutting up to 10% of staff

Show HN: Freenet Lives – Real-Time Decentralized Apps at Scale [video]

In the AI age, 'slow and steady' doesn't win

Administration won't let student deported to Honduras return

How were the NIST ECDSA curve parameters generated? (2023)

AI, networks and Mechanical Turks (2025)

Goto Considered Awesome [video]

Show HN: I Built a Free AI LinkedIn Carousel Generator

Implementing Auto Tiling with Just 5 Tiles

Open Challange (Get all Universities involved

Apple Tried to Tamper Proof AirTag 2 Speakers – I Broke It [video]

Show HN: Isolating AI-generated code from human code | Vibe as a Code

Show HN: More beautiful and usable Hacker News

Toledo Derailment Rescue [video]

War Department Cuts Ties with Harvard University

Show HN: LocalGPT – A local-first AI assistant in Rust with persistent memory

A Bid-Based NFT Advertising Grid

AI readability score for your documentation

NASA Study: Non-Biologic Processes Don't Explain Mars Organics

I inhaled traffic fumes to find out where air pollution goes in my body

Continuous Nvidia CUDA Profiling in Production

Comments