Continuous Nvidia CUDA Profiling in Production

https://www.polarsignals.com/blog/posts/2025/10/22/gpu-profiling

66•brancz•1w ago

Comments

gnurizen•1h ago

Author here, would be happy to field any questions or feedback!

sirhcm•1h ago

Does the profiler read any of the GPU's performance counters? Would be super cool to have an open source tool that can capture the same data nsight compute does.

gnurizen•38m ago

This profiler is focused on kernel execution but we do scrape high level metrics (https://www.polarsignals.com/blog/posts/2025/06/04/latest-in... which is based on https://github.com/polarsignals/gpu-metrics-agent). What performance counters in particular were you interested in?

sirhcm•29m ago

Cache hit rate is probably the most immediately useful. Although given that this is for always-on profiling maybe this project isn't as geared towards optimizing kernels as I originally thought? In theory reading the counters should be low overhead though.

embedding-shape•50m ago

This "low-overhead always on GPU profiler" seems really cool and useful, but we're not using Kubernetes for anything, and the instructions for how to use it seems to only include Kubernetes. Is there a way of running this without Kubernetes?

gnurizen•47m ago

Yeah the quickstart guide covers docker, k8s and "raw" binary options:

https://www.parca.dev/docs/quickstart/

knlb•38m ago

Thanks for the post, this is pretty cool!

I feel like I've seen Cupti have fairly high overhead depending on the cuda version, but I'm not very confident -- did you happen to benchmark different workloads with cupti on/off?

---

If you're taking feature requests: a way to subscribe to -- and get tracebacks for -- cuda context creation would be very useful; I've definitely been surprised by finding processes on the wrong gpu and being easily able to figure out where they came from would be great.

I did a hack by using LD_PRELOAD to subscribe/publish the event, but never really followed through on getting the python stack trace.

gnurizen•34m ago

CUPTI is kind of a choose your own adventure thing, as you subscribe to more stuff the overhead goes up, this is kind of minimalist profiler that just subscribes to the kernel launches and nothing else. Still to your point depending on kernel launch frequency/granularity it may be higher overhead than some would want in production, we have plans to address that with some probabilistic sampling instead of profiling everything but wanted to get this into folks hands and get some real world feedback first.

Keep Android Open

From VS Code to Helix

Kafka is Fast – I'll use Postgres

I made a 10¢ MCU Talk

AWS to bare metal two years later: Answering your questions about leaving AWS

Eye prosthesis is the first to restore sight lost to macular degeneration

Who needs Graphviz when you can build it yourself?

What we talk about when we talk about sideloading

Recreating a Homebrew Game System from 1987

ChatGPT's Atlas: The Browser That's Anti-Web

Tips for stroke-surviving software engineers

uBlock Origin Lite Apple App Store

Glyph: Scaling Context Windows via Visual-Text Compression

SpiderMonkey Garbage Collector

Aggressive bots ruined my weekend

Mom says son asked Tesla's Grok AI bot about soccer; told him to send nudes

EuroLLM: LLM made in Europe built to support all 24 official EU languages

Show HN: Learn German with Games

Berkeley Out-of-Order RISC-V Processor (Boom) (2020)

Continuous Nvidia CUDA Profiling in Production

Tinkering is a way to acquire good taste

Samsung's $2000 smart fridges are getting ads

UIs Are Not Pure Functions of the Model – React.js and Cocoa Side by Side (2018)

New attacks are diluting secure enclave defenses from Nvidia, AMD, and Intel

Grammarly rebrands to 'Superhuman,' launches a new AI assistant

Boring is what we wanted

Wheeled Inverted Pendulum Model

Apple will phase out Rosetta 2 in macOS 28

Generative AI Image Editing Showdown

Wacl – A Tcl Distribution for WebAssembly

Keep Android Open

From VS Code to Helix

Kafka is Fast – I'll use Postgres

I made a 10¢ MCU Talk

AWS to bare metal two years later: Answering your questions about leaving AWS

Eye prosthesis is the first to restore sight lost to macular degeneration

Who needs Graphviz when you can build it yourself?

What we talk about when we talk about sideloading

Recreating a Homebrew Game System from 1987

ChatGPT's Atlas: The Browser That's Anti-Web

Tips for stroke-surviving software engineers

uBlock Origin Lite Apple App Store

Glyph: Scaling Context Windows via Visual-Text Compression

SpiderMonkey Garbage Collector

Aggressive bots ruined my weekend

Mom says son asked Tesla's Grok AI bot about soccer; told him to send nudes

EuroLLM: LLM made in Europe built to support all 24 official EU languages

Show HN: Learn German with Games

Berkeley Out-of-Order RISC-V Processor (Boom) (2020)

Continuous Nvidia CUDA Profiling in Production

Tinkering is a way to acquire good taste

Samsung's $2000 smart fridges are getting ads

UIs Are Not Pure Functions of the Model – React.js and Cocoa Side by Side (2018)

New attacks are diluting secure enclave defenses from Nvidia, AMD, and Intel

Grammarly rebrands to 'Superhuman,' launches a new AI assistant

Boring is what we wanted

Wheeled Inverted Pendulum Model

Apple will phase out Rosetta 2 in macOS 28

Generative AI Image Editing Showdown

Wacl – A Tcl Distribution for WebAssembly

Continuous Nvidia CUDA Profiling in Production

Comments