A Gentle Introduction to CUDA PTX

https://philipfabianek.com/posts/cuda-ptx-introduction/

59•ashvardanian•4mo ago

Comments

the_panopticon•4mo ago

Very interesting. It sounds like tuning at the PTX level can increase workload efficiencies, such as quote "Specifically, we employ customized PTX (Parallel Thread Execution) instructions" from the DeepSeek folks https://arxiv.org/abs/2412.19437.

shetaye•4mo ago

Agreed! The gulf between pure-C++ CUDA and PTX is getting larger with these optimizations. My understanding is that Deepseek used PTX instructions that either had no corresponding C++ implemented (like `wgmma` mentioned in the article) or uncommon permutations of modifiers (`LD.Global.NC.L1::no_allocate.L2::256b`).

saagarjha•4mo ago

They didn’t employ custom PTX instructions; they used existing ones in ways they were not designed to be used.

the__alchemist•4mo ago

Is this analogy valid: Writing PTX is like writing assembly instead of a higher-level language (C, C++, rust etc) for CPU code? E.g. normally the higher level code compiles to it, but you can do optimizations by going lower?

For context, like the opening paragraph in the article goes into, I generate PTX code regularly, but have no idea what the actual code in the PTX file means!

I'm curious about the forward compatibility the article goes into. I only experience that to a point: Code compiled on Cuda 12 does not seem to work on machines with Cuda 13.

philipfabianek•4mo ago

Indeed, this is one way to think about it. However, PTX is an instruction set for a virtual machine, not the actual hardware. The true, hardware-specific assembly is called SASS (Streaming Assembly) and the PTX code is translated into SASS by the GPU driver (using ptxas) in a final compilation step. Unlike SASS, PTX is (mostly) forward compatible.

I don't know the details about your CUDA 12 vs. 13 issue but I think it is not about hardware compatibility but more about the software stack. An application linked against CUDA 12 libraries and might not work with CUDA 13 libraries.

neuroelectron•4mo ago

That's not much different than a modern CPU with an OS on top; where you have the OS doing some of the scheduling then the CPU is splitting up the instructions into microinstructions and then scheduling them again in finer detail (hyperthreading and such). Seems to me there must be a C-level syntax and compiler so you're not manually splitting up individual adds and such and is still capable of optimizing the math effectively. But if that were true, we wouldn't have AAA game studios going to NVidia to optimize their game engines for each individual game.

checker659•4mo ago

https://en.wikipedia.org/wiki/Intermediate_representation

saagarjha•4mo ago

It’s really not true anymore that PTX is forward compatible. There’s a subset that is but any of the new interesting interfaces that have been added are not forward compatible and change in each microarchitectural revision. Most of the reason you’d drop down to PTX anyway is to use those; otherwise compilers are fairly good these days and it’s rarely the case you’ll see PTX unless you’re profiling.

TSMC to produce 3-nanometer chips in Japan

Quantization-Aware Distillation

List of Musical Genres

Show HN: Sknet.ai – AI agents debate on a forum, no humans posting

University of Waterloo Webring

Large tech companies don't need heroes

Backing up all the little things with a Pi5

Game of Trees (Got)

Human Systems Research Submolt

The Threads Algorithm Loves Rage Bait

Search NYC open data to find building health complaints and other issues

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Show HN: Grovia – Long-Range Greenhouse Monitoring System

Ask HN: The Coming Class War

Mind the GAAP Again

The Yardbirds, Dazed and Confused (1968)

Agent News Chat – AI agents talk to each other about the news

Do you have a mathematically attractive face?

Code only says what it does

The success of 'natural language programming'

The Scriptovision Super Micro Script video titler is almost a home computer

Discovering the "original" iPhone from 1995 [video]

Psychometric Comparability of LLM-Based Digital Twins

SidePop – track revenue, costs, and overall business health in one place

The Other Markov's Inequality

The Cascading Effects of Repackaged APIs [pdf]

Lightweight and extensible compatibility layer between dataframe libraries

Haskell for all: Beyond agentic coding

Dorsey's Block cutting up to 10% of staff

Show HN: Freenet Lives – Real-Time Decentralized Apps at Scale [video]