How to Think About GPUs

https://jax-ml.github.io/scaling-book/gpus/

114•alphabetting•1d ago

Comments

porridgeraisin•1d ago

A short addition that pre-volta nvidia GPUs were SIMD like TPUs are, and not SIMT which post-volta nvidia GPUs are.

camel-cdr•22h ago

SIMT is just a programming model for SIMD.

Modern GPUs still are just SIMD with good predication support at ISA level.

porridgeraisin•22h ago

I was referring to this portion of TFA

> CUDA cores are much more flexible than a TPU’s VPU: GPU CUDA cores use what is called a SIMT (Single Instruction Multiple Threads) programming model, compared to the TPU’s SIMD (Single Instruction Multiple Data) model.

adrian_b•2h ago

This flexibility of CUDA is a software facility, which is independent of the hardware implementation.

For any SIMD processor one can write a compiler that translates a program written for the SIMT programming model into SIMD instructions. For example, for the Intel/AMD CPUs with SSE4/AVX/AVX-512 ISAs, there exists a compiler of this kind (ispc: https://github.com/ispc/ispc).

achierius•3h ago

That's not true. SIMT notably allows for divergence and reconvergence, whereby single threads actually end up executing different work for a time, while in SIMD you have to always be in sync.

camel-cdr•3h ago

I'm not aware of any GPU that implements this.

Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1]. This feature wasn't meant to accelerate code, but to allow more composable programming models [2].

Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once.

    if (theradIdx.x < 4) {
        A;
        B;
    } else {
        X;
        Y;
    }
    Z;

The diagram shows how this executes in the following order:

Volta:

    ->|   ->X   ->Y   ->Z|->
    ->|->A   ->B   ->Z   |->

pre Volta:

    ->|      ->X->Y|->Z
    ->|->A->B      |->Z

The SIMD equivilant of pre Volta is:

    vslt mask, vid, 4
    vopA ..., mask
    vopB ..., mask
    vopX ..., ~mask
    vopY ..., ~mask
    vopZ ...

The Volta model is:

    vslt mask, vid, 4
    vopA ..., mask
    vopX ..., ~mask
    vopB ..., mask
    vopY ..., ~mask
    vopZ ...

[1] https://chipsandcheese.com/i/138977322/shader-execution-reor...

[2] https://stackoverflow.com/questions/70987051/independent-thr...

adrian_b•2h ago

"Divergence" is supported by any SIMD processor, but with various amounts of overhead depending on the architecture.

"Divergence" means that every "divergent" SIMD instruction is executed at least twice, with different masks, so that it is actually executed only on a subset of the lanes (i.e. CUDA "threads").

SIMT is a programming model, not a hardware implementation. NVIDIA has never explained exactly how the execution of divergent threads has been improved since Volta, but it is certain that, like before, the CUDA "threads" are not threads in the traditional sense, i.e. the CUDA "threads" do not have independent program counters that can be active simultaneously.

What seems to have been added since Volta is some mechanism for fast saving and restoring separate program counters for each CUDA "thread", in order to be able to handle data dependencies between distinct CUDA "threads" by activating the "threads" in the proper order, but those saved per-"thread" program counters cannot become active simultaneously if they have different values, so you cannot execute simultaneously instructions from different CUDA "threads", unless they perform the same operation, which is the same constraint that exists in any SIMD processor.

Post-Volta, nothing has changed when there are no dependencies between the CUDA "threads" composing a CUDA "warp".

What has changed is that now you can have dependencies between the "threads" of a "warp" and the program will produce correct results, while with older GPUs that was unlikely. However dependencies between the CUDA "threads" of a "warp" shall be avoided whenever possible, because they reduce the achievable performance.

aanet•7h ago

Fantastic resource! Thanks for posting it here.

nickysielicki•6h ago

The calculation under “Quiz 2: GPU nodes“ is incorrect, to the best of my knowledge. There aren’t enough ports for each GPU and/or for each switch (less the crossbar connections) to fully realize the 450GB/s that’s theoretically possible, which is why 3.2TB/s of internode bandwidth is what’s offered on all of the major cloud providers and the reference systems. If it was 3.6TB/s, this would produce internode bottlenecks in any distributed ring workload.

Shamelessly: I’m open to work if anyone is hiring.

aschleck•4h ago

It's been a while since I thought about this but isn't the reason providers advertise only 3.2tbps because that's the limit of a single node's connection to the IB network? DGX is spec'ed to pair each H100 with a Connect-X 7 NIC and those cap out at 400gbps. 8 gpus * 400gbps / gpu = 3.2tbps.

Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.

gregorygoc•4h ago

It’s mind boggling why this resource has not been provided by NVIDIA yet. It reached the point that 3rd parties reverse engineer and summarize NV hardware to a point it becomes an actually useful mental model.

What are the actual incentives at NVIDIA? If it’s all about marketing they’re doing great, but I have some doubts about engineering culture.

threeducks•2h ago

With mediocre documentation, NVIDIAs closed-source libraries, such as cuBLAS and cuDNN, will remain the fastest way to perform certain tasks, thereby strengthening vendor lock-in. And of course it makes it more difficult for other companies to reverse engineer.

akshaydatazip•4h ago

Thanks for the really thorough research on that . Right what I wanted for my morning coffee

physicsguy•3h ago

It’s interesting that nvshmem has taken off in ML because the MPI equivalents were never that satisfactory in the simulation world.

Mind you, I did all long range force stuff which is difficult to work with over multiple nodes at the best of times.

tomhow•3h ago

Discussion of original series:

How to scale your model: A systems view of LLMs on TPUs - https://news.ycombinator.com/item?id=42936910 - Feb 2025 (30 comments)

tucnak•2h ago

This post is a great illustration why TPU's lend more nicely towards homogenous computing: yes, there's systolic array limitations (not good for sparsity) but all things considering, bandwidth doesn't change as your cluster ever so larger grows. It's a shame Google is not interested in selling this hardware: if they were available, it would open the door to compute-in-network capabilities far beyond what's currently available; by combining non-homogenous topologies involving various FPGA solutions, i.e. with Alveo V80 exposing 4x800G NIC's.

Also: it's a shame Google doesn't talk about how they use TPU's outside of LLM.

tormeh•2h ago

I find it very hard to justify investing time into learning something that's neither open source nor has multiple interchangeable vendors. Being good at using Nvidia chips sounds a lot like being an ABAP consultant or similar to me. I realize there's a lot of money to be made in the field right now, but IIUC historically this kind of thing has not been a great move.

saagarjha•2h ago

Sure, but you can make money in the field and retire faster than it becomes irrelevant. FWIW none of the ideas here are novel or nontransferable–it's just the specific design that is proprietary. Understanding how to do an AllReduce has been of theoretical interest for decades and will probably remain worth doing far into the future.

Philpax•2h ago

There's more in common with other GPU architectures than there are differences, so a CUDA consultant should be able to pivot if/when the other players are a going concern. It's more about the mindset than the specifics.

dotancohen•1h ago

I've been hearing that for over a decade. I can't even name off hand any CUDA competitors, none of them are likely to gain enough traction to upset CUDA in the coming decade.

Philpax•1h ago

Hence the "if" :-)

ROCm is getting some adoption, especially as some of the world's largest public supercomputers have AMD GPUs.

Some of this is also being solved by working at a different abstraction layer; you can sometimes be ignorant to the hardware you're running on with PyTorch. It's still leaky, but it's something.

physicsguy•27m ago

I still don't see ROCm as that serious a threat, they're still a long way behind in library support.

I used to use ROCFFT as an example, it was missing core functionality that cuFFT has had since like 2008. It looks like they've finally caught up now, but that's one library among many.

WithinReason•1h ago

What's in this article would apply to most other hardware, just with slightly different constants

amelius•1h ago

I mean it is similar to investing time in learning assembly language.

For most IT folks it doesn't make much sense.

qwertox•1h ago

It's a valid point of view, but I don't see the value in sharing it.

There are enough people for who it's worth it, even if just for tinkering, and I'm sure you are aware of that.

It reads a bit like "You shouldn't use it because..."

Learning about Nvidia GPUs will teach you a lot about other GPUs as well, and there are a lot of tutorials about the former, so why not use it if it interests you?

woooooo•25m ago

It's a useful bit of caution to remember transferrable fundamentals, I remember when Oracle wizards were in high demand.

physicsguy•25m ago

It really isn't that hard to pivot. It's worth saying that if you were already writing OpenMP and MPI code then learning CUDA wasn't particularly difficult to get started, and learning to write more performant CUDA code would also help you write faster CPU bound code. It's an evolution of existing models of compute, not a revolution.

AGENTS.md – Open format for guiding coding agents

How to Think About GPUs

Ask HN: Why does the US Visa application website do a port-scan of my network?

How to Draw a Space Invader

Modern CI Is Too Complex and Misdirected

Copilot broke audit logs, but Microsoft won't tell customers

How we exploited CodeRabbit: From simple PR to RCE and write access on 1M repos

Tiny microbe challenges the definition of cellular life

D2 (text to diagram tool) now supports ASCII renders

Type-machine

The Value of Hitting the HN Front Page

How I Made Ruby Faster Than Ruby

Show HN: I've made an easy to extend and flexible JavaScript logger

Gaussian Processes for Machine Learning [pdf]

Fast and observable background job processing for .NET

Emacs as your video-trimming tool

Candle Flame Oscillations as a Clock

Without the futex, it's futile

Rails Charts Using ECharts from Apache

Customizing Lisp REPLs

Intel Foundry Demonstrates First Arm-Based Chip on 18A Node

CRDT: Text Buffer

AnduinOS

Drunken Bishop (2023)

How Figma’s multiplayer technology works (2019)

We’re Not So Special: A new book challenges human exceptionalism

Show HN: OpenAI/reflect – Physical AI Assistant that illuminates your life

Show HN: Hanaco Weather – A poetic weather SNS from the OS Yamato project

Custom telescope mount using harmonic drives and ESP32

Analysis of the GFW's Unconditional Port 443 Block on August 20, 2025

How to Think About GPUs

Comments

AGENTS.md – Open format for guiding coding agents

How to Think About GPUs

Ask HN: Why does the US Visa application website do a port-scan of my network?

How to Draw a Space Invader

Modern CI Is Too Complex and Misdirected

Copilot broke audit logs, but Microsoft won't tell customers

How we exploited CodeRabbit: From simple PR to RCE and write access on 1M repos

Tiny microbe challenges the definition of cellular life

D2 (text to diagram tool) now supports ASCII renders

Type-machine

The Value of Hitting the HN Front Page

How I Made Ruby Faster Than Ruby

Show HN: I've made an easy to extend and flexible JavaScript logger

Gaussian Processes for Machine Learning [pdf]

Fast and observable background job processing for .NET

Emacs as your video-trimming tool

Candle Flame Oscillations as a Clock

Without the futex, it's futile

Rails Charts Using ECharts from Apache

Customizing Lisp REPLs

Intel Foundry Demonstrates First Arm-Based Chip on 18A Node

CRDT: Text Buffer

AnduinOS

Drunken Bishop (2023)

How Figma’s multiplayer technology works (2019)

We’re Not So Special: A new book challenges human exceptionalism

Show HN: OpenAI/reflect – Physical AI Assistant that illuminates your life

Show HN: Hanaco Weather – A poetic weather SNS from the OS Yamato project

Custom telescope mount using harmonic drives and ESP32

Analysis of the GFW's Unconditional Port 443 Block on August 20, 2025