frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

How to Think About GPUs

https://jax-ml.github.io/scaling-book/gpus/
114•alphabetting•1d ago

Comments

porridgeraisin•1d ago
A short addition that pre-volta nvidia GPUs were SIMD like TPUs are, and not SIMT which post-volta nvidia GPUs are.
camel-cdr•22h ago
SIMT is just a programming model for SIMD.

Modern GPUs still are just SIMD with good predication support at ISA level.

porridgeraisin•22h ago
I was referring to this portion of TFA

> CUDA cores are much more flexible than a TPU’s VPU: GPU CUDA cores use what is called a SIMT (Single Instruction Multiple Threads) programming model, compared to the TPU’s SIMD (Single Instruction Multiple Data) model.

adrian_b•2h ago
This flexibility of CUDA is a software facility, which is independent of the hardware implementation.

For any SIMD processor one can write a compiler that translates a program written for the SIMT programming model into SIMD instructions. For example, for the Intel/AMD CPUs with SSE4/AVX/AVX-512 ISAs, there exists a compiler of this kind (ispc: https://github.com/ispc/ispc).

achierius•3h ago
That's not true. SIMT notably allows for divergence and reconvergence, whereby single threads actually end up executing different work for a time, while in SIMD you have to always be in sync.
camel-cdr•3h ago
I'm not aware of any GPU that implements this.

Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1]. This feature wasn't meant to accelerate code, but to allow more composable programming models [2].

Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once.

    if (theradIdx.x < 4) {
        A;
        B;
    } else {
        X;
        Y;
    }
    Z;
The diagram shows how this executes in the following order:

Volta:

    ->|   ->X   ->Y   ->Z|->
    ->|->A   ->B   ->Z   |->
pre Volta:

    ->|      ->X->Y|->Z
    ->|->A->B      |->Z
The SIMD equivilant of pre Volta is:

    vslt mask, vid, 4
    vopA ..., mask
    vopB ..., mask
    vopX ..., ~mask
    vopY ..., ~mask
    vopZ ...
The Volta model is:

    vslt mask, vid, 4
    vopA ..., mask
    vopX ..., ~mask
    vopB ..., mask
    vopY ..., ~mask
    vopZ ...

[1] https://chipsandcheese.com/i/138977322/shader-execution-reor...

[2] https://stackoverflow.com/questions/70987051/independent-thr...

adrian_b•2h ago
"Divergence" is supported by any SIMD processor, but with various amounts of overhead depending on the architecture.

"Divergence" means that every "divergent" SIMD instruction is executed at least twice, with different masks, so that it is actually executed only on a subset of the lanes (i.e. CUDA "threads").

SIMT is a programming model, not a hardware implementation. NVIDIA has never explained exactly how the execution of divergent threads has been improved since Volta, but it is certain that, like before, the CUDA "threads" are not threads in the traditional sense, i.e. the CUDA "threads" do not have independent program counters that can be active simultaneously.

What seems to have been added since Volta is some mechanism for fast saving and restoring separate program counters for each CUDA "thread", in order to be able to handle data dependencies between distinct CUDA "threads" by activating the "threads" in the proper order, but those saved per-"thread" program counters cannot become active simultaneously if they have different values, so you cannot execute simultaneously instructions from different CUDA "threads", unless they perform the same operation, which is the same constraint that exists in any SIMD processor.

Post-Volta, nothing has changed when there are no dependencies between the CUDA "threads" composing a CUDA "warp".

What has changed is that now you can have dependencies between the "threads" of a "warp" and the program will produce correct results, while with older GPUs that was unlikely. However dependencies between the CUDA "threads" of a "warp" shall be avoided whenever possible, because they reduce the achievable performance.

aanet•7h ago
Fantastic resource! Thanks for posting it here.
nickysielicki•6h ago
The calculation under “Quiz 2: GPU nodes“ is incorrect, to the best of my knowledge. There aren’t enough ports for each GPU and/or for each switch (less the crossbar connections) to fully realize the 450GB/s that’s theoretically possible, which is why 3.2TB/s of internode bandwidth is what’s offered on all of the major cloud providers and the reference systems. If it was 3.6TB/s, this would produce internode bottlenecks in any distributed ring workload.

Shamelessly: I’m open to work if anyone is hiring.

aschleck•4h ago
It's been a while since I thought about this but isn't the reason providers advertise only 3.2tbps because that's the limit of a single node's connection to the IB network? DGX is spec'ed to pair each H100 with a Connect-X 7 NIC and those cap out at 400gbps. 8 gpus * 400gbps / gpu = 3.2tbps.

Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.

gregorygoc•4h ago
It’s mind boggling why this resource has not been provided by NVIDIA yet. It reached the point that 3rd parties reverse engineer and summarize NV hardware to a point it becomes an actually useful mental model.

What are the actual incentives at NVIDIA? If it’s all about marketing they’re doing great, but I have some doubts about engineering culture.

threeducks•2h ago
With mediocre documentation, NVIDIAs closed-source libraries, such as cuBLAS and cuDNN, will remain the fastest way to perform certain tasks, thereby strengthening vendor lock-in. And of course it makes it more difficult for other companies to reverse engineer.
akshaydatazip•4h ago
Thanks for the really thorough research on that . Right what I wanted for my morning coffee
physicsguy•3h ago
It’s interesting that nvshmem has taken off in ML because the MPI equivalents were never that satisfactory in the simulation world.

Mind you, I did all long range force stuff which is difficult to work with over multiple nodes at the best of times.

tomhow•3h ago
Discussion of original series:

How to scale your model: A systems view of LLMs on TPUs - https://news.ycombinator.com/item?id=42936910 - Feb 2025 (30 comments)

tucnak•2h ago
This post is a great illustration why TPU's lend more nicely towards homogenous computing: yes, there's systolic array limitations (not good for sparsity) but all things considering, bandwidth doesn't change as your cluster ever so larger grows. It's a shame Google is not interested in selling this hardware: if they were available, it would open the door to compute-in-network capabilities far beyond what's currently available; by combining non-homogenous topologies involving various FPGA solutions, i.e. with Alveo V80 exposing 4x800G NIC's.

Also: it's a shame Google doesn't talk about how they use TPU's outside of LLM.

tormeh•2h ago
I find it very hard to justify investing time into learning something that's neither open source nor has multiple interchangeable vendors. Being good at using Nvidia chips sounds a lot like being an ABAP consultant or similar to me. I realize there's a lot of money to be made in the field right now, but IIUC historically this kind of thing has not been a great move.
saagarjha•2h ago
Sure, but you can make money in the field and retire faster than it becomes irrelevant. FWIW none of the ideas here are novel or nontransferable–it's just the specific design that is proprietary. Understanding how to do an AllReduce has been of theoretical interest for decades and will probably remain worth doing far into the future.
Philpax•2h ago
There's more in common with other GPU architectures than there are differences, so a CUDA consultant should be able to pivot if/when the other players are a going concern. It's more about the mindset than the specifics.
dotancohen•1h ago
I've been hearing that for over a decade. I can't even name off hand any CUDA competitors, none of them are likely to gain enough traction to upset CUDA in the coming decade.
Philpax•1h ago
Hence the "if" :-)

ROCm is getting some adoption, especially as some of the world's largest public supercomputers have AMD GPUs.

Some of this is also being solved by working at a different abstraction layer; you can sometimes be ignorant to the hardware you're running on with PyTorch. It's still leaky, but it's something.

physicsguy•27m ago
I still don't see ROCm as that serious a threat, they're still a long way behind in library support.

I used to use ROCFFT as an example, it was missing core functionality that cuFFT has had since like 2008. It looks like they've finally caught up now, but that's one library among many.

WithinReason•1h ago
What's in this article would apply to most other hardware, just with slightly different constants
amelius•1h ago
I mean it is similar to investing time in learning assembly language.

For most IT folks it doesn't make much sense.

qwertox•1h ago
It's a valid point of view, but I don't see the value in sharing it.

There are enough people for who it's worth it, even if just for tinkering, and I'm sure you are aware of that.

It reads a bit like "You shouldn't use it because..."

Learning about Nvidia GPUs will teach you a lot about other GPUs as well, and there are a lot of tutorials about the former, so why not use it if it interests you?

woooooo•25m ago
It's a useful bit of caution to remember transferrable fundamentals, I remember when Oracle wizards were in high demand.
physicsguy•25m ago
It really isn't that hard to pivot. It's worth saying that if you were already writing OpenMP and MPI code then learning CUDA wasn't particularly difficult to get started, and learning to write more performant CUDA code would also help you write faster CPU bound code. It's an evolution of existing models of compute, not a revolution.

AGENTS.md – Open format for guiding coding agents

https://agents.md/
515•ghuntley•10h ago•232 comments

How to Think About GPUs

https://jax-ml.github.io/scaling-book/gpus/
114•alphabetting•1d ago•27 comments

Ask HN: Why does the US Visa application website do a port-scan of my network?

163•mbix77•4h ago•68 comments

How to Draw a Space Invader

https://muffinman.io/blog/invaders/
319•abdusco•11h ago•36 comments

Modern CI Is Too Complex and Misdirected

https://gregoryszorc.com/blog/2021/04/07/modern-ci-is-too-complex-and-misdirected/
80•thundergolfer•6h ago•35 comments

Copilot broke audit logs, but Microsoft won't tell customers

https://pistachioapp.com/blog/copilot-broke-your-audit-log
505•Sayrus•9h ago•173 comments

How we exploited CodeRabbit: From simple PR to RCE and write access on 1M repos

https://research.kudelskisecurity.com/2025/08/19/how-we-exploited-coderabbit-from-a-simple-pr-to-rce-and-write-access-on-1m-repositories/
597•spiridow•18h ago•201 comments

Tiny microbe challenges the definition of cellular life

https://nautil.us/a-rogue-new-life-form-1232095/
99•jnord•10h ago•22 comments

D2 (text to diagram tool) now supports ASCII renders

https://d2lang.com/blog/ascii/
318•alixanderwang•16h ago•52 comments

Type-machine

https://arthi-chaud.github.io/posts/type-machine/
22•todsacerdoti•5h ago•4 comments

The Value of Hitting the HN Front Page

https://www.mooreds.com/wordpress/archives/3530
93•mooreds•8h ago•37 comments

How I Made Ruby Faster Than Ruby

https://noteflakes.com/articles/2025-08-18-how-to-make-ruby-faster
40•ciconia•1d ago•9 comments

Show HN: I've made an easy to extend and flexible JavaScript logger

https://github.com/inshinrei/halua
4•inshinrei•20h ago•3 comments

Gaussian Processes for Machine Learning [pdf]

https://gaussianprocess.org/gpml/chapters/RW.pdf
25•susam•1d ago•5 comments

Fast and observable background job processing for .NET

https://github.com/mikasjp/BusyBee
16•mikasjp•2d ago•5 comments

Emacs as your video-trimming tool

https://xenodium.com/emacs-as-your-video-trimming-tool
243•xenodium•17h ago•124 comments

Candle Flame Oscillations as a Clock

https://cpldcpu.com/2025/08/13/candle-flame-oscillations-as-a-clock/
291•cpldcpu•4d ago•65 comments

Without the futex, it's futile

https://h4x0r.org/futex/
269•eatonphil•20h ago•124 comments

Rails Charts Using ECharts from Apache

https://github.com/railsjazz/rails_charts
45•amalinovic•2d ago•4 comments

Customizing Lisp REPLs

https://aartaka.me/customize-repl.html
13•nemoniac•1d ago•2 comments

Intel Foundry Demonstrates First Arm-Based Chip on 18A Node

https://hothardware.com/news/intel-foundry-demos-deer-creek-falls-reference-soc
44•rbanffy•1d ago•20 comments

CRDT: Text Buffer

https://madebyevan.com/algos/crdt-text-buffer/
121•skadamat•14h ago•6 comments

AnduinOS

https://www.anduinos.com/
129•TheFreim•15h ago•157 comments

Drunken Bishop (2023)

https://re.factorcode.org/2023/08/drunken-bishop.html
67•todsacerdoti•12h ago•12 comments

How Figma’s multiplayer technology works (2019)

https://www.figma.com/blog/how-figmas-multiplayer-technology-works/
150•redbell•3d ago•47 comments

We’re Not So Special: A new book challenges human exceptionalism

https://democracyjournal.org/magazine/78/were-not-so-special/
74•nobet•7h ago•140 comments

Show HN: OpenAI/reflect – Physical AI Assistant that illuminates your life

https://github.com/openai/openai-reflect
77•Sean-Der•14h ago•29 comments

Show HN: Hanaco Weather – A poetic weather SNS from the OS Yamato project

https://github.com/osyamato/os-yamato
12•tsuyoshi_k•6h ago•4 comments

Custom telescope mount using harmonic drives and ESP32

https://www.svendewaerhert.com/blog/telescope-mount/
287•waerhert•1d ago•105 comments

Analysis of the GFW's Unconditional Port 443 Block on August 20, 2025

https://gfw.report/blog/gfw_unconditional_rst_20250820/en/
136•kotri•5h ago•88 comments