Shamelessly: I’m open to work if anyone is hiring.
Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.
What are the actual incentives at NVIDIA? If it’s all about marketing they’re doing great, but I have some doubts about engineering culture.
Mind you, I did all long range force stuff which is difficult to work with over multiple nodes at the best of times.
How to scale your model: A systems view of LLMs on TPUs - https://news.ycombinator.com/item?id=42936910 - Feb 2025 (30 comments)
Also: it's a shame Google doesn't talk about how they use TPU's outside of LLM.
ROCm is getting some adoption, especially as some of the world's largest public supercomputers have AMD GPUs.
Some of this is also being solved by working at a different abstraction layer; you can sometimes be ignorant to the hardware you're running on with PyTorch. It's still leaky, but it's something.
I used to use ROCFFT as an example, it was missing core functionality that cuFFT has had since like 2008. It looks like they've finally caught up now, but that's one library among many.
For most IT folks it doesn't make much sense.
There are enough people for who it's worth it, even if just for tinkering, and I'm sure you are aware of that.
It reads a bit like "You shouldn't use it because..."
Learning about Nvidia GPUs will teach you a lot about other GPUs as well, and there are a lot of tutorials about the former, so why not use it if it interests you?
porridgeraisin•1d ago
camel-cdr•22h ago
Modern GPUs still are just SIMD with good predication support at ISA level.
porridgeraisin•22h ago
> CUDA cores are much more flexible than a TPU’s VPU: GPU CUDA cores use what is called a SIMT (Single Instruction Multiple Threads) programming model, compared to the TPU’s SIMD (Single Instruction Multiple Data) model.
adrian_b•2h ago
For any SIMD processor one can write a compiler that translates a program written for the SIMT programming model into SIMD instructions. For example, for the Intel/AMD CPUs with SSE4/AVX/AVX-512 ISAs, there exists a compiler of this kind (ispc: https://github.com/ispc/ispc).
achierius•3h ago
camel-cdr•3h ago
Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1]. This feature wasn't meant to accelerate code, but to allow more composable programming models [2].
Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once.
The diagram shows how this executes in the following order:Volta:
pre Volta: The SIMD equivilant of pre Volta is: The Volta model is: [1] https://chipsandcheese.com/i/138977322/shader-execution-reor...[2] https://stackoverflow.com/questions/70987051/independent-thr...
adrian_b•2h ago
"Divergence" means that every "divergent" SIMD instruction is executed at least twice, with different masks, so that it is actually executed only on a subset of the lanes (i.e. CUDA "threads").
SIMT is a programming model, not a hardware implementation. NVIDIA has never explained exactly how the execution of divergent threads has been improved since Volta, but it is certain that, like before, the CUDA "threads" are not threads in the traditional sense, i.e. the CUDA "threads" do not have independent program counters that can be active simultaneously.
What seems to have been added since Volta is some mechanism for fast saving and restoring separate program counters for each CUDA "thread", in order to be able to handle data dependencies between distinct CUDA "threads" by activating the "threads" in the proper order, but those saved per-"thread" program counters cannot become active simultaneously if they have different values, so you cannot execute simultaneously instructions from different CUDA "threads", unless they perform the same operation, which is the same constraint that exists in any SIMD processor.
Post-Volta, nothing has changed when there are no dependencies between the CUDA "threads" composing a CUDA "warp".
What has changed is that now you can have dependencies between the "threads" of a "warp" and the program will produce correct results, while with older GPUs that was unlikely. However dependencies between the CUDA "threads" of a "warp" shall be avoided whenever possible, because they reduce the achievable performance.