It doesn't matter what architecture one studies, or even a hypothetical one. The last significant application I wrote in assembler was for System/370, some 40 years ago. Yet CPU ISAs of today are not really that different, conceptually.
CPU true.
GPU no. It's not even the instructions that are different, but I would suggest studying up on GPU loads/stores.
GPUs have fundamentally altered how loads/stores have worked. Yes it's a SIMD load (aka gather operation) which has been around since the 80s. But the routing of that data includes highly optimized broadcast patterns and or butterfly routing or crossbars (which allows for an arbitrary shuffle within log2(n)).
Load(same memory location) across GPU Threads (or SimD lanes) compiles as a single broadcast.
Load(consecutive memory location) across consecutive SIMD lanes is also efficient.
Load(arbitrary) is doable but slower. The crossbar will be taxed.
RDNA architecture (a few gens old) slides has some breadcrumbs: https://gpuopen.com/download/RDNA_Architecture_public.pdf
AMD also publishes its ISAs, but I don't think you'll be able to extract much from a reference-style document: https://gpuopen.com/amd-gpu-architecture-programming-documen...
Books on CUDA/HIP also go into some detail of the underlying architecture. Some slides from NV:
https://gfxcourses.stanford.edu/cs149/fall21content/media/gp...
Edit: I should say that Apple also publishes decent stuff. See the link here and the stuff linked at the bottom of the page. But note that now you're in UMA/TBDR territory; discrete GPUs work considerably differently: https://developer.apple.com/videos/play/wwdc2020/10602/
If anyone has more suggestions, please share.
What has drastically changed is that you cannot do trivial 'cycle counting' anymore.
A couple useful points it lacks:
* `switch` statements can be lowered in two different ways: using a jump table (an indirect branch, only possible when values are adjacent; requires a highly-predictable branch to check the range first), or using a binary search (multiple direct branches). Compilers have heuristics to determine which should be used, but I haven't played with them.
* You may be able to turn an indirect branch into a direct branch using code like the following:
if (function_pointer == expected_function)
expected_function();
else
(*function_pointer)();
* It's generally easy to turn tail recursion into a loop, but it takes effort to design your code to make that possible in the first place. The usual Fibonacci example is a good basic intro; tree-walking is a good piece of homework.* `cmov` can be harmful (since it has to compute both sides) if branch is even moderately predictable and/or if the less-likely side has too many instructions. That said, from my tests, compilers are still too hesitant to use `cmov` even for cases where yes I really know dammit. OoO CPUs are weird to reason about but I've found that due to dependencies between other instructions, there's often some execution ports to spare for the other side of the branch.
Cool trick with the function pointer comparison!
I've been slowly reading Agner Fog's resources. The microarchitecture manual is incredible, and pertinently, the section on branch prediction algorithms I find fascinating:
https://web.archive.org/web/20250611003116/https://www.agner...
noone_youknow•2d ago
One minor nit, for the “odd corner case that likely never exists in real code” of taken branches to the next instruction, I can think of at least one example where this is often used: far jumps to the next instruction with a different segment on x86[_64] that are used to reload CS (e.g. on a mode switch).
Aware that’s a very specific case, but it’s one that very much does exist in real code.
chrisfeilbach•34m ago