and another article about zvdb-abap from ~1.5 years ago.
The fact that the author achieves only a 3 to 6 times speedup on a processor running at a frequency 857 faster should have led to the conclusion that old optimizations tricks are awfully slow on modern architecture.
To be fair, execution pipeline optimization still works the same, but not taking into account the different layers of cache, the way the memory management works and even how and when actual RAM is queried will only lead to suboptimal code.
U1F984•4h ago
haiku2077•4h ago
I know it's not always true on the Nintendo 64, because it shared a single bus between the RAM and "GPU": https://youtu.be/t_rzYnXEQlE?t=94, https://youtu.be/Ca1hHC2EctY?t=827
bayindirh•3h ago
My favorite story is an embedded processor which I forgot its ISA. The gist was, there was a time budget, and doing a normal memory access would consume 90% of that budget alone. The trick was to use the obscure DMA engine to pump data into the processor caches asynchronously. This way, moving data was only ~4% of the same budget, and they have beaten their performance targets by a large margin.
ay•3h ago
Some (for me) useful pointers to that regard for both:
1. https://www.agner.org/optimize/instruction_tables.pdf - an extremely nice resource on micro architectural impacts of instructions
2. https://llvm.org/docs/CommandGuide/llvm-mca.html - tooling from Intel that allows to see some of these in real machine code
3. https://www.intel.com/content/www/us/en/developer/articles/t... - shows you whether the above is matching the reality (besides the CPU alone, more often than not your bottleneck is actually memory accesses; at least on the first access which wasn’t triggered by a hardware prefetcher or a hint to it. On Linux it would be staring at “perf top” results.
So, the answer is as is very often - “it depends”.
bayindirh•3h ago
yorwba•3h ago
whizzter•2h ago
But yes, you're right. Back when I started with optimizations in the mid 90s memory _latencies_ were fairly minor compared to complex instructions so most things that wasn't additions (and multiplications on the Pentium) would be faster from a lookup table, over time memory latencies grew and grew as clock speeds and other improvements made the distance to the physical memory an actual factor and lookup tables less useful compared to recomputing things.
Still today there are things that are expensive enough that can be fit in a lookup table that is small enough that it doesn't get evicted from cache during computation, but they're few.
devnullbrain•2h ago
mananaysiempre•2h ago
Maybe on the Z80. Contemporary RAM was quite fast compared to it, by our sad standards.
A table lookup per byte will see you hit a pretty hard limit of about 1 cycle per byte on all x86 CPUs of the last decade. If you’re doing a state machine or a multistage table[1] where the next table index depends on both the next byte and the previous table value, you’ll be lucky to see half that. Outracing your SSD[2] you’re not, with this approach.
If instead you can load a 64-bit chunk (or several!) at a time, you’ll have quite a bit of leeway to do some computation to it before you’re losing to the lookup table, especially considering you’ve got fast shifts and even multiplies (another difference from the Z80). And if you’re doing 128- or 256-bit vectors, you’ve got even more compute budget—but you’re also going to spend a good portion of it just shuffling the right bytes into the right positions. Ultimately, though, one of your best tools there is going to be ... an instruction that does 16 resp. 32 lookups in a 16-entry table at a time[3].
So no, if you want to be fast on longer pieces of data, in-memory tables are not your friend. On smaller data, with just a couple of lookups, they could be[4]. In any case, you need to be thinking about your code’s performance in detail for these things to matter—I can’t think of a situation where “prefer a lookup table” is a useful heuristic. “Consider a lookup table” (then measure), maybe.
[1] https://www.unicode.org/versions/latest/ch05.pdf
[2] https://lemire.me/en/talk/perfsummit2020/
[3] http://0x80.pl/notesen/2008-05-24-sse-popcount.html
[4] https://tia.mat.br/posts/2014/06/23/integer_to_string_conver...
Taniwha•2h ago
flohofwoe•1h ago
The fastest instructions that didn't access memory (like adding two 8-bit registers) were 4 clock cycles, so there's really not much room to beat a memory access with computation.
Today "it depends", I still use lookup tables in some places in my home computer emulators, but only after benchmarking showed that the table is actually slightly faster.