What’s with the “if (idx == NULLPTR)” block? The loop won’t access an entry outside the list, so this appears to be adding unnecessary instructions and unnecessary divergence. (And maybe even unnecessary dependencies?) Does demonstrating the performance problem depend on having this code in the loop? I hope not, but I’m very curious why it’s there.
A couple of other tiny nits - the first 2 graphs should have a Y axis that starts at zero! That won’t compromise these in any way. There should be a very compelling reason not to show ratios on a graph that start from zero, and these don’t have any such reason. And I’m curious why the X axis is factors of 8 except the last two, which seem strangely arbitrary?
Some experiments I have done with something that does summation showed a considerable speedup by summing odd and even values into separate bins. Although this applies only to doing something not too closely resembling signal processing algorithms, as the compiler can otherwise optimize out for that.
Part of my video titled "new computers don't speed up old code"
solarexplorer•1d ago
adrian_b•1d ago
Otherwise the performance is limited by the memory transfer throughput, not by the latency of individual memory accesses.
The article demonstrates the difference between these 2 cases, even if its title could have been better.
Because the latency of memory loads is many times greater than the latency of any other kind of CPU instructions, both for loads from the main memory and for loads from the L3 cache memory, this effect is more visible in programs with many memory loads, like the examples from the article, than in programs using other instructions with long latencies.
jjtheblunt•1d ago
adrian_b•14h ago
The same as for normal memory loads, the effect of a page miss will vary depending on whether the memory load is part of a long dependency chain, so the CPU will not be able to find other instructions to execute concurrently while the dependency chain is stalled by waiting for the load result, or the memory load has only few instructions depending on it, so the CPU will go ahead executing other parts of the program.
Page misses in the TLB do not cause any new behavior, but the very long latencies corresponding to them exacerbate the effects of long dependency chains. With page misses, even a relatively short dependency chain may not allow the CPU to find enough independent instructions to be executed in order to avoid an execution stall.
With certain operating systems that choose to load lazily memory pages from a SSD/HDD or which choose to implement a virtual memory capacity greater than the physical memory capacity, there is a different kind of page miss, a miss from the memory currently mapped as valid by the OS, which results in an exception handled by the operating system, while the executing program is suspended. There are also mostly obsolete CPUs where a TLB page miss causes an exception, instead of being handled by dedicated hardware. In these cases, to which I assume that you refer by mentioning mmap, it does not matter whether the exception-causing instruction was part of a long dependency chain or not, the slowing-down of the program by exception handling is the same.