Intuitively I understand why it is a hard problem. Micro-optimizations have deterministic properties that are simple enough that optimality is all but certain. Macro-optimization heuristics create incidental minor pessimization effects that are buried below the noise floor by major optimization effects on average.
In the middle are optimizations that are too complex to guarantee effective optimization but too small in effect to offset any incidental pessimization. It is easy to inadvertently make things worse in these cases. Most cases of surprising performance variances I see fall into this gap.
It is also where the codegen from different compilers seems to disagree the most, which lends evidence to the idea that the “correct” codegen is far from obvious to a compiler.
Can u give some examples?
Perhaps you mean small granular choices which occur widely throughout the code?
For example: you decide to unroll/inline/vectorize code has unpredictable effect on performance.
Given a specific program, yes it is measurable and predictable.
A compiler has to deal with arbitrary programs, actually - small fragments of arbitrary programs at a time. Even if you could have an oracle giving you very precise measurements for the program fragment the compiler is dealing with, it won't hold up in practice.
Consider a small representative example. Suppose the compiler is compiling a function, and it did the best possible job.
Now, inlining this function at two callsites could have completely different second order effects like further optimizations, register pressure. There are partial solutions of course, people have spent their entire careers dealing with this specific problem. Rest assured that whatever you can come up with after thinking for a minute has been tried in 1970.
There can be competition between optimizations. At a given moment, the preconditions for both optimization O1 and optimization O2 may hold, but if you then apply O1, the precondition for O2 may no longer hold and vice versa.
In such cases, the best solution is to pick the better of O1 and O2.
Cranelift has an optimiser based on it, for classic optimisations within an IR.
KazeEmmanuar did a great job analyzing exactly this so we don't have to!
The whole thing reminds me in a fuzzy way that I don't yet fully comprehend of register-memory architecture.
It’s also a reminder that compilers aren’t magic and optimizers aren’t all-knowing.
People have a tendency to ascribe magical properties to anything they don’t understand. Compilers and optimizers more so.
Measure before and after any optimization. You’d be surprised at what does and doesn’t improve efficiency.
Today’s CPUs are more complex than a human can understand.
Measure twice, cut once.
Only because it used to be true in the past, someone told it was so during a coffee break, or it appeared in some computer related article.
Branch predictors are a lot bigger and more capable now than they were and there is every chance when the optimisation for jump tables was tested against the equivalent branch code it outperformed it, but on more recent CPUs the situation reverses.
userbinator•3mo ago
shiftingleft•3mo ago
for Zen5 rustc creates the following:
https://rust.godbolt.org/z/hz1eKjnaGCold_Miserable•3mo ago
Struggling proc \n lzcnt ecx,ecx \n test cl,cl \n setz al \n lea edx,[ecx-2] ;[0,1,2] \n cmp rdx,2 \n cmovle eax,ecx \n ret \n Struggling endp
userbinator•3mo ago
We can assume the count has already been done.
shiftingleft•3mo ago
hairtuq•3mo ago
userbinator•3mo ago
Cold_Miserable•3mo ago
lzcnt ecx,ecx
mov eax, 00100 0011 0010 0000 0001b shl ecx,2 shrx eax,eax,ecx and eax,15
teo_zero•3mo ago