It doesn’t seem like Intel’s BOT delivers more performance gains, and it is closed source.
BOLT could do this, but does not as far as I’m aware.
Most of vectorization like this is also probably better done in a compiler middle end. At least in LLVM, the loop vectorizer and especially the SLP Vectorizer do a decent job of picking up most of the gains.
You might be able to pick up some gains by doing it post-link at the MC level, but writing an IR level SLP Vectorizer is already quite difficult.
[1] https://www.intel.com/content/www/us/en/support/articles/000...
I swore Intel had their own PLO tool, but I can only find https://github.com/clearlinux/distribution/issues/2996.
It was open source, but has since been deprecated.
I do wonder what this "optimize" step actually entails; does it just replace the binary with one that Intel themselves carefully decompiled and then hand-optimised? If it's a general "decompile-analyse-optimise-recompile" (perhaps something similar to what the https://en.wikipedia.org/wiki/Transmeta_Crusoe does), why restrict it?
refulgentis•1h ago
Wait until they hear about branch predictors.