Usually abstraction like this means that the compiler has to emit generic code which is then harder to flow through constraints and emit the same final assembly since it's less similar to the "canonical" version of the code that wouldn't use a magic `==` (in this case) or std::vector methods or something else like that.
Abstractions are welcome when it doesn't matter, when it matters there are other ways to write the code and it keeps being C++ compliant.
This was my function
for (auto v : array) {
if (v != 0)
return false;
}
return true;
clang emits basically the same thing yours does. But gcc ends up just really struggling to vectorize for large numbers of array.Here's gcc for 42 elements:
https://godbolt.org/z/sjz7xd8Gs
and here's clang for 42 elements:
https://godbolt.org/z/frvbhrnEK
Very bizarre. Clang pretty readily sees that it can use SIMD instructions and really optimizes this while GCC really struggles to want to use it. I've even seen strange output where GCC will emit SIMD instructions for the first loop and then falls back on regular x86 compares for the rest.
Edit: Actually, it looks like for large enough array sizes, it flips. At 256 elements, gcc ends up emitting simd instructions while clang does pure x86. So strange.
Slightly off-topic but I like this way to test if memory is all zeroes: https://rusty.ozlabs.org/2015/10/20/ccanmems-memeqzero-itera... (see "epiphany #2" at the bottom of the page) I really wish there was a standard libc function for it.
the_fall•4d ago
The claim that the code is inefficient is really not substantiated well in this blog post. Sometimes, long-winded assembly actually runs faster because of pipelining, register aliasing, and other quirks. Other times, a "weird" way of zeroing a register may actually take up less space in memory, etc.
rsf•4d ago
I didn't run benchmarks, but in the case of clang writing zeros to memory (which are never used thereafter), there's no way that particular code is optimal.
For the gcc output, it seems unlikely that the three versions are all optimal, given the inconsistent strategies used. In particular, the code that sets the output value to 0 or 1 in the size = 3 version is highly unlikely to be optimal in my opinion. I'd be amazed if it is!
Your point that unintuitive code is sometimes actually optimal is well taken though :)
its_magic•4d ago
https://skanthak.hier-im-netz.de/gcc.html
DannyBee•1h ago
Not to mention for anything not super battery limited you can get a m55 running at 800mhz with a separate 1ghz npu, hardware video encoders, etc.
This is before you move into the rockchip/etc space.
We really just aren't scalar compute limited in tons of places these days. There are certainly places but 10-15 years ago missing little scalar optimizations could make very noticeable differences in the performance of lots of apps and now it just doesn't anymore
magicalhippo•57m ago
I had a case back in the 2010s where I was trying to optimize a hot loop. The loop involved an integer division by a factor which was common for all elements, similar to a vector normalization pass. For reasons I don't recall, I couldn't get rid of the division entirely.
I saw the compiler emitted an "idiv [mem]" instruction, and I thought surely that was suboptimal. So I reproduced the assembly but changed the code slightly so I could have "idiv reg" instead. All it involved was loading the variable into an unused register before the loop and use that inside the loop.
So I benchmarked it and much to my surprise it was a fair bit slower.
I thought I might have been due to loop target alignment, so I spent some time inserting no-ops to align things in various supposedly optimal ways, but it never got as fast. I changed my assembly to mirror what the compiler had spit out and voila, back to the fastest speed again...
Tried to ask around, and someone suggested it had to do with some internal register load/store contention or something along those lines.
At that point I knew I was done optimizing code by writing assembly. Not my cup of tea.