Both GCC and Clang generate strange/inefficient code

https://codingmarginalia.blogspot.com/2026/02/both-gcc-and-clang-generate.html

38•rsf•4d ago

Comments

the_fall•4d ago

It's common for compilers to generate mildly unusual code because they translate high-level code into an abstract intermediate notation, run a variety optimization steps on that notation, and then emit machine-specific code to perform whatever the optimizations yielded. There's no constraint along the lines of "but select the most logical opcode for this task".

The claim that the code is inefficient is really not substantiated well in this blog post. Sometimes, long-winded assembly actually runs faster because of pipelining, register aliasing, and other quirks. Other times, a "weird" way of zeroing a register may actually take up less space in memory, etc.

rsf•4d ago

> The claim that the code is inefficient is really not substantiated well in this blog post.

I didn't run benchmarks, but in the case of clang writing zeros to memory (which are never used thereafter), there's no way that particular code is optimal.

For the gcc output, it seems unlikely that the three versions are all optimal, given the inconsistent strategies used. In particular, the code that sets the output value to 0 or 1 in the size = 3 version is highly unlikely to be optimal in my opinion. I'd be amazed if it is!

Your point that unintuitive code is sometimes actually optimal is well taken though :)

its_magic•4d ago

Stefan Kanthak has previously noted that GCC's code generator is quite horrible, in these extensive investigations:

https://skanthak.hier-im-netz.de/gcc.html

DannyBee•1h ago

It's also rarely worth being optimal in scalar code anymore, particularly at compilation speed cost. The exception here is memory accesses and branches that will miss. So the writing of useless zeros is egregious but other stuff just isn't usually worth caring about these days. It's "good enough" in an age where even in embedded land I can run a 48mhz cortex m0 for 10 years on a battery and not worry about a few extra ANDS. I'm much more likely to hit size than speed limitations.

Not to mention for anything not super battery limited you can get a m55 running at 800mhz with a separate 1ghz npu, hardware video encoders, etc.

This is before you move into the rockchip/etc space.

We really just aren't scalar compute limited in tons of places these days. There are certainly places but 10-15 years ago missing little scalar optimizations could make very noticeable differences in the performance of lots of apps and now it just doesn't anymore

magicalhippo•57m ago

> The claim that the code is inefficient is really not substantiated well in this blog post. Sometimes, long-winded assembly actually runs faster because of pipelining, register aliasing, and other quirks.

I had a case back in the 2010s where I was trying to optimize a hot loop. The loop involved an integer division by a factor which was common for all elements, similar to a vector normalization pass. For reasons I don't recall, I couldn't get rid of the division entirely.

I saw the compiler emitted an "idiv [mem]" instruction, and I thought surely that was suboptimal. So I reproduced the assembly but changed the code slightly so I could have "idiv reg" instead. All it involved was loading the variable into an unused register before the loop and use that inside the loop.

So I benchmarked it and much to my surprise it was a fair bit slower.

I thought I might have been due to loop target alignment, so I spent some time inserting no-ops to align things in various supposedly optimal ways, but it never got as fast. I changed my assembly to mirror what the compiler had spit out and voila, back to the fastest speed again...

Tried to ask around, and someone suggested it had to do with some internal register load/store contention or something along those lines.

At that point I knew I was done optimizing code by writing assembly. Not my cup of tea.

btdmaster•4d ago

In my experience C++ abstractions give the optimizer a harder job and thus it generates worse code. In this case, different code is emitted by clang if you write a C version[0] versus C++ original[1].

Usually abstraction like this means that the compiler has to emit generic code which is then harder to flow through constraints and emit the same final assembly since it's less similar to the "canonical" version of the code that wouldn't use a magic `==` (in this case) or std::vector methods or something else like that.

[0] https://godbolt.org/z/vso7xbh61

[1] https://godbolt.org/z/MjcEKd9Tr

pjmlp•3d ago

Except that the C++ version doesn't need to be like that.

Abstractions are welcome when it doesn't matter, when it matters there are other ways to write the code and it keeps being C++ compliant.

maccard•1h ago

To back up the other commenter - it's not the same. https://godbolt.org/z/r6e443x1c shows that if you write imperfect C++ clang is perfectly capable of optimizing it.

cogman10•39m ago

What's strange is I'm finding that gcc really struggles to correctly optimize this.

This was my function

    for (auto v : array) {
        if (v != 0)
            return false;
    }
    return true;

clang emits basically the same thing yours does. But gcc ends up just really struggling to vectorize for large numbers of array.

Here's gcc for 42 elements:

https://godbolt.org/z/sjz7xd8Gs

and here's clang for 42 elements:

https://godbolt.org/z/frvbhrnEK

Very bizarre. Clang pretty readily sees that it can use SIMD instructions and really optimizes this while GCC really struggles to want to use it. I've even seen strange output where GCC will emit SIMD instructions for the first loop and then falls back on regular x86 compares for the rest.

Edit: Actually, it looks like for large enough array sizes, it flips. At 256 elements, gcc ends up emitting simd instructions while clang does pure x86. So strange.

rwmj•1h ago

The OP should try with -march=native so the compiler can use vector instructions.

Slightly off-topic but I like this way to test if memory is all zeroes: https://rusty.ozlabs.org/2015/10/20/ccanmems-memeqzero-itera... (see "epiphany #2" at the bottom of the page) I really wish there was a standard libc function for it.

gspr•57m ago

With `u32` as the element type, rustc 1.93 (with `-O`) does the correct thing for size=1, checks both elements separately (i.e. worse than in the article) for size=2, checks all three elements separately (i.e. not being crazy like in the article) for size=3, and starts doing SIMD at size=4.

https://godbolt.org/z/5PETM5bbn

usamoi•26m ago

This code is not equivalent to the C++ version. You can directly use `*x == [0_u32; SIZE]`. The code generated by the two is different. (But the iterator version not producing optimal code is also an issue.)

newpavlov•38m ago

Compilers also like to unnecessarily copy data to stack: https://github.com/llvm/llvm-project/issues/53348 Which can be particularly annoying in cryptographic code where you want to minimize number of copies of sensitive data.

GLM5 Released on Z.ai Platform

It's all a blur

Windows Notepad App Remote Code Execution Vulnerability

Do not apologize for replying late to my email

Show HN: AI agents play SimCity through a REST API

Exposure Simulator

Chrome extensions spying on users' browsing data

Rome is studded with cannon balls (2022)

A Cosmic Miracle: A Remarkably Luminous Galaxy at z=14.44 Confirmed with JWST

Who Smeared Feynman

Communities are not fungible

Show HN: Itsyhome – Control HomeKit from your Mac menu bar (open source)

The Feynman Lectures on Physics (1961-1964)

The Singularity will occur on a Tuesday

Visualize MySQL query execution plans as interactive FlameGraphs

End of an era for me: no more self-hosted git

Show HN: Musical Interval Trainer

Ex-GitHub CEO launches a new developer platform for AI agents

Exploring a Modern SMTPE 2110 Broadcast Truck

CoLoop (YC S21) Is Hiring Ex Technical Founders in London

FAA closes airspace around El Paso, Texas, for 10 days, grounding all flights

Show HN: CodeMic

Both GCC and Clang generate strange/inefficient code

Clean-room implementation of Half-Life 2 on the Quake 1 engine

The Day the Telnet Died

Fun With Pinball

The Little Learner: A Straight Line to Deep Learning (2023)

My eighth year as a bootstrapped founder

Signy: Signed URLs for Small Devices

Simplifying Vulkan one subsystem at a time

Both GCC and Clang generate strange/inefficient code

Comments

GLM5 Released on Z.ai Platform

It's all a blur

Windows Notepad App Remote Code Execution Vulnerability

Do not apologize for replying late to my email

Show HN: AI agents play SimCity through a REST API

Exposure Simulator

Chrome extensions spying on users' browsing data

Rome is studded with cannon balls (2022)

A Cosmic Miracle: A Remarkably Luminous Galaxy at z=14.44 Confirmed with JWST

Who Smeared Feynman

Communities are not fungible

Show HN: Itsyhome – Control HomeKit from your Mac menu bar (open source)

The Feynman Lectures on Physics (1961-1964)

The Singularity will occur on a Tuesday

Visualize MySQL query execution plans as interactive FlameGraphs

End of an era for me: no more self-hosted git

Show HN: Musical Interval Trainer

Ex-GitHub CEO launches a new developer platform for AI agents

Exploring a Modern SMTPE 2110 Broadcast Truck

CoLoop (YC S21) Is Hiring Ex Technical Founders in London

FAA closes airspace around El Paso, Texas, for 10 days, grounding all flights

Show HN: CodeMic

Both GCC and Clang generate strange/inefficient code

Clean-room implementation of Half-Life 2 on the Quake 1 engine

The Day the Telnet Died

Fun With Pinball

The Little Learner: A Straight Line to Deep Learning (2023)

My eighth year as a bootstrapped founder

Signy: Signed URLs for Small Devices

Simplifying Vulkan one subsystem at a time