When Compiler Optimizations Hurt Performance

https://nemanjatrifunovic.substack.com/p/when-compiler-optimizations-hurt

91•rbanffy•3mo ago

Comments

userbinator•3mo ago

ARM64 has many registers but I believe the lack of suitably large immediate values and, apparently, compilers that are willing to use them all across functions, puts it at a disadvantage here. Assuming we want the return value in eax and the leading count comes in cl, this can be done branchlessly and data-lessly on x86 as follows:

    mov eax, 0x00043201
    test cl, 8
    setz al
    shl cl, 2
    shr eax, cl
    and eax, 15

Something similar may be possible on ARM64, but I suspect it will definitely be more than 19 bytes ;-)

shiftingleft•3mo ago

Shouldn't your snippet be using lzcnt? I can't see how this would result in the desired lookup.

for Zen5 rustc creates the following:

  utf8_sequence_length_lookup:
    shl edi, 24
    mov ecx, 274945
    not edi
    lzcnt eax, edi
    shl al, 2
    shrx rax, rcx, rax
    and al, 7
    ret

https://rust.godbolt.org/z/hz1eKjnaG

Cold_Miserable•3mo ago

I can't see any elegant solution. \n

Struggling proc \n lzcnt ecx,ecx \n test cl,cl \n setz al \n lea edx,[ecx-2] ;[0,1,2] \n cmp rdx,2 \n cmovle eax,ecx \n ret \n Struggling endp

userbinator•3mo ago

and the leading count comes in cl

We can assume the count has already been done.

shiftingleft•3mo ago

Ah I can't read, thanks :-)

hairtuq•3mo ago

The mapping after we have the leading 1's count can be done in 3 instructions (in 32-bit math) on either x86 or ARM:

    t = 0x2020c6 >> x
    return (t << (t & 31)) >> 27

userbinator•3mo ago

That is extremely clever (how did you come up with it?), but definitely needs more than 3 instructions (you need at least 2 just to shift the constant):

    mov eax, 0x2020c6
    shr eax, cl
    mov cl, al
    shl eax, cl
    shr eax, 27

Yet those 14 bytes definitely beats my first attempt... although I've since found an even shorter solution:

    mov ax, 0x4681
    lea ecx, [ecx+ecx*2]
    shr eax, cl
    and eax, 7

12 bytes.

Cold_Miserable•3mo ago

I see what you're doing. 13 bytes, 3 cycles. Why does newline not work?

lzcnt ecx,ecx

mov eax, 00100 0011 0010 0000 0001b shl ecx,2 shrx eax,eax,ecx and eax,15

teo_zero•3mo ago

Clever, but I prefer the parent's proposal:

  return (0x43201 >> x*4) & 7

It's more understandable and the number of instructions is the same.

jandrewrogers•3mo ago

I’ve seen many examples of a similar phenomenon. Modern compilers are impressively good at generating bit-twiddling micro-optimizations from idiomatic code. They are also good at larger scale structural macro-optimization. However, there is a No Man’s Land for compiler optimization between micro-optimization and macro-optimization where the effectiveness of compiler optimizations are much less reliable.

Intuitively I understand why it is a hard problem. Micro-optimizations have deterministic properties that are simple enough that optimality is all but certain. Macro-optimization heuristics create incidental minor pessimization effects that are buried below the noise floor by major optimization effects on average.

In the middle are optimizations that are too complex to guarantee effective optimization but too small in effect to offset any incidental pessimization. It is easy to inadvertently make things worse in these cases. Most cases of surprising performance variances I see fall into this gap.

It is also where the codegen from different compilers seems to disagree the most, which lends evidence to the idea that the “correct” codegen is far from obvious to a compiler.

algo_trader•3mo ago

> also good at larger scale structural macro-optimization

Can u give some examples?

Perhaps you mean small granular choices which occur widely throughout the code?

viraptor•3mo ago

Automatic unrolling + inlining + constant hoisting (+ vectorisation)? I'd call that macro, but maybe op had something different in mind.

fooker•3mo ago

It's a hard problem because the cost model is incredibly complicated.

For example: you decide to unroll/inline/vectorize code has unpredictable effect on performance.

bell-cot•3mo ago

Or worse - it has perfectly predictable effects on the performance...so long as you know details (core counts, cache sizes, etc.) of the CPU model and dataset which your code will be run on, and the TLB/cache/fabric/memory usage pattern of other jobs which might be running at the same time.

fooker•3mo ago

No, maybe you misunderstand the problem here?

Given a specific program, yes it is measurable and predictable.

A compiler has to deal with arbitrary programs, actually - small fragments of arbitrary programs at a time. Even if you could have an oracle giving you very precise measurements for the program fragment the compiler is dealing with, it won't hold up in practice.

Consider a small representative example. Suppose the compiler is compiling a function, and it did the best possible job.

Now, inlining this function at two callsites could have completely different second order effects like further optimizations, register pressure. There are partial solutions of course, people have spent their entire careers dealing with this specific problem. Rest assured that whatever you can come up with after thinking for a minute has been tried in 1970.

fancyfredbot•3mo ago

I think OP was agreeing with you.

fooker•3mo ago

My point was - even if you knew everything about the hardware and could measure performance with absolute precision, it is still a hard problem for compilers for many many reasons.

Panzerschrek•3mo ago

I think some people misunderstand how compiler optimizations work. it's not magic, which makes each piece of code faster, it's just a bunch of deterministic code transformation rules, which usually make code faster (considering large set of use-cases), but it's not proven that they always do so. So, by writing performant code one should always keep this in mind.

Someone•3mo ago

There are many optimizations that are guaranteed to make code run faster, but even then, that does not mean you should use them.

There can be competition between optimizations. At a given moment, the preconditions for both optimization O1 and optimization O2 may hold, but if you then apply O1, the precondition for O2 may no longer hold and vice versa.

In such cases, the best solution is to pick the better of O1 and O2.

Findecanor•3mo ago

If I understood it right, Equality Saturation of E-graphs is intended to help solve that problem.

Cranelift has an optimiser based on it, for classic optimisations within an IR.

account42•3mo ago

One could even say that compiler optimizations are a set of deterministic code transformation rules that happen to make SPEC run faster. But programmers are not wrong to expect better from their compilers.

Panzerschrek•3mo ago

It's a simple rule of thumb to avoid branching in performance-critical code. Branchless code is pretty predictable, but if branching takes place, one can no longer reason so easy about code performance, since compiler optimizations and CPU branch prediction may be involved.

account42•3mo ago

But reality is not that simple. Branchless with unfortunate data dependencies can be a lot worse than predictable branching code.

Panzerschrek•3mo ago

That's why I said it's a rule of thumb. There are always exceptions where branching does code faster.

president_zippy•3mo ago

If you wanna really see this at work on a whole other extreme, try compiling code for an N64 game. It's no surprise that optimizations for modern-day x86_64 and Arm64 processors with a lot of cache would not generalize well to a MIPS CPU with a cache that must be manipulated at the software layer and abysmal RDRAM latency, but the exact mechanics of it are interesting.

KazeEmmanuar did a great job analyzing exactly this so we don't have to!

https://www.youtube.com/watch?v=Ca1hHC2EctY

MrCheeze•3mo ago

Exactly what I was going to post. Optimizations like loop unrolling slow down the N64 because keeping the code size small is the most important factor. I think even compilers of the time got this wrong, not just modern ones.

president_zippy•3mo ago

The one that really blows me away is how KazeEmmanuar explained the software-controlled cache. Using it well would involve either manually making calls to load/invalidate it or writing a compiler backend that replaces loading data from memory into registers with instructions to load specific memory address ranges into cache.

The whole thing reminds me in a fuzzy way that I don't yet fully comprehend of register-memory architecture.

YesThatTom2•3mo ago

This is a good reminder that optimization without measurement is stupidity.

It’s also a reminder that compilers aren’t magic and optimizers aren’t all-knowing.

People have a tendency to ascribe magical properties to anything they don’t understand. Compilers and optimizers more so.

Measure before and after any optimization. You’d be surprised at what does and doesn’t improve efficiency.

Today’s CPUs are more complex than a human can understand.

Measure twice, cut once.

pjmlp•3mo ago

Worse is the cargo cult of "write this way because it is faster", without any measurements.

Only because it used to be true in the past, someone told it was so during a coffee break, or it appeared in some computer related article.

superblas•3mo ago

Agreed. As an embedded software engineer new to the field I’ve seen senior engineers with over ten years of experience complain when division by a compile time constant (even by a power of two) is used in code instead of multiplication or bit shifting on our cortex-m7 mcu running at 600MHz

PaulKeeble•3mo ago

I suspect over time as memory latency, bandwidth and branch prediction have become increasingly the hot aspect of CPUs that optimisations put into compilers before now often end up improving for the wrong thing. Maintaining sequential reads of memory and ahead of the calculation is now critical for peak performance as is the branch being correctly predicted. Saving calculation time on the ports of the CPU however much less impactful as we mostly struggle to shuttle the work into them to achieve peak performance.

Branch predictors are a lot bigger and more capable now than they were and there is every chance when the optimisation for jump tables was tested against the equivalent branch code it outperformed it, but on more recent CPUs the situation reverses.

anon-3988•3mo ago

Since the biggest one is only 11110 (5 bits), can't you use a lookup table of 32 characters by masking the last 3 bits?

yuliyp•3mo ago

That's not the problem. The problem is that you're adding a data dependency on the CPU loading the first byte. The branch-based one just "predicts" the number of bytes in the codepoint and can keep executing code past that. In data that's ASCII, relying on the branch predictor to just guess "0" repeatedly turns out to be much faster as you can effectively be processing multiple characters simultaneously.

anon-3988•3mo ago

I am pretty sure CPUs can speculative load as well. In the CPU pipeline, it sees that there's an repeated instruction to load, it should be able to dispatch and perform all of it in the pipeline. The nice thing is that there is no chance of hazard execution here because all of this speculative load is usable unlike the 1% chance where the branch would fail which causes the whole pipeline to be flushed.

ky_vulnerable•3mo ago

When I used to code in C++, I went down a long rabbit hole of compiler optimizations. It's way more complicated than we give developers credit for. Coming from C++ to Rust, I was happy to see that `--release` optimizes (almost) everything for you.

adzm•3mo ago

Notably this is on an arm processor (aarch64) and they don't say specifically which one. I've seen extreme differences in jump table performance between x86/amd64 cpus. Specifically I remember seeing a big speedup on similar code with either Haswell or Skylake that I pinned down to a jump table suddenly performing blazing fast compared to the previous architecture. Which actually led me to implement the same thing here (avoid generating the jump table) so it would perform better on the earlier processors. This was code that was, basically, stream parsing varint values so the code pattern immediately triggered that core memory!

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

I Write Games in C (yes, C)

SectorC: A C Compiler in 512 bytes

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

History and Timeline of the Proco Rat Pedal (2021)

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Software factories and the agentic moment

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

I Write Games in C (yes, C)

SectorC: A C Compiler in 512 bytes

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

History and Timeline of the Proco Rat Pedal (2021)

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Software factories and the agentic moment

When Compiler Optimizations Hurt Performance

Comments