See "Example 2: Tricking the compiler" in my blog post about O3 sometimes being slower than O2: https://barish.me/blog/cpp-o3-slower/
You absolutely can fool a lot of compilers out there! And I am not only looking at you, NVCC.
E.g. if in `main` you called two different add functions, couldn't it optimize one of them away completely?
It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no? Why doesn't it do that?
If you declare them as static, it eliminates the functions and the calls completely: https://aoco.compiler-explorer.com/z/soPqe7eYx
I'm sure it could also perform definition merging like you suggest but I can't think of a way of triggering it at the moment without also triggering their complete elision.
It can't do that because the program might load a dynamic library that depends on the function (it's perfectly OK for a `.so` to depend on a function from the main executable, for example).
That's one of the reasons why a very cheap optimization is to always use `static` for functions when you can. You're telling the compiler that the function doesn't need to be visible outside the current compilation unit, so the compiler is free to even inline it completely and never produce an actual callable function, if appropriate.
I get it though, because carefully structuring your #includes to get a single translation unit is messy, and compile times get too long.
A quick google suggests it's called "identical comdat folding" https://devblogs.microsoft.com/oldnewthing/20161024-00/?p=94...
Anything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).
I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.
For example:
$ julia
julia> function f(n)
total = 0
for x in 1:n
total += x
end
return total
end
julia> @code_native f(10)
...
sub x9, x0, #2
mul x10, x8, x9
umulh x8, x8, x9
extr x8, x8, x10, #1
add x8, x8, x0, lsl #1
sub x0, x8, #1
ret
...
it shows this with nice colors right in the REPL.In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.
jagged-chisel•52m ago
qsort•42m ago
Like with people in general, it depends on what compiler/interpreter we're talking about, I'll freely grant that clang is smarter than me, but CPython for sure isn't. :)
More generally, canonicalization goes very far, but no farther than language semantics allows. Not even the notorious "sufficiently smart compiler" with infinite time can figure out what you don't tell it.
manbitesdog•3m ago
adrianN•39m ago
lou1306•13m ago
amiga386•8m ago
So if you use a messy solution where something that should be a struct and operated on with functions, is actually just a pile of local variables within a single function, and you use macros operating on local variables instead of inlineable functions operating on structs, you get massively better performance.
tonyhart7•38m ago
pareto principle like always, dont need the best but good enough
not every company is google level anyway
ErroneousBosh•27m ago
I agree that most people are not writing hand-tuned avr8 assembly. Most people aren't attempting to do DSP on 8-bit AVRs either.
IshKebab•22m ago
You can mostly not think about super low level integer manipulation stuff though.
jaccola•19m ago
I work with Cuda kernels a lot for computer vision. I am able to consistently and significantly improve on the performance of research code without any fancy tricks, just with good software engineering practices.
By organising variables into structs, improving naming, using helper functions, etc... the previously impenetrable code becomes so much clearer and the obvious optimisations reveal themselves.
Not to say there are certain tricks / patterns / gotchas / low level hardware realities to keep in mind, of course.