Strategic optimisations is often basically free if you have domain expertise. It's that easy to know that the business wants x outcome and algorithm y is the right choice etc if its all internal thought processes. Whereas if you don't know enough then you're likely to make very expensive to undo decisions.
> There is no way to provide both optimized assembly and equivalent C code and let the compiler use the former in the general case and the latter in special cases.
This is true, but can be seen as a failure of language and tooling. For example, Halide [1] pioneered (AFAIK) the concept of separating algorithm from implementation at the language level. This separation lets you express the algorithm once, and then "schedule" it by specifying parallelism, vectorization, etc. You can provide multiple schedules for one algorithm, which allows you to specialize / make different choices depending on varying factors.
It's a really interesting concept, though maybe limited in practice to DSLs. I'm not sure a general purpose language would be a good fit for this model, but then, for general purpose programs written in general purpose languages, perf optimization at the level TFA discusses is frequently limited to just specific hot sections. Those hot sections could be extracted out into specialized components written in such a DSL.
https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning....
you don't need to go all the way to Halide to do what the article is claiming isn't possible - you can do it just by including a "micro-kernel" in your library and have the code branch to that impl (depending on something at runtime) instead of whatever the C code compiled down to. this is done every single day in every single GPU lib (famously cublas ships with hundreds/thousands of these of such ukernels for gemms depending on shapes).
One of my early-career successes was just creating a framework for generating every permutation of perf optimizations for every (log-scaled -- clz is very fast) input size and checking which was best, dropping the results into a lookup table of function pointers to branch on. The university had a large supply of heterogeneous computers, replete with all the normal problems like being able to double floating-point addition throughput on Haswell CPUs by abusing the fmadd instruction, so I made a framework (probably closer to a DSL) for encoding your algorithms in a way that you could analyze perf tradeoffs at compile time and tune your result for the given computer. It's kind of like what ATLAS does for some linear algebra tasks.
Such practices are almost never optimal, but they're pretty easy to implement, and the results are near-optimal for almost all inputs. In the tradeoff between human and computer performance, I think it's a nice option.
This seems like a nonsensical statement to me. How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?
Measuring/profiling just means observing the system you want to optimize in a systematic way. You certainly won't be very effective at optimizing anything if you don't observe it.
Theoretical calculations means you've formed a model of what's happening and you're devising a plan to optimize against that model. But of course, a useful model needs to represent the significant aspects of your system (and a good model should exclude most insignificant ones). Failing to observe your system means your model could be bad -- focused on insignificant aspects and missing significant ones -- and you'd never know.
Measuring doesn't mean don't think. Measuring and thinking are two different things. You need to do them both to optimize effectively.
The fact remains most projects that do small trivial modular prototypes first will ultimately know which paths are viable before painting themselves into a corner algorithmically.
Best of luck =3
You do profiling because it's WAY too easy to get obsessed about theoretical problems when a simple measurement will show you the actual problems.
You do the math on the actual problem location, not a method with O(n!) which only gets called with n=3.
You still have to look at the entire call stack when profiling (which means thinking about the overarching algorithm).
If a problem area is so intuitively obvious, why would you introduce the problem in the first place? In reality, performance optimizations are usually needed where you least expect them. Which means that you can't get there intuitively. Hence, the suggestion of using profiling to help track down where the problem is instead.
Pretty much everyone I know will throw down an O(n^2) algorithm or whatever in their first pass and replace it with something more thought out once they have the time to think deeply about it.
If you're fretting about optimization at every stage of development, you're really doing it wrong. This is precisely the type of early optimization that everyone and their dog will tell you is bad. You should focus first on making your program work and laying out the logic and data flows. Optimization does not matter until the program is almost finished.
Bring in a performance expert and their intuition can quickly identify many performance improvements. The tough part for the business is when and where do you pay for the experience of performant code, and when is good enough to leave alone?
Imagine: function F() { for (i = 0; i < 10; i++) { A(); B(); C(); } }
If we profile this code, we might find out, e.g. B takes the majority of the time--let's say 90%. So you spend hours, days, weeks, making B 2X faster. Great. Now you removed 45% of execution time. But the loop in the outer function F is just a few instructions, it is not "hot"--it won't show up in profiles except for ones that capture stacks.
If you're just stuck in the weeds optimizing hot functions that show up in profiles, it's possible to completely overlook F. That loop might be completely redundant, causing 10X the workload by repeatedly computing A, B, and C, which may don't need to be recomputed.
There are bazillions of examples like this. Say you find out that a function is super, super hot. But it's just a simple function. There are calls to it all over the code. You can't make it any faster. Instead you need to figure out how to not call it at all, e.g. by caching or rethinking the whole algorithm.
> How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?
This happens more than you think. Understanding how the system works in enough detail and also at a high level to formulate a plan is in short supply. Jumping in and hacking in things, like a cache or something, is surprisingly common.
I don't think I've ever used a profiler that couldn't report you were in F() here. One that only captures your innermost functions really doesn't seem that useful, for exactly the reasons you give.
One of my best examples of this, I had a function reported as still taking 10% of cumulative run time, after I’d tweaked it as much as I could. But I’d set up a benchmark that called a code path a deterministic number of times and this function was getting called twice as much as it should. I found two sibling methods asking the same question and rearranged them so the answer came as an argument and nixed the duplicate call. I reran the benchmark and instead of getting a reduction of 5% (10/2), I got 20%. That was all memory pressure.
The worst memory pressure I ever fixed I saw a 10x improvement by removing one duplicate call. Now, there was a quadratic part of that call but it was a small enough n that I expected 3x and hoped for 4x, and was as shocked as anyone when it went from 30s to 3s with one refactor.
But, in the real world. Code bases are massive. And it is hard to predict when worlds collide. Most things does not matter until they do. So measuring is the way to go I believe.
There’s so much noise at that point that even people who would usually catch problems start to miss them.
There’s usual response to this is, “well you can turn caching off to do profiling” but that’s incorrect because once people know they can get a value from the cache they stop passing it on the stack. So your function that calls A() three times that should have called it 2? You find now that it’s being called ten times.
And the usual response to that is, “well it’s free now so who cares?” Except it’s not free. Every cache miss now either costs you multiple, or much more complex cache bookkeeping which is more overhead, and every hit resets the MRU data on that entry making it more likely that other elements get evicted.
For instance in NodeJS concurrent fetches for the same resource often go into a promise cache, but now the context of the closure for the promise is captured in the cache, and it doesn’t take much to confuse v8 into keeping a bunch of data in scope that’s not actually reachable. I’ve had to fix that a few times. Hundreds of megabytes in one case because it kept an entire request handler in scope.
But it's also very easily to mislead yourself that way, many "optimizations" might do much less than you think. So you should avoid implementing more complex or harder to understand code just because you think it is faster, but otherwise I'd certainly try to write faster code by default in areas I know well enough to judge that.
Profilers fill in gaps in our mental model of code execution but they are not a substitute for it. Computer behavior is largely knowable from first principles, albeit requiring a considerable degree of expertise and detail. For some people in some contexts, the profiler adds little information to what they already understand. I know a few people that do almost all of their optimization work using pencil and paper with great effectiveness and precision. Not something I would recommend for most software engineers but not unreasonable at the limit either.
If you optimize infrequently, or haven't profiled code like this recently, or haven't profiled this specific codebase recently, then your mental model is probably due a refresh.
Once you “find” a candidate change you measure it to see if what you did made things worse and you put it back if it did, or maybe you try it in combination with other changes to see if its value is complementary.
But people fuck up all the time reading the initial telemetry, which is often where I come in. I get tired of hearing people say, “we’ve done all we can, look how flat this chart is,” and hand someone my beer. You won’t find all of the good candidates in the percentage of run time list. That’s not all the data that’s there, and not every change that works needs to even be supported by the initial data. It only needs to be supported by the delta afterward.
> This seems like a nonsensical statement to me. How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?
I believe the mantra “intuition doesn’t work, profile your code” is understood to mean “don't rely on your intuition alone; profile your work.” It's a warning that when it comes to performance optimization, intuition is often unreliable, so you need to supplement your mental model with actual data if you want to avoid big mistakes.
I don't know why the author of the blog post is interpreting the mantra literally, as if it claims intuition is obsoleted by profiling.
Profiling to find suboptimal code is perfectly fine. Then you need to figure out how to fix it. Many people don't understand how performance optimization works, so they blindly add caching, improve constant time by invoking more low-level methods, etc. This obviously doesn't work, yet intuitively (to those people, anyway) it should produce good results.
That's why the mantra exists: don't trust your intuition, don't believe it when it says these changes improve performance, instead measure that performance and only apply changes that work. This is also perfectly fine, but this is a double-edged sword, and I've seen people go too far in this direction.
For example, they refuse to do any changes that don't immediately improve performance according to the profiler. If they modify one computation and performance decreases, they abandon this path altogether. They treat optimization as a game with a dense fog of war, and they refuse to apply deductive reasoning and, of course, intuition to apply changes that, according to the profiler at least, are not immediately rewarding.
Adding caches or switching to lower-level calls is definitely something learned, and I wouldn't call it "intuitive".
What I think you are referring to is that sometimes, simply reading and understanding the code can tell you where the problem really is — still, my experience is that you want to measure the before and after to at least identify the general area you should be looking at, and more often than not, I could figure out what needs optimizing and how without having to get detailed profiles.
I did end up being surprised a few times, but it was mostly due to external code like buggy library implementations that didn't do the things they promised (eg. async library really synchronizing everything with a badly placed mutex).
At the same time, it's wrong to focus on a single case or a single profile (executing one code path in one set of circumstances), but what you are describing simply sounds like bad engineering — the fact that you can misuse the tool does not make the tool bad, it's still the engineer wielding it who's at fault.
Caching and lower level calls are generic solutions that work everywhere, but are also generally the last and worst way to optimise (thus why they need such careful analysis since they so often have the opposite effect).
Better is to optimise the algorithms, where actual profiling is a lesser factor. Not a zero factor of course, as a rule of thumb it’s probably still wise to test your improvements, but if you manage to delete an n^2 loop then you really don’t need a profiler to tell you that you’ve made things better.
Eg: indexing or partitioning a database table may appear to make things slower if you don't have both a representative amount of data and representative query patterns when you're measuring the change.
You should still measure your changes, but sometimes you need to be careful about measuring them in the right way, and possibly simulating a future context (eg: more scale) before drawing a conclusion.
Intuition about how the context will evolve and what effect that might have on the tradeoffs of different approaches is helpful
It's really just debugging and troubleshooting, but with a different goal in mind.
In my experience most webapps can fix so much low hanging performance issues by mapping the API in a way that matches how its used in the client. It can remove so much mapping and combining for data all over.
The fact that sometimes optimization work is tricky or requires some pre-thinking, or is even gasp counter-intuitive is such a hilarious way to say "this is hard". That's just table stakes starting points for so-so-so much work.
Edit: Heck, even deciding whether to prioritize optimization work or feature work is usually a harder problem than the actual optimization work itself.
I've since noticed this many times. Optimizing is like meditation to me. It's very mechanical (measure), with a sprinkle of creative work (once you know what is slow, it's quite obvious how to make it faster, but just challenging enough to be engaging), and it has a very nice tight feedback loop: Something is slow. I make a change. Now it's fast. Next.
Optimizing is my happy place.
Unfortunately I had to give up when I was still 10 times slower than the reference lol
Code cleanup in general is the same for me, but it’s really hard to justify putting much time into that when running your own company solo.
Then there was a system for benchmarking the software as a whole on a wide variety of architectures, including NUMA. With lots of plots and statistics.
Usually you’d eventually end up at a point where the improvements are below the noise floor or they help on some systems and cause regression on others. The rule was usually “no regressions”
VTune for multithreading optimization. Built a fibers and lockfree system for efficient scheduling.
fundamentally disagree. First it is a building of a mental model of what happens, a kind of analysis stage, and then compare it to the mental model of how it should or could work or producing a more efficient algorithm/way of accomplishing the target task.
When people try to brute-force, lets try this or this, without having the model in mind, that is frequently a waste, and even when/if it produces some improvement there is no understanding and guarantee what use cases the improvement will cover, whether it would regress some use cases, whether it still be there after we push those new features/fixes/etc.
My father had a PhD in Operations Research/Industrial Engneering.
Grock was a famous Swiss clown and once the highest paid entertainer in Europe.
The non-clown word you’re looking for is grok and Robert Heinlein coined it in 1961.
That doesn't seem to be what this post it talking about. It seems to talking about well worn areas trying to improve the state of the art. An example that illustrates it for me is DeepMind's AlphaTensor finding a better way to multiply matrices[0] in 2022. It wasn't a brute-force solution, but the scale of it makes it appear so.
> On 4x4 matrices, AlphaTensor unexpectedly discovered a solution with 47 multiplication steps, an improvement over the 49 required with Strassen’s algorithm of 1969, albeit restricted to mod 2 arithmetic. Similarly, AlphaTensor solved 5x5 matrices with 96 rather than Strassen's 98 steps. Based on the surprising discovery that such improvements exist, other researchers were quickly able to find a similar independent 4x4 algorithm, and separately tweaked Deepmind's 96-step 5x5 algorithm down to 95 steps in mod 2 arithmetic and to 97[24] in normal arithmetic.[25] Some algorithms were completely new: for example, (4, 5, 5) was improved to 76 steps from a baseline of 80 in both normal and mod 2 arithmetic.
This to me shows that direct profiling and observation wouldn't have led to the optimization. Improvements needed a sort-of but not actually brute-force effort of many people trying, but also being clever with their attempts.
[0] https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...
"Even Apple’s LLVM fork lacks scheduling annotations for Apple Silicon. How am I supposed to write efficient code when Apple doesn’t bother to tune their own compiler?"
In addition to its public fork of LLVM, Apple also keeps a very private fork where (one assumes) they keep all the really juicy stuff.
I agree that it is frustrating not to have the data, but don't confuse "I don't have the data" with "no one has the data".
In C++, you can achieve performance using the often-denigrated standard template library, if you would only pay attention to the documented performance requirements for given function calls. But even there it's not the panacae that it could be, because it often provides an amortized cost while handwashing the access pattern (for example: std::unordered_map is great in algorithmic cost theory and terrible in memory access patterns).
What's the algorithmic cost (big-O) of this function call? What's the memory footprint of that function call? Worse: is that documented big-O estimated across a contiguous dataset, discontiguous paged datasets, or does it have to dereference pointers? What's the network cost of a given call? It's hard to know if you don't know that `write()` to a socket could incur 40 or 400 bytes across a network, and don't know whether that network is 1mbps, 10mbps, 1gbit, or localhost, etc; and with how much latency.
For example, when I hand-rolled some x86_64 SIMD instructions to analyze some DNA, I found the Intel Intrinsics Guide [0] 1000% helpful because many of the instructions detailed what sort of performance to expect on specific processor architectures (or, at least, in general). If you read the Intel Instruction Set Reference [1], a lot of similar performance information can be found. The performance I achieved was approximately the theoretical bottleneck between CPU and main RAM; getting better performance would have required a complete algorithm change which would have been Ph.D paper publishing worthy.
Of course, sometimes even low level hardware can have performance-killing bugs.
[0]: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
[1]: https://www.intel.com/content/www/us/en/developer/articles/t...
For example: do you compression data sent over the network? What level of compression do you use? Changing the data format can affect the optimal compression level, and vice versa, using a higher compression level means you can keep the underlying data simpler. For example, you can replace deserialization with zero-cost casts. But that might mean you spend more memory. Do you have that much memory? If you do, would giving that memory to the database for use as cache be better? And so on.
The individual choices are simple, but they compound and affect the overall performance in unpredictable ways. The only way to be sure you aren't missing something obvious is to check all, or at least most combinations.
For them, the systematic way to optimize goes: profile your code, and then apply domain knowledge of the product to optimize the hotspots with these common techniques:
Don't do repeated work. (If you have an expensive invariant calculation in a loop or function call, move it out, to a higher level of the program where it can be done once.)
Save your work. (Caching and memoization.)
Do less work. (Alter the product requirements to use less computationally-intensive techniques in cases where users won't notice the difference.)
Do work when the user isn't looking. (Move computations to background tasks, apply concurrency, perform async requests.)
If all else fails, call in a performance engineer like the author to micro-optimize your building blocks.
You can often get speedups of 3-6 orders of magnitude by applying these techniques, simply because the original code is so brain-dead. Performance engineers like the author tend to work on code that has already been tightly optimized, and so there is less low-hanging fruit to pick.
MatthiasWandel•5h ago
jandrewrogers•5h ago