One question, how portable are performance benefits from tweaks to memory alignment? Is this something where going beyond rough heuristics (sequential access = good, order of magnitude cache sizes, etc) requires knowing exactly what platform you're targeting?
And yes, once all the usual tricks have been exhausted, the nest step is looking at the cache/cache line sizes of the exact CPU you’re targeting and dividing the workload into units that fit inside the (lowest level possible) cache, so it’s always hot. And if you’re into this stuff, then you’re probably aware of cache-oblivious algorithms[0] as well :)
Personally, I almost never had the need to go too far into platform-specific code (except SIMD, of course), doing all the stuff in the post is 99% of the way there.
And yeah, C# is criminally underrated, I might write a post comparing high-perf code in C++ and C# in the future.
[0]: https://en.wikipedia.org/wiki/Cache-oblivious_algorithm
Great article by the way.
>> C# has an awesome situation in here with its support for value types (ref structs), slices (spans), stack allocation, SIMD intrinsics (including AVX512!). You can even go bare-metal and GC-free with bflat.
There's been a really solid effort by the maintainers to improve performance in C# , especially with regard to keeping stuff off the heap. I think it's a fantastic language for doing backends in. It's unfortunate that one of the big language users, Unity, has not yet updated to the modern runtime.
The former, hotspot, is a visualiser for perf data, and it deals ok with truly massive files that made perfetto and similar just big down. It also supports visualing off-CPU profiles ("why is my program slow but not CPU bound?").
The latter, heaptrack, is a tool with very similar UI to hotspot (I think the two tools share some code even) to profile malloc/free (or new/delete). Sometimes the performance issue is as simple as not reusing a buffer but reallocating it over and over inside a loop. And sometimes you wonder where all the memory is going.
jmole•2mo ago
It sounds like the “worst case“ here is that the program is already optimized.
bofersen•2mo ago
What I wanted to say was that a spiky profile provides a clear path to optimizing a piece of code, whereas a flat profile usually means there are more fundamental issues (inefficient memory management, pointer chasing all over the place, convoluted object system, etc.).
saghm•2mo ago
bofersen•2mo ago
Narishma•2mo ago
lmm•2mo ago
hansvm•2mo ago
It often happens for good reasons. Features get added over time, there are some scars from a mocking framework, simpler (faster) solutions don't quite work because they're supporting X which supports Y which supports Z (dead code, but nobody noticed), people use full datetime handling when they mean to access performance counters, the complexity of the thing means that you blow your branch prediction cache size budget, etc....
The solution is to deeply understand the problem (lots of techniques, but this comment isn't a blog post) and come up with a solution, like a ground-up rewrite of some or all of the offending section.