Bytes before FLOPS: your algorithm is (mostly) fine, your data isn't

https://www.bitsdraumar.is/bytes-before-flops/

31•bofersen•1d ago

Comments

jmole•1d ago

> worst case scenario being the flat profile where program time is roughly evenly distributed

It sounds like the “worst case“ here is that the program is already optimized.

bofersen•1d ago

Author here, kinda sorta. I should've been a bit more specific than that. You can have a profile showing a function taking up 99% of the time, but when you dive into it, there's no clear bottleneck. But just because there's no bottleneck, that doesn't mean it's optimized; vice versa-a well-optimized program can have a bottleneck that's already been cycle-squeezed to hell and back.

What I wanted to say was that a spiky profile provides a clear path to optimizing a piece of code, whereas a flat profile usually means there are more fundamental issues (inefficient memory management, pointer chasing all over the place, convoluted object system, etc.).

saghm•20h ago

It sounds like a flat profile essentially is a local optimum, compared to cases where there's a path "upwards" along a hill to some place more optimal that doesn't require completely changing your strategy.

bofersen•15h ago

That's actually a good observation, yeah. It's often the case that you dig deeper and deeper and find some incomprehensible spaghetti and just say "fuck it, I'll just do what I can here, should be enough".

colonCapitalDee•1d ago

Great article. Can confirm, writing performance focused C# is fun. It's great having the convenience of async, LINQ, and GC for writing non-hot path "control plane" code, then pulling out Vector<T>, Span<T>, and so on for the hot path.

One question, how portable are performance benefits from tweaks to memory alignment? Is this something where going beyond rough heuristics (sequential access = good, order of magnitude cache sizes, etc) requires knowing exactly what platform you're targeting?

bofersen•1d ago

Author here. First of all, thanks for the compliment! It’s tough to get myself to write these days, so any motivation is appreciated.

And yes, once all the usual tricks have been exhausted, the nest step is looking at the cache/cache line sizes of the exact CPU you’re targeting and dividing the workload into units that fit inside the (lowest level possible) cache, so it’s always hot. And if you’re into this stuff, then you’re probably aware of cache-oblivious algorithms[0] as well :)

Personally, I almost never had the need to go too far into platform-specific code (except SIMD, of course), doing all the stuff in the post is 99% of the way there.

And yeah, C# is criminally underrated, I might write a post comparing high-perf code in C++ and C# in the future.

[0]: https://en.wikipedia.org/wiki/Cache-oblivious_algorithm

VorpalWay•2m ago

To the list of profiling tools I would like to add KDAB Hotspot and KDE Heaptrack.

The former, hotspot, is a visualiser for perf data, and it deals ok with truly massive files that made perfetto and similar just big down. It also supports visualing off-CPU profiles ("why is my program slow but not CPU bound?").

The latter, heaptrack, is a tool with very similar UI to hotspot (I think the two tools share some code even) to profile malloc/free (or new/delete). Sometimes the performance issue is as simple as not reusing a buffer but reallocating it over and over inside a loop. And sometimes you wonder where all the memory is going.

Pebble Watch software is now 100% open source

Claude Advanced Tool Use

Shai-Hulud Returns: Over 300 NPM Packages Infected

Three Years from GPT-3 to Gemini 3

Unpowered SSDs slowly lose data

Claude Opus 4.5

Cool-retro-term: terminal emulator which mimics look and feel of the old CRTs

Neopets.com Changed My Life (2019)

Moving from OpenBSD to FreeBSD for firewalls

Show HN: I built an interactive HN Simulator

The Bitter Lesson of LLM Extensions

What OpenAI did when ChatGPT users lost touch with reality

PS5 now costs less than 64GB of DDR5 memory. RAM jumps to $600 due to shortage

Show HN: OCR Arena – A playground for OCR models

Bytes before FLOPS: your algorithm is (mostly) fine, your data isn't

Everything you need to know about hard drive vibration (2016)

Chrome Jpegxl Issue Reopened

TSMC Arizona outage saw fab halt, Apple wafers scrapped

You can see a working Quantum Computer in IBM's London office

Corvus Robotics (YC S18): Hiring Head of Mfg/Ops, Next Door to YC Mountain View

Random lasers from peanut kernel doped with birch leaf–derived carbon dots

Inside Rust's std and parking_lot mutexes – who wins?

Launch HN: Karumi (YC F25) – Personalized, agentic product demos

Building the largest known Kubernetes cluster

Mind-reading devices can now predict preconscious thoughts

NSA and IETF, part 3: Dodging the issues at hand

Fifty Shades of OOP

GrapheneOS migrates server infrastructure from France

The history of Indian science fiction

Implications of AI to schools

Pebble Watch software is now 100% open source

Claude Advanced Tool Use

Shai-Hulud Returns: Over 300 NPM Packages Infected

Three Years from GPT-3 to Gemini 3

Unpowered SSDs slowly lose data

Claude Opus 4.5

Cool-retro-term: terminal emulator which mimics look and feel of the old CRTs

Neopets.com Changed My Life (2019)

Moving from OpenBSD to FreeBSD for firewalls

Show HN: I built an interactive HN Simulator

The Bitter Lesson of LLM Extensions

What OpenAI did when ChatGPT users lost touch with reality

PS5 now costs less than 64GB of DDR5 memory. RAM jumps to $600 due to shortage

Show HN: OCR Arena – A playground for OCR models

Bytes before FLOPS: your algorithm is (mostly) fine, your data isn't

Everything you need to know about hard drive vibration (2016)

Chrome Jpegxl Issue Reopened

TSMC Arizona outage saw fab halt, Apple wafers scrapped

You can see a working Quantum Computer in IBM's London office

Corvus Robotics (YC S18): Hiring Head of Mfg/Ops, Next Door to YC Mountain View

Random lasers from peanut kernel doped with birch leaf–derived carbon dots

Inside Rust's std and parking_lot mutexes – who wins?

Launch HN: Karumi (YC F25) – Personalized, agentic product demos

Building the largest known Kubernetes cluster

Mind-reading devices can now predict preconscious thoughts

NSA and IETF, part 3: Dodging the issues at hand

Fifty Shades of OOP

GrapheneOS migrates server infrastructure from France

The history of Indian science fiction

Implications of AI to schools

Bytes before FLOPS: your algorithm is (mostly) fine, your data isn't

Comments