SIMD Binary Heap Operations

http://0x80.pl/notesen/2025-01-18-simd-heap.html

60•ryandotsmith•5mo ago

Comments

dooglius•5mo ago

is_heap doesn't seem like a particularly useful operation though, generally a heap is intentionally constructed as such via push_heap/pop_heap

simonask•5mo ago

I think it's fairly useful. It means that you can convert a contiguous array to a heap with a fast O(n) read-only check instead of O(n log n) writes, so if you know that's the common case, you can detect it up front and only revert to normal binary heap insertion if it returns false.

madars•5mo ago

By the way, you can also construct heap in linear time - instead of doing n consecutive insertions, at least half of which require log n work, you can apply sift-down operations from bottom up, beginning with the last non-leaf node and working backwards to the root.

That way, roughly half the nodes are leaves (requiring no work), a quarter are at the second-to-last level (requiring at most 1 comparison/swap), an eighth at the third-to-last level (requiring at most 2 comparisons/swaps), and so on. Summing up 1 n/4 + 2 n/8 + ... gets you O(n) total complexity.

See https://en.wikipedia.org/wiki/Heapsort?useskin=monobook#Vari...

gotoeleven•5mo ago

I think the parent comment is asking what process or algorithm is there that would result in an array that was sometimes but not always a heap, and you'd want to do something based on whether the array was in fact a heap or not? Like in your example, what process do you have in mind that might result in a heap and might not?

simonask•5mo ago

Any time you deal with semi-trusted input, like an internal protocol or deserialization. :-)

mananaysiempre•5mo ago

Indeed that part seems more like an artist’s study than an attempt at actual usefulness, but that’s okay when figuring out if anything at all in the neighbourhood of what you’re trying to do with SIMD is even possible. As far as constructing heaps, don’t forget about (linear-time) heapify, which can be significantly faster if you have a bunch of elements and want to construct a heap with all of them in it. (This doesn’t get you a linear-time heap sort because you’ll still pay the full linearithmic price for the subsequent pop_heaps.)

camel-cdr•5mo ago

make_heap should be vectorizable, that would be more useful. I can also see a path to vectorize bulk insert, but that seems harder.

dragontamer•5mo ago

I don't believe it's possible to vectorize the classic heap.

I've seen vectorized and SIMD heap implementations. They are a different data structure entirely. You basically work with 16-sorted items and then sort your working set (16-items) with any node, allowing you to generalize the push down or push up operations of a heap. (Sort both lists. Top16 make a node and stay here. Bottom16 push down and recurse).

This is very very similar to a classic heap but you need a lot of operations so that you have enough work to SIMD.

Sorting is after all, a highly parallelized operation (Bionic Sort, MergePath, etc) and is a good basis for generalizing single threaded data structures into a multi thread or SIMD version.

CalChris•5mo ago

It's 2025 and the simd operations in this aren't that obscure. So I wonder how close std::simd gets to the instrinsics' performance with clang and gcc.

Sesse__•5mo ago

The fastest way of doing a heap I've found is generally: Don't. For many of the relevant operations (graph search, merging streams, etc.), you can do just as well with a winner-tree; it can usually be updated branch-free with min/max operations, whereas with a heap you'll usually have completely unpredictable branches for every single operation.

A winner-tree, or its counterpart the loser-tree (for min instead of max), is a very simple binary tree: Bottom layer is all your values (2^N of them). The layer above that is the highest of pairs of values. The layer above that is the highest of pairs of pairs. And so on, until you get to the top of the tree, which contains the largest value. Updating a value is trivial; you overwrite the relevant one at the bottom, and then run exactly log2(n) max operations upwards until the you hit the root. Inserting and deleting may, of course, be more complicated.

DennisL123•5mo ago

A winner tree uses extra space, doesn't it? That might exclude it from certain applications to be an alternative. Four-ary heaps are roughly (ymmv) twice as fast as binary heaps by exploiting cache locality in a better way for small key/value types. And it seems to be a sweet spot since eight-ary heaps don’t deliver additional improvements.

Sesse__•5mo ago

Yes, since the inner nodes duplicate information, it uses more space (roughly twice as much). I've found them generally most effective for things like merging 256 streams or doing Dijkstra with not super-many nodes (e.g. Contraction Hierarchies). If you have millions or billions of entries, then cache considerations start becoming more important than branch mispredictions and you want something like a B-heap.

za_creature•5mo ago

That's a nice niche you found (spoken from one heap fan to another) but I have to say I strongly disagree with your use of *roughly* twice as much

At best you were off by one but in the context of performance, you'd want to assign that extra to a super-root of ±inf in order to save log2n extra range checks per heap-up, no?

Sesse__•5mo ago

I don't think I understand what you are saying. Anyway, if you are strapped for space, do not use a winner tree.

za_creature•5mo ago

Ah, I meant that for a classic heap, it's convenient to assign h[0] to the limit of your goal function (e.g. -inf for a min heap) cause that way you can skip the while i>0 check and just do while h[i>>1] > min(h[i], h[i+1]), asserting that the condition is definitely false when i < 2

I guess that it's not as important when you're explicitly willing to pay the piper and run the full log2n iterations just for the sake of branch predictability.

Thanks for the algo though, before today I never would've thought about doing a branchless downheap with i = i<<1 + h[i<<1] != h[i]

RossBencina•5mo ago

For running min/max there is an efficient O(1) algorithm by none other than Daniel Lemire: https://lemire.me/en/publication/arxiv0610046/

_ache_•5mo ago

Can't see the Website.

I'm the only one with a HTTPS problem here? Bad domain, `art.mahajana.net` instead of 0x80.pl.

superjan•5mo ago

i have wasted several weeks worth of evenings on vectorizing heaps (4ary heaps: with SIMD, you’re not limited to binary heaps). It did not provide any speedup. I’d expect that halving heap depth would help but no. Still don’t know why.

pjmlp•5mo ago

I would suggest adding concepts to the template definition, or at least a mix of enable_if and static_assert, if to be used on versions prior to C++20.

France's homegrown open source online office suite

British drivers over 70 to face eye tests every three years

Start all of your commands with a comma (2009)

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

First Proof

Software Factories and the Agentic Moment

Reinforcement Learning from Human Feedback

The Waymo World Model

Coding agents have replaced every framework I used

Vocal Guide – belt sing without killing yourself

A Fresh Look at IBM 3270 Information Display System

StrongDM's AI team build serious software without even looking at the code

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Vinklu Turns Forgotten Plot in Bucharest into Tiny Coffee Shop

72M Points of Interest

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Making geo joins faster with H3 indexes

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Ga68, a GNU Algol 68 Compiler

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

What Is Ruliology?

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: If you lose your memory, how to regain access to your computer?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

An Update on Heroku

Microsoft open-sources LiteBox, a security-focused library OS

France's homegrown open source online office suite

British drivers over 70 to face eye tests every three years

Start all of your commands with a comma (2009)

Leisure Suit Larry's Al Lowe on model trains, funny deaths and Disney

Hoot: Scheme on WebAssembly

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

First Proof

Software Factories and the Agentic Moment

Reinforcement Learning from Human Feedback

The Waymo World Model

Coding agents have replaced every framework I used

Vocal Guide – belt sing without killing yourself

A Fresh Look at IBM 3270 Information Display System

StrongDM's AI team build serious software without even looking at the code

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Vinklu Turns Forgotten Plot in Bucharest into Tiny Coffee Shop

72M Points of Interest

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Making geo joins faster with H3 indexes

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Ga68, a GNU Algol 68 Compiler

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

What Is Ruliology?

Show HN: I spent 4 years building a UI design tool with only the features I use

Show HN: If you lose your memory, how to regain access to your computer?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

An Update on Heroku

Microsoft open-sources LiteBox, a security-focused library OS

SIMD Binary Heap Operations

Comments