You can beat the binary search

https://lemire.me/blog/2026/04/27/you-can-beat-the-binary-search/

76•vok•2d ago

Comments

bediger4000•2d ago

The (AI generated?) image on this article is absolutely not helpful, and I think it's wrong based on how I read the article. Better not to have an image at all.

iosovi•1h ago

Agreed, it threw me off at first but the rest of the article was quite nice.

aidenn0•1h ago

If you are storing 16-bit integers, wouldn't an 8kB bitmap be even faster?

Findecanor•1h ago

The range is 1..4096, so 4096 bits = 512 byte bitmap would suffice.

That is, if you're only ever going to test for membership in the set. If you need metadata then ... You could store that in a packed array and use a population count of the bit-vector before the lookup bit as index into it. For each word of bits, store the accumulated population count of the words before it to speed up lookup. Modern CPU's are memory-bound so I don't think SIMD would help much over using 64-bit words. For 4096 bits / 64, that would be 64 additional bytes.

loeg•10m ago

The library the author is talking about selects between bitmap and array dynamically depending on density.

https://roaringbitmap.org/

jstanley•1h ago

As a teenager I spent a weekend thinking that if binary search was good, because it cuts the search space in half at every step, then wouldn't a ternary search be better? Because we'd cut it into thirds at every step.

So instead of just comparing the middle value, we'd compare the one at the 1/3 point, and if that turns out to be too low then we compare the value at the 2/3 point.

Unfortunately although we cut the search space to 2/3 of what it was for binary search at each step (1/3 vs 1/2), we do 3/2 as many comparisons at each step (one comparison 50% of the time, two comparisons the other 50%), so it averages out to equivalence.

EDIT: See zamadatix reply, it's actually 5/3 as many comparisons because 2/3 of the time you have to do 2.

bryanlarsen•1h ago

Did you continue by fantasizing about CPU's that contain ternary comparators?

madcaptenor•1h ago

Isn't it a bit better on average, although not as much as you'd hoped? For example 19 steps of binary search get you down to 1/524288 of the original search space with 19 comparisons. 12 steps of ternary search get you down to 1/3^12 = 1/531441 of the original search space with, on average, 12 * 3/2 = 18 comparisons.

jstanley•56m ago

Maybe! But you can see the other comment that points out I was wrong and it is actually 5/3 comparisons so it still works out worse.

nkurz•1h ago

> Unfortunately although we cut the search space to 2/3 of what it was for binary search at each step (1/3 vs 1/2), we do 3/2 as many comparisons at each step (one comparison 50% of the time, two comparisons the other 50%), so it averages out to equivalence.

True, but is there some particular reason that you want to minimize the number of comparisons rather than have a faster run time? Daniel doesn't overly emphasize it, but as he mentions in the article: "The net result might generate a few more instructions but the number of instructions is likely not the limiting factor."

The main thing this article shows is that (at least sometimes on some processors) a quad search is faster than a binary search _despite_ the fact that that it performs theoretically unnecessary comparisons. While some computer scientists might scoff, I'd bet heavily that an optimized ternary search could also frequently outperform.

jstanley•58m ago

You normally measure runtime of a sorting algorithm in terms of the number of comparisons it has to do.

Obviously real-world performance depends on other things as well.

Someone•42m ago

Not “normally”, but “in computer science” and even then, mostly “in the past” and even then, only “typically” (there are sorting algorithms that make zero comparisons. See for example https://pages.cs.wisc.edu/~paton/readings/Old/fall01/LINEAR-...)

All other people live in the real world, and care about real-world performance, and modern computer scientists know that.

bena•1h ago

Imagine if you split the search space N times, no middles. Then you could just compare the value.

zamadatix•1h ago

This ternary approach doesn't actually average 3/2 comparisons per level:

- First third: 1 comparisons

- Second third: 2 comparisons

- Third third: 2 comparisons

(1+2+2)/3 = 5/3 average comparisons. I think the gap starts here at assuming it's 50% of the time because it feels like "either you do 1 comparison or 2" but it's really 33% of the time because "there is 1/3 chance it's in the 1st comparison and 2/3 chance it'll be 2 comparisons".

This lets us show ternary is worse in total average comparisons, just barely: 5/3*Log_3[n] = 1.052... * Log_2[n].

In other words, you end up with fewer levels but doing more comparisons (on average) to get to the end. This is true for all searches of this type (w/ a few general assumptions like the values being searched for are evenly distributed and the cost of the operations is idealized - which is where the main article comes in) where the number of splits is > 2.

jstanley•58m ago

Oh yeah!

eggprices•54m ago

When you can't seek quickly, e.g. on a disk, you can use a B-tree with say 128-way search. Fetching 128 keys doesn't cost much more than fetching 1 but it saves an additional 7 fetches.

ack_complete•21m ago

Note that CPUs have also gotten dramatically wider in both execution width and vector capability since you were a teenager. The increased throughput shifts the balance more toward being able to burn operations to reduce dependency chains. It's possible for your idea to have been both non-viable on the CPUs at the time and more viable on CPUs now.

GuB-42•15m ago

It turns out that teenager you had something.

Not for the ternary version of the binary search algorithm, because what you had is just a skewed binary search, not an actual ternary search. Because comparisons are binary by nature, any search algorithm involving comparisons are a type of binary search, and any choice other than the middle element is less efficient in terms of algorithmic complexity, though in some conditions, it may be better on real hardware. For an actual ternary search, you need a 3-way comparison as an elementary operation.

Where it gets interesting is when you consider "radix efficiency" [1], for which the best choice is 3, the natural number closest to e. And it is relevant to tree search, that is, a ternary tree may be better than a binary tree.

[1] https://en.wikipedia.org/wiki/Optimal_radix_choice

wood_spirit•1h ago

A beautiful algorithm.

Would there be any value in using simd to check the whole cache line that you fetch for exact matches on the narrowing phase for an early out?

taeric•1h ago

If you are talking smaller arrays, linear search with a sentinel value at the end is already tough to beat. The thing that sucks about that claim, is that "smaller" is such a nebulous qualifier that it is really hard to internalize.

rao-v•56m ago

This is simply not true - if you look at this article’s excellent benchmarking, linear search falls behind somewhere around 200-400 elements.

In general I love this article, it took what I’ve often wondered about and did a perfect job exploring with useful ablation studies.

eggprices•53m ago

Except on Apple, where binary search always wins. Does anyone know why?

stephencanon•31m ago

Prior to the current generation Intel designs, Apple’s branch predictor tables were a good deal larger than Intel’s IIRC, so depending on benchmarking details it’s plausible that Apple Silicon was predicting every branch perfectly in the benchmark, while Intel had a more real-world mispredict rate. Perf counters would confirm.

BeetleB•37m ago

For that machine and compiler version, yes.

taeric•13m ago

I don't think std::find typically uses a sentinel, though?

SuperV1234•55m ago

That's not what the article is about.

cubefox•1h ago

Since being binary search is already very fast with its O(log n) time complexity: are there any real world applications which could practically benefit from this improvement?

loeg•8m ago

This is a drop-in improvement for essentially any binary search over 16-bit integer members.

drob518•52m ago

Isn't "quaternary" just sort of unrolling the binary search loop by one level? I mean, to find the partition in which the item is located, you still do roughly the same rough number of comparisons. You're just taking them 4 at a time, not 2 at a time. Seems like loop unrolling would give you the same.

wtallis•42m ago

Yes, this can be seen as unrolling the loop a bit. It improves performance not by significantly reducing the number of instructions or memory reads, but by relaxing the dependencies between operations so that it doesn't have to be executed purely serially. You could also look at it as akin to speculatively executing both sides of the branch.

mayoff•37m ago

Quaternary search effectively performs both of the next loop iteration’s possible comparisons simultaneously with the current iteration’s comparison. This is a little more complex than simple loop unrolling.

Regardless, both kinds of search are O(log N) with different constants. The constants don’t matter so much in algorithms class but in the real world they can matter a lot.

loeg•15m ago

Sort of, yes, but you're also removing a data dependency between the unrolled stages.

nkurz•6m ago

It's trickier than that. Modern processors are speculative, which means that they guess at the result for a comparison and keep going along one side of a branch as far as they can until they are told they guessed wrong or hit some internal limit. If they guessed wrong, they throw away the speculative work, take a penalty of a handful of cycles, and do the same thing again from a different starting point.

Essentially, this means that all loops are already unrolled from the processors point of view, minus a tiny bit of overhead for the loop itself that can often be ignored. Since in binary search the main cost is grabbing data from memory (or from cache in the "warm cache" examples) this means that the real game is how to get the processor to issue the requests for the data you will eventually need as far in advance as possible so you don't have to wait as long for it to arrive.

The difference in algorithm for quad search (or anything higher than binary) is that instead of taking one side of each branch (and thus prefetching deeply in one direction) is that you prefetch all the possible cases but with less depth. This way you are guaranteed to have successfully issued the prefetch you will eventually need, and are spending slightly less of your bandwidth budget on data that will never be used in the actual execution path.

As others are pointing out, "number of comparisons" is almost useless metric when comparing search algorithms if your goal is predicting real world performance. The limiting factor is almost never the number of comparisons you can do. Instead, the potential for speedup depends on making maximal use of memory and cache bandwidth. So yes, you can view this as loop unrolling, but only if you consider how branching on modern processors works under the hood.

kardos•38m ago

So is the SIMD the magic piece here, or is it the interpolation search? If the data is evenly distributed, that is pretty optimal for the interpolation search..

mayoff•36m ago

In the Intel CPU + cold cache case, the quad search matters. In the other three cases, only the SIMD matters.

loeg•9m ago

To put it another way: this is addressed in the article.

gobdovan•37m ago

I thought this would be about how you can beat binary search in the 'Guess Who?' game. There's a cool math paper about it [0] and an approachable video by the author. [1]

[0] https://arxiv.org/abs/1509.03327

[1] https://www.youtube.com/watch?v=_3RNB8eOSx0

BeetleB•34m ago

Some of the plots would have been much more helpful if instead of absolute value in seconds, the y-axis were the multiplier w.r.t binary search (and eyeballing suggests a relatively constant multiplier).

Obviously, this isn't changing the big-Oh complexity, but in the "real world", still nice to see a 2-4x speedup.

gobdovan•30m ago

I remember I had a pedagogy class in Uni taught by psychology faculty, and was messing with them by proposing a mock syllabus where we'd teach students binary search, then the advanced advanced ones ternary search, and the very advanced, Quaternary, with a big Q, as in the geological period. Jokes on me now, I suppose.

alexfoo•21m ago

The classical canonical Comp Sci algorithms are effectively "designed" for CPUs with no parallelism (either across multiple cores, via Hyper-threading technology, or "just" SIMD style instructions), and also where all memory accesses take the same amount of time (so no concept of L1/L2/L3/etc caches of varying latencies). And all working on general/random data.

As soon as you move away from either (or both) of these assumptions then there are likely to be many tweaks you can make to get better performance.

What the classical algorithms do offer is a very good starting point for developing a more optimal/efficient solution once you know more about the specific shape of data or quirks/features of a specific CPU.

When you start to get at the pointy end of optimising things then you generally end up looking at how the data is stored and accessed in memory, and whether any changes you can make to improve this don't hurt things further down the line. In a job many many years ago I remember someone who spent way too long optimising a specific part of some code only to find that the overall application ran slower as the optimisations meant that a lot more information needed later on had been evicted from the cache.

(This is probably just another way of stating Rob Pike's 5th rule of programming which was itself a restatement of something by Fred Brooks in _The Mythical Man Month_. Ref: https://www.cs.unc.edu/~stotts/COMP590-059-f24/robsrules.htm...)

Belgium stops decommissioning nuclear power plants

Meta in row after workers who saw smart glasses users having sex lose jobs

How an Oil Refinery Works

I aggregated 28 US Government auction sites into one search

You can beat the binary search

The FCC is about to ban 21% of its test labs today. I mapped them all

Granite 4.1: IBM's 8B Model Matching 32B MoE

Mozilla's Opposition to Chrome's Prompt API

Claude Code refuses requests or charges extra if your commits mention "OpenClaw"

The Zig project's rationale for their anti-AI contribution policy

Where the goblins came from

Noctua releases official 3D CAD models for its cooling fans

Copy Fail

My Stratum-0 Atomic Clock

A Primer on Bézier Curves – So What Makes a Bézier Curve?

Japan Is Building Cardboard Suicide Drones

Craig Venter has died

GCC 16 has been released

The Science Behind Honey's Eternal Shelf Life (2013)

Because It Doesn't Have To

Durable queues, streams, pub/sub, and a cron scheduler – inside your SQLite file

"Parse, don't validate" through the years with C++

DataCenter.FM – background noise app featuring the sound of the AI bubble

Show HN: I wrote a DOOM clone in my own programming language

Biology is a Burrito: A text- and visual-based journey through a living cell

Zed 1.0

The More Young People Use AI, the More They Hate It

FastCGI: 30 years old and still the better protocol for reverse proxies

OpenTrafficMap

London to Calcutta by Bus (2022)

You can beat the binary search

Comments

Belgium stops decommissioning nuclear power plants

Meta in row after workers who saw smart glasses users having sex lose jobs

How an Oil Refinery Works

I aggregated 28 US Government auction sites into one search

You can beat the binary search

The FCC is about to ban 21% of its test labs today. I mapped them all

Granite 4.1: IBM's 8B Model Matching 32B MoE

Mozilla's Opposition to Chrome's Prompt API

Claude Code refuses requests or charges extra if your commits mention "OpenClaw"

The Zig project's rationale for their anti-AI contribution policy

Where the goblins came from

Noctua releases official 3D CAD models for its cooling fans

Copy Fail

My Stratum-0 Atomic Clock

A Primer on Bézier Curves – So What Makes a Bézier Curve?

Japan Is Building Cardboard Suicide Drones

Craig Venter has died

GCC 16 has been released

The Science Behind Honey's Eternal Shelf Life (2013)

Because It Doesn't Have To

Durable queues, streams, pub/sub, and a cron scheduler – inside your SQLite file

"Parse, don't validate" through the years with C++

DataCenter.FM – background noise app featuring the sound of the AI bubble

Show HN: I wrote a DOOM clone in my own programming language

Biology is a Burrito: A text- and visual-based journey through a living cell

Zed 1.0

The More Young People Use AI, the More They Hate It

FastCGI: 30 years old and still the better protocol for reverse proxies

OpenTrafficMap

London to Calcutta by Bus (2022)