That is, if you're only ever going to test for membership in the set. If you need metadata then ... You could store that in a packed array and use a population count of the bit-vector before the lookup bit as index into it. For each word of bits, store the accumulated population count of the words before it to speed up lookup. Modern CPU's are memory-bound so I don't think SIMD would help much over using 64-bit words. For 4096 bits / 64, that would be 64 additional bytes.
So instead of just comparing the middle value, we'd compare the one at the 1/3 point, and if that turns out to be too low then we compare the value at the 2/3 point.
Unfortunately although we cut the search space to 2/3 of what it was for binary search at each step (1/3 vs 1/2), we do 3/2 as many comparisons at each step (one comparison 50% of the time, two comparisons the other 50%), so it averages out to equivalence.
EDIT: See zamadatix reply, it's actually 5/3 as many comparisons because 2/3 of the time you have to do 2.
True, but is there some particular reason that you want to minimize the number of comparisons rather than have a faster run time? Daniel doesn't overly emphasize it, but as he mentions in the article: "The net result might generate a few more instructions but the number of instructions is likely not the limiting factor."
The main thing this article shows is that (at least sometimes on some processors) a quad search is faster than a binary search _despite_ the fact that that it performs theoretically unnecessary comparisons. While some computer scientists might scoff, I'd bet heavily that an optimized ternary search could also frequently outperform.
Obviously real-world performance depends on other things as well.
All other people live in the real world, and care about real-world performance, and modern computer scientists know that.
- First third: 1 comparisons
- Second third: 2 comparisons
- Third third: 2 comparisons
(1+2+2)/3 = 5/3 average comparisons. I think the gap starts here at assuming it's 50% of the time because it feels like "either you do 1 comparison or 2" but it's really 33% of the time because "there is 1/3 chance it's in the 1st comparison and 2/3 chance it'll be 2 comparisons".
This lets us show ternary is worse in total average comparisons, just barely: 5/3*Log_3[n] = 1.052... * Log_2[n].
In other words, you end up with fewer levels but doing more comparisons (on average) to get to the end. This is true for all searches of this type (w/ a few general assumptions like the values being searched for are evenly distributed and the cost of the operations is idealized - which is where the main article comes in) where the number of splits is > 2.
Not for the ternary version of the binary search algorithm, because what you had is just a skewed binary search, not an actual ternary search. Because comparisons are binary by nature, any search algorithm involving comparisons are a type of binary search, and any choice other than the middle element is less efficient in terms of algorithmic complexity, though in some conditions, it may be better on real hardware. For an actual ternary search, you need a 3-way comparison as an elementary operation.
Where it gets interesting is when you consider "radix efficiency" [1], for which the best choice is 3, the natural number closest to e. And it is relevant to tree search, that is, a ternary tree may be better than a binary tree.
Would there be any value in using simd to check the whole cache line that you fetch for exact matches on the narrowing phase for an early out?
In general I love this article, it took what I’ve often wondered about and did a perfect job exploring with useful ablation studies.
Regardless, both kinds of search are O(log N) with different constants. The constants don’t matter so much in algorithms class but in the real world they can matter a lot.
Essentially, this means that all loops are already unrolled from the processors point of view, minus a tiny bit of overhead for the loop itself that can often be ignored. Since in binary search the main cost is grabbing data from memory (or from cache in the "warm cache" examples) this means that the real game is how to get the processor to issue the requests for the data you will eventually need as far in advance as possible so you don't have to wait as long for it to arrive.
The difference in algorithm for quad search (or anything higher than binary) is that instead of taking one side of each branch (and thus prefetching deeply in one direction) is that you prefetch all the possible cases but with less depth. This way you are guaranteed to have successfully issued the prefetch you will eventually need, and are spending slightly less of your bandwidth budget on data that will never be used in the actual execution path.
As others are pointing out, "number of comparisons" is almost useless metric when comparing search algorithms if your goal is predicting real world performance. The limiting factor is almost never the number of comparisons you can do. Instead, the potential for speedup depends on making maximal use of memory and cache bandwidth. So yes, you can view this as loop unrolling, but only if you consider how branching on modern processors works under the hood.
Obviously, this isn't changing the big-Oh complexity, but in the "real world", still nice to see a 2-4x speedup.
As soon as you move away from either (or both) of these assumptions then there are likely to be many tweaks you can make to get better performance.
What the classical algorithms do offer is a very good starting point for developing a more optimal/efficient solution once you know more about the specific shape of data or quirks/features of a specific CPU.
When you start to get at the pointy end of optimising things then you generally end up looking at how the data is stored and accessed in memory, and whether any changes you can make to improve this don't hurt things further down the line. In a job many many years ago I remember someone who spent way too long optimising a specific part of some code only to find that the overall application ran slower as the optimisations meant that a lot more information needed later on had been evicted from the cache.
(This is probably just another way of stating Rob Pike's 5th rule of programming which was itself a restatement of something by Fred Brooks in _The Mythical Man Month_. Ref: https://www.cs.unc.edu/~stotts/COMP590-059-f24/robsrules.htm...)
bediger4000•2d ago
iosovi•1h ago