This looks like an interesting potential alternative to GxHash. GxHash is nice but sadly I found AES intrinsics to have somewhat high latency on AMD's Zen 2/3 cores, making it a loss on short strings (but until I measured it on our server hw, M4 Max sure had me fooled, it has way better latency despite retiring more AES operations!).
For latency-optimized hashing, rapidhash is currently the fastest algorithm I know of with consistently good hash quality. Hash quality is outside the cryptographic range, which is expected for a latency-optimized design, but close enough that you can at least make a comparison. Should be pretty portable too. It is what I recommend for general-purpose small-key hashing.
For throughput-optimized hashing, rotohash[0] is currently the fastest algorithm I know of, with single-core throughput saturating CPU cache bandwidth. Hash quality is at the low-end of the cryptographic range, about the same as MD5. While it is built on portable 32-bit primitives, it clearly targets modern microarchitectures — the Zen 5 performance metrics, though not listed, are crazy. This is brand new, part of a research project to efficiently and robustly checksum I/O at current hardware line rates (100s of GB/s).
Any algorithm can be faster if you completely disregard quality. In fairness, the quality bar has risen quickly as we’ve become better at designing these algorithms and finding the defects. State-of-the-art algorithms a decade ago are all considered obviously and hopelessly broken today. Designing a competitive hash function now requires an extremely high degree of skill and expertise. It was much easier to come into it as a hobbyist and produce an acceptable result ten years ago.
Generally my feeling is that at these speeds, designing for branch-predictability for short strings might be more important than absolute throughput.
For bulk hashing there are better functions, but XXH3, GXHash, etc are not among them, being both slower and weaker than the state-of-the-art. Hash functions that can’t pass quality tests are not serious contenders.
How much better is it than old standbys like hashpjw, FNV1A, or djb's favorite¹, and how? It's apparently 350 lines of code instead of 1–10 lines of code. That seems like an important way that it's worse. And it comes with a copyright license that requires you to add it to your documentation, so the risk of forgetting that creates a risk of copyright infringement. So, how big is the benefit?
How much faster is it hashing a small key like the string "__umulh"? Is it actually slower? (For those needing an introduction to hash tables, covering why that question makes sense, Chris Wellons is always excellent: https://nullprogram.com/blog/2025/01/19/)
Why does it have an array commented "default secret parameters" in the public GitHub repo? Are we supposed to change those? Do they provide some kind of security, and if so, what kind, against what threat model?
The readme and https://scholar.google.com/scholar?hl=es&as_sdt=0%2C5&q=rapi... are no help here. https://github.com/rurban/smhasher?tab=readme-ov-file#summar... explains some of the questions at greater length but of course not rapidhash's particular answers.
______
¹ while (s[i]) h = h * 33 + s[i++]; about 11 machine-code instructions in its full-fledged form: https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename...
The rest of this comment is kind of ELI5 (I hope!) not because I think @kragen needs that, but because maybe other readers will benefit and I know he leans towards long form text, and @jandrewrogers basically alludes to this stuff in his comment opening this sub-thread. EDIT: as do a few other posters here like at least @curiouscoding https://news.ycombinator.com/item?id=44012740
______
As for benefit - that always depends on "Compared to what??" in your use cases, what CPU you expect things to run on, and so much else. As just a few examples - last I knew, the SMHasher performance tests blocked inlining which already makes them unrealistic comparisons, esp. with really tiny code footprint ones like your FNV1a example. Your higher level table structure might also be caching the hash code for table resize/as a comparison prefix for lookups. So, how much you have to hash at all may depend on doing that, predictability of table sizes, etc. How effective the comparison prefix idea even is can depend upon how common long prefixes are in your key sets which my be literally unknown - could be pathnames from `find` with tons of prefix similarity or none at all.
Furthermore, the entire culture of presentation of performance in this space is sadly "rate" - the reciprocal of what is easi(er) to reason about, namely "time". Why? "Time is additive". Even the headline here is "GB/s". Cycles/byte or ns/byte is just easier. You want to present a time equation with perHashCost + perByteCost*nBytes and both unknown coefficients if inferred from real-times on a modern time sharing OS probably have significant statistical variation. If there is a highly non-random distribution of kicks/competition/interferences, it will usually be easier to see in time kicks. The reciprocal transformation distorts everything. Anyway, a time cost equation (like Knuth v1-3 of days of yore!) also lets you easily compute the break-even string length when rate starts to dominate fixed overheads, namely nBytes_breakEven = perHashCost/perByteCost.
It also lets you estimate an answer to your question of "when does this matter", if your strings are typically under 32B (I believe an early days limit to identifiers for many C compilers, but someone can correct me if that's wrong) and the perByteCost is 0.1 ns/byte (10 GB/s - just a round number), then break even (when per-hash starts to matters more) is 0.1*32 = 3.2 ns or 16..24 cpu cycles on most modern deployments. Said the other way, if your perHash overhead is 16..24 cycles and your hash througput is 10 GB/s, then you better hope your strings are longer than 32 bytes on average or you've probably made a bad choice.
One way in which an equation fails is that it does oversimplify the performance profile even of this simple problem. That profile often has plateaus/cliffs at cache line size and integer size boundaries, but rates fare no better there. Even so, if you are presenting a bunch of hashes, a table sortable by perHash and perByte costs would be the best starting point for most evaluations where the slope is "just an average"/effective value. A plot is best, though.
Even on a fairly wimpy microcontroller the hash I linked above only has about four instructions of startup/finalize overhead, even without inlining: https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename... (I modified it slightly because GCC was puzzlingly not reducing my array indexing to pointer arithmetic...)
I continue to think that the world would be less confusing if people didn't go around citing GB/s for hash-table hashes, but rather (perHash+-err, perCost+-err) on CPU xyz. Cuts right to the heart of the matters and then folks like jandrewrogers don't have to be vague with "small" to then inspire questions like yours. (Of course, CPUs really vary a lot, too.)
I also think when people measure these fast string hashes for strings > L1 CPU cache they're mostly measuring a melange of IO/memory system not the hash computation - maybe sometimes unwittingly. L1 CPU caches can deliver 100s of GB/s after all. 1 64B cache line per cycle is 64B/0.2ns = 320 GB/s. Intel's Skylake could do that a decade ago. The memory wall can be a real bear.
Probably if you're hashing a 32KiB string, you're still in L1 cache, but your performance is probably close to this 71GB/s number, which would be 460ns.
Those byte-at-a-time hashers really are pretty slow unless your strings are very short (like under 3..6 bytes - which, hey, they might be!). Further down there are some plots/graphs with a little more compiler/cpu variety but fewer hash functions. Kind of a long thread, but maybe informative. Anyway, pictures are worth thousands of words and all that.
EDIT: and FWIW, there's even some preliminary analysis of this new rapid hash TFA is about towards the bottom, but it may be an older, slower version.
The specifics are somewhat hardware specific but the performance crossover point as a function of key size between latency-optimized algorithms and throughput-optimized algorithms is currently in the 400-500 byte range on modern server hardware, assuming state-of-the-art quality.
That's the real rub. Hashed data structures were a revelation in the early 90's when we all started using perl. None of this work really changes any of that calculus. It's not like we're all going to be chatting at the water cooler about "Remember the days when all we had was djb2?! How the world has changed!"
Still interesting, but unless you're directly involved in a heavily benchmarked runtime, it's probably not worth trying to integrate.
Classic algorithms are severely broken for many trivial key sets and therefore are not general purpose. However, they may work great for your specific key set and this is something you can test for if you know the properties of your key set. The point of a general purpose hash function is that you can use it with unknown or unknowable key distributions with some degree of assurance that the distribution of hashes is unlikely to be biased in some way.
Something else to keep in mind is that those old algorithms are really slow. Modern algorithms explicitly take advantage of the parallelism and width of modern CPU cores. Larger internal hash state allows higher throughput and quality but also increases latency and I-cache footprint, so there is a tension in the design.
As I mentioned in another post, if the properties of your key set are well-defined and bounded, it can often make sense to design a hash function that is only high-performance and high-quality for that key set. Most people don’t know how to do this effectively so they mostly stick with general-purpose hash functions, one latency-optimized and one throughput-optimized.
The best cryptographic hash functions, like SHA-256, are very close to a random oracle. Weaker cryptographic hash functions, like MD5, are significantly further away and stronger non-cryptographic hash functions, like rapidhash, are further yet. Most non-cryptographic hash functions are so far away from a random oracle in some of the tests as to be effectively broken. XXH3 is an example of this, anyone can try it for themselves using e.g. SMHasher3. For these “broken” cases, the hashes quite obviously don’t have a random distribution.
There are a couple important points to take away from this:
Hash quality is a function of the key set. If you know the properties of the key set ahead of time, which is often the case, then you can construct a hash function with narrowly optimal quality and performance even though it is totally broken for many other key sets that are out of scope. These almost always perform better than a general purpose hash function. This is a good performance engineering trick.
For a hash function to claim suitability for “general purpose”, it needs to be consistently close to a random oracle across all key sizes and key distributions. If it breaks for one key set, it is broken for many others that you didn’t test. Being closer to a random oracle across tens of thousands of tests does not imply that the hash function is not broken (even SHA-256 may be broken) but it greatly improves the probability that you will never accidentally generate a key set that creates a severely biased hash distribution.
Most non-cryptographic hash functions sacrifice quality to improve latency. Low latency and small I-cache footprint is very important for small-key hashing performance in real systems. The Pareto frontier of quality+latency is still quite an active area of research. Any hash function that was designed 10 years ago will be very far from that frontier, the state-of-the-art is fast-moving.
There is a similar Pareto frontier for quality+throughput e.g. hashes used as block storage checksums.
As for why XXH3 is popular, it is incessant self-promotion and marketing. I recently went down a rabbit hole of analyzing quality measures for many hash functions to see how far the state-of-the-art is from the cryptographic hash functions, so the data is pretty fresh. The great news is that non-cryptographic hash quality is now qualitatively far beyond what it was even a few years ago while at the same time improving performance.
what is a hash fn that is more reliable and faster than xxh3?
Impressive!
[1]: https://en.wikipedia.org/wiki/Collision_attack#Hash_flooding
GxHash is twice as fast and passes all SMHasher tests...
wpollock•1mo ago
wmf•1mo ago
benwills•1mo ago
Dylan16807•1mo ago
You'll probably end up fitting entirely inside the reorder buffer plus a sequential stream from memory, with the actual caches almost irrelevant.
benwills•1mo ago
Dylan16807•1mo ago
Yes, I'm talking about the data cache.
> But there are ways to make more or less efficient use of the data cache.
How?
You need to touch every byte of input, and nothing should be faster than going right through from start to end.
benwills•1mo ago
This is a minor example, but since you asked...
https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L6432
That's an example of a fair number of accumulators that are stored as XXHash goes through its input buffer.
Many modern hash functions store more state/accumulators than they used to. Previous generations of hash functions would often just have one or two accumulators and run through the data. Many modern hash functions might even store multiple wider SIMD variables for better mixing.
And if you're storing enough state that it doesn't fit in your registers, the CPU will put it into the data cache.
Dylan16807•1mo ago
And there's 150+ registers in the actual chip.
But my argument is more that there isn't really an efficient or inefficient way to use L1. So unless you have an enormous amount of state, the question is moot. And if you have so much state you're spilling to L2, that's not when you worry about good or bad cache use, that's a weird bloat problem.
Sesse__•1mo ago
Retr0id•1mo ago
For example, a common optimisation for CRC implementation involves lookup tables. If your lookup table exceeds L1 then you're going to have a bad time. If your lookup table fits exactly in L1, your benchmarks might look great, while real-world performance suffers (because your other code wants to be making use of L1, too).
I'd imagine rapidhash avoids lookup tables but the question of cache-friendliness is still pertinent.