This looks like an interesting potential alternative to GxHash. GxHash is nice but sadly I found AES intrinsics to have somewhat high latency on AMD's Zen 2/3 cores, making it a loss on short strings (but until I measured it on our server hw, M4 Max sure had me fooled, it has way better latency despite retiring more AES operations!).
Generally my feeling is that at these speeds, designing for branch-predictability for short strings might be more important than absolute throughput.
For bulk hashing there are better functions, but XXH3, GXHash, etc are not among them, being both slower and weaker than the state-of-the-art. Hash functions that can’t pass quality tests are not serious contenders.
How much better is it than old standbys like hashpjw or djb's favorite¹, and how? It's apparently 350 lines of code instead of 1–10 lines of code. That seems like an important way that it's worse. And it comes with a copyright license that requires you to add it to your documentation, so the risk of forgetting that creates a risk of copyright infringement.
How much faster is it hashing a small key like the string "__umulh"? Is it actually slower? Why does it have an array commented "default secret parameters" in the public GitHub repo? Are we supposed to change those?
The readme and https://scholar.google.com/scholar?hl=es&as_sdt=0%2C5&q=rapi... are no help here.
______
¹ while (s[i]) h = h * 33 + s[i++]; https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename...
wpollock•7h ago
wmf•6h ago
benwills•4h ago
Dylan16807•4h ago
You'll probably end up fitting entirely inside the reorder buffer plus a sequential stream from memory, with the actual caches almost irrelevant.
benwills•4h ago
Dylan16807•4h ago
Yes, I'm talking about the data cache.
> But there are ways to make more or less efficient use of the data cache.
How?
You need to touch every byte of input, and nothing should be faster than going right through from start to end.
benwills•3h ago
This is a minor example, but since you asked...
https://github.com/Cyan4973/xxHash/blob/dev/xxhash.h#L6432
That's an example of a fair number of accumulators that are stored as XXHash goes through its input buffer.
Many modern hash functions store more state/accumulators than they used to. Previous generations of hash functions would often just have one or two accumulators and run through the data. Many modern hash functions might even store multiple wider SIMD variables for better mixing.
And if you're storing enough state that it doesn't fit in your registers, the CPU will put it into the data cache.
Dylan16807•3h ago
And there's 150+ registers in the actual chip.
But my argument is more that there isn't really an efficient or inefficient way to use L1. So unless you have an enormous amount of state, the question is moot. And if you have so much state you're spilling to L2, that's not when you worry about good or bad cache use, that's a weird bloat problem.
Sesse__•1h ago