As an exercise for the reader:
There are exactly 32 symbols in ASCII:
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Taking input in a register, uniquely hash them to the range 0-31.
Any other input values may return any number, but must not
trap or exhibit undefined behavior. The obvious approach of
"just make a 256-element array" isn't allowed.
This can be done in few enough cycles that you need to
carefully consider if there's any point to including:
loads (even hitting L1)
branches (even if it fully predicts when it is taken)
multiplication (unless just using lea/add/shift)
I found that out-of-order only helps a little; it's
difficult to scatter-gather very wide in so few cycles.
Writing C code mostly works if you can persuade the compiler
not to emit an unwanted branch.
uint8_t classify(uint8_t c) {
return c - (0x5F45452B2B210000ULL >> ((c & 0x70) >> 1));
}
The AND-then-shift sequence on the right is annoying and it feels like one should be able to avoid it, but I don’t see how. Overall, this is more expensive than it looks—neither a full 64-bit constant nor a variable 64-bit shift are exactly free. So I’m probably missing something here. func symbolHash(c byte) byte {
x := (c - 1) >> 5
y := int(c) + 0x1b150000>>(x<<3)
return byte(y & 31)
}
But this doesn't generalize - it depends on the input consisting of a couple runs of consecutive characters which can be made continuous. (Extra credit: why is "c-1" necessary?) (hash(c + 4339104) & 7) * 4 + (hash(c + 201375) & 3)
For a generic hash function (eg the Murmur hash, or the simple one here: long hash(long x) {
x = ((x >>> 16) ^ x) * 0x45d9f3b;
x = ((x >>> 16) ^ x) * 0x45d9f3b;
x = (x >>> 16) ^ x;
return x;
}
As described in the linked paper, the fastest way to find such a MPHF for larger sets is nowadays Consensus-RecSplit.[1] https://stackoverflow.com/questions/664014/what-integer-hash...
unsigned h = _pext_u32(1264523 * x, 0x1020a01);
unsigned h = (1264523 * x & 0x1020a01) * 134746240 >> 27;
Alternatively: unsigned h = (1639879 * x & 0x1038040) * 67375104L >> 32 & 31;
The multiplication by 67375104L can be a usual 32x32 IMUL where the high half goes to edx, though I'm not sure that conveys a benefit over a 64x64 IMUL in serial code these days.Some PHF's can be pre-compiled, while most needs to be deserialized at run-time. I worked on a pre-compiled pthash variant, but got struck by C++ bugs.
There's a huge overhead for ordered variants in some, to check for false positives.
For small n gperf is still the fastest by far. And it is pre-compiled only.
I'm also not sure what you mean with "check for false positives"... each entry not in the set returns a random value, maybe for you this is a "false positive"? A typical solution for this is to add a "fingerprint" for each entry (just a hash value per entry) - that way there is still some false positive probability, which may or may not be acceptable (it is typically considered acceptable if the hash function is cryptographically secure, and the fingerprint size is big enough: e.g. 128 bits of the SHA-256 sum). Depending on the data, it may be faster to compare the value (eg. in a parser).
when comparing those MPFH query times the startup-time, the deserialization from disk, is 1000x higher than the actual query time. when you compile those data structures, the load time is instant. also memory usage is twice as low.
Well... let's put it like this: in this survey, "parsers" (where I am one of the co-authors) are not mentioned explicitly in the "Applications" section. They are a subset of "Hash Tables and Retrieval". There are many other uses: Approximate Membership, Databases, Bioinformatics, Text Indexing, Natural Language Processing. Yes, parsers are mentioned in "The Birth of Perfect Hashing". Maybe we can conclude that parsers are not the "main" use case nowadays.
> when you compile those data structures, the load time is instant.
Well, in this case I would recommend to use a static lookup table in the form of source code. That way, it is available for the compiler at compile time, and doesn't need to be loaded and parsed at runtime: it is available as a data structure at runtime. All modern MPHF implementations in this survey support this usage. But, yes, they are not optimized for this use case: you typically don't have millions of keywords in a programming language.
I'm certainly not willing to load they keys and mpfh properties at query-time from disc, as they are known in advance and can be compiled to C or C++ code in advance, which leads to an instant load-time, in opposition to your costly deserialization times in all your tools.
Your deserialization overhead space is not calculated, and also not the storage costs for the false positive check. It's rather academic, not practical
And where else would one use a static lookup table to great effect? When I actively followed the SIGPLAN (programming languages) proceedings, one of the papers that really stood out for me was one about making interface/trait based function calls as fast as inheritance by using perfect hashing on all vtable entries.
tmostak•1d ago
You can use perfect hashes not only the usual suspects of contiguous integer and dictionary-encoded string ranges, but also use cases like binned numeric and date ranges (epoch seconds binned per year can use a perfect hash range of one bin per year for a very wide range of timestamps), and can even handle arbitrary expressions if you propagate the ranges correctly.
Obviously you need a good "baseline" hash path to fall back to you, but it's surprising how many real-world use cases you can profitably cover with perfect hashing.
anitil•1d ago
TheTaytay•20h ago
senderista•18h ago
bytehamster•17h ago
wahern•10h ago
Last time I tried gperf 8-10 years, it took it hours or even days to build a hash function CHD could do in seconds or less. If someone's idea of perfect hash function construction cost is gperf (at least gperf circa 2015)... welcome to the future.
[1] See my implementation of CHD: https://25thandclement.com/~william/projects/phf.html
anitil•10h ago