Modern Minimal Perfect Hashing: A Survey

84•matt_d•1d ago

Comments

tmostak•1d ago

We've made extensive use of perfect hashing in HeavyDB (formerly MapD/OmniSciDB), and it has definitely been a core part of achieving strong group by and join performance.

You can use perfect hashes not only the usual suspects of contiguous integer and dictionary-encoded string ranges, but also use cases like binned numeric and date ranges (epoch seconds binned per year can use a perfect hash range of one bin per year for a very wide range of timestamps), and can even handle arbitrary expressions if you propagate the ranges correctly.

Obviously you need a good "baseline" hash path to fall back to you, but it's surprising how many real-world use cases you can profitably cover with perfect hashing.

anitil•1d ago

So in HeavyDB do you on-the-fly build perfect hashes for queries? I've only ever seen perfect hashes used at 'build time' when the keys are already known and fixed (like keywords in a compiler)

TheTaytay•20h ago

I had the same question! I have never heard of runtime perfect hashing. (Admittedly, I haven’t read the paper yet.)

senderista•18h ago

In the DSA theory literature there is so-called “dynamic perfect hashing” but I don’t think it’s ever been implemented and its use case is served by high-load factor techniques like bucketized cuckoo hashing.

bytehamster•17h ago

In the appendix of the survey, there are 3 references on dynamic perfect hashing. I think the only actual implementation of a dynamic PHF is a variant of perfect hashing though fingerprinting in the paper "perfect hashing for network applications". However, that implementation is not fully dynamic and needs to be re-built if the key set changes too much.

wahern•10h ago

All of those modern algorithms, even relatively older ones like CHD, can find a perfect hash function over millions of keys in less than a second.[1] Periodically rebuilding a function can be more than fast enough, depending on your use case.

Last time I tried gperf 8-10 years, it took it hours or even days to build a hash function CHD could do in seconds or less. If someone's idea of perfect hash function construction cost is gperf (at least gperf circa 2015)... welcome to the future.

[1] See my implementation of CHD: https://25thandclement.com/~william/projects/phf.html

anitil•10h ago

The article had a reference for being able to compress data in a table (maybe similar in spirit to using a foreign key to a small table). I could also see it being useful in compression dictionaries, but again that's not really a run-time use (and I'm sure I'm not the first to think of it)

o11c•1d ago

I'm only vaguely aware of how other people do perfect hashing (generators I've used always seem to produce arrays to load from), but dabbled in a purely-arithmetic toy problem recently.

As an exercise for the reader:

  There are exactly 32 symbols in ASCII:

    !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

  Taking input in a register, uniquely hash them to the range 0-31.
  Any other input values may return any number, but must not
  trap or exhibit undefined behavior. The obvious approach of
  "just make a 256-element array" isn't allowed.

  This can be done in few enough cycles that you need to
  carefully consider if there's any point to including:

    loads (even hitting L1)
    branches (even if it fully predicts when it is taken)
    multiplication (unless just using lea/add/shift)

  I found that out-of-order only helps a little; it's
  difficult to scatter-gather very wide in so few cycles.

  Writing C code mostly works if you can persuade the compiler
  not to emit an unwanted branch.

mananaysiempre•1d ago

I’m tempted to emulate a conventional SIMD solution and go with a small lookup table:

  uint8_t classify(uint8_t c) {
      return c - (0x5F45452B2B210000ULL >> ((c & 0x70) >> 1));
  }

The AND-then-shift sequence on the right is annoying and it feels like one should be able to avoid it, but I don’t see how. Overall, this is more expensive than it looks—neither a full 64-bit constant nor a variable 64-bit shift are exactly free. So I’m probably missing something here.

duskwuff•1d ago

This set of symbols has some interesting properties which allow for this solution:

    func symbolHash(c byte) byte {
        x := (c - 1) >> 5
        y := int(c) + 0x1b150000>>(x<<3)
        return byte(y & 31)
    }

But this doesn't generalize - it depends on the input consisting of a couple runs of consecutive characters which can be made continuous. (Extra credit: why is "c-1" necessary?)

thomasmg•22h ago

Using a brute-force approach can quickly find a minimal perfect hash table. Eg. the RecSplit approach can be used for this case, to first split into 4 sections, and then use another one for each section. Or, in this case, the same one for each section:

    (hash(c + 4339104) & 7) * 4 + (hash(c + 201375) & 3)

For a generic hash function (eg the Murmur hash, or the simple one here:

    long hash(long x) {
        x = ((x >>> 16) ^ x) * 0x45d9f3b;
        x = ((x >>> 16) ^ x) * 0x45d9f3b;
        x = (x >>> 16) ^ x;
        return x;
    }

As described in the linked paper, the fastest way to find such a MPHF for larger sets is nowadays Consensus-RecSplit.

[1] https://stackoverflow.com/questions/664014/what-integer-hash...

tmyklebu•20h ago

Two fast (today) instructions:

  unsigned h = _pext_u32(1264523 * x, 0x1020a01);

tmyklebu•20h ago

Same idea, but without BMI2:

  unsigned h = (1264523 * x & 0x1020a01) * 134746240 >> 27;

Alternatively:

  unsigned h = (1639879 * x & 0x1038040) * 67375104L >> 32 & 31;

The multiplication by 67375104L can be a usual 32x32 IMUL where the high half goes to edx, though I'm not sure that conveys a benefit over a 64x64 IMUL in serial code these days.

mananaysiempre•7h ago

Where does the constant multiplier come from? Is it just bruteforced, or is there an idea behind it that I can’t see?

vlovich123•1d ago

Interesting that they don’t cover boomphf which is the fastest MPHF I’ve encountered.

judofyr•1d ago

Boomph is a Rust re-implementation of BBHash which is included (and dominated by three other implementations). AFAIK there’s no reason to think it would perform any better than BBHash.

bytehamster•17h ago

Addition: BBHash, in turn, is a re-implementation of FiPHa (perfect hashing through fingerprinting). There are quite many re-implementations of FiPHa: BBHash, Boomph, FiPS, FMPH, etc. As shown in the survey, BBHash is by far the slowest. Even though it implements exactly the same algorithm, FMPH is much faster. Its paper [1] also compares to Boomph. The beauty of the fingerprinting technique is that it is super simple. That's probably the reason why there are so many implementations of it.

[1] https://dl.acm.org/doi/pdf/10.1145/3596453

rurban•1d ago

They completely forget the startup-time in the query-time, which dominates by a factor of 1000.

Some PHF's can be pre-compiled, while most needs to be deserialized at run-time. I worked on a pre-compiled pthash variant, but got struck by C++ bugs.

There's a huge overhead for ordered variants in some, to check for false positives.

For small n gperf is still the fastest by far. And it is pre-compiled only.

thomasmg•20h ago

From what you describe, I think you have a somewhat special use case: it sounds like you are compiling it. The experiments done in the survey are not: instead, the hash function is used in the form of a data structure, similar to a Bloom filter ("deserialized" in your words). Do you use it for a parser? How many entries do you have? The survey uses millions of entries. "Startup time in the query time": I'm not quite sure what you mean, I'm afraid. Could you describe it?

I'm also not sure what you mean with "check for false positives"... each entry not in the set returns a random value, maybe for you this is a "false positive"? A typical solution for this is to add a "fingerprint" for each entry (just a hash value per entry) - that way there is still some false positive probability, which may or may not be acceptable (it is typically considered acceptable if the hash function is cryptographically secure, and the fingerprint size is big enough: e.g. 128 bits of the SHA-256 sum). Depending on the data, it may be faster to compare the value (eg. in a parser).

rurban•20h ago

using it like gperf is certainly not a special case. if your keys are fixed and known at compile-time, you can also pre-compile the hash function with its data structures, and not being forced to deserialize it at startup.

when comparing those MPFH query times the startup-time, the deserialization from disk, is 1000x higher than the actual query time. when you compile those data structures, the load time is instant. also memory usage is twice as low.

thomasmg•19h ago

> using it like gperf is certainly not a special case.

Well... let's put it like this: in this survey, "parsers" (where I am one of the co-authors) are not mentioned explicitly in the "Applications" section. They are a subset of "Hash Tables and Retrieval". There are many other uses: Approximate Membership, Databases, Bioinformatics, Text Indexing, Natural Language Processing. Yes, parsers are mentioned in "The Birth of Perfect Hashing". Maybe we can conclude that parsers are not the "main" use case nowadays.

> when you compile those data structures, the load time is instant.

Well, in this case I would recommend to use a static lookup table in the form of source code. That way, it is available for the compiler at compile time, and doesn't need to be loaded and parsed at runtime: it is available as a data structure at runtime. All modern MPHF implementations in this survey support this usage. But, yes, they are not optimized for this use case: you typically don't have millions of keywords in a programming language.

bytehamster•17h ago

Many modern perfect hash functions are super close to the space lower bound, often having just between 3% and 50% overhead. So your claim that the space consumption "twice as low" is information theoretically impossible. With gperf, the space consumption is in the machine code instead of a data structure, but it's definitely not "for free". In fact, I'm pretty sure that the size of the machine code generated by gperf is very far from optimal. The only exception are tiny input sets with just a couple of hundred keys, where gperf probably wins due to lower constant overheads.

rurban•3h ago

My special use case is i.e. unicode property checks, for which gperf is not big enough, and which has millions of keys. Other cases are integer keys.

I'm certainly not willing to load they keys and mpfh properties at query-time from disc, as they are known in advance and can be compiled to C or C++ code in advance, which leads to an instant load-time, in opposition to your costly deserialization times in all your tools.

Your deserialization overhead space is not calculated, and also not the storage costs for the false positive check. It's rather academic, not practical

throwaway81523•1d ago

This is a really good article. The subject area has a lot of new developments in the past few years (wonder why) and the survey discusses them. I always thought of perfect hashing as a niche optimization good for some things and of theoretical interest, but it's apparently more important than I thought.

hinkley•17h ago

I’ve always considered this a bit of an academic conversation because as others have people have pointed out the up front costs are often too much to bear. However we have languages now that can run functions at build time. So a static lookup table is possible.

And where else would one use a static lookup table to great effect? When I actively followed the SIGPLAN (programming languages) proceedings, one of the papers that really stood out for me was one about making interface/trait based function calls as fast as inheritance by using perfect hashing on all vtable entries.

bytehamster•15h ago

That sounds super interesting! Do you remember the title or the authors of the paper?

anitil•10h ago

I've seen these used in compilers for matching keywords - it's usually a small list of words but used so often in tokenising that it can be worth the bother. I've used gperf to (unsuccessfully) improve a key dictionary in python, but it turns out the default dictionary implementation is pretty good so the extra hassle wasn't worth the bother.

Build a minimal decorator with Ruby in 30 minutes

Expanding Racks [video]

Agentic Coding Recommendations

Chatterbox TTS

Microsoft Office migration from Source Depot to Git

How much EU is in DNS4EU?

SchemeFlow (YC S24) Is Hiring a Founding Engineer (London) to Speed Up Construction

The hunt for Marie Curie's radioactive fingerprints in Paris

AOSP project is coming to an end

Show HN: Eyesite - experimental website combining computer vision and web design

Research suggests Big Bang may have taken place inside a black hole

Show HN: Spark, An advanced 3D Gaussian Splatting renderer for Three.js

Plants hear their pollinators, and produce sweet nectar in response

How I Program with Agents

Dancing brainwaves: How sound reshapes your brain networks in real time

Danish Ministry Replaces Windows and Microsoft Office with Linux and LibreOffice

V-JEPA 2 world model and new benchmarks for physical reasoning

Show HN: Ikuyo a Travel Planning Web Application

OpenAI o3-pro

How long it takes to know if a job is right for you or not

Bypassing GitHub Actions policies in the dumbest way possible

The curious case of shell commands, or how "this bug is required by POSIX" (2021)

Fine-tuning LLMs is a waste of time

Show HN: RomM – An open-source, self-hosted ROM manager and player

Firefox OS's story from a Mozilla insider not working on the project (2024)

Show HN: S3mini – Tiny and fast S3-compatible client, no-deps, edge-ready

Unveiling the EndBOX – A microcomputer prototype for EndBASIC

Rohde and Schwarz AMIQ Modulation Generator Teardown

Congratulations on creating the one billionth repository on GitHub

DeskHog, an open-source developer toy

Build a minimal decorator with Ruby in 30 minutes

Expanding Racks [video]

Agentic Coding Recommendations

Chatterbox TTS

Microsoft Office migration from Source Depot to Git

How much EU is in DNS4EU?

SchemeFlow (YC S24) Is Hiring a Founding Engineer (London) to Speed Up Construction

The hunt for Marie Curie's radioactive fingerprints in Paris

AOSP project is coming to an end

Show HN: Eyesite - experimental website combining computer vision and web design

Research suggests Big Bang may have taken place inside a black hole

Show HN: Spark, An advanced 3D Gaussian Splatting renderer for Three.js

Plants hear their pollinators, and produce sweet nectar in response

How I Program with Agents

Dancing brainwaves: How sound reshapes your brain networks in real time

Danish Ministry Replaces Windows and Microsoft Office with Linux and LibreOffice

V-JEPA 2 world model and new benchmarks for physical reasoning

Show HN: Ikuyo a Travel Planning Web Application

OpenAI o3-pro

How long it takes to know if a job is right for you or not

Bypassing GitHub Actions policies in the dumbest way possible

The curious case of shell commands, or how "this bug is required by POSIX" (2021)

Fine-tuning LLMs is a waste of time

Show HN: RomM – An open-source, self-hosted ROM manager and player

Firefox OS's story from a Mozilla insider not working on the project (2024)

Show HN: S3mini – Tiny and fast S3-compatible client, no-deps, edge-ready

Unveiling the EndBOX – A microcomputer prototype for EndBASIC

Rohde and Schwarz AMIQ Modulation Generator Teardown

Congratulations on creating the one billionth repository on GitHub

DeskHog, an open-source developer toy

Modern Minimal Perfect Hashing: A Survey

Comments