frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Why a 175-Year-Old Glassmaker Is Suddenly an AI Superstar

https://www.wsj.com/tech/corning-fiber-optics-ai-e045ba3b
1•Brajeshwar•32s ago•0 comments

Micro-Front Ends in 2026: Architecture Win or Enterprise Tax?

https://iocombats.com/blogs/micro-frontends-in-2026
1•ghazikhan205•2m ago•0 comments

Japanese rice is the most expensive in the world

https://www.cnn.com/2026/02/07/travel/this-is-the-worlds-most-expensive-rice-but-what-does-it-tas...
1•mooreds•3m ago•0 comments

These White-Collar Workers Actually Made the Switch to a Trade

https://www.wsj.com/lifestyle/careers/white-collar-mid-career-trades-caca4b5f
1•impish9208•3m ago•1 comments

The Wonder Drug That's Plaguing Sports

https://www.nytimes.com/2026/02/02/us/ostarine-olympics-doping.html
1•mooreds•3m ago•0 comments

Show HN: Which chef knife steels are good? Data from 540 Reddit tread

https://new.knife.day/blog/reddit-steel-sentiment-analysis
1•p-s-v•3m ago•0 comments

Federated Credential Management (FedCM)

https://ciamweekly.substack.com/p/federated-credential-management-fedcm
1•mooreds•4m ago•0 comments

Token-to-Credit Conversion: Avoiding Floating-Point Errors in AI Billing Systems

https://app.writtte.com/read/kZ8Kj6R
1•lasgawe•4m ago•1 comments

The Story of Heroku (2022)

https://leerob.com/heroku
1•tosh•4m ago•0 comments

Obey the Testing Goat

https://www.obeythetestinggoat.com/
1•mkl95•5m ago•0 comments

Claude Opus 4.6 extends LLM pareto frontier

https://michaelshi.me/pareto/
1•mikeshi42•5m ago•0 comments

Brute Force Colors (2022)

https://arnaud-carre.github.io/2022-12-30-amiga-ham/
1•erickhill•8m ago•0 comments

Google Translate apparently vulnerable to prompt injection

https://www.lesswrong.com/posts/tAh2keDNEEHMXvLvz/prompt-injection-in-google-translate-reveals-ba...
1•julkali•9m ago•0 comments

(Bsky thread) "This turns the maintainer into an unwitting vibe coder"

https://bsky.app/profile/fullmoon.id/post/3meadfaulhk2s
1•todsacerdoti•9m ago•0 comments

Software development is undergoing a Renaissance in front of our eyes

https://twitter.com/gdb/status/2019566641491963946
1•tosh•10m ago•0 comments

Can you beat ensloppification? I made a quiz for Wikipedia's Signs of AI Writing

https://tryward.app/aiquiz
1•bennydog224•11m ago•1 comments

Spec-Driven Design with Kiro: Lessons from Seddle

https://medium.com/@dustin_44710/spec-driven-design-with-kiro-lessons-from-seddle-9320ef18a61f
1•nslog•11m ago•0 comments

Agents need good developer experience too

https://modal.com/blog/agents-devex
1•birdculture•12m ago•0 comments

The Dark Factory

https://twitter.com/i/status/2020161285376082326
1•Ozzie_osman•12m ago•0 comments

Free data transfer out to internet when moving out of AWS (2024)

https://aws.amazon.com/blogs/aws/free-data-transfer-out-to-internet-when-moving-out-of-aws/
1•tosh•13m ago•0 comments

Interop 2025: A Year of Convergence

https://webkit.org/blog/17808/interop-2025-review/
1•alwillis•15m ago•0 comments

Prejudice Against Leprosy

https://text.npr.org/g-s1-108321
1•hi41•16m ago•0 comments

Slint: Cross Platform UI Library

https://slint.dev/
1•Palmik•20m ago•0 comments

AI and Education: Generative AI and the Future of Critical Thinking

https://www.youtube.com/watch?v=k7PvscqGD24
1•nyc111•20m ago•0 comments

Maple Mono: Smooth your coding flow

https://font.subf.dev/en/
1•signa11•21m ago•0 comments

Moltbook isn't real but it can still hurt you

https://12gramsofcarbon.com/p/tech-things-moltbook-isnt-real-but
1•theahura•24m ago•0 comments

Take Back the Em Dash–and Your Voice

https://spin.atomicobject.com/take-back-em-dash/
1•ingve•25m ago•0 comments

Show HN: 289x speedup over MLP using Spectral Graphs

https://zenodo.org/login/?next=%2Fme%2Fuploads%3Fq%3D%26f%3Dshared_with_me%25253Afalse%26l%3Dlist...
1•andrespi•26m ago•0 comments

Teaching Mathematics

https://www.karlin.mff.cuni.cz/~spurny/doc/articles/arnold.htm
2•samuel246•28m ago•0 comments

3D Printed Microfluidic Multiplexing [video]

https://www.youtube.com/watch?v=VZ2ZcOzLnGg
2•downboots•29m ago•0 comments
Open in hackernews

Modern Minimal Perfect Hashing: A Survey

https://arxiv.org/abs/2506.06536
89•matt_d•8mo ago

Comments

tmostak•8mo ago
We've made extensive use of perfect hashing in HeavyDB (formerly MapD/OmniSciDB), and it has definitely been a core part of achieving strong group by and join performance.

You can use perfect hashes not only the usual suspects of contiguous integer and dictionary-encoded string ranges, but also use cases like binned numeric and date ranges (epoch seconds binned per year can use a perfect hash range of one bin per year for a very wide range of timestamps), and can even handle arbitrary expressions if you propagate the ranges correctly.

Obviously you need a good "baseline" hash path to fall back to you, but it's surprising how many real-world use cases you can profitably cover with perfect hashing.

anitil•8mo ago
So in HeavyDB do you on-the-fly build perfect hashes for queries? I've only ever seen perfect hashes used at 'build time' when the keys are already known and fixed (like keywords in a compiler)
TheTaytay•8mo ago
I had the same question! I have never heard of runtime perfect hashing. (Admittedly, I haven’t read the paper yet.)
senderista•8mo ago
In the DSA theory literature there is so-called “dynamic perfect hashing” but I don’t think it’s ever been implemented and its use case is served by high-load factor techniques like bucketized cuckoo hashing.
bytehamster•8mo ago
In the appendix of the survey, there are 3 references on dynamic perfect hashing. I think the only actual implementation of a dynamic PHF is a variant of perfect hashing though fingerprinting in the paper "perfect hashing for network applications". However, that implementation is not fully dynamic and needs to be re-built if the key set changes too much.
wahern•8mo ago
All of those modern algorithms, even relatively older ones like CHD, can find a perfect hash function over millions of keys in less than a second.[1] Periodically rebuilding a function can be more than fast enough, depending on your use case.

Last time I tried gperf 8-10 years, it took it hours or even days to build a hash function CHD could do in seconds or less. If someone's idea of perfect hash function construction cost is gperf (at least gperf circa 2015)... welcome to the future.

[1] See my implementation of CHD: https://25thandclement.com/~william/projects/phf.html

anitil•8mo ago
The article had a reference for being able to compress data in a table (maybe similar in spirit to using a foreign key to a small table). I could also see it being useful in compression dictionaries, but again that's not really a run-time use (and I'm sure I'm not the first to think of it)
o11c•8mo ago
I'm only vaguely aware of how other people do perfect hashing (generators I've used always seem to produce arrays to load from), but dabbled in a purely-arithmetic toy problem recently.

As an exercise for the reader:

  There are exactly 32 symbols in ASCII:

    !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

  Taking input in a register, uniquely hash them to the range 0-31.
  Any other input values may return any number, but must not
  trap or exhibit undefined behavior. The obvious approach of
  "just make a 256-element array" isn't allowed.

  This can be done in few enough cycles that you need to
  carefully consider if there's any point to including:

    loads (even hitting L1)
    branches (even if it fully predicts when it is taken)
    multiplication (unless just using lea/add/shift)

  I found that out-of-order only helps a little; it's
  difficult to scatter-gather very wide in so few cycles.

  Writing C code mostly works if you can persuade the compiler
  not to emit an unwanted branch.
mananaysiempre•8mo ago
I’m tempted to emulate a conventional SIMD solution and go with a small lookup table:

  uint8_t classify(uint8_t c) {
      return c - (0x5F45452B2B210000ULL >> ((c & 0x70) >> 1));
  }
The AND-then-shift sequence on the right is annoying and it feels like one should be able to avoid it, but I don’t see how. Overall, this is more expensive than it looks—neither a full 64-bit constant nor a variable 64-bit shift are exactly free. So I’m probably missing something here.
duskwuff•8mo ago
This set of symbols has some interesting properties which allow for this solution:

    func symbolHash(c byte) byte {
        x := (c - 1) >> 5
        y := int(c) + 0x1b150000>>(x<<3)
        return byte(y & 31)
    }
But this doesn't generalize - it depends on the input consisting of a couple runs of consecutive characters which can be made continuous. (Extra credit: why is "c-1" necessary?)
thomasmg•8mo ago
Using a brute-force approach can quickly find a minimal perfect hash table. Eg. the RecSplit approach can be used for this case, to first split into 4 sections, and then use another one for each section. Or, in this case, the same one for each section:

    (hash(c + 4339104) & 7) * 4 + (hash(c + 201375) & 3)
For a generic hash function (eg the Murmur hash, or the simple one here:

    long hash(long x) {
        x = ((x >>> 16) ^ x) * 0x45d9f3b;
        x = ((x >>> 16) ^ x) * 0x45d9f3b;
        x = (x >>> 16) ^ x;
        return x;
    }
As described in the linked paper, the fastest way to find such a MPHF for larger sets is nowadays Consensus-RecSplit.

[1] https://stackoverflow.com/questions/664014/what-integer-hash...

tmyklebu•8mo ago
Two fast (today) instructions:

  unsigned h = _pext_u32(1264523 * x, 0x1020a01);
tmyklebu•8mo ago
Same idea, but without BMI2:

  unsigned h = (1264523 * x & 0x1020a01) * 134746240 >> 27;
Alternatively:

  unsigned h = (1639879 * x & 0x1038040) * 67375104L >> 32 & 31;
The multiplication by 67375104L can be a usual 32x32 IMUL where the high half goes to edx, though I'm not sure that conveys a benefit over a 64x64 IMUL in serial code these days.
mananaysiempre•8mo ago
Where does the constant multiplier come from? Is it just bruteforced, or is there an idea behind it that I can’t see?
tmyklebu•7mo ago
Yeah, that was just a search.

There are 2^32 multipliers; call them m. For a bit in m*x to be useful, it must be zero for 16 symbols and one for 16 symbols. Call those bits "useful bits." Try every multiplier; for each multiplier, compute all the useful bits (usually not many) and try all masks with 5 useful bits.

mananaysiempre•7mo ago
> For a bit in m*x to be useful, it must be zero for 16 symbols and one for 16 symbols.

Ahh brilliant, thanks!

vlovich123•8mo ago
Interesting that they don’t cover boomphf which is the fastest MPHF I’ve encountered.
judofyr•8mo ago
Boomph is a Rust re-implementation of BBHash which is included (and dominated by three other implementations). AFAIK there’s no reason to think it would perform any better than BBHash.
bytehamster•8mo ago
Addition: BBHash, in turn, is a re-implementation of FiPHa (perfect hashing through fingerprinting). There are quite many re-implementations of FiPHa: BBHash, Boomph, FiPS, FMPH, etc. As shown in the survey, BBHash is by far the slowest. Even though it implements exactly the same algorithm, FMPH is much faster. Its paper [1] also compares to Boomph. The beauty of the fingerprinting technique is that it is super simple. That's probably the reason why there are so many implementations of it.

[1] https://dl.acm.org/doi/pdf/10.1145/3596453

rurban•8mo ago
They completely forget the startup-time in the query-time, which dominates by a factor of 1000.

Some PHF's can be pre-compiled, while most needs to be deserialized at run-time. I worked on a pre-compiled pthash variant, but got struck by C++ bugs.

There's a huge overhead for ordered variants in some, to check for false positives.

For small n gperf is still the fastest by far. And it is pre-compiled only.

thomasmg•8mo ago
From what you describe, I think you have a somewhat special use case: it sounds like you are compiling it. The experiments done in the survey are not: instead, the hash function is used in the form of a data structure, similar to a Bloom filter ("deserialized" in your words). Do you use it for a parser? How many entries do you have? The survey uses millions of entries. "Startup time in the query time": I'm not quite sure what you mean, I'm afraid. Could you describe it?

I'm also not sure what you mean with "check for false positives"... each entry not in the set returns a random value, maybe for you this is a "false positive"? A typical solution for this is to add a "fingerprint" for each entry (just a hash value per entry) - that way there is still some false positive probability, which may or may not be acceptable (it is typically considered acceptable if the hash function is cryptographically secure, and the fingerprint size is big enough: e.g. 128 bits of the SHA-256 sum). Depending on the data, it may be faster to compare the value (eg. in a parser).

rurban•8mo ago
using it like gperf is certainly not a special case. if your keys are fixed and known at compile-time, you can also pre-compile the hash function with its data structures, and not being forced to deserialize it at startup.

when comparing those MPFH query times the startup-time, the deserialization from disk, is 1000x higher than the actual query time. when you compile those data structures, the load time is instant. also memory usage is twice as low.

thomasmg•8mo ago
> using it like gperf is certainly not a special case.

Well... let's put it like this: in this survey, "parsers" (where I am one of the co-authors) are not mentioned explicitly in the "Applications" section. They are a subset of "Hash Tables and Retrieval". There are many other uses: Approximate Membership, Databases, Bioinformatics, Text Indexing, Natural Language Processing. Yes, parsers are mentioned in "The Birth of Perfect Hashing". Maybe we can conclude that parsers are not the "main" use case nowadays.

> when you compile those data structures, the load time is instant.

Well, in this case I would recommend to use a static lookup table in the form of source code. That way, it is available for the compiler at compile time, and doesn't need to be loaded and parsed at runtime: it is available as a data structure at runtime. All modern MPHF implementations in this survey support this usage. But, yes, they are not optimized for this use case: you typically don't have millions of keywords in a programming language.

bytehamster•8mo ago
Many modern perfect hash functions are super close to the space lower bound, often having just between 3% and 50% overhead. So your claim that the space consumption "twice as low" is information theoretically impossible. With gperf, the space consumption is in the machine code instead of a data structure, but it's definitely not "for free". In fact, I'm pretty sure that the size of the machine code generated by gperf is very far from optimal. The only exception are tiny input sets with just a couple of hundred keys, where gperf probably wins due to lower constant overheads.
rurban•8mo ago
My special use case is i.e. unicode property checks, for which gperf is not big enough, and which has millions of keys. Other cases are integer keys.

I'm certainly not willing to load they keys and mpfh properties at query-time from disc, as they are known in advance and can be compiled to C or C++ code in advance, which leads to an instant load-time, in opposition to your costly deserialization times in all your tools.

Your deserialization overhead space is not calculated, and also not the storage costs for the false positive check. It's rather academic, not practical

bytehamster•8mo ago
There is no deserialization time or space overhead. The measurements refer to the deserialized form. They are not loaded from disk.

About false positive checks, I think you misunderstand what a perfect hash function does.

rurban•8mo ago
See, everybody can see how you cheat your benchmarks. It's unrealistic to measure only the perfect hash function, when you discard the cost of the deserialization and false-positive checks.

That's always the case when dealing with you academics.

We dont care how you define your costs when we laugh about that. In reality you have to load the keys, the data structures and check for existance. You even discard the existance checks, by omitting the false-positive checks. Checking against a non-existing key will lead to a hit in your check. Only very rarely you know in advance if your key is in the set

rurban•8mo ago
Also, you miss the space costs for the ordering.
throwaway81523•8mo ago
This is a really good article. The subject area has a lot of new developments in the past few years (wonder why) and the survey discusses them. I always thought of perfect hashing as a niche optimization good for some things and of theoretical interest, but it's apparently more important than I thought.
hinkley•8mo ago
I’ve always considered this a bit of an academic conversation because as others have people have pointed out the up front costs are often too much to bear. However we have languages now that can run functions at build time. So a static lookup table is possible.

And where else would one use a static lookup table to great effect? When I actively followed the SIGPLAN (programming languages) proceedings, one of the papers that really stood out for me was one about making interface/trait based function calls as fast as inheritance by using perfect hashing on all vtable entries.

bytehamster•8mo ago
That sounds super interesting! Do you remember the title or the authors of the paper?
anitil•8mo ago
I've seen these used in compilers for matching keywords - it's usually a small list of words but used so often in tokenising that it can be worth the bother. I've used gperf to (unsuccessfully) improve a key dictionary in python, but it turns out the default dictionary implementation is pretty good so the extra hassle wasn't worth the bother.