I invented a very fast string search algorithm based on bloom filters. Our paper [1] was accepted to the Symposium of Experimental Algorithms 2024 [2]. Code can be found here [3].
[1] https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.S...
However, the performance gain is not surprising at all since Bloom filters allow the querying engine to skip large blocks of data with certainty when the blocks do not contain the target data. False negatives are impossible. False positives occur but the rate of false positives can be made very small with well-chosen parameters and trade-offs.
With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007) and a new bloom filter for every 1000 records (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) ^ k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.
The only downside of a false positive is that it makes the query engine read a data block unnecessarily to verify whether the target data is present. This affects performance but not correctness; much like how CPU branch misprediction affects performance but not correctness.
A 30-fold increase in querying speed with just 1.25 kB of overhead per data block of 1000 records (each block roughly 1 MB to 2 MB in size) was, in my view, an excellent trade-off. It made a lot of difference to the customer experience, turning what used to be a 2 minute wait for query results into a wait of just about 5 seconds, or in larger queries, reducing a 30 minute wait to about 1 minute.
This was a fairly large project, with roughly 3 million lines of C and C++ code which had numerous constants with special values defined throughout the code. So instead of using a less interesting number like 10000, I chose 10007 so that if we ever came across the value 10007 (decimal) or 0x2717 (hexadecimal) while inspecting a core dump in a debugger, we would immediately recognise it as the constant defining the number of bits per Bloom filter.
Rather than one hash per file I made a file for each 2 letter combinations like aa.raw, ab.raw, etc where each bit in the file represents a record. (bit 0 is file 0 etc) you could ofc do 3 or 4 letters too.
A query is split into 2 letter combinations. 1st + 2nd, 2nd + 3rd, etc the files are loaded, do a bitwise AND on the files.
a search for "bloom" would only load bl.raw, lo.raw, oo.raw, om.raw
The index is really hard to update but adding new records is easy. New records are first added as false positives until we have enough bits to push a byte to the end of each raw file.
I then got lost pondering what creative letter combinations would yield the best results. Things like xx.raw and an.raw are pretty useless. Words could be treated as if unique characters.
Characters (or other bytes) could be combined like s=z, k=c, x=y or i=e
Calculating which combination is best for the static data set was to hard for me. One could look at the ratio to see if it is worth having a letter combination.
But it works and loading a hand full of files or doing an AND is amazingly fast.
sanskarix•3h ago
I've seen them save startups real money in caching layers - checking "did we already process this event" before hitting the database. false positives are fine because you just check the database anyway, but true negatives save you thousands of queries.
the trick is recognizing when false positives don't hurt you. most engineers learn about them in theory but never find that practical use case where they're actually the right tool. same with skip lists and a bunch of other algorithms - brilliant for 2% of problems, overkill for the rest.