For example, "check if we might have Mr Smith's data cached by looking in this bloom filter". As Long as the answer is usually right, that's good enough.
I do wonder if in the future we'll use a mini online-trained AI to achieve the same thing. "Ask the AI if we have MR Smiths data cached".
Rather like you might ask an office clerk "Is Mr smith that troublesome customer you were dealing with last week? Do you still have his file on your desk?"
1. https://en.m.wikipedia.org/wiki/Bloom_filters_in_bioinformat...
For blacklists, potentially more suitable, but since it can also give false positives, it could deny permission to people who should have it. An attacker might also attack this - by flooding the filter with accounts that deliberately get blacklisted, they could lock out people who should have access.
Obviously this is very use-case specific - it's probably not the best approach to doing permissions if security is paramount.
They can give false positives incorrectly indicating an element might be in the set when it's not, but never false negatives
Knowing (fast) if something is not in a dataset can be very useful.
A few non-exhaustive real world use-cases that come to mind:
- Databases: To quickly check if a record might exist before doing a costly disk lookup.
- Spell Checkers: To check if a word might be in the dictionary.
- Spam Filters: To check if an email sender is on a list of known spammers.
- Browser Security: Chrome uses Bloom filters to check if a site might be malicious.
- Password Checker: To check if a password is known to be leaked.
- Web Caches: To check if a URL or resource is definitely not in the cache.
- Distributed Systems: To avoid sending data that another system definitely doesn’t need.
Discussion of how Bloom filters made SQLite much faster.
If there are no collisions, you get the true value. The data structure also doesn't "fill up" the way a Bloom filter does — the older data just gets probabilistically worse, so in some use cases you can keep using the one structure continuously.
My particular use case (which led me to "invent" the structure and then discover that of course it's already been described in the literature, haha) is implicit cache invalidation. The approximator structure(s) store keys like "latest time member X was removed from group Y". I can then use that to lazily decide which cached values I can still use, instead of doing a potentially expensive search to see what I need to invalidate at the time the group member was removed.
I'm new to using them, so I keep getting excited about new use cases — much like when I was first introduced to Bloom filters.
First-order with least fixed point FO(LFP) == P As P=co-P but we think NP!=co-NP, bloom filters often have value for access to a small and fast path to co-NP
In that case the collisions don't matter because often excluding membership is more valuable than proving membership and you have to choose one.
That is why outlook used it to reduce address completion, even if a hit requires the expensive call to the server, the set of email addresses not in your address book is far larger.
But it is all horses for courses.
If your need does fit in P you have a lot of options, while the options for co-NP is far more rarified.
Also interesting are "Invertible Bloom Lookup Tables" (IBLT), https://arxiv.org/pdf/1101.2245 , which allow to add and remove data, and allow to retrieve the remaining data if "little" data remains. That means, it can be used for error correction: all all the _correct_ data (let's say 1 GB of correct data) into a 1 MB IBLT. Let's say a downloader finds that the checksum doesn't match: he can download that 1 MB IBLT, and remove the data from his (invalid) download. What remains is the delta: the error correction data. I know, there are other ways you could do error correction, but this is very fast, and quite interesting technically.
"Hashmap" - maybe a less correct name but it lets to make more correct assumption.
The benchmark alternates between ~1 million different keys to check in the filter, explicitly to account for cache effects.
[That said, on a hot production bloom filter, much can be loaded into caches anyway so it's not an entirely un-realistic scenario that some of these are cache hits]
https://github.com/facebook/rocksdb/blob/main/util/bloom_imp...
https://github.com/facebook/rocksdb/blob/main/util/ribbon_al...
I know with a traditional bloom filter this would give me a lot of false positives. Is there an alternative that would fare better ?
Represent (possibly overlapping, hence trees are tricky) subsets of size 50-100 out of a set of size of size 1000-2000.
Some amount of false positives is acceptable but not like 50%.
If you have two Bloom filters which use same set of possible keys and same filter setup, you can compute intersection of their represented sets with a bitwise AND operation over the bitsets in these filters.
If you store bitsets aligned and padded by SIMD vector size (for example, an array of __m128i in C++ as opposed to the array of uint64 values in the golang example), computing such intersection is very efficient on modern computers.
I once used that technique to filter 3D objects from a large set using the intersection of 3 range queries, one per coordinate. Compute two Bloom filters filled with the objects returned by X and Y queries, intersect the filters, then iterate over the objects returned by the remaining Z query testing against the intersected Bloom filter. Due to probabilistic nature of the Bloom filter you still need to test for X and Y queries for the objects tested positive by the filter, however with proper setup the filter should reject vast majority of them.
https://github.com/FastFilter/fastfilter_cpp/blob/f27873fac4...
Also, this -somehow- related Ask HN is a true classic, at least for me: https://news.ycombinator.com/item?id=32186203
noelwelsh•15h ago
f_devd•15h ago
For direct replacements, see Xor(+)-Filters: https://arxiv.org/abs/1912.08258 and the subsequent evolution into Binary Fuse Filters: https://arxiv.org/abs/2201.01174
samuel•14h ago
Xor and binary fuse filters require access to the full set of keys at construction time. In this sense, they are immutable. Alternatives have typically a fixed memory usage and a maximal capacity, but they also allow more flexibility such as progressive construction (adding keys one by one).
virexene•14h ago
f_devd•12h ago
sparkie•12h ago
But these, and xor/binary-fuse-filters are only direct replacements for Bloom filters in some problems. There are still problems for which Bloom filters are a better solution.
Bloom filters have the advantage that the filter produced by adding two elements, is the same as the bitwise-or of a pair of filters made by adding each element separately.
Also, the filter produced by bitwise-and, `bloom([x]) & bloom([y])`, cannot have any bits set in it which are not also set by `bloom([x, y])`. We can assert that There are practical applications that this can provide which are not satisfied by the "optimized" variants. The bitwise-or does set union (or least upper bound), and the bitwise-and does a set intersection (or greatest lower bound), though the filter produced by `bloom([x, y])` has a higher false positive rate than the filter produced by `bloom([x]) & bloom([y])`.f_devd•9h ago
Snawoot•13h ago