For example, "check if we might have Mr Smith's data cached by looking in this bloom filter". As Long as the answer is usually right, that's good enough.
I do wonder if in the future we'll use a mini online-trained AI to achieve the same thing. "Ask the AI if we have MR Smiths data cached".
Rather like you might ask an office clerk "Is Mr smith that troublesome customer you were dealing with last week? Do you still have his file on your desk?"
1. https://en.m.wikipedia.org/wiki/Bloom_filters_in_bioinformat...
For blacklists, potentially more suitable, but since it can also give false positives, it could deny permission to people who should have it. An attacker might also attack this - by flooding the filter with accounts that deliberately get blacklisted, they could lock out people who should have access.
Obviously this is very use-case specific - it's probably not the best approach to doing permissions if security is paramount.
They can give false positives incorrectly indicating an element might be in the set when it's not, but never false negatives
Knowing (fast) if something is not in a dataset can be very useful.
A few non-exhaustive real world use-cases that come to mind:
- Databases: To quickly check if a record might exist before doing a costly disk lookup.
- Spell Checkers: To check if a word might be in the dictionary.
- Spam Filters: To check if an email sender is on a list of known spammers.
- Browser Security: Chrome uses Bloom filters to check if a site might be malicious.
- Password Checker: To check if a password is known to be leaked.
- Web Caches: To check if a URL or resource is definitely not in the cache.
- Distributed Systems: To avoid sending data that another system definitely doesn’t need.
Discussion of how Bloom filters made SQLite much faster.
If there are no collisions, you get the true value. The data structure also doesn't "fill up" the way a Bloom filter does — the older data just gets probabilistically worse, so in some use cases you can keep using the one structure continuously.
My particular use case (which led me to "invent" the structure and then discover that of course it's already been described in the literature, haha) is implicit cache invalidation. The approximator structure(s) store keys like "latest time member X was removed from group Y". I can then use that to lazily decide which cached values I can still use, instead of doing a potentially expensive search to see what I need to invalidate at the time the group member was removed.
I'm new to using them, so I keep getting excited about new use cases — much like when I was first introduced to Bloom filters.
"Hashmap" - maybe a less correct name but it lets to make more correct assumption.
The benchmark alternates between ~1 million different keys to check in the filter, explicitly to account for cache effects.
[That said, on a hot production bloom filter, much can be loaded into caches anyway so it's not an entirely un-realistic scenario that some of these are cache hits]
https://github.com/facebook/rocksdb/blob/main/util/bloom_imp...
https://github.com/facebook/rocksdb/blob/main/util/ribbon_al...
I know with a traditional bloom filter this would give me a lot of false positives. Is there an alternative that would fare better ?
noelwelsh•7h ago
f_devd•6h ago
For direct replacements, see Xor(+)-Filters: https://arxiv.org/abs/1912.08258 and the subsequent evolution into Binary Fuse Filters: https://arxiv.org/abs/2201.01174
samuel•6h ago
Xor and binary fuse filters require access to the full set of keys at construction time. In this sense, they are immutable. Alternatives have typically a fixed memory usage and a maximal capacity, but they also allow more flexibility such as progressive construction (adding keys one by one).
virexene•6h ago
f_devd•4h ago
sparkie•4h ago
But these, and xor/binary-fuse-filters are only direct replacements for Bloom filters in some problems. There are still problems for which Bloom filters are a better solution.
Bloom filters have the advantage that the filter produced by adding two elements, is the same as the bitwise-or of a pair of filters made by adding each element separately.
Also, the filter produced by bitwise-and, `bloom([x]) & bloom([y])`, cannot have any bits set in it which are not also set by `bloom([x, y])`. We can assert that There are practical applications that this can provide which are not satisfied by the "optimized" variants. The bitwise-or does set union (or least upper bound), and the bitwise-and does a set intersection (or greatest lower bound), though the filter produced by `bloom([x, y])` has a higher false positive rate than the filter produced by `bloom([x]) & bloom([y])`.f_devd•1h ago
Snawoot•5h ago