Bloom filters are good for search that does not scale

https://notpeerreviewed.com/blog/bloom-filters/

74•birdculture•5h ago

Comments

sanskarix•3h ago

the beautiful thing about bloom filters is they let you say "definitely not here" without checking everything. that asymmetry is weirdly powerful for specific problems.

I've seen them save startups real money in caching layers - checking "did we already process this event" before hitting the database. false positives are fine because you just check the database anyway, but true negatives save you thousands of queries.

the trick is recognizing when false positives don't hurt you. most engineers learn about them in theory but never find that practical use case where they're actually the right tool. same with skip lists and a bunch of other algorithms - brilliant for 2% of problems, overkill for the rest.

MattPalmer1086•3h ago

May be true for offline full text search, but not true for online string search.

I invented a very fast string search algorithm based on bloom filters. Our paper [1] was accepted to the Symposium of Experimental Algorithms 2024 [2]. Code can be found here [3].

[1] https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.S...

[2] https://sea2024.univie.ac.at/accepted-papers/

[3] https://github.com/nishihatapalmer/HashChain

majke•3h ago

Bing uses bloom filters for the most-recent index:

https://dl.acm.org/doi/pdf/10.1145/3077136.3080789

https://bitfunnel.org/strangeloop/

susam•2h ago

When I worked at RSA over a decade ago, we developed Bloom filter-based indexing to speed up querying on a proprietary database that was specialised for storing petabytes of network events and packet data. I implemented the core Bloom filter-based indexer based on MurmurHash2 functions and I was quite proud of the work I did back then. The resulting improvement in query performance looked impressive to our customers. I remember the querying speed went up from roughly 49,000 records per second to roughly 1,490,000 records per second, so nearly a 30-fold increase.

However, the performance gain is not surprising at all since Bloom filters allow the querying engine to skip large blocks of data with certainty when the blocks do not contain the target data. False negatives are impossible. False positives occur but the rate of false positives can be made very small with well-chosen parameters and trade-offs.

With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007) and a new bloom filter for every 1000 records (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) ^ k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.

The only downside of a false positive is that it makes the query engine read a data block unnecessarily to verify whether the target data is present. This affects performance but not correctness; much like how CPU branch misprediction affects performance but not correctness.

A 30-fold increase in querying speed with just 1.25 kB of overhead per data block of 1000 records (each block roughly 1 MB to 2 MB in size) was, in my view, an excellent trade-off. It made a lot of difference to the customer experience, turning what used to be a 2 minute wait for query results into a wait of just about 5 seconds, or in larger queries, reducing a 30 minute wait to about 1 minute.

susam•1h ago

I just noticed the m = 10007 value in my comment above and I thought I should clarify it. The number of bits per bloom filter does not need to be a prime number if the hash functions have uniform distribution. Murmur2 hash functions do have uniform distribution, so m was not chosen to be prime in order to reduce collisions in the Bloom filter's bit positions. The reason for using a prime value was more mundane.

This was a fairly large project, with roughly 3 million lines of C and C++ code which had numerous constants with special values defined throughout the code. So instead of using a less interesting number like 10000, I chose 10007 so that if we ever came across the value 10007 (decimal) or 0x2717 (hexadecimal) while inspecting a core dump in a debugger, we would immediately recognise it as the constant defining the number of bits per Bloom filter.

anonymars•1h ago

Ha, collision avoidance all the way down

gkfasdfasdf•1h ago

Interesting trick about the constant value, and thank you for the detailed write up!

6510•21m ago

I'm not an expert at all, I learn they exist after making something similar to search a few thousand blog posts.

Rather than one hash per file I made a file for each 2 letter combinations like aa.raw, ab.raw, etc where each bit in the file represents a record. (bit 0 is file 0 etc) you could ofc do 3 or 4 letters too.

A query is split into 2 letter combinations. 1st + 2nd, 2nd + 3rd, etc the files are loaded, do a bitwise AND on the files.

a search for "bloom" would only load bl.raw, lo.raw, oo.raw, om.raw

The index is really hard to update but adding new records is easy. New records are first added as false positives until we have enough bits to push a byte to the end of each raw file.

I then got lost pondering what creative letter combinations would yield the best results. Things like xx.raw and an.raw are pretty useless. Words could be treated as if unique characters.

Characters (or other bytes) could be combined like s=z, k=c, x=y or i=e

Calculating which combination is best for the static data set was to hard for me. One could look at the ratio to see if it is worth having a letter combination.

But it works and loading a hand full of files or doing an AND is amazingly fast.

susam•7m ago

Nice! Thanks for sharing. What you've described here sounds like an n-gram inverted index with n = 2, represented as a bitset. If I could give it a name, I'd call it a bigram bitset inverted index or BBII (or B2I2) for short. Glad to know you design and implemented all of this from first principles, and that it serves you well!

The 1337 PNG Hashquine (2022)

Show HN: In Memoria – MCP server that stops AI assistants from forgetting

Tech YouTuber irate as AI "wrongfully" terminates account with 350K+ subscribers

Posteverywhere.ai

Creating an all-in-one academic ecosystem and professional social network

CS Spinjam Shopee

A Circling Story

CS LazBon Lazada

When the Slack Channel Gets Archived, but the Service Keeps Running

Show HN: I made a Bluesky algorithm that Rick Rolls you with trending content

Profiling with Cursor 2.0: The Missing Layer in AI Code Generation

US Clouds: German Research Foundation DFG wants to bring data back from abroad

CS Spinjam Shopee

WhatsApp Web Shows "Download WhatsApp for Mac" Screen

Gautier de Brienne: Duke of Athens

CS EasyCash

AI Prompts for Nonprofit Professionals

Run Any LLM with a Single API: Introducing Any-LLM v1.0

All Your Base Are Belong to Us

Neo and the Future of Jobs

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

The Red Riding Hood Chronicles by TechWrath Studios

Starbucks Sells Majority Stake in China Business to Boyu Capital

How to Draw a 2D Chibi Character in Adobe Illustrator CC [video]

US Traces Ransomware Attacks to 2 People Working for Cybersecurity Firms

AWS announces Fastnet, its first solo subsea cable project

Searles's Chinese Room: Case study in philosophy of mind and cognitive science

High-performance 2D graphics rendering on the CPU using sparse strips [pdf]

LLMs as Interpreters: The Probabilistic Runtime for English Programs

Why AI Can't Write Good Software

The 1337 PNG Hashquine (2022)

Show HN: In Memoria – MCP server that stops AI assistants from forgetting

Tech YouTuber irate as AI "wrongfully" terminates account with 350K+ subscribers

Posteverywhere.ai

Creating an all-in-one academic ecosystem and professional social network

CS Spinjam Shopee

A Circling Story

CS LazBon Lazada

When the Slack Channel Gets Archived, but the Service Keeps Running

Show HN: I made a Bluesky algorithm that Rick Rolls you with trending content

Profiling with Cursor 2.0: The Missing Layer in AI Code Generation

US Clouds: German Research Foundation DFG wants to bring data back from abroad

CS Spinjam Shopee

WhatsApp Web Shows "Download WhatsApp for Mac" Screen

Gautier de Brienne: Duke of Athens

CS EasyCash

AI Prompts for Nonprofit Professionals

Run Any LLM with a Single API: Introducing Any-LLM v1.0

All Your Base Are Belong to Us

Neo and the Future of Jobs

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

The Red Riding Hood Chronicles by TechWrath Studios

Starbucks Sells Majority Stake in China Business to Boyu Capital

How to Draw a 2D Chibi Character in Adobe Illustrator CC [video]

US Traces Ransomware Attacks to 2 People Working for Cybersecurity Firms

AWS announces Fastnet, its first solo subsea cable project

Searles's Chinese Room: Case study in philosophy of mind and cognitive science

High-performance 2D graphics rendering on the CPU using sparse strips [pdf]

LLMs as Interpreters: The Probabilistic Runtime for English Programs

Why AI Can't Write Good Software

Bloom filters are good for search that does not scale

Comments