They're an _amazing_ data structure that, at a fixed size, tracks potential set membership. That means unlike normal b-tree indexes, they don't grow with the number of unique items in the dataset.
This makes them great for "needle in a haystack" search (logs, document) as implementations like VictoriaMetrics and Bing's BitFunnel show. I've used them in the past, but they've never been center-stage in my projects.
I wanted high cardinality keyword search for ANOTHER project... and, well, down the yak-shaving rabbit hole we go!
BloomSearch brings this into an extensible Go package:
- Very memory efficient via bloom filters and streaming row scans
- DataStore and MetaStore interfaces for any backend (can be same or separate)
- Hierarchical pruning via partitions, minmax indexes, and of course bloom filters
- Search by field, token, or field:token with complex combinators
- Disaggregated storage and compute for unbound ingest and query throughput
And of course, you know I had to make a custom file format ^-^ (FILE_FORMAT.MD)
BloomSearch is optimized for massive concurrency, arbitrary cardinality and dataset size, and super low memory usage. There's still a lot on the table too in terms of size and performance optimizations, but I'm already super pleased with it. With distributed query processing I'm targeting >100B rows/s over large datasets.
I'm also excited to replace our big logging bill ~$0.003/GB for log storage with infinite retention and guilt-free querying :P
SwiftyBug•5h ago
dangoodmanUT•5h ago
- field
- term
- term in field
Each file, and each row group within the file, has 3 bloom filters to handle these queries.
So something like:
{"user": {"name": "John", "tags": [{"type": "user"}, {"role": "admin"}]}}
Gets turned into queryable pairs of:
[{Path: "user.name", Values: ["John"]}, {Path: "user.tags.type", Values: ["user"]}, {Path: "user.tags.role", Values: ["admin"]}]
Then you can search for:
- any record that has "john" in it
- any record that has the "user.tags.type" key
- any record that has "user.tags.type"="user" and "user.tags.role"="admin"
Which bloom filters are used depends on how you build the query, but they test for whether a row matching the condition(s) is in the file/row group
SwiftyBug•3h ago
panic•3h ago
dangoodmanUT•3h ago
dangoodmanUT•3h ago
The default tokenizer is a a whitespace one: https://github.com/danthegoodman1/bloomsearch/blob/148a79967...
So {"name": "John Smith"} is tokenized to [{Path: "name", Values: ["john", "smith"]}], and the bloom filters will store:
- field: "name"
- token: "john"
- token: "smith"
- fieldtoken: "name:john"
- fieldtoken: "name:smith"
The same tokenizer must be used at query time too.
Fuzzy searches and sub-word searches could be supported with custom tokenizers (eg trigrams, stemming), but it's more generally targeting the "I know some exact subset of the record, I need all that have this exactly" searches
bonobocop•3h ago
A way to “handle” partial substrings is to break up your input data into tokens (like substrings split in spaces or dashes) and then you can break up your search string up in the same way.
EGreg•2h ago
dangoodmanUT•1h ago
You can prune based on partitions, minmax indexes, then bloom filters first. By that point the row group scan, if all other cheks suggest that the row you are after is in the block, is a very small amount of data.
https://itnext.io/how-do-open-source-solutions-for-logs-work... covers this very well