Two Bits Are Better Than One: making bloom filters 2x more accurate

https://floedb.ai/blog/two-bits-are-better-than-one-making-bloom-filters-2x-more-accurate

42•matheusalmeida•4d ago

Comments

pkoird•1h ago

Clever. My first impression was that surely this saturates the filter too fast as we're setting more bits at once but looks like the maths checks out. It's one of those non-intuitive things that I am glad I learned today.

lemagedurage•1h ago

True, I had the same feeling. The article does go off 256K elements in a bloom filter of 2M. After 1M elements, using 2 bits actually increases false positive rate, but at that point the false positive rate is higher than 50% already.

vlmutolo•59m ago

This article is a little confusing. I think this is a roundabout way to invent the blocked bloom filter with k=2 bits inserted per element.

It seems like the authors wanted to use a single hash for performance (?). Maybe they correctly determined that naive Bloom filters have poor cache locality and reinvented block bloom filters from there.

Overall, I think block bloom filters should be the default most people reach for. They completely solve the cache locality issues (single cache miss per element lookup), and they sacrifice only like 10–15% space increase to do it. I had a simple implementation running at something like 20ns per query with maybe k=9. It would be about 9x that for native Bloom filters.

There’s some discussion in the article about using a single hash to come up with various indexing locations, but it’s simpler to just think of block bloom filters as:

1. Hash-0 gets you the block index

2. Hash-1 through hash-k get you the bits inside the block

If your implementation slices up a single hash to divide it into multiple smaller hashes, that’s fine.

sakras•17m ago

Yeah I kind of think authors didn't conduct a thorough-enough literature review here. There are well-known relations between number of hash functions you use and the FPR, cache-blocking and register-blocking are classic techniques (Cache-, Hash-, and Space-Efficient Bloom Filters by Putze et. al), and there are even ways of generating patterns from only a single hash function that works well (shamelessly shilling my own blogpost on the topic: https://save-buffer.github.io/bloom_filter.html)

I also find the use of atomics to build the filter confusing here. If you're doing a join, you're presumably doing a batch of hashes, so it'd be much more efficient to partition your Bloom filter, lock the partitions, and do a bulk insertion.

h33t-l4x0r•23m ago

Hmm, Bloom filters seem important. I'm wondering why my CS education never even touched on them and it's tbh triggering my imposter syndrome.

How I use Claude Code: Separation of planning and execution

Japanese Woodblock Print Search

Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

A Botnet Accidentally Destroyed I2P

Two Bits Are Better Than One: making bloom filters 2x more accurate

“Playmakers,” reviewed: The race to give every child a toy

How far back in time can you understand English?

Evidence of the bouba-kiki effect in naïve baby chicks

Scientists discover recent tectonic activity on the moon

Parse, Don't Validate and Type-Driven Design in Rust

Gamedate – A site to revive dead multiplayer games

zclaw: personal AI assistant in under 888 KB, running on an ESP32

How Taalas "prints" LLM onto a chip?

Forward propagation of errors through time

CXMT has been offering DDR4 chips at about half the prevailing market rate

Claws are now a new layer on top of LLM agents

Toyota Mirai hydrogen car depreciation: 65% value loss in a year

Carelessness versus Craftsmanship in Cryptography

The Human Root of Trust – public domain framework for agent accountability

EDuke32 – Duke Nukem 3D (Open-Source)

Canvas_ity: A tiny, single-header <canvas>-like 2D rasterizer for C++

I verified my LinkedIn identity. Here's what I handed over

Be wary of Bluesky

Inputlag.science – Repository of knowledge about input lag in gaming

Finding forall-exists Hyperbugs using Symbolic Execution

What not to write on your security clearance form (1988)

Permacomputing

A16z partner says that the theory that we’ll vibe code everything is wrong

Uncovering insiders and alpha on Polymarket with AI

Keep Android Open

Two Bits Are Better Than One: making bloom filters 2x more accurate

Comments

How I use Claude Code: Separation of planning and execution

Japanese Woodblock Print Search

Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

A Botnet Accidentally Destroyed I2P

Two Bits Are Better Than One: making bloom filters 2x more accurate

“Playmakers,” reviewed: The race to give every child a toy

How far back in time can you understand English?

Evidence of the bouba-kiki effect in naïve baby chicks

Scientists discover recent tectonic activity on the moon

Parse, Don't Validate and Type-Driven Design in Rust

Gamedate – A site to revive dead multiplayer games

zclaw: personal AI assistant in under 888 KB, running on an ESP32

How Taalas "prints" LLM onto a chip?

Forward propagation of errors through time

CXMT has been offering DDR4 chips at about half the prevailing market rate

Claws are now a new layer on top of LLM agents

Toyota Mirai hydrogen car depreciation: 65% value loss in a year

Carelessness versus Craftsmanship in Cryptography

The Human Root of Trust – public domain framework for agent accountability

EDuke32 – Duke Nukem 3D (Open-Source)

Canvas_ity: A tiny, single-header <canvas>-like 2D rasterizer for C++

I verified my LinkedIn identity. Here's what I handed over

Be wary of Bluesky

Inputlag.science – Repository of knowledge about input lag in gaming

Finding forall-exists Hyperbugs using Symbolic Execution

What not to write on your security clearance form (1988)

Permacomputing

A16z partner says that the theory that we’ll vibe code everything is wrong

Uncovering insiders and alpha on Polymarket with AI

Keep Android Open