Extending That XOR Trick to Billions of Rows

https://nochlin.com/blog/extending-that-xor-trick

104•hundredwatt•6mo ago

Comments

dataflow•6mo ago

Before you get too excited, this is a probabilistic algorithm, not a deterministic one. Feels weird to call it an "extension" when you lose an absolute guarantee, but still cool nonetheless.

nyrikki•6mo ago

> Finally, they introduce Invertible Bloom Filters, which add an exact get operation and a probabilistic listing operation.

I haven't spent time digging into the implementation details, but the exact get should allow for verification.

It is not uncommon to use probabilistic methods to reduce search space.

dataflow•6mo ago

I haven't dug into the gory details either, but later they say:

> To fully generalize this into a robust data structure, we need:

> (1) A partitioning scheme that creates recoverable partitions with high probability

> (2) An iterative process that uses recovered values to unlock additional partitions

And they also say:

> With proper sizing (typically m > 1.22d cells), IBFs recover the full symmetric difference with very high probability.

It really doesn't sound like this is both exact and also running in linear time like XOR... right? Perhaps somehow the error is one-sided but then the time bound is probabilistic? If I'm missing something and they're truly maintaining both an absolute guarantee and an absolute time bound, this is mindblowing! But I don't get that impression?

nullc•6mo ago

There is no absolute guarantee. You can have an arbitrarily large multiple and the decode can still fail when a set of entries exist that form a cycle, it just becomes quite unlikely as the overhead goes up.

One of the ways of hybridizing iblt and exact algebraic techniques like the minisketch library I link in my other post is to staple a small algebraic sketch to the iblt. If the iblt is successful you're done, if it gets stuck you use take the recovered elements out of the algebraic sketch and decode that. It's fast to decode the algebraic sketch in spite its O(n^2) behavior because it's small, and it'll always be successful if there are few enough elements (unlike the iblt).

Sadly this still doesn't give a guarantee since you might have more elements in a cycle than the size of the backup, but small cycles are more likely than big ones so there exists a range of sizes where it's more communications efficient than a larger iblt.

hundredwatt•6mo ago

You don't lose absolute guarantees, but the probabilistic nature means the process may fail (in a guaranteed detectable way) in which case you can try again with a larger parameter.

The "bloom filter" name is misleading in regard to this.

nullc•6mo ago

> (in a guaranteed detectable way)

To be pedantic, not guaranteed. The xor of multiple elements may erroneously have a passing checksum, resulting in an undetected false decode. You can make the probability of this as low as you like by using a larger checksum, but the initial HN example was IIRC a list of 32-bit integers, so using a (say) 128 bit checksum to make false decodes 'cryptographically unlikely' would come at a rather high cost since it's a size added to each bucket.

If your sent members are multi-kilobyte contact database entries or something than the overhead required to make false decode impossible would be insignificant.

This limitation also applies somewhat to the alternative algebraic approach in my comments-- an overfull sketch could be falsely decoded--, except the added size needed to make a false decode cryptographically unlikely is very small and goes down relative to the size of the sketch as the sketch grows instead of being linear in the size of the sketch.

I haven't looked at your implementation but it can be useful to have at least 1 bit counter or just make the LSB of your checksum always 1. Doing so prevents falsely decoding an overfull bucket with an even number of members in it, and since the distribution of members to bucket is binomial 2 is an extremely common number for overfull buckets. You can use a counter bigger than 1 bit (and combine it with addition in its ring rather than xor), but the tradeoff vs just having more checksum bits is less obvious.

It's probably an interesting open question about the existence of checksums such that the xor of 2..N valid codewords is unlikely to be a valid codeword... the "always emit 1" function has perfect performance for even values but are there schemes that still contribute distance even in cases were the N isn't completely precluded?

hundredwatt•6mo ago

> To be pedantic, not guaranteed. The xor of multiple elements may erroneously have a passing checksum, resulting in an undetected false decode

The false decodes can be detected. During peeling, deleting a false decode inserts a new element with the opposite sign of count. Later, you decode this second false element and end up with the same element in both the A / B and B / A result sets (as long as decode completes without encountering a cycle).

So, after decode, check for any elements present in both A / B and B / A result sets and remove them.

Beyond that, you can also use the cell position for additional checksum bits in the decode process without increasing the data structure's bit size. i.e., if we attempt to decode the element X from a cell at position m, then one of the h_i(x) hash functions for computing indices should return m.

There's even a paper about a variant of IBFs that has no checksum field at all: https://arxiv.org/abs/2211.03683. It uses the cell position among other techniques.

dark-star•6mo ago

It would be nice if they explained what XOR trick that is. It seems to have something to do with finding missing numbers in a list?

foota•6mo ago

It's a solution to the problem: given a list of n-1 unique integers 1 through n, find the missing integer.

The trick is that when you xor all of the numbers in the list together and then xor that with the xor of 1 through n, the result is the missing number.

JPLeRouzic•6mo ago

I know XOR only in the context of binary numbers. Is this "XOR trick" more general?

nromiun•6mo ago

Every number on computers is converted to binary internally, so yes this works on decimal numbers too.

fc417fc802•6mo ago

> so yes this works on decimal numbers too.

Given that rounding tends to be necessary that seems extremely questionable in practice. Similar to how the equality operator in most (probably all) languages can be used with floating point numbers but in most cases that is a very bad idea.

hyperhello•6mo ago

Integers don’t have to be stored as floating point.

fc417fc802•6mo ago

And yet a great many applications that use numbers with a decimal point will involve rounding and/or inexact matches.

williamdclt•6mo ago

It works on the binary representation so it actually works for any data type, even composite types! It won't resolve pointers/references/aliases of course

nullc•6mo ago

>the xor of 1 through n

XOR[0...x] = (x&1^(x&2)>>1)+x*(~x&1)

Of course, you don't have to do this with xor, you can do it with just addition (mod wordmax) and the well known identity attributed to gauss will support you.

krior•6mo ago

https://florian.github.io//xor-trick/ if you want detailed information and a list of possible applications.

dzaima•6mo ago

A rough sketch for a more direct way to extend the XOR trick to finding more than two differences:

For e.g. 3 differences: instead of a binary xor (i.e. binary-digit-wise sum mod 2), do a binary-digit-wise sum mod 3 (negating one input); a 0 (mod 3) sum result for a given bit means that the bit is the same in all entries, and 1 (mod 3) or 2 (mod 3) mean that you can partition on the bit, resulting in partitions with sizes `sum` and `input_different_element_count - sum`; then you repeat this recursively until they reach containing just 1 difference. (rounding the modulo up to the next power of two instead of odd modulos for the summing is perfectly fine, the final infinite-precision sum is in the range of [0; diffcount] anyway)

Extends trivially to more than 3 differences, and collapses to the basic trick for 2 differences. The accumulator size is O(log(diffcount) * element_size), but the recursive partitioning takes O(n) space or O(diffcount * n) time (plus some logarithm something maybe). Tradeoffs are probably reasonably possible, but the basic hashset approach can reduce its O(n) space requirement at the cost of taking >O(n) time too by partitioning on a hash.

nullc•6mo ago

iblt has low space efficiency for small sets, small elements, and low failure rates (and on that note is only probabilistic in its success).

We implemented https://github.com/bitcoin-core/minisketch which has optimal size efficiency-- N bits of state will always correctly recover when there are N or fewer bits of set difference, even when the set elements are small (like 32 bits, for example).

So for example you can you and I can each have sets of, say, ten thousand 32-bit elements which are identical except for 10 entries, and I can send you a 320 bit (32*10) sketch of my set and from that you can always determine the 10 (or fewer) differences. The same element and difference size with IBLT would likely take thousands of bits to have a low failure rate.

The downside is that the minisketch approach has quadratic decode complexity in the size of the set difference, but this is not a big deal when the number of differences is small by construction or thanks to recursive subdivision.

For cases where the differences are large iBLT eventually wins out-- the two ideas can also be hybridized in a variety of ways. E.g. using minisketch to make multi-element buckets in an iblt analogous to blocked bloom filter or the normal practice with cuckoo filters.

Another related scheme is cpisync which was used for many years by SKS key servers. It has communications efficiency like minisketch, but cubic decode costs.

javcasas•6mo ago

So the XOR initial trick is: use a hash to partition the data into batches so that each batch has up to 1 missing element.

Can't we use this again? I mean:

1. Partition the data so that some batches have up to 1 missing element.

2. Recover the elements where possible with the XOR trick.

3. Pick another hash function, then repeat finding more missing elements.

4. Repeat until no more missing elements.

dzaima•6mo ago

A property of the initial XOR trick for 2 different elements is that it guarantees finding a way to partition in one pass (and with very trivial code; no hashing involved!), which is lost by replacing that with hashing. (the original trick does take two passes - finding the bit to partition on, and doing the actual partitioning, whereas hashing is 1+ε passes, but the first pass in the original is just an xor-fold, and the partitioning really only needs to be a "accumulator ^= (current_val & mask) ? current_val : 0" (other partition is just xoring the results of both passes), both of which can be trivially parallelized and SIMD'd with O(1) extra memory usage)

The approach in my comment achieves guaranteeing finding partitions, and still avoids actual hashing or anything strictly-probabilistic, but does still lose the extreme triviality and mechanical sympathy of the original approach.

hundredwatt•6mo ago

The graph constructed by using bloom filter-style hash functions supports a decoding process called "peeling" where you:

1. Find a batch with 1 missing element 2. Delete that element from its other assigned partitions 3. Repeat, as the modified batches may now be recoverable

This iterative process (surprisingly!) succeeds with very high probability as long as the number of partitions is 1.22x larger than the number of missing elements with k=3 hash functions.

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

First Proof

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Reinforcement Learning from Human Feedback

The Waymo World Model

Start all of your commands with a comma (2009)

France's homegrown open source online office suite

Vocal Guide – belt sing without killing yourself

The AI boom is causing shortages everywhere else

Coding agents have replaced every framework I used

Software factories and the agentic moment

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

British drivers over 70 to face eye tests every three years

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: I spent 4 years building a UI design tool with only the features I use

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

What Is Ruliology?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

First Proof

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Reinforcement Learning from Human Feedback

The Waymo World Model

Start all of your commands with a comma (2009)

France's homegrown open source online office suite

Vocal Guide – belt sing without killing yourself

The AI boom is causing shortages everywhere else

Coding agents have replaced every framework I used

Software factories and the agentic moment

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

British drivers over 70 to face eye tests every three years

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: I spent 4 years building a UI design tool with only the features I use

An Update on Heroku

Show HN: If you lose your memory, how to regain access to your computer?

What Is Ruliology?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Extending That XOR Trick to Billions of Rows

Comments