People Keep Inventing Prolly Trees

https://www.dolthub.com/blog/2025-06-03-people-keep-inventing-prolly-trees/

191•lifty•7mo ago

Comments

compressedgas•7mo ago

This article does not mention Jumbostore (Kave Eshghi, Mark Lillibridge, Lawrence Wilcock, Guillaume Belrose, and Rycharde Hawkes) which used content defined chunking recursively on the chunk list of a content defined chunked file in 2007. This is exactly what a Prolly Tree is.

lawlessone•7mo ago

Amazing! all these people reinvented my SuperMegaTree!

aboodman•7mo ago

I was aware of this kind of structure when I coined 'prolly tree'. It's the same thing bup was doing, which I referenced in our design docs:

https://github.com/attic-labs/noms/blob/master/doc/intro.md#...

The reason I thought a new name was warranted is that a prolly tree stores structured data (a sorted set of k/v pairs, like a b-tree), not blob data. And it has the same interface and utility as a b-tree.

Is it a huge difference? No. A pretty minor adaptation of an existing idea. But still different enough to warrant a different name IMO.

compressedgas•7mo ago

My use of "exactly" was an overstatement. The important difference is that internal nodes in a prolly tree contain not only the hashes of the child nodes but also the index keys as is done in a B-tree. The divisions at each level however similarly are decided by the application of content defined chunking method to the entire level of the tree.

ChadNauseam•7mo ago

Haha, this is funny. I've been obsessed with rolling-hash based chunking since I read about it in the dat paper. I didn't realize there was a tree version, but it is a natural extension.

I have a related cryptosystem that I came up with, but is so obvious I'm sure someone else has invented it first. The idea is to back up a file like so: first, do a rolling-hash based chunking, then encrypt each chunk where the key is the hash of that chunk. Then, upload the chunks to the server, along with a file (encrypted by your personal key) that contains the information needed to decrypt each chunk and reassemble them. If multiple users used this strategy, any files they have in common would result in the same chunks being uploaded. This would let the server provider deduplicate those files (saving space), without giving the server provider the ability to read the files. (Unless they already know exactly which file they're looking for, and just want to test whether you're storing it.)

Tangent: why is it that downloading a large file is such a bad experience on the internet? If you lose internet halfway through, the connection is closed and you're just screwed. I don't think it should be a requirement, but it would be nice if there was some protocol understood by browsers and web servers that would be able to break-up and re-assemble a download request into a prolly tree, so I could pick up downloading where I left off, or only download what changed since the last time I downloaded something.

wakawaka28•7mo ago

I think the cost of processing stuff that way would far exceed the cost of downloading the entire file again. You can already resume downloads from a byte offset if the server supports it, and that probably covers 99% of the cases where you would actually want to resume a download of a single file. Partial updates are rarely possible for large files anyway, as they are often compressed. If the host wants to make partial updates make sense then they could serve over rsync.

nicoburns•7mo ago

Bittorrent is the protocol you're looking for. Unfortunately not widely adopted for the use cases you are talking about.

theLiminator•7mo ago

Sounds similar to IPFS.

Retr0id•7mo ago

> If you lose internet halfway through, the connection is closed and you're just screwed. [...] it would be nice if there was some protocol understood by browsers and web servers

HTTP Range Requests solve this without any clever logic, if mutually supported.

motorest•7mo ago

> HTTP Range Requests solve this without any clever logic, if mutually supported.

Understated comment in the thread.

The very first search hit on Google is none other than Mozilla's page on ranged requests.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Ran...

Here's the leading summary from that page.

> An HTTP Range request asks the server to send parts of a resource back to a client. Range requests are useful for various clients, including media players that support random access, data tools that require only part of a large file, and download managers that let users pause and resume a download.

Here's a RFC:

https://datatracker.ietf.org/doc/html/rfc7233

RainyDayTmrw•7mo ago

AES-GCM-SIV[1] does something similar to your per chunk derived key, except that AES-GCM-SIV expects the key to be user-provided, and the IV is synthetic - hence Synthetic IV mode.

What's your threat model? This has "interesting"[3] properties. For example, given a file, the provider can figure out who has the file. Or, given a file, an arbitrary user can figure out if some other user already has the file. Users may even be able to "teleport" files to each other, like the infamous Dropbox Dropship[2].

I suspect why no one has tried this is many-fold: (1) Most providers want to store plaintext. Those few providers who don't want to store plaintext, whether for secrecy or deniability reasons, also don't want to store anything else correlatable, either. (2) Space is cheap. (3) Providers like being able to charge for space. Since providers sell space at a markup, they almost want you to use more space, not less.

[1]: https://en.wikipedia.org/wiki/AES-GCM-SIV [2]: https://en.wikipedia.org/wiki/Dropship_(software) [3]: "Interesting" is not a word you want associated with your cryptography usage, to say the least.

1vuio0pswjnm7•7mo ago

"Tangent: why is it that downloading a large file is such a bad experience on the internet?"

This comment could only come from someone who never downloaded large files from the internet in the 1990s.

Feels like heaven to me downloading today.

Watching video from YouTube, Facebook, etc., if accessed via those websites running their Javascripts, usually uses the Range header. Some people refer to the "breeak up and re-assembly" as "progressive download".

HelloNurse•7mo ago

Adding tangent to tangent, I recently experienced an unexpected modern counterpart of a 1990s large download: deleting about 120K emails from a GMail folder, then purging them for real by "emptying" the GMail "trash bin".

The first phase was severely asynchronous, with a popup mentioning "the next few minutes", which turned out to be hours. Manually refreshing the page showed a cringeworthy deletion rate of about 500 messages per minute.

But at least it worked; the second phase was more special, with plenty of arbitrary stopping and outright lies. After repeated purging attempts I finally got an empty bin achievement page on my phone but I found over 50K messages in the trash on my computer the next day, where every attempt to empty the trash showed a very slow progress dialog that reported completion but actually deleted only about 4K messages.

I don't expect many JavaScript card castles of the complexity of GMail message handling to be tested on large jobs; at least old FTP and web servers were designed with high load and large files in mind.

zokier•7mo ago

Video streaming usually uses something like DASH/HLS and is fair bit more complicated than Range headers. Notably this means that downloading the video means reversing the streaming format and glueing the segments together.

1vuio0pswjnm7•7mo ago

In recent times, large video files could often be downloaded in the popular browsers by changing a URL path parameter like "r=1234567" to "r=0". I have downloaded many large videos that way.

DASH is used sometimes, but not on the majority of videos I encounter. Of course this can change over time. The point is that downloading large files today, e.g., from YouTube, Facebook, etc., cf. downloading large files in the 90s where speeds were slower and interruptions were more common, has been relatively fast and easy by comparison, even though these websites might be changing how they serve these files behind the scenes and software developers gravitate toward complexity.

Commercial "streaming", e.g., ESPN, etc., might be intentionally difficult to download and might involve "reversing" and "glueing" but that is not what I'm describing.

vanderZwan•7mo ago

> the dat paper

What's the name of the paper you're alluding to? I'm not familiar with it and it sounds interesting

aboodman•7mo ago

https://github.com/dat-ecosystem-archive/whitepaper/blob/mas...

vanderZwan•7mo ago

Thank you!

layer8•7mo ago

> This would let the server provider deduplicate those files (saving space), without giving the server provider the ability to read the files.

This gives the service provider the ability to see who is storing the same files, however, which can be sensitive information. Moreover, once they know/decrypt a file for one user, they know that file for all users.

rakoo•7mo ago

It does sound similar to ideas in Tahoe-LAFS: https://tahoe-lafs.readthedocs.io/en/latest/architecture.htm...

Which has already thought about attacks on the scheme you described: https://tahoe-lafs.org/hacktahoelafs/drew_perttula.html

ChadNauseam•7mo ago

Wow, the idea of adding a key to create groups where only someone inside the group could carry out the attack is awesome.

iamwil•7mo ago

Anyone know if editing a prolly tree requires reconstructing the entire tree from the leaves again? All the examples I've ever seen in a wild reconstruct from the bottom up. Presumably, you can leave the untouched leaves intact, and the reconstruct parent nodes whose hashes have changed due to the changed leaves. I ended up doing an implementation of this, and wondered if it's of any interest or value to others?

aboodman•7mo ago

I am confused by this question. Both noms and dolt, and presumably most other prolly tree implementations do what you propose. if they didn't, inserts would be terribly slow.

iamwil•7mo ago

That’s what I figured, and answers my question.

gritzko•7mo ago

Prolly Trees are Merkle-fied B-trees, essentially.

I am working on related things[r], using Merkle-fied LSM trees. Ink&Switch do things that very closely resemble Merklefied LSM[h], although they are not exactly LSM. I would not be surprised if someone else is doing something similar in parallel. The tricks are very similar to Prollies, but LSM instead of B-trees.

That reminds me my younger years when I "invented" the Causal Tree[c] data structure. It was later reinvented as RGA (Replicated Growable Array [a]), Timestamped Insertion Tree and, I believe, YATA. All seem to be variations of a very very old revision control data structure named "weave"[w].

Recently I improved CT to the degree that warranted a new algorithm name (DISCONT [d]). Fundamentally the same, but much cheaper. Probably, we should see all these "inventions" as improvements. All the Computer Science basics seem to have been invented in the 70s, 80s the latest.

[w]: https://docs.rs/weave/latest/weave/

[r]: https://github.com/gritzko/librdx

[d]: https://github.com/gritzko/go-rdx/blob/main/DISCOUNT.md

[h]: https://www.inkandswitch.com/keyhive/notebook/05/

[c]: https://dl.acm.org/doi/10.1145/1832772.1832777

[a]: https://pages.lip6.fr/Marc.Shapiro/papers/RR-7687.pdf links to the authors of RGA

lifty•7mo ago

This is the first time I hear about librdx and Chotki, very cool projects. I skimmed over both projects but I haven't seen much written about conflict resolution. Does it mean that last write wins based on version vectors?

gritzko•7mo ago

I am a (co)author. Conflict resolution is CRDT in all cases.

FIRST (Float Int Reference String Term): Last-Write-Wins based on the timestamp,

PLEX:

- Tuples: per-entry LWW or recursive,

- Linear: DISCONT (CT/RGA type),

- Eulerian: per-key LWW or recursive,

- Multiplexed (version vectors, counters): per-author LWW or recursive.

zombot•7mo ago

What the hell does "probabilistically balanced" mean?

moomin•7mo ago

It means the balancing is content-dependent, and is normally balanced, but certain edge-case inputs may result in sub-optimal behaviour.

zombot•7mo ago

What is probabilistic about that? It sounds deterministic.

mcherm•7mo ago

The specific behavior of the hash function (including the salt you chose). Choosing a different hash function (or a different salt) would result in a different breakdown into chunks.

In principle, if your data were specially crafted to exploit the specific hash function (and salt) you could get an aberrant case like 1 million entries in a single b-tree node or a million b-tree nodes with just one entry. But unless you intentionally exploit the hash function the chance of this is vanishingly small.

moomin•7mo ago

It's a term of art. We say the same thing about quicksort being "usually" O(n log n)

judofyr•7mo ago

The opposite of probabilistic is not deterministic in this context. This is not about «drawing a random number», but rather that balancing is dependent on the input data. «With high probability» here means «majority of the possible input data leads to a balanced structure».

If it was not probabilistic then the balancing would be guaranteed in all cases. This typically means that it somehow stores balancing information somewhere so that it can detect when something is unbalanced and repair it. In this data structure we’re just hashing the content without really caring about the current balance and then it turns out that for most inputs it will be fine.

stonemetal12•7mo ago

Like quicksort is O(N log N) on average but can degrade to O(N^2) in the worse case. The tree is balanced on average, but can degrade to not close to balanced in the worst case.

timsehn•7mo ago

We actually wrote another blog about this because I had the same question. I am the CEO of DoltHub so I can have my engineers write stuff to explain it to me :-)

https://www.dolthub.com/blog/2025-06-26-prolly-tree-balance/

wmanley•7mo ago

Here's apenwarr's description of the same data structure from 2009: https://apenwarr.ca/log/20091004 .

Here's a post to the git mailing list from Martin Uecker describing the same from 2005: https://lore.kernel.org/git/20050416173702.GA12605@macavity/ . From the tone of the email it sounds like he didn't consider the idea new at that point:

> The chunk boundaries should be determined deterministically from local properties of the data. Use a rolling checksum over some small window and split the file it it hits a special value (0). This is what the rsyncable patch to zlib does.

He calls it a merkle hash tree.

Edit: here's one that's one day earlier from C. Scott Ananian: https://lore.kernel.org/git/Pine.LNX.4.61.0504151232160.2763...

> We already have the rsync algorithm which can scan through a file and efficiently tell which existing chunks match (portions of) it, using a rolling checksum. (Here's a refresher: http://samba.anu.edu.au/rsync/tech_report/node2.html ). Why not treat the 'chunk' as the fundamental unit, and compose files from chunks?

elric•7mo ago

How do people find specialised data structure that they aren't already aware of? Stumbling across random blog posts and reading the odd book on data structures can't be the optimal way.

Is there a way to search for a structure by properties? E.g. O(1) lookups, O(log(n)) inserts or better, navigates like a tree (just making this up), etc?

Loranubi•7mo ago

I was trying to catalog them at some point in a reasonably structured way https://github.com/Dobatymo/data-algos/ https://github.com/Dobatymo/data-algos/blob/master/data-stru... But it's a lot of work and I didn't update it for a while.

hiAndrewQuinn•7mo ago

My understanding is you basically just bash your head against the problem for long enough, and simultaneously have enough of a grounding in the fundamentals, that you just start to come up with it as the obvious next thing. In other words, there's no trick to it, just expertise, hard work, and an eye for what's relevant and what's irrelevant in the problem.

donatj•7mo ago

It would probably help if they had a Wikipedia page. Someone who actually understands what they are should get on that.

Googling "Prolly Trees", there's not much and this article is one of the top results.

inetknght•7mo ago

> Sometimes an invention is not widely known because its creator doesn't realize they've created something novel: the design seemed obvious and intuitive to them.

I never went to high school or college or anything.

I can't tell you how many times I come up with something, only to discover years later that someone else came up with the same idea later (or sometimes earlier), branded it, and marketed it.

jerf•7mo ago

I kind of like data structures, because when you study them, and take them apart, and understand their pieces, rather than seeing the world as Prolly Trees here and Binary Trees there and Bloom Filters over there, you see a whole bunch of little tricks you can use, and when you collect a reasonably large bag of those tricks you can put them together in all sorts of ways.

It's almost a pity computers are as fast as they are and they are so rarely needed because having "arrays" and "maps/dicts/associative arrays/whatever" solves so many problems so much faster than we need anyhow. I don't get to pull out the bag of tricks very often. But then again, when I do, it's because it's a life saver and the difference between success and failure, so maybe it all balances out.

shadowgovt•7mo ago

It's fun to go look at the actual implementation of those primitives in languages and libraries and see how complex they can be under the hood. Some of them incorporate two or three algorithms and switch them on the fly based on profiling the incoming data. I remember being startled to learn that Cocoa's "NSString" supports ropes, caching of transforms, and I think even some translation primitives, all quietly switching on and off as needed.

Developers have less control over the particulars (and may miss optimization opportunities if they can make guarantees about the shape of the problem) but it benefits the common case.

guywithahat•7mo ago

One things an exec said at my old job I liked was “research is generally ~5-10 years behind industry”, and I see that seems to still be the case for prolly trees

AI Overviews are killing the web search, and there's nothing we can do about it

City skylines need an upgrade in the face of climate stress

1979: The Model World of Robert Symes [video]

Satellites Have a Lot of Room

1980s Farm Crisis

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux

The source code was the moat. But not anymore

Does anyone else feel like their inbox has become their job?

An AI model that can read and diagnose a brain MRI in seconds

Dev with 5 of experience switched to Rails, what should I be careful about?

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

Tell HN: Yet Another Round of Zendesk Spam

Postgres Message Queue (PGMQ)

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

NY lawmakers proposed statewide data center moratorium

AI Overviews are killing the web search, and there's nothing we can do about it

City skylines need an upgrade in the face of climate stress

1979: The Model World of Robert Symes [video]

Satellites Have a Lot of Room

1980s Farm Crisis

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux

The source code was the moat. But not anymore

Does anyone else feel like their inbox has become their job?

An AI model that can read and diagnose a brain MRI in seconds

Dev with 5 of experience switched to Rails, what should I be careful about?

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

Tell HN: Yet Another Round of Zendesk Spam

Postgres Message Queue (PGMQ)

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

NY lawmakers proposed statewide data center moratorium

People Keep Inventing Prolly Trees

Comments