B-trees and database indexes (2024)

https://planetscale.com/blog/btrees-and-database-indexes

124•tosh•5d ago

Comments

game_the0ry•5d ago

This has been post before, but planetscale also has a great sql for developers course:

https://planetscale.com/learn/courses/mysql-for-developers

bddicken•5d ago

Oh hey, I wrote this! Happy to chat more about the article here. Databases are kinda my thing.

mamcx•5d ago

I remember this article for when I was researching for https://spacetimedb.com/. The interactivity is very cool, BTW!

One neat realization is that a database is in fact more about indexes than the actual raw tables (all things interesting work under this assumption), to the point that implementing the engine you get the impression that everything start with "CREATE INDEX" than "CREATE TABLE". This includes sequential scans, where as visualized in your article show that lay the data sequentially is in fact a form of index.

Now, I have the dream of make a engine more into this vision...

amarant•5d ago

Thanks for writing this! The visualisations really drive a better understanding than pure text does, and it's quite clear that you have a better understanding of what database do under the hood than I do.

As such, I have a question for you: contrary to your article, I've always been taught that random primary keys are better than sequential ones. The reason for this, I was told, was to avoid "hotspots". I guess it only really applies once sharding comes into play, and perhaps also only if your primary key is your sharding key, but I think that's a pretty common setup.

I'm not really sure how to formulate a concrete question here, I guess I would like to hear your thoughts on any tradeoffs on sequential Vs random keys in sharded setups? Is there a case there random keys are valid, or have I been taught nonsense?

bddicken•5d ago

B+trees combined with sequential IDs are great for writes. This is because we are essentially just appending new rows to the "linked list" at the bottom level of the tree. We can also keep a high fill % if we know there isn't a lot of data churn.

If you're sharding based purely on sequential ID ranges, then yes this is a problem. Its better practice to shard based on a hash of your ID, so sequential id assignments turn into non-sequential shard keys, keeping things evenly distributed.

amarant•5d ago

Oh wow, that's a super simple solution, and I can immediately see how this gets you the best of both worlds!

And since it's only used for speedy lookup we can even use a fast, cheap and non-secure hashing algorithm, so it's really a low-cost operation!

Thanks! This was really one of those aha-moments where I feel kinda stupid to not have thought of it myself!

bddicken•5d ago

I've also written about sharding.

https://planetscale.com/blog/database-sharding

amarant•5d ago

Thanks! Another great article! It strikes me that modulo sharding on a sequential id would probably work rather well, but it was not mentioned in this article. Is there a reason I'm not seeing that this is bad? I guess resharding might be problematic, as you can't easily split a shard in two without rewriting every shard if you do that...

evil-olive•5d ago

> I guess resharding might be problematic

yes, that's the crux of the problem. when you have a sharded database, typically you want to be able to add (and/or remove) shards easily and non-disruptively.

for example - your database is currently sharded across N nodes, and it's overloaded due to increased traffic, so you want to increase it to N+1 nodes (or N+M nodes, which can add complexity in some cases)

if adding a shard causes a significant increase in load on the database, that's usually a non-starter for a production workload, because at the time you want to do it, the database is already overloaded

you can read about this in the original Dynamo paper [0] from almost 20 years ago - consistent hashing is used to select 3 of the N nodes to host a given key. when node N+1 is added, it joins the cluster in such a way that it will "take over" hosting 1/Nth of the data, from each of the N nodes - meaning that a) the joining process places a relatively small load on each of those N nodes and b) once the node is fully joined, it reduces overall load evenly across all N nodes.

0: https://www.allthingsdistributed.com/2007/10/amazons_dynamo....

cogman10•5d ago

For our DBs (which are often unsharded), we've found the best performance using the user account ID as the first part of the cluster key and then a sequential id for whatever the record is as the second.

It's not as good as just a sequential ID at keeping the fragmentation and data movement down. However, it does ultimately lead to the best write performance for us because the user data ends up likely still appending to an empty page. It allows for more concurrent writes to the same table because they aren't all fighting over that end page.

UUIDv4 is madness.

traderj0e•5d ago

Spanner in particular wants random primary keys. But there are sharded DBMSes that still use sequential PKs, like Citus. There are also some use cases for semi-sequential PKs like uuid7.

bddicken•5d ago

What about spanner specifically benefits from random ids over sequential ones?

traderj0e•4d ago

I'm not an expert on Spanner, supposedly it's due to hotspotting. Your data is partitioned by primary key, and if you make that sequential, all new writes will hit the same master. https://docs.cloud.google.com/spanner/docs/schema-and-data-m... explicitly recommends a uuid4 or some other options

That's another thing, some say to use uuid7 for sharded DBs, but this is a serious counterexample.

jiveturkey•5d ago

> MySQL, arguably the world's most popular database management system,

traderj0e•5d ago

Is there a problem with that?

shawn_w•5d ago

Not the original commenter, but I thought sqlite had that title.

Retr0id•5d ago

sqlite is arguably not really a DBMS, just a DB

somat•5d ago

The database is the file, the management system is the api to use the file.

traderj0e•4d ago

It's technically a DBMS, but I can see why you wouldn't compare it to MySQL.

jiveturkey•3d ago

arguably is doing a lot of work

bddicken•5d ago

It may not have the popularity it once did, but MySQL still powers a huge % of the internet.

threatofrain•5d ago

Also curious to hear what people think of Bf-tree.

  https://vldb.org/pvldb/vol17/p3442-hao.pdf
  https://github.com/microsoft/bf-tree

bddicken•5d ago

I've read this paper and it's a neat idea. It hasn't been introduced into popular oss databases like postgres and mysql, and my understanding is it has some drawbacks for real prod use vs ths simplistic benchmarks presented in the paper.

Would love to know if anyones built something using it outside of academic testing.

infogulch•4d ago

Another interesting tree filesystem data structure is the Bε-tree ("b epsilon tree"), which also tries to bridge the gap between small writes and the large pages of modern drives. The first paper/talk from 2015 has a fun name "BetrFS: A Right-Optimized Write-Optimized File System" and they published a few dozen times until 2022. https://www.betrfs.org/

ozgrakkurt•3d ago

The idea seems like it should work but it is questionable what cases will it be better in.

B+tree and LSM-tree are very developed and are kind of optimal. They are also fairly easy to beat for a given specific use case.

I guess they have a concrete case that has benefitted from this design or this was an attempt at doing that. Would be interesting to read about that specific case they had. I just skimmed the paper, so I'm sorry if they explained it in the middle somewhere.

Also I tried some other databases that claim to be better than rocksdb but it just is miles better than other databases when I needed large scale (couple billions of 32byte keys mapped to 8byte values).

I tried MDBX(LMDB), sled (also claimed read AND write optimized).

Tried sharding and all configuration options with both.

Reading papers about database research unfortunately feels like reading LLM output because I have to sift through a lot of fluff, and I have to know exactly that the thing is about and the surrounding ideas. I am not super knowledgeable in this field so this might be just a skill issue, but I would recommend seeing it this way.

This paper also writes about variable sized pages so it might be relevant to understanding what the trade-offs might be.

https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf

Also another thing I highly recommend is to always judge by hardware limits vs db measurement instead of looking at graphs in paper.

If something is doing 1GB/s write on an ssd that can do 7GB/s than it is bad at writes. It doesn't matter if it looks cool on a graph. This is kind of a crude way of seeing it but it is at least reliable.

whartung•5d ago

I keep hearing about the downside of B(+)-Trees for DBs, that they have issues for certain scenarios, but I've never seen a simple, detailed list about them, what they are, and the scenarios they perform badly in.

Retr0id•5d ago

For pure write throughput, LSM trees tend to beat btrees.

bddicken•5d ago

It's really just a matter of tradeoffs. B-trees are great, but are better suited for high read % and medium/low write volume. In the opposite case, things like LSMs are typically better suited.

If you want a comprehensive resource, I'd recommend reading either Designing Data Intensive Applications (Kleppman) or Database Internals (Petrov). Both have chapters on B-trees and LSMs.

daneel_w•5d ago

See my comment in the main thread for an example. In a worst case scenario, some data is simply too "frizzy" to index/search efficiently and with good performance in a B-tree.

faangguyindia•5d ago

If your application is write intensive LSM is better than Btree.

But you'd rarely need it. We mostly have write intensive counters. We just write to redis first then aggregate and write to postgres.

This reduces number of writes we need in postgres a lot

daneel_w•5d ago

"The deeper the tree, the slower it is to look up elements. Thus, we want shallow trees for our databases!"

With composite indices in InnoDB it's even more important to keep the tree streamlined and let it fan out according to data cardinality: https://news.ycombinator.com/item?id=34404641

traderj0e•5d ago

I've known for a long time that you usually want b-tree in Postgres/MySQL, but never understood too well how those actually work. This is the best explanation so far.

Also, for some reason there have been lots of HN articles incorrectly advising people to use uuid4 or v7 PKs with Postgres. Somehow this is the first time I've seen one say to just use serial.

bddicken•5d ago

Simple sequential IDs are great. If you want UUID, v7 is the way to go since it maintains sequential ordering.

jwpapi•5d ago

Does all of that apply to Postgresql as well or only Mysql?

sgarland•5d ago

Both, assuming you’re ever going to index it - both use a form of a B+tree for their base indices.

If it’s just being stored in the table, it doesn’t matter, but also if it doesn’t matter, just use v7.

traderj0e•3d ago

I haven't explored hash indexes enough. That might be one thing that differs between Postgres and MySQL, because for a long time Postgres didn't have a good story for those.

omcnoe•5d ago

There are subtle gotchas around sequential UUID compared to serial depending on where you generate the UUIDs. You can kinda only get hard sequential guarantee if you are generating them at write time on DB host itself.

But, for both Serial & db-gen’d sequential UUID you can still encounter transaction commit order surprises. I think software relying on sequential records should use some mechanism other than Id/PK to determine it. I’ve personally encountered extremely subtle bugs related to transaction commit order and sequential Id assumptions multiple times.

sgarland•5d ago

> just use serial

Ideally you use IDENTITY with Postgres, but the end result is the same, yes.

evil-olive•5d ago

> incorrectly advising people to use uuid4 or v7 PKs with Postgres

random UUIDs vs time-based UUIDs vs sequential integers has too many trade-offs and subtleties to call one of the options "incorrect" like you're doing here.

just as one example, any "just use serial everywhere" recommendation should mention the German tank problem [0] and its possible modern-day implications.

for example, if you're running a online shopping website, sequential order IDs means that anyone who places two orders is able to infer how many orders your website is processing over time. business people usually don't like leaking that information to competitors. telling them the technical justification of "it saves 8 bytes per order" is unlikely to sway them.

0: https://en.wikipedia.org/wiki/German_tank_problem

jim33442•5d ago

PK isn't the same as public ID, even though you could make them the same. Normally you have a uuid4 or whatever as the public one to look up, but all the internal joins etc use the serial PKs.

evil-olive•4d ago

> Normally you have a uuid4 or whatever as the public one to look up, but all the internal joins etc use the serial PKs.

what? that's possible, but it's the worst of both worlds. I've certainly never encountered a system where that's the "normal" practice.

the usual reason people avoid UUIDv4 primary keys is that it causes writes to be distributed across the entire B-tree, whereas sequential (or UUIDv7) concentrates them.

but if you then add a "alternate primary key" you're just re-creating the problem - the B-tree for that unique index will have its writes distributed at random.

if you need a UUID PK...just use it as the PK.

traderj0e•4d ago

The problem isn't so much the writes, it's the reads. Every time you join tables, you're using a PK 2-4x the size it needs to be, and at least that much slower. Even filtering on a secondary index may involve an internal lookup via PK to the main table. It doesn't take long to start noticing the performance difference.

Since you'd have a secondary index for the public UUID, yes that one index suffers from the random-writes issue still, but it takes a lot of volume to notice. If it ever is a big deal, you can use a separate KV store for it. But if you picked UUID as the PK, it's harder to get away from it.

traderj0e•3d ago

Well that and all the tables you have that don't need a customer-facing ID at all, in which case you also benefit from quicker writes using serial

omcnoe•5d ago

DB perf considerations aside, a lot of software pattern around idempotency/safe retries/horiz-scaling/distributed systems are super awkward with a serial pk because you don’t have any kind of unambiguous unique record identifier until after the DB write succeeds.

DB itself is “distributed” in that it’s running outside the services own memory in 99% of cases, in complex systems the actual DB write may be buried under multiple layers of service indirection across multiple hosts. Trying to design that correctly while also dealing with pre-write/post-write split on record id is a nightmare.

traderj0e•3d ago

DB sequence will give you a unique ID before the transaction succeeds. If the transaction fails, there's just a gap in the IDs.

If some service that doesn't interact with the DB wants to define its own IDs, sure, but even then whatever writes to the DB can always remap that to serial IDs. I know there are use cases where that still doesn't make sense and you really need UUID PKs, but it's not the norm.

hybirdss•5d ago

interactive viz on this kind of topic is just unfair compared to text

viccis•5d ago

A B+ tree with deletion was one of the most difficult algorithms I had to do back in college. You'd hit edge cases after billions of insertions...

photochemsyn•5d ago

Sqlite’s btree is available here:

https://github.com/sqlite/sqlite/blob/master/src/btree.c

I always thought this was too complicated to every really understand how it worked, especially the lock policy, but now with LLMs (assisted with sqlite’s very comprehensive comment policy) even a relative neophyte can start to understand how it all works together. Also the intro to the file is worth reading today:

* 2004 April 6 * * The author disclaims copyright to this source code. In place of * a legal notice, here is a blessing: * * May you do good and not evil. * May you find forgiveness for yourself and forgive others. * May you share freely, never taking more than you give. * ************************************* * This file implements an external (disk-based) database using BTrees. * See the header comment on "btreeInt.h" for additional information. * Including a description of file format and an overview of operation. */

kuharich•5d ago

Past comments: https://news.ycombinator.com/item?id=41489832

Archive of Byte magazine, starting with issue #1 in 1975

SPEAKE(a)R: Turn Speakers to Microphones for Fun and Profit [pdf] (2017)

Game devs explain the tricks involved with letting you pause a game

The seven programming ur-languages (2022)

Shader Lab, like Photoshop but for shaders

What are skiplists good for?

NIST scientists create 'any wavelength' lasers

College instructor turns to typewriters to curb AI-written work

Anonymous request-token comparisons from Opus 4.6 and Opus 4.7

The electromechanical angle computer inside the B-52 bomber's star tracker

Ask HN: How did you land your first projects as a solo engineer/consultant?

Changes in the system prompt between Claude Opus 4.6 and 4.7

Why Japan has such good railways

The world in which IPv6 was a good design

Updating Gun Rocket through 10 years of Unity Engine

It's cool to care (2025)

Binary Dependencies: Identifying the Hidden Packages We All Depend On

State of Kdenlive

Show HN: Prompt-to-Excalidraw demo with Gemma 4 E2B in the browser (3.1GB)

Modern Common Lisp with FSet

Migrating from DigitalOcean to Hetzner

Binary GCD

Keep Pushing: We Get 10 More Days to Reform Section 702

Optimizing Ruby Path Methods

Metatextual Literacy

Dizzying Spiral Staircase with Single Guardrail Once Led to Top of Eiffel Tower

Airline worker arrested after sharing photos of bomb damage in WhatsApp group

Sumida Aquarium Posts 2026 Penguin Relationship Chart, with Drama and Breakups

Zero-Copy GPU Inference from WebAssembly on Apple Silicon

Thoughts and feelings around Claude Design

Archive of Byte magazine, starting with issue #1 in 1975

SPEAKE(a)R: Turn Speakers to Microphones for Fun and Profit [pdf] (2017)

Game devs explain the tricks involved with letting you pause a game

The seven programming ur-languages (2022)

Shader Lab, like Photoshop but for shaders

What are skiplists good for?

NIST scientists create 'any wavelength' lasers

College instructor turns to typewriters to curb AI-written work

Anonymous request-token comparisons from Opus 4.6 and Opus 4.7

The electromechanical angle computer inside the B-52 bomber's star tracker

Ask HN: How did you land your first projects as a solo engineer/consultant?

Changes in the system prompt between Claude Opus 4.6 and 4.7

Why Japan has such good railways

The world in which IPv6 was a good design

Updating Gun Rocket through 10 years of Unity Engine

It's cool to care (2025)

Binary Dependencies: Identifying the Hidden Packages We All Depend On

State of Kdenlive

Show HN: Prompt-to-Excalidraw demo with Gemma 4 E2B in the browser (3.1GB)

Modern Common Lisp with FSet

Migrating from DigitalOcean to Hetzner

Binary GCD

Keep Pushing: We Get 10 More Days to Reform Section 702

Optimizing Ruby Path Methods

Metatextual Literacy

Dizzying Spiral Staircase with Single Guardrail Once Led to Top of Eiffel Tower

Airline worker arrested after sharing photos of bomb damage in WhatsApp group

Sumida Aquarium Posts 2026 Penguin Relationship Chart, with Drama and Breakups

Zero-Copy GPU Inference from WebAssembly on Apple Silicon

Thoughts and feelings around Claude Design

B-trees and database indexes (2024)

Comments