Show HN: Turbolite – a SQLite VFS serving sub-250ms cold JOIN queries from S3

https://github.com/russellromney/turbolite

44•russellthehippo•1h ago

I built a SQLite VFS in Rust that serves cold queries directly from S3 with sub-second performance, and often much faster.

It’s called turbolite. It is experimental, buggy, and may corrupt data. I would not trust it with anything important yet.

I wanted to explore whether object storage has gotten fast enough to support embedded databases over cloud storage. Filesystems reward tiny random reads and in-place mutation. S3 rewards fewer requests, bigger transfers, immutable objects, and aggressively parallel operations where bandwidth is often the real constraint. This was explicitly inspired by turbopuffer’s ground-up S3-native design. https://turbopuffer.com/blog/turbopuffer

The use case I had in mind is lots of mostly-cold SQLite databases (database-per-tenant, database-per-session, or database-per-user architectures) where keeping a separate attached volume for inactive database feels wasteful. turbolite assumes a single write source and is aimed much more at “many databases with bursty cold reads” than “one hot database.”

Instead of doing naive page-at-a-time reads from a raw SQLite file, turbolite introspects SQLite B-trees, stores related pages together in compressed page groups, and keeps a manifest that is the source of truth for where every page lives. Cache misses use seekable zstd frames and S3 range GETs for search queries, so fetching one needed page does not require downloading an entire object.

At query time, turbolite can also pass storage operations from the query plan down to the VFS to frontrun downloads for indexes and large scans in the order they will be accessed.

You can tune how aggressively turbolite prefetches. For point queries and small joins, it can stay conservative and avoid prefetching whole tables. For scans, it can get much more aggressive.

It also groups pages by page type in S3. Interior B-tree pages are bundled separately and loaded eagerly. Index pages prefetch aggressively. Data pages are stored by table. The goal is to make cold point queries and joins decent, while making scans less awful than naive remote paging would.

On a 1M-row / 1.5GB benchmark on EC2 + S3 Express, I’m seeing results like sub-100ms cold point lookups, sub-200ms cold 5-join profile queries, and sub-600ms scans from an empty cache with a 1.5GB database. It’s somewhat slower on normal S3/Tigris.

Current limitations are pretty straightforward: it’s single-writer only, and it is still very much a systems experiment rather than production infrastructure.

I’d love feedback from people who’ve worked on SQLite-over-network, storage engines, VFSes, or object-storage-backed databases. I’m especially interested in whether the B-tree-aware grouping / manifest / seekable-range-GET direction feels like the right one to keep pushing.

Comments

russellthehippo•1h ago

A bit more color on what I found interesting building this:

The motivating question for me was less “can SQLite read over the network?” and more “what assumptions break once the storage layer is object storage instead of a filesystem?”

The biggest conceptual shift was around *layout*.

What felt most wrong in naive designs was that SQLite page numbers are not laid out in a way that matches how you want to fetch data remotely. If an index is scattered across many unrelated page ranges, then “prefetch nearby pages” is kind of a fake optimization. Nearby in the file is not the same thing as relevant to the query.

That pushed me toward B-tree-aware grouping. Once the storage layer starts understanding which table or index a page belongs to, a lot of other things get cleaner: more targeted prefetch, better scan behavior, less random fetching, and much saner request economics.

Another thing that became much more important than I expected is that *different page types matter a lot*. Interior B-tree pages are tiny in footprint but disproportionately important, because basically every query traverses them. That changed how I thought about the system: much less as “a database file” and much more as “different classes of pages with very different value on the critical path.”

The query-plan-aware “frontrun” part came from the same instinct. Reactive prefetch is fine, but SQLite often already knows a lot about what it is about to touch. If the storage layer can see enough of that early, it can start warming the right structures before the first miss fully cascades. That’s still pretty experimental, but it was one of the more fun parts of the project.

A few things I learned building this:

1. *Cold point reads and small joins seem more plausible than I expected.* Not local-disk fast, obviously, but plausible for the “many mostly-cold DBs” niche.

2. *The real enemy is request count more than raw bytes.* Once I leaned harder into grouping and prefetch by tree, the design got much more coherent.

3. *Scans are still where reality bites.* They got much less bad, but they are still the place where remote object storage most clearly reminds you that it is not a local SSD.

4. *The storage backend is super important.* Different storage backends (S3, S3 Express, Tigris) have verg different round trip latencies and it's the single most important thing in determining how to tune prefetching.

Anyway, happy to talk about the architecture, the benchmark setup, what broke, or why I chose this shape instead of raw-file range GETs / replication-first approaches / etc.

hgo•28m ago

I really appreciate this post. Freely and humbly sharing real insights from an interesting project. I almost feel like I got a significant chunk of the reward for your investment into this project just by reading.

Thank you for sharing.

russellthehippo•13m ago

Thanks for your kind words!

russellthehippo•1h ago

Also I want to acknowledge the other projects in adjacent parts of this space — raw SQLite range-request VFSes, Litestream/LiteFS-style replication approaches, libSQL/Turso, Neon, mvsqlite, etc. I took a lot of inspiration from them, thanks!

michaeljelly•52m ago

Really cool

alex_hirner•49m ago

What are your thoughts on eviction, re how easy to add some basic policy?

russellthehippo•36m ago

Great question. I have some eviction functions in the Rust library; I don’t expose them through the extension/VFS yet. The open question is less “can I evict?” and more “when should eviction fire?” via user action, via policy, or both.

The obvious policy-driven versions are things like:

- when cache size crosses a limit

- on checkpoint

- every N writes (kind of like autocheckpoint)

- after some idle / age threshold

My instinct is that for the workload I care about, the best answer is probably hybrid. The VFS should have a tier-aware policy internally that users can configure with separate policies for interior/index/data pages. But the user/application may still be in the best position to say “this tenant/session DB is cold now, evict aggressively.”

carlsverre•47m ago

You might be interested in taking a look at Graft (https://graft.rs/). I have been iterating in this space for the last year, and have learned a lot about it. Graft has a slightly different set of goals, one of which is to keep writes fast and small and optimize for partial replication. That said, Graft shares several design decisions, including the use of framed ZStd compression to store pages.

I do like the B-tree aware grouping idea. This seems like a useful optimization for larger scan-style workloads. It helps eliminate the need to vacuum as much.

Have you considered doing other kinds of optimizations? Empty pages, free pages, etc.

russellthehippo•28m ago

Very cool, thanks. I hadn’t seen Graft before, but that sounds pretty adjacent in a lot of interesting ways. I looked at the repo and see what I can apply.

I've tried out all sorts of optimizations - for free pages, I've considered leaving empty space in each S3 object and serving those as free pages to get efficient writes without shuffling pages too much. My current bias has been to over-store a little if it keeps the read path simpler, since the main goal so far has been making cold reads plausible rather than maximizing space efficiency. Especially because free pages compress well.

I have two related roadmap item: hole-punching and LSM-like writing. For local on non-HDD storage, we can evict empty pages automatically by releasing empty page space back to the OS. For writes, LSM is best because it groups related things together, which is what we need. but that would mean doing a lot of rewriting on checkpoint. So both of these feel a little premature to optimize for vs other things.

inferense•11m ago

very cool!

Ask HN: Leaving Notion, Codebase as a Wiki?

Engineers do get promoted for writing simple code

AI comments drove Paul Graham off X notifications

Show HN: Photo Triager – Cull Raw Photos on iPhone with XMP Sidecars

Show HN: Breakwater

Show HN: Illustrative – AI pipeline that turns books into graphic novels

Agent Reliability Engineering

Databuddy: Privacy-First Analytics

Probabilistic feature analysis of facial perception of emotions [pdf]

Unlimited Prep Cooks. Are You Going to Make Something?

Show HN: NPM install is a security hole, so we built a guard for it

The OpenAI Safety Bug Bounty Program

From trash to climate tech: Rubber gloves find new life as carbon capturers

Using FireWire on a Raspberry Pi

Vive La France

School uses AI to remove 200 books, including Orwell's 1984 and Twilight

Polymarket Says It Predicts the Truth. Its Social Feeds Are Filled w Falsehoods.

Netflix raises prices across all streaming plans

What Happens When a Whale Is Born?

Dead whale washes up on Rockaway Beach

Mindory App

Claude Code CLI notification script

IEML, the Information Economy MetaLanguage

The Placek Framework: How Pentium, Febreze, and PowerBook Were Named

Show HN: Mokkit, a browser app to create animated device mockups

People Cheering Verdicts Against Meta Should Understand What Theyre Cheering For

Best agentic IDEs (with video demos)

The United States router ban, explained

Claude Code adjusting down 5hr limits

Bitter Lessons from a Chinese Education Reformer (2022)