450× Faster Joins with Index Condition Pushdown

https://readyset.io/blog/optimizing-straddled-joins-in-readyset-from-hash-joins-to-index-condition-pushdown

58•marceloaltmann•4d ago

Comments

marceloaltmann•4d ago

Straddled joins were still a bottleneck in Readyset even after switching to hash joins. By integrating Index Condition Pushdown into the execution path, we eliminated the inefficiency and achieved up to 450× speedups.

LtdJorge•2h ago

Why downvote?

airstrike•2h ago

Reads like an ad written by an LLM, is my guess.

It could just be that they translated from their original language to English and got that as a byproduct. Many such cases.

Sesse__•1h ago

It also does not add anything interesting to the discussion. Like, why add a bland summary of the article?

ianks•3h ago

I love this type of practical optimization for DB queries. I’ve always liked how [rom-rb](https://rom-rb.org/learn/core/5.2/combines/) made the combine pattern easy to use when joins are slow. Nice to see this implemented at DB layer

jamesblonde•3h ago

We call these pushdown joins in rondb. They only support an equality condition for the index condition. Joins with index condition pushdown is a bit of a mouthful.

We also went from like 6 seconds to 50ms. Huge speedup.

Reference

https://docs.rondb.com/rondb_parallel_query/#pushdown-joins

bdcravens•3h ago

What database engine is this in? You reference your product, but I assume this is in MySQL/MariaDB?

https://dev.mysql.com/doc/refman/9.4/en/index-condition-push...

Sesse__•3h ago

This isn't really the same as MySQL's ICP; it seems more like what MySQL would call a “ref” or “eq_ref” lookup, i.e. a simple lookup on an indexed value on the right side of a nested-loop join. It's bread and butter for basically any database optimizer.

ICP in MySQL (which can be built on top of ref/eq_ref, but isn't part of the normal index lookup per se) is a fairly weird concept where the storage engine is told to evaluate certain predicates on its own, without returning the row to the optimizer. This is to a) reduce the number of round-trips (function calls) from the executor down into the storage engine, and b) because InnoDB's secondary indexes need an extra storage round-trip to return the row (secondary indexes don't point at the row you want, they contain the primary key and then you have to lookup the actual row from the PK), so if you can remove the row early, you can skip the main row lookup.

LtdJorge•2h ago

Seems like they are caching MySQL with their own layer built on RocksDB.

vjerancrnjak•2h ago

Another example of row based dbs somehow being insanely slow compared to column based.

Just an endless sequence of misbehavior and we’re waving it off as rows work good for specific lookups but columns for aggregations, yet here it is all the other stuff that is unreasonably slow.

tharkun__•2h ago

It's an example. But not of that.

It's an example of old things being new again maybe. Or reinventing the wheel because the wheel wasn't known to them.

Yes I know, nobody wants to pay that tax or make that guy richer, but databases like Oracle have had JPPD for a long time. It's just something the database does and the optimizer chooses whether to do it or not depending on whether it's the best thing to do or not.

rotis•1h ago

Exactly. This is a basic optimization technique and all the dinosaur era databases should have that. But if you build a new database product you have to implement these techniques from scratch. There is no way you shortcut that. Reminds me about CockroachDB and them building a query optimizer[1]. They started with rule based one and then switched to cost based. Feature that older databases already had.

[1] https://www.cockroachlabs.com/blog/building-cost-based-sql-o...

sschnei8•2h ago

I feel like this is more an example of:

“We filtered first instead of reading an entire table from disk and performing a lookup”

Where both OLAP and OLTP dbms would benefit.

To your point, it’s clear certain workloads lend themselves to OLAP and columnar storage much better, but “an endless sequence of misbehavior” seems a bit harsh .

flufluflufluffy•1h ago

I read their website landing page but it’s still kinda confusing — what exactly is readyset? It all sounds like it’s a cache you can set up in front of MySQL/postgres. But then this article is talking about implementing joins which is what the database itself would do, not a cache. But then the blurbs talk about it like it’s a “CDN for your database” that brings your data to the edge. What the heck is it?!

Sesse__•1h ago

It seems to be some sort of read-only reimplementation of MySQL/Postgres that can ingest their replication streams and materialize views (for caching). Complete with a really primitive optimizer, if the article is to be believed.

aseipp•32m ago

ReadySet is basically "incremental view maintenance" but applied to arbitrary SQL queries. It acts like a caching proxy for your database, but it simultaneously ingests the replication log from the system in order to see things happen. Then it uses that information to perform "incremental" updates of data it has cached, so that if you requery something, it is much faster.

Naive example: let's say you had a query that was a table scan and it computed the average age of all users in the users table. If you insert a new row into the users table and then rerun the query, you'd expect another table scan, so it will grow over time. In a traditional setup, you might cache this query and only update it "every once in a while." Instead, ReadySet can decompose this query into an "incremental program", run it, cache the result -- and then when it sees the insert into the table it incrementally updates the underlying data cache. That means the second run would actually be fast, and the cost to update the cache is only proportional to the underlying change, not the size of the table.

ReadySet is derived from research on Noira, whose paper is here: https://www.usenix.org/conference/osdi18/presentation/gjengs...

dangoodmanUT•1h ago

Maybe it's not obvious initially, but in retrospect, this handling of joins feels like the obvious way to handle it.

Push down filters to read the least data possible.

Or, know your data and be able to tell the query engine which kind of join strategy you would like (hash vs push down)

SoftTalker•1h ago

Decades ago we used to provide hints in queries based on "knowing the data" but modern optimizers have a lot better statistics on indexes, and the need to tell the query optimizer what to do should be rare.

Libre – An anonymous social experiment without likes, followers, or ads

RFC 9839 and Bad Unicode

Manim: Animation engine for explanatory math videos

Librebox: An open source, Roblox-compatible game engine

Writing Speed-of-Light Flash Attention for 5090 in CUDA C++

I Made a Floppy Disk from Scratch

Rethinking the Linux cloud stack for confidential VMs

450× Faster Joins with Index Condition Pushdown

Determinants and causal effects of admission to selective private colleges [pdf]

Bild AI (YC W25) Is Hiring Applied AI Founding Engineer

Optimizing our way through Metroid

Developer's block

WebR – R in the Browser

Why was Apache Kafka created?

Texas Instruments' $60B U.S. project, the next iPhone chips fabric

Lightning declines over shipping lanes following regulation of sulfur emissions

Arrays in Practice (2024)

Waitgroups: What they are, how to use them and what changed with Go 1.25

Line scan camera image processing for train photography

David Klein's TWA Posters (2018)

The JWST Rocky Worlds DDT Program reveals GJ 3929B to likely be a bare rock

RFK Jr demanded a vaccine study be retracted – the journal said no

The F-35 Is Losing the (Trade) War

Shader Academy: Learn computer graphics by solving challenges

Converting an online game to work without any JavaScript

My experience creating software with LLM coding agents – Part 2 (Tips)

The first Media over QUIC CDN: Cloudflare

Show HN: JavaScript-free (X)HTML Includes

You can't grow cool-climate plants in hot climates

Nitro: A tiny but flexible init system and process supervisor

450× Faster Joins with Index Condition Pushdown

Comments

Libre – An anonymous social experiment without likes, followers, or ads

RFC 9839 and Bad Unicode

Manim: Animation engine for explanatory math videos

Librebox: An open source, Roblox-compatible game engine

Writing Speed-of-Light Flash Attention for 5090 in CUDA C++

I Made a Floppy Disk from Scratch

Rethinking the Linux cloud stack for confidential VMs

450× Faster Joins with Index Condition Pushdown

Determinants and causal effects of admission to selective private colleges [pdf]

Bild AI (YC W25) Is Hiring Applied AI Founding Engineer

Optimizing our way through Metroid

Developer's block

WebR – R in the Browser

Why was Apache Kafka created?

Texas Instruments' $60B U.S. project, the next iPhone chips fabric

Lightning declines over shipping lanes following regulation of sulfur emissions

Arrays in Practice (2024)

Waitgroups: What they are, how to use them and what changed with Go 1.25

Line scan camera image processing for train photography

David Klein's TWA Posters (2018)

The JWST Rocky Worlds DDT Program reveals GJ 3929B to likely be a bare rock

RFK Jr demanded a vaccine study be retracted – the journal said no

The F-35 Is Losing the (Trade) War

Shader Academy: Learn computer graphics by solving challenges

Converting an online game to work without any JavaScript

My experience creating software with LLM coding agents – Part 2 (Tips)

The first Media over QUIC CDN: Cloudflare

Show HN: JavaScript-free (X)HTML Includes

You can't grow cool-climate plants in hot climates

Nitro: A tiny but flexible init system and process supervisor