We also went from like 6 seconds to 50ms. Huge speedup.
Reference
https://dev.mysql.com/doc/refman/9.4/en/index-condition-push...
ICP in MySQL (which can be built on top of ref/eq_ref, but isn't part of the normal index lookup per se) is a fairly weird concept where the storage engine is told to evaluate certain predicates on its own, without returning the row to the optimizer. This is to a) reduce the number of round-trips (function calls) from the executor down into the storage engine, and b) because InnoDB's secondary indexes need an extra storage round-trip to return the row (secondary indexes don't point at the row you want, they contain the primary key and then you have to lookup the actual row from the PK), so if you can remove the row early, you can skip the main row lookup.
Just an endless sequence of misbehavior and we’re waving it off as rows work good for specific lookups but columns for aggregations, yet here it is all the other stuff that is unreasonably slow.
It's an example of old things being new again maybe. Or reinventing the wheel because the wheel wasn't known to them.
Yes I know, nobody wants to pay that tax or make that guy richer, but databases like Oracle have had JPPD for a long time. It's just something the database does and the optimizer chooses whether to do it or not depending on whether it's the best thing to do or not.
[1] https://www.cockroachlabs.com/blog/building-cost-based-sql-o...
“We filtered first instead of reading an entire table from disk and performing a lookup”
Where both OLAP and OLTP dbms would benefit.
To your point, it’s clear certain workloads lend themselves to OLAP and columnar storage much better, but “an endless sequence of misbehavior” seems a bit harsh .
Recent example, have 1GB of data in total across tables. Query needs 20 minutes. Obvious quadratic/cubic-or-even-worse behavior.
I disable nested loop join and it's 4 seconds. Still slow, but don't want to spend time figuring out why it's slower than reading 1GB of data and pipelining the computation so that it's just 1 second, or even faster given the beefy NVME where files are stored (ignoring that I actually have good indices and the surface area of the query is probably 10MB and not 1GB).
How can the strategy be slower than downloading 1GB of data and gluing it together in Python?
Something is just off with the level of abstraction, query planner relying on weird stats. The whole system, outside of its transactional guarantees, just sucks.
Another example where materializing CTE reduces exec time from 2 seconds to 50ms, because then you somehow hint to the query planner that result of that CTE is small.
So even PostgreSQL is filled with these endless riddles in misbehavior, even though PhDs boast about who knows what in the query optimizer and will make an effort to belittle my criticism by repeating the "1 row" vs "agg all rows" as if I'm in elementary school and don't know how to use both OLTP or OLAP systems.
Unlike column dbs where I know it's some nice fused group-by/map/reduce behavior where I avoid joins like plague and there's no query planner, stats maintenance, indices, or other mumbo-jumbo that does not do anything at all most of the time.
Most of my workloads are extremely tiny and I am familiar with how to structure schemas for OLTP and OLAP and I just dislike how most relational databases work.
I feel your pain. I've been through all stages of grief with `enable_nestloop`. I've arrived at acceptance. Sometimes you just need to redo your query approach. Usually by the time I get the planner to behave, I've ended up with something that's expressed more simply to boot.
Postgres implements the ancient Volcano model from the 1980s, but there's been a ton of query optimization research since then, especially from the database groups at TUM Munich, University of Washington, and Carnegie Mellon. Systems like HyPer and Umbra (both at TUM) are state of the art query planners that would eat Postgres' lunch. Lots of work on making planners smarter about rearranging joins to be more optimal, improving cache locality and buffer management, and so on.
Unfortunately, changing an old Volcano planner and applying newer techniques would probably be a huge endeavor.
Naive example: let's say you had a query that was a table scan and it computed the average age of all users in the users table. If you insert a new row into the users table and then rerun the query, you'd expect another table scan, so it will grow over time. In a traditional setup, you might cache this query and only update it "every once in a while." Instead, ReadySet can decompose this query into an "incremental program", run it, cache the result -- and then when it sees the insert into the table it incrementally updates the underlying data cache. That means the second run would actually be fast, and the cost to update the cache is only proportional to the underlying change, not the size of the table.
ReadySet is derived from research on Noira, whose paper is here: https://www.usenix.org/conference/osdi18/presentation/gjengs...
Push down filters to read the least data possible.
Or, know your data and be able to tell the query engine which kind of join strategy you would like (hash vs push down)
There is a real need to be able to take key queries and say, "don't change the way you run this query". Most databases offer this. Unfortunately PostgreSQL doesn't. There are ways to force the join (eg using a series of queries with explicit temporary tables), but all create overhead. And the result is that a PostgreSQL website will sometimes change a good query plan to a bad one, then have problems. Just because it is Tuesday.
We've hit this with MSSQL too. Suddenly production is down because for whatever reason MSSQL decided to forget its good plan and instead table scan, and then continue to reuse that cached table-scanning plan.
For one specific query MSSQL likes to do this with at a certain customer we've so far just added the minutes since start of year as a dummy column, while we work on more pressing issues. Very blunt, yet it works.
But here they are deciding between "pushdown o.status==shipped" and "pushdown u.email==address@", in parallel both, then join (which they already did) or first doing "u.email==address@" then pushing down "u.id==o.user_id" mostly.
This is a judgment call. Their planner is pretty dumb to not know which one is better, but “push down as much as possible” doesn't cut it: you need to actually decide what to push down and why.
I'd be very surprised if virtually every RDBMS doesn't do this already.
Would be curious if any RocksDB knowledgeable people have a different analysis.
I would disagree with this assumption for two reasons: first, theoretically, a file system is a key value store, and basically all databases run on file systems, so it stands to reason that any optimization Postgres does can be achieved as an abstraction over a key-value store with a good API because Postgres already did.
Second, less theoretically, this has already been done by CockroachDB, which stores data in Pebble in the current iteration and previously used RocksDB (pebble is CRDB’s Go rewrite of RocksDB) and TiDB, which stores its data in TiKV.
A thin wrapper over a KV store will only be able to use optimizations provided by the KV store, but if your wrapper is thick enough to include abstractions like adding multiple tables or inserting values into multiple cells in multiple tables atomically, then you can build arbitrary indices into the abstraction.
I wouldn’t tend to call a KV store a bad database engine because I don’t think of it as a database engine at all. It might technically be one under the academic definition of a database engine, but I mostly see it being used as a building block in a more complicated database engine.
At every place I’ve worked at this is a solved problem: Hive+Spark, just keep everything sharded across a ton of machines.
It’s cheaper to pay for a Hive cluster that does dumb queries than paying these expensive DB licenses, data engineers building arbitrary indices, etc… just throw compute at the problem, who cares. 1TB of RAM /flash is so cheap these days.
Even working on the worlds “biggest platforms” a daily partition of user data is like 2TB.
You’re telling me a F500 can’t buy a 5 machine/40TB cluster for like $40k and basically be set?
Just dump it in Hadoop became an anti-pattern and everyone yearned for databases and clean data and not dealing with internal IT and the cluster “admins”.
For the first decade I wanted nothing to do with DB’s aside from places to store data. One day I saw a few things that made a massive difference and then went wild on learning how to speed things up. It’s fantastic and because few devs know this stuff well, it becomes a superpower. You wouldn’t believe what you can squeeze out of modern SQL DB’s and hardware, without touching any kind of optimised solutions. Which I love too but that’s a different post.
Maybe ask the DBA’s a few questions and see if that triggers any interest for you. Look at query plans and how many rows are processed for a query. How many columns. What is being locked. Can you remove locks when you’re just running a query and how much does that speed things up. There are queries for all sorts of metrics, eg which indexes are huge but never used. The DB can often suggest indexes, but don’t just use add the suggestions. Use them as a starting point to reason about your own. Try get down to low millisecond queries for really frequent stuff, because it’ll make them fast and means less time locking the DB, less RAM, less temp table storage.
All my other skills have aged. Fundamental database knowledge lasts.
I totally lose respect for sr engineers who can’t write sql to find even simple answers to questions.
It’s never bad to have another arrow in your quiver
Does your hotpath query have a covering index? Is it seek-able. Are you predicates SARG-able.
Seriously biggest bang for your buck is to understand SARG-ability and understand indexes in your database really well.
There are two distinct job functions.
1) dba - maintain OS, db software/hardware - DBs are complex beasts you want experts setting up hardware or deploying into the clown in a smarter way (virtual machines in the clown will cost you a fortune and your sanity) 2) database developers- specialists in writing sql.
The two functions work together but are distinct.
Nowadays due to private equity and Stanford MBAs we have junior engineers doing all plus “dev ops”.
It’s an absolute circus.
IMO, this is why we end up with ask these crazy DB startups - routing around the damage.
RDMS in modern hardware are insanely fast and powerful.
Some folks pointed out that index pushdowns and join optimizations aren’t novel. That’s fair. In a traditional database engine, pushdowns and access path selection are standard. But Readyset isn’t a conventional engine.
When you create a materialized view in Readyset, the query is compiled into a dataflow graph designed for high-throughput incremental updates, not per-request planning. We receive changes from the upstream database’s replication stream and propagate deltas through the graph. Reads typically hit the cache directly at sub-ms latency.
But when a key hasn’t yet been materialized, we perform what we call an upquery -- a one-off pull from the base tables (stored in RocksDB) to hydrate the missing result. Since we don’t re-plan queries on each request, the structure of that upquery, including filter pushdowns and join execution, is precompiled into the dataflow.
Straddled joins, where filtering is required on both sides of the join, are especially tricky in this model. Without smarter pushdown, we were overfetching data and doing unnecessary join work. This optimization pushes composite filters into both sides of the join to reduce RocksDB scans and hash table size.
It’s a well-known idea in the context of traditional databases, but making it work in a static, incrementally maintained dataflow system is what makes it unique here.
Happy to go deeper if folks are curious. Appreciate the thoughtful feedback.
marceloaltmann•5mo ago
LtdJorge•5mo ago
airstrike•5mo ago
It could just be that they translated from their original language to English and got that as a byproduct. Many such cases.
Sesse__•5mo ago
magicalhippo•5mo ago
I for one appreciate such comments, given the guidelines to avoid submission summaries.
nchmy•5mo ago
OutOfHere•5mo ago
airstrike•5mo ago
You're free to opine on the reason for downvotes too. This metacomment, however, is more noise than signal.
OutOfHere•5mo ago
One line summaries of comprehensible articles can get downvoted because they don't add value beyond what's already very clear from the article.
airstrike•5mo ago
and the fact that multiple people upvoted my comment at a minimum suggests others also believe it to be a possible explanation
i have no idea why you've chosen this particular hill to die on, when neither of us stands to profit from this protracted exchange
OutOfHere•5mo ago
airstrike•5mo ago
Either way, I didn't even downvote the OP so you're beeping at the wrong human
OutOfHere•5mo ago