I was thinking of using Citus for this, but possibly using duckdb is a better way to do. Citus comes with a lot more out of the box but duckdb could be a good stepping stone.
Think this opens up a lot of interesting possibilities like more powerful analytics notebooks like marimo (https://marimo.io/) … and that’s just one example of many.
Depends on your definition of medium sized, but for tables of hundreds of thousands of rows and ~30 columns, these tools are fast enough to run queries instantly or near instantly even on laptop CPUs.
The first is simply that it's fast - for example, DuckDB has one of the best csv readers around, and it's parallelised.
Next, engines like DuckDB are optimised for aggregate analysis, where your single query processes a lot of rows (often a significant % of all rows). That means that a full scan is not necessarily as big a problem as it first appears. It's not like a transactional database where often you need to quickly locate and update a single row out of millions.
In addition, engines like DuckDB have predicate pushdown so if your data is stored in parquet format, then you do not need to scan every row because the parquet files themselves hold metadata about the values contained within the file.
Finally, when data is stored in formats like parquet, it's a columnar format, so it only needs to scan the data in that column, rather than needing to process the whole row even though you may be only interested in one or two columns
Related question, curious as to your experience with DuckLake if you've used it. I'm currently setting up s3 + Iceberg + duckDB for my company (startup) and was wondering what to pick between Iceberg and DuckLake.
"We're moving towards a simpler world where most tabular data can be processed on a single large machine1 and the era of clusters is coming to an end for all but the largest datasets."
become very debatable. Depending on how you want to pivot/ scale/augment your data, even datasets that seemingly "fit" on large boxes will quickly OOM you.
The author also has another article where they claim that:
"SQL should be the first option considered for new data engineering work. It’s robust, fast, future-proof and testable. With a bit of care, it’s clear and readable." (over polars/pandas etc)
This does not map to my experience at all, outside of the realm of nicely parsed datasets that don't require too much complicated analysis or augmentation.
From my experience, the data modelling side is still overwhelmingly in SQL. The ingestion side is definitely mostly Python/Scala though.
But the problem is the ecosystem hasn't standardised on any of them, and it's annoying to have to rewrite pipelines from one dataframe API.
I also agree you're gonna hit OOM if your data is massive, but my guess is the vast majority of tabular data people process is <10GB, and that'll generally process fine on a single large machine. Certainly in my experience it's common to see Spark being used on datasets that are no where big enough to need it. DuckDB is gaining traction, but a lot of people still seem unaware how quickly you can process multiple GB of data on a laptop nowadays.
I guess my overall position is it's a good idea to think about using DuckDB first, because often it'll do the job quickly and easily. There are a whole host of scenarios where it's inappropriate, but it's a good place to start.
-- Support for .parquet, .json, .csv (note: Spotify listening history comes in a multiple .json files, something fun to play with).
-- Support for glob reading, like: select * from 'tsa20*.csv' - so you can read hundreds of files (any type of file!) as if they were one file.
-- if the files don't have the same schema, union_by_name is amazing.
-- The .csv parser is amazing. Auto assigns types well.
-- It's small! The Web Assembly version is 2mb! The CLI is 16mb.
-- Because it is small you can add duckdb directly to your product, like Malloy has done: https://www.malloydata.dev/ - I think of Malloy as a technical persons alternative to PowerBI and Tableau, but it uses a semantic model that helps AI write amazing queries on your data. Edit: Malloy makes SQL 10x easier to write because of its semantic nature. Malloy transpiles to SQL, like Typescript transpiles to Javascript.
Doing that in postgres takes some time, and even simple count(*) takes a lot of time (with all columns indexed)
I'm still waiting for Excel to load the file.
DangitBobby•3h ago
samuell•2h ago