“Ducklake DuckDB extension” really rolls off the tongue /s.
Of course, the performance will depend on the database.
https://quesma.com/blog-detail/apache-iceberg-practical-limi...
Even Snowflake was using FoundationDB for metadata, whereas Iceberg attempts to use blob storage even for the metadata layer.
* number of snapshots
* frequent large schema changes
* lots of small files/row level updates
* lots of stats
The last one IIRC used to be pretty bad especially with larger schemas.
Most engines have ways to help with this - compaction, snapshot exportation, etc… Though it can still be up to the user. S3 tables is supposed to do some of this for you.
If metadata is below 1-5MB it’s really not an issue. Your commit rate is effectively limited by the size of your metadata and the number of writers you have.
I’ve written scripts to fix 1GB+ metadata files in production. Usually it was pruning snapshots without deleting files (relying on bucket policy to later clean things up) or removing old schema versions.
They support syncing to Iceberg by writing the manifest and metadata files on demand, and they already have read support for Iceberg. They just fixed Iceberg's core issues but it's not a direct competitor as you can use DuckLake along with Iceberg in a very nice and bidirectional way.
(diclaimer: I work at DuckDB Labs)
> About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ?
Dropping the Parquet files on the blob storage will not work – you have to COPY them through DuckLake so that the catalog databases is updated with the required catalog and metadata information.
If the thesis here is that most datasets are small, fair enough - but then why use a lake instead of big postgres, yknow?
Even if it's in the TB-range, we're at the point where high-spec laptops can handle it (my own benchmarking: https://ibis-project.org/posts/1tbc/). When I tried to go up to 10TB TPC-H queries on large cloud VMs I did hit some malloc (or other memory) issues, but that was a while ago and I imagine DuckDB can fly past that these days too. Single-node definitely has limits, but it's hard to see how 99%+ of organizations really need distributed computing in 2025.
Yes, there must be some use cases where you need all the data loaded up and addressable seamlessly across a cluster, but those are rare and typically FAANG-class problems.
It's currently only DuckDB specific because the initial implementation that supports this new catalog is a DuckDB extension
They don't focus on solving the scalability problem in the metadata layer; you might need to scale your PostgreSQL independently as you have many DuckDB compute nodes running on the edge.
> They solve query parallelism by allowing you to perform computations on the edge, enabling horizontal scaling the compute layer.
Hmm, I don't understand this one. How do you horizontally scale a query that scans all data to do `select count(*), col from huge_table group by col`, for example? In a traditional map-reduce engine, that turns into parallel execution over chunks of data, which get later merged for the final result. In DuckDB, doesn't that necessarily get done by a single node which has to inspect every row all by itself?
A single node can scan through several gigabytes of data per second. When the column data is compressed through various algorithms, this means billions of rows / sec.
You can scale horizontally if you have many concurrent queries pretty well, that’s what I was referring to.
~40y ago invented data warehouse, where an ETL process overnight would collect data from smaller dbs into a central db (the data warehouse)
~15y ago, data lake (i.e. hadoop) emerged to address scaling and other things. Same idea but ELT instead of ETL: less focus on schema, collect the data into S3 and transform it later
It seems like this will just work out of the box, and just scale up to very reasonable data sizes. And the work from the DuckDB folks is typically excellent. It's clear they understand this space. Excited to try it out!
(Disclosure, I am a developer of marimo.)
datapains has some good stuff to get trino running and you can get a hivemetastore running with some hacking. I dorked around with it and then got the iceberg connector working with trino and see how it all works. I load data in to a dumb hive with a trino table pointed at it and then insert from select ... in to iceberg.
If the duck guys have some simple to get running stuff, they could probably start to eat everyone else' lunch.
However, this mostly reads to me as thought experiment: > what if the backend service of an Iceberg catalog was just a SQL database?
The manifesto says that maintaining a data lake catalog is easier, which I agree with in theory. s3-files-as-information-schema presents real challenges!
But, what I most want to know is what's the end-user benefit?
What does someone get with this if they're already using Apache Polaris or Lakekeeper as their Iceberg REST catalog?
it adds for users the following features to a data lake: - multi-statement & multi-table transactions - SQL views - delta queries - encryption - low latency: no S3 metadata & inlining: store small inserts in-catalog and more!
As a consequence of compaction, when deleting the build up of many small add/delete files, in a format like Iceberg, you would lose the ability to time travel to those earlier states.
With DuckLake's ability to refer to parts of parquet files, we can preserve the ability to time travel, even after deleting the old parquet files
Ducklake you can build petabyte scale warehouse with multiple readers and writer instances, all transactional on your s3, on your ec2 instances.
Motherduck has limitations like only one writer instance. Read replicas can be 1m behind (not transactional).
Having different instances concurrently writing to different tables is not possible.
Ducklake gives proper separation of compute and storage with a transactional metadata layer.
Same question for presto, trino, dremio, snowflake, bigquery, etc.
These I'd argue are not the natural use cases for a data lake, especially a design which uses multiple writers to a given table.
The main difference is to store metadata and stats also directly in SQL databases, which makes perfect sense for smaller scale data. In fact we were doing something similar in https://github.com/Mooncake-Labs/pg_mooncake, metadata are stored in pg tables and only periodically flush to actual formats like iceberg.
Wouldn't be surprised if 5 years down the line we see DuckDBaaS execs offering fully paid trips to DuckDBForce or smth like that.
would be nice to have some tutorial/use-cases in the doc :)
However, I constantly run in to problems with concurrent writes. I have a cloud function triggered ever x minutes to pull data from API and that’s fine.
But if I need to run a backfill I risk that that process will run at the same time as the timer triggered function. Especially if I load my backfill queue with hundreds of runs that needs to be pulled and they start saturating the workers in the cloud function.
However in Iceberg there are Copy-On-Write and Merge-On-Read strategies dealing with updates.
(I work at DuckDB Labs.)
That way you could use another query engine while still use duckdb to handle the data mutation.
I have an application which ought to be a near-perfect match for Parquet. I have a source of timestamped data (basically a time series, except that the intervals might not be evenly spaced -- think log files). A row is a timestamp and a bunch of other columns, and all the columns have data types that Parquet handles just fine [0]. The data accumulates, and it's written out in batches, and the batches all have civilized sizes. The data is naturally partitioned on some partition column, and there is only one writer for each value of the partition column. So far, so good -- the operation of writing a batch is a single file creation or create call to any object store. The partition column maps to the de-facto sort-of-standard Hive partitioning scheme.
Except that the data is (obviously) also partitioned on the timestamp -- each batch covers a non-overlapping range of timestamps. And Hive partitioning can't represent this. So none of the otherwise excellent query tools can naturally import the data unless I engage in a gross hack:
I could also partition on a silly column like "date". This involves aligning batches to date boundaries and also makes queries uglier.
I could just write the files and import ".parquet". This kills performance and costs lots of money.
I could use Iceberg or Delta Lake or whatever for the sole benefit that their client tools can handle ranged partitions. Gee thanks. I don't actually need any of the other complexity.
It would IMO be really really nice if everyone could come up with a directory-name or filename scheme for ranged partitioning.
[0] My other peeve is that a Parquet row and an Arrow row and a Thrift message and a protobuf message, etc, are almost* but not quite the same thing. It would be awesome if there was a companion binary format for a single Parquet row or a stream of rows so that tools could cooperate more easily on producing the data that eventually gets written into Parquet files.
Sure, metadata in the Parquet file handles this, but a query planner has to read that metadata, whereas a sensible way to stick the metadata in the file path would allow avoiding reading the file at all.
If I have an S3 bucket containing five years worth of Parquet files, each covering a few days worth of rows, and I tell my favorite query tool (DuckDB, etc) about that bucket, then the tool will need to do a partial read (which is multiple operations, I think, since it will need to find the footer and then read the footer) of ~500 files just to find out which ones contain the data of interest. A good query plan would be to do a single list operation on the bucket to find the file names and then to read the file or files needed to answer my query.
Iceberg and Delta Lake (I think -- I haven't actually tried it) can do this, but plain Parquet plus Hive partitioning can't, and I'm not aware of any other lightweight scheme that is well supported that can do it. My personal little query tool (which predates Parquet) can do it just fine by the simple expedient of reading directory names.
In theory, going off of the schema diagram they have, all your files are listed in `data_file`, the timestamp range for that file would be in `file_column_stats`, and that information could be used to decide what files to _actually_ read based on your query.
Whether duckdb's query engine takes advantage of this is a different story, but even if it doesn't Yet it should be possible to do so Eventually.
Just beware that one issue you can have is the limit of row groups per file (2^15).
That is the kind of metadata that is useful to push up, into something like DuckLake.
Someone could buck the trend and extend Hive’s scheme to support ranges.
Now your query engine might require you to specify the partitions or range of partitions you want to query on; you absolutely can use datepartition >=a and datepartition<b in your query. Iceberg seems to fix that and just let you use the timestamp; presumably the metadata is smart enough to exclude the partitions you don't care about.
There’s nothing conceptually or algorithmically difficult about what I want to do. All that’s needed is to encode a range of times into the path of a segment. But Hive didn’t do this, and everyone implemented Hive’s naming scheme, and that’s the status quo now.
Do you need to put a REST API in front of it this duckdb instance to make it an Iceberg Catalog?
One thing I noticed is that the `table_stats` and `column_stats` tables aren't snapshot versioned. What are these used for and isn't that going to hurt timetravel queries (`SELECT COUNT(*) FROM tbl WHERE snapshot_id=<old_id>` as a simple example)?
You can check out our sandbox at https://yeet.cx/play
I don’t think DuckLake itself will succeed in getting adopted beyond DuckDB, but I would not be surprised if over time the catalog just absorbs the metadata, and the original Iceberg format fades into history as a transitional form.
Existing Lakehouse systems like Iceberg store crucial table information (like schema and file lists) as many small "metadata files" in cloud object storage (like S3). Accessing these files requires numerous network calls, making operations like query planning and updating tables inefficient and prone to conflicts. DuckLake solves this by putting all that metadata into a fast, transactional SQL database, using a single query to get what's needed, which is much quicker and more reliable.
One question for me is, lets say i want to start using this today and at work we are running snowflake. I get that each analytics person would have to run duckdb + this extension on their local machines and point to the blob store and the database that is running datalake extension, for now that would be say a VM running duckdb. When I run the actual query where does the computation happen? And what if I want a lot of computation?
Is the solution currently to host a huge duckdb VM that everyone ssh's into and run their queries or how does that part work?
And yes, to get more compute, you'd want to spin up a VM.
What's cool is you can do both (run locally for small stuff, run on VM for heavy workloads).
kermatt•1d ago