DuckLake is an integrated data lake and catalog format

276•kermatt•1mo ago

Comments

kermatt•1mo ago

And a manifesto: https://ducklake.select/manifesto/

iampims•1mo ago

Great idea, poor naming. If you’re aiming for a standard of sorts, tying it to a specific software by reusing its name feels counter productive.

“Ducklake DuckDB extension” really rolls off the tongue /s.

rtyu1120•1mo ago

Quite a bummer, particularly because the main selling point is that it can be utilized with any SQL database (iiuc).

formalreconfirm•1mo ago

If I understand the Manifesto correctly, the metadata db can be any SQL database but the client needs to be DuckDB + DuckLake extension no ?

crudbug•1mo ago

Good point. I think - any ducklake implementation for any SQL compliant database will work.

Of course, the performance will depend on the database.

raihansaputra•1mo ago

*for now. The principle in the client side (especially read only) should be the same with Iceberg. Ideally there's an Iceberg adapter for clients.

snthpy•1mo ago

True. The format looks really open so it would be better to have a more independent name. DuckLake for the DuckDB extension name is great in my opinion but for the table format something like SQLake or AcidLake might be more apt. The latter doesn't sound very appealing though, probably especially for ducks.

ChrisArchitect•1mo ago

Announcement post https://duckdb.org/2025/05/27/ducklake.html (https://news.ycombinator.com/item?id=44106691)

jakozaur•1mo ago

Iceberg competitor, addressing some of its shortcomings, like blown-up metadata:

https://quesma.com/blog-detail/apache-iceberg-practical-limi...

Even Snowflake was using FoundationDB for metadata, whereas Iceberg attempts to use blob storage even for the metadata layer.

prpl•1mo ago

metadata bloat can be due to a few things, but it’s manageable.

* number of snapshots

* frequent large schema changes

* lots of small files/row level updates

* lots of stats

The last one IIRC used to be pretty bad especially with larger schemas.

Most engines have ways to help with this - compaction, snapshot exportation, etc… Though it can still be up to the user. S3 tables is supposed to do some of this for you.

If metadata is below 1-5MB it’s really not an issue. Your commit rate is effectively limited by the size of your metadata and the number of writers you have.

I’ve written scripts to fix 1GB+ metadata files in production. Usually it was pruning snapshots without deleting files (relying on bucket policy to later clean things up) or removing old schema versions.

buremba•1mo ago

I had the same impression but I wouldn't call it competitor after watching their video: https://youtu.be/zeonmOO9jm4?t=4032

They support syncing to Iceberg by writing the manifest and metadata files on demand, and they already have read support for Iceberg. They just fixed Iceberg's core issues but it's not a direct competitor as you can use DuckLake along with Iceberg in a very nice and bidirectional way.

formalreconfirm•1mo ago

It looks very promising, especially knowing DuckDB team is behind it. However I really don't understand how to insert data in it. Are we supposed to use DuckDB INSERT statement with any function to read external files or any other data ? Looks very cool though.

szarnyasg•1mo ago

Yes, you can use standard SQL constructs such as INSERT statements and COPY to load data into DuckLake.

(diclaimer: I work at DuckDB Labs)

formalreconfirm•1mo ago

Thank you for your work ! We use DuckDB with dbt-duckdb in production (because on-prem and because we don't need ten thousands nodes) and we love it ! About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ? From my understanding DuckLake was responsible for managing the files on the storage layer.

szarnyasg•1mo ago

Great!

> About the COPY statement, it means we can drop Parquet files ourselves in the blob storage ?

Dropping the Parquet files on the blob storage will not work – you have to COPY them through DuckLake so that the catalog databases is updated with the required catalog and metadata information.

jonstewart•1mo ago

Ah, drat. I have an application that uses DuckDB to output parquet files. That application is by necessity disconnected from any sense of a data lake. But, I would love to have a good way of then pushing them up to S3 and integrating into a data lake. I’ve been looking into Iceberg and I’ve had the thought, “this is great but I hate the idea of what all these little metadata files will do to latency.”

tracnar•1mo ago

I was also thinking about this use case when reading the announcement. Let's say you have a bunch of parquet files already (on local FS, HTTPS, S3, ...) that you can assume are immutable (or maybe append-only). It would be great if you could attach them to the DuckLake without copying them! From the design doc, it seems it should essentially work, you would read those parquet files to compute the metadata, and insert a reference to the parquet file instead of copying them to the storage you manage. Basically you want to create the catalog independently from the underlying data.

spenczar5•1mo ago

There is a lot to like here, but once metadata is in the novel Ducklake format, it is hard to picture how you can get good query parallelism, which you need for large datasets. Iceberg already is well supported by lots of heavy-duty query engines and that support is important once you have lots and lots and lots of data.

formalreconfirm•1mo ago

Someone correct me if I'm wrong but from my understanding, DuckDB will always be the query engine, thus I suppose you will have access to DuckDB query parallelism (single node but multithreaded with disk spilling etc) + statistics-based optimizations like file pruning, predicate pushdown etc offered by DuckLake. I think DuckLake is heavily coupled to DuckDB (Which is good for our use case). Again, this is my understanding, correct me if wrong.

spenczar5•1mo ago

I agree with everything you said. I just mean that a single node may be slow when processing those parquet files in a complex aggregation, bottlenecked on network IO or CPU or available memory.

If the thesis here is that most datasets are small, fair enough - but then why use a lake instead of big postgres, yknow?

formalreconfirm•1mo ago

That's the part I don't really get. In the Manifesto they are talking about scaling to hundreds of terabytes and thousands of compute nodes. But DuckDB compute nodes, even if they are very performant, at the end are single nodes, so even if your lakehouse contains TB of data, you will be limited to your biggest client capacity (I know DuckDB works well with data bigger than memory, but still, I suppose it can reach limits at some point). At the end I think DuckLake is aimed at lakehouses of "reasonable" size the same way DuckDB is intended for data of "reasonable" size.

dkdcio•1mo ago

Huge "it depends", but typically organizations are not querying all of their data at once. Usually, they're processing it in some time-based increments.

Even if it's in the TB-range, we're at the point where high-spec laptops can handle it (my own benchmarking: https://ibis-project.org/posts/1tbc/). When I tried to go up to 10TB TPC-H queries on large cloud VMs I did hit some malloc (or other memory) issues, but that was a while ago and I imagine DuckDB can fly past that these days too. Single-node definitely has limits, but it's hard to see how 99%+ of organizations really need distributed computing in 2025.

mrbungie•1mo ago

You can run a fleet of DuckDB instances and process data in a partitioned way.

Yes, there must be some use cases where you need all the data loaded up and addressable seamlessly across a cluster, but those are rare and typically FAANG-class problems.

memhole•1mo ago

From my perspective the issue is analytics support. You’ll need a step that turns it into something supported by BI tools. Obviously if something like Trino picks up the format it’s not an issue

mrbungie•1mo ago

DuckDB has support for ODBC/ADBC and even JDBC (so Trino should be easy, https://trino.io/docs/current/connector/duckdb.html).

anentropic•1mo ago

It seems to me that by publishing the spec other non-DuckDB implementations could be built?

It's currently only DuckDB specific because the initial implementation that supports this new catalog is a DuckDB extension

buremba•1mo ago

You don't need to store the metadata in DuckDB; it can live in your own PostgreSQL/MySQL, similar to Iceberg REST Catalog. They solve query parallelism by allowing you to perform computations on the edge, enabling horizontal scaling the compute layer.

They don't focus on solving the scalability problem in the metadata layer; you might need to scale your PostgreSQL independently as you have many DuckDB compute nodes running on the edge.

spenczar5•1mo ago

Even though it's in your own SQL DB, there's still some sort of layout for the metadata. That's the thing that trino/bigquery/whatever won't understand (yet?).

> They solve query parallelism by allowing you to perform computations on the edge, enabling horizontal scaling the compute layer.

Hmm, I don't understand this one. How do you horizontally scale a query that scans all data to do `select count(*), col from huge_table group by col`, for example? In a traditional map-reduce engine, that turns into parallel execution over chunks of data, which get later merged for the final result. In DuckDB, doesn't that necessarily get done by a single node which has to inspect every row all by itself?

nojvek•1mo ago

you're correct that duckdb doesn't do any multi-node map-reduce, however duckdb utilizes all available cores on a node quite effectively to parallelize scanning. And node sizes nowadays get upto 192 vCPUs.

A single node can scan through several gigabytes of data per second. When the column data is compressed through various algorithms, this means billions of rows / sec.

buremba•1mo ago

You are right, you can only scale vertically for single query execution but given that cloud providers now have beefy machines available with TBs of memory and hundreds of CPUs, it’s not a blocker unless you are querying petabyte level raw data. Trino/Spark is still unbeatable in that way but in most cases, you partition the data and use predicate pushdown anyways.

You can scale horizontally if you have many concurrent queries pretty well, that’s what I was referring to.

adastra22•1mo ago

What is a data lake?

szarnyasg•1mo ago

The YouTube video “Apache Iceberg: What It Is and Why Everyone’s Talking About It” by Tim Berglund explains data lakes really well in the opening minutes: https://www.youtube.com/watch?v=TsmhRZElPvM

adastra22•1mo ago

Thanks but I don’t have the time to watch YouTube.

simlevesque•1mo ago

It's your db but on s3.

dsp_person•1mo ago

he explains

~40y ago invented data warehouse, where an ETL process overnight would collect data from smaller dbs into a central db (the data warehouse)

~15y ago, data lake (i.e. hadoop) emerged to address scaling and other things. Same idea but ELT instead of ETL: less focus on schema, collect the data into S3 and transform it later

adastra22•1mo ago

Thank you!

CyberDildonics•1mo ago

A network file system with a database which can be done with a PC and sqlite, but you need a new term to sell the new thing so now it's data lake and data warehouse and 'blob storage'. How are the 'blobs' stored? Probably with blocks of a constant size linked together to form the the 'full blob'. If this sounds like file systems invented 50 years ago, you just don't understand, the difference is that this is very expensive.

dkdcio•1mo ago

This looks awesome. One of my biggest gripe's personally with Iceberg (less-so Delta Lake, but similar) is how difficult it is to just try out on a laptop. Delta Lake has vanilla Python implementations, but those are fragemented and buggy IME. Iceberg has just never worked locally, you need a JVM cluster and a ton of setup. I went down a similar road of trying to use sqlite/postgres+duckdb+parquet files in blob storage, but it was a lot of work.

It seems like this will just work out of the box, and just scale up to very reasonable data sizes. And the work from the DuckDB folks is typically excellent. It's clear they understand this space. Excited to try it out!

frisbm•1mo ago

Have you tried out PyIceberg yet? It's a pure Python implementation and it works pretty well. It supports a SQL Catalog as well as an In-Memory Catalog via a baked in SQLite SQL Catalog.

https://py.iceberg.apache.org/

dkdcio•1mo ago

this looks much simpler than the last time I tried it!

wodenokoto•1mo ago

Delta-io (based on deltalake-r) runs very very easily locally. Just pip install, and write and you get catalog and everything.

https://delta-io.github.io/delta-rs/

dkdcio•1mo ago

I tried using it but on more than one occasion hit showstopping bugs -- they're probably fixed by now though

mritchie712•1mo ago

Here's a step-by-step setup. It's using S3 and RDS, but I wouldn't be hard to swap in a local sqlite instead.

https://www.definite.app/blog/cloud-iceberg-duckdb-aws

akshayka•1mo ago

It's indeed very easy to try locally! For example in a marimo notebook, just a few lines of code: https://www.youtube.com/watch?v=x6YtqvGcDBY

(Disclosure, I am a developer of marimo.)

TheCondor•1mo ago

I was thinking of putting something together for this. Like a helm chart that works with k3s.

datapains has some good stuff to get trino running and you can get a hivemetastore running with some hacking. I dorked around with it and then got the iceberg connector working with trino and see how it all works. I load data in to a dumb hive with a trino table pointed at it and then insert from select ... in to iceberg.

If the duck guys have some simple to get running stuff, they could probably start to eat everyone else' lunch.

mehulashah•1mo ago

We've come full circle. If you want to build a database, then you need to build it like a database. Thank you DuckDB folks!

data_ders•1mo ago

the manifesto [1] is the most interesting thing. I agree that DuckDB has the largest potential to disrupt the current order with Iceberg.

However, this mostly reads to me as thought experiment: > what if the backend service of an Iceberg catalog was just a SQL database?

The manifesto says that maintaining a data lake catalog is easier, which I agree with in theory. s3-files-as-information-schema presents real challenges!

But, what I most want to know is what's the end-user benefit?

What does someone get with this if they're already using Apache Polaris or Lakekeeper as their Iceberg REST catalog?

[1]: https://ducklake.select/manifesto/

anentropic•1mo ago

they say it's faster for one thing - can resolve all metadata in a single query instead of multiple HTTP requests

peterboncz•1mo ago

https://x.com/peterabcz/status/1927402100922683628

it adds for users the following features to a data lake: - multi-statement & multi-table transactions - SQL views - delta queries - encryption - low latency: no S3 metadata & inlining: store small inserts in-catalog and more!

tishj•1mo ago

One thing to add to this: Snapshots can be retained (though rewritten) even through compaction

As a consequence of compaction, when deleting the build up of many small add/delete files, in a format like Iceberg, you would lose the ability to time travel to those earlier states.

With DuckLake's ability to refer to parts of parquet files, we can preserve the ability to time travel, even after deleting the old parquet files

ryanworl•1mo ago

Does this trick preclude the ability to sort your data within a partition? You wouldn’t be able to rely on the row IDs being sequential anymore to be able to just refer to a prefix of them within a newly created file.

buremba•1mo ago

My understanding was that MotherDuck was focusing on providing the "multiplayer mode" for DuckDB. It's interesting to see DuckDB Labs supporting data lakes natively. I guess MotherDuck is potentially moving to the UI layer by providing the notebook interface for DuckDB.

peterboncz•1mo ago

Good point! Anticipating official announcements I can confirm that MotherDuck is indeed intending to both: host DuckLake catalogs, and facilitate querying DuckLakes using DuckDB via its cloud-based DuckDB service.

nehalem•1mo ago

I wonder how this relates to Mother Duck (https://motherduck.com/)? They do „DuckDB-powered data warehousing“ but predate this substantially.

nojvek•1mo ago

Motherduck is hosting duckdb in cloud. DuckLake is a much more open system.

Ducklake you can build petabyte scale warehouse with multiple readers and writer instances, all transactional on your s3, on your ec2 instances.

Motherduck has limitations like only one writer instance. Read replicas can be 1m behind (not transactional).

Having different instances concurrently writing to different tables is not possible.

Ducklake gives proper separation of compute and storage with a transactional metadata layer.

teleforce•1mo ago

Just wondering does DuckLake utilizing Open Table Formats (OTFs) since I don't see it's mentioned anywhere in the website?

raihansaputra•1mo ago

No. DuckLake is implementing the Open Table Format (and the Catalog above the Table Format). Not utilizing them, but an alternate implementation.

jtigani•1mo ago

For what it's worth, MotherDuck and DuckLake will play together very nicely. You will be able to have your MotherDuck data stored in DuckLake, improving scalability, concurrency, and consistency while also giving access to the underlying data to third-party tools. We've been working on this for the last couple of months, and will share more soon.

raihansaputra•1mo ago

i think a way to see it is MotherDuck is a service to just throw your data at at they will sort it (using duckdb underneath) and you can use DuckDB to iterface with your data. But if you want to be more "lakehouse" or maybe down the line there are more integrations with DuckLake ir you want data to be stored in a blob storage, you can use DuckLake with MotherDuck as the metadata store.

a26z•1mo ago

How do I integrate DuckLake with Apache Spark? Is it a format or a catalog?

Same question for presto, trino, dremio, snowflake, bigquery, etc.

nxm•1mo ago

"DuckLake is also able to improve the two biggest performance problems of data lakes: small changes and many concurrent changes."

These I'd argue are not the natural use cases for a data lake, especially a design which uses multiple writers to a given table.

zhousun•1mo ago

Using SQL as catalog is not new (iceberg supports JDBC catalog from the very beginning).

The main difference is to store metadata and stats also directly in SQL databases, which makes perfect sense for smaller scale data. In fact we were doing something similar in https://github.com/Mooncake-Labs/pg_mooncake, metadata are stored in pg tables and only periodically flush to actual formats like iceberg.

BewareTheYiga•1mo ago

I am a huge fan of what they are doing, particularly putting local compute front and center. However for “BigCorp”, it’s going to be an uphill battle. The incumbents are entrenched and many decision makers will make decisions based on non technical reasons (I.e did my sales exec get me to the F1 Grand Prix).

mrbungie•1mo ago

Nothing new to see here really, pretty sure the same narrative was seen about Hive/Spark against Teradata/Exadata back in the 201X.

Wouldn't be surprised if 5 years down the line we see DuckDBaaS execs offering fully paid trips to DuckDBForce or smth like that.

oulipo•1mo ago

Not exactly sure what it's for? it's to stream your data to Parquet files on (eg) S3 and keep somewhere the exact schema at each point in time? or is it something else?

would be nice to have some tutorial/use-cases in the doc :)

wodenokoto•1mo ago

I’m building a poor man’s datalake at work, basically putting parquet files in blob storage using deltalake-rs’ python bindings and duck db for querying.

However, I constantly run in to problems with concurrent writes. I have a cloud function triggered ever x minutes to pull data from API and that’s fine.

But if I need to run a backfill I risk that that process will run at the same time as the timer triggered function. Especially if I load my backfill queue with hundreds of runs that needs to be pulled and they start saturating the workers in the cloud function.

isoprophlex•1mo ago

Add a randomly chosen suffix to your filenames?

wodenokoto•1mo ago

That doesn’t change the manifest, which keeps tabs of which rows are current and which are soft deleted, making time travel possible.

ed_elliott_asc•1mo ago

Take a lease on the json file before you attempt the write and queue writes that way

wodenokoto•1mo ago

What does the worker that tries to commit do when the json manifest is locked? Wait and try again?

TheGuyWhoCodes•1mo ago

Is there any information about updates to existing rows? The FAQ says "Similarly to other data lakehouse technologies, DuckLake does not support constraints, keys, or indexes."

However in Iceberg there are Copy-On-Write and Merge-On-Read strategies dealing with updates.

szarnyasg•1mo ago

Yes - updates on existing rows are supported.

(I work at DuckDB Labs.)

TheGuyWhoCodes•1mo ago

Thanks szarnyasg. If I've got you here, can you use the ducklake extension commands to get the parquet files for a query without running said query?

That way you could use another query engine while still use duckdb to handle the data mutation.

amluto•1mo ago

I have an personal pet peeve about Parquet that is solved, incompatibly, by basically every "data lake / lakehouse" layer on top, and I'd love to see it become compatible: ranged partitioning.

I have an application which ought to be a near-perfect match for Parquet. I have a source of timestamped data (basically a time series, except that the intervals might not be evenly spaced -- think log files). A row is a timestamp and a bunch of other columns, and all the columns have data types that Parquet handles just fine [0]. The data accumulates, and it's written out in batches, and the batches all have civilized sizes. The data is naturally partitioned on some partition column, and there is only one writer for each value of the partition column. So far, so good -- the operation of writing a batch is a single file creation or create call to any object store. The partition column maps to the de-facto sort-of-standard Hive partitioning scheme.

Except that the data is (obviously) also partitioned on the timestamp -- each batch covers a non-overlapping range of timestamps. And Hive partitioning can't represent this. So none of the otherwise excellent query tools can naturally import the data unless I engage in a gross hack:

I could also partition on a silly column like "date". This involves aligning batches to date boundaries and also makes queries uglier.

I could just write the files and import ".parquet". This kills performance and costs lots of money.

I could use Iceberg or Delta Lake or whatever for the sole benefit that their client tools can handle ranged partitions. Gee thanks. I don't actually need any of the other complexity.

It would IMO be really really nice if everyone could come up with a directory-name or filename scheme for ranged partitioning.

[0] My other peeve is that a Parquet row and an Arrow row and a Thrift message and a protobuf message, etc, are almost* but not quite the same thing. It would be awesome if there was a companion binary format for a single Parquet row or a stream of rows so that tools could cooperate more easily on producing the data that eventually gets written into Parquet files.

bitbang•1mo ago

Why is the footer metadata not sufficient for this need? The metadata should contain the min and max timestamp values from the respective column of interest, so that when executing a query, the query tool should be optimizing its query by reading the metadata to determine if that parquet file should be read or not depending on what time range is in the query.

dugmartin•1mo ago

This can also be done using row group metadata within the parquet file. The row group metadata can include the range values of ordinals so you can "partition" on timestamps without having to have a file per time range.

simlevesque•1mo ago

I wish we had more control of the row group metadata when writing Parquet files with DuckDB.

amluto•1mo ago

But I want a file per range! I’m already writing out an entire chunk of rows, and that chunk is a good size for a Parquet file, and that chunk doesn’t overlap the previous chunk.

Sure, metadata in the Parquet file handles this, but a query planner has to read that metadata, whereas a sensible way to stick the metadata in the file path would allow avoiding reading the file at all.

mrlongroots•1mo ago

I have the same gripe. You want a canonical standard that's like "hive partitioning" but defines the range [val1, val2) as column=val1_val2. It's a trivial addition on top of Parquet.

amluto•1mo ago

That would do the trick, as would any other spelling of the same thing.

amluto•1mo ago

Because the footer metadata is in the Parquet file, which is already far too late to give an efficient query.

If I have an S3 bucket containing five years worth of Parquet files, each covering a few days worth of rows, and I tell my favorite query tool (DuckDB, etc) about that bucket, then the tool will need to do a partial read (which is multiple operations, I think, since it will need to find the footer and then read the footer) of ~500 files just to find out which ones contain the data of interest. A good query plan would be to do a single list operation on the bucket to find the file names and then to read the file or files needed to answer my query.

Iceberg and Delta Lake (I think -- I haven't actually tried it) can do this, but plain Parquet plus Hive partitioning can't, and I'm not aware of any other lightweight scheme that is well supported that can do it. My personal little query tool (which predates Parquet) can do it just fine by the simple expedient of reading directory names.

Jarwain•1mo ago

Maybe I'm misunderstanding something about how ducklake works, but isn't that the purpose of the 'catalog database'? To store the metadata about all the files to optimize the query?

In theory, going off of the schema diagram they have, all your files are listed in `data_file`, the timestamp range for that file would be in `file_column_stats`, and that information could be used to decide what files to _actually_ read based on your query.

Whether duckdb's query engine takes advantage of this is a different story, but even if it doesn't Yet it should be possible to do so Eventually.

amluto•1mo ago

Yes, and this is how basically every “lake” thing works. But all the lake solutions add a lot more complexity than just improving the parquet filename scheme, and all of them require that all the readers and all the writers agree on a particular “lake”.

Jarwain•1mo ago

That's fair! I guess I see it as trading technical complexity with the human complexity of getting everyone on board with an update to the standard, and getting that standard implemented across the board. It's a lot easier to get my coworkers to just use duckdb as a reader/writer with ducklake than to change the system.

Frankly, I'm not entirely sure what the process of proposing that change to the hive file scheme would even look like

amluto•1mo ago

> Frankly, I'm not entirely sure what the process of proposing that change to the hive file scheme would even look like

Maybe convince DuckDB and/or clickhouse-local and/or polars.scan_parquet to implement it as a pilot? If it's a success, other tools might follow suit.

Or maybe something like DuckLake could have an option to put column statistics in the filenames. I raised this as a discussion:

https://github.com/duckdb/ducklake/discussions/92

Jarwain•1mo ago

I'm not super sure about it being in the filename, if only because my understanding is that some of the lakes use it for partitioning and other metadata (metameta-data?).

Imo range is probably the most useful statistic in a folder/file name anyways for partitioning purposes. My vote would be for `^` as the range separator to minimize risk of collision and confusion. i.e. `timestamp=2025-03-27T00:00:00-0800^2025-03-30-0700` or `hour=0^12`,`hour=12^24`. `^` is valid across all systems, and I'd be very surprised if it was commonly used as a property/column name. Only collision I can think of is that its start-of-line in regex

Jarwain•1mo ago

Too late to edit buuut

There's a standard! (for time intervals, and I could see it working here)[0]

> Section 3.2.6 of ISO 8601-1:2019 notes that "A solidus may be replaced by a double hyphen ["--"] by mutual agreement of the communicating partners",

So forget what I said; why exacerbate the standards problem?[1]

[0]https://en.wikipedia.org/wiki/ISO_8601#Time_intervals [1]https://xkcd.com/927/

hendiatris•1mo ago

In the lower level arrow/parquet libraries you can control the row groups, and even the data pages (although it’s a lot more work). I have used this heavily with the arrow-rs crate to drastically improve (like 10x) how quickly data could be queried from files. Some row groups will have just a few rows, others will have thousands, but being able to bypass searching in many row groups makes the skew irrelevant.

Just beware that one issue you can have is the limit of row groups per file (2^15).

jonstewart•1mo ago

I think maybe this is a pet peeve of Hive and not of Parquet? Yes, it does require opening the Parquet file to look at the min, max range for the column, but only that data and if the data isn’t in range there shouldn’t be further requests.

That is the kind of metadata that is useful to push up, into something like DuckLake.

amluto•1mo ago

I guess my peeve could be restated as: Hive’s naming scheme doesn’t handle this, and Parquet per se can’t handle it because it’s out of scope, but all the awesome tools (e.g. DuckDB), when used on Parquet files without something with “lake” or “ice” in the name, use the Hive scheme.

Someone could buck the trend and extend Hive’s scheme to support ranges.

TheCondor•1mo ago

Hive supports 2 kinds of partitioning, injected and dynamic. You can totally use a partition key like the hour in UNIX time. It's an integer starting at some epoch and incrementing by 3600.

Now your query engine might require you to specify the partitions or range of partitions you want to query on; you absolutely can use datepartition >=a and datepartition<b in your query. Iceberg seems to fix that and just let you use the timestamp; presumably the metadata is smart enough to exclude the partitions you don't care about.

amluto•1mo ago

This is exactly what I meant by “I could also partition on a silly column like "date". This involves aligning batches to date boundaries and also makes queries uglier.”

Dowwie•1mo ago

Time series data is naturally difficult to work with, but avoidable. One solution is to not query raw time series data files. Instead, segment your time series data before you store it, normalizing the timestamps as part of event processing. Sliding window observations will help you find where the event begins and then you adjust the offset until you find where the time series returns to its zero-like position. That's your event.

amluto•1mo ago

Segmenting data is exactly what writing it into non-overlapping Parquet files is. My point is that many tools can read a bucket full of these segments, and most of them can handle a scheme where each file corresponds to a single value of a column, but none of them can agree on how to efficiently segment the data where each segment contains a range, unless a new column is invented for the purpose and all queries add complexity to map onto this type of segmentation.

There’s nothing conceptually or algorithmically difficult about what I want to do. All that’s needed is to encode a range of times into the path of a segment. But Hive didn’t do this, and everyone implemented Hive’s naming scheme, and that’s the status quo now.

neves•1mo ago

I didn't delve into the product, but why is it really different of Hive catalog searching parquet or csv files? What's the improvement?

SystemOut•1mo ago

Strangely I can't get to this domain. We have ZScaler at work with DGA Blocking enabled and it prevents me from loading the page.

rickette•1mo ago

Most likely caused by the .select TLD.

jamesblonde•1mo ago

How will DuckLake work with other Iceberg clients - like Python (Polars), Spark, Flink, etc?

Do you need to put a REST API in front of it this duckdb instance to make it an Iceberg Catalog?

ivovk•1mo ago

My understanding is that DuckLake, while being open source format, is not compatible with Iceberg, since it addressees some of it’s shortcomings, such as metadata stored in blob storage.

snthpy•1mo ago

This looks very cool!

One thing I noticed is that the `table_stats` and `column_stats` tables aren't snapshot versioned. What are these used for and isn't that going to hurt timetravel queries (`SELECT COUNT(*) FROM tbl WHERE snapshot_id=<old_id>` as a simple example)?

r3tr0•1mo ago

Absolutely love DuckDB. We use it a ton in our product.

You can check out our sandbox at https://yeet.cx/play

georgewfraser•1mo ago

They make a really good criticism of Iceberg: if we have a database anyway, why are we bothering to store metadata in files?

I don’t think DuckLake itself will succeed in getting adopted beyond DuckDB, but I would not be surprised if over time the catalog just absorbs the metadata, and the original Iceberg format fades into history as a transitional form.

Dowwie•1mo ago

Hopefully this clarifies the value proposition for others:

Existing Lakehouse systems like Iceberg store crucial table information (like schema and file lists) as many small "metadata files" in cloud object storage (like S3). Accessing these files requires numerous network calls, making operations like query planning and updating tables inefficient and prone to conflicts. DuckLake solves this by putting all that metadata into a fast, transactional SQL database, using a single query to get what's needed, which is much quicker and more reliable.

RamtinJ95•1mo ago

I love duckDB and this looks just absolutely brilliant!

One question for me is, lets say i want to start using this today and at work we are running snowflake. I get that each analytics person would have to run duckdb + this extension on their local machines and point to the blob store and the database that is running datalake extension, for now that would be say a VM running duckdb. When I run the actual query where does the computation happen? And what if I want a lot of computation?

Is the solution currently to host a huge duckdb VM that everyone ssh's into and run their queries or how does that part work?

mritchie712•1mo ago

the compute would happen on your machine if you were running duckdb locally.

And yes, to get more compute, you'd want to spin up a VM.

What's cool is you can do both (run locally for small stuff, run on VM for heavy workloads).

RamtinJ95•1mo ago

so in other words we could replace snowflake/bigquery with this solution and get pretty much the same performance?

mritchie712•1mo ago

it depends... but yes, you could likely set it up in a way that matches (or beats) snowflake performance.

LIGO detects most massive black hole merger to date

RFC: PHP license update

Apple's MLX adding CUDA support

DEWLine Museum – The Distant Early Warning Radar Line

Kiro: A new agentic IDE

NeuralOS: An operating system powered by neural networks

Show HN: Bedrock – An 8-bit computing system for running programs anywhere

Context Rot: How increasing input tokens impacts LLM performance

Cognition (Devin AI) to Acquire Windsurf

Replicube: 3D shader puzzle game, online demo

Building Modular Rails Applications: A Deep Dive into Rails Engines

Anthropic, Google, OpenAI and XAI Granted Up to $200M from Defense Department

SQLite async connection pool for high-performance

Show HN: The HTML Maze – Escape an eerie labyrinth built with HTML pages

Cidco MailStation as a Z80 Development Platform (2019)

Embedding user-defined indexes in Apache Parquet

Strategies for Fast Lexers

Meticulous (YC S21) is hiring in UK to redefine software dev

Japanese grandparents create life-size Totoro with bus stop for grandkids (2020)

Lightning Detector Circuits

Data brokers are selling flight information to CBP and ICE

Tandy Corporation, Part 3 Becoming IBM Compatible

Two guys hated using Comcast, so they built their own fiber ISP

East Asian aerosol cleanup has likely contributed to global warming

Impacts of adding PV solar system to internal combustion engine vehicles

It took 45 years, but spreadsheet legend Mitch Kapor finally got his MIT degree

The Corset X-Rays of Dr Ludovic O'Followell (1908)

Lossless Float Image Compression

A Century of Quantum Mechanics

Why random selection is necessary to create stable meritocratic institutions

LIGO detects most massive black hole merger to date

RFC: PHP license update

Apple's MLX adding CUDA support

DEWLine Museum – The Distant Early Warning Radar Line

Kiro: A new agentic IDE

NeuralOS: An operating system powered by neural networks

Show HN: Bedrock – An 8-bit computing system for running programs anywhere

Context Rot: How increasing input tokens impacts LLM performance

Cognition (Devin AI) to Acquire Windsurf

Replicube: 3D shader puzzle game, online demo

Building Modular Rails Applications: A Deep Dive into Rails Engines

Anthropic, Google, OpenAI and XAI Granted Up to $200M from Defense Department

SQLite async connection pool for high-performance

Show HN: The HTML Maze – Escape an eerie labyrinth built with HTML pages

Cidco MailStation as a Z80 Development Platform (2019)

Embedding user-defined indexes in Apache Parquet

Strategies for Fast Lexers

Meticulous (YC S21) is hiring in UK to redefine software dev

Japanese grandparents create life-size Totoro with bus stop for grandkids (2020)

Lightning Detector Circuits

Data brokers are selling flight information to CBP and ICE

Tandy Corporation, Part 3 Becoming IBM Compatible

Two guys hated using Comcast, so they built their own fiber ISP

East Asian aerosol cleanup has likely contributed to global warming

Impacts of adding PV solar system to internal combustion engine vehicles

It took 45 years, but spreadsheet legend Mitch Kapor finally got his MIT degree

The Corset X-Rays of Dr Ludovic O'Followell (1908)

Lossless Float Image Compression

A Century of Quantum Mechanics

Why random selection is necessary to create stable meritocratic institutions

DuckLake is an integrated data lake and catalog format

Comments