Pg_lake: Postgres with Iceberg and data lake access

https://github.com/Snowflake-Labs/pg_lake

168•plaur782•3h ago

Comments

ozgune•3h ago

This is huge!

When people ask me what’s missing in the Postgres market, I used to tell them “open source Snowflake.”

Crunchy’s Postgres extension is by far the most ahead solution in the market.

Huge congrats to Snowflake and the Crunchy team on open sourcing this.

ayhanfuat•2h ago

With S3 Table Buckets, Cloudflare R2 Data Catalog and now this, Iceberg seems to be winning.

dkdcio•2h ago

I was going to ask if you could then put DuckDB over Postgres for the OLAP query engine -- looks like that's already what it does! very interesting development in the data lake space alongside DuckLake and things

pgguru•2h ago

You create foreign tables in postgres using either the pg_lake_table wrapper or pg_lake_iceberg.

Once those tables exist, queries against them are able to either push down entirely to the remote tables and uses a Custom Scan to execute and pull results back into postgres, or we transform/extract the pieces that can be executed remotely using a FDW and then treat it as a tuple source.

In both cases, the user does not need to know any of the details and just runs queries inside postgres as they always have.

spenczar5•1h ago

I think I don't understand postgres enough, so forgive this naive question, but what does pushing down to the remote tables mean? Does it allow parallelism? If I query a very large iceberg table, will this system fan the work out to multiple duckdb executors and gather the results back in?

pgguru•1h ago

In any query engine you can execute the same query in different ways. The more restrictions that you can apply on the DuckDB side the less data you need to return to Postgres.

For instance, you could compute a `SELECT COUNT(*) FROM mytable WHERE first_name = 'David'` by querying all the rows from `mytable` on the DuckDB side, returning all the rows, and letting Postgres itself count the number of results, but this is extremely inefficient, since that same value can be computed remotely.

In a simple query like this with well-defined semantics that match between Postgres and DuckDB, you can run the query entirely on the remote side, just using Postgres as a go-between.

Not all functions and operators work in the same way between the two systems, so you cannot just push things down unconditionally; `pg_lake` does some analysis to see what can run on the DuckDB side and what needs to stick around on the Postgres side.

There is only a single "executor" from the perspective of pg_lake, but the pgduck_server embeds a multi-threaded duckdb instance.

How DuckDB executes the portion of the query it gets is up to it; it often will involve parallelism, and it can use metadata about the files it is querying to speed up its own processing without even needing to visit every file. For instance, it can look at the `first_name` in the incoming query and just skip any files which do not have a min_value/max_value that would contain that.

spenczar5•1h ago

Thanks for the detailed answer!

I use DuckDB today to query Iceberg tables. In some particularly gnarly queries (huge DISTINCTs, big sorts, even just selects that touch extremely heavy columns) I have sometimes run out of memory in that DuckDB instance.

I run on hosts without much memory because they are cheap, and easy to launch, giving me isolated query parallism, which is hard to achieve on a single giant host.

To the extent that its possible, I dream of being able to spread those gnarly OOMing queries across multiple hosts; perhaps the DISTINCTs can be merged for example. But this seems like a pretty complicated system that needs to be deeply aware of Iceberg partitioning ("hidden" in pg_lake's language), right?

Is there some component in the postgres world that can help here? I am happy to continue over email, if you prefer, by the way.

pgguru•51m ago

Well, dealing with large analytics queries will always perform better with larger amounts of memory... :D You can perhaps tune things to perform based on the amount of system memory (IME 80% is what DuckDB targets if not otherwise configured). Your proposed system does sounds like it introduces quite a bit of complexity that would be better served just by using hosts with more memory.

As far as Iceberg is concerned, DuckDB has its own implementation, but we do not use that; pg_lake has its own iceberg implementation. The partitioning is "hidden" because it is separated out from the schema definition itself and can be changed gradually without the query engine needing to care about the details of how things are partitioning at read time. (For writes, we respect the latest partitioning spec and always write according to that.)

beoberha•2h ago

Curious why pgduck_server is a totally separate process?

dkdcio•2h ago

from the README:

> This separation also avoids the threading and memory-safety limitations that would arise from embedding DuckDB directly inside the Postgres process, which is designed around process isolation rather than multi-threaded execution. Moreover, it lets us interact with the query engine directly by connecting to it using standard Postgres clients.

beoberha•1h ago

Thanks! Didn’t scroll down far enough

rmnclmnt•2h ago

The README explains it:

pgguru•2h ago

What has been pointed out from the README; also:

- Separation of concerns, since with a single external process we can share object store caches without complicated locking dances between multiple processes. - Memory limits are easier to reason about with a single external process. - Postgres backends end up being more robust, as you can restart the pgduck_server process separately.

dharbin•2h ago

Why would Snowflake develop and release this? Doesn't this cannibalize their main product?

kentm•2h ago

It's not going to scale as well as Snowflake, but it gets you into an Iceberg ecosystem which Snowflake can ingest and process at scale. Analytical data systems are typically trending to heterogenous compute with a shared storage backend -- you have large, autoscaling systems to process the raw data down to something that is usable by a smaller, cheaper query engine supporting UIs/services.

hobs•2h ago

But if you are used to this type of compute per dollar what on earth would make you want to move to Snowflake?

kentm•1h ago

Different parts of the analytical stack have different performance requirements and characteristics. Maybe none of your stack needs it and so you never need Snowflake at all.

More likely, you don't need Snowflake to process queries from your BI tools (Mode, Tableau, Superset, etc), but you do need it to prepare data for those BI tools. Its entirely possible that you have hundreds of terabytes, if not petabytes, of input data that you want to pare down to < 1 TB datasets for querying, and Snowflake can chew through those datasets. There's also third party integrations and things like ML tooling that you need to consider.

You shouldn't really consider analytical systems the same as a database backing a service. Analytical systems are designed to funnel large datasets that cover the entire business (cross cutting services and any sharding you've done) into subsequently smaller datasets that are cheaper and faster to query. And you may be using different compute engines for different parts of these pipelines; there's a good chance you're not using only Snowflake but Snowflake and a bunch of different tools.

barrrrald•1h ago

One thing I admire about Snowflake is a real commitment to self-cannibalization. They were super out front with Iceberg even though it could disrupt them, because that's what customers were asking for and they're willing to bet they'll figure out how to make money in that new world

Video of their SVP of Product talking about it here: https://youtu.be/PERZMGLhnF8?si=DjS_OgbNeDpvLA04&t=1195

qaq•26m ago

Have you interacted with Snowflake teams much? We are using external iceberg tables with snowflake. Every interaction pretty much boils down to you really should not be using iceberg you should be using snowflake for storage. It's also pretty obvious some things are strategically not implemented to push you very strongly in that direction.

999900000999•25m ago

It'll probably be really difficult to set up.

If it's anything like super base, your question the existence of God when trying to get it to work properly.

You pay them to make it work right.

mberning•2h ago

Does anyone know how access control works to the underlying s3 objects? I didn’t see anything regarding grants in the docs.

pgguru•2h ago

Hi, one of the developers here. You define credentials that can access the S3 buckets and use those as DuckDB secrets, usually in an init script for pgduck_server. (You can see some examples of this in the testing framework.)

I'll see if we can improve the docs or highlight that part better, if it is already documented—we did move some things around prior to release.

onderkalaci•2h ago

Maybe this could help: https://github.com/Snowflake-Labs/pg_lake?tab=readme-ov-file...

mberning•2h ago

Interesting. I am working on a project to integrate access management to iceberg/parquet files for sagemaker. Controlling what users logged into sagemaker studio have access to in s3. It’s fine using static policies for mvp, but eventually it needs to be dynamic and integrated into enterprise iam tools. Those tools generally have great support for managing sql grants. Not so much for s3 bucket policies.

pgguru•2h ago

DuckDB secrets management supports custom IAM roles and the like; at this point we are basically treating the pgduck_server external system as a black box.

For the postgres grants themselves, we provide privs to allow read/write to the remote tables, which is done via granting the `pg_lake_read`, `pg_lake_write` or `pg_lake_read_write` grants. This is a blanket all-or-nothing grant, however, so would need some design work/patching to support per-relation grants, say.

(You could probably get away with making roles in postgres that have the appropriate read/write grant, then only granting those specific roles to a given relation, so it's probably doable though a little clunky at the moment.)

mslot•1h ago

There are Postgres roles for read/write access to the S3 object that DuckDB has access to. Those roles can create tables from specific files or at specific locations, and can then assign more fine-grained privileges to other Postgres roles (e.g. read access on a specific view or table).

chaps•2h ago

I love postgres and have created my own "data lake" sorta systems -- what would this add to my workflows?

gajus•2h ago

Man, we are living in the golden era of PostgreSQL.

anentropic•2h ago

When Snowflake bought Crunchy Data I was hoping they were going to offer a managed version of this

It's great that I can run this locally in a Docker container, I'd love to be able to run a managed instance on AWS billed through our existing Snowflake account

oulipo2•1h ago

Interesting! How does it compare with ducklake?

mslot•1h ago

You could say

With DuckLake, the query frontend and query engine are DuckDB, and Postgres is used as a catalog in the background.

With pg_lake, the query frontend and catalog are Postgres, and DuckDB is used as a query engine in the background.

Of course, they also use different table formats (though similar in data layer) with different pros and cons, and the query frontends differ in significant ways.

An interesting thing about pg_lake is that it is effectively standalone, no external catalog required. You can point Spark et al. directly to Postgres with pg_lake by using the Iceberg JDBC driver.

inglor•1h ago

This is really nice though looking at the code - a lot of the postgres types are missing as well a lot of the newer parquet logical types - but this is a great start and a nice use of FDW.

inglor•1h ago

Also, any planned support for more catalogs?

pgguru•1h ago

Hi, what types are you expecting to see that aren't supported? I believe we had support for most/all builtin postgres types.

iamcreasy•1h ago

Very cool! Was there any inherent limitation with postgresql or its extension system that forced pg_lake to use duckdb as query engine?

mslot•1h ago

I gave a talk on that at Data Council, then still discussing the pg_lake extensions as part if Crunchy Data Warehouse.

https://youtu.be/HZArjlMB6W4?si=BWEfGjMaeVytW8M1

Also, nicer recording from POSETTE: https://youtu.be/tpq4nfEoioE?si=Qkmj8o990vkeRkUa

It comes down to the trade-offs made by operational and analytical query engines being fundamentally different at every level.

pgguru•1h ago

DuckDB provided a lot of infrastructure for reading/writing parquet files and other common formats here. It also was inherently multi-threaded and supported being embedded in a larger program (similar to sqllite), so made it a good basis for something that could work outside of the traditional process model of Postgres.

Additionally, the postgres extension system supports most of the current project, so wouldn't say it was forced in this case, it was a design decision. :)

spenczar5•1h ago

Very cool. One question that comes up for me is whether pg_lake expects to control the Iceberg metadata, or whether it can be used purely as a read layer. If I make schema updates and partition changes to iceberg directly, without going through pg_lake, will pg_lake's catalog correctly reflect things right away?

pgguru•1h ago

We have some level of external iceberg table read-only support, but it is limited at the moment. See this example/caveat: https://github.com/Snowflake-Labs/pg_lake/blob/main/docs/fil...

mslot•1h ago

You can use it as a read layer for for specific metadata JSON URL or a table in a REST catalog. The latter got merged quite recently, not yet in docs.

boshomi•1h ago

Why not just use Ducklake?[1] That reduces complexity[2] since only DuckDB and PostgreSQL with pg_duckdb are required.

[1] https://ducklake.select/

[2] DuckLake - The SQL-Powered Lakehouse Format for the Rest of Us by Prof. Hannes Mühleisen: https://www.youtube.com/watch?v=YQEUkFWa69o

pgguru•51m ago

Boils down to design decisions; see: https://news.ycombinator.com/item?id=45813631

mslot•40m ago

DuckLake is pretty cool, and we obviously love everything the DuckDB is doing. It's what made pg_lake possible, and what motivated part of our team to step away from Microsoft/Citus.

DuckLake can do things that pg_lake cannot do with Iceberg, and DuckDB can do things Postgres absolutely can't (e.g. query data frames). On the other hand, Postgres can do a lot of things that DuckDB cannot do. For instance, it can handle >100k single row inserts/sec.

Transactions don't come for free. Embedding the engine in the catalog rather than the catalog in the engine enables transactions across analytical and operational tables. That way you can do a very high rate of writes in a heap table, and transactionally move data into an Iceberg table.

Postgres also has a more natural persistence & continuous processing story, so you can set up pg_cron jobs and use PL/pgSQL (with heap tables for bookkeeping) to do orchestration.

There's also the interoperability aspect of Iceberg being supported by other query engines.

jabr•26m ago

How does this compare to https://www.mooncake.dev/pgmooncake? It seems there are several projects like this now, with each taking a slightly different approach optimized for different use cases?

j_kao•24m ago

FYI the mooncake team was acquired by Databricks so it's basically vendors trying to compete on features now :)

darth_avocado•1h ago

This is so cool! We have files in Iceberg that we then move data to/from to a PG db using a custom utility. It always felt more like a workaround that didn’t fully use the capabilities of both the technologies. Can’t wait to try this out.

fridder•1h ago

I love this. There are definitely shops where the data is a bit too much for postgres but something like Snowflake would be overkill. Wish this was around a couple years ago lol

drchaim•49m ago

More integrations are great. Anyway, the "this is awesome" moment (for me) will be when you could mix row- and column-oriented tables in Postgres, a bit like Timescale but native Postgres and well done. Hopefully one day.

pgguru•37m ago

Hypertables definitely had the arrays columns auto-expanding with the custom node type. Not sure what else it would look like for what you describe.

That said, don't sleep on the "this is awesome" parts in this project... my personal favorite is the automatic schema detection:

``` CREATE TABLE my_iceberg_table () USING iceberg WITH (definition_from = 's3://bucket/source_data.parquet'); ```

harisund1990•14m ago

This is cool to see! Looks like a compete against pg_mooncake which Databricks acquired. But how is this different from pg_duckdb?

We're open-sourcing the successor of Jupyter notebook

Pg_lake: Postgres with Iceberg and data lake access

Show HN: A CSS-Only Terrain Generator

Man spent 200 days building a solar-powered explorer yacht that can run forever

NoLongerEvil-Thermostat – Nest Generation 1 and 2 Firmware

Launch HN: Plexe (YC X25) – Build production-grade ML models from prompts

Codemaps: Understand Code, Before You Vibe It

What is a manifold?

Exploring a space-based, scalable AI infrastructure system design

Normalize Identifying Corporate Devices in Your Software

Recovering videos from my Sony camera that I stupidly deleted

Optimizing Datalog for the GPU

The Rust Foundation Maintainers Fund

This Day in 1988, the Morris worm infected 10% of the Internet within 24 hours

Chaining FFmpeg with a Browser Agent

How devtools map minified JS code back to your TypeScript source code

Bloom filters are good for search that does not scale

My Truck Desk

Customize Nano Text Editor

Whole Earth Index

Tell HN: X is opening any tweet link in a webview whether you press it or not

The 512KB Club

Aisuru botnet shifts from DDoS to residential proxies

Things you can do with diodes

AI's Dial-Up Era

You can't cURL a Border

When stick figures fought

Show HN: I built a local-first daily planner for iOS

Tenacity – a multi-track audio editor/recorder

Data breach at major Swedish software supplier impacts 1.5M

We're open-sourcing the successor of Jupyter notebook

Pg_lake: Postgres with Iceberg and data lake access

Show HN: A CSS-Only Terrain Generator

Man spent 200 days building a solar-powered explorer yacht that can run forever

NoLongerEvil-Thermostat – Nest Generation 1 and 2 Firmware

Launch HN: Plexe (YC X25) – Build production-grade ML models from prompts

Codemaps: Understand Code, Before You Vibe It

What is a manifold?

Exploring a space-based, scalable AI infrastructure system design

Normalize Identifying Corporate Devices in Your Software

Recovering videos from my Sony camera that I stupidly deleted

Optimizing Datalog for the GPU

The Rust Foundation Maintainers Fund

This Day in 1988, the Morris worm infected 10% of the Internet within 24 hours

Chaining FFmpeg with a Browser Agent

How devtools map minified JS code back to your TypeScript source code

Bloom filters are good for search that does not scale

My Truck Desk

Customize Nano Text Editor

Whole Earth Index

Tell HN: X is opening any tweet link in a webview whether you press it or not

The 512KB Club

Aisuru botnet shifts from DDoS to residential proxies

Things you can do with diodes

AI's Dial-Up Era

You can't cURL a Border

When stick figures fought

Show HN: I built a local-first daily planner for iOS

Tenacity – a multi-track audio editor/recorder

Data breach at major Swedish software supplier impacts 1.5M

Pg_lake: Postgres with Iceberg and data lake access

Comments