It's kinda funny to not mention that Databricks acquired Tabular, the Iceberg company, for a billion dollars: https://www.databricks.com/company/newsroom/press-releases/d...
It's becoming clear that merge trees and compaction need to be addressed next, after delete vectors brought them onstage.
Vertica will actually look up the equality keys in a relevant projection if it exists, and then use the column values in the matching rows to equality-delete from the other projections; it's fairly good at avoiding table scans.
As an industry, we keep forgetting these things and reinventing the wheel because there is more money to be made squeezing enterprises than providing widely available sustainable software for a fair price and than losing mindshare to the next generation of tool and eventually getting sold for parts. It's a sad dynamic
Apache Iceberg as mature? I mean, there's a lot of activity around it, but I remember a year ago the rust library didn't even have write capabilities. And it's not like the library is a client and there's an iceberg server - the library literally is the whole product, interacting with the files in s3
I'm not even sure if I'm joking. :)
They do work, but they have some sharp and rough edges.
> In streaming CDC scenarios, however, you’d need to query Iceberg for the location on every delete: introducing random reads, latency, and drastically lowering throughput under high concurrency. On large tables, real-time performance is essentially impossible.
Let's consider the actual situation. There's a Postgres table that fits on whatever Postgres server is in use. It gets mirrored to Iceberg. Postgres is a full-fledged relational database and has indexes and such. Iceberg is not, although it can be scanned much faster than Postgres and queried by fancy Big Data tools (which, I agree, are really cool!). And, notably, there is no index mapping Postgres rows to Iceberg row positions.
But why isn't there? CDC is inherently stateful -- unless someone is going to build Merkle trees or similar to allow efficiently diffing table states (which would be awesome), the CDC process need to keep enough state to know where it is. Maybe this is O(1) in current implementations. But why not keep the entire mapping from Postgres rows to Iceberg positions? The Postgres database table is about N rows times however wide a row is, and it fits on a Postgres server. The mapping needed would be about the size of a single index on the table. Why not store it somewhere? Updates to it will be faster than updates to the source Postgres table, so it will keep up. Is the problem that this is awkward to do in a "serverless" manner?
For extra fun, someone could rig up Postgres (via an extension or just some clever tables) so that the mapping is stored in Postgres itself. It would be, roughly, one small table with CDC state and one moderate size table per mirrored table storing the row position mapping. It could be on the same server instance or a different one.
postgres is for relational data, ok
CDC is meant to capture changes and process the changes only (in isolation from all previous changes), not to recover the snapshot of the original table by reimplementing the logic inside postgres of merge-on-read
iceberg is columnar storage for large historical data for analytics, its not meant for relational data, and certainly not for realtime
it looks like they need to use time-series oriented db, like timescale, influxdb, etc
With a query engine that supports federation, we can write SELECT * FROM PG_Table or SELECT * FROM Iceberg_File just the same.
datadrivenangel•6h ago
UltraSane•1h ago
sakesun•1h ago
UltraSane•1h ago
Whenever I implement CDC I always try to implement some kind of integrity check that runs at some reasonable interval that can detect and try to fix any discrepancies. CDC is by far the most efficient form of data synchronization in most situations.
sakesun•1h ago