[0] https://javadoc.io/doc/org.apache.iceberg/iceberg-api/latest...
(Would be genuinely excited if the answer is yes.)
A “standard” getting semi-monthly updates via random Databricks-affiliated GitHub accounts doesn’t really fit that bill.
Look at something like this:
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#wr...
Ouch.
I’ve always disliked this approach. It conflates two things: the value to put in preexisting rows and the default going forward. I often want to add a column, backfill it, and not have a default.
Fortunately, the Iceberg spec at least got this right under the hood. There’s “initial-default”, which is the value implicitly inserted in rows that predate the addition of the column, and there’s “write-default”, which is the default for new rows.
It seems quite possible that there will be maybe three libraries that can write to Iceberg (Java, Python, Rust, maybe Golang), while the rest at best will offer read access only. And those language choices will condition and be conditioned by the languages that developers use to write applications that manage Iceberg data.
Of course I haven't seen any implementations supporting these yet.
So far only Variant is supported in Spark and with 1.10 Spark will support nano timestamp and unknowntype I believe.
https://lists.apache.org/thread/gd5smyln3v6k4b790t5d1vy4483m...
The way they implemented this seems really useful for any database.
https://cloud.google.com/bigquery/docs/iceberg-tables#limita...
You're right — our current implementation in BigLake doesn't have full feature parity with the V3 spec yet. We're actively working on it.
The key context is that the V3 spec is brand new, having been finalized only about two months ago. The official Apache Iceberg release that incorporates all these V3 features isn't even out yet. So, you'll find that the entire ecosystem, including major vendors, is in a similar position of implementing the new spec.
The purpose of our blog post was to celebrate this huge milestone for the open-source community and to share a technical deep-dive on why these new capabilities are so important.
The entire concept of data lakes seems odd to me, as a DBRE. If you want performant OLAP, then get an OLAP DB. If you want temporality, have a created_at column and filter. If the problem is that you need to ingest petabytes of data, fix your source: your OLTP schema probably sucks and is causing massive storage amplification.
[0]: https://database-doctor.com/posts/iceberg-is-wrong-2.html
talatuyarer•5mo ago