You can throw away a table and recreate it in minutes and vice versa you can edit the data and the table will adapt.
I am so used to this and I am worried of loosing this flexibility with Iceberg.
Maybe a mix is the way to go.
TFA is very well written by the way. From my perspective I see Iceberg as Hive tables 2.0. Solving a lot of the Hive related problems but not all generic database problems. So all new features are positive for me.
But my only gripe is - is the added complexity worth it?
If you have any use case like one the author describes, maybe use an in-memory cloud database with tiered storage or a plain RDBMS. Iceberg (and similar formats) work great for the use cases for which they're designed.
The multi-writer architecture can't be proven scalable because a single writer doesn't cause it to fall over.
I have caused issues by using 500 concurrent writers on embarrassingly parallel workloads. I have watched people choose sharding schemes to accommodate Iceberg's metadata throughput NOT the natural/logical sharding of the underlying data.
Last I half-knew (so check me), Spark may have done some funky stuff to workaround the Iceberg shortcomings. That is useless if you're not using Spark. If scalability of the architecture requires a funky client in one language and a cooperative backend, we might as well be sticking HDF5 on Lustre. HDF5 on Lustre never fell over for me in the 1000+ embarrassingly parallel concurrent writer use case (massive HPC turbulence restart files with 32K concurrent writers per https://ieeexplore.ieee.org/abstract/document/6799149 )
You can achieve 100M database inserts per second with D4M and Accumulo more than a decade ago back in 2014, and object storage is not necessary for that exercise.
Someone need to come up with lakehouse systems based on D4M, it's a long overdue.
D4M is also based on sound mathematics not unlike the venerable SQL [2].
[1] Achieving 100M database inserts per second using Apache Accumulo and D4M (2017 - 46 comments):
https://news.ycombinator.com/item?id=13465141
[2] Mathematics of Big Data: Spreadsheets, Databases, Matrices, and Graphs:
https://mitpress.mit.edu/9780262038393/mathematics-of-big-da...
ozgrakkurt•6mo ago
It is very basic compared to a database, and even when you go into details of databases there are many things that don’t make sense in terms of doing the absolute best thing.
You could ciritisize parquet in a similar way if you go through the spec but because it is open and so popular people are going to use it no matter what.
If you need more performance/efficiency simplicity etc. just don’t use parquet but have conversion between your format and parquet.
Or you can build on top of parquet with external indices, keeping metadata in memory and having a separate WAL for consistency.
Similarly it should be possible to build on top of iceberg spec to create something like a db server that is efficient.
It is unlikely for something so usable for so many use cases to be the technically pure and most sensible option.
dkdcio•6mo ago
People don't choose on tech on technical purity, but they often chose on simplicity & ease of use
lsuresh•6mo ago