The technique described in the article, seems to use this key-value pair to store pointers to the additional metadata (in this case a distinct index) embedded in the file. Note that we can embed arbitrary binary data in the Parquet file between each data page. This is perfectly valid since all Parquet readers rely on the exact offsets to the data pages specified in the footer.
This means that DataFusion does not need to specify how the metadata is interpreted. It is already well specified as part of the Parquet file format itself. DataFusion is an independent project -- it is a query execution engine for OLAP / columnar data, which can take in SQL statements, build query plan, optimize them, and execute. It is an embeddable runtime with numerous ways to extend it by the host program. Parquet is a file format supported by DataFusion because it is one of the most popular ways of storing data in a columnar way in object storages like S3.
Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes - as long as we're embedding only supplementary information like indices or bloom filters, a reader can still continue working with the columnar data in Parquet as it used to; it is just that it won't be able to take advantage of the additional metadata.
I work on a database engine that uses parquet as our on-storage file format and we make liberal use of the custom metadata area for things specific to our product that any other parquet readers would just ignore.
jasim•5h ago
One of the arguments is that there is no standardized way to extend Parquet with new kinds of metadata (like statistical summaries, HyperLogLog etc.)
This post was written by the DataFusion folks, who have shown a clever way to do this without breaking backward compatibility with existing readers.
They have inserted arbitrary data between footer and data pages, which other readers will ignore. But query engines like DataFusion can exploit it. They embed a new index to the .parquet file, and use that to improve query performance.
In this specific instance, they add an index with all the distinct values of a column. Then they extend the DataFusion query engine to exploit that so that queries like `WHERE nation = 'Singapore'` can use that index to figure out whether the value exists in that .parquet file without having to scan the data pages (which is already optimized because there is a min-max filter to avoid scanning the entire dataset).
Also in general this is a really good deep dive into columnar data storage.
jjtheblunt•1h ago
dmvinson•1h ago
This solution seems clever overall, and finding a way to bolt on features of the latest-and-greatest new hotness without breaking backwards compatibility is a testament to the DataFusion team. Supporting legacy systems is crucial work, even if things need a ground-up rewrite periodically.