Embedding user-defined indexes in Apache Parquet

https://datafusion.apache.org/blog/2025/07/14/user-defined-parquet-indexes/

81•jasim•6h ago

Comments

jasim•5h ago

I think this post is a response to some new file format initiatives, based on the criticism that the Parquet file format is showing its age.

One of the arguments is that there is no standardized way to extend Parquet with new kinds of metadata (like statistical summaries, HyperLogLog etc.)

This post was written by the DataFusion folks, who have shown a clever way to do this without breaking backward compatibility with existing readers.

They have inserted arbitrary data between footer and data pages, which other readers will ignore. But query engines like DataFusion can exploit it. They embed a new index to the .parquet file, and use that to improve query performance.

In this specific instance, they add an index with all the distinct values of a column. Then they extend the DataFusion query engine to exploit that so that queries like `WHERE nation = 'Singapore'` can use that index to figure out whether the value exists in that .parquet file without having to scan the data pages (which is already optimized because there is a min-max filter to avoid scanning the entire dataset).

Also in general this is a really good deep dive into columnar data storage.

jjtheblunt•1h ago

nice summary!

dmvinson•1h ago

What are the new file format initiatives you're referencing here?

This solution seems clever overall, and finding a way to bolt on features of the latest-and-greatest new hotness without breaking backwards compatibility is a testament to the DataFusion team. Supporting legacy systems is crucial work, even if things need a ground-up rewrite periodically.

Nelkins•4h ago

Cool, but this is very specific to DataFusion, no? Is there any chance this would be standardized so other Parquet readers could leverage the same technique?

gdubya•4h ago

The technique can be applied by any engine, not just DataFusion. Each engine would have to know about the indexes in order to make use of them, but the fallback to parquet standard defaults means that the data is still readable by all.

aerzen•3h ago

But does data fusion publish a specification of how this metadata can be read, along with a test suite for verifying implementations? Because if they don't, this cannot be reliably used by any other impl

jasim•3h ago

Parquet files include a field called key_value_metadata in the FileMetadata structure; it sits in the footer of the file. See: https://github.com/apache/parquet-format/blob/master/src/mai...

The technique described in the article, seems to use this key-value pair to store pointers to the additional metadata (in this case a distinct index) embedded in the file. Note that we can embed arbitrary binary data in the Parquet file between each data page. This is perfectly valid since all Parquet readers rely on the exact offsets to the data pages specified in the footer.

This means that DataFusion does not need to specify how the metadata is interpreted. It is already well specified as part of the Parquet file format itself. DataFusion is an independent project -- it is a query execution engine for OLAP / columnar data, which can take in SQL statements, build query plan, optimize them, and execute. It is an embeddable runtime with numerous ways to extend it by the host program. Parquet is a file format supported by DataFusion because it is one of the most popular ways of storing data in a columnar way in object storages like S3.

Note that the readers of Parquet need to be aware of any metadata to exploit it. But if not, nothing changes - as long as we're embedding only supplementary information like indices or bloom filters, a reader can still continue working with the columnar data in Parquet as it used to; it is just that it won't be able to take advantage of the additional metadata.

SiempreViernes•2h ago

So, can we take that as a "no"?

DAlperin•2h ago

The story here isn't that they've invented a new format for user defined indexes (the one proposed here is sort of contrived and I probably wouldn't recommend in production) but rather demonstrating how the user defined metadata space of the parquet format can be used for application specific purposes.

I work on a database engine that uses parquet as our on-storage file format and we make liberal use of the custom metadata area for things specific to our product that any other parquet readers would just ignore.

gregw2•40m ago

Note that there are "Puffin files" associated with Iceberg which have some overlap with this functionality: https://iceberg.apache.org/puffin-spec/#file-structure

LIGO detects most massive black hole merger to date

RFC: PHP license update

Apple's MLX adding CUDA support

DEWLine Museum – The Distant Early Warning Radar Line

Kiro: A new agentic IDE

NeuralOS: An operating system powered by neural networks

Show HN: Bedrock – An 8-bit computing system for running programs anywhere

Cognition (Devin AI) to Acquire Windsurf

Replicube: 3D shader puzzle game, online demo

Context Rot: How increasing input tokens impacts LLM performance

Anthropic, Google, OpenAI and XAI Granted Up to $200M from Defense Department

Building Modular Rails Applications: A Deep Dive into Rails Engines

SQLite async connection pool for high-performance

Show HN: The HTML Maze – Escape an eerie labyrinth built with HTML pages

Cidco MailStation as a Z80 Development Platform (2019)

Embedding user-defined indexes in Apache Parquet

Strategies for Fast Lexers

Japanese grandparents create life-size Totoro with bus stop for grandkids (2020)

Meticulous (YC S21) is hiring in UK to redefine software dev

Lightning Detector Circuits

Data brokers are selling flight information to CBP and ICE

Tandy Corporation, Part 3 Becoming IBM Compatible

East Asian aerosol cleanup has likely contributed to global warming

Two guys hated using Comcast, so they built their own fiber ISP

Impacts of adding PV solar system to internal combustion engine vehicles

The Corset X-Rays of Dr Ludovic O'Followell (1908)

It took 45 years, but spreadsheet legend Mitch Kapor finally got his MIT degree

Lossless Float Image Compression

Why random selection is necessary to create stable meritocratic institutions

A Century of Quantum Mechanics

LIGO detects most massive black hole merger to date

RFC: PHP license update

Apple's MLX adding CUDA support

DEWLine Museum – The Distant Early Warning Radar Line

Kiro: A new agentic IDE

NeuralOS: An operating system powered by neural networks

Show HN: Bedrock – An 8-bit computing system for running programs anywhere

Cognition (Devin AI) to Acquire Windsurf

Replicube: 3D shader puzzle game, online demo

Context Rot: How increasing input tokens impacts LLM performance

Anthropic, Google, OpenAI and XAI Granted Up to $200M from Defense Department

Building Modular Rails Applications: A Deep Dive into Rails Engines

SQLite async connection pool for high-performance

Show HN: The HTML Maze – Escape an eerie labyrinth built with HTML pages

Cidco MailStation as a Z80 Development Platform (2019)

Embedding user-defined indexes in Apache Parquet

Strategies for Fast Lexers

Japanese grandparents create life-size Totoro with bus stop for grandkids (2020)

Meticulous (YC S21) is hiring in UK to redefine software dev

Lightning Detector Circuits

Data brokers are selling flight information to CBP and ICE

Tandy Corporation, Part 3 Becoming IBM Compatible

East Asian aerosol cleanup has likely contributed to global warming

Two guys hated using Comcast, so they built their own fiber ISP

Impacts of adding PV solar system to internal combustion engine vehicles

The Corset X-Rays of Dr Ludovic O'Followell (1908)

It took 45 years, but spreadsheet legend Mitch Kapor finally got his MIT degree

Lossless Float Image Compression

Why random selection is necessary to create stable meritocratic institutions

A Century of Quantum Mechanics

Embedding user-defined indexes in Apache Parquet

Comments