Get the latest tech news

Embedding user-defined indexes in Apache Parquet

Files Posted on: Mon 14 July 2025 by Qi Zhu, Jigao Luo, and Andrew Lamb It’s a common misconception that Apache Parquet files are limited to basic Min/Max/Null Count statistics and Bloom filters, and that adding more advanced indexes requires changing the specification or creating a new file format. In fact, footer metadata and offset-based addressing already provide everything needed to embed user-defined index structures within Parquet files without breaking compatibility with other Parquet readers.

The resulting Parquet files remain fully compatible with other tools such as DuckDB and Spark, which simply ignore the unknown index bytes and key/value metadata. This is a key benefit of a distinct value index: accurate filtering without requiring the column to be sorted, unlike min/max-based pruning which is most effective when data is ordered. 5: For information about rewriting files to optimize for specific queries, such as resorting, repartitioning, and tuning data page and row group sizes, see XiangpengHao/liquid‑cache#227 and the conversation between JigaoLuo and XiangpengHao for details.

Get the Android app

Or read this on Hacker News