Get the latest tech news
Using Parquet's Bloom Filters
In this article, we explore when and how to use Bloom filters in Parquet, their impact on written Parquet files, and measure their effectiveness when dealing with large quantities of high-cardinality data.
Choosing an NDV that corresponds closely with the actual cardinality of the column incurs a hefty storage cost (2 MB per row group) but does not increase pruning efficiency and therefore does not appear to be necessary. Discussion/Key Observations The impact on file size is the same as the previous experiment, consistent with the theoretical results that the Bloom filter size is not related to the cardinality of the data it is filtering DataFusion successfully prunes all non-matching row groups at NDV 1,000, adding only ~2K overhead per row group The pruning efficiency saturates at a lower NDV (1,000) than in the previous experiment (7,500) When using Parquet, Bloom filters can provide substantial query performance gains at larger data volumes, albeit with additional storage costs.
Or read this on Hacker News