Get the latest tech news
Improving Parquet Dedupe on Hugging Face Hub
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
The Xet team at Hugging Face is working on improving the efficiency of the Hub's storage architecture to make it easier and quicker for users to store and update data and models. Most Parquet files are bulk exports from various data analysis pipelines or databases, often appearing as full snapshots rather than incremental updates. Our default storage algorithm uses byte-level Content-Defined Chunking (CDC), which generally dedupes well over insertions and deletions, but the Parquet layout brings some challenges.
Or read this on Hacker News