Get the latest tech news

Improving Parquet Dedupe on Hugging Face Hub


We’re on a journey to advance and democratize artificial intelligence through open source and open science.

The Xet team at Hugging Face is working on improving the efficiency of the Hub's storage architecture to make it easier and quicker for users to store and update data and models. Most Parquet files are bulk exports from various data analysis pipelines or databases, often appearing as full snapshots rather than incremental updates. Our default storage algorithm uses byte-level Content-Defined Chunking (CDC), which generally dedupes well over insertions and deletions, but the Parquet layout brings some challenges.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of face hub

face hub

Photo of parquet dedupe

parquet dedupe