Get the latest tech news
I spent 5 hours learning how ClickHouse built their internal data warehouse
19 data sources and a total of 470 TB of compressed data.
Built to handle large-scale data, it excels in OLAP scenarios, delivering rapid query execution even on massive datasets. Because Airflow jobs/DAGs can retry multiple times for the same data interval, using ReplicatedReplacingMergeTree makes the pipeline idempotent, allowing safe re-execution without duplicates. However, this approach became unsustainable as they added more data sources, developed complex business metrics, and served an increasing number of internal stakeholders.
Or read this on Hacker News