Get the latest tech news

Scaling our observability platform by embracing wide events and replacing OTel

Read how we scaled our observability platform from 19PB to 100PB and 500 trillion rows by replacing OpenTelemetry with a native ClickHouse-to-ClickHouse pipeline, embracing wide events and cutting CPU usage by 90%.

That kind of scale forced a series of architectural changes, new tools, and hard-earned lessons that we felt were worth sharing - not least that OpenTelemetry (OTel) isn’t always the panacea of Observability (though we still love it), and that sometimes custom pipelines are essential. While our total volume has grown more than 5x, the breakdown reveals a deliberate shift in strategy: today, the vast majority of our data comes from “SysEx”, a new purpose-built exporter we developed to handle high-throughput, high-fidelity system logs from ClickHouse itself. This cross-layer visibility transforms debugging from guesswork into precise root cause analysis - if we see unusual egress traffic, we can immediately identify whether it's from expensive cross-region queries, backup operations, or unexpected replication, making troubleshooting incredibly efficient for the support team.

Get the Android app

Or read this on Hacker News