Get the latest tech news

Behind the scenes: Redpanda Cloud's response to the GCP outage

On June 12, GCP went down. Here’s how we responded at Redpanda Cloud and what it taught us about safety and reliability.

When this mathematical fact is acknowledged, safety and reliabiilty measures are put in place, such as closing feedback control loops, phasing change rollouts, shedding load, applying backpressure, randomizing retries, and defining incident response processes, among others. GCP’s seemingly innocuous automated quota update triggered a butterfly effect that no human could have predicted, affecting several companies — some known for their impressive engineering culture and considered internet pillars for their long-standing availability record. Had we kept our entire observability stack on that service, we would have lost all our fleet-wide log searching capabilities, forcing us to fail over to another vendor with exponentially bigger cost ramifications given our scale.

Get the Android app

Or read this on Hacker News