Get the latest tech news

Behind the scenes: Redpanda Cloud's response to the GCP outage


On June 12, GCP went down. Here’s how we responded at Redpanda Cloud and what it taught us about safety and reliability.

When this mathematical fact is acknowledged, safety and reliabiilty measures are put in place, such as closing feedback control loops, phasing change rollouts, shedding load, applying backpressure, randomizing retries, and defining incident response processes, among others. GCP’s seemingly innocuous automated quota update triggered a butterfly effect that no human could have predicted, affecting several companies — some known for their impressive engineering culture and considered internet pillars for their long-standing availability record. Had we kept our entire observability stack on that service, we would have lost all our fleet-wide log searching capabilities, forcing us to fail over to another vendor with exponentially bigger cost ramifications given our scale.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Response

Response

Photo of scenes

scenes

Photo of GCP

GCP

Related news:

News photo

Remedy is trying to fix FBC: Firebreak in response to middling reviews and player feedback

News photo

Bungie's Marathon reboot delayed indefinitely in response to "passionate" fan feedback

News photo

Alt cloud platform Railway forced to pause lowest tiers after onrush of GCP customers