Get the latest tech news

The Evolution of SRE at Google


Billions of people around the world use Google’s products every day, and they count on those products to work reliably. Behind the scenes, Google’s services have increased dramatically in scale over the last 25 years — and failures have become rarer even as the scale has grown.

We have adopted the STAMP (System-Theoretic Accident Model and Processes) framework, developed by Professor Nancy Leveson at MIT, which shifts the focus from preventing individual component failures to understanding and managing complex system interactions. Today, Leveson's STAMP methodology offers a robust framework for understanding and mitigating risks in complex socio-technical systems, demonstrating the enduring relevance and adaptability of control theory principles in our rapidly evolving technological landscape. Rather than seeing complexity as a bug, SRE teams at Google are leveraging control theory and methods like STPA and CAST to lead us to more comprehensive and proactive approaches to reliability, moving beyond simply reacting to failures to actively designing safer systems from the ground up.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Google

Google

Photo of sre

sre

Photo of Evolution

Evolution

Related news:

News photo

UK ICO response to Google's policy change on device fingerprinting

News photo

Google might finally let you sync Pixel notifications over cellular data

News photo

Musk’s X, Google Yet to Apply for Malaysia’s New Social Media License