Get the latest tech news
The Evolution of SRE at Google
Billions of people around the world use Google’s products every day, and they count on those products to work reliably. Behind the scenes, Google’s services have increased dramatically in scale over the last 25 years — and failures have become rarer even as the scale has grown.
We have adopted the STAMP (System-Theoretic Accident Model and Processes) framework, developed by Professor Nancy Leveson at MIT, which shifts the focus from preventing individual component failures to understanding and managing complex system interactions. Today, Leveson's STAMP methodology offers a robust framework for understanding and mitigating risks in complex socio-technical systems, demonstrating the enduring relevance and adaptability of control theory principles in our rapidly evolving technological landscape. Rather than seeing complexity as a bug, SRE teams at Google are leveraging control theory and methods like STPA and CAST to lead us to more comprehensive and proactive approaches to reliability, moving beyond simply reacting to failures to actively designing safer systems from the ground up.
Or read this on Hacker News