Get the latest tech news
We built a self-healing system to survive a concurrency bug at Netflix
Our CPUs were dying, the bug was temporarily un-fixable, and we had no viable path forward. Here's how we managed to survive.
The idea was that client teams would write their own endpoint scripts to deploy on the server in order to enable more flexible and efficient interactions. After looking at a bunch of graphs and some, you guessed it, JVM thread dumps - we discovered that our CPU usage was growing due to a concurrency bug in an internal library used by the client scripts that ran in our cluster. It’s about building high-volume systems, thriving through sociotechnical challenges, managing operational chaos, and surviving gruesome production incidents.
Or read this on Hacker News