Get the latest tech news

We built a self-healing system to survive a concurrency bug at Netflix


Our CPUs were dying, the bug was temporarily un-fixable, and we had no viable path forward. Here's how we managed to survive.

The idea was that client teams would write their own endpoint scripts to deploy on the server in order to enable more flexible and efficient interactions. After looking at a bunch of graphs and some, you guessed it, JVM thread dumps - we discovered that our CPU usage was growing due to a concurrency bug in an internal library used by the client scripts that ran in our cluster. It’s about building high-volume systems, thriving through sociotechnical challenges, managing operational chaos, and surviving gruesome production incidents.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of Netflix

Netflix

Photo of self

self

Photo of healing system

healing system

Related news:

News photo

Docker Compose Isn't Enough

News photo

Waymo and Zoox Bring Self-Driving Cabs to More Cities

News photo

Immersive Gamebox teams up with Netflix on Floor is Lava experience