Get the latest tech news
Quick takes on the recent OpenAI public incident write-up
OpenAI recently published a public writeup for an incident they had on December 11, and there are lots of good details in here! Here are some of my off-the-cuff observations: Saturation With thousa…
The impact was specific to clusters exceeding a certain size, and our DNS cache on each node delayed visible failures long enough for the rollout to continue. We identified the issue within minutes and immediately spun up multiple workstreams to explore different ways to bring our clusters back online quickly: Scaling up Kubernetes API servers: Increased available resources to handle pending requests, allowing us to apply the fix.
Or read this on Hacker News