Quick takes on the recent OpenAI public incident write-up – Surfing Complexity

OpenAI's telemetry deployment saturated their Kubernetes API servers, which cascaded through DNS to break service discovery—a textbook example of how reliability improvements can paradoxically cause outages and why saturation is the failure mode you can't test your way out of.

Read Original

• Saturation (resource exhaustion) is an extremely common failure mode that passes all functional tests but only manifests under production load—OpenAI joins Cloudflare, Rogers, and Slack in this pattern
• The failure chain was surprising: new telemetry service → k8s API overload → DNS failure → service-to-service communication breakdown, showing how "looking at the arrows, not just the boxes" reveals complex interactions
• DNS caching spread the impact over time, making the breaking change harder to detect during rollout and harder to correlate with the cause after the fact
• The failure mode broke the remediation tools themselves ("I destroyed my tools with my tools")—they couldn't fix the k8s API because they needed the k8s API to deploy the fix
• The telemetry service was deployed specifically to improve reliability, exemplifying the author's conjecture that reliable systems often fail due to subsystems designed to make them more reliable

OpenAI's December 11 incident provides a masterclass in how complex systems fail through unexpected interactions rather than simple component failures. A new telemetry service deployment—ironically intended to improve observability and reliability—generated massive load on Kubernetes API servers across large clusters. This saturated the API servers, taking down the control plane. The surprising part: this control plane failure cascaded to break DNS-based service discovery, ultimately preventing running services from communicating with each other. The coupling mechanism was DNS, creating an unexpected dependency between the k8s control plane and the data plane.

The incident illustrates several recurring patterns in distributed systems failures. First, saturation (resource exhaustion) is notoriously difficult to prevent through testing because the system can be functionally correct—passing all tests in staging—while still failing under production-scale load. OpenAI's engineers did validate resource utilization on the clusters, but didn't assess the impact on the Kubernetes API servers specifically. Second, DNS caching created a temporal smearing effect: the impact was delayed and spread out over time, allowing the rollout to continue before problems became visible and making it harder to correlate the change with the failure. Third, the failure mode broke the remediation tools themselves—they couldn't access the Kubernetes control plane to fix the problem because the control plane was what had failed. The engineers had to pursue multiple parallel strategies (scaling down cluster size, blocking network access, scaling up API servers) to eventually restore enough control to remove the offending service, essentially gambling on interventions under uncertainty.

The broader lesson is about looking at system interactions, not just individual components. The telemetry service itself worked fine; the k8s API servers were functioning correctly; DNS was operating as designed. The failure emerged from their interaction under specific conditions (cluster size, load patterns) that only existed in production. This exemplifies why reliable systems fail: the very subsystems designed to improve reliability (observability, telemetry) can become sources of failure through complex, unexpected interactions.

Quick takes on the recent OpenAI public incident write-up – Surfing Complexity

TLDR

In Detail

TLDR

In Detail

Related