The Canva outage: another tale of saturation and resilience – Surfing Complexity
Canva's outage shows how a perfectly fine deployment triggered cascading failures when latent issues collided—and why automated systems designed to help (load balancers, autoscalers) made things worse until human operators improvised a solution by repurposing Cloudflare's firewall as a load shedder.
Read Original Summary used for search
TLDR
• A stale Cloudflare routing rule caused CDN downloads to synchronize 270K+ clients, creating an accidental "barrier" that released a thundering herd of 1.5M req/s—3x normal peak
• The API gateway couldn't handle it due to a known-but-unfixed telemetry library bug that caused excessive thread locking, reducing throughput at the worst possible moment
• Load balancers concentrated traffic on "healthy" nodes, pushing them over the edge faster than autoscalers could provision new capacity while the OOM killer terminated overloaded tasks
• Engineers had to block all traffic at the CDN layer, let the system stabilize, then incrementally ramp up by region—adapting the system's behavior in ways it wasn't originally designed for
• The incident illustrates "decompensation" (system can't keep pace with cascading failures) and why resilience requires human adaptive capacity, not just robust automation
In Detail
Lorin Hochstein dissects Canva's December 2024 outage as a textbook example of how systems fail when pushed outside their "competence envelope"—the range of conditions they're designed to handle. The trigger wasn't a bug but a deployment: when Canva released new JavaScript files, a stale Cloudflare routing rule sent traffic between Singapore and Northern Virginia over the public internet instead of their private backbone. The resulting packet loss created massive latency, causing 270K+ clients to queue up waiting for the CDN to fetch assets from origin. When the transfer finally completed 20 minutes later, all those synchronized clients simultaneously hit Canva's API gateway with 3x normal peak load. The gateway couldn't cope because of a performance regression where calls to a telemetry library were taking locks on the event loop—a known issue with a fix literally queued for deployment that day.
What makes this incident fascinating is how automated systems designed to improve reliability actually accelerated the failure. The load balancer redirected traffic away from unhealthy tasks to healthy ones, creating a positive feedback loop that pushed more tasks over the edge. The autoscaler tried to provision new capacity, but the Linux OOM killer terminated overloaded containers faster than new ones could spin up—a phenomenon resilience engineer David Woods calls "decompensation," where the system can't keep pace with cascading failures. Hochstein emphasizes that all these systems were functionally correct; the issues were performance problems that only manifested under extreme load, making them nearly impossible to catch in testing.
The recovery required human improvisation. Engineers blocked all traffic using Cloudflare's firewall rules (repurposing a security feature as a load shedder), let the system stabilize, then carefully ramped traffic back up by region. This illustrates Hochstein's core thesis: resilience isn't just about building robust systems that handle expected scenarios—it's about adaptive capacity, the ability to reconfigure system behavior when facing the unexpected. Automation can't do this; it requires operators with deep system knowledge who can improvise solutions in the moment. The piece challenges the "automate everything" mindset, arguing that the more you improve at catching functional bugs, the more your incidents will be these harder-to-detect performance issues that demand human judgment to resolve.