Cloudflare outage on November 18, 2025
A database permissions change caused Cloudflare's Bot Management feature file to double in size, exceeding a hardcoded limit and crashing their global network for 6 hours—their worst outage since 2019.
Read Original Summary used for search
TLDR
• A ClickHouse database permissions update caused queries to return duplicate metadata rows, doubling a Bot Management feature file from ~60 to >200 features
• The proxy code had a hardcoded 200-feature limit for memory preallocation—when exceeded, it panicked with unhandled errors, returning 5xx responses globally
• Intermittent failures (good/bad files generated every 5 minutes) initially made the team suspect a DDoS attack, especially when their status page coincidentally went down
• The fix: stopped bad file generation, manually inserted a known-good file, and force-restarted the core proxy—traffic recovered after 6 hours
• Root cause was treating internal config files as trusted input rather than validating them like user input, plus lack of graceful error handling for limit violations
In Detail
On November 18, 2025, Cloudflare experienced its worst outage since 2019 when a database permissions change triggered a cascade of failures across their global network. The incident began at 11:05 UTC when engineers made a change to ClickHouse database permissions to improve security—allowing users to see metadata for underlying tables in addition to distributed tables. This seemingly benign change had an unexpected consequence: queries that previously returned only "default" database columns now also returned "r0" database columns, effectively doubling the result set.
This mattered because the Bot Management system used such a query to generate a "feature file" every five minutes—a configuration file containing machine learning features used to score bot traffic. The duplicate rows caused the file to balloon from ~60 features to over 200. The problem: Cloudflare's FL2 proxy had a hardcoded limit of 200 features for memory preallocation. When the oversized file was distributed globally, the Rust code hit this limit and panicked with an unhandled error, causing 5xx responses for any traffic touching the bots module.
The debugging process was complicated by intermittent failures—because the ClickHouse cluster was being gradually updated, the query sometimes hit updated nodes (generating bad files) and sometimes hit old nodes (generating good files). This created a pattern of the network failing and recovering every five minutes, which led the team to initially suspect a DDoS attack, especially when their external status page coincidentally went down. The team ultimately identified the root cause, stopped the bad file generation, manually inserted a known-good file, and force-restarted the proxy. Full recovery took until 17:06 UTC.
The post-mortem reveals several systemic failures: treating internally-generated config files as trusted rather than validated input, using hardcoded limits without graceful degradation, and having error handling that consumed excessive CPU when triggered at scale. Cloudflare's remediation plan includes hardening config file ingestion, adding global kill switches for features, preventing error reporting from overwhelming resources, and reviewing failure modes across all proxy modules. The transparency of this analysis—including showing the exact Rust code that panicked—demonstrates how complex distributed systems fail not from attacks or hardware issues, but from subtle interactions between components that seem unrelated.