When Imperfect Systems are Good, Actually: Bluesky's Lossy Timelines · Jaz's Blog

Bluesky deliberately drops timeline writes for users following too many people—cutting P99 latency by 96% by recognizing that perfect consistency is wasted on users who can't consume their feed anyway.

Read Original

• Define "reasonable user behavior" as a design constraint: users following >2000 people can't realistically consume their full timeline, so perfect consistency is wasted effort
• Hot shards emerge when hyperactive timelines (users following 100k+ others) slow down tens of thousands of colocated users on the same database shard
• Probabilistic write dropping via loss_factor = min(reasonable_limit/num_follows, 1) caps fanout work per user—someone following 8000 people gets 75% of writes dropped
• Tail latencies multiply brutally: fanning out to 2M followers with 15ms P99 writes can stack up to 5+ minute fanout times for a single post
• Results: P99 fanout duration dropped from 5-10 minutes to <10 seconds, hot shards eliminated entirely

The core thesis is that embracing imperfection for edge-case users enables massive scalability improvements without degrading normal user experience. Bluesky's timeline fanout works by inserting post references into followers' timeline tables—but when users follow hundreds of thousands of others, their timeline becomes a "hot shard" that gets written to constantly, slowing down the tens of thousands of other users sharing that database shard.

The author introduces "lossy timelines" as the solution: define a reasonable_limit (2000 follows) and calculate a loss_factor for each user. Users following 4000 people get a loss factor of 0.5, meaning 50% of writes to their timeline are probabilistically dropped. This caps the fanout work per user at the equivalent of the reasonable limit. The key insight is that users following 100k+ people can't possibly consume their full timeline anyway—they're beyond the threshold of reasonable behavior, so perfect consistency is wasted computational effort. To avoid millions of additional database reads, they cache high-follow accounts in Redis and load into memory every 30 seconds.

The results were dramatic: P99 fanout latency dropped 96%, jobs that previously took 5-10 minutes for large accounts now complete in under 10 seconds, and hot shards were eliminated entirely. This is a masterclass in identifying where consistency doesn't matter to users and trading it for massive performance gains. The broader lesson: "reasonable limits" aren't just for rate limiting—they're a fundamental design tool for building scalable systems.

When Imperfect Systems are Good, Actually: Bluesky's Lossy Timelines · Jaz's Blog

TLDR

In Detail

TLDR

In Detail

Related