Stop Rate Limiting! Capacity Management Done Right" by Jon Moore - YouTube
Rate limiting APIs is fundamentally broken—limit concurrent requests instead and use TCP-style congestion control to adaptively discover backend capacity, enabling fair allocation without wasting resources.
Read Original Summary used for search
TLDR
• Rate limiting (requests/second) fails during backend overload because it doesn't provide backpressure—queues grow infinitely and latency explodes
• Little's Law (N = X × R) reveals that limiting concurrent requests behaves identically to rate limiting under normal conditions but naturally throttles during overload
• Borrowing TCP's AIMD algorithm (additive increase, multiplicative decrease), proxies can adaptively discover backend capacity by probing with +1 connection/second and backing off to 75% on errors
• Multiple proxy instances converge to fair allocation without central coordination—configure the same quotas everywhere and AIMD handles the distribution
• Throughput ≠ capacity: a backend with 7 workers and 2s latency has 3.5 req/s throughput, but you need to account for queue depth to avoid timeouts
In Detail
Moore demonstrates through live demos and queueing theory why the standard practice of rate limiting APIs is fundamentally flawed. When a backend service experiences elevated latency (say, doubling from 1s to 2s), rate-limited clients continue sending their quota of requests, causing infinite queue buildup at the backend. Using Little's Law (concurrent requests = throughput × latency), he shows that a service with 7 worker threads and 2s latency can only handle 3.5 req/s—but a rate-limited client sending 5 req/s will cause unbounded queuing and eventual system collapse.
The solution is to limit concurrent requests instead of request rate. Under normal conditions, this behaves identically to rate limiting (5 concurrent requests with 1s latency = 5 req/s). But during overload, it automatically provides backpressure: if latency doubles, throughput is cut in half to maintain the concurrency limit. The key insight is that throughput and capacity are different variables in Little's Law—adding worker threads increases capacity even if latency is elevated.
Moore then introduces an adaptive algorithm borrowed from TCP congestion control: increase allowed connections by 1/second when things are healthy, back off to 75% of current estimate when encountering timeouts or errors. This lets proxies dynamically discover backend capacity without hardcoded limits. He extends this with "fair share" allocation using quota percentages rather than absolute numbers—if client A is entitled to 25% and client B to 75%, they can borrow unused quota from each other based on the proxy's current capacity estimate. The AIMD properties ensure that multiple proxy instances converge to fair allocation without any central coordination—just configure the same percentages everywhere and the math works out.