Rate Limiting Strategies for High-Traffic APIs: A Comparative Analysis
Token bucket, leaky bucket, fixed window, sliding window — each rate-limiting algorithm has tradeoffs. We benchmarked all four under 100k req/s load with Redis and compared accuracy, fairness, and latency overhead.
Choosing the Right Algorithm
The right rate-limiting strategy depends on your traffic patterns. A fixed window is simple but vulnerable to burst traffic at window boundaries, while a sliding window log is fairer but computationally expensive.
We benchmarked four algorithms under 100k req/s sustained load using Redis as the backing store: fixed window counter, sliding window log, sliding window counter (hybrid), and token bucket. Each has distinct tradeoffs across three dimensions: accuracy (how precisely it enforces the limit), fairness (how evenly it distributes capacity across clients), and operational overhead (memory and CPU cost at scale).
The short answer: token bucket for most APIs, sliding window counter when fairness matters most, fixed window when simplicity is the priority and burst traffic is acceptable.
Fixed Window and Sliding Window: Tradeoffs at Scale
Fixed window rate limiting counts requests in discrete time buckets (e.g., 1,000 requests per minute, where 'minute' resets at :00, :01, :02, etc.). It's O(1) in both memory and time — one counter per client per window. The failure mode is the boundary burst: a client can make 1,000 requests at :59 and another 1,000 at :00 of the next minute, achieving 2,000 requests in 2 seconds while technically staying within limit. For APIs where bursty traffic is tolerable, this is often acceptable.
Sliding window log tracks the timestamp of every request within the window and counts how many timestamps fall within the last N seconds. It's perfectly accurate — no boundary bursts — but requires O(requests per window) memory per client. At 100k req/s with 1,000 clients allowed 100 req/min each, the memory overhead becomes non-trivial. At high scale, this algorithm becomes the bottleneck.
Sliding window counter is the practical middle ground: maintain two fixed windows (current and previous) and compute the approximate sliding window count as prev_count × overlap_fraction + current_count. This uses O(1) memory, is accurate within ~0.003% of the true sliding window value at most traffic distributions, and eliminates boundary bursts in practice. In our Redis benchmarks at 100k req/s, sliding window counter added 0.4ms P99 latency overhead — acceptable for most APIs.
Token Bucket: The Burst-Friendly Standard
Token bucket is the dominant algorithm for production API rate limiting, used by AWS API Gateway, Stripe, Twilio, and most major API platforms. A client starts with a bucket of N tokens. Each request consumes one token. Tokens refill at a constant rate (e.g., 10 per second for a 600/min limit). When the bucket is empty, requests are rejected. Crucially, tokens accumulate up to the bucket capacity — so a client that doesn't use their full allocation builds up burst capacity.
This burst capacity is the key feature. A client with a 1,000/min limit that sends 500 requests in the first minute has 500 tokens accumulated. They can use those in a short burst — sending 1,500 requests in the next minute — without being rate-limited. This matches real API traffic patterns: clients often have bursty behavior (processing a batch job, responding to a webhook flood) and penalizing them for bursts they haven't built up is unnecessarily restrictive.
In Redis, implement token bucket with a MULTI/EXEC transaction that reads the current token count and last refill timestamp, calculates new tokens based on elapsed time, subtracts one token if available, and writes back — all atomically. A Lua script is more efficient than MULTI/EXEC for this use case: it executes atomically on the Redis server without a round-trip between read and write, reducing latency by 30-50% compared to the client-side transaction approach.
Observability and Graceful Degradation
Rate limiting without observability is a black box. Instrument your rate limiter with: per-client rejection rate (what percentage of requests are being rejected, by client), throttle duration distribution (how long clients typically spend in a throttled state), and burst detection events (when a client exceeds their average rate by more than 2x).
Return precise error responses: HTTP 429 with a Retry-After header (when the client can next send a request), an X-RateLimit-Limit header (their total limit), an X-RateLimit-Remaining header (tokens/requests remaining in the current window), and an X-RateLimit-Reset header (when the window resets). Clients that implement proper backoff logic rely on these headers — poorly documented rate limit responses lead to retry storms that compound the load problem you were trying to solve.
For critical clients — large enterprise customers, your own internal services — implement tiered rate limits rather than a one-size-fits-all ceiling. A paying enterprise customer processing end-of-month billing runs should not be rate-limited the same as an unauthenticated free-tier client. Tiered limits align rate limiting policy with business value and prevent your most important clients from experiencing the limits designed for abuse prevention.
About the Author
Start monitoring your stack
Aggregate real-time operational data from every service your stack depends on into a single dashboard. Free for up to 10 services.