Course → Module 2: Scalability, Load Balancing & API Design

Why Rate Limiting Exists

Every API has a capacity ceiling. Whether that ceiling is set by CPU, memory, database connections, or downstream dependencies, there is a point beyond which additional requests degrade the experience for everyone. Rate limiting enforces that ceiling explicitly rather than letting the system discover it through failure.

Rate limiting serves three purposes. First, it protects backend services from being overwhelmed by a single client (intentionally or accidentally). Second, it ensures fair access across clients. Third, it provides a defense layer against abuse, including brute-force attacks, credential stuffing, and scraping.

The core question in rate limiting is simple: given a stream of incoming requests, how do you decide which ones to allow and which to reject? The answer depends on which algorithm you use.

Fixed Window Counter

The simplest approach. Divide time into fixed intervals (say, 1-minute windows). Maintain a counter per client per window. Increment on each request. If the counter exceeds the limit, reject.

The problem is boundary bursts. A client can send 100 requests at 12:00:59 and another 100 at 12:01:00. Both are within their respective windows, but the server just received 200 requests in 2 seconds. The fixed window sees two compliant windows. The server sees a spike.

Boundary burst problem: Fixed window counters allow up to 2x the intended rate at window boundaries, because requests at the end of one window and the start of the next are counted separately but arrive nearly simultaneously.

Sliding Window Log

Instead of fixed intervals, keep a log of timestamps for each request. When a new request arrives, remove all timestamps older than the window size (e.g., older than 60 seconds). Count the remaining entries. If the count is at or above the limit, reject.

This eliminates the boundary burst problem entirely. The window slides with each request, so there is no artificial boundary to exploit. The cost is memory: you must store every request timestamp within the window. For high-volume APIs, this becomes expensive quickly.

Sliding Window Counter

A hybrid that approximates the sliding window without storing individual timestamps. It keeps counters for the current and previous fixed windows, then weights them based on how far into the current window you are.

If the previous window had 80 requests and the current window (30 seconds in, out of 60) has 40 requests, the estimated count is: 80 x 0.5 (remaining fraction of previous window) + 40 = 80. This gives you a reasonable approximation of a true sliding window at the memory cost of just two counters per client.

Token Bucket

The token bucket algorithm works by analogy. A bucket holds tokens up to a maximum capacity. Tokens are added at a fixed rate (the refill rate). Each request consumes one token. If the bucket is empty, the request is rejected.

The key property: the bucket can accumulate tokens during idle periods, up to its maximum capacity. This allows controlled bursts. A client that has been quiet for a while can briefly exceed the steady-state rate, up to the bucket's capacity, before being throttled back to the refill rate.

graph TD Refill["Token Refill
(fixed rate, e.g., 10/sec)"] --> Bucket["Token Bucket
(max capacity: 100)"] Request["Incoming Request"] --> Check{"Tokens > 0?"} Check -->|Yes| Allow["Allow Request
(remove 1 token)"] Check -->|No| Reject["Reject (429)"] Bucket --> Check style Refill fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style Bucket fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Request fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Check fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Allow fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style Reject fill:#2a2a2a,stroke:#c8a882,color:#ede9e3

In practice, you do not run a background thread to add tokens. Instead, when a request arrives, you calculate how many tokens should have been added since the last request based on elapsed time. This is called "lazy refill" and is how most implementations work.

Leaky Bucket

The leaky bucket is the inverse. Requests enter a queue (the bucket). The queue drains at a fixed rate. If the queue is full, new requests are dropped.

Where the token bucket allows bursts up to the bucket capacity, the leaky bucket enforces a strictly uniform output rate. Bursts are absorbed by the queue rather than passed through. This is useful when your downstream dependency requires a constant request rate and cannot tolerate spikes at all.

Algorithm Comparison

Algorithm Burst Handling Memory Accuracy Best For
Fixed Window Allows 2x burst at boundaries Very low (1 counter per client) Low Simple use cases, low-stakes limits
Sliding Window Log No boundary burst High (stores all timestamps) Exact Low-volume APIs needing precision
Sliding Window Counter Minimal boundary burst Low (2 counters per client) Approximate High-volume APIs, good default choice
Token Bucket Controlled bursts allowed Low (token count + timestamp) Exact for steady state APIs where occasional bursts are acceptable
Leaky Bucket Smooths all bursts Low (queue size) Exact output rate Protecting rate-sensitive downstream services

Distributed Rate Limiting with Redis

A rate limiter running in a single process is straightforward. The challenge appears when your API runs on multiple servers behind a load balancer. Each server sees only a fraction of the total traffic. A per-server limit of 100 requests per minute, across 10 servers, means the actual limit is 1,000, not 100.

The standard solution is a centralized counter in Redis. Redis operations like INCR are atomic and fast (sub-millisecond). A typical pattern for a fixed window limiter:

  1. Construct a key like ratelimit:{user_id}:{window_timestamp}.
  2. INCR the key. If the result is 1 (first request in this window), set a TTL equal to the window size.
  3. If the result exceeds the limit, reject with HTTP 429.

Race Conditions

The INCR-then-check pattern looks simple, but it has a race condition in the TTL step. Between the INCR and the EXPIRE commands, another request could arrive. If the process crashes between these two operations, the key persists forever with no expiry, and the client is permanently blocked.

The fix is to use a Lua script executed via Redis EVAL. Lua scripts in Redis are atomic: the entire script runs without interruption. This guarantees that the increment, the limit check, and the TTL set all happen as a single operation.

Atomic operations in Redis: Any time your rate limiting logic requires multiple Redis commands that depend on each other (read, decide, write), wrap them in a Lua script. This eliminates race conditions without requiring distributed locks, which would add latency and complexity.

What Happens at Rejection

When a request exceeds the limit, the standard response is HTTP 429 (Too Many Requests). Good rate limiters also include response headers that help clients self-regulate:

These headers turn the rate limiter from a blunt wall into a communication channel. Well-behaved clients use them to back off gracefully rather than hammering the endpoint and collecting 429s.

Where to Place the Rate Limiter

Rate limiting can happen at multiple layers: the API gateway, a reverse proxy (Nginx, Envoy), a middleware layer in your application, or a dedicated service. The closer to the edge, the earlier you reject bad traffic and the less load reaches your application servers. Most architectures place the primary rate limiter at the API gateway or reverse proxy, with a secondary application-level limiter for business-logic-specific rules (e.g., "max 5 password attempts per hour").

Further Reading

Assignment

Design a rate limiter for a REST API with the following requirements:

  • Limit: 100 requests per minute per user.
  • The API runs on 8 servers behind a load balancer.
  • Users authenticate via API key in the request header.

Answer the following:

  1. Which algorithm do you choose, and why? Consider whether you need to allow bursts or enforce a smooth rate.
  2. What happens when user sends request #101? Specify the HTTP status code, response headers, and the behavior the client should implement.
  3. What data structure and storage system do you use? Describe the key schema and the operations performed on each request.
  4. Where in the request lifecycle do you place the rate limiter? At the gateway, in a middleware, or as a separate service? Justify.
  5. What failure mode do you choose if Redis becomes unreachable: fail open (allow all) or fail closed (reject all)? What are the risks of each?