Module 2: Scalability, Load Balancing & API Design (Printable)

The Scaling Problem

Your system is under load. Response times are climbing. Users are complaining. You need more capacity. The question is not whether to scale, but how.

There are two fundamental directions: make the existing machine bigger (vertical scaling), or add more machines (horizontal scaling). Each direction carries different costs, constraints, and architectural consequences. Most production systems end up using both.

Vertical Scaling (Scale Up)

Vertical scaling means increasing the capacity of a single machine by adding more CPU, RAM, storage, or faster disks. The application code does not change. You simply give it a more powerful host.

This is the simplest form of scaling. If your database server is running at 80% CPU, you migrate it to an instance with more cores. If your application is running out of memory, you add RAM. The application itself does not know or care that the underlying hardware changed.

Vertical scaling is appealing because it requires zero architectural changes. A single-threaded application that cannot run across multiple machines will still benefit from a faster CPU. A database that stores everything on one disk will still benefit from more RAM for caching.

But vertical scaling has a hard ceiling. There is a largest machine you can buy. As of 2026, the largest AWS EC2 instance (u-24tb1.metal) offers 448 vCPUs and 24 TB of RAM. That is the wall. If your workload outgrows that machine, vertical scaling cannot help you. And long before you hit that wall, the cost curve becomes punishing. Doubling the CPU count of a cloud instance rarely doubles the price. It often triples or quadruples it.

There is also the downtime problem. Resizing a machine typically requires stopping it, changing the instance type, and restarting. For a database, this can mean minutes of unavailability. For a stateless web server behind a load balancer, it is less painful but still disruptive.

Horizontal Scaling (Scale Out)

Horizontal scaling means adding more machines to share the workload. Instead of one powerful server, you run many smaller servers behind a load balancer. Each server handles a portion of the traffic.

Horizontal scaling has no hard ceiling. Need more capacity? Add another server. Need even more? Add ten. Cloud platforms make this trivial with auto-scaling groups that add or remove instances based on metrics like CPU utilization or request count.

The catch is that horizontal scaling demands distributed architecture. Your application must be designed to run as multiple independent instances. This means:

No local state. If a user's session data is stored in memory on Server A, and their next request goes to Server B, the session is lost. State must live in a shared store (Redis, a database, or sticky sessions).
Shared-nothing design. Each instance should be self-sufficient. No writing to local disk that other instances need to read.
Coordination overhead. Distributed locks, cache invalidation, and data consistency become real problems once you have multiple writers.

Horizontal scaling also introduces operational complexity. You now need load balancers, health checks, deployment strategies for rolling updates, and monitoring across multiple instances. The system has more moving parts.

Diagonal Scaling (The Hybrid)

In practice, most production systems use both strategies. This is sometimes called diagonal scaling.

A common pattern: the application tier scales horizontally behind a load balancer, while the primary database scales vertically on the largest feasible instance. The database is hard to distribute (joins, transactions, consistency), so you give it the biggest machine you can afford. The application servers are stateless and easy to replicate, so you add more of them.

As the database eventually outgrows vertical limits, you introduce read replicas (horizontal reads) while keeping a single primary for writes. This is diagonal scaling in action: vertical where distribution is hard, horizontal where it is natural.

graph TB subgraph Vertical Scaling V1[Server] -->|"Add CPU, RAM"| V2["Bigger Server"] V2 -->|"Add more"| V3["Biggest Server
(ceiling)"] end subgraph Horizontal Scaling H1[Server 1] H2[Server 2] H3[Server 3] H4["Server N..."] LB[Load Balancer] --> H1 LB --> H2 LB --> H3 LB --> H4 end

Comparison Table

Dimension	Vertical Scaling	Horizontal Scaling
Mechanism	Bigger machine (more CPU, RAM, disk)	More machines behind a load balancer
Capacity ceiling	Hard limit (largest available instance)	No theoretical limit
Cost curve	Superlinear (2x capacity costs 3-4x)	Roughly linear (2x capacity costs ~2x)
Code changes	None required	Must handle distributed state, sessions, coordination
Downtime during scaling	Usually required (instance resize)	Zero downtime (add/remove instances)
Fault tolerance	Single point of failure	Survives individual instance failures
State management	Simple (everything on one machine)	Complex (shared stores, distributed locks)
Operational complexity	Low (one machine to manage)	High (load balancers, health checks, deployments)
Best for	Databases, legacy apps, quick fixes	Stateless services, web servers, microservices

Systems Thinking Lens

Scaling decisions create feedback loops. Vertical scaling is a balancing loop with a fixed limit: you add resources, performance improves, but the ceiling does not move. Eventually the gap between demand and capacity closes again, and you have nowhere to go.

Horizontal scaling is a reinforcing loop in terms of capacity (more machines, more throughput) but also a reinforcing loop in terms of complexity (more machines, more coordination problems, more debugging surface). The systems thinker asks: which loop dominates at our current scale? If you have 3 servers, the coordination overhead is minimal. At 300 servers, it is a major engineering concern.

The leverage point is often not "which direction to scale" but "what to make stateless." Every component you can make stateless becomes trivially horizontally scalable. The real work is in the architecture, not the infrastructure.

Assignment

Your company runs a PostgreSQL database on a single server. It is at 80% CPU utilization during peak hours and growing 10% per month. You have two options:

Vertical: Upgrade to an instance with 2x the CPU cores. Cost is 3x your current monthly bill. No code changes needed. Can be done this weekend with 15 minutes of downtime.
Horizontal: Add read replicas and split read traffic from write traffic. Cost is 2x your current monthly bill. Requires code changes to route read queries to replicas. Estimated 3 weeks of development and testing.

You have a 6-month runway before the database hits critical load. Which option do you choose, and why? Consider: what happens at month 7? What does each option buy you in terms of future scaling? Is there a diagonal approach that combines both?

What Load Balancers Do

A load balancer sits between clients and a pool of servers. It receives incoming requests and forwards each one to a server that can handle it. The goals are straightforward: distribute traffic evenly, detect and route around failed servers, and allow the backend to scale without clients needing to know about it.

But not all load balancers work the same way. The two major categories, defined by where they operate in the network stack, are Layer 4 (transport) and Layer 7 (application). In AWS terminology, these map to the Network Load Balancer (NLB) and the Application Load Balancer (ALB).

The OSI Model Context

To understand the difference, you need to know two layers of the OSI model:

Layer 4 (Transport): Deals with TCP and UDP. At this layer, the load balancer sees source IP, destination IP, source port, and destination port. It does not inspect the content of the request.
Layer 7 (Application): Deals with HTTP, HTTPS, WebSocket, and other application protocols. At this layer, the load balancer can read HTTP headers, URL paths, cookies, and request bodies.

The layer at which a load balancer operates determines what it can see and what routing decisions it can make.

graph TB Client[Client] --> L7["Layer 7 (ALB)
Reads HTTP headers, paths, cookies"] Client --> L4["Layer 4 (NLB)
Reads TCP/UDP: IP + port only"] L7 -->|"/api/*"| Backend1[API Servers] L7 -->|"/static/*"| Backend2[Static Servers] L4 -->|"TCP forward"| Backend3[Server Pool]

Application Load Balancer (ALB)

ALB operates at Layer 7. It understands HTTP and HTTPS, can inspect request content, and makes routing decisions based on URL paths, hostnames, HTTP headers, query strings, and source IP.

ALB is the default choice for web applications. It supports:

Path-based routing: Send /api/* to one target group and /images/* to another.
Host-based routing: Route api.example.com to backend services and www.example.com to frontend servers.
SSL/TLS termination: The ALB decrypts HTTPS traffic, inspects the request, and forwards plain HTTP to backend servers. This offloads CPU-intensive TLS processing from your application servers.
WebSocket support: ALB natively handles WebSocket upgrades.
Authentication integration: ALB can authenticate users via OpenID Connect providers (Cognito, Okta, etc.) before requests reach your backend.
Sticky sessions: Route a user's requests to the same target using cookies.

The tradeoff is latency. Because the ALB reads and parses the full HTTP request before making a routing decision, it adds processing time. For most web applications, this overhead is negligible (single-digit milliseconds). For ultra-low-latency workloads, it matters.

Network Load Balancer (NLB)

NLB operates at Layer 4. It routes TCP and UDP traffic based on IP addresses and port numbers without inspecting the application-layer content. It is designed for extreme throughput and ultra-low latency.

NLB is built for raw performance. It supports:

Millions of requests per second with latencies in the microsecond range.
Static IP addresses: Each NLB gets one static IP per Availability Zone. This is critical for clients that need to whitelist specific IPs (firewalls, DNS records, partner integrations).
Source IP preservation: The original client IP is visible to backend servers without needing X-Forwarded-For headers.
TLS passthrough: NLB can forward encrypted traffic directly to backend servers without decrypting it. Useful when end-to-end encryption is required.
AWS PrivateLink support: NLB is the only load balancer type that works with VPC endpoint services for private connectivity.
Non-HTTP protocols: Any TCP or UDP workload. Game servers, IoT brokers, custom binary protocols.

The tradeoff is intelligence. NLB cannot route based on URL paths, HTTP headers, or cookies. It sees packets, not requests. If you need content-aware routing, NLB cannot do it alone.

Comparison Table

Dimension	ALB (Layer 7)	NLB (Layer 4)
OSI Layer	Layer 7 (Application)	Layer 4 (Transport)
Protocols	HTTP, HTTPS, WebSocket, gRPC	TCP, UDP, TLS
Routing intelligence	Path, host, header, query string, source IP	IP address and port only
SSL/TLS handling	Terminates and re-encrypts (offloading)	Passthrough or terminate
Static IP	No (DNS-based, IP can change)	Yes (one per AZ)
Latency	Low (single-digit ms)	Ultra-low (sub-ms)
Throughput	High	Extreme (millions of requests/sec)
Source IP preservation	Via X-Forwarded-For header	Native (client IP visible directly)
Sticky sessions	Cookie-based	Source IP-based
PrivateLink	Not supported	Supported
Cost model	Per hour + LCU (capacity units)	Per hour + NLCU (capacity units)
Best for	Web apps, REST APIs, microservices	IoT, gaming, real-time, non-HTTP protocols

When to Use Which

The decision usually comes down to what the load balancer needs to understand about the traffic.

If your load balancer needs to inspect HTTP content to make routing decisions (path-based routing, host-based routing, header inspection), use ALB. If it just needs to forward packets as fast as possible, use NLB.

A common production pattern combines both: an NLB as the external entry point (for static IPs and PrivateLink) that forwards to an ALB for content-based routing. This gives you the best of both layers, at the cost of additional hops and complexity.

Systems Thinking Lens

The choice of load balancer is not a local decision. It propagates through the system. Choosing ALB means your backend servers do not need to handle TLS termination, saving CPU. But it also means you depend on ALB's connection limits and request processing latency. Choosing NLB means your servers see the real client IP natively, but your application code must handle routing logic that ALB would have done for you.

Every capability you move into the load balancer is a capability you remove from your application. Every capability you keep in the application is latency you add to the load balancer. This is a tradeoff, not a best practice.

Assignment

For each scenario below, choose the appropriate load balancer type (ALB or NLB) and explain your reasoning:

A REST API where /api/v1/* routes to the backend service cluster and /static/* routes to a CDN origin server. The team wants SSL offloading so backend servers do not handle TLS.
An IoT platform that accepts 1 million persistent TCP connections from embedded sensors. Each sensor sends a 64-byte payload every 30 seconds. Partners need to whitelist specific IP addresses in their firewalls.
A microservices application with 12 services. The team wants the load balancer to handle SSL termination for all HTTPS traffic and authenticate users via an OIDC provider before requests reach any backend service.

For each answer, identify what would go wrong if you chose the other load balancer type instead.

The Algorithm Decides

A load balancer receives a request. It has five healthy servers behind it. Which one gets the request?

That decision is made by a load balancing algorithm. The algorithm you choose determines how traffic is distributed, how well your system handles heterogeneous servers, and how it behaves under uneven load. The right algorithm depends on your traffic pattern, your server fleet, and what assumptions you can safely make.

Round-Robin

Round-robin distributes requests sequentially across all servers in a fixed rotation. Server 1, then Server 2, then Server 3, then back to Server 1. Every server gets the same number of requests.

Round-robin is the simplest algorithm and the default in most load balancers, including NGINX. It works well when all servers have identical hardware specs and all requests take roughly the same time to process.

It fails when either of those assumptions breaks. If one server is slower than the others, it still receives the same number of requests, creating a bottleneck. If some requests are expensive (a complex database query) and others are cheap (serving a cached response), round-robin ignores this completely. It distributes requests evenly, not load evenly.

Weighted Round-Robin

Weighted round-robin assigns each server a weight proportional to its capacity. A server with weight 3 receives three requests for every one request sent to a server with weight 1.

This solves the heterogeneous server problem. If Server A has 8 CPU cores and Server B has 4, you give Server A a weight of 2 and Server B a weight of 1. The load balancer sends twice as many requests to Server A.

The weights are static. You configure them once and they do not change unless you update the configuration. This means weighted round-robin cannot adapt to runtime conditions like one server running a memory-intensive background job.

In this example, for every 6 requests, Server A gets 3, Server C gets 2, and Server B gets 1. The distribution matches their configured capacity.

Least Connections

Least connections sends each new request to the server with the fewest active connections at that moment. It is a dynamic algorithm that adapts to real-time load.

This is the right choice when requests have highly variable processing times. A server handling a long-running request will have that connection counted against it, so new requests will be routed elsewhere. Servers that finish requests quickly will naturally receive more.

Least connections is particularly effective for WebSocket applications, database connection pools, and any workload where some requests hold connections open much longer than others.

The weakness is that it still assumes all servers have equal capacity. A small server with 2 active connections and a large server with 2 active connections are treated identically, even though the large server could handle 10 more. This is solved by weighted least connections, which divides the connection count by the server's weight before comparing.

IP Hash

IP hash computes a hash of the client's IP address and uses the result to deterministically select a server. The same client IP always maps to the same server.

IP hash provides session affinity without cookies or shared session stores. If a user's session data is stored in memory on Server B, IP hash ensures that user's requests always go to Server B.

The tradeoff is uneven distribution. IP hash does not guarantee equal traffic across servers. A single corporate office behind a NAT gateway might have thousands of users sharing one IP address, all routed to the same server. If a server goes down, all clients mapped to it are remapped, which invalidates their sessions.

IP hash is also disrupted by adding or removing servers. When the number of servers changes, the hash function redistributes most clients to different servers. Consistent hashing (covered in Session 3.6) solves this by minimizing remapping when the server pool changes.

Random

Random selection picks a server at random for each request. With a large number of requests, the distribution approaches uniform.

Random selection is rarely the primary algorithm, but it appears in two important contexts. First, as a tiebreaker when other algorithms produce equal candidates. Second, in the power of two random choices algorithm: pick two servers at random, then send the request to whichever has fewer connections. This achieves near-optimal load distribution with minimal coordination, and it is used in systems like Envoy proxy.

Comparison Table

Algorithm	How It Works	Best For	Fails When
Round-robin	Sequential rotation through server list	Identical servers, uniform request cost	Servers differ in capacity or requests differ in cost
Weighted round-robin	Round-robin proportional to configured weights	Heterogeneous server fleet with known capacity ratios	Runtime conditions change (weights are static)
Least connections	Route to server with fewest active connections	Variable request durations, long-lived connections	Servers have unequal capacity (use weighted variant)
IP hash	Hash client IP to deterministically select server	Session affinity without shared session store	Uneven IP distribution (NAT), server pool changes
Random	Pick a server at random	Tiebreaking, "power of two choices" variant	Small request volumes (randomness needs large N to converge)

Choosing an Algorithm

Start with these questions:

Are all servers identical? If yes, round-robin or least connections. If no, use a weighted variant.
Do requests vary in processing time? If yes, least connections outperforms round-robin. If no, round-robin is simpler and sufficient.
Do you need session affinity? If yes, IP hash or sticky sessions at the load balancer level. If no, connection-based algorithms are more efficient.
Is the server pool stable? If servers are frequently added and removed (autoscaling), IP hash will cause frequent remapping. Least connections adapts naturally.

Systems Thinking Lens

The load balancing algorithm is a control mechanism in a feedback loop. The system measures some signal (connection count, server order, client IP) and uses it to make a routing decision. The quality of that signal determines the quality of the decision.

Round-robin uses no signal at all. It is open-loop control: distribute evenly and hope for the best. Least connections closes the loop by measuring actual server state. IP hash closes a different loop: ensuring client-server affinity.

When you see a system with uneven load despite a load balancer, the first question is: what signal is the algorithm using, and does that signal actually correlate with server load? If the algorithm is round-robin and one server is overloaded, the algorithm literally cannot see the problem. Switching to least connections adds the feedback the system needs to self-correct.

Assignment

You have three servers behind a load balancer:

Server A: 8 cores, 32 GB RAM
Server B: 8 cores, 32 GB RAM
Server C: 16 cores, 64 GB RAM (twice the capacity of A or B)

Answer the following:

Which load balancing algorithm would you choose for this fleet? What weights would you assign?
If you used plain round-robin instead, what would happen to Server A and Server B during peak traffic? What would happen to Server C?
Now suppose requests vary wildly in cost: some take 10ms and others take 5 seconds. Does your answer to question 1 change? If so, what algorithm would you use instead, and why?

The Front Door Problem

As your system grows from one service to many, clients face a problem. Which service handles authentication? Which one enforces rate limits? Where does SSL terminate? If you have 12 microservices, do clients need to know about all 12 endpoints?

The answer is no. You place a single component at the front door that handles cross-cutting concerns and routes requests to the right backend. This component is the API Gateway.

What an API Gateway Does

An API Gateway is a server that acts as the single entry point for all client requests. It receives API calls, applies policies (authentication, rate limiting, transformation), routes them to the appropriate backend service, and returns the response.

The API Gateway pattern was formalized by Chris Richardson on microservices.io and has become a standard component in microservices architectures. Major implementations include AWS API Gateway, Kong, Apigee, and Azure API Management.

An API Gateway typically handles five responsibilities:

1. Request Routing

The gateway maps incoming requests to backend services. A request to /users/123 goes to the User Service. A request to /orders/456 goes to the Order Service. The client sends everything to one hostname, and the gateway figures out where it goes.

This decouples clients from the internal service topology. You can split, merge, or relocate backend services without changing any client code. The gateway absorbs the complexity.

2. Authentication and Authorization

Instead of every service implementing its own authentication logic, the gateway verifies identity once at the edge. It validates JWT tokens, checks API keys, or integrates with identity providers (OAuth 2.0, OpenID Connect). The backend services receive pre-authenticated requests with the user identity attached as a header.

This centralizes security logic. One place to patch, one place to audit, one place to update when the authentication scheme changes.

3. Rate Limiting

The gateway enforces rate limits before requests reach backend services. A free-tier client gets 100 requests per minute. An enterprise client gets 10,000. Requests that exceed the limit receive a 429 Too Many Requests response without consuming any backend resources.

Rate limiting at the gateway is more efficient than rate limiting at individual services because it protects all services from abuse through a single checkpoint.

4. Request and Response Transformation

The gateway can modify requests and responses in transit. Common transformations include:

Adding headers (correlation IDs, authentication context)
Rewriting URL paths (versioning: /v2/users maps to the same service as /v1/users but with a different backend path)
Aggregating responses from multiple services into a single response
Protocol translation (accepting REST from clients but calling gRPC services internally)

5. Monitoring and Logging

Because every request passes through the gateway, it is the natural place to collect metrics. Request counts, latency distributions, error rates, and traffic patterns per service, per client, per endpoint. This data feeds into dashboards, alerts, and capacity planning.

API Gateway vs. Load Balancer

API gateways and load balancers are complementary, not competing. They solve different problems and typically coexist in the same architecture.

Responsibility	API Gateway	Load Balancer
Primary purpose	API management and policy enforcement	Traffic distribution across server instances
OSI layer	Layer 7 (application)	Layer 4 or Layer 7
Routing logic	API-aware: paths, versions, client identity	Server-aware: health, capacity, connection count
Authentication	Yes (JWT, API keys, OAuth)	No (or limited to ALB OIDC)
Rate limiting	Yes (per client, per endpoint)	No
Request transformation	Yes (headers, body, protocol)	No
Health checks	Usually delegates to LB	Yes (active and passive)
SSL termination	Yes	Yes (ALB and NLB)
Scaling backend	Not its job	Primary job

The short version: the API Gateway decides what to do with the request. The load balancer decides which server handles it.

Request Flow

In a typical production setup, the request passes through multiple components in sequence. Each one applies a specific transformation or routing decision.

graph LR C[Client] --> GW["API Gateway
Auth, rate limit,
transform, route"] GW --> LB["Load Balancer
Distribute to
healthy instance"] LB --> S1["Service Instance 1"] LB --> S2["Service Instance 2"] LB --> S3["Service Instance 3"]

The client sends a request to the API Gateway. The gateway validates the API key, checks the rate limit, determines which backend service should handle the request, and forwards it. The request then hits a load balancer (one per service or a shared one with path-based routing), which selects a healthy instance of that service. The instance processes the request and sends the response back through the same chain.

In some architectures, the API Gateway itself sits behind a load balancer (or an NLB for static IPs), and a CDN sits in front of everything for cached content. The full chain can look like this:

graph LR C[Client] --> CDN[CDN] CDN --> NLB["NLB
(static IP)"] NLB --> GW["API Gateway"] GW --> ALB["ALB
(path routing)"] ALB --> SVC["Service
Instances"]

Each hop adds latency. Each component adds operational cost. The systems thinker asks: does each component in this chain justify its existence? If your API Gateway already does path-based routing, do you also need an ALB? If your CDN already terminates SSL, does the gateway need to do it again?

Common API Gateway Products

Product	Type	Notable Strength
AWS API Gateway	Managed (serverless)	Native Lambda integration, usage plans, no infrastructure to manage
Kong	Open-source / Enterprise	Plugin ecosystem, runs on NGINX, highly extensible
Apigee	Managed (Google Cloud)	API analytics, developer portal, monetization
Azure API Management	Managed (Azure)	Policy engine, multi-region, developer portal
Envoy + custom control plane	Self-managed	Service mesh integration, fine-grained traffic control

The Gateway Bloat Problem

A warning from Microsoft's microservices architecture guide: as teams add features to the API Gateway, it tends to grow into a monolith of its own. Every team wants their custom routing rule, their special header, their exception to the rate limit.

The solution is the Backends for Frontends (BFF) pattern: instead of one gateway for all clients, you deploy separate gateways for the mobile app, the web app, and the admin dashboard. Each gateway is tailored to its client's needs and maintained by the team that owns that client. This prevents the single gateway from becoming a coordination bottleneck across teams.

Systems Thinking Lens

The API Gateway is a leverage point in the system. Because every request passes through it, a small change at the gateway has a large effect across all services. Adding authentication at the gateway secures every service at once. A misconfigured rate limit at the gateway blocks every client at once.

This is high leverage, which also means high risk. The gateway is a single point of failure for the entire API surface. If it goes down, everything goes down. This is why production gateways are themselves deployed behind load balancers, across multiple availability zones, with health checks and automatic failover.

The feedback loop is clear: more services lead to more routing rules in the gateway, which leads to more complexity, which leads to more risk of misconfiguration, which leads to more outages. The balancing force is decomposition: splitting into multiple gateways or using infrastructure-as-code to keep configuration auditable and version-controlled.

Assignment

Draw the complete request flow for the following scenario. Label every component and describe what each one does to the request as it passes through.

Scenario: A mobile app sends a POST /api/v2/orders request to create a new order. The system has the following components:

A CDN (for static assets, not relevant for this POST request)
An NLB providing a static IP entry point
An API Gateway handling authentication, rate limiting, and version routing
An ALB distributing traffic across Order Service instances
Three instances of the Order Service

For each component in the chain, answer:

What does this component check or modify on the request?
Under what condition would this component reject the request (return an error) instead of forwarding it?
If this component were removed, what would break or degrade?

Why Autoscaling Exists

In Session 2.1, we established that horizontal scaling adds more machines to handle more load. But who decides when to add machines, and when to remove them? Manual scaling requires an engineer to watch dashboards and click buttons. That does not work at 3 AM on a Saturday.

Autoscaling automates this decision. It monitors metrics, compares them to thresholds, and adjusts capacity without human intervention. Done well, it keeps costs low during quiet hours and keeps users happy during spikes. Done poorly, it oscillates wildly, scales too late, or never scales down.

Autoscaling is a control system that adjusts compute capacity in response to observed or predicted demand. It is a feedback loop: measure load, compare to target, adjust capacity, re-measure.

Reactive Autoscaling

Reactive autoscaling responds to what has already happened. A metric crosses a threshold, and the system adds or removes instances. There are two common approaches.

Target Tracking

You specify a target value for a metric. The autoscaler continuously adjusts capacity to keep that metric near the target. If you set "average CPU utilization = 50%," the system adds instances when CPU rises above 50% and removes them when it drops below.

This works like a thermostat. You set the temperature you want. The system figures out when to turn the heater on and off. You do not specify the exact conditions for action. You specify the desired outcome.

AWS Auto Scaling supports target tracking on predefined metrics like ASGAverageCPUUtilization, ALBRequestCountPerTarget, and custom CloudWatch metrics.

Step Scaling

Step scaling gives you more control. You define CloudWatch alarms with thresholds, and for each alarm, you specify how many instances to add or remove. The "steps" refer to graduated responses based on how far the metric has breached the threshold.

For example:

CPU 50-70%: add 1 instance
CPU 70-90%: add 3 instances
CPU above 90%: add 5 instances

Step scaling is more predictable than target tracking when you understand your traffic patterns well enough to define appropriate thresholds. But it requires more configuration and more tuning.

Predictive Autoscaling

Reactive autoscaling has a fundamental problem: it responds after load arrives. If your application takes 3 minutes to boot a new instance, that is 3 minutes of degraded service every time a spike hits.

Predictive autoscaling analyzes historical traffic patterns to forecast future load. It pre-provisions capacity before demand arrives. AWS Predictive Scaling, for example, analyzes up to 14 days of historical CloudWatch data to identify recurring patterns. It generates a 48-hour capacity forecast, updated every 6 hours.

The key requirement: your traffic must have recurring, predictable patterns. A news site with unpredictable viral spikes will not benefit much. An enterprise SaaS product where usage peaks at 9 AM on weekdays and drops at 6 PM is a perfect candidate.

Reactive vs. Predictive Autoscaling

Dimension	Reactive	Predictive
Trigger	Current metric breaches threshold	Forecasted metric will breach threshold
Response time	Minutes (metric detection + instance boot)	Proactive (capacity ready before spike)
Best for	Unpredictable, bursty traffic	Recurring, periodic traffic patterns
Configuration	Thresholds and step sizes	Minimal (ML model trains itself)
Risk	Slow response to sudden spikes	Wasted capacity if patterns shift
Cost efficiency	Pay only when needed (with lag)	May over-provision slightly
Can combine?	Yes. Use predictive for baseline, reactive for unexpected surges.

In practice, the best approach is to layer both. Predictive scaling handles the known daily pattern. Reactive scaling catches anything the forecast missed.

Warm-Up Delays: Feedback Loop Delays in Disguise

In Module 0, we learned that delays in feedback loops cause oscillation. Autoscaling is a textbook example.

When the autoscaler launches a new instance, that instance is not immediately useful. It needs to boot the OS, start the application, load configuration, populate caches, and establish database connections. Google's SRE team notes that some of their services are dramatically less efficient when caches are cold, because the majority of requests are normally served from cache.

During this warm-up period, the new instance either handles no traffic or handles it poorly. The autoscaler's metric (say, average CPU) may still look high because the new instance is not yet contributing. So it launches another instance. And another.

This is classic overshoot caused by delay in a feedback loop. The action (launching instances) has been taken, but the effect (reduced load) has not appeared yet. The system keeps acting.

graph LR A[Load increases] --> B[Metric breaches threshold] B --> C[Autoscaler launches instances] C --> D["Warm-up delay (boot, cache fill)"] D --> E[New instances absorb load] E --> F[Metric drops] F --> G[Cooldown prevents premature scale-in] G -->|Load changes again| A

AWS addresses this with an estimated instance warm-up time setting. During this period, the autoscaler does not count the new instance's metrics toward the aggregate. This prevents the overshoot.

Cooldown Periods

Cooldown is the mirror of warm-up. After a scaling action completes, the autoscaler pauses before evaluating metrics again. This prevents rapid oscillation: scale up, metric drops, scale down, metric rises, scale up again.

Without cooldown, you get flapping. The system adds and removes instances in rapid cycles, never reaching a stable state. Each cycle wastes time and money on instance launches that get terminated before they are useful.

The default cooldown in AWS is 300 seconds (5 minutes). Target tracking policies manage cooldown automatically, which is one reason they are often preferred over step scaling for most workloads.

Warm-up prevents overshoot by excluding new instances from metric calculations until they are ready. Cooldown prevents oscillation by pausing scaling decisions after each action. Both are mechanisms to handle delay in the autoscaling feedback loop.

Autoscaling is a Balancing Loop

Recall from Session 0.5 that balancing feedback loops seek equilibrium. Autoscaling is precisely this: a balancing loop that tries to keep a metric (CPU, request count, latency) near a target value. The structure is: gap between actual and desired state triggers corrective action, which closes the gap.

Understanding autoscaling as a feedback loop, rather than just a cloud feature, lets you reason about its behavior. If your system oscillates, look for delays. If it overshoots, increase warm-up time. If it never stabilizes, check whether your metric actually reflects the bottleneck you care about.

Assignment

Your application experiences a 5x traffic spike every day at 9:00 AM. Your reactive autoscaler takes 3 minutes to detect the spike, launch new instances, and complete warm-up. During those 3 minutes, users see errors and degraded performance. This happens every single day.

Identify the type of feedback loop delay causing the problem. Is it detection delay, provisioning delay, warm-up delay, or a combination?
Propose a solution using predictive autoscaling. What data would the predictive model need? How far in advance should it pre-provision?
Propose a hybrid solution that combines predictive and reactive scaling. What does each layer handle?
Could you solve this without predictive scaling at all? Consider scheduled scaling actions. What are the tradeoffs?

Why Estimate Before You Design?

Before choosing databases, designing APIs, or drawing architecture diagrams, you need to know how big the problem is. A system serving 1,000 users per day and one serving 100 million users per day require fundamentally different architectures. The first might run on a single server. The second needs distributed storage, caching layers, load balancers, and CDNs.

Capacity estimation, often called "back-of-the-envelope" calculation, is the discipline of converting business requirements into infrastructure numbers. You start with users and end with servers, storage, and bandwidth.

Capacity estimation is the process of translating user-facing metrics (DAU, actions per user) into infrastructure metrics (QPS, storage, bandwidth, memory) using simple arithmetic. The goal is order-of-magnitude accuracy, not precision.

The Estimation Pipeline

Every capacity estimate follows the same pipeline. Start with users. Derive queries. Derive storage. Derive bandwidth. Each step feeds the next.

graph LR A[DAU / MAU] --> B[Actions per user] B --> C[QPS] C --> D[Peak QPS] D --> E[Storage] D --> F[Bandwidth] D --> G[Memory / Cache]

Key Formulas

These five formulas cover the vast majority of capacity estimation problems.

1. Queries Per Second (QPS)

QPS = DAU x actions_per_user / 86,400

There are 86,400 seconds in a day. This gives the average QPS across a 24-hour period. For write-heavy and read-heavy operations, calculate each separately.

2. Peak QPS

Peak QPS = Average QPS x peak_multiplier

Traffic is never evenly distributed. A common rule of thumb: peak is 2x average for daily patterns, and up to 10x for viral or breaking-news events. Design your system to handle peak, not average.

3. Storage

Daily storage = write_QPS x object_size x 86,400
Yearly storage = daily_storage x 365 x replication_factor

Always account for replication. If you replicate data 3x (common for durability), your storage requirement triples. Add 20-30% overhead for metadata, indexes, and logs.

4. Bandwidth

Bandwidth (bytes/sec) = QPS x average_response_size
Bandwidth (Mbps) = bandwidth_bytes x 8 / 1,000,000

Calculate ingress (data coming in) and egress (data going out) separately. For read-heavy systems, egress dominates.

5. Cache Memory

Cache size = daily_read_requests x average_object_size x cache_ratio

A common heuristic: cache the top 20% of hot data to serve 80% of reads (the Pareto principle applied to caching). So cache_ratio is often 0.2.

Reference Numbers

You need a mental library of common sizes to make estimates quickly. These approximations, popularized by Jeff Dean's "Numbers Every Programmer Should Know," form the basis of quick estimation.

Item	Approximate Size
1 ASCII character	1 byte
1 Unicode character (UTF-8, common)	2-4 bytes
A UUID / GUID	16 bytes (binary), 36 bytes (string)
A Unix timestamp	4-8 bytes
A tweet (280 chars + metadata)	~1 KB
A typical JSON API response	1-10 KB
A compressed image (thumbnail)	10-50 KB
A compressed image (web quality)	100-500 KB
1 minute of compressed video (720p)	~5 MB
1 million (10⁶)	~1 MB if 1 byte each
1 billion (10⁹)	~1 GB if 1 byte each

Round aggressively. Use powers of 10. The goal is to land within the right order of magnitude, not to compute exact numbers. Saying "about 10 TB" is useful. Saying "9.7 TB" implies false precision.

Worked Example: A Photo-Sharing Service

Requirements: 50 million DAU. Each user uploads 1 photo per day (average 200 KB) and views 50 photos per day.

Step 1: QPS

Write QPS = 50M x 1 / 86,400 ≈ 580 writes/sec
Read QPS  = 50M x 50 / 86,400 ≈ 29,000 reads/sec
Peak read QPS = 29,000 x 2 = ~58,000 reads/sec

Step 2: Storage

Daily new photos     = 50M x 200 KB = 10 TB/day
Yearly (no replication) = 10 TB x 365 = 3.65 PB/year
With 3x replication  = ~11 PB/year

That is a lot of storage. This tells you immediately: you need object storage (like S3), not a relational database, for the photo data itself.

Step 3: Bandwidth

Ingress = 580 x 200 KB = 116 MB/sec ≈ 928 Mbps
Egress  = 29,000 x 200 KB = 5.8 GB/sec ≈ 46 Gbps

The egress number is enormous. This immediately points you toward CDN caching for read traffic. Without a CDN, the origin server bandwidth bill alone would be prohibitive.

Step 4: Cache

Daily read volume = 50M x 50 x 200 KB = 500 TB
Cache 20% of hot data = 100 TB

100 TB of cache is impractical in RAM. This tells you that you need a tiered caching strategy: a smaller in-memory cache (Redis) for metadata, and CDN edge caching for the actual image bytes.

Common Mistakes

Forgetting replication. If your database replicates 3x, your storage triples. Always ask: how many copies?
Using average instead of peak. Systems fail at peak, not at average. Always compute peak QPS.
Ignoring metadata. A 200 KB image might have 1 KB of metadata (user ID, timestamp, tags, location). For 50M uploads per day, that is 50 GB/day of metadata alone.
False precision. Do not use a calculator. Round to powers of 10. The estimate should take 5 minutes, not 30.
Skipping the "so what." Every number should lead to an architectural decision. 11 PB/year means object storage. 46 Gbps egress means CDN. If the number does not change a decision, you did not need to compute it.

Assignment

You are designing a Twitter-like service with the following requirements:

100 million DAU
Each user posts 2 tweets per day (average tweet: 280 bytes of text, no media)
Each user reads 200 tweets per day
Store tweets for 5 years
3x replication

Calculate the following:

Write QPS (average and peak at 2x)
Read QPS (average and peak at 2x)
Daily new storage (text only, before replication)
Total storage for 5 years (with replication and 30% metadata overhead)
Based on these numbers, what architectural decisions would you make? Should tweets live in a relational database or a NoSQL store? Do you need a CDN? How much cache would you provision?

What Makes an API "RESTful"?

REST (Representational State Transfer) is an architectural style, not a protocol. Roy Fielding defined it in his 2000 doctoral dissertation. The core idea: treat everything as a resource, identify resources with URLs, and manipulate them with a small, fixed set of HTTP methods.

A RESTful API organizes endpoints around nouns (resources), not verbs (actions). You do not create an endpoint called /createUser. You create a resource at /users and use the HTTP method POST to create a new one.

Resource-oriented design: Every entity in your system (user, order, bookmark, comment) is a resource with a unique URL. The HTTP method tells the server what to do with it. The URL tells the server which resource you mean.

HTTP Methods, Idempotency, and Safety

HTTP defines a small set of methods. Each has specific semantics that clients and intermediaries (proxies, caches, load balancers) rely on.

Method	Purpose	Idempotent?	Safe?	Request Body?
`GET`	Retrieve a resource	Yes	Yes	No
`POST`	Create a new resource	No	No	Yes
`PUT`	Replace a resource entirely	Yes	No	Yes
`PATCH`	Partially update a resource	No*	No	Yes
`DELETE`	Remove a resource	Yes	No	Optional

*PATCH is not guaranteed to be idempotent, though it can be implemented that way.

Idempotent means calling the same request multiple times produces the same server state as calling it once. PUT /users/42 with the same body always results in the same user record, no matter how many times you call it. POST /users creates a new user each time, so it is not idempotent.

Safe means the method does not modify server state. GET is safe. DELETE is not.

Idempotency matters because networks are unreliable. If a client sends a PUT request and the connection drops before it receives the response, it can safely retry. The server ends up in the same state. If the same thing happens with a non-idempotent POST, the retry might create a duplicate resource.

PUT vs. POST

The distinction is often misunderstood. POST means "create a new resource; the server assigns the ID." PUT means "place this resource at this exact URL." If the resource exists, PUT replaces it. If it does not, PUT creates it at the specified URL.

POST /bookmarks        → Server creates bookmark, assigns ID 789
PUT  /bookmarks/789    → Client specifies the exact resource to create or replace

Status Codes

HTTP status codes communicate what happened. Use them correctly. Clients, monitoring tools, and retry logic all depend on them.

Range	Meaning	Common Codes
2xx	Success	200 OK, 201 Created, 204 No Content
3xx	Redirection	301 Moved Permanently, 304 Not Modified
4xx	Client error	400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found, 409 Conflict, 429 Too Many Requests
5xx	Server error	500 Internal Server Error, 502 Bad Gateway, 503 Service Unavailable

A few rules: return 201 (not 200) when creating a resource. Return 204 when a DELETE succeeds and there is no body to return. Return 409 Conflict when the client tries to create something that already exists. Never return 200 with an error message in the body. That defeats the purpose of status codes.

API Versioning

APIs evolve. Fields get added, renamed, or removed. You need a strategy for changing the API without breaking existing clients.

Three common approaches:

URL path versioning: /v1/users, /v2/users. Simple, visible, easy to route. Most widely used.
Header versioning: Accept: application/vnd.myapi.v2+json. Keeps URLs clean but is harder to test in a browser.
Query parameter: /users?version=2. Simple but pollutes the query string.

URL path versioning is the most common choice for public APIs because it is explicit and requires no special client configuration. Use it unless you have a specific reason not to.

Rate Limiting

Every public API needs rate limiting. Without it, a single misbehaving client (or attacker) can consume all your server resources.

Rate limiting caps the number of requests a client can make in a time window. Common implementations use a token bucket or sliding window algorithm. When a client exceeds the limit, the server returns 429 Too Many Requests with a Retry-After header.

Rate limits are typically defined per API key or per IP address, and expressed as requests per second or per minute. Session 2.9 covers rate limiting algorithms in depth.

Pagination: Cursor vs. Offset

Any endpoint that returns a list of resources needs pagination. Returning 10 million bookmarks in a single response is not an option. Two approaches dominate.

Offset-Based Pagination

GET /bookmarks?offset=20&limit=10

The server skips the first 20 records and returns the next 10. Simple to implement. The client just increments the offset by the page size.

The problem: offset pagination breaks with mutable data. If a new bookmark is inserted while the client is paginating, records shift. The client might see duplicates or skip items. It also performs poorly at large offsets because the database still has to scan and discard all skipped rows.

Cursor-Based Pagination

GET /bookmarks?cursor=eyJpZCI6MTAwfQ&limit=10

The cursor is an opaque token (usually an encoded record ID or timestamp) pointing to a specific position in the dataset. The server returns records after that position. The response includes a next_cursor for the client to use in the next request.

Cursor pagination is stable even when data changes between requests. It is also efficient because the database can use an index to jump directly to the cursor position instead of scanning from the beginning.

sequenceDiagram participant C as Client participant S as Server C->>S: GET /bookmarks?limit=10 S-->>C: 200 OK {data: [...10 items], next_cursor: "abc123"} C->>S: GET /bookmarks?cursor=abc123&limit=10 S-->>C: 200 OK {data: [...10 items], next_cursor: "def456"} C->>S: GET /bookmarks?cursor=def456&limit=10 S-->>C: 200 OK {data: [...3 items], next_cursor: null} Note over C,S: next_cursor is null, no more pages

The tradeoff is that cursor pagination does not support "jump to page 5." The client can only move forward (or backward, if you provide a previous cursor). For most API use cases, this is acceptable. For user-facing interfaces where page numbers matter, offset is sometimes still preferred despite its limitations.

Designing Good Endpoints

A few conventions that make APIs predictable and easy to use:

Use plural nouns for collections: /bookmarks, not /bookmark.
Use nested routes for relationships: /users/42/bookmarks for bookmarks belonging to user 42.
Use query parameters for filtering and sorting: /bookmarks?tag=design&sort=created_at.
Return the created resource on POST: Include the full object (with server-assigned fields like ID and timestamps) in the 201 response.
Use consistent error format: Every error response should have the same structure, such as {"error": {"code": "NOT_FOUND", "message": "Bookmark 999 does not exist"}}.

Assignment

Design the API for a bookmarking service. Users can save URLs with tags, list their bookmarks, delete bookmarks, and search by tag or keyword. For each operation, specify:

The HTTP method and endpoint URL
The request body (if any), as JSON
The response body, as JSON
The HTTP status code for success

Operations to design:

Create a bookmark: Save a URL with a title, description, and list of tags.
List bookmarks: Return the current user's bookmarks, paginated. Choose cursor or offset pagination and justify your choice.
Delete a bookmark: Remove a bookmark by its ID.
Search bookmarks: Find bookmarks matching a keyword or tag. Should this be a separate endpoint or a query parameter on the list endpoint? Explain your reasoning.

Bonus: How would you make the "create bookmark" operation idempotent? What would you use as the idempotency key?

Three Models of Compute

When your application needs to scale, the first decision is not "how many servers?" but "what kind of compute unit?" The three dominant models today are virtual machines (VMs), containers, and serverless functions. Each makes a different tradeoff between control, speed, and operational burden.

Understanding these tradeoffs is not optional. Picking the wrong compute model for a workload can cost you 10x more than it should, or leave your system unable to respond to traffic spikes at all.

Virtual Machines

A virtual machine runs a full guest operating system on top of a hypervisor. The hypervisor (VMware ESXi, KVM, Hyper-V, or a cloud provider's custom solution) abstracts the physical hardware and allocates CPU, memory, and storage to each VM.

Because each VM boots a complete OS, provisioning takes minutes. A new EC2 instance on AWS typically needs 30 to 90 seconds to become reachable, and that is after the AMI is already built. Building the image itself can take 10 to 20 minutes.

VM isolation model: Each VM has its own kernel, filesystem, and network stack. This provides strong security boundaries, since a compromised process inside one VM cannot access another VM's memory or filesystem. The cost is resource duplication: every VM runs its own copy of the OS, consuming RAM and disk that the application itself does not need.

VMs shine when you need full OS control, must run legacy software that assumes a dedicated machine, or require strong tenant isolation (as in multi-tenant SaaS where each customer gets their own VM). They are the slowest to scale but the most flexible in what they can run.

Containers

Containers share the host OS kernel. Instead of virtualizing hardware, they use kernel features (namespaces and cgroups on Linux) to isolate processes, filesystems, and network interfaces. A container image contains only the application and its dependencies, not an entire OS.

This makes containers dramatically lighter. A typical container image is 50 to 500 MB, compared to 2 to 20 GB for a VM image. Startup time drops from minutes to seconds, often under 5 seconds for a well-built image.

The tradeoff is weaker isolation. All containers on a host share the same kernel. A kernel vulnerability can potentially affect every container on that host. This matters less in environments where you control all the workloads, and more in multi-tenant platforms where untrusted code runs side by side.

Containers introduce orchestration complexity. Running one container is simple. Running hundreds across a cluster requires a scheduler like Kubernetes, which manages placement, networking, health checks, scaling, and rolling deployments. Kubernetes itself is a significant operational commitment. Most teams underestimate the learning curve and the ongoing maintenance cost.

Serverless Functions

Serverless (AWS Lambda, Google Cloud Functions, Azure Functions) takes the abstraction one step further. You deploy a function. The provider handles everything else: provisioning, scaling, patching, and deprovisioning. You pay only for the time your function executes, measured in milliseconds.

The zero-idle-cost model is the key advantage. If your function receives zero requests, you pay zero. This makes serverless ideal for workloads with unpredictable or spiky traffic patterns.

The primary penalty is cold starts. When a function has not been invoked recently, the provider must allocate a container, load the runtime, load your code, and initialize dependencies before handling the request. This can add 100ms to several seconds of latency, depending on the runtime and package size. Java and .NET functions suffer worse cold starts than Python or Node.js due to larger runtime initialization overhead.

Vendor lock-in: Serverless functions are tightly coupled to the provider's event system, IAM model, and runtime environment. Moving a Lambda function to Google Cloud Functions is not a redeployment. It is a rewrite. This coupling extends to surrounding services: API Gateway, Step Functions, EventBridge, and other provider-specific glue that your function depends on.

The Compute Spectrum

These three models form a spectrum from maximum control to maximum abstraction. As you move from VMs to serverless, you give up control and gain operational simplicity and scaling speed.

graph LR VM["Virtual Machines
Full OS, minutes to start
Maximum control"] --> Container["Containers
Shared kernel, seconds to start
Balanced control"] Container --> Serverless["Serverless Functions
Managed runtime, ms to scale
Minimum control"] style VM fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Container fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Serverless fill:#2a2a2a,stroke:#c8a882,color:#ede9e3

The spectrum is not a ranking. "More abstract" does not mean "better." It means different tradeoffs. The right choice depends on the workload.

Comparison Table

Dimension	Virtual Machines	Containers	Serverless
Startup time	30s to minutes	1 to 10 seconds	100ms to seconds (cold start)
Cost model	Pay per hour/reserved, always running	Pay for cluster capacity, always running	Pay per invocation and duration
Scaling speed	Slow (minutes via autoscaling groups)	Medium (seconds via HPA/KEDA)	Fast (automatic, per-request)
Isolation	Strong (separate kernels)	Moderate (shared kernel, namespaces)	Strong (provider-managed micro-VMs)
Ops complexity	High (OS patching, config management)	High (Kubernetes, networking, observability)	Low (provider manages infrastructure)
Vendor lock-in	Low (standard OS images are portable)	Low to medium (OCI images are portable, orchestration may not be)	High (runtime, triggers, IAM all provider-specific)
Best for	Legacy apps, strong isolation needs, long-running processes	Microservices, CI/CD pipelines, steady-state APIs	Event-driven tasks, variable traffic, glue code

Scaling Behavior in Practice

The difference in scaling behavior becomes clear under sudden load. Suppose your application receives 10x normal traffic for 15 minutes.

With VMs, your autoscaling group detects high CPU, launches new instances, waits for health checks, and registers them with the load balancer. By the time the new capacity is online (3 to 5 minutes), the spike may already be subsiding. You overshoot on the way up and undershoot on the way down.

With containers on Kubernetes, the Horizontal Pod Autoscaler detects high utilization and schedules new pods. If nodes have spare capacity, pods start in seconds. If the cluster needs new nodes, you are back to VM-speed scaling for those nodes. This is why teams pre-provision extra node capacity or use tools like Karpenter to speed up node provisioning.

With serverless, each incoming request can spawn a new execution environment if needed. Scaling is nearly instant and precisely matched to demand. But at scale, you may hit concurrency limits (AWS Lambda defaults to 1,000 concurrent executions per region), and you must explicitly request increases.

The Hybrid Reality

Most production systems use more than one model. A common pattern: containers for the core API (steady traffic, predictable load), serverless for event-driven workers (image processing on upload, webhook handling), and VMs for stateful workloads (databases, legacy systems).

The goal is not to pick one model for everything. It is to match each workload to the compute model whose tradeoffs best fit that workload's characteristics.

Assignment

You are designing infrastructure for three workloads. For each, choose the compute model (VM, container, or serverless) and justify your choice in 2 to 3 sentences. Consider startup time, cost efficiency, scaling needs, and operational constraints.

Always-on REST API with steady traffic. The API serves 500 requests per second consistently, 24/7. Latency must stay under 50ms at p99.
Image thumbnail generation triggered on upload. Users upload between 10 and 10,000 images per hour depending on time of day. Each thumbnail takes 2 to 5 seconds to generate.
ML model serving for product recommendations. The model is 2 GB in memory, requires GPU access, and serves predictions with a 200ms SLA.

For each choice, also identify the main risk of your chosen model and describe one mitigation strategy.

Why Rate Limiting Exists

Every API has a capacity ceiling. Whether that ceiling is set by CPU, memory, database connections, or downstream dependencies, there is a point beyond which additional requests degrade the experience for everyone. Rate limiting enforces that ceiling explicitly rather than letting the system discover it through failure.

Rate limiting serves three purposes. First, it protects backend services from being overwhelmed by a single client (intentionally or accidentally). Second, it ensures fair access across clients. Third, it provides a defense layer against abuse, including brute-force attacks, credential stuffing, and scraping.

The core question in rate limiting is simple: given a stream of incoming requests, how do you decide which ones to allow and which to reject? The answer depends on which algorithm you use.

Fixed Window Counter

The simplest approach. Divide time into fixed intervals (say, 1-minute windows). Maintain a counter per client per window. Increment on each request. If the counter exceeds the limit, reject.

The problem is boundary bursts. A client can send 100 requests at 12:00:59 and another 100 at 12:01:00. Both are within their respective windows, but the server just received 200 requests in 2 seconds. The fixed window sees two compliant windows. The server sees a spike.

Boundary burst problem: Fixed window counters allow up to 2x the intended rate at window boundaries, because requests at the end of one window and the start of the next are counted separately but arrive nearly simultaneously.

Sliding Window Log

Instead of fixed intervals, keep a log of timestamps for each request. When a new request arrives, remove all timestamps older than the window size (e.g., older than 60 seconds). Count the remaining entries. If the count is at or above the limit, reject.

This eliminates the boundary burst problem entirely. The window slides with each request, so there is no artificial boundary to exploit. The cost is memory: you must store every request timestamp within the window. For high-volume APIs, this becomes expensive quickly.

Sliding Window Counter

A hybrid that approximates the sliding window without storing individual timestamps. It keeps counters for the current and previous fixed windows, then weights them based on how far into the current window you are.

If the previous window had 80 requests and the current window (30 seconds in, out of 60) has 40 requests, the estimated count is: 80 x 0.5 (remaining fraction of previous window) + 40 = 80. This gives you a reasonable approximation of a true sliding window at the memory cost of just two counters per client.

Token Bucket

The token bucket algorithm works by analogy. A bucket holds tokens up to a maximum capacity. Tokens are added at a fixed rate (the refill rate). Each request consumes one token. If the bucket is empty, the request is rejected.

The key property: the bucket can accumulate tokens during idle periods, up to its maximum capacity. This allows controlled bursts. A client that has been quiet for a while can briefly exceed the steady-state rate, up to the bucket's capacity, before being throttled back to the refill rate.

graph TD Refill["Token Refill
(fixed rate, e.g., 10/sec)"] --> Bucket["Token Bucket
(max capacity: 100)"] Request["Incoming Request"] --> Check{"Tokens > 0?"} Check -->|Yes| Allow["Allow Request
(remove 1 token)"] Check -->|No| Reject["Reject (429)"] Bucket --> Check style Refill fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style Bucket fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Request fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Check fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Allow fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style Reject fill:#2a2a2a,stroke:#c8a882,color:#ede9e3

In practice, you do not run a background thread to add tokens. Instead, when a request arrives, you calculate how many tokens should have been added since the last request based on elapsed time. This is called "lazy refill" and is how most implementations work.

Leaky Bucket

The leaky bucket is the inverse. Requests enter a queue (the bucket). The queue drains at a fixed rate. If the queue is full, new requests are dropped.

Where the token bucket allows bursts up to the bucket capacity, the leaky bucket enforces a strictly uniform output rate. Bursts are absorbed by the queue rather than passed through. This is useful when your downstream dependency requires a constant request rate and cannot tolerate spikes at all.

Algorithm Comparison

Algorithm	Burst Handling	Memory	Accuracy	Best For
Fixed Window	Allows 2x burst at boundaries	Very low (1 counter per client)	Low	Simple use cases, low-stakes limits
Sliding Window Log	No boundary burst	High (stores all timestamps)	Exact	Low-volume APIs needing precision
Sliding Window Counter	Minimal boundary burst	Low (2 counters per client)	Approximate	High-volume APIs, good default choice
Token Bucket	Controlled bursts allowed	Low (token count + timestamp)	Exact for steady state	APIs where occasional bursts are acceptable
Leaky Bucket	Smooths all bursts	Low (queue size)	Exact output rate	Protecting rate-sensitive downstream services

Distributed Rate Limiting with Redis

A rate limiter running in a single process is straightforward. The challenge appears when your API runs on multiple servers behind a load balancer. Each server sees only a fraction of the total traffic. A per-server limit of 100 requests per minute, across 10 servers, means the actual limit is 1,000, not 100.

The standard solution is a centralized counter in Redis. Redis operations like INCR are atomic and fast (sub-millisecond). A typical pattern for a fixed window limiter:

Construct a key like ratelimit:{user_id}:{window_timestamp}.
INCR the key. If the result is 1 (first request in this window), set a TTL equal to the window size.
If the result exceeds the limit, reject with HTTP 429.

Race Conditions

The INCR-then-check pattern looks simple, but it has a race condition in the TTL step. Between the INCR and the EXPIRE commands, another request could arrive. If the process crashes between these two operations, the key persists forever with no expiry, and the client is permanently blocked.

The fix is to use a Lua script executed via Redis EVAL. Lua scripts in Redis are atomic: the entire script runs without interruption. This guarantees that the increment, the limit check, and the TTL set all happen as a single operation.

Atomic operations in Redis: Any time your rate limiting logic requires multiple Redis commands that depend on each other (read, decide, write), wrap them in a Lua script. This eliminates race conditions without requiring distributed locks, which would add latency and complexity.

What Happens at Rejection

When a request exceeds the limit, the standard response is HTTP 429 (Too Many Requests). Good rate limiters also include response headers that help clients self-regulate:

X-RateLimit-Limit: the maximum allowed requests in the window.
X-RateLimit-Remaining: how many requests remain in the current window.
X-RateLimit-Reset: the Unix timestamp when the window resets.
Retry-After: how many seconds to wait before retrying.

These headers turn the rate limiter from a blunt wall into a communication channel. Well-behaved clients use them to back off gracefully rather than hammering the endpoint and collecting 429s.

Where to Place the Rate Limiter

Rate limiting can happen at multiple layers: the API gateway, a reverse proxy (Nginx, Envoy), a middleware layer in your application, or a dedicated service. The closer to the edge, the earlier you reject bad traffic and the less load reaches your application servers. Most architectures place the primary rate limiter at the API gateway or reverse proxy, with a secondary application-level limiter for business-logic-specific rules (e.g., "max 5 password attempts per hour").

Assignment

Design a rate limiter for a REST API with the following requirements:

Limit: 100 requests per minute per user.
The API runs on 8 servers behind a load balancer.
Users authenticate via API key in the request header.

Answer the following:

Which algorithm do you choose, and why? Consider whether you need to allow bursts or enforce a smooth rate.
What happens when user sends request #101? Specify the HTTP status code, response headers, and the behavior the client should implement.
What data structure and storage system do you use? Describe the key schema and the operations performed on each request.
Where in the request lifecycle do you place the rate limiter? At the gateway, in a middleware, or as a separate service? Justify.
What failure mode do you choose if Redis becomes unreachable: fail open (allow all) or fail closed (reject all)? What are the risks of each?

CDN as Geographic Horizontal Scaling

A Content Delivery Network is, at its core, a system for putting copies of your content closer to your users. Instead of every request traveling to a single origin server (which might be thousands of kilometers away), requests are served by edge servers distributed across dozens or hundreds of locations worldwide.

This is horizontal scaling applied geographically. Rather than making one server faster, you add more servers in more places. The result is lower latency, reduced origin load, and better availability during traffic spikes.

The physics are straightforward. Light in fiber travels at roughly 200,000 km/s. A round trip from Jakarta to a server in Virginia (about 16,000 km each way) takes at least 160ms just for the speed of light, before any processing happens. Put an edge server in Singapore (900 km away), and that round trip drops to under 10ms. For a page that requires 20 round trips to fully load, this difference is enormous.

How a CDN Request Works

When a user requests a resource, DNS resolves the domain to the nearest CDN edge server (using anycast routing or geo-DNS). The edge server checks its local cache. If the content is there and still valid, it serves it directly. This is a cache hit. If the content is missing or expired, the edge server fetches it from the origin, caches it, and serves it. This is a cache miss.

graph TD User["User (Jakarta)"] --> DNS["DNS Resolution"] DNS --> Edge["CDN Edge Server
(Singapore)"] Edge --> CacheCheck{"Cache Hit?"} CacheCheck -->|Yes| Serve["Serve from Edge
(~10ms)"] CacheCheck -->|No| Shield["Origin Shield
(if configured)"] Shield --> ShieldCheck{"Shield
Cache Hit?"} ShieldCheck -->|Yes| ServeShield["Serve from Shield"] ShieldCheck -->|No| Origin["Origin Server"] Origin --> Shield ServeShield --> Edge Shield --> Edge Edge --> User style User fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style DNS fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Edge fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style CacheCheck fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style Serve fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style Shield fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style ShieldCheck fill:#2a2a2a,stroke:#c8a882,color:#ede9e3 style ServeShield fill:#2a2a2a,stroke:#6b8f71,color:#ede9e3 style Origin fill:#2a2a2a,stroke:#c8a882,color:#ede9e3

Cache-Control Headers

The origin server controls caching behavior through the Cache-Control HTTP header. This header tells both browsers and CDN edge servers how long to cache a response and under what conditions.

The most important directives:

max-age=3600: Cache this response for 3,600 seconds (1 hour). After that, it is considered stale.
s-maxage=86400: Like max-age, but applies only to shared caches (CDNs). Overrides max-age for edge servers while letting browsers use a different TTL.
no-cache: The cache may store the response, but must revalidate with the origin before serving it. This does not mean "do not cache." It means "always check first."
no-store: Do not cache this response at all. Used for sensitive data like banking pages or personal health records.
stale-while-revalidate=60: If the cached response is stale, serve it anyway while fetching a fresh copy in the background. The "60" means this behavior is allowed for up to 60 seconds past expiry.
stale-if-error=300: If the origin is unreachable, serve stale content for up to 300 seconds rather than returning an error.

stale-while-revalidate is one of the most valuable CDN directives. It eliminates the latency spike that occurs when cached content expires. Instead of making the first user after expiry wait for a full origin fetch, it serves stale content instantly and refreshes in the background. For most content, slightly stale is far better than slow.

Cache Invalidation Strategies

Phil Karlton famously said there are only two hard things in computer science: cache invalidation and naming things. In CDN operations, cache invalidation is the act of removing or refreshing content before its TTL expires.

Strategy	How It Works	Speed	Cost	Best For
TTL Expiry	Content expires naturally based on Cache-Control headers	No action needed	Zero	Content that can tolerate staleness (images, CSS, JS with hashed filenames)
Purge	Explicitly delete a specific URL from all edge caches	Seconds to minutes	API call per URL	Urgent corrections, legal takedowns, breaking news
Purge by Tag/Surrogate Key	Tag responses with categories, purge all content matching a tag	Seconds	One API call per tag	CMS content updates (purge all "product-123" pages)
Stale-While-Revalidate	Serve stale, refresh in background	Transparent	Zero (built into response)	Content where freshness matters but not to the millisecond
Versioned URLs	Change the URL when content changes (e.g., `style.a1b2c3.css`)	Instant (new URL, no cache to invalidate)	Build pipeline change	Static assets (CSS, JS, images)

Versioned URLs deserve special attention. If you hash the file contents into the filename (bundle.8f3a2c.js), you can set max-age=31536000 (one year) because the URL itself changes whenever the content changes. There is nothing to invalidate. This is the most reliable caching strategy for static assets and the one you should use by default.

Origin Shielding

Without origin shielding, every edge server that experiences a cache miss fetches directly from the origin. If you have 200 edge locations and a popular asset expires simultaneously, the origin receives 200 concurrent requests for the same file. This is called a "thundering herd" or "cache stampede."

Origin shielding adds an intermediate cache layer. All edge servers route their cache misses through a single (or small number of) shield servers. The shield checks its own cache first. If it has the content, it serves all the edge servers. If not, only the shield fetches from the origin. This collapses 200 origin requests into 1.

Origin shielding reduces origin load by acting as a funnel between edge servers and the origin. It is especially valuable for origins with limited capacity (shared hosting, legacy systems) or content that expires frequently across many edge locations simultaneously.

Push vs. Pull CDN

In a pull CDN, the edge server fetches content from the origin on demand (on first request or after cache expiry). This is the default model used by Cloudflare, Fastly, and most CDN providers. You do not upload anything to the CDN. It pulls what it needs, when it needs it.

In a push CDN, you upload content to the CDN's storage explicitly. The CDN serves directly from its own storage rather than proxying to your origin. AWS CloudFront with S3 as the origin operates in this model. You control exactly what is on the CDN and when.

Pull CDNs are simpler to operate: point DNS, configure caching headers, and you are done. Push CDNs give you more control and are better for large static assets (video, software downloads) where you want to guarantee the content is available on every edge before users request it.

What to Cache and What Not To

Not everything belongs in a CDN cache. The general rule: cache anything that is the same for all users, and do not cache anything that is personalized or sensitive.

Cache: images, CSS, JavaScript, fonts, public HTML pages, API responses that are identical across users (product catalog, search results for common queries).
Do not cache: authenticated API responses, user dashboards, shopping carts, anything containing personal data, POST/PUT/DELETE responses.
Cache with care: semi-personalized content (e.g., localized pages). Use the Vary header to cache different versions for different Accept-Language or Accept-Encoding values.

Measuring CDN Performance

The key metric is cache hit ratio: the percentage of requests served from edge cache without touching the origin. A well-configured CDN should achieve 85% or higher for a typical content site. If your ratio is below 60%, something is wrong: either too many unique URLs, too-short TTLs, or headers that prevent caching.

Monitor X-Cache or equivalent headers in your CDN's responses. They tell you whether each request was a hit, miss, or stale revalidation. Use this data to identify caching gaps and optimize your header configuration.

Assignment

You are deploying a web application with the following user distribution:

60% Indonesia
30% Malaysia
10% global (scattered)

Your origin server is in Singapore. The application serves a mix of static assets (images, CSS, JS), public product pages, and authenticated user dashboards.

Design the CDN strategy:

Edge node placement. Where would you place edge nodes (or choose PoP locations from a provider like Cloudflare or AWS CloudFront)? Justify each location based on the user distribution.
Caching rules. For each content type below, specify the Cache-Control header you would set and explain why:
- Static assets (CSS, JS with hashed filenames)
- Product images uploaded by sellers
- Public product listing pages
- Authenticated user dashboard
- API response for product search
Origin shielding. Would you enable origin shielding? If yes, where would you place the shield server and why?
Invalidation plan. A seller updates a product image. Describe step by step how the new image reaches users. Which invalidation strategy do you use?

Module 2: Scalability, Load Balancing & API Design

The Scaling Problem

Vertical Scaling (Scale Up)

Horizontal Scaling (Scale Out)

Diagonal Scaling (The Hybrid)

Comparison Table

Systems Thinking Lens

Further Reading

Assignment

What Load Balancers Do

The OSI Model Context

Application Load Balancer (ALB)

Network Load Balancer (NLB)

Comparison Table

When to Use Which

Systems Thinking Lens

Further Reading

Assignment

The Algorithm Decides

Round-Robin

Weighted Round-Robin

Least Connections

IP Hash

Random

Comparison Table

Choosing an Algorithm

Systems Thinking Lens

Further Reading

Assignment

The Front Door Problem

What an API Gateway Does

1. Request Routing

2. Authentication and Authorization

3. Rate Limiting

4. Request and Response Transformation

5. Monitoring and Logging

API Gateway vs. Load Balancer

Request Flow

Common API Gateway Products

The Gateway Bloat Problem

Systems Thinking Lens

Further Reading

Assignment

Why Autoscaling Exists

Reactive Autoscaling

Target Tracking

Step Scaling

Predictive Autoscaling

Reactive vs. Predictive Autoscaling

Warm-Up Delays: Feedback Loop Delays in Disguise

Cooldown Periods

Autoscaling is a Balancing Loop

Further Reading

Assignment

Why Estimate Before You Design?

The Estimation Pipeline

Key Formulas

1. Queries Per Second (QPS)

2. Peak QPS

3. Storage

4. Bandwidth

5. Cache Memory

Reference Numbers

Worked Example: A Photo-Sharing Service

Step 1: QPS

Step 2: Storage

Step 3: Bandwidth

Step 4: Cache

Common Mistakes

Further Reading

Assignment

What Makes an API "RESTful"?

HTTP Methods, Idempotency, and Safety

PUT vs. POST

Status Codes

API Versioning

Rate Limiting

Pagination: Cursor vs. Offset

Offset-Based Pagination

Cursor-Based Pagination