Course → Module 2: Scalability, Load Balancing & API Design

The Scaling Problem

Your system is under load. Response times are climbing. Users are complaining. You need more capacity. The question is not whether to scale, but how.

There are two fundamental directions: make the existing machine bigger (vertical scaling), or add more machines (horizontal scaling). Each direction carries different costs, constraints, and architectural consequences. Most production systems end up using both.

Vertical Scaling (Scale Up)

Vertical scaling means increasing the capacity of a single machine by adding more CPU, RAM, storage, or faster disks. The application code does not change. You simply give it a more powerful host.

This is the simplest form of scaling. If your database server is running at 80% CPU, you migrate it to an instance with more cores. If your application is running out of memory, you add RAM. The application itself does not know or care that the underlying hardware changed.

Vertical scaling is appealing because it requires zero architectural changes. A single-threaded application that cannot run across multiple machines will still benefit from a faster CPU. A database that stores everything on one disk will still benefit from more RAM for caching.

But vertical scaling has a hard ceiling. There is a largest machine you can buy. As of 2026, the largest AWS EC2 instance (u-24tb1.metal) offers 448 vCPUs and 24 TB of RAM. That is the wall. If your workload outgrows that machine, vertical scaling cannot help you. And long before you hit that wall, the cost curve becomes punishing. Doubling the CPU count of a cloud instance rarely doubles the price. It often triples or quadruples it.

There is also the downtime problem. Resizing a machine typically requires stopping it, changing the instance type, and restarting. For a database, this can mean minutes of unavailability. For a stateless web server behind a load balancer, it is less painful but still disruptive.

Horizontal Scaling (Scale Out)

Horizontal scaling means adding more machines to share the workload. Instead of one powerful server, you run many smaller servers behind a load balancer. Each server handles a portion of the traffic.

Horizontal scaling has no hard ceiling. Need more capacity? Add another server. Need even more? Add ten. Cloud platforms make this trivial with auto-scaling groups that add or remove instances based on metrics like CPU utilization or request count.

The catch is that horizontal scaling demands distributed architecture. Your application must be designed to run as multiple independent instances. This means:

Horizontal scaling also introduces operational complexity. You now need load balancers, health checks, deployment strategies for rolling updates, and monitoring across multiple instances. The system has more moving parts.

Diagonal Scaling (The Hybrid)

In practice, most production systems use both strategies. This is sometimes called diagonal scaling.

A common pattern: the application tier scales horizontally behind a load balancer, while the primary database scales vertically on the largest feasible instance. The database is hard to distribute (joins, transactions, consistency), so you give it the biggest machine you can afford. The application servers are stateless and easy to replicate, so you add more of them.

As the database eventually outgrows vertical limits, you introduce read replicas (horizontal reads) while keeping a single primary for writes. This is diagonal scaling in action: vertical where distribution is hard, horizontal where it is natural.

graph TB subgraph Vertical Scaling V1[Server] -->|"Add CPU, RAM"| V2["Bigger Server"] V2 -->|"Add more"| V3["Biggest Server
(ceiling)"] end subgraph Horizontal Scaling H1[Server 1] H2[Server 2] H3[Server 3] H4["Server N..."] LB[Load Balancer] --> H1 LB --> H2 LB --> H3 LB --> H4 end

Comparison Table

Dimension Vertical Scaling Horizontal Scaling
Mechanism Bigger machine (more CPU, RAM, disk) More machines behind a load balancer
Capacity ceiling Hard limit (largest available instance) No theoretical limit
Cost curve Superlinear (2x capacity costs 3-4x) Roughly linear (2x capacity costs ~2x)
Code changes None required Must handle distributed state, sessions, coordination
Downtime during scaling Usually required (instance resize) Zero downtime (add/remove instances)
Fault tolerance Single point of failure Survives individual instance failures
State management Simple (everything on one machine) Complex (shared stores, distributed locks)
Operational complexity Low (one machine to manage) High (load balancers, health checks, deployments)
Best for Databases, legacy apps, quick fixes Stateless services, web servers, microservices

Systems Thinking Lens

Scaling decisions create feedback loops. Vertical scaling is a balancing loop with a fixed limit: you add resources, performance improves, but the ceiling does not move. Eventually the gap between demand and capacity closes again, and you have nowhere to go.

Horizontal scaling is a reinforcing loop in terms of capacity (more machines, more throughput) but also a reinforcing loop in terms of complexity (more machines, more coordination problems, more debugging surface). The systems thinker asks: which loop dominates at our current scale? If you have 3 servers, the coordination overhead is minimal. At 300 servers, it is a major engineering concern.

The leverage point is often not "which direction to scale" but "what to make stateless." Every component you can make stateless becomes trivially horizontally scalable. The real work is in the architecture, not the infrastructure.

Further Reading

Assignment

Your company runs a PostgreSQL database on a single server. It is at 80% CPU utilization during peak hours and growing 10% per month. You have two options:

  1. Vertical: Upgrade to an instance with 2x the CPU cores. Cost is 3x your current monthly bill. No code changes needed. Can be done this weekend with 15 minutes of downtime.
  2. Horizontal: Add read replicas and split read traffic from write traffic. Cost is 2x your current monthly bill. Requires code changes to route read queries to replicas. Estimated 3 weeks of development and testing.

You have a 6-month runway before the database hits critical load. Which option do you choose, and why? Consider: what happens at month 7? What does each option buy you in terms of future scaling? Is there a diagonal approach that combines both?