Course → Module 2: Scalability, Load Balancing & API Design

Why Estimate Before You Design?

Before choosing databases, designing APIs, or drawing architecture diagrams, you need to know how big the problem is. A system serving 1,000 users per day and one serving 100 million users per day require fundamentally different architectures. The first might run on a single server. The second needs distributed storage, caching layers, load balancers, and CDNs.

Capacity estimation, often called "back-of-the-envelope" calculation, is the discipline of converting business requirements into infrastructure numbers. You start with users and end with servers, storage, and bandwidth.

Capacity estimation is the process of translating user-facing metrics (DAU, actions per user) into infrastructure metrics (QPS, storage, bandwidth, memory) using simple arithmetic. The goal is order-of-magnitude accuracy, not precision.

The Estimation Pipeline

Every capacity estimate follows the same pipeline. Start with users. Derive queries. Derive storage. Derive bandwidth. Each step feeds the next.

graph LR A[DAU / MAU] --> B[Actions per user] B --> C[QPS] C --> D[Peak QPS] D --> E[Storage] D --> F[Bandwidth] D --> G[Memory / Cache]

Key Formulas

These five formulas cover the vast majority of capacity estimation problems.

1. Queries Per Second (QPS)

QPS = DAU x actions_per_user / 86,400

There are 86,400 seconds in a day. This gives the average QPS across a 24-hour period. For write-heavy and read-heavy operations, calculate each separately.

2. Peak QPS

Peak QPS = Average QPS x peak_multiplier

Traffic is never evenly distributed. A common rule of thumb: peak is 2x average for daily patterns, and up to 10x for viral or breaking-news events. Design your system to handle peak, not average.

3. Storage

Daily storage = write_QPS x object_size x 86,400
Yearly storage = daily_storage x 365 x replication_factor

Always account for replication. If you replicate data 3x (common for durability), your storage requirement triples. Add 20-30% overhead for metadata, indexes, and logs.

4. Bandwidth

Bandwidth (bytes/sec) = QPS x average_response_size
Bandwidth (Mbps) = bandwidth_bytes x 8 / 1,000,000

Calculate ingress (data coming in) and egress (data going out) separately. For read-heavy systems, egress dominates.

5. Cache Memory

Cache size = daily_read_requests x average_object_size x cache_ratio

A common heuristic: cache the top 20% of hot data to serve 80% of reads (the Pareto principle applied to caching). So cache_ratio is often 0.2.

Reference Numbers

You need a mental library of common sizes to make estimates quickly. These approximations, popularized by Jeff Dean's "Numbers Every Programmer Should Know," form the basis of quick estimation.

Item Approximate Size
1 ASCII character 1 byte
1 Unicode character (UTF-8, common) 2-4 bytes
A UUID / GUID 16 bytes (binary), 36 bytes (string)
A Unix timestamp 4-8 bytes
A tweet (280 chars + metadata) ~1 KB
A typical JSON API response 1-10 KB
A compressed image (thumbnail) 10-50 KB
A compressed image (web quality) 100-500 KB
1 minute of compressed video (720p) ~5 MB
1 million (106) ~1 MB if 1 byte each
1 billion (109) ~1 GB if 1 byte each

Round aggressively. Use powers of 10. The goal is to land within the right order of magnitude, not to compute exact numbers. Saying "about 10 TB" is useful. Saying "9.7 TB" implies false precision.

Worked Example: A Photo-Sharing Service

Requirements: 50 million DAU. Each user uploads 1 photo per day (average 200 KB) and views 50 photos per day.

Step 1: QPS

Write QPS = 50M x 1 / 86,400 ≈ 580 writes/sec
Read QPS  = 50M x 50 / 86,400 ≈ 29,000 reads/sec
Peak read QPS = 29,000 x 2 = ~58,000 reads/sec

Step 2: Storage

Daily new photos     = 50M x 200 KB = 10 TB/day
Yearly (no replication) = 10 TB x 365 = 3.65 PB/year
With 3x replication  = ~11 PB/year

That is a lot of storage. This tells you immediately: you need object storage (like S3), not a relational database, for the photo data itself.

Step 3: Bandwidth

Ingress = 580 x 200 KB = 116 MB/sec ≈ 928 Mbps
Egress  = 29,000 x 200 KB = 5.8 GB/sec ≈ 46 Gbps

The egress number is enormous. This immediately points you toward CDN caching for read traffic. Without a CDN, the origin server bandwidth bill alone would be prohibitive.

Step 4: Cache

Daily read volume = 50M x 50 x 200 KB = 500 TB
Cache 20% of hot data = 100 TB

100 TB of cache is impractical in RAM. This tells you that you need a tiered caching strategy: a smaller in-memory cache (Redis) for metadata, and CDN edge caching for the actual image bytes.

Common Mistakes

Further Reading

Assignment

You are designing a Twitter-like service with the following requirements:

  • 100 million DAU
  • Each user posts 2 tweets per day (average tweet: 280 bytes of text, no media)
  • Each user reads 200 tweets per day
  • Store tweets for 5 years
  • 3x replication

Calculate the following:

  1. Write QPS (average and peak at 2x)
  2. Read QPS (average and peak at 2x)
  3. Daily new storage (text only, before replication)
  4. Total storage for 5 years (with replication and 30% metadata overhead)
  5. Based on these numbers, what architectural decisions would you make? Should tweets live in a relational database or a NoSQL store? Do you need a CDN? How much cache would you provision?