Capacity Estimation
Session 2.6 · ~5 min read
Why Estimate Before You Design?
Before choosing databases, designing APIs, or drawing architecture diagrams, you need to know how big the problem is. A system serving 1,000 users per day and one serving 100 million users per day require fundamentally different architectures. The first might run on a single server. The second needs distributed storage, caching layers, load balancers, and CDNs.
Capacity estimation, often called "back-of-the-envelope" calculation, is the discipline of converting business requirements into infrastructure numbers. You start with users and end with servers, storage, and bandwidth.
Capacity estimation is the process of translating user-facing metrics (DAU, actions per user) into infrastructure metrics (QPS, storage, bandwidth, memory) using simple arithmetic. The goal is order-of-magnitude accuracy, not precision.
The Estimation Pipeline
Every capacity estimate follows the same pipeline. Start with users. Derive queries. Derive storage. Derive bandwidth. Each step feeds the next.
Key Formulas
These five formulas cover the vast majority of capacity estimation problems.
1. Queries Per Second (QPS)
QPS = DAU x actions_per_user / 86,400
There are 86,400 seconds in a day. This gives the average QPS across a 24-hour period. For write-heavy and read-heavy operations, calculate each separately.
2. Peak QPS
Peak QPS = Average QPS x peak_multiplier
Traffic is never evenly distributed. A common rule of thumb: peak is 2x average for daily patterns, and up to 10x for viral or breaking-news events. Design your system to handle peak, not average.
3. Storage
Daily storage = write_QPS x object_size x 86,400
Yearly storage = daily_storage x 365 x replication_factor
Always account for replication. If you replicate data 3x (common for durability), your storage requirement triples. Add 20-30% overhead for metadata, indexes, and logs.
4. Bandwidth
Bandwidth (bytes/sec) = QPS x average_response_size
Bandwidth (Mbps) = bandwidth_bytes x 8 / 1,000,000
Calculate ingress (data coming in) and egress (data going out) separately. For read-heavy systems, egress dominates.
5. Cache Memory
Cache size = daily_read_requests x average_object_size x cache_ratio
A common heuristic: cache the top 20% of hot data to serve 80% of reads (the Pareto principle applied to caching). So cache_ratio is often 0.2.
Reference Numbers
You need a mental library of common sizes to make estimates quickly. These approximations, popularized by Jeff Dean's "Numbers Every Programmer Should Know," form the basis of quick estimation.
| Item | Approximate Size |
|---|---|
| 1 ASCII character | 1 byte |
| 1 Unicode character (UTF-8, common) | 2-4 bytes |
| A UUID / GUID | 16 bytes (binary), 36 bytes (string) |
| A Unix timestamp | 4-8 bytes |
| A tweet (280 chars + metadata) | ~1 KB |
| A typical JSON API response | 1-10 KB |
| A compressed image (thumbnail) | 10-50 KB |
| A compressed image (web quality) | 100-500 KB |
| 1 minute of compressed video (720p) | ~5 MB |
| 1 million (106) | ~1 MB if 1 byte each |
| 1 billion (109) | ~1 GB if 1 byte each |
Round aggressively. Use powers of 10. The goal is to land within the right order of magnitude, not to compute exact numbers. Saying "about 10 TB" is useful. Saying "9.7 TB" implies false precision.
Worked Example: A Photo-Sharing Service
Requirements: 50 million DAU. Each user uploads 1 photo per day (average 200 KB) and views 50 photos per day.
Step 1: QPS
Write QPS = 50M x 1 / 86,400 ≈ 580 writes/sec
Read QPS = 50M x 50 / 86,400 ≈ 29,000 reads/sec
Peak read QPS = 29,000 x 2 = ~58,000 reads/sec
Step 2: Storage
Daily new photos = 50M x 200 KB = 10 TB/day
Yearly (no replication) = 10 TB x 365 = 3.65 PB/year
With 3x replication = ~11 PB/year
That is a lot of storage. This tells you immediately: you need object storage (like S3), not a relational database, for the photo data itself.
Step 3: Bandwidth
Ingress = 580 x 200 KB = 116 MB/sec ≈ 928 Mbps
Egress = 29,000 x 200 KB = 5.8 GB/sec ≈ 46 Gbps
The egress number is enormous. This immediately points you toward CDN caching for read traffic. Without a CDN, the origin server bandwidth bill alone would be prohibitive.
Step 4: Cache
Daily read volume = 50M x 50 x 200 KB = 500 TB
Cache 20% of hot data = 100 TB
100 TB of cache is impractical in RAM. This tells you that you need a tiered caching strategy: a smaller in-memory cache (Redis) for metadata, and CDN edge caching for the actual image bytes.
Common Mistakes
- Forgetting replication. If your database replicates 3x, your storage triples. Always ask: how many copies?
- Using average instead of peak. Systems fail at peak, not at average. Always compute peak QPS.
- Ignoring metadata. A 200 KB image might have 1 KB of metadata (user ID, timestamp, tags, location). For 50M uploads per day, that is 50 GB/day of metadata alone.
- False precision. Do not use a calculator. Round to powers of 10. The estimate should take 5 minutes, not 30.
- Skipping the "so what." Every number should lead to an architectural decision. 11 PB/year means object storage. 46 Gbps egress means CDN. If the number does not change a decision, you did not need to compute it.
Further Reading
- Alex Xu, Back-of-the-Envelope Estimation, System Design Interview. The most widely referenced walkthrough of estimation for interviews.
- Jeff Dean, Latency Numbers Every Programmer Should Know. The canonical reference for order-of-magnitude latency and size numbers.
- Colin Scott, Interactive Latency Numbers. An interactive, updated version of Dean's original numbers with historical trends.
- More Numbers Every Awesome Programmer Must Know, High Scalability. Extends Dean's list with network, disk, and distributed system numbers.
Assignment
You are designing a Twitter-like service with the following requirements:
- 100 million DAU
- Each user posts 2 tweets per day (average tweet: 280 bytes of text, no media)
- Each user reads 200 tweets per day
- Store tweets for 5 years
- 3x replication
Calculate the following:
- Write QPS (average and peak at 2x)
- Read QPS (average and peak at 2x)
- Daily new storage (text only, before replication)
- Total storage for 5 years (with replication and 30% metadata overhead)
- Based on these numbers, what architectural decisions would you make? Should tweets live in a relational database or a NoSQL store? Do you need a CDN? How much cache would you provision?