Session 6.2: Step 2: Estimate Scale & Bottlenecks

Course → Module 6: System Design Interview Framework

Why Estimation Matters

After you understand the requirements, the next question is: how big is this system? The answer determines everything that follows. A system serving 1,000 users per day and a system serving 500 million users per day have almost nothing in common architecturally. One runs on a single server. The other requires distributed databases, sharding, caching layers, and CDN infrastructure across multiple continents.

Back-of-the-envelope estimation is the bridge between requirements and architecture. You take the user-level numbers (daily active users, actions per user) and convert them into system-level numbers (queries per second, storage per day, bandwidth). These numbers tell you which components are necessary and which are overkill.

Estimation is not about being precise. It is about knowing which order of magnitude you are in. The difference between 100 QPS and 10,000 QPS is not a tuning problem. It is an architecture problem.

The Estimation Cascade

Every estimation follows the same cascade. You start with users and work your way down to infrastructure numbers.

flowchart LR A["DAU
500M"] --> B["Actions/user/day
40 messages"] B --> C["Write QPS
~231K"] C --> D["Daily Storage
~1.86 TB"] D --> E["Yearly Storage
~680 TB"] E --> F["With Replication
~2 PB"] style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#c8a882,color:#ede9e3 style C fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#222221,stroke:#6b8f71,color:#ede9e3 style E fill:#191918,stroke:#c47a5a,color:#ede9e3 style F fill:#191918,stroke:#c47a5a,color:#ede9e3

Key Estimation Formulas

Memorize these formulas. They appear in nearly every system design problem.

Metric	Formula	Notes
Write QPS	DAU × actions/user/day ÷ 86,400	86,400 = seconds in a day
Peak QPS	Average QPS × 2 to 5	Peak factor depends on the application. Social media peaks at evening hours.
Read QPS	Write QPS × read/write ratio	Most systems are read-heavy. A 10:1 ratio is common.
Daily Storage	Write QPS × 86,400 × avg object size	Or simply: DAU × actions/day × avg object size
Yearly Storage	Daily storage × 365	Multiply by replication factor (typically 3x)
Bandwidth	QPS × avg object size	Separate ingress (writes) from egress (reads)
Memory (Cache)	Daily reads × avg object size × cache ratio	Cache ratio is typically 20% of daily data (80/20 rule)

Worked Example: WhatsApp Messaging

Let us walk through a full estimation for WhatsApp text messaging. These are the assumed inputs from Session 6.1.

Given:

500 million DAU
Each user sends 40 messages per day (average)
Average message size: 100 bytes (text content + metadata)

Step 1: Write QPS.

Total messages per day = 500,000,000 × 40 = 20,000,000,000 (20 billion).

Write QPS = 20,000,000,000 ÷ 86,400 ≈ 231,481 writes/second.

Peak QPS (3x factor) ≈ 694,444 writes/second.

Step 2: Daily storage.

Daily storage = 20,000,000,000 × 100 bytes = 2,000,000,000,000 bytes ≈ 1.86 TB/day.

Step 3: Yearly storage.

Yearly storage = 1.86 TB × 365 ≈ 679 TB/year.

With 3x replication = 679 × 3 ≈ 2,037 TB ≈ 2 PB/year.

Step 4: Bandwidth.

Ingress bandwidth = 231,481 × 100 bytes ≈ 23 MB/s (incoming messages).

Egress bandwidth depends on fan-out. If each message is read once (1:1 chat), egress ≈ ingress ≈ 23 MB/s. For group messages, multiply by average group size.

Step 5: Cache memory.

If we cache 20% of daily messages: 0.2 × 1.86 TB ≈ 372 GB of cache. This fits comfortably in a Redis cluster.

Visualizing the Cascade

The logarithmic scale is deliberate. These numbers span wildly different orders of magnitude, and that is exactly the point. QPS is in the hundreds of thousands. Storage is in the terabytes and petabytes. Your job is not to compute these to three decimal places. Your job is to know whether you are dealing with gigabytes or petabytes, because the architecture for each is fundamentally different.

Common Estimation Shortcuts

A few useful approximations to keep in your head during interviews:

Fact	Value
Seconds in a day	~86,400 (round to ~100,000 for quick math)
Seconds in a year	~31.5 million
1 million requests/day	~12 QPS
1 byte of ASCII text	1 character
1 KB	~1,000 characters, or a short paragraph
1 MB	~1 high-res photo, or 1 minute of compressed audio
1 GB	~1,000 high-res photos
Typical DB read latency (SSD)	~1 ms
Typical cache read latency	~0.1 ms (100 μs)
Network round-trip (same region)	~1 ms
Network round-trip (cross-continent)	~100-200 ms

What Interviewers Are Looking For

The interviewer does not expect you to produce exact numbers. They want to see three things.

Structured reasoning. You follow a repeatable process: users to actions to QPS to storage. You do not guess.

Order-of-magnitude awareness. When you say "about 200,000 QPS," the interviewer knows you understand this is a large-scale system that needs distributed infrastructure. If you said "200 QPS" with the same inputs, that signals a fundamental miscalculation.

Connection to architecture. The numbers should inform your design. "At 700K peak QPS, a single database will not handle this. We need sharding." That is the sentence that turns estimation from a math exercise into an engineering decision.

Assignment

Using the WhatsApp parameters from this session (500M DAU, 40 messages/user/day, 100 bytes/message), calculate the following:

Write QPS (average and peak at 3x)
Daily storage in terabytes
Yearly storage with 3x replication in petabytes
Cache memory needed if you cache 20% of daily messages

Show your work. Then answer: at these numbers, can you use a single database server? Why or why not?

Step 2: Estimate Scale & Bottlenecks