Step 2: Estimate Scale & Bottlenecks
Session 6.2 · ~5 min read
Why Estimation Matters
After you understand the requirements, the next question is: how big is this system? The answer determines everything that follows. A system serving 1,000 users per day and a system serving 500 million users per day have almost nothing in common architecturally. One runs on a single server. The other requires distributed databases, sharding, caching layers, and CDN infrastructure across multiple continents.
Back-of-the-envelope estimation is the bridge between requirements and architecture. You take the user-level numbers (daily active users, actions per user) and convert them into system-level numbers (queries per second, storage per day, bandwidth). These numbers tell you which components are necessary and which are overkill.
Estimation is not about being precise. It is about knowing which order of magnitude you are in. The difference between 100 QPS and 10,000 QPS is not a tuning problem. It is an architecture problem.
The Estimation Cascade
Every estimation follows the same cascade. You start with users and work your way down to infrastructure numbers.
500M"] --> B["Actions/user/day
40 messages"] B --> C["Write QPS
~231K"] C --> D["Daily Storage
~1.86 TB"] D --> E["Yearly Storage
~680 TB"] E --> F["With Replication
~2 PB"] style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#c8a882,color:#ede9e3 style C fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#222221,stroke:#6b8f71,color:#ede9e3 style E fill:#191918,stroke:#c47a5a,color:#ede9e3 style F fill:#191918,stroke:#c47a5a,color:#ede9e3
Key Estimation Formulas
Memorize these formulas. They appear in nearly every system design problem.
| Metric | Formula | Notes |
|---|---|---|
| Write QPS | DAU × actions/user/day ÷ 86,400 | 86,400 = seconds in a day |
| Peak QPS | Average QPS × 2 to 5 | Peak factor depends on the application. Social media peaks at evening hours. |
| Read QPS | Write QPS × read/write ratio | Most systems are read-heavy. A 10:1 ratio is common. |
| Daily Storage | Write QPS × 86,400 × avg object size | Or simply: DAU × actions/day × avg object size |
| Yearly Storage | Daily storage × 365 | Multiply by replication factor (typically 3x) |
| Bandwidth | QPS × avg object size | Separate ingress (writes) from egress (reads) |
| Memory (Cache) | Daily reads × avg object size × cache ratio | Cache ratio is typically 20% of daily data (80/20 rule) |
Worked Example: WhatsApp Messaging
Let us walk through a full estimation for WhatsApp text messaging. These are the assumed inputs from Session 6.1.
Given:
- 500 million DAU
- Each user sends 40 messages per day (average)
- Average message size: 100 bytes (text content + metadata)
Step 1: Write QPS.
Total messages per day = 500,000,000 × 40 = 20,000,000,000 (20 billion).
Write QPS = 20,000,000,000 ÷ 86,400 ≈ 231,481 writes/second.
Peak QPS (3x factor) ≈ 694,444 writes/second.
Step 2: Daily storage.
Daily storage = 20,000,000,000 × 100 bytes = 2,000,000,000,000 bytes ≈ 1.86 TB/day.
Step 3: Yearly storage.
Yearly storage = 1.86 TB × 365 ≈ 679 TB/year.
With 3x replication = 679 × 3 ≈ 2,037 TB ≈ 2 PB/year.
Step 4: Bandwidth.
Ingress bandwidth = 231,481 × 100 bytes ≈ 23 MB/s (incoming messages).
Egress bandwidth depends on fan-out. If each message is read once (1:1 chat), egress ≈ ingress ≈ 23 MB/s. For group messages, multiply by average group size.
Step 5: Cache memory.
If we cache 20% of daily messages: 0.2 × 1.86 TB ≈ 372 GB of cache. This fits comfortably in a Redis cluster.
Visualizing the Cascade
The logarithmic scale is deliberate. These numbers span wildly different orders of magnitude, and that is exactly the point. QPS is in the hundreds of thousands. Storage is in the terabytes and petabytes. Your job is not to compute these to three decimal places. Your job is to know whether you are dealing with gigabytes or petabytes, because the architecture for each is fundamentally different.
Common Estimation Shortcuts
A few useful approximations to keep in your head during interviews:
| Fact | Value |
|---|---|
| Seconds in a day | ~86,400 (round to ~100,000 for quick math) |
| Seconds in a year | ~31.5 million |
| 1 million requests/day | ~12 QPS |
| 1 byte of ASCII text | 1 character |
| 1 KB | ~1,000 characters, or a short paragraph |
| 1 MB | ~1 high-res photo, or 1 minute of compressed audio |
| 1 GB | ~1,000 high-res photos |
| Typical DB read latency (SSD) | ~1 ms |
| Typical cache read latency | ~0.1 ms (100 μs) |
| Network round-trip (same region) | ~1 ms |
| Network round-trip (cross-continent) | ~100-200 ms |
What Interviewers Are Looking For
The interviewer does not expect you to produce exact numbers. They want to see three things.
Structured reasoning. You follow a repeatable process: users to actions to QPS to storage. You do not guess.
Order-of-magnitude awareness. When you say "about 200,000 QPS," the interviewer knows you understand this is a large-scale system that needs distributed infrastructure. If you said "200 QPS" with the same inputs, that signals a fundamental miscalculation.
Connection to architecture. The numbers should inform your design. "At 700K peak QPS, a single database will not handle this. We need sharding." That is the sentence that turns estimation from a math exercise into an engineering decision.
Further Reading
- System Design Interview, Vol. 1 by Alex Xu, Chapter 2: Back-of-the-envelope Estimation
- High Scalability: Google Pro Tip, Use Back-of-the-Envelope Calculations
- Latency Numbers Every Programmer Should Know (interactive visualization)
- System Design Primer: Back-of-the-envelope Calculations
Assignment
Using the WhatsApp parameters from this session (500M DAU, 40 messages/user/day, 100 bytes/message), calculate the following:
- Write QPS (average and peak at 3x)
- Daily storage in terabytes
- Yearly storage with 3x replication in petabytes
- Cache memory needed if you cache 20% of daily messages
Show your work. Then answer: at these numbers, can you use a single database server? Why or why not?