Skip to content
hibranwar
  • About
  • Work
  • Writing
  • Library
  • Made
  • Now
  • Contact

Module 6: System Design Interview Framework

Systems Thinking × System Design · 5 sessions

← Back to course

Why Requirements Come First

Most candidates, when given a system design prompt, start drawing boxes within 30 seconds. Load balancer here, database there, maybe a cache for good measure. Five minutes later, the interviewer asks: "Does this system need to support group messaging?" The candidate pauses. The entire diagram is wrong.

The first five minutes of a system design interview are the most valuable. Not because you produce a diagram, but because you produce clarity. Every minute spent understanding the problem saves eight minutes of wasted design. The ratio is not metaphorical. If you design for the wrong requirements, you will either backtrack painfully or deliver a system that solves the wrong problem.

The best system designers spend 20% of their time on requirements and 80% on the right problem. Everyone else does the opposite.

The Two Categories of Requirements

System requirements split into two categories, and confusing them is a common failure mode.

Functional requirements describe what the system does. These are features visible to the user. "Users can send text messages." "Users can create group chats with up to 256 members." "Messages are delivered in order within a conversation." If you removed a functional requirement, the user would notice immediately because a feature is missing.

Non-functional requirements describe how the system behaves. These are qualities of the system that users experience but do not directly request. Latency under 200ms. 99.99% availability. End-to-end encryption. If you violated a non-functional requirement, the system would still technically work, but it would be slow, unreliable, or insecure.

The distinction matters because functional requirements determine your architecture (what services you build), while non-functional requirements determine your infrastructure (how you build and deploy them).

The Requirement-Gathering Process

Requirement gathering in an interview is not random questioning. It follows a structured flow. Start broad, then narrow. Confirm the scope, then dig into specifics.

flowchart TD A["Restate the problem
in your own words"] --> B["Identify core use cases
(the 2-3 must-haves)"] B --> C["Ask about users
Who? How many? Where?"] C --> D["Clarify functional requirements
What does the system do?"] D --> E["Clarify non-functional requirements
Latency, availability, consistency?"] E --> F["Identify constraints
Budget, timeline, existing infra?"] F --> G["Confirm scope with interviewer
What is in/out?"] G --> H["Summarize and proceed
to estimation"] style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#c8a882,color:#ede9e3 style C fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#222221,stroke:#6b8f71,color:#ede9e3 style E fill:#191918,stroke:#c47a5a,color:#ede9e3 style F fill:#191918,stroke:#c47a5a,color:#ede9e3 style G fill:#191918,stroke:#8a8478,color:#ede9e3 style H fill:#191918,stroke:#8a8478,color:#ede9e3

Notice the structure. You do not jump straight to "How many requests per second?" That question only makes sense after you know what kind of requests the system handles. You also do not start listing features without confirming users and use cases.

The Categorized Checklist

Use this table as a mental checklist during the requirement-gathering phase. Not every question applies to every problem, but scanning through the categories ensures you do not miss a critical dimension.

Category Questions to Ask Why It Matters
Functional What are the core features? Who are the actors? What actions can each actor perform? What data flows in and out? Determines the services you build and the APIs you expose
Non-functional What latency is acceptable? What availability target (99.9%? 99.99%)? Strong or eventual consistency? Any compliance requirements (GDPR, HIPAA)? Drives infrastructure choices: replication, caching, regional deployment
Scale How many users (DAU/MAU)? Read-heavy or write-heavy? Peak traffic vs. average? Data growth rate? Determines whether you need sharding, caching, CDN, or async processing
Constraints Existing tech stack? Budget limitations? Team size? Must it run on specific cloud? Mobile, web, or both? Eliminates options early and keeps the design realistic
Edge Cases What happens during failures? How do we handle duplicate submissions? What about network partitions? Offline support? Separates a good design from a great one. Interviewers probe these.

The Art of Scoping

A system design interview is typically 45 to 60 minutes. You cannot design an entire production system in that time. Scoping is the skill of deciding what to include and what to defer.

Here is the principle: focus on the core flow that makes the system unique. For a chat application, the core flow is sending and receiving messages. Group management, profile pictures, and story features are secondary. For a URL shortener, the core flow is shortening a URL and redirecting to the original. Analytics dashboards and custom aliases are secondary.

Explicitly state what you are including and what you are excluding. "I will focus on one-to-one messaging and message delivery guarantees. I will defer group messaging, media sharing, and presence indicators unless we have time." This shows the interviewer that you understand prioritization, not that you are avoiding complexity.

Common Mistakes in Requirement Gathering

Assuming instead of asking. "I assume we need to handle 1 billion users." Why assume? Ask. The interviewer might say 10 million, and that changes your entire architecture.

Asking too many questions. Spending 15 minutes on requirements in a 45-minute interview leaves too little time for design. Aim for 5 minutes. Be focused, not exhaustive.

Ignoring non-functional requirements entirely. Many candidates list features but never ask about latency, consistency, or availability. These are the requirements that determine your technical decisions.

Not summarizing. After gathering requirements, state them back to the interviewer. "So we are designing a one-to-one chat system for 500 million DAU, with sub-200ms message delivery, 99.99% availability, and end-to-end encryption. I will focus on message send/receive and delivery guarantees. Does that sound right?" This alignment check prevents you from designing the wrong system.

Worked Example: "Design WhatsApp"

The interviewer says: "Design WhatsApp." Here is how the first five minutes should go.

Restate: "WhatsApp is a real-time messaging application. I want to make sure I design the right subset. Let me ask a few questions."

Users and scale: "How many daily active users should I design for? Are we targeting global users across multiple regions?"

Core features: "Should I focus on one-to-one messaging only, or also group chats? Do we need media sharing (images, video), or just text? What about read receipts and online status?"

Non-functional: "What is the acceptable message delivery latency? Should messages be encrypted end-to-end? Do we guarantee message ordering? If a user is offline, do we store messages for later delivery?"

Constraints: "Should this work on both mobile and web? Do we need to support very low-bandwidth connections?"

Scope confirmation: "Based on your answers, I will focus on one-to-one text messaging with offline delivery and end-to-end encryption for 500 million DAU. I will defer group messaging, media, and stories."

Total time: about four minutes. The interviewer now knows you understand the problem, and you have a clear scope for the remaining 40 minutes.

Further Reading

  • System Design Interview, Vol. 1 by Alex Xu, Chapter 3: A Framework for System Design Interviews
  • Designing Data-Intensive Applications by Martin Kleppmann, Chapter 1: Reliability, Scalability, and Maintainability
  • Gergely Orosz: System Design Interview Guide Review
  • The System Design Primer by Donne Martin (GitHub)

Assignment

You are asked: "Design WhatsApp." Before drawing a single box, write 10 clarifying questions. Organize them into the five categories from the checklist table above (functional, non-functional, scale, constraints, edge cases). For each question, write one sentence explaining why the answer would change your design.

Why Estimation Matters

After you understand the requirements, the next question is: how big is this system? The answer determines everything that follows. A system serving 1,000 users per day and a system serving 500 million users per day have almost nothing in common architecturally. One runs on a single server. The other requires distributed databases, sharding, caching layers, and CDN infrastructure across multiple continents.

Back-of-the-envelope estimation is the bridge between requirements and architecture. You take the user-level numbers (daily active users, actions per user) and convert them into system-level numbers (queries per second, storage per day, bandwidth). These numbers tell you which components are necessary and which are overkill.

Estimation is not about being precise. It is about knowing which order of magnitude you are in. The difference between 100 QPS and 10,000 QPS is not a tuning problem. It is an architecture problem.

The Estimation Cascade

Every estimation follows the same cascade. You start with users and work your way down to infrastructure numbers.

flowchart LR A["DAU
500M"] --> B["Actions/user/day
40 messages"] B --> C["Write QPS
~231K"] C --> D["Daily Storage
~1.86 TB"] D --> E["Yearly Storage
~680 TB"] E --> F["With Replication
~2 PB"] style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#c8a882,color:#ede9e3 style C fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#222221,stroke:#6b8f71,color:#ede9e3 style E fill:#191918,stroke:#c47a5a,color:#ede9e3 style F fill:#191918,stroke:#c47a5a,color:#ede9e3

Key Estimation Formulas

Memorize these formulas. They appear in nearly every system design problem.

Metric Formula Notes
Write QPS DAU × actions/user/day ÷ 86,400 86,400 = seconds in a day
Peak QPS Average QPS × 2 to 5 Peak factor depends on the application. Social media peaks at evening hours.
Read QPS Write QPS × read/write ratio Most systems are read-heavy. A 10:1 ratio is common.
Daily Storage Write QPS × 86,400 × avg object size Or simply: DAU × actions/day × avg object size
Yearly Storage Daily storage × 365 Multiply by replication factor (typically 3x)
Bandwidth QPS × avg object size Separate ingress (writes) from egress (reads)
Memory (Cache) Daily reads × avg object size × cache ratio Cache ratio is typically 20% of daily data (80/20 rule)

Worked Example: WhatsApp Messaging

Let us walk through a full estimation for WhatsApp text messaging. These are the assumed inputs from Session 6.1.

Given:

  • 500 million DAU
  • Each user sends 40 messages per day (average)
  • Average message size: 100 bytes (text content + metadata)

Step 1: Write QPS.

Total messages per day = 500,000,000 × 40 = 20,000,000,000 (20 billion).

Write QPS = 20,000,000,000 ÷ 86,400 ≈ 231,481 writes/second.

Peak QPS (3x factor) ≈ 694,444 writes/second.

Step 2: Daily storage.

Daily storage = 20,000,000,000 × 100 bytes = 2,000,000,000,000 bytes ≈ 1.86 TB/day.

Step 3: Yearly storage.

Yearly storage = 1.86 TB × 365 ≈ 679 TB/year.

With 3x replication = 679 × 3 ≈ 2,037 TB ≈ 2 PB/year.

Step 4: Bandwidth.

Ingress bandwidth = 231,481 × 100 bytes ≈ 23 MB/s (incoming messages).

Egress bandwidth depends on fan-out. If each message is read once (1:1 chat), egress ≈ ingress ≈ 23 MB/s. For group messages, multiply by average group size.

Step 5: Cache memory.

If we cache 20% of daily messages: 0.2 × 1.86 TB ≈ 372 GB of cache. This fits comfortably in a Redis cluster.

Visualizing the Cascade

The logarithmic scale is deliberate. These numbers span wildly different orders of magnitude, and that is exactly the point. QPS is in the hundreds of thousands. Storage is in the terabytes and petabytes. Your job is not to compute these to three decimal places. Your job is to know whether you are dealing with gigabytes or petabytes, because the architecture for each is fundamentally different.

Common Estimation Shortcuts

A few useful approximations to keep in your head during interviews:

Fact Value
Seconds in a day ~86,400 (round to ~100,000 for quick math)
Seconds in a year ~31.5 million
1 million requests/day ~12 QPS
1 byte of ASCII text 1 character
1 KB ~1,000 characters, or a short paragraph
1 MB ~1 high-res photo, or 1 minute of compressed audio
1 GB ~1,000 high-res photos
Typical DB read latency (SSD) ~1 ms
Typical cache read latency ~0.1 ms (100 μs)
Network round-trip (same region) ~1 ms
Network round-trip (cross-continent) ~100-200 ms

What Interviewers Are Looking For

The interviewer does not expect you to produce exact numbers. They want to see three things.

Structured reasoning. You follow a repeatable process: users to actions to QPS to storage. You do not guess.

Order-of-magnitude awareness. When you say "about 200,000 QPS," the interviewer knows you understand this is a large-scale system that needs distributed infrastructure. If you said "200 QPS" with the same inputs, that signals a fundamental miscalculation.

Connection to architecture. The numbers should inform your design. "At 700K peak QPS, a single database will not handle this. We need sharding." That is the sentence that turns estimation from a math exercise into an engineering decision.

Further Reading

  • System Design Interview, Vol. 1 by Alex Xu, Chapter 2: Back-of-the-envelope Estimation
  • High Scalability: Google Pro Tip, Use Back-of-the-Envelope Calculations
  • Latency Numbers Every Programmer Should Know (interactive visualization)
  • System Design Primer: Back-of-the-envelope Calculations

Assignment

Using the WhatsApp parameters from this session (500M DAU, 40 messages/user/day, 100 bytes/message), calculate the following:

  1. Write QPS (average and peak at 3x)
  2. Daily storage in terabytes
  3. Yearly storage with 3x replication in petabytes
  4. Cache memory needed if you cache 20% of daily messages

Show your work. Then answer: at these numbers, can you use a single database server? Why or why not?

From Numbers to Architecture

You have your requirements from Session 6.1. You have your scale estimates from Session 6.2. Now you translate both into a diagram. The high-level design (HLD) is the skeleton of your system. Every box represents a component. Every arrow represents data flow. Every component exists because a requirement or a scale constraint demands it.

The HLD is not a wish list of technologies. It is a structural argument. Each component answers a question: what problem does it solve, and why can the adjacent components not solve that problem on their own?

Every box in your diagram should answer: what problem does this solve that the adjacent box cannot? If you cannot answer that question, the box does not belong.

The Standard Components

Most large-scale systems share a common set of building blocks. You do not include all of them in every design. You include the ones your requirements demand.

Component Purpose Include When
Client (Mobile/Web) User interface, input/output Always. Every system has users.
CDN Serve static assets close to users Media-heavy systems, global user base
Load Balancer Distribute traffic across servers More than one application server (almost always)
API Gateway Rate limiting, auth, routing, protocol translation Microservices architecture, public APIs
Application Servers Business logic Always. This is your service tier.
Cache Reduce database load, improve latency Read-heavy systems, hot data patterns
Database Persistent storage Always. Data must survive restarts.
Message Queue Decouple producers from consumers, handle spikes Async workflows, write spikes, cross-service communication
Object Storage Store large files (images, video, backups) Media uploads, file sharing, backups
Notification Service Push notifications, email, SMS Systems that need to alert offline users

High-Level Design: Chat Application

Building on the WhatsApp example from Sessions 6.1 and 6.2, here is the HLD for a one-to-one chat system handling 500M DAU.

flowchart TB subgraph Clients M["Mobile App"] W["Web App"] end subgraph Edge LB["Load Balancer"] end subgraph Services CS["Chat Service
(WebSocket)"] PS["Presence Service"] NS["Notification Service"] US["User Service"] end subgraph Storage MQ["Message Queue
(Kafka)"] DB["Message Store
(Cassandra)"] Cache["Session Cache
(Redis)"] UDB["User DB
(PostgreSQL)"] end M & W -->|"WebSocket"| LB LB --> CS CS -->|"Publish message"| MQ MQ -->|"Persist"| DB CS -->|"Check online?"| PS PS --> Cache CS -->|"User offline"| NS CS -->|"Auth/profile"| US US --> UDB style M fill:#222221,stroke:#c8a882,color:#ede9e3 style W fill:#222221,stroke:#c8a882,color:#ede9e3 style LB fill:#222221,stroke:#6b8f71,color:#ede9e3 style CS fill:#191918,stroke:#c8a882,color:#ede9e3 style PS fill:#191918,stroke:#6b8f71,color:#ede9e3 style NS fill:#191918,stroke:#c47a5a,color:#ede9e3 style US fill:#191918,stroke:#8a8478,color:#ede9e3 style MQ fill:#222221,stroke:#c47a5a,color:#ede9e3 style DB fill:#222221,stroke:#c8a882,color:#ede9e3 style Cache fill:#222221,stroke:#6b8f71,color:#ede9e3 style UDB fill:#222221,stroke:#8a8478,color:#ede9e3

Justifying Each Component

Every box in the diagram above has a reason.

WebSocket connections through Load Balancer. Chat requires real-time bidirectional communication. HTTP polling would generate 500M+ unnecessary requests per minute. WebSockets maintain a persistent connection. The load balancer distributes these connections across Chat Service instances.

Chat Service. Handles the core message flow: receive message from sender, determine if recipient is online, deliver or queue for later delivery. This is the heart of the system.

Message Queue (Kafka). At 231K writes/second, writing directly to the database from the Chat Service would create a bottleneck. The queue absorbs write spikes and decouples message ingestion from persistence. If the database is slow for a few seconds, messages queue up instead of being dropped.

Message Store (Cassandra). At nearly 2 PB/year with 3x replication, you need a database designed for high write throughput and horizontal scaling. Cassandra handles this well. We chose it over PostgreSQL for messages because relational features (joins, transactions) are not needed for message storage.

Session Cache (Redis). The Presence Service needs to know instantly whether a user is online and which Chat Service instance holds their WebSocket connection. Redis provides sub-millisecond lookups for this mapping.

Notification Service. When the recipient is offline, the message must still be delivered eventually. The Notification Service handles push notifications (APNs, FCM) to wake the recipient's device.

User Service and PostgreSQL. User profiles, authentication, and contact lists are relational data with low write volume. PostgreSQL is the right fit: ACID transactions for account operations, and the scale is manageable (user profile reads are cacheable).

Defining API Contracts

After the HLD diagram, define the key APIs. You do not need every endpoint, just the ones that serve the core flow.

Send message (WebSocket frame):

{
  "action": "send_message",
  "to": "user_id_456",
  "content": "Hello",
  "timestamp": 1711929600,
  "client_msg_id": "uuid-abc-123"
}

Receive message (WebSocket frame):

{
  "action": "new_message",
  "from": "user_id_123",
  "content": "Hello",
  "timestamp": 1711929600,
  "msg_id": "server-generated-id",
  "client_msg_id": "uuid-abc-123"
}

Fetch message history (REST):

GET /api/v1/conversations/{conversation_id}/messages?before={msg_id}&limit=50

Notice the client_msg_id in the send message payload. This is an idempotency key. If the client sends the same message twice (due to a network retry), the server can deduplicate using this ID. This small detail shows the interviewer you think about real-world reliability.

HLD Presentation Tips

Draw top to bottom or left to right. Clients at the top, storage at the bottom. Data flows downward. This is the convention interviewers expect.

Label the arrows. An arrow without a label is ambiguous. Does data flow from service A to service B via HTTP, gRPC, or a message queue? Label it.

Start simple, then elaborate. Draw the minimal HLD first (client, server, database). Then add components one by one as you explain why each is needed. This builds a narrative. It is far more effective than presenting a complex diagram all at once and then trying to explain it.

Separate the read path from the write path. In many systems, reads and writes follow different paths through the architecture. Making this explicit shows depth. For the chat system: the write path goes Client to Chat Service to Kafka to Cassandra. The read path for message history goes Client to Chat Service to Cassandra (or Cache).

Further Reading

  • System Design Interview, Vol. 1 by Alex Xu, Chapter 12: Design a Chat System
  • Facebook Engineering: Building Mobile-First Infrastructure for Messenger
  • InfoQ: The WhatsApp Architecture Facebook Bought for $19 Billion
  • Martin Fowler: Richardson Maturity Model (API design levels)

Assignment

Draw the high-level design for WhatsApp based on your requirements from Session 6.1 and your estimates from Session 6.2. Your diagram should include:

  1. All components from client to storage
  2. Labeled arrows showing data flow and protocols
  3. A one-sentence justification for each component
  4. At least two API contracts for the core flow (send message, receive message, or fetch history)

Compare your diagram with the one in this session. What did you include that we did not? What did you leave out? Both differences are worth examining.

Where Tradeoffs Become Concrete

The high-level design from Session 6.3 shows the structure. Now you fill in the substance. Every box in that diagram represents a technology decision. Which database? Which cache? Which message queue? Which compute model? Each choice comes with tradeoffs, and the interviewer expects you to articulate them.

This is the step that separates candidates who can draw diagrams from candidates who can build systems. A diagram with "Database" written in a box is a sketch. A diagram with "Cassandra, because we need high write throughput with tunable consistency across regions" is engineering.

A tech decision without a stated tradeoff is an opinion. A tech decision with a stated tradeoff is engineering.

The Decision Framework

For each technology decision, use a three-part structure:

  1. State the requirement that drives this decision. ("We need 700K peak writes/second with tunable consistency.")
  2. Name 2-3 viable options and their key differences.
  3. Pick one and state the tradeoff. What do you gain? What do you give up?

This structure takes 30 seconds per decision and demonstrates engineering judgment every time.

Decision Map

Here are the five major technology decisions for most system designs, mapped to options and tradeoffs.

Decision Options Key Tradeoff
Primary Database PostgreSQL, MySQL, Cassandra, DynamoDB, MongoDB Relational integrity vs. horizontal write scalability
Cache Redis, Memcached Feature richness (Redis: data structures, pub/sub, persistence) vs. simplicity and memory efficiency (Memcached)
Message Queue Kafka, RabbitMQ, SQS, Pulsar Throughput and durability (Kafka) vs. routing flexibility (RabbitMQ) vs. managed simplicity (SQS)
Compute Model VMs (EC2), Containers (ECS/K8s), Serverless (Lambda) Control and predictability (VMs) vs. density and portability (containers) vs. zero-ops and pay-per-use (serverless)
Monitoring Prometheus + Grafana, Datadog, CloudWatch, ELK Cost and control (self-hosted) vs. ease and breadth (managed SaaS)

Database Selection Decision Tree

Database selection is the most consequential decision in most system designs. This decision tree provides a structured approach.

flowchart TD A["What is your data model?"] -->|"Relational
(joins, transactions)"| B["Write volume?"] A -->|"Key-value or
wide-column"| C["Scale requirement?"] A -->|"Document
(flexible schema)"| D["MongoDB / DynamoDB"] A -->|"Graph
(relationships)"| E["Neo4j / Neptune"] B -->|"< 50K writes/sec"| F["PostgreSQL / MySQL"] B -->|"> 50K writes/sec"| G["CockroachDB / TiDB
(distributed SQL)"] C -->|"Single region"| H["Redis / DynamoDB"] C -->|"Multi-region,
high throughput"| I["Cassandra / ScyllaDB"] style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#6b8f71,color:#ede9e3 style C fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#191918,stroke:#c47a5a,color:#ede9e3 style E fill:#191918,stroke:#c47a5a,color:#ede9e3 style F fill:#191918,stroke:#c8a882,color:#ede9e3 style G fill:#191918,stroke:#c8a882,color:#ede9e3 style H fill:#191918,stroke:#8a8478,color:#ede9e3 style I fill:#191918,stroke:#8a8478,color:#ede9e3

This tree is a starting point, not a rulebook. Real decisions involve multiple factors simultaneously. But walking through this tree in an interview shows that you approach database selection systematically rather than by habit or preference.

Applying Decisions to WhatsApp

Let us apply the framework to the WhatsApp chat system from Session 6.3.

Message Store: Cassandra. The requirement is 231K average writes/second (700K peak) with horizontal scalability. Messages are written once and read sequentially within a conversation. There are no cross-conversation joins. Cassandra's wide-column model fits perfectly: the partition key is the conversation ID, and messages are sorted by timestamp within the partition. The tradeoff: we give up ACID transactions and ad-hoc query flexibility. We accept this because message storage has a simple, predictable access pattern.

User Database: PostgreSQL. User profiles, authentication data, and contact lists involve relational data with referential integrity needs. Write volume is low (new users, profile updates). PostgreSQL handles this easily on a single primary with read replicas. The tradeoff: limited horizontal write scaling. We accept this because user write volume is orders of magnitude lower than message volume.

Cache: Redis. We need to store WebSocket session mappings (which server holds which user's connection) and presence data (online/offline). Redis provides sub-millisecond lookups, supports TTL for automatic session expiry, and offers pub/sub for broadcasting presence changes. The tradeoff: data is in memory, so cost scales with data size. We accept this because session data is small per user (a few hundred bytes).

Message Queue: Kafka. At 700K peak writes/second, we need a queue that handles sustained high throughput without data loss. Kafka's append-only log provides exactly this. Messages are persisted to disk and replicated across brokers. The tradeoff: Kafka is operationally complex (ZooKeeper/KRaft management, partition rebalancing). We accept this because the alternative (writing directly to Cassandra from the Chat Service) would couple ingestion speed to database write speed, creating a bottleneck during spikes.

Compute: Containers on Kubernetes. The Chat Service maintains WebSocket connections, which are long-lived and stateful. Serverless (Lambda) has a 15-minute execution limit and cold start latency, making it unsuitable for persistent connections. VMs work but waste resources during low-traffic hours. Containers on Kubernetes give us fine-grained autoscaling: scale up Chat Service pods during peak hours, scale down at night. The tradeoff: Kubernetes is operationally heavy. We accept this because at 500M DAU, the team size justifies the infrastructure investment.

The Anti-Pattern: Technology-First Design

A common mistake is choosing technologies before understanding requirements. "I will use Kafka because it is industry standard." But if your system handles 100 writes/second, Kafka is overkill. A simple RabbitMQ or even a PostgreSQL-backed job queue would suffice. Starting with technology leads to over-engineering. Starting with requirements leads to appropriate engineering.

Similarly, do not choose a technology just because you know it well. If the interviewer asks "Why Cassandra?" and your answer is "I have used it before," that is not engineering. A better answer: "Because we need high write throughput with predictable latency at scale, and our access pattern is partition-key lookups with time-ordered scans. Cassandra is optimized for exactly this pattern."

Monitoring: The Forgotten Decision

Most candidates skip monitoring entirely. This is a missed opportunity. Mentioning monitoring shows the interviewer you think about production operations, not just architecture diagrams.

For the WhatsApp system, the critical metrics to monitor are:

  • Message delivery latency (p50, p95, p99): is the core user experience degrading?
  • Kafka consumer lag: are messages piling up faster than they are being persisted?
  • WebSocket connection count per server: is the load balanced evenly?
  • Cassandra write latency: is the database keeping up with the write volume?
  • Cache hit ratio: is the cache effective, or are most requests falling through to the database?

You do not need to design the monitoring system. Just mentioning these metrics and why they matter demonstrates operational maturity.

Further Reading

  • Designing Data-Intensive Applications by Martin Kleppmann, Chapters 2-3: Data Models and Storage Engines
  • Apache Cassandra Data Modeling Documentation
  • AWS Well-Architected Framework, Technology Choices pillar
  • Redis FAQ: When to Use Redis

Assignment

For the WhatsApp high-level design from Session 6.3, make the following four technology decisions. For each one, write exactly three sentences:

  1. Database for messages: state the requirement, name your choice, state the tradeoff.
  2. Cache: state the requirement, name your choice, state the tradeoff.
  3. Message queue: state the requirement, name your choice, state the tradeoff.
  4. Compute model: state the requirement, name your choice, state the tradeoff.

Then list three metrics you would monitor in production, and explain what each metric tells you about system health.

Patterns as Named Tradeoffs

Design patterns are not solutions you apply to problems. They are tradeoffs that someone named. Every pattern solves a specific category of problem by accepting a specific category of cost. Using a pattern without understanding the cost is cargo-culting. Understanding the cost and choosing deliberately is engineering.

This session is a reference card. Six patterns that appear frequently in system design interviews, each with a clear description, a use case, and a tradeoff. Bookmark this page. Return to it before interviews.

Patterns are not solutions. They are named tradeoffs. Knowing the pattern means knowing what you gain and what you pay. Everything else is trivia.

Pattern Reference Table

Pattern One-Line Description When to Use Main Tradeoff
Read Replica Replicate a database to separate read traffic from write traffic Read-heavy workloads where a single DB server cannot handle the read QPS Replication lag means reads may return stale data (eventual consistency)
CQRS Use separate models (and often separate stores) for reads and writes Systems where read and write patterns differ fundamentally in shape or scale Increased system complexity. Two models must be kept in sync.
Event Sourcing Store state as a sequence of immutable events instead of current-state snapshots Audit trails, undo/redo, temporal queries, financial systems Event store grows unbounded. Rebuilding current state from events is expensive.
Saga Coordinate multi-service transactions through a sequence of local transactions with compensating actions Distributed transactions across microservices where 2PC is too slow or unavailable No atomicity guarantee. Partial failures require compensating transactions (rollback logic).
Strangler Fig Gradually replace a legacy system by routing traffic to new components one feature at a time Migrating from monolith to microservices, or replacing any legacy system incrementally Dual-system overhead during migration. Routing logic adds complexity.
Cell-Based Architecture Partition the system into independent, self-contained cells that each serve a subset of users Extreme reliability requirements. Blast radius containment. Cross-cell operations are expensive. Data locality must be carefully managed.

Pattern 1: Read Replica

The simplest scaling pattern. Your primary database handles all writes. One or more replica databases handle reads. The primary replicates data to replicas asynchronously (or synchronously, at the cost of write latency).

When to reach for it: your application is read-heavy (common: 90% reads, 10% writes), and the primary database is overloaded by read queries. Adding read replicas lets you scale reads horizontally without changing your application architecture significantly.

The cost: replication lag. A write to the primary may take 10ms to 500ms (or more, under load) to appear on a replica. During that window, a read from the replica returns stale data. For many applications (social media feeds, product catalogs), this is acceptable. For others (bank balances, inventory counts), it is not.

Pattern 2: CQRS (Command Query Responsibility Segregation)

CQRS takes the read replica idea further. Instead of replicating the same data model, you maintain two separate models: one optimized for writes (commands) and one optimized for reads (queries). The write model might be a normalized relational database. The read model might be a denormalized document store, a search index, or a materialized view.

flowchart LR subgraph "Write Side (Commands)" C["Client"] -->|"POST /orders"| WS["Write Service"] WS --> WDB["Write DB
(normalized)"] WDB -->|"Events"| EB["Event Bus"] end subgraph "Read Side (Queries)" EB -->|"Project"| RP["Read Projector"] RP --> RDB["Read DB
(denormalized)"] RDB --> RS["Read Service"] RS -->|"GET /orders"| C2["Client"] end style C fill:#222221,stroke:#c8a882,color:#ede9e3 style WS fill:#222221,stroke:#6b8f71,color:#ede9e3 style WDB fill:#191918,stroke:#c8a882,color:#ede9e3 style EB fill:#191918,stroke:#c47a5a,color:#ede9e3 style RP fill:#222221,stroke:#6b8f71,color:#ede9e3 style RDB fill:#191918,stroke:#c8a882,color:#ede9e3 style RS fill:#222221,stroke:#8a8478,color:#ede9e3 style C2 fill:#222221,stroke:#8a8478,color:#ede9e3

The write side accepts commands, validates them, and stores the result in the write database. An event is emitted to an event bus. The read side consumes events, projects them into a read-optimized format, and stores that in the read database. Clients query the read side for data.

The cost: you now maintain two data stores, a projection process, and the event bus between them. If the projector fails or falls behind, the read model becomes stale. Debugging inconsistencies between the two models requires tracing through the event pipeline. This complexity is justified only when the read and write patterns are fundamentally different, for example, when writes are transactional and relational, but reads need full-text search across denormalized documents.

Pattern 3: Event Sourcing

Traditional databases store current state. If a user changes their name from "Alice" to "Bob," the database overwrites "Alice" with "Bob." The history is lost.

Event Sourcing stores every change as an immutable event. Instead of storing "name = Bob," you store two events: "NameSet: Alice" and "NameChanged: Bob." The current state is derived by replaying all events in order.

flowchart TD E1["Event 1: AccountCreated
balance = 0"] --> E2["Event 2: Deposited
amount = 500"] E2 --> E3["Event 3: Withdrawn
amount = 200"] E3 --> E4["Event 4: Deposited
amount = 100"] E4 --> S["Current State:
balance = 400"] style E1 fill:#222221,stroke:#c8a882,color:#ede9e3 style E2 fill:#222221,stroke:#6b8f71,color:#ede9e3 style E3 fill:#191918,stroke:#c47a5a,color:#ede9e3 style E4 fill:#222221,stroke:#6b8f71,color:#ede9e3 style S fill:#191918,stroke:#c8a882,color:#ede9e3

The advantages: complete audit trail, ability to reconstruct state at any point in time, natural fit for CQRS (events feed the read projector). Financial systems, healthcare records, and collaborative editors benefit greatly from event sourcing.

The cost: the event store grows without bound. After years, replaying millions of events to compute current state is impractical. You need snapshots (periodic state captures) to make rebuilds feasible. Schema evolution of events is also tricky: when you add a new field to an event, you must handle old events that lack that field.

Pattern 4: Saga

In a monolith, you wrap multiple operations in a database transaction. Either all succeed or all roll back. In a microservices architecture, a single business operation (e.g., "place an order") may span three services: Order Service, Payment Service, and Inventory Service. You cannot use a database transaction across services.

A Saga breaks the distributed transaction into a sequence of local transactions. Each service performs its local transaction and publishes an event. If any step fails, the Saga executes compensating transactions to undo the previous steps.

Example: Place Order Saga. Step 1: Order Service creates order (status: pending). Step 2: Payment Service charges the customer. Step 3: Inventory Service reserves stock. If Payment fails, the compensating action is: Order Service cancels the order. If Inventory fails after payment succeeds, the compensating actions are: Payment Service refunds the charge, then Order Service cancels the order.

The cost: you must design and implement compensating transactions for every step. Some operations are difficult to compensate (you cannot "unsend" an email). Sagas provide eventual consistency but not atomicity. During execution, intermediate states are visible (the order exists but is not yet paid). Your application must handle these intermediate states gracefully.

Pattern 5: Strangler Fig

Named after the strangler fig tree that grows around a host tree and eventually replaces it, this pattern is for migrations. You place a routing layer (often an API gateway or reverse proxy) in front of the legacy system. For each feature you migrate, you route that traffic to the new system. The legacy system continues to serve everything else. Over time, more and more traffic goes to the new system until the legacy system handles nothing and can be decommissioned.

The cost: during migration, you operate two systems simultaneously. You need a routing layer that understands which features are migrated and which are not. Data may need to be synchronized between old and new systems. The migration can take months or years. But the alternative, rewriting the entire system and switching over in one shot, carries far higher risk.

Pattern 6: Cell-Based Architecture

Cell-Based Architecture partitions the entire system into independent, self-contained units called cells. Each cell serves a subset of users (typically assigned by user ID or region). Each cell has its own complete stack: load balancer, application servers, database, cache. Cells share nothing.

The primary benefit is blast radius containment. If a cell fails (bad deployment, database corruption, overload), only the users assigned to that cell are affected. The other cells continue operating normally. This is how AWS structures some of its own services internally.

The cost: cross-cell operations are expensive and complex. If User A (in Cell 1) sends a message to User B (in Cell 3), the request must cross cell boundaries. Data must be carefully partitioned so that most operations stay within a single cell. You also pay the operational overhead of managing many identical but independent infrastructure stacks.

Choosing the Right Pattern

In an interview, you will not use all six patterns in a single design. Most designs use one or two. The skill is knowing which pattern fits the problem at hand. Here is a quick decision guide:

  • Reads overwhelming your database? Start with Read Replicas. If reads and writes have fundamentally different shapes, consider CQRS.
  • Need a complete audit trail or temporal queries? Event Sourcing.
  • Multi-service transaction that must eventually be consistent? Saga.
  • Migrating a legacy system incrementally? Strangler Fig.
  • Extreme availability requirements with blast radius isolation? Cell-Based Architecture.

Further Reading

  • Martin Fowler: CQRS
  • Martin Fowler: Event Sourcing
  • Microservices.io: Saga Pattern by Chris Richardson
  • Martin Fowler: Strangler Fig Application
  • AWS: What is Cell-Based Architecture

Assignment

Create your own cheat sheet. For each of the six patterns in this session, write in your own words:

  1. (a) A one-line description (no more than 15 words)
  2. (b) When you would use it (one specific scenario)
  3. (c) The main tradeoff (what you gain vs. what you pay)

Do not copy from the table above. Rewriting in your own words is how you internalize the material. If you cannot explain a pattern without looking at the table, you do not understand it yet.

© Ibrahim Anwar · Bogor, West Java
This work is licensed under CC BY 4.0
  • Links
  • Entity
  • RSS