Session 6.4: Step 4: Tech & Infrastructure Decisions

Course → Module 6: System Design Interview Framework

Where Tradeoffs Become Concrete

The high-level design from Session 6.3 shows the structure. Now you fill in the substance. Every box in that diagram represents a technology decision. Which database? Which cache? Which message queue? Which compute model? Each choice comes with tradeoffs, and the interviewer expects you to articulate them.

This is the step that separates candidates who can draw diagrams from candidates who can build systems. A diagram with "Database" written in a box is a sketch. A diagram with "Cassandra, because we need high write throughput with tunable consistency across regions" is engineering.

A tech decision without a stated tradeoff is an opinion. A tech decision with a stated tradeoff is engineering.

The Decision Framework

For each technology decision, use a three-part structure:

State the requirement that drives this decision. ("We need 700K peak writes/second with tunable consistency.")
Name 2-3 viable options and their key differences.
Pick one and state the tradeoff. What do you gain? What do you give up?

This structure takes 30 seconds per decision and demonstrates engineering judgment every time.

Decision Map

Here are the five major technology decisions for most system designs, mapped to options and tradeoffs.

Decision	Options	Key Tradeoff
Primary Database	PostgreSQL, MySQL, Cassandra, DynamoDB, MongoDB	Relational integrity vs. horizontal write scalability
Cache	Redis, Memcached	Feature richness (Redis: data structures, pub/sub, persistence) vs. simplicity and memory efficiency (Memcached)
Message Queue	Kafka, RabbitMQ, SQS, Pulsar	Throughput and durability (Kafka) vs. routing flexibility (RabbitMQ) vs. managed simplicity (SQS)
Compute Model	VMs (EC2), Containers (ECS/K8s), Serverless (Lambda)	Control and predictability (VMs) vs. density and portability (containers) vs. zero-ops and pay-per-use (serverless)
Monitoring	Prometheus + Grafana, Datadog, CloudWatch, ELK	Cost and control (self-hosted) vs. ease and breadth (managed SaaS)

Database Selection Decision Tree

Database selection is the most consequential decision in most system designs. This decision tree provides a structured approach.

flowchart TD A["What is your data model?"] -->|"Relational
(joins, transactions)"| B["Write volume?"] A -->|"Key-value or
wide-column"| C["Scale requirement?"] A -->|"Document
(flexible schema)"| D["MongoDB / DynamoDB"] A -->|"Graph
(relationships)"| E["Neo4j / Neptune"] B -->|"< 50K writes/sec"| F["PostgreSQL / MySQL"] B -->|"> 50K writes/sec"| G["CockroachDB / TiDB
(distributed SQL)"] C -->|"Single region"| H["Redis / DynamoDB"] C -->|"Multi-region,
high throughput"| I["Cassandra / ScyllaDB"] style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#6b8f71,color:#ede9e3 style C fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#191918,stroke:#c47a5a,color:#ede9e3 style E fill:#191918,stroke:#c47a5a,color:#ede9e3 style F fill:#191918,stroke:#c8a882,color:#ede9e3 style G fill:#191918,stroke:#c8a882,color:#ede9e3 style H fill:#191918,stroke:#8a8478,color:#ede9e3 style I fill:#191918,stroke:#8a8478,color:#ede9e3

This tree is a starting point, not a rulebook. Real decisions involve multiple factors simultaneously. But walking through this tree in an interview shows that you approach database selection systematically rather than by habit or preference.

Applying Decisions to WhatsApp

Let us apply the framework to the WhatsApp chat system from Session 6.3.

Message Store: Cassandra. The requirement is 231K average writes/second (700K peak) with horizontal scalability. Messages are written once and read sequentially within a conversation. There are no cross-conversation joins. Cassandra's wide-column model fits perfectly: the partition key is the conversation ID, and messages are sorted by timestamp within the partition. The tradeoff: we give up ACID transactions and ad-hoc query flexibility. We accept this because message storage has a simple, predictable access pattern.

User Database: PostgreSQL. User profiles, authentication data, and contact lists involve relational data with referential integrity needs. Write volume is low (new users, profile updates). PostgreSQL handles this easily on a single primary with read replicas. The tradeoff: limited horizontal write scaling. We accept this because user write volume is orders of magnitude lower than message volume.

Cache: Redis. We need to store WebSocket session mappings (which server holds which user's connection) and presence data (online/offline). Redis provides sub-millisecond lookups, supports TTL for automatic session expiry, and offers pub/sub for broadcasting presence changes. The tradeoff: data is in memory, so cost scales with data size. We accept this because session data is small per user (a few hundred bytes).

Message Queue: Kafka. At 700K peak writes/second, we need a queue that handles sustained high throughput without data loss. Kafka's append-only log provides exactly this. Messages are persisted to disk and replicated across brokers. The tradeoff: Kafka is operationally complex (ZooKeeper/KRaft management, partition rebalancing). We accept this because the alternative (writing directly to Cassandra from the Chat Service) would couple ingestion speed to database write speed, creating a bottleneck during spikes.

Compute: Containers on Kubernetes. The Chat Service maintains WebSocket connections, which are long-lived and stateful. Serverless (Lambda) has a 15-minute execution limit and cold start latency, making it unsuitable for persistent connections. VMs work but waste resources during low-traffic hours. Containers on Kubernetes give us fine-grained autoscaling: scale up Chat Service pods during peak hours, scale down at night. The tradeoff: Kubernetes is operationally heavy. We accept this because at 500M DAU, the team size justifies the infrastructure investment.

The Anti-Pattern: Technology-First Design

A common mistake is choosing technologies before understanding requirements. "I will use Kafka because it is industry standard." But if your system handles 100 writes/second, Kafka is overkill. A simple RabbitMQ or even a PostgreSQL-backed job queue would suffice. Starting with technology leads to over-engineering. Starting with requirements leads to appropriate engineering.

Similarly, do not choose a technology just because you know it well. If the interviewer asks "Why Cassandra?" and your answer is "I have used it before," that is not engineering. A better answer: "Because we need high write throughput with predictable latency at scale, and our access pattern is partition-key lookups with time-ordered scans. Cassandra is optimized for exactly this pattern."

Monitoring: The Forgotten Decision

Most candidates skip monitoring entirely. This is a missed opportunity. Mentioning monitoring shows the interviewer you think about production operations, not just architecture diagrams.

For the WhatsApp system, the critical metrics to monitor are:

Message delivery latency (p50, p95, p99): is the core user experience degrading?
Kafka consumer lag: are messages piling up faster than they are being persisted?
WebSocket connection count per server: is the load balanced evenly?
Cassandra write latency: is the database keeping up with the write volume?
Cache hit ratio: is the cache effective, or are most requests falling through to the database?

You do not need to design the monitoring system. Just mentioning these metrics and why they matter demonstrates operational maturity.

Assignment

For the WhatsApp high-level design from Session 6.3, make the following four technology decisions. For each one, write exactly three sentences:

Database for messages: state the requirement, name your choice, state the tradeoff.
Cache: state the requirement, name your choice, state the tradeoff.
Message queue: state the requirement, name your choice, state the tradeoff.
Compute model: state the requirement, name your choice, state the tradeoff.

Then list three metrics you would monitor in production, and explain what each metric tells you about system health.

Step 4: Tech & Infrastructure Decisions