Skip to content
hibranwar
  • About
  • Work
  • Writing
  • Library
  • Made
  • Now
  • Contact

Module 7: Real-World Case Studies I

Systems Thinking × System Design · 8 sessions

← Back to course

Why URL Shorteners Matter

A URL shortener converts a long URL like https://example.com/articles/2026/04/very-long-slug-that-nobody-wants-to-type into something like https://short.ly/a3Bx9k. That is the entire product. The simplicity is deceptive.

Bitly processes over 10 billion clicks per month and creates roughly 256 million short links monthly. The read-to-write ratio is roughly 40:1 at Bitly's scale, and can reach 100:1 or higher for popular link shorteners. This asymmetry is the defining characteristic of the system and drives every architectural decision.

Key insight: The simplest system design problem teaches the most fundamental lesson: reads dominate writes in almost every system. A URL shortener makes this ratio visible and unavoidable.

Capacity Estimation

Before drawing any diagrams, pin down the numbers. Assume a service handling 100 million new URLs per month, with a 100:1 read-to-write ratio. Store URLs for 5 years.

MetricCalculationResult
New URLs / monthGiven100M
New URLs / second100M / (30 × 86,400)~40 writes/sec
Redirects / second40 × 100~4,000 reads/sec
Total URLs over 5 years100M × 60 months6 billion
Storage per URL (avg 500 bytes)6B × 500 bytes~3 TB
Short key length needed627 = 3.5 trillion7 chars (base62)

A 7-character base62 key gives 3.5 trillion possible combinations. For 6 billion URLs, that is a collision probability so low you can ignore it with a proper generation strategy. But "so low you can ignore it" only holds if your generation strategy is actually proper. That is where the real design starts.

Short Key Generation: Three Approaches

There are three common strategies, each with distinct tradeoffs.

1. Random generation with collision check

Generate a random 7-character string. Check the database. If it already exists, generate another. Simple and stateless. The problem: as the keyspace fills, collision rate increases. At 6 billion keys out of 3.5 trillion, the probability of a collision on any single attempt is about 0.17%. That means roughly 1 in 600 writes needs a retry. Manageable, but not free.

2. Hash-based (MD5/SHA256 truncation)

Hash the long URL. Take the first 7 characters of the base62-encoded hash. Deterministic: the same input always produces the same output. The problem: different URLs can produce the same 7-character prefix. You still need a collision check, and you have lost the randomness that distributes keys evenly.

3. Counter-based with base62 encoding

Use a global auto-incrementing counter. Encode the counter value in base62. Counter 1 becomes 0000001. Counter 62 becomes 0000010. No collisions by definition. The problem: sequential keys are predictable. Anyone can guess the next URL. And you need a distributed counter that does not duplicate values across servers.

The counter approach is the most reliable at scale. Predictability is solved by adding a simple shuffle or XOR step before encoding. The distributed counter problem is solved by pre-allocating ranges: Server A gets IDs 1-1,000,000. Server B gets 1,000,001-2,000,000. No coordination needed until a range is exhausted.

flowchart TD A[Generate Short Key] --> B{Which Strategy?} B -->|Random| C[Generate random 7-char string] C --> D{Exists in DB?} D -->|Yes| C D -->|No| E[Store mapping] B -->|Hash| F[Hash long URL] F --> G[Take first 7 chars base62] G --> H{Exists in DB?} H -->|Yes| I[Append counter, re-hash] H -->|No| E B -->|Counter| J[Get next ID from counter service] J --> K[Base62 encode] K --> E

High-Level Design

The system has two flows: URL creation and URL redirection. Both go through a load balancer to a stateless application layer, which talks to a cache and a database.

flowchart LR subgraph Write Path C1[Client] -->|POST /shorten| LB1[Load Balancer] LB1 --> APP1[App Server] APP1 --> KG[Key Generator] APP1 --> DB[(Database)] end subgraph Read Path C2[Client] -->|GET /a3Bx9k| LB2[Load Balancer] LB2 --> APP2[App Server] APP2 --> CACHE[(Cache)] CACHE -->|miss| DB2[(Database)] APP2 -->|301/302| C2 end

The database is a key-value store. The key is the short code. The value is the original URL plus metadata (creation date, expiration, user ID, click count). DynamoDB, Cassandra, or even a sharded MySQL table all work here. The access pattern is simple: point lookups by key. No range queries. No joins. This is where key-value stores shine.

Caching: The 80/20 Rule in Action

At a 100:1 read-to-write ratio, caching is not optional. It is the primary scaling mechanism. A small percentage of shortened URLs receive a disproportionate share of traffic. A viral tweet with a Bitly link might get millions of clicks in hours, while most links are clicked fewer than 10 times total.

Place a Redis or Memcached layer between the application servers and the database. Use an LRU (Least Recently Used) eviction policy. With even a modest cache, you can absorb 80-90% of read traffic without hitting the database.

Cache sizing math: if 20% of URLs account for 80% of traffic, and we have 6 billion total URLs, we need to cache roughly 1.2 billion entries. At 500 bytes each, that is 600 GB. Large, but well within what a Redis cluster can handle.

The 301 vs. 302 Decision

When a user clicks a short URL, the server responds with an HTTP redirect. The choice between 301 (Moved Permanently) and 302 (Found / Temporary Redirect) has real consequences.

A 301 tells the browser to cache the redirect. Next time the user clicks the same short URL, the browser redirects directly without contacting your server. This reduces server load but means you lose visibility into click analytics. You cannot count how many times the link was clicked because repeat visits never reach your servers.

A 302 tells the browser this redirect is temporary. Every click hits your server. You get accurate analytics, but at the cost of higher server load.

Bitly uses 301 for performance. Analytics are supplemented by JavaScript tracking on the destination page. Most URL shorteners with analytics features use 302 to ensure every click is counted.

Read vs. Write Traffic Distribution

The chart below illustrates why caching and read optimization dominate the architecture. For every URL created, the system handles dozens to hundreds of redirect requests.

Create and Redirect Flows

The two core operations in detail:

sequenceDiagram participant C as Client participant LB as Load Balancer participant App as App Server participant KG as Key Generator participant DB as Database participant Cache as Cache (Redis) Note over C,Cache: URL Creation Flow C->>LB: POST /api/shorten {long_url} LB->>App: Forward request App->>KG: Request next short key KG-->>App: "a3Bx9k" App->>DB: INSERT (a3Bx9k, long_url, metadata) DB-->>App: OK App->>Cache: SET a3Bx9k = long_url App-->>C: 201 {short_url: "https://short.ly/a3Bx9k"} Note over C,Cache: URL Redirect Flow C->>LB: GET /a3Bx9k LB->>App: Forward request App->>Cache: GET a3Bx9k Cache-->>App: long_url (cache hit) App-->>C: 302 Redirect to long_url

On a cache miss during redirect, the app server falls through to the database, fetches the mapping, populates the cache, and then redirects. The next request for the same key hits the cache.

Further Reading

  • Designing URL Shortener Systems: From TinyURL to Bitly Scale, DEV Community
  • System Design: Scalable URL Shortener Service like TinyURL, Sandeep Verma
  • URL Shortener System Design, GeeksforGeeks
  • Bitly, Wikipedia

Assignment

Design a URL shortener that handles 500 million new URLs per month and stores them for 10 years.

  1. Redo the capacity estimation table with these new numbers. How many characters do you need in your short key? How much total storage?
  2. Choose a key generation strategy. Justify why it handles your scale without collisions.
  3. One of your short URLs goes viral and receives 1 million hits per day. Walk through exactly what happens at each layer (load balancer, app server, cache, database). Where is the bottleneck?
  4. Draw a complete high-level design diagram showing all components, including the cache layer, database replicas, and the key generation service.

The Problem: Personalized Timelines at Scale

When a user opens Twitter (now X), they see a feed of tweets from people they follow, ranked and ordered. This looks simple. It is not. At peak, Twitter handled 300,000 timeline read requests per second while ingesting around 4,600 new tweets per second (12,000 at peak). The system must assemble a personalized timeline for each of those 300K requests in under 200 milliseconds.

The core design question is deceptively simple: when User A posts a tweet, how does it appear in User B's feed? There are exactly two strategies, and Twitter has tried both.

Strategy 1: Fan-Out on Read (Pull Model)

When User B opens their timeline, the system queries all accounts B follows, fetches their recent tweets, merges them, ranks them, and returns the result. Every timeline request triggers a fresh computation.

sequenceDiagram participant U as User B (Reader) participant API as API Server participant FS as Follow Service participant TS as Tweet Store participant R as Ranker U->>API: GET /timeline API->>FS: Get accounts B follows FS-->>API: [User A, User C, User D, ...] API->>TS: Fetch recent tweets from each TS-->>API: Tweets (unsorted) API->>R: Rank and merge R-->>API: Sorted timeline API-->>U: Timeline response

This approach is simple and storage-efficient. You store each tweet once. But the read path is expensive. If User B follows 500 accounts, that is 500 lookups plus a merge-sort on every single request. At 300K requests per second, this does not scale.

Strategy 2: Fan-Out on Write (Push Model)

When User A posts a tweet, the system immediately writes a copy of that tweet's ID into the timeline cache of every follower. User B's timeline is pre-computed and sitting in memory (Redis) before B even opens the app.

sequenceDiagram participant A as User A (Writer) participant API as API Server participant FS as Fan-out Service participant TL as Timeline Cache (Redis) participant B as User B (Reader) A->>API: POST /tweet API->>FS: Fan-out to all followers FS->>TL: Write tweet ref to Follower 1 timeline FS->>TL: Write tweet ref to Follower 2 timeline FS->>TL: Write tweet ref to Follower N timeline Note over FS,TL: N writes per tweet B->>API: GET /timeline API->>TL: Read pre-computed timeline TL-->>API: Sorted tweet refs API-->>B: Timeline response

This is what Twitter actually adopted. Each user's home timeline is stored as a Redis list of tweet IDs, replicated across 3 machines in a datacenter. The read path becomes a single Redis lookup. Fast. Predictable. Easy to cache.

The cost is write amplification. User A posts one tweet. If A has 5,000 followers, that single tweet generates 5,000 timeline writes. Twitter reported that this amplification inflated their 4,600 tweets/sec into 345,000 timeline writes per second.

The Celebrity Problem

Fan-out on write works beautifully until a celebrity posts. A user with 10 million followers generates 10 million timeline writes for a single tweet. At Twitter's scale, multiple celebrities tweet every second. The fan-out queue backs up. Timelines go stale. Users complain their feed is delayed.

This is not a theoretical problem. Twitter engineers confirmed that tweets from high-follower accounts could take more than 5 seconds to propagate to all followers. For breaking news from a journalist with millions of followers, that delay is unacceptable.

StrategyRead CostWrite CostLatencyStorageBest For
Fan-out on ReadHigh (N queries per request)Low (1 write)High read latencyLowUsers with many followers
Fan-out on WriteLow (1 cache read)High (N writes per tweet)Low read latencyHighUsers with few followers
HybridMediumMediumLow read latencyMediumProduction systems at scale

Key insight: Fan-out on write trades storage for latency. Fan-out on read trades latency for storage. The celebrity problem forces you to do both.

The Hybrid Approach

Twitter's actual production system uses a hybrid. Regular users (say, under 10,000 followers) use fan-out on write. Their tweets are pushed to follower timelines immediately. High-follower accounts are flagged. Their tweets are not fanned out. Instead, when User B requests their timeline, the system reads B's pre-computed timeline from Redis, then fetches recent tweets from any high-follower accounts B follows, merges them in, and returns the result.

flowchart TD A[User A posts tweet] --> B{Followers > threshold?} B -->|No: Regular user| C[Fan-out on Write] C --> D[Write tweet ref to each follower timeline in Redis] B -->|Yes: Celebrity| E[Store tweet only] E --> F[No fan-out] G[User B requests timeline] --> H[Read pre-computed timeline from Redis] H --> I[Fetch recent tweets from celebrities B follows] I --> J[Merge and rank] J --> K[Return timeline to User B]

The threshold is not fixed. Twitter uses a dynamic cutoff based on follower count and tweet frequency. An account with 5 million followers who tweets once a day is treated differently than one with 500,000 followers who tweets 50 times a day. The fan-out cost depends on both.

Cursor-Based Pagination

Timelines are paginated. The naive approach is offset-based: "give me tweets 20-40." This breaks when new tweets arrive between page requests. Tweet 20 on page 2 might be tweet 21 by the time the user scrolls. Items get duplicated or skipped.

Cursor-based pagination solves this. Instead of "give me page 2," the client says "give me 20 tweets older than tweet ID 1892347." The cursor is a pointer to the last item seen. New tweets arriving at the top of the timeline do not shift the cursor position. The user sees a consistent stream regardless of how fast new content arrives.

Storage Architecture

The timeline itself lives in Redis for fast reads. But the tweets themselves are stored in a persistent datastore. Twitter uses a MySQL cluster (sharded by user ID) for tweet storage. The Redis timeline contains only tweet IDs, not full tweet objects. When the client requests a timeline, the system reads the list of IDs from Redis, then hydrates them by fetching the full tweet data from the persistent store (with another cache layer in front).

This separation matters. Redis holds the ordering and personalization. The persistent store holds the source of truth. If a Redis node fails, timelines can be recomputed from the tweet store. Slow, but recoverable.

Further Reading

  • The Architecture Twitter Uses to Deal with 150M Active Users, High Scalability
  • Real-time Tweet Delivery Architecture at Twitter, HdM Stuttgart CS Blog
  • Twitter's Tough Architectural Decision, Denny Sam
  • System Design Primer: Twitter, Donne Martin

Assignment

A user with 10 million followers posts a tweet. If you used pure fan-out on write, that single tweet would generate 10 million timeline writes.

  1. Calculate how long this fan-out would take if each write takes 1ms and you have 100 fan-out workers running in parallel.
  2. Design a hybrid system. Define your threshold for "celebrity" accounts. What data do you use to set this threshold?
  3. Draw the sequence diagram for a timeline request where User B follows 3 regular users and 2 celebrities. Show every system interaction.
  4. Your Redis cluster holding timelines loses a node. 5% of users have lost their pre-computed timelines. How do you recover? What is the user experience during recovery?

The Scale of Video

Over 500 hours of video are uploaded to YouTube every minute. That is 720,000 hours of new content per day. Netflix streams to over 260 million subscribers across 190 countries. At peak hours, Netflix alone accounts for roughly 15% of all downstream internet traffic in North America.

These numbers reveal something important about the architecture. Video is not a request-response problem. It is a pipeline problem. Raw footage enters one end, and optimized streams exit the other, tailored to thousands of different device and network combinations. Every step in between is a system design decision.

Key insight: Video streaming is a storage problem disguised as a networking problem. The real complexity is not in delivering bytes. It is in preparing the right bytes for every possible viewer.

The Full Pipeline

From the moment a creator clicks "upload" to the moment a viewer presses play, the video passes through five major stages: upload, transcoding, storage, distribution, and playback.

flowchart LR A[Creator] -->|Chunked Upload| B[Upload Service] B --> C[Object Storage - Raw] C --> D[Transcoding Pipeline] D --> E[Multiple Resolutions + Codecs] E --> F[Object Storage - Processed] F --> G[CDN Edge Servers] G --> H[Viewer Device] B --> I[Metadata Service] I --> J[(Metadata DB)] J --> K[Search / Recommendation]

Two parallel paths exist from the start. The video binary goes into object storage and the transcoding pipeline. The metadata (title, description, tags, thumbnail) goes into a relational database that feeds search and recommendation systems. These two paths are independent and should be treated as separate services.

Step 1: Chunked Upload

A raw 4K video file can easily be 10-50 GB. Uploading a 50 GB file as a single HTTP request is unreliable. Network interruptions, timeouts, and browser limits all conspire against you. The solution is chunked upload.

The client splits the file into chunks (typically 5-25 MB each). Each chunk is uploaded independently with a chunk index. The server reassembles them. If a chunk fails, only that chunk is retried. YouTube uses a resumable upload protocol that tracks which chunks have been received and allows the upload to continue from the last successful chunk after any interruption.

The upload target is object storage (S3, Google Cloud Storage). The raw file is stored as-is. This is the "source of truth" for the video. Everything downstream is a derived artifact.

Step 2: Transcoding

Raw uploaded video is not suitable for streaming. It may be in a codec the viewer's device cannot decode. It is almost certainly too large for mobile networks. Transcoding converts the raw video into multiple versions optimized for different devices and bandwidth conditions.

Netflix runs its transcoding pipeline on EC2 instances in AWS, processing petabytes of video data daily. Each video is encoded into multiple resolution and bitrate combinations called a "bitrate ladder."

Video Encoding Profiles

ResolutionBitrate (video)Storage per hourTarget Use Case
240p300 Kbps~135 MB2G/3G mobile, extreme low bandwidth
360p700 Kbps~315 MBLow-end mobile
480p1.5 Mbps~675 MBStandard mobile, slow Wi-Fi
720p3 Mbps~1.35 GBTablets, standard desktop
1080p6 Mbps~2.7 GBHD desktop, smart TVs
4K (2160p)16 Mbps~7.2 GB4K TVs, high-bandwidth connections

A single hour of uploaded video, transcoded into all six profiles, requires roughly 12.4 GB of storage. With 500 hours uploaded to YouTube every minute, that is over 6 TB of new processed video per minute. Per day, that exceeds 8.6 PB of new transcoded content.

Netflix takes this further with per-title encoding. Instead of using a fixed bitrate ladder, they analyze each video's complexity and generate a custom ladder. A slow dialogue scene needs less bitrate at 1080p than an action sequence. This optimization reduced Netflix's bandwidth requirements by roughly 50% without perceptible quality loss.

Step 3: Adaptive Bitrate Streaming

The viewer's bandwidth is not constant. It fluctuates as they move between Wi-Fi and cellular, as network congestion rises and falls, as other devices on the same network start or stop streaming. Adaptive Bitrate (ABR) streaming handles this dynamically.

The transcoded video is split into small segments, typically 2-10 seconds long. Each segment exists at every quality level. A manifest file (HLS uses .m3u8, DASH uses .mpd) lists all available segments and their quality levels. The video player on the client device monitors its download speed and buffer level, then requests the appropriate quality for each segment.

sequenceDiagram participant P as Video Player participant CDN as CDN Edge participant S as Origin Storage P->>CDN: Request manifest (.m3u8) CDN-->>P: Manifest with all quality levels Note over P: Bandwidth: 5 Mbps P->>CDN: Request segment 1 at 1080p CDN-->>P: Segment 1 (1080p) Note over P: Bandwidth drops to 1 Mbps P->>CDN: Request segment 2 at 480p CDN-->>P: Segment 2 (480p) Note over P: Bandwidth recovers to 8 Mbps P->>CDN: Request segment 3 at 1080p CDN-->>P: Segment 3 (1080p) Note over P,CDN: Quality adjusts per segment

The quality switch happens at segment boundaries. If each segment is 4 seconds long, the player can adjust quality every 4 seconds. Shorter segments allow faster adaptation but increase the number of HTTP requests and the manifest file size. Longer segments are more efficient but slower to adapt.

Step 4: CDN Delivery

Serving video from a single origin datacenter does not work at global scale. A viewer in Jakarta requesting video from a server in Virginia would experience high latency and compete for bandwidth across undersea cables. Content Delivery Networks solve this by caching video segments at edge servers close to viewers.

Netflix operates its own CDN called Open Connect. Netflix embeds custom hardware appliances (called Open Connect Appliances, or OCAs) directly inside ISP networks. During off-peak hours, popular content is pre-positioned on these devices. When a subscriber presses play, the video is served from a box inside their own ISP's network, often within the same city.

YouTube uses Google's global network and edge caches. The principle is the same: move the data closer to the viewer. Popular videos are cached at many edge locations. Long-tail content (rarely watched videos) may only exist at a regional cache or the origin. The first viewer in a region pays the latency cost, and subsequent viewers benefit from the cache.

Metadata and the Relational Layer

While the video binary lives in object storage and CDNs, everything else about the video lives in a relational database. Title, description, upload date, view count, like count, comments, channel information, tags, categories, subtitles, and content moderation flags. This metadata drives search, recommendations, and the entire user interface.

YouTube uses a combination of MySQL (Vitess, their sharding middleware) and Bigtable for different metadata needs. The key insight: video content and video metadata are separate systems with separate scaling characteristics. Content is write-once-read-many with massive bandwidth needs. Metadata is read-heavy with complex query patterns (search, filter, sort, aggregate).

flowchart TD subgraph Content Path RAW[Raw Video - S3/GCS] --> TC[Transcoding] TC --> PROC[Processed Segments - S3/GCS] PROC --> CDN[CDN Edge Cache] CDN --> VIEWER[Viewer] end subgraph Metadata Path UP[Upload Metadata] --> MDB[(MySQL / Vitess)] MDB --> SEARCH[Search Index - Elasticsearch] MDB --> REC[Recommendation Engine] MDB --> API[API for UI] API --> VIEWER end

Further Reading

  • High Quality Video Encoding at Scale, Netflix Tech Blog
  • 2025 YouTube Statistics: Global Overview and Key Trends, Teleprompter.com
  • Video Encoding and Quality Research, Netflix Research
  • Adaptive Bitrate Streaming: What It Is and How ABR Works, Dacast

Assignment

A creator uploads a 4K, 1-hour video (raw file size: 30 GB). A viewer in a rural area with a 3G connection (500 Kbps average) wants to watch it. Walk through every step:

  1. How is the 30 GB file uploaded reliably? Calculate the number of chunks at 10 MB each and the time to upload on a 50 Mbps connection.
  2. How many transcoded versions are created? What is the total storage for all versions of this 1-hour video?
  3. The rural viewer presses play. Which resolution does the ABR algorithm select? What happens when their bandwidth fluctuates between 300 Kbps and 800 Kbps?
  4. The video is brand new and has not been cached at any edge server near the viewer. Describe the cache miss flow and what the viewer experiences on the first segment vs. subsequent segments.

What Makes Chat Different

A chat application looks simple on the surface: User A types a message, User B receives it. But the engineering challenges are unique. Unlike a web application where the client initiates every interaction, a chat system must push messages to clients the moment they arrive. The server cannot wait for the client to ask. It must deliver proactively.

WhatsApp handles approximately 100 billion messages per day across 2 billion users, with only about 50 engineers. That efficiency comes from making sharp architectural choices and sticking with them.

Key insight: A chat system is a delivery guarantee problem. The hard part is not sending the message. It is knowing whether the message arrived.

Connection Model: WebSockets

HTTP is request-response. The client asks, the server answers. For chat, you need a persistent, bidirectional channel. WebSockets provide exactly this. After an initial HTTP handshake, the connection upgrades to a full-duplex TCP connection. Both sides can send data at any time without waiting for the other.

WhatsApp's original architecture used persistent TCP connections managed by Erlang processes. Each connected device maintains one process on a frontend server. No connection pooling. No multiplexing. One connection, one process. This sounds wasteful until you realize that Erlang processes are extremely lightweight (roughly 2 KB each) and the BEAM VM can handle millions of them on a single machine.

For a system design interview, WebSockets are the standard answer. The connection is established when the app opens and maintained as long as the user is active. A heartbeat mechanism detects dropped connections.

High-Level Architecture

flowchart TD A1[User A - Mobile] <-->|WebSocket| CS1[Chat Server 1] B1[User B - Mobile] <-->|WebSocket| CS2[Chat Server 2] C1[User C - Desktop] <-->|WebSocket| CS1 CS1 <--> MQ[Message Queue / Pub-Sub] CS2 <--> MQ MQ <--> CS1 MQ <--> CS2 CS1 --> DB[(Message Store - Cassandra)] CS2 --> DB CS1 --> PS[Presence Service] CS2 --> PS PS --> CACHE[(Redis - Online Status)] MQ --> PN[Push Notification Service] PN --> APNS[APNs / FCM]

Each chat server maintains WebSocket connections to a set of users. When User A sends a message to User B, Chat Server 1 publishes the message to a message queue. Chat Server 2 (where User B is connected) consumes the message and pushes it down User B's WebSocket. If User B is offline, the message is stored and a push notification is sent instead.

Message Delivery: Online Path

When both users are online and connected to the system, message delivery is straightforward but involves multiple guarantee checks.

sequenceDiagram participant A as User A participant CS1 as Chat Server 1 participant MQ as Message Queue participant CS2 as Chat Server 2 participant B as User B participant DB as Message Store A->>CS1: Send message (msg_id: 42) CS1->>DB: Persist message DB-->>CS1: ACK CS1-->>A: Sent ✓ CS1->>MQ: Publish (recipient: B, msg_id: 42) MQ->>CS2: Deliver to B's server CS2->>B: Push message via WebSocket B-->>CS2: Delivered ACK CS2->>MQ: ACK consumed CS2->>CS1: Delivery receipt CS1->>A: Delivered ✓✓

Notice the two checkmarks. The first (sent) confirms the server received and stored the message. The second (delivered) confirms the recipient's device received it. WhatsApp adds a third state: read, triggered when the recipient actually opens the conversation. Each state is a separate acknowledgment flowing back through the system.

Message Delivery: Offline Path

The harder case. User B is not connected. Their phone is off, out of range, or the app is backgrounded. The message must still arrive eventually. This is store-and-forward.

When the message queue tries to deliver to Chat Server 2 and finds no active connection for User B, it triggers two actions. First, the message is already persisted in the message store (Cassandra), tagged as undelivered. Second, a push notification is sent through APNs (Apple) or FCM (Google) to wake the device and alert the user.

When User B comes back online and establishes a WebSocket connection, the chat server queries the message store for all undelivered messages for User B, sorted by timestamp, and pushes them down the connection. Once User B's device acknowledges receipt, the messages are marked as delivered.

Storage Decisions

Data TypeStoreWhy
Messages (text)CassandraWrite-heavy, time-ordered, partitioned by conversation ID. Cassandra's LSM-tree storage is optimized for sequential writes.
Media (images, video, voice)Object storage (S3)Large binary blobs. Store the file in S3, store the S3 URL in the message record.
User profilesMySQL / PostgreSQLRelational data with consistency needs. Read-heavy, low write volume.
Online/presence statusRedisEphemeral data. Needs fast reads and writes. TTL-based expiration handles disconnects automatically.
Group membershipMySQL / PostgreSQLRelational. Groups have members, admins, settings. Consistency matters.
Undelivered message queueCassandra (same table, filtered)Query: "all messages for user B where delivered = false, ordered by timestamp."

Cassandra is the dominant choice for message storage because of its write performance and natural time-series partitioning. Messages within a conversation are stored in a single partition, ordered by timestamp. Reading a conversation is a single sequential scan of one partition. WhatsApp's engineering team has spoken about running Cassandra clusters handling 230,000 writes per second.

Group Chat: The Fan-Out Problem Returns

One-to-one chat is point-to-point delivery. Group chat is one-to-many. When User A sends a message to a group of 100 members, the system must deliver 99 copies (everyone except the sender). This is the same fan-out problem from Session 7.2, but at message-level granularity.

The approach: the chat server publishes the message to a group topic in the message queue. Each member's chat server subscribes to that topic. When the message arrives, each server pushes it to the connected members. Offline members get store-and-forward treatment as before.

WhatsApp caps groups at 1,024 members. This is not an arbitrary limit. It is a fan-out constraint. A message to a 1,024-member group generates 1,023 delivery operations. At WhatsApp's message volume, larger groups would create unsustainable write amplification.

End-to-End Encryption

WhatsApp uses the Signal Protocol for end-to-end encryption. The key principle: the server never sees plaintext messages. Messages are encrypted on the sender's device and decrypted only on the recipient's device. The server stores and forwards ciphertext.

The protocol uses X3DH (Extended Triple Diffie-Hellman) to establish a shared secret between two devices, and the Double Ratchet algorithm to generate a new encryption key for every single message. Even if an attacker compromises one message key, they cannot decrypt past or future messages. This property is called forward secrecy.

For system design purposes, encryption adds two constraints. First, the server cannot index or search message content (it is ciphertext). Features like server-side search require separate encrypted indexes. Second, multi-device support is complex because each device has its own key pair. Sending a message to a user with 3 devices means encrypting the message 3 times, once with each device's public key.

Presence and Typing Indicators

The "online" indicator and "typing..." status are presence features. They are ephemeral, high-frequency, and tolerance for staleness is high. Nobody cares if the "online" status is 5 seconds stale.

Presence is stored in Redis with a TTL. When a user's WebSocket sends a heartbeat, the TTL is refreshed. When the heartbeat stops (disconnect), the key expires automatically. Typing indicators are even more transient. They are sent as fire-and-forget messages through the WebSocket. No persistence. No retry. If the indicator is lost, the worst case is the recipient does not see "typing..." for a few seconds.

Further Reading

  • How WhatsApp Handles 100 Billion Messages Per Day, ByteByteGo
  • 8 Reasons Why WhatsApp Was Able to Support 50 Billion Messages a Day with Only 32 Engineers, System Design One
  • Why Signal Protocol is Well-Designed, Praetorian
  • Design WhatsApp System Design Interview, AlgoMaster

Assignment

User A sends a message to User B. User B's phone is off. Walk through the complete journey:

  1. Where is the message stored after User A sends it? What acknowledgment does User A see?
  2. How does the system know User B is offline? What specific mechanism detects this?
  3. What happens when User B turns on their phone and opens the app? Describe every step from WebSocket establishment to message display.
  4. How does User A know the message was delivered? Trace the delivery receipt back through every component.
  5. Now add end-to-end encryption. At which points in the flow is the message plaintext vs. ciphertext? Where are the encryption and decryption operations?

The Core Problem

A ride-hailing platform connects riders who need a car with drivers who have one. That sentence sounds simple. The engineering behind it is not. At Uber's scale, the system handles roughly 31 million trips per day, with 2 million online drivers sending GPS coordinates every 4 seconds. That produces around 500,000 location writes per second at peak.

The fundamental challenge is real-time spatial matching: given a rider's location, find the nearest available drivers within seconds. This is a search problem over constantly moving data points on a spherical surface.

Ride-hailing is a real-time matching problem built on top of a geospatial indexing problem. The matching algorithm is only as good as the spatial index that feeds it.

Geospatial Indexing

You cannot query "find all drivers within 2 km" by computing the distance from the rider to every single driver. With 2 million online drivers, that brute-force approach would take too long and consume too many resources. Instead, the system partitions the Earth's surface into a grid of cells, assigns each driver to a cell based on their current coordinates, and only searches nearby cells when a match request arrives.

Three indexing systems dominate this space.

Method Grid Shape Hierarchy Edge Distortion Used By
Geohash Rectangles Z-order curve, prefix-based High near poles and cell boundaries Redis GEO, Elasticsearch
Google S2 Quadrilaterals (on sphere) Hilbert curve on cube faces Low (projects onto sphere directly) Google Maps, early Uber
Uber H3 Hexagons Icosahedron-based, 16 resolutions Minimal (equidistant neighbors) Uber (post-2018), spatial analytics

Geohash is the simplest. It encodes latitude and longitude into a single string by interleaving bits. Longer strings mean smaller cells. The problem: two points on opposite sides of a cell boundary can have completely different geohash prefixes, making boundary queries tricky.

Google S2 projects the Earth onto a cube, then uses a Hilbert curve to map 2D regions to 1D ranges. This preserves spatial locality better than geohash. Uber originally used S2 for their dispatch system.

Uber later developed H3, a hexagonal grid. Hexagons have a useful property: every neighbor is equidistant from the center. With squares or rectangles, diagonal neighbors are farther away than edge neighbors. In a city where people are constantly moving in all directions, hexagons minimize the quantization error. H3 supports 16 resolution levels, from continent-sized hexagons down to roughly 1 square meter.

graph TB subgraph "Geospatial Index" A["Earth Surface"] --> B["Divide into H3 Hexagonal Cells"] B --> C["Resolution 7: ~5 km²"] B --> D["Resolution 9: ~0.1 km²"] B --> E["Resolution 12: ~3 m²"] end subgraph "Driver Location Store" F["Driver sends GPS
(lat, lng, timestamp)"] --> G["Compute H3 cell ID
at resolution 9"] G --> H["Write to in-memory store
key: cell_id → driver_list"] end subgraph "Match Query" I["Rider requests ride"] --> J["Compute rider's H3 cell"] J --> K["k-ring: get neighboring cells"] K --> L["Fetch drivers from all cells"] L --> M["Rank by distance + ETA"] end style A fill:#222221,stroke:#c8a882,color:#ede9e3 style B fill:#222221,stroke:#c8a882,color:#ede9e3 style C fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#222221,stroke:#6b8f71,color:#ede9e3 style E fill:#222221,stroke:#6b8f71,color:#ede9e3 style F fill:#222221,stroke:#c8a882,color:#ede9e3 style G fill:#222221,stroke:#c8a882,color:#ede9e3 style H fill:#222221,stroke:#6b8f71,color:#ede9e3 style I fill:#222221,stroke:#c8a882,color:#ede9e3 style J fill:#222221,stroke:#c8a882,color:#ede9e3 style K fill:#222221,stroke:#c8a882,color:#ede9e3 style L fill:#222221,stroke:#6b8f71,color:#ede9e3 style M fill:#222221,stroke:#6b8f71,color:#ede9e3

The "k-ring" operation is critical. Given a cell, it returns all cells within k hops. For finding nearby drivers, you start with k=1 (the 6 immediate neighbors plus the center cell), fetch all drivers in those 7 cells, and expand the ring if you need more candidates.

High-Level Architecture

graph LR R["Rider App"] -->|"Request ride"| GW["API Gateway"] D["Driver App"] -->|"GPS every 4s"| GW GW --> LS["Location Service
(in-memory, sharded)"] GW --> MS["Matching Service"] GW --> TS["Trip Service"] MS -->|"Query nearby drivers"| LS MS -->|"Create trip"| TS TS --> DB["Trip DB
(PostgreSQL)"] TS --> Q["Event Queue
(Kafka)"] Q --> PS["Pricing Service"] Q --> NS["Notification Service"] Q --> AN["Analytics"] D -->|"Accept/Reject"| GW style R fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#222221,stroke:#6b8f71,color:#ede9e3 style GW fill:#222221,stroke:#c8a882,color:#ede9e3 style LS fill:#222221,stroke:#c47a5a,color:#ede9e3 style MS fill:#222221,stroke:#c47a5a,color:#ede9e3 style TS fill:#222221,stroke:#c47a5a,color:#ede9e3 style DB fill:#222221,stroke:#8a8478,color:#ede9e3 style Q fill:#222221,stroke:#8a8478,color:#ede9e3 style PS fill:#222221,stroke:#6b8f71,color:#ede9e3 style NS fill:#222221,stroke:#6b8f71,color:#ede9e3 style AN fill:#222221,stroke:#6b8f71,color:#ede9e3

The Location Service is the hottest component. It must absorb 500K writes/sec and serve spatial queries with low latency. This service stores driver locations in memory, sharded by geographic region. It does not use a traditional database for the live location data. Persistent storage (for trip history, analytics) comes later via Kafka.

The Matching Service receives a ride request, queries the Location Service for nearby available drivers, ranks them by estimated time of arrival, and dispatches a request to the best candidate. If the driver declines, the system moves to the next candidate.

Trip State Machine

Every trip follows a strict lifecycle. State machines prevent invalid transitions (you cannot complete a trip that was never started) and make the system easier to reason about during failures.

stateDiagram-v2 [*] --> Requested : Rider requests Requested --> Matching : Find drivers Matching --> DriverAssigned : Driver accepts Matching --> Cancelled : No driver / timeout DriverAssigned --> EnRoute : Driver heading to pickup EnRoute --> Arrived : Driver at pickup Arrived --> InProgress : Trip started InProgress --> Completed : Arrived at destination InProgress --> Cancelled : Rider cancels mid-trip Completed --> [*] Cancelled --> [*]

Each state transition triggers events. "DriverAssigned" triggers a push notification to the rider. "InProgress" starts the fare meter. "Completed" triggers payment processing and receipt generation. The state machine is the backbone that coordinates all downstream services.

Write Load: The Numbers

Consider the scale calculation for driver location updates.

With 50,000 drivers updating every 3 seconds: 50,000 / 3 = ~16,667 writes per second. At Uber's global scale of 2 million drivers updating every 4 seconds: 2,000,000 / 4 = 500,000 writes per second. No single relational database handles this. The solution is an in-memory store (like Redis or a custom service), sharded by H3 cell ranges, with asynchronous persistence to durable storage.

Surge Pricing as a Reinforcing Loop

Surge pricing connects directly to the reinforcing loops from Session 0.4. When demand exceeds supply in a region, prices increase. Higher prices attract more drivers to that region (supply increases). Higher prices also discourage some riders (demand decreases). This is a balancing mechanism layered on top of a reinforcing signal.

But there is a reinforcing loop hiding underneath. When surge pricing activates, drivers from neighboring regions migrate toward the surge zone. This creates undersupply in those neighboring regions, which triggers surge pricing there, which causes further driver migration. The surge can propagate outward like a wave. Uber mitigates this with predictive positioning, incentivizing drivers to stay in areas where demand is expected to rise.

Further Reading

  • H3: Uber's Hexagonal Hierarchical Spatial Index (Uber Engineering Blog)
  • Geospatial Indexing Explained: A Comparison of Geohash, S2, and H3 (Ben Feifke)
  • How Uber Finds Nearby Drivers at 1 Million Requests per Second (Ajit Singh)
  • How Uber Serves over 150 Million Reads per Second from Integrated Cache (ByteByteGo)

Assignment

Your city has 50,000 active drivers. Each driver sends a GPS update every 3 seconds.

  1. Calculate the write QPS for the location service.
  2. A relational database like PostgreSQL maxes out at roughly 10,000-30,000 writes/sec on a single node. Can it handle this load? What alternatives would you use?
  3. A rider requests a ride. Design the flow to find the 5 nearest available drivers. Which geospatial index would you use and why? What is the k-ring radius you would start with?
  4. Draw a causal loop diagram showing how surge pricing propagates across neighboring regions. Identify the reinforcing and balancing loops.

The Core Problem

An e-commerce platform at Amazon's scale sells hundreds of millions of products across dozens of countries. On Prime Day 2023, Amazon processed over 375 million items in a single day. The catalog alone contains over 350 million SKUs. Behind the simple act of clicking "Buy Now" sits a distributed transaction that must coordinate inventory, payment, shipping, and notifications across multiple services, all while ensuring that two customers never buy the last unit of the same item.

The hardest problem in e-commerce is not showing products. It is selling the last one exactly once. Everything in the checkout pipeline exists to guarantee this property under concurrent load.

Service Decomposition

A monolithic e-commerce application hits a wall quickly. The search team's deployment schedule differs from the payment team's. The inventory service has different scaling needs than the recommendation engine. Microservices let each team own their domain.

Service Responsibility Data Store Scaling Pattern
Catalog Product metadata, descriptions, images, categories Elasticsearch + DynamoDB Read replicas, CDN for images
Cart User's current selections, quantities, saved items Redis (session) + DynamoDB (persistent) Horizontal, stateless
Inventory Stock counts per SKU per warehouse PostgreSQL (strong consistency) Sharded by warehouse region
Order Order lifecycle, status tracking PostgreSQL Sharded by user ID
Payment Charge authorization, capture, refund PostgreSQL (ACID required) Vertical + limited horizontal
Shipping Carrier selection, label generation, tracking DynamoDB Event-driven, async
Recommendation "Customers also bought", personalized feeds Redis (precomputed) + ML model store Read-heavy, cache-first

High-Level Architecture

graph TB Client["Client
(Web / Mobile)"] --> CDN["CDN
(Images, Static)"] Client --> GW["API Gateway"] GW --> CS["Catalog Service"] GW --> CT["Cart Service"] GW --> OS["Order Service
(Saga Orchestrator)"] GW --> RS["Recommendation
Service"] CS --> ES["Elasticsearch"] CS --> DDB1["DynamoDB
(Catalog)"] CT --> RD["Redis
(Cart)"] OS --> INV["Inventory Service"] OS --> PAY["Payment Service"] OS --> SHP["Shipping Service"] INV --> PG1["PostgreSQL
(Inventory)"] PAY --> PG2["PostgreSQL
(Payment)"] OS --> KF["Kafka
(Order Events)"] KF --> NF["Notification Service"] KF --> AN["Analytics"] style Client fill:#222221,stroke:#6b8f71,color:#ede9e3 style CDN fill:#222221,stroke:#8a8478,color:#ede9e3 style GW fill:#222221,stroke:#c8a882,color:#ede9e3 style CS fill:#222221,stroke:#c47a5a,color:#ede9e3 style CT fill:#222221,stroke:#c47a5a,color:#ede9e3 style OS fill:#222221,stroke:#c47a5a,color:#ede9e3 style RS fill:#222221,stroke:#c47a5a,color:#ede9e3 style ES fill:#222221,stroke:#8a8478,color:#ede9e3 style DDB1 fill:#222221,stroke:#8a8478,color:#ede9e3 style RD fill:#222221,stroke:#8a8478,color:#ede9e3 style INV fill:#222221,stroke:#6b8f71,color:#ede9e3 style PAY fill:#222221,stroke:#6b8f71,color:#ede9e3 style SHP fill:#222221,stroke:#6b8f71,color:#ede9e3 style PG1 fill:#222221,stroke:#8a8478,color:#ede9e3 style PG2 fill:#222221,stroke:#8a8478,color:#ede9e3 style KF fill:#222221,stroke:#c8a882,color:#ede9e3 style NF fill:#222221,stroke:#6b8f71,color:#ede9e3 style AN fill:#222221,stroke:#6b8f71,color:#ede9e3

The Checkout Flow: Inventory Locking

The critical moment is checkout. When a user clicks "Place Order," the system must reserve inventory, authorize payment, and create the order. If any step fails, the preceding steps must be rolled back. This is the Saga pattern (covered in Session 6.5).

Inventory is modeled with three numbers per SKU per warehouse: on_hand (physical stock), reserved (claimed by pending orders), and available (on_hand minus reserved). This prevents overselling without locking the entire row for the duration of a checkout.

sequenceDiagram participant U as User participant OS as Order Service participant INV as Inventory Service participant PAY as Payment Service participant SHP as Shipping Service participant KF as Kafka U->>OS: Place Order OS->>INV: Reserve inventory (SKU, qty) Note over INV: UPDATE inventory
SET reserved = reserved + qty
WHERE available >= qty INV-->>OS: Reserved (reservation_id) OS->>PAY: Authorize payment PAY-->>OS: Authorized (auth_id) OS->>OS: Create order record (status: confirmed) OS->>KF: OrderConfirmed event KF->>INV: Deduct inventory (convert reservation to sale) KF->>SHP: Create shipment KF->>U: Order confirmation email Note over OS: If payment fails: OS->>INV: Release reservation OS->>U: Payment declined

The reservation step uses an optimistic approach. The SQL WHERE available >= qty clause acts as a guard. If two users try to buy the last item simultaneously, only one reservation will succeed because the second query will see available = 0 after the first transaction commits. The loser gets an "out of stock" response.

This is optimistic locking in practice. No explicit lock is held. Instead, the database's transactional guarantees ensure that the constraint check and the update happen atomically.

Payment: Exactly-Once Semantics

Payment is the one operation where "at least once" delivery can cost you real money. Charging a customer twice is worse than not charging them at all. The system must guarantee exactly-once processing.

The standard approach uses an idempotency key. The Order Service generates a unique key for each payment attempt and sends it with the charge request. The Payment Service stores this key. If the same key arrives again (due to a retry after a network timeout), the Payment Service returns the original result without processing a second charge.

On the provider side, payment finality relies on the webhook callback from the payment processor (Stripe, Adyen), not the synchronous API response. The synchronous response confirms the request was received. The webhook confirms the money moved. The Order Service should not mark an order as "paid" until the webhook arrives.

Search-Optimized Catalog

Product search is read-heavy. Users browse, filter, and search far more than they buy. The catalog uses Elasticsearch for full-text search and faceted filtering (brand, price range, rating, category). The source of truth for product data lives in DynamoDB. Changes propagate to Elasticsearch via a change data capture (CDC) stream.

This separation lets the search index be optimized for read patterns (denormalized, pre-aggregated facet counts) while the primary store maintains normalized data integrity.

Recommendation Engine

The "Customers who bought X also bought Y" feature is a collaborative filtering problem. At Amazon's scale, recommendations are precomputed offline using batch processing (Spark or equivalent) and stored in a fast key-value store (Redis, DynamoDB). When a user views a product, the service looks up precomputed recommendations by product ID. Real-time signals (what the user just clicked) are blended in at serving time via a lightweight model.

The key insight: recommendations are not computed on the fly. They are a read path. The heavy computation happens in a daily or hourly batch job.

Further Reading

  • Architecting a Highly Available Microservices-Based Ecommerce Site (AWS Architecture Blog)
  • Amazon System Design (CodeKarle)
  • Build a Scalable E-commerce Platform: System Design Overview (DZone)
  • Saga Pattern (Microservices.io, Chris Richardson)

Assignment

Two users click "Buy Now" on the last unit of a product at the same time.

  1. Design the checkout flow so that exactly one user succeeds and the other receives an "out of stock" message. Write the SQL for the inventory reservation step. Explain why it prevents overselling.
  2. What happens if the payment authorization succeeds but the Payment Service crashes before returning the response to the Order Service? How does the Order Service know whether to retry or roll back?
  3. Describe the locking strategy. Is this pessimistic or optimistic locking? What are the tradeoffs of each in a high-traffic checkout scenario?
  4. Draw a sequence diagram for the full saga, including the compensation (rollback) path when shipping fails after payment succeeds.

The Core Problem

A notification system delivers messages to users across multiple channels: push notifications, email, SMS, and in-app messages. That sounds like a simple "send message" API. In practice, it is a delivery pipeline that must handle user preferences, rate limiting, priority ordering, retry logic, and channel-specific failure modes, all at scale.

When Shopify runs a flash sale and needs to notify 10 million users within an hour, the system must process roughly 2,800 notifications per second sustained, with bursts much higher. Each notification might fan out to multiple channels (push and email). The pipeline must not collapse under this load, and it must not spam users who opted out of marketing messages.

A notification system is a delivery pipeline with user preferences as the routing table. The preferences determine which channel, the priority determines when, and the rate limiter determines how often.

Notification Channels Compared

Channel Latency (P95) Cost per Message Reliability Best For
Push (APNs/FCM) < 5 seconds Free (platform fee only) Medium (device must be online) Time-sensitive alerts, engagement
Email (SES/SendGrid) < 30 seconds $0.0001 - $0.001 High (store-and-forward) Receipts, marketing, detailed content
SMS (Twilio/SNS) < 10 seconds $0.005 - $0.05 High (carrier delivery) OTP, critical alerts, non-app users
In-App < 1 second (if connected) Free Low (user must be in app) Contextual updates, badges

SMS is the most expensive by a large margin. At scale, SMS can cost $10,000 per day or more. A well-designed system uses SMS only for critical messages (OTPs, security alerts) and routes everything else through push or email. The user preference table is the first place to check, and cost is the second.

High-Level Architecture

graph TB subgraph "Producers" P1["Order Service"] P2["Marketing Service"] P3["Security Service"] end P1 -->|"order_confirmed"| API["Notification API"] P2 -->|"flash_sale"| API P3 -->|"login_alert"| API API --> VAL["Validation &
Deduplication"] VAL --> PREF["Preference
Lookup"] PREF --> PRI["Priority Router"] PRI --> HQ["High Priority Queue
(OTP, security)"] PRI --> MQ["Medium Priority Queue
(transactional)"] PRI --> LQ["Low Priority Queue
(marketing)"] HQ --> PW["Push Worker"] HQ --> SW["SMS Worker"] MQ --> PW MQ --> EW["Email Worker"] LQ --> EW LQ --> PW PW --> APNS["APNs / FCM"] EW --> SES["SES / SendGrid"] SW --> TWI["Twilio / SNS"] PW --> DLQ["Dead Letter Queue
(failed deliveries)"] EW --> DLQ SW --> DLQ style API fill:#222221,stroke:#c8a882,color:#ede9e3 style VAL fill:#222221,stroke:#c8a882,color:#ede9e3 style PREF fill:#222221,stroke:#c8a882,color:#ede9e3 style PRI fill:#222221,stroke:#c47a5a,color:#ede9e3 style HQ fill:#222221,stroke:#c47a5a,color:#ede9e3 style MQ fill:#222221,stroke:#6b8f71,color:#ede9e3 style LQ fill:#222221,stroke:#8a8478,color:#ede9e3 style PW fill:#222221,stroke:#c8a882,color:#ede9e3 style EW fill:#222221,stroke:#c8a882,color:#ede9e3 style SW fill:#222221,stroke:#c8a882,color:#ede9e3 style APNS fill:#222221,stroke:#6b8f71,color:#ede9e3 style SES fill:#222221,stroke:#6b8f71,color:#ede9e3 style TWI fill:#222221,stroke:#6b8f71,color:#ede9e3 style DLQ fill:#222221,stroke:#8a8478,color:#ede9e3 style P1 fill:#222221,stroke:#6b8f71,color:#ede9e3 style P2 fill:#222221,stroke:#6b8f71,color:#ede9e3 style P3 fill:#222221,stroke:#6b8f71,color:#ede9e3

The architecture separates concerns into stages. Producers submit notification requests. The API validates and deduplicates. The Preference Lookup determines which channels the user has enabled. The Priority Router assigns the notification to the correct queue. Workers consume from queues and deliver via third-party providers. Failed deliveries land in a Dead Letter Queue for retry or investigation.

Delivery Flow with Retry

sequenceDiagram participant P as Producer participant API as Notification API participant DB as Notification DB participant Q as Priority Queue participant W as Push Worker participant FCM as FCM (Google) participant DLQ as Dead Letter Queue P->>API: Send notification (user_id, type, payload) API->>API: Deduplicate (idempotency key) API->>DB: Store notification (status: pending) API->>Q: Enqueue (priority: high) W->>Q: Poll message W->>FCM: Deliver push notification FCM-->>W: 200 OK W->>DB: Update status: delivered Note over W,FCM: If FCM returns 5xx or timeout: W->>Q: Re-enqueue with backoff (attempt 2/5) Note over W,Q: Exponential backoff: 1s, 5s, 30s, 120s, 600s W->>Q: Poll again after delay W->>FCM: Retry delivery FCM-->>W: 200 OK W->>DB: Update status: delivered Note over W,DLQ: If all 5 attempts fail: W->>DLQ: Move to Dead Letter Queue W->>DB: Update status: failed

The retry strategy uses exponential backoff with jitter. Without jitter, all failed notifications retry at the same instant, creating a thundering herd that overwhelms the downstream provider again. Adding random jitter (say, plus or minus 20% of the backoff interval) spreads retries over time.

Rate Limiting

Rate limiting operates at two levels. Per-user rate limiting prevents notification fatigue: no user should receive more than N push notifications per hour, regardless of how many services want to notify them. Per-channel rate limiting respects provider quotas: APNs and FCM impose rate limits, and exceeding them results in dropped messages or temporary bans.

A sliding window counter in Redis works well for per-user limits. The key is rate:{user_id}:{channel}:{hour}, incremented on each send. If the counter exceeds the threshold, the notification is either downgraded to a lower-priority channel or deferred to the next window.

User Preferences as the Routing Table

Each user has a preference record that determines what they receive and how. A simplified model:

{
  "user_id": "u_12345",
  "channels": {
    "push": true,
    "email": true,
    "sms": false
  },
  "categories": {
    "order_updates": ["push", "email"],
    "marketing": ["email"],
    "security": ["push", "sms", "email"]
  },
  "quiet_hours": { "start": "23:00", "end": "07:00", "timezone": "Asia/Jakarta" }
}

When the Notification API receives a security alert, it looks up the user's preferences, finds that security notifications go to push, SMS, and email, and fans out to all three channels. A marketing notification only goes to email. During quiet hours, non-critical notifications are deferred until the window opens.

This preference lookup is the single most important step in the pipeline. Without it, the system is a spam cannon. With it, the system respects user intent.

Handling Downstream Failures

What happens when the push notification service (APNs or FCM) goes down entirely? The system needs a fallback strategy. If push delivery fails after all retries, the system can automatically escalate to email (cheaper and more reliable for store-and-forward delivery). For critical notifications like OTPs, the fallback chain might be: push, then SMS, then email. The fallback logic lives in the worker, not the producer. Producers should not need to know about delivery infrastructure.

Further Reading

  • Notification System Design: Architecture and Best Practices (MagicBell Engineering)
  • Design a Scalable Notification Service (AlgoMaster)
  • How to Design a Notification System: A Complete Guide (System Design Handbook)
  • Designing a Notification System: Push, Email, and SMS at Scale (DEV Community)

Assignment

A flash sale starts in 1 hour. You need to notify 10 million users.

  1. Calculate the sustained throughput required if you want all notifications delivered within 30 minutes. How many worker instances do you need if each worker processes 500 notifications per second?
  2. Design the delivery pipeline. What queue system do you use? How do you partition the work across workers?
  3. The push notification provider (FCM) goes down 10 minutes before the sale. What is your fallback plan? How do you re-route notifications already in the push queue to email?
  4. A user has opted out of marketing notifications but this flash sale is for an item in their wishlist. Should you send it? Design the preference lookup logic that handles this edge case.

The Core Problem

A smart parking system tracks the real-time availability of every spot in a parking facility and allows users to reserve spots before arrival. The physical world imposes constraints that purely digital systems do not have. A parking spot can only hold one car. Sensors can malfunction. A driver might park in a different spot than the one they reserved. The system must reconcile digital state with physical reality continuously.

For a mall with 500 parking spots across 3 floors, the numbers are modest. But the design patterns are the same ones used at airport scale (10,000+ spots) or city-wide smart parking networks (100,000+ spots across hundreds of facilities). The concurrency challenge surfaces even at small scale: weekend peak hours produce bursts of simultaneous booking attempts for the last available spots on a popular floor.

A parking system is a distributed inventory problem with physical constraints. Each spot is a unit of inventory. The physical sensor is the source of truth. The database is the best available estimate.

Data Model

Entity Key Fields Relationships Notes
ParkingLot lot_id, name, address, lat, lng, total_spots Has many Floors Top-level entity, one per facility
Floor floor_id, lot_id, floor_number, spot_count Belongs to ParkingLot, has many Spots Physical grouping for navigation
Spot spot_id, floor_id, spot_number, type, status, version Belongs to Floor, has many Bookings type: regular, handicapped, EV. version: for optimistic locking
Booking booking_id, spot_id, user_id, start_time, end_time, status, created_at Belongs to Spot and User status: reserved, active, completed, cancelled
Payment payment_id, booking_id, amount, method, status, idempotency_key Belongs to Booking Idempotency key prevents double charges

The version field on the Spot entity is critical. It enables optimistic locking, which we will cover in detail below.

High-Level Architecture

graph TB subgraph "Client Layer" MA["Mobile App"] KI["Kiosk / Display"] end subgraph "Application Layer" GW["API Gateway"] BS["Booking Service"] AS["Availability Service
(in-memory cache)"] PS["Payment Service"] NS["Notification Service"] end subgraph "Data Layer" PG["PostgreSQL
(Bookings, Spots)"] RD["Redis
(Availability Cache)"] KF["Kafka
(Events)"] end subgraph "IoT Layer" S1["Floor 1 Sensors"] S2["Floor 2 Sensors"] S3["Floor 3 Sensors"] IGW["IoT Gateway
(MQTT Broker)"] end MA --> GW KI --> GW GW --> BS GW --> AS BS --> PG BS --> PS BS --> KF AS --> RD KF --> NS S1 --> IGW S2 --> IGW S3 --> IGW IGW --> AS IGW --> KF style MA fill:#222221,stroke:#6b8f71,color:#ede9e3 style KI fill:#222221,stroke:#6b8f71,color:#ede9e3 style GW fill:#222221,stroke:#c8a882,color:#ede9e3 style BS fill:#222221,stroke:#c47a5a,color:#ede9e3 style AS fill:#222221,stroke:#c47a5a,color:#ede9e3 style PS fill:#222221,stroke:#c47a5a,color:#ede9e3 style NS fill:#222221,stroke:#c47a5a,color:#ede9e3 style PG fill:#222221,stroke:#8a8478,color:#ede9e3 style RD fill:#222221,stroke:#8a8478,color:#ede9e3 style KF fill:#222221,stroke:#8a8478,color:#ede9e3 style S1 fill:#222221,stroke:#6b8f71,color:#ede9e3 style S2 fill:#222221,stroke:#6b8f71,color:#ede9e3 style S3 fill:#222221,stroke:#6b8f71,color:#ede9e3 style IGW fill:#222221,stroke:#c8a882,color:#ede9e3

The Availability Service maintains an in-memory count (via Redis) of available spots per floor, updated by both the Booking Service (when a reservation is made) and the IoT Gateway (when sensors detect occupancy changes). This cache enables sub-millisecond availability checks without querying the database.

Concurrent Booking with Optimistic Locking

The central design challenge: two users open the app at the same time, see the same spot listed as available, and both tap "Reserve." Only one should succeed.

Pessimistic locking (SELECT FOR UPDATE) would work but creates contention. During peak hours, many users compete for spots on the same floor. Holding row locks for the duration of a booking transaction (which includes payment) would serialize all bookings and destroy throughput.

Optimistic locking is the better fit. Each Spot row has a version column. The booking flow reads the current version, performs the reservation, and writes back only if the version has not changed.

sequenceDiagram participant U1 as User A participant U2 as User B participant BS as Booking Service participant DB as PostgreSQL U1->>BS: Reserve Spot #42 U2->>BS: Reserve Spot #42 BS->>DB: SELECT status, version FROM spots WHERE spot_id = 42 Note over DB: Returns status=available, version=5 BS->>DB: SELECT status, version FROM spots WHERE spot_id = 42 Note over DB: Returns status=available, version=5 BS->>DB: UPDATE spots SET status='reserved', version=6
WHERE spot_id=42 AND version=5 Note over DB: User A: 1 row affected. Success. DB-->>BS: 1 row updated BS->>DB: UPDATE spots SET status='reserved', version=6
WHERE spot_id=42 AND version=5 Note over DB: User B: 0 rows affected. Version mismatch. DB-->>BS: 0 rows updated BS-->>U1: Reservation confirmed BS-->>U2: Spot no longer available

The WHERE version = 5 clause is the guard. When User A's update commits, the version changes to 6. User B's update finds no row matching version = 5, so zero rows are affected. The Booking Service interprets zero affected rows as a conflict and returns an error to User B.

This approach requires no explicit locks. The database's row-level atomicity handles the race condition. The tradeoff: under very high contention (dozens of users competing for the same spot), the retry rate increases. In a parking system, this is rarely a problem because contention is spread across hundreds of spots.

IoT Sensor Integration

Physical sensors (ultrasonic, infrared, or magnetic) detect whether a car is physically present in a spot. These sensors communicate via MQTT to an IoT Gateway, which translates sensor events into system events.

The sensor data serves two purposes. First, it reconciles the digital state with reality. If a spot is marked "reserved" in the database but the sensor shows it has been physically occupied for 15 minutes past the reservation window, the system can auto-complete the booking. If a spot is marked "available" but the sensor shows a car parked there (someone parked without a reservation), the system marks it as occupied.

Second, sensor data provides real-time availability that is more accurate than the booking database alone. A driver might reserve a spot, arrive, and park in a different spot because the reserved one was physically blocked. The sensor layer catches this discrepancy.

Availability: Cache vs. Database

Users checking availability far outnumber users making bookings. The read-to-write ratio might be 100:1. Querying PostgreSQL for every "how many spots are available on Floor 2?" request would waste database capacity on simple counts.

Redis maintains atomic counters: avail:{lot_id}:{floor_id}. When a booking is confirmed, the counter decrements. When a booking is cancelled or a sensor detects a car leaving, the counter increments. The counter is the first thing the app checks. The database is only queried when the user selects a specific spot to reserve.

A periodic reconciliation job (every 5 minutes) compares Redis counters against the actual database counts and sensor states. If they drift (due to missed events or Redis restarts), the job corrects the cache. This is eventual consistency with a bounded staleness window.

Payment Integration

Parking payment follows the same patterns as e-commerce (Session 7.6). An idempotency key per booking prevents double charges. For time-based pricing (pay per hour), the system calculates the final amount when the driver exits, not when they enter. The booking record stores the start time. The exit event (sensor detects car leaving, or user taps "End Session") triggers the payment calculation.

Pre-authorization is useful for reservations. The system authorizes a hold on the user's payment method at booking time and captures the actual amount at exit. If the user cancels before arriving, the hold is released.

Further Reading

  • How to do distributed locking (Martin Kleppmann)
  • PostgreSQL MVCC and Concurrency Control (PostgreSQL Documentation)
  • MQTT: The Standard for IoT Messaging (mqtt.org)
  • Designing a Parking Lot System (Design Gurus)

Assignment

A shopping mall has 500 parking spots across 3 floors (200, 200, 100). Two users attempt to book the same spot at the same time.

  1. Write the SQL statements for the optimistic locking flow. Include the SELECT (read version), the UPDATE (with version guard), and the INSERT into the bookings table. Wrap it in a transaction.
  2. What happens if the system uses pessimistic locking (SELECT FOR UPDATE) instead? Calculate the maximum throughput if each booking transaction takes 200ms and there are 500 spots. Is this acceptable for peak-hour traffic?
  3. A sensor detects a car in Spot #42, but the database shows Spot #42 as "available" (no booking exists). Design the reconciliation logic. What status should the spot get? Should the system attempt to identify the car's owner?
  4. Design the Redis cache invalidation strategy. When should the counter be updated? What happens if a Redis node crashes and restarts with stale data?
© Ibrahim Anwar · Bogor, West Java
This work is licensed under CC BY 4.0
  • Links
  • Entity
  • RSS