Skip to content
hibranwar
  • About
  • Work
  • Writing
  • Library
  • Made
  • Now
  • Contact

Module 8: Real-World Case Studies II

Systems Thinking × System Design · 10 sessions

← Back to course

The Problem

The web contains over 100 billion pages. A user types a few words and expects relevant results in under 300 milliseconds. The system must crawl those pages, understand their content, store it in a format optimized for retrieval, and rank results by relevance. Every component operates at a scale where naive approaches collapse.

A search engine is not one system. It is a pipeline of specialized subsystems, each solving a different problem, connected end-to-end.

The Pipeline

At the highest level, a search engine is a five-stage pipeline. Raw web pages enter one end. Ranked results exit the other.

graph LR A[URL Frontier] --> B[Crawler] B --> C[Parser] C --> D[Indexer] D --> E[Ranking Engine] E --> F[Query Server] F --> G[User Results]

Each stage has distinct compute characteristics. Crawling is I/O-bound and network-limited. Parsing is CPU-bound. Indexing is both CPU and storage-intensive. Ranking combines precomputed signals with real-time scoring. Serving demands low latency above all else.

Crawling: Discovery at Scale

The crawler's job is to fetch web pages. It starts with a seed list of known URLs, fetches each page, extracts links from the HTML, and adds new URLs to the frontier. This process runs continuously across thousands of machines.

Three problems dominate crawler design:

  • Politeness. Crawlers must respect robots.txt and rate-limit requests per domain. Hammering a small site with 1,000 requests per second will get you blocked and potentially cause an outage for the site owner. A politeness policy typically enforces one request per domain every few seconds.
  • Deduplication. The same content appears at multiple URLs. Redirect chains, URL parameters, and mirror sites all produce duplicates. The crawler maintains a URL-seen set (often a Bloom filter for memory efficiency) and a content fingerprint store to detect near-duplicates.
  • Frontier prioritization. Not all URLs are equal. A page on a major news site should be re-crawled hourly. A personal blog from 2008 can wait months. The URL frontier is a priority queue that balances freshness requirements against crawl budget.

Key insight: The URL frontier is the brain of the crawler. Its prioritization strategy determines what the search engine knows and how current that knowledge is. A poorly designed frontier wastes crawl budget on low-value pages while high-value content goes stale.

The Inverted Index

Once pages are crawled and parsed, the content must be stored in a structure optimized for search. A naive approach (scan every document for every query) is impossibly slow at scale. The solution is the inverted index.

A forward index maps documents to the words they contain. An inverted index flips this: it maps each word to the list of documents containing that word. This list is called a posting list.

graph TD subgraph Inverted Index T1["Term: 'database'"] --> P1["Doc 4, Doc 17, Doc 203"] T2["Term: 'sharding'"] --> P2["Doc 17, Doc 89"] T3["Term: 'replication'"] --> P3["Doc 4, Doc 89, Doc 112"] end subgraph Query: database sharding Q["Intersect posting lists"] --> R["Doc 17"] end

Each entry in a posting list typically stores more than just the document ID. It includes the term frequency (how many times the word appears), the positions of each occurrence (enabling phrase queries), and metadata like whether the term appeared in a title or heading. This additional information supports both ranking and advanced query types like "exact phrase" matching.

A query for "database sharding" becomes an intersection of two posting lists. The engine retrieves the posting list for "database" and the posting list for "sharding," then finds documents that appear in both lists. Sorted posting lists make this intersection efficient using a merge-join approach.

Ranking: TF-IDF, BM25, and PageRank

Retrieving documents is not enough. The engine must decide which documents are most relevant. Three algorithms form the foundation of modern ranking.

Algorithm What It Measures Strengths Limitations
TF-IDF Term frequency weighted by inverse document frequency Simple, interpretable, effective baseline No term saturation; long documents unfairly boosted
BM25 TF-IDF with saturation curve and document length normalization Industry standard (Elasticsearch, Solr, Lucene default) Tuning parameters (k1, b) require experimentation
PageRank Link graph authority (how many quality pages link to this one) Query-independent quality signal; resists keyword stuffing Expensive to compute; can be gamed with link farms

TF-IDF scores a document by how often a query term appears in it (term frequency), discounted by how common that term is across all documents (inverse document frequency). The word "the" appears everywhere, so it contributes almost nothing. The word "sharding" appears in far fewer documents, so its presence is a stronger signal.

BM25 improves on TF-IDF with two key additions. First, it applies a saturation function: the tenth occurrence of a term in a document adds less relevance than the first. Second, it normalizes by document length so that a 500-word page and a 50,000-word page are compared fairly. BM25 is the default ranking function in Elasticsearch and Apache Lucene.

PageRank operates on the link graph, not on document content. Each page starts with equal rank. On each iteration, a page distributes its rank equally among the pages it links to. After many iterations, pages linked by many high-quality sources accumulate higher rank. Google's original 1998 paper used PageRank as a query-independent authority signal layered on top of content-based scoring.

Fuzzy Search

Users misspell words. They type "databse" instead of "database." A strict inverted index lookup returns nothing for a misspelled term. Fuzzy search bridges this gap.

The standard approach computes edit distance (Levenshtein distance) between the query term and terms in the index. An edit distance of 1 means one character insertion, deletion, or substitution separates the two strings. "databse" is edit distance 1 from "database."

At scale, computing edit distance against every term in the index is too slow. Practical systems use n-gram indexes (breaking terms into overlapping character sequences) or finite state transducers to efficiently find candidate terms within a given edit distance. Elasticsearch supports fuzzy queries natively using automaton-based matching.

Elasticsearch Sharding

A single-machine inverted index cannot hold the entire web. Elasticsearch distributes the index across a cluster using sharding. Each index is split into primary shards, and each shard is a self-contained Lucene index with its own inverted index.

When a query arrives, a coordinator node fans it out to every shard in parallel. Each shard searches its local inverted index, scores candidates using BM25, and returns its top K results. The coordinator merges these partial results into a global top-K list. This scatter-gather pattern enables horizontal scaling: more shards means more documents indexed, and more replicas means higher read throughput.

Routing determines which shard holds a given document. By default, Elasticsearch hashes the document ID and takes the modulo of the number of primary shards. This ensures even distribution. Replicas of each shard live on different nodes, providing fault tolerance. If a node dies, its replicas on other nodes continue serving queries without interruption.

Concept Purpose Trade-off
Primary Shard Holds a partition of the index; handles writes Shard count is fixed at index creation
Replica Shard Copy of primary; serves read traffic More replicas = more storage, higher read throughput
Coordinator Node Receives query, fans out, merges results Bottleneck if too many shards (fan-out overhead)
Segment Merging Lucene merges small segments into larger ones Improves query speed but consumes I/O during merge

Putting It Together

A complete search engine combines all of these components into a feedback-driven system. The crawler feeds the indexer. The indexer builds inverted indexes distributed across shards. The ranking engine combines content signals (BM25) with authority signals (PageRank) and freshness signals (crawl recency). The query server orchestrates distributed retrieval and returns results under a latency budget.

Each component is a system in itself, with its own scaling challenges, failure modes, and optimization surfaces. The art of search engine design is in how these components interact, not in any single component alone.

Further Reading

  • Elasticsearch from the Bottom Up (Elastic Blog). Deep dive into Lucene internals, inverted indexes, and segment architecture.
  • BM25: The ranking algorithm behind search engines (Arpit Bhayani). Clear walkthrough of the BM25 formula and its parameters.
  • Elasticsearch Deep Dive for System Design Interviews (Hello Interview). Sharding, replication, and distributed query execution.
  • Google Search System Design: A Complete Guide (System Design Handbook). End-to-end architecture of web search at Google scale.

Assignment

A brand new webpage is published on the internet. Trace its journey from completely unknown to appearing in search results. For each stage, answer:

  1. How does the crawler discover the URL? What data structure manages crawl priority?
  2. Once fetched, how is the page's content processed and stored? What data structure enables fast keyword lookup?
  3. When a user searches for a term on that page, how does the system retrieve and rank the result? Name the specific algorithms involved.
  4. If the search index is distributed across 100 shards, describe how the query reaches the right shard and how partial results are merged.

The Problem

A popular artist announces a concert. 50,000 seats. Within seconds of tickets going on sale, 500,000 users hit the "Buy" button simultaneously. The system must ensure that no seat is sold twice, that the experience feels fair, and that payment processing does not corrupt inventory state. Failure at any point means overselling, angry customers, and refund chaos.

Ticketing is fundamentally different from e-commerce. In e-commerce, if one item sells out, there are usually substitutes. In ticketing, every seat is unique. Section 103, Row F, Seat 12 either belongs to one person or it does not. There is no partial fulfillment.

Key insight: Ticketing is a fairness problem disguised as a scaling problem. The hardest part is not handling 500K requests per second. It is ensuring that the person who clicked first actually gets the seat, and that no seat is promised to two people at once.

High-Level Architecture

graph TD U[Users] --> LB[Load Balancer] LB --> WQ[Virtual Waiting Queue] WQ --> API[Booking API] API --> SL[Seat Lock Service
Redis] API --> INV[Inventory Service
In-Memory Cache] API --> PAY[Payment Service] PAY --> DB[(Primary Database)] SL --> DB INV --> DB DB --> NOTIFY[Notification Service] NOTIFY --> U

The architecture separates concerns into distinct services. The virtual waiting queue controls admission rate. The seat lock service prevents double-booking. The inventory service tracks availability in memory for fast reads. The payment service handles the financial transaction. The notification service confirms or rejects.

Seat Locking: Optimistic vs. Pessimistic

When a user selects a seat, the system must temporarily reserve it while they complete payment. Two strategies exist for this lock.

Strategy How It Works Best For Risk
Pessimistic Locking Lock the seat row in the database immediately when selected. No other transaction can read or modify it until released. Low concurrency, strong consistency requirements Lock contention under high load; database becomes bottleneck
Optimistic Locking Allow multiple users to select the same seat. At commit time, check a version number. If it changed, the commit fails and the user must retry. High read volume, lower write contention Users see "available" seats that are already taken; poor UX during flash sales
Distributed Lock (Redis) Use Redis SET with NX (set-if-not-exists) and TTL. A Lua script atomically checks availability and sets the lock in one operation. Flash sales, extreme concurrency Requires TTL management; lock expiry before payment completes can cause issues

For flash sale scenarios, the Redis-based distributed lock is the standard choice. The lock is set with a TTL (typically 10 minutes). If the user completes payment within that window, the lock converts to a confirmed booking. If the TTL expires, the seat releases back to inventory automatically.

The critical detail is atomicity. Checking "is this seat available?" and "lock it for me" must happen in a single, uninterruptible operation. Without atomicity, two users can both see the seat as available, both attempt to lock it, and one ends up with a phantom reservation. Redis Lua scripts solve this by executing the check-and-set as one atomic unit on the server side.

Flash Sale Queue Architecture

When 500,000 users click "Buy" simultaneously, letting all of them hit the booking API directly would overwhelm every downstream service. The solution is a virtual waiting queue that controls admission.

sequenceDiagram participant U as 500K Users participant Q as Virtual Queue participant A as Admission Controller participant B as Booking API participant R as Redis Lock participant P as Payment U->>Q: Enter waiting room Q-->>U: Position #47,231. Estimated wait: 4 min A->>Q: Release next batch (500 users) Q->>B: Admitted users proceed B->>R: SET seat_103_F_12 NX TTL 600 R-->>B: OK (locked) B-->>U: Seat reserved. Complete payment in 10 min. U->>P: Submit payment P->>B: Payment confirmed B->>R: Convert lock to confirmed B-->>U: Booking confirmed

The queue serves two purposes. First, it acts as a buffer, absorbing the initial traffic spike without passing it downstream. Second, it enforces fairness by processing users in the order they arrived. The admission controller releases users in batches sized to match the booking API's throughput capacity.

Users in the queue see their position and an estimated wait time. This transparency is important. People tolerate waiting when they can see progress. They do not tolerate a spinning wheel with no information.

In-Memory Inventory

The primary database is the source of truth for seat ownership. But querying it for every "show me available seats" request is too slow under flash-sale load. The solution is an in-memory inventory cache, typically Redis, that mirrors seat availability.

When a seat is locked, both the Redis lock and the inventory cache are updated. When a lock expires, the cache is updated to show the seat as available again. The database is updated only on confirmed booking, not on lock acquisition. This reduces write pressure on the database to only successful transactions.

The trade-off is eventual consistency. The cache may briefly show a seat as available when it has just been locked, or show it as locked when the lock just expired. For flash sales, this is acceptable. The lock service is the true gatekeeper, not the inventory display.

Payment Idempotency

Network failures during payment create a dangerous scenario. The user's payment goes through, but the confirmation response is lost. The user clicks "Pay" again. Without idempotency, they get charged twice.

Idempotency means that processing the same request multiple times produces the same result as processing it once. The standard implementation assigns a unique idempotency key to each booking attempt. The payment service stores this key with the transaction. If the same key arrives again, the service returns the previous result instead of processing a new charge.

Traffic Pattern: The Spike

Ticketing traffic does not follow normal web patterns. It is dominated by extreme spikes at the moment tickets go on sale, followed by rapid decay.

This spike pattern drives every architectural decision. The virtual queue exists because of this spike. The in-memory inventory exists because the database cannot handle this spike. The Redis lock exists because traditional database locks cannot handle this spike. Every component is shaped by the fact that 90% of all traffic arrives in the first 60 seconds.

Failure Modes

Three failure scenarios require explicit handling:

  • Lock expires before payment completes. The seat releases back to inventory and another user grabs it. The first user's payment must be refunded. The system needs a reconciliation process that detects "payment succeeded but seat was lost" and triggers automatic refunds.
  • Payment service is down. The seat is locked but payment cannot be processed. The lock TTL acts as a safety valve. After 10 minutes, the seat returns to the pool. The user is told to retry.
  • Double submission. The user clicks "Pay" twice. The idempotency key prevents double charging, but the system must also prevent creating two booking records for the same seat.

Further Reading

  • Design a Ticket Booking Site Like Ticketmaster (Hello Interview). Complete system design walkthrough with HLD and deep dives.
  • Design Ticketmaster: A Comprehensive Guide (System Design School). Covers seat locking, queue architecture, and payment flow.
  • Ticketmaster System Design: Step-by-Step Guide (System Design Handbook). Detailed treatment of distributed locking and inventory management.
  • Ticketmaster's System Design (Educative Blog). Estimation, API design, and database schema for ticketing.

Assignment

A popular concert has 50,000 seats. 500,000 users hit the "Buy" button within the first second of tickets going on sale. Design the admission and booking flow.

  1. How do you prevent all 500K requests from reaching the booking API simultaneously? Describe the queue mechanism and batch sizing strategy.
  2. A user selects Section 103, Row F, Seat 12. Another user selects the same seat 200ms later. Walk through exactly what happens at the Redis lock level for both users.
  3. The first user's payment takes 8 minutes. The lock TTL is 10 minutes. What happens if payment takes 12 minutes instead? How does the system handle the resulting state?
  4. Why is optimistic locking a poor choice for flash sale seat reservation? What specific failure mode makes it unsuitable?

The Problem

A user has a 2GB folder synced across three devices: a laptop, a phone, and a desktop. They edit a 100MB presentation on the laptop. The system must sync that change to the other two devices. Uploading the entire 100MB file every time a single slide changes is wasteful. If two devices edit the same file simultaneously, the system must detect and resolve the conflict without losing either person's work.

Cloud storage is a deceptively complex system. The user sees a folder. Behind it is a distributed architecture handling deduplication, chunking, delta synchronization, conflict detection, and metadata management across millions of concurrent users.

High-Level Architecture

graph TD C1[Client 1
Desktop] --> SA[Sync Agent] C2[Client 2
Laptop] --> SA C3[Client 3
Phone] --> SA SA --> MS[Metadata Service] SA --> BS[Block Storage
S3 / GCS] SA --> NS[Notification Service
Pub/Sub] MS --> DB[(Metadata DB
File tree, versions, hashes)] NS --> C1 NS --> C2 NS --> C3

The architecture separates two fundamentally different concerns. The metadata service tracks the file tree: which files exist, their sizes, version numbers, and block hashes. The block storage service stores the actual file data as content-addressed chunks. This separation is the foundation that makes deduplication and delta sync possible.

Block-Level Chunking

Files are not stored as single objects. Each file is split into fixed-size blocks, typically 4MB each. Each block is hashed using SHA-256 to produce a content fingerprint. The file's metadata record stores the ordered list of block hashes that compose it.

This chunking approach enables three capabilities:

  • Parallel upload/download. Multiple blocks transfer simultaneously instead of waiting for a sequential file transfer.
  • Resumable transfers. If a connection drops mid-upload, only the incomplete block needs to be re-sent, not the entire file.
  • Content-addressed storage. Blocks are stored by their hash, not by filename. This is the foundation for deduplication.

Block-Level Deduplication

If two users upload the same 4MB block, only one copy is stored. The system checks: does a block with this SHA-256 hash already exist in storage? If yes, skip the upload and just reference the existing block. If no, upload the block and register its hash.

The impact at scale is enormous. Consider a company where 500 employees all have the same 50MB onboarding PDF in their synced folders. Without deduplication, that is 25GB of storage. With block-level dedup, it is 50MB. The other 499 copies are just metadata pointers to the same blocks.

Deduplication also applies within a single user's files. If you copy a folder, the blocks already exist. The system creates new metadata entries pointing to existing blocks. The copy is instant from a storage perspective.

Key insight: Separating metadata from block storage is what makes deduplication work. A file is just an ordered list of block hashes. "Copying" a file means copying a list of hashes. The blocks themselves are immutable, content-addressed objects shared across all users.

Delta Sync

A user edits one slide in a 100MB presentation. The file is composed of 25 blocks (at 4MB each). Only the block containing the modified slide has changed. Delta sync identifies which blocks changed and uploads only those.

sequenceDiagram participant C as Client participant S as Sync Agent participant M as Metadata Service participant B as Block Storage C->>S: File modified: presentation.pptx S->>S: Re-chunk file, compute block hashes S->>M: Compare new hashes with stored hashes M-->>S: Blocks 1-24 unchanged. Block 17 is new. S->>B: Upload only Block 17 (4MB) B-->>S: Block stored, hash registered S->>M: Update file metadata: block 17 hash updated M-->>S: Version incremented S->>C: Sync complete

Instead of uploading 100MB, the client uploads 4MB. That is a 96% reduction in bandwidth. For users on slow connections or mobile networks, this difference is the reason the product is usable at all.

The mechanism relies on the rsync-style algorithm. The client computes a rolling checksum across the modified file to identify which block boundaries shifted. For each block, it computes the SHA-256 hash and compares it against the previously stored hashes. Only blocks with new hashes are uploaded.

Sync Strategy Comparison

Strategy What Transfers Bandwidth Cost (100MB file, 1 line changed) Complexity Use Case
Full Sync Entire file every time 100 MB Low Simple backup, initial upload
File-Level Delta Only modified files (entire file if any change) 100 MB Low Basic sync tools
Block-Level Delta Only modified blocks within modified files 4 MB (one block) Medium Dropbox, Google Drive
Byte-Level Delta Only changed bytes (binary diff) ~1 KB High rsync, specialized tools

Block-level delta is the sweet spot for cloud storage. It captures most of the savings of byte-level delta with significantly less computational overhead. Computing a binary diff of a 100MB file is expensive. Comparing 25 block hashes is cheap.

Conflict Resolution

Two users edit the same file on different devices while offline. Both devices come online and attempt to sync. The metadata service detects a conflict: both edits are based on the same parent version, but they produce different block hashes.

Three strategies handle this:

  • Last-writer-wins. The most recent edit (by timestamp) becomes the canonical version. The other edit is discarded. Simple but lossy. Used for low-value or auto-generated files.
  • Conflict copy. The system keeps both versions. One becomes "presentation.pptx" and the other becomes "presentation (conflict copy).pptx." The user decides which to keep. This is Dropbox's default approach.
  • Merge. For structured documents (like text files with line-based content), the system attempts an automatic three-way merge using the common ancestor version. Git uses this approach. It works well for code but poorly for binary files.

The conflict copy approach is the safest default. It never loses data. The cost is user inconvenience: someone must manually reconcile the two versions. For most cloud storage products, this trade-off is acceptable because conflicts are rare. Most files are edited by one person at a time.

Notification and Propagation

When the laptop syncs a change, the desktop and phone need to know. The notification service uses a pub/sub pattern. Each device subscribes to a channel for its user's file tree. When the metadata service records a new version, it publishes a notification. Each subscribed device pulls the updated metadata and downloads any new blocks.

For devices that are offline, the notification queues until the device reconnects. On reconnection, the sync agent pulls all pending changes and applies them in version order.

Further Reading

  • Dropbox Architecture Deep Dive: Building a High-Frequency File Storage System at Scale (Chengchang Yu). Detailed walkthrough of Dropbox's chunking, sync, and metadata architecture.
  • Design a Cloud Storage System Like Dropbox: A Step-by-Step Guide (System Design Handbook). Estimation, API design, and component breakdown.
  • The rsync algorithm: Rolling checksum (rsync.samba.org). Original technical report on the rolling checksum that powers delta sync.
  • Rsync Algorithm in System Design (GeeksforGeeks). Accessible explanation of the two-level checksum system and block matching.

Assignment

A user has a 100MB file synced via cloud storage. They change one line in the file. Design the sync flow.

  1. Without delta sync, how much data transfers? With block-level delta sync (4MB blocks), how much? With byte-level delta, approximately how much?
  2. The file is split into 25 blocks. Describe how the client determines which blocks changed. What hash function is used, and why?
  3. Two devices edit the same file while offline. Device A changes block 3. Device B changes block 17. Can the system merge these changes automatically? What if both devices changed block 3?
  4. A company has 1,000 employees. Each has a copy of the same 200MB training video. How much total storage does the system use with and without deduplication?

The Problem

Two users are editing the same document at the same time. User A types "hello" at position 10. User B, at the same moment, deletes characters 5 through 8. Both operations are valid locally. But when they arrive at the server, the positions no longer mean the same thing. User A's "position 10" assumed the original document. After User B's deletion, position 10 refers to a different location in the text.

Without a conflict resolution mechanism, concurrent edits produce garbled text, lost characters, or duplicated content. The core challenge of collaborative editing is not networking or storage. It is making concurrent operations converge to the same result on every device, regardless of the order they arrive.

Key insight: Collaboration is a conflict resolution problem. The algorithm determines whose keystrokes survive and how positions shift when edits overlap. Get it wrong, and users see different versions of the same document.

High-Level Architecture

graph TD U1[User A
Browser] -->|WebSocket| GW[Gateway Service] U2[User B
Browser] -->|WebSocket| GW U3[User C
Browser] -->|WebSocket| GW GW --> OT[OT / CRDT Engine] OT --> DS[Document Store] OT --> PS[Presence Service] DS --> DB[(Document DB
Snapshots + Op Log)] PS --> GW GW -->|Broadcast transformed ops| U1 GW -->|Broadcast transformed ops| U2 GW -->|Broadcast transformed ops| U3

Each user connects to the server via a persistent WebSocket. Edits are represented as operations (insert, delete, format) and sent to the server. The OT/CRDT engine transforms these operations to account for concurrency, applies them to the canonical document state, and broadcasts the transformed operations to all other connected users. The presence service tracks cursor positions, selections, and user activity for the colored cursors visible in the editor.

Operational Transformation (OT)

OT is the algorithm Google Docs uses. It works by transforming operations against each other so that they produce consistent results regardless of execution order.

Consider this scenario. The document contains "ABCDEF" (6 characters). User A inserts "X" at position 2. User B deletes the character at position 4 (the letter "D"). Both edits happen simultaneously.

sequenceDiagram participant A as User A participant S as Server participant B as User B Note over A,B: Document: "ABCDEF" A->>S: Insert "X" at position 2 B->>S: Delete at position 4 Note over S: Server receives A first S->>S: Apply A: "ABXCDEF" S->>S: Transform B against A:
B's position 4 shifts to 5
(insertion before position 4) S->>S: Apply transformed B: delete at 5
"ABXCEF" S->>A: Transformed B: delete at position 5 S->>B: Transformed A: insert "X" at position 2 Note over A,B: Both see: "ABXCEF"

The transformation logic is straightforward in this case. User A inserted a character before User B's target position. So User B's position must shift right by 1. Without this transformation, User B would delete the character at position 4 in the modified document, which is now "C" instead of "D." The wrong character disappears.

OT requires a central server to establish a canonical operation order. Every operation carries a revision number. The server maintains a linear history of applied operations. When an operation arrives that was generated against an older revision, the server transforms it against all operations that have been applied since that revision. This is called the transformation pipeline.

CRDTs: The Decentralized Alternative

Conflict-free Replicated Data Types take a fundamentally different approach. Instead of transforming operations, CRDTs embed enough metadata in the data structure itself to guarantee that all replicas converge automatically, without a central server.

In a CRDT-based text editor, every character has a unique, globally ordered identifier. This identifier is not a simple position number. It is a tuple of (site ID, logical clock, fractional position) that establishes a total order across all characters inserted by all users. When User A inserts between characters with IDs 1.0 and 2.0, the new character gets an ID like 1.5. This ID never changes, even as other characters are inserted or deleted around it.

Because every character has a stable, unique ID, operations never conflict. An insert specifies "place this character after ID X." A delete specifies "remove the character with ID Y." These operations commute: the final result is the same regardless of the order they are applied.

OT vs. CRDT Comparison

Dimension Operational Transformation CRDTs
Central server required? Yes. Server establishes canonical order. No. Replicas converge independently.
Complexity Transform functions for every operation pair. Correctness proofs are notoriously difficult. Data structure design is complex upfront, but operations are simpler.
Bandwidth Low. Only operations are transmitted. Higher. Each character carries metadata (IDs, tombstones).
Offline support Difficult. Offline edits require complex rebasing against server history. Natural. Offline edits merge automatically on reconnect.
Undo Complex. Must invert and re-transform operations. Complex. Must track causal history per operation.
Used by Google Docs, Microsoft Office Online Figma, Yjs, Automerge
Memory overhead Low (operation log) Higher (per-character metadata, tombstones for deleted characters)

The Delta-Based Document Model

Rather than storing the full document text, collaborative editors typically store the document as a sequence of deltas (operations). The initial document is a snapshot. Every edit after that is a delta applied to the snapshot. To reconstruct the current document, replay all deltas from the last snapshot.

Periodic snapshots compress the history. Instead of replaying 10,000 operations from the beginning of the document, the system saves a snapshot every N operations and only replays from the most recent snapshot. This bounds the cost of opening a document regardless of its edit history.

WebSocket for Real-Time Sync

Collaborative editing requires bidirectional, low-latency communication. HTTP request-response is too slow: each edit would require a new connection. WebSocket maintains a persistent, full-duplex connection between the browser and the server.

The server broadcasts transformed operations to all connected clients within milliseconds. Cursor positions, selections, and typing indicators also flow through this channel. The result is the "live" feeling of seeing other users' cursors move in real time.

Connection management adds complexity. If a user's WebSocket drops (network change, laptop sleep), the client must reconnect, fetch any operations it missed, and apply them to its local state. The server maintains a per-document operation log that clients can replay from their last known revision.

Offline Mode and Reconciliation

A user edits a document on a plane with no internet. They make 50 edits over two hours. When they land and reconnect, those 50 operations must be integrated with whatever changes other users made during the same period.

With OT, this is expensive. Each offline operation must be transformed against every server operation that happened in the interim. If 200 operations occurred on the server, each of the 50 offline operations must be transformed against all 200, creating a cascade of transformations.

With CRDTs, offline reconciliation is straightforward. The client sends its operations. Because CRDT operations commute (order does not matter), the server simply applies them. No transformation needed. This is the primary reason newer collaboration tools like Figma adopted CRDTs over OT.

Further Reading

  • How Google Docs Uses Operational Transformation for Real-Time Collaboration (DEV Community). Step-by-step walkthrough of OT with examples.
  • I was wrong. CRDTs are the future (Seph Gentle). A former Google Wave engineer's argument for CRDTs over OT, with performance benchmarks.
  • Google Docs Architecture: Real-Time Collaboration with OT vs. CRDTs (SDE Ray). Architectural comparison with diagrams and trade-off analysis.
  • Understanding Real-Time Collaboration with CRDTs (Shambhavi Shandilya, Medium). Accessible introduction to CRDT data structures for text editing.

Assignment

Two users are editing the same document simultaneously. The document currently reads "The quick brown fox." User A inserts "very " at position 4 (before "quick"). User B inserts "old " at position 4 (also before "quick"), at the same time.

  1. Without any conflict resolution, what does each user see after applying both operations locally? Why do they see different results?
  2. With OT, the server receives User A's operation first. How does it transform User B's operation? What does the final document read?
  3. If this system used CRDTs instead, how would the character IDs prevent the conflict? What metadata ensures a consistent ordering?
  4. User A goes offline for 30 minutes and makes 20 edits. Meanwhile, 100 operations occur on the server. Compare the reconciliation cost for OT vs. CRDT when User A reconnects.

The Problem

An auction for a rare guitar ends at 3:00:00 PM. At 2:59:58, a new bid arrives. This bid is valid. But it was timed to prevent other bidders from responding. This is bid sniping, and it undermines the auction's purpose of discovering the true market price through competitive bidding.

An online auction platform must solve several problems simultaneously: strict ordering of bids (the highest valid bid wins, always), protection against last-second sniping, detection of fraudulent bid patterns, and real-time notification to all participants. Each of these has distinct consistency and latency requirements.

Auction Lifecycle

Every auction follows a state machine. The transitions must be strictly controlled because each state change has financial and legal consequences.

stateDiagram-v2 [*] --> Draft : Seller creates listing Draft --> Scheduled : Listing approved Scheduled --> Active : Start time reached Active --> Active : New bid received Active --> Extended : Bid in final window (anti-snipe) Extended --> Active : Extension period elapses, no new bids Extended --> Extended : Another bid during extension Active --> Ended : End time reached, no final-window bids Extended --> Ended : Extension elapses with no new bids Ended --> Sold : Payment confirmed Ended --> Unsold : Reserve not met / no bids Sold --> [*] Unsold --> [*]

The transition from Active to Extended is the anti-sniping mechanism. The transition from Ended to Sold requires payment confirmation within a deadline. If the winning bidder does not pay, the platform may offer the item to the second-highest bidder.

Strict Bid Ordering

The fundamental requirement of an auction is that the highest valid bid wins. In a distributed system, this is harder than it sounds. Two bids arrive at two different servers at nearly the same time. Without a single source of truth for bid order, the system could accept both, or accept the wrong one.

The standard solution is a single-writer pattern per auction. All bids for a given auction route to one designated partition (shard). Within that partition, bids are processed sequentially. There is no concurrent write to the same auction's bid history.

This is a deliberate trade-off. By funneling all bids for one auction through a single writer, you sacrifice write throughput on that auction in exchange for absolute ordering guarantees. For most auctions, this is fine. Even the most popular auctions receive at most a few hundred bids per minute, not thousands per second.

sequenceDiagram participant B1 as Bidder 1 participant B2 as Bidder 2 participant LB as Load Balancer participant BP as Bid Processor
(Auction #4821 Partition) participant DB as Auction DB participant N as Notification Service B1->>LB: Bid $450 on Auction #4821 B2->>LB: Bid $475 on Auction #4821 LB->>BP: Route both to partition for #4821 Note over BP: Process sequentially BP->>DB: Read current high bid ($400) BP->>DB: Validate & accept $450 BP->>N: Notify all: new high bid $450 N->>B2: You've been outbid ($450) BP->>DB: Read current high bid ($450) BP->>DB: Validate & accept $475 BP->>N: Notify all: new high bid $475 N->>B1: You've been outbid ($475)

Routing is deterministic. The auction ID is hashed to select the partition. Every bid for auction #4821 goes to the same processor, regardless of which server receives the HTTP request. This guarantees a total order on all bids for that auction.

Anti-Sniping: Time Extension

Bid sniping is placing a bid in the final seconds of an auction, leaving no time for other bidders to respond. On platforms without protection, sniping tools automatically submit bids 1 to 3 seconds before the auction ends.

The standard countermeasure is a time extension. If a bid arrives within the final N minutes (commonly 2 to 5 minutes) of an auction, the auction deadline extends by N minutes from the time of that bid. This extension can repeat: if another bid arrives during the extension, the clock extends again.

The auction ends only when the extension period elapses with no new bids. This recreates the dynamics of a live auction, where bidding continues as long as someone raises their hand.

Time Event Auction End Time
2:55:00 PM Auction scheduled to end at 3:00 PM 3:00:00 PM
2:59:58 PM Bid received (within 5-minute window) 3:04:58 PM (extended +5 min)
3:03:30 PM Counter-bid received (within window) 3:08:30 PM (extended again)
3:08:30 PM No bids during extension Auction ends. Last bidder wins.

The extension window size is a design decision. Too short (30 seconds) and snipers can still time their bids to prevent response. Too long (15 minutes) and auctions drag on unnecessarily. Most platforms settle on 2 to 5 minutes.

Consistency Requirements by Operation

Not every operation in an auction system requires the same consistency guarantees. Over-engineering consistency where it is not needed wastes resources. Under-engineering it where it is needed causes disputes.

Operation Consistency Requirement Rationale
Place a bid Strong (linearizable) Bid ordering must be globally agreed. No two bids can appear to have different orderings on different nodes.
View current high bid Eventual (seconds stale is acceptable) Displaying a bid that is 2 seconds old does not cause financial harm. Bidders will see the updated value shortly.
Auction end determination Strong (linearizable) All nodes must agree on whether the auction is still active. A bid accepted after the auction ended is a legal problem.
Outbid notification At-least-once delivery Missing a notification is a poor experience. Duplicate notifications are tolerable.
Bid history display Eventual The full bid history is informational. Slight delays in display are acceptable.
Payment processing Exactly-once (idempotent) Charging a winner twice is unacceptable. Payment must be idempotent.

Counterfeit Bid Detection

Shill bidding is when a seller (or an accomplice) places fake bids on their own item to drive the price up. The winning bidder pays an inflated price. Detection uses pattern analysis:

  • Account linkage. Bids from accounts sharing an IP address, payment method, or device fingerprint with the seller.
  • Behavioral patterns. An account that only bids on one seller's items, always loses, and always pushes the price up by small increments.
  • Timing analysis. Bids that arrive at suspicious intervals, suggesting automated tooling rather than human decision-making.
  • Withdrawal patterns. A bidder who consistently retracts bids just before the auction ends, after having inflated the price.

Detection typically runs as an asynchronous pipeline. Bids are accepted in real time (you cannot block a bid while running a fraud model), but flagged bids trigger review. If fraud is confirmed, the auction is voided and the seller faces penalties.

Outbid Notifications

When a new bid arrives, every previous bidder who is still in the running must be notified. This is a fan-out problem. A popular auction with 500 bidders means each new bid triggers 499 notifications.

Notifications must be delivered reliably but do not need to be synchronous with bid processing. The bid processor publishes an event to a message queue. A notification consumer reads the event, looks up all previous bidders, and sends push notifications, emails, or SMS based on each user's preferences. If delivery fails, the message is retried.

The at-least-once delivery guarantee means a user might receive two "you've been outbid" notifications for the same bid. This is preferable to the alternative: missing the notification entirely and losing the auction without knowing.

Further Reading

  • Design an Online Auction Platform Like eBay (Hello Interview). Full system design walkthrough covering bid ordering, anti-sniping, and fraud detection.
  • Inside eBay's Real-Time Auction System: Bidding Logic, Algorithms & Fraud Prevention (FrugalTesting). Detailed analysis of eBay's proxy bidding, reserve prices, and fraud prevention.
  • Auction Sniping (Wikipedia). Overview of sniping strategies and platform responses across major auction sites.
  • Bid Sniping (eBay Help). eBay's official position on bid sniping and proxy bidding as a countermeasure.

Assignment

An auction for a collectible item is scheduled to end at 3:00:00 PM. The anti-sniping window is 5 minutes. A bid of $500 arrives at 2:59:58 PM.

  1. What is the new auction end time? If a counter-bid arrives at 3:04:00 PM, what happens?
  2. The platform runs on 5 servers across 2 data centers. A bid arrives at Server A in Data Center 1. Another bid (higher) arrives at Server B in Data Center 2, 50ms later. How does the system guarantee that both data centers agree on which bid was first? What pattern ensures this?
  3. A seller creates a second account and bids on their own item to raise the price from $200 to $350. What signals would a fraud detection system look for? Which of those signals could also appear in legitimate bidding?
  4. The auction has 300 active bidders. A new bid triggers 299 outbid notifications. If the notification service is momentarily down, what delivery guarantee prevents missed notifications? What is the user-facing consequence of this guarantee?

The Problem: Infinite Content, Finite Attention

A user follows 500 friends, 200 pages, and 50 groups. In the last 24 hours, those sources produced 3,000 new posts. The user will scroll through maybe 50 before closing the app. The feed ranking system decides which 50 posts appear first, and in what order.

This is not a sorting problem. It is a prediction problem. The system must estimate, for each candidate post, how likely this specific user is to engage with it. Engagement means different things: a like, a comment, a share, a click, or simply spending five seconds reading. Each action has a different weight. A share is worth more than a like. A "hide post" signal is strongly negative.

The original Facebook approach, called EdgeRank (retired around 2013), used a simple formula: affinity x weight x decay. How close is the user to the author? How valuable is the content type? How fresh is the post? This worked for hundreds of millions of users. It broke at billions.

Key insight: Feed ranking is a real-time multi-objective optimization problem. The system simultaneously maximizes engagement, content diversity, user satisfaction, and platform health, and these objectives often conflict.

Ranking Signals

Modern feed ranking systems use hundreds of signals. They fall into a few categories.

Signal Category Examples Weight Source
Social graph Friendship strength, interaction frequency, mutual friends High Graph DB, interaction logs
Content type preference User prefers videos over text, images over links Medium-High User behavior model
Recency Post age, time since last refresh Medium Post metadata
Engagement velocity How fast the post is accumulating likes/comments Medium Real-time counters
Author history Author's average engagement rate, post frequency Medium Author profile service
Negative signals Hides, unfollows, "see less" clicks, spam reports High (negative) User feedback logs
Content understanding NLP topic extraction, image classification, link quality Low-Medium ML classifiers
Session context Time of day, device type, network speed, scroll depth Low Client telemetry

No single signal dominates. The model learns non-linear combinations. A video post from a close friend at 8 PM on mobile might rank highest. The same video from a page the user barely interacts with, at 3 AM, ranks much lower.

The Ranking Pipeline

Scoring 3,000 candidates with a full neural network for every user request is too expensive. The standard approach is a multi-stage funnel that progressively narrows candidates while increasing model complexity.

graph TD A[Candidate Generation
~3,000 posts] --> B[First-Pass Ranking
Lightweight model
~500 posts] B --> C[Main Ranking
Deep neural network
~100 posts] C --> D[Re-Ranking
Diversity + policy rules
~50 posts] D --> E[Final Feed
~20 posts per page] F[User Profile] --> B F --> C G[Social Graph] --> A H[Content Features] --> B H --> C I[Real-time Signals] --> C I --> D

Stage 1: Candidate generation. Pull all eligible posts from the user's social graph. This is a fan-out-on-read operation. For each user, query their friends list, followed pages, and joined groups, then fetch recent posts from each. Pre-computed friend lists and post indexes make this fast. Output: ~3,000 candidates.

Stage 2: First-pass ranking. A lightweight model (logistic regression or a shallow neural net) scores each candidate using pre-computed features. This model runs in under 1ms per candidate. It eliminates obvious non-starters: posts in languages the user does not read, content types they never engage with, posts from sources they interact with rarely. Output: ~500 candidates.

Stage 3: Main ranking. A deep neural network scores the remaining candidates using the full feature set, including real-time engagement signals. This is the most compute-intensive stage, typically taking 10-50ms for the full batch. The model predicts multiple outcomes simultaneously: probability of like, comment, share, click, and hide. These predictions are combined into a single score using learned weights. Output: ~100 candidates.

Stage 4: Re-ranking. Policy and diversity rules adjust the final ordering. No more than 3 posts from the same source in the top 20. Boost content types that are underrepresented. Demote engagement bait. Insert ads at designated positions. Output: the final feed page.

High-Level System Architecture

graph LR subgraph Client APP[Mobile/Web Client] end subgraph Edge CDN[CDN] LB[Load Balancer] end subgraph Feed Service FS[Feed Assembler] CR[Candidate Retriever] RK[Ranking Service] AB[A/B Config Service] end subgraph Data Layer SG[(Social Graph
TAO / Graph DB)] PS[(Post Store
Sharded MySQL)] FC[(Feature Cache
Redis / Memcached)] ML[ML Model Server
GPU Cluster] RT[(Real-time Counters
Engagement Stream)] end APP --> CDN --> LB --> FS FS --> CR CR --> SG CR --> PS FS --> RK RK --> FC RK --> ML RK --> RT FS --> AB

The Feed Assembler is the orchestrator. It calls the Candidate Retriever to gather posts, passes them to the Ranking Service for scoring, and checks the A/B Config Service to determine which ranking model variant this user sees. The ML Model Server runs on GPUs and serves predictions via gRPC with batched inference to maximize throughput.

Pre-Computation vs. Real-Time

Not everything can be computed at request time. The system splits work between offline pre-computation and online real-time scoring.

Pre-computed (offline): User embeddings (updated hourly), friend strength scores (updated daily), content embeddings (computed at post creation), author quality scores (updated hourly), and content safety classifications (computed at post creation).

Real-time (online): Engagement velocity (likes in last 5 minutes), session context (device, time, scroll position), recency decay, and interaction between user state and post features.

Pre-computed features are stored in a feature store (often Redis or a dedicated system like Feast) and looked up at serving time. This keeps online latency low. The main ranking model combines pre-computed embeddings with real-time signals to produce the final score.

A/B Testing Infrastructure

Feed ranking changes are never deployed blindly. Every change goes through A/B testing. At Facebook's scale, this means running hundreds of experiments simultaneously, each affecting a slice of users.

The A/B testing infrastructure must guarantee several properties. First, experiment isolation: a user in experiment A and experiment B should not have their results contaminated by interaction effects. Second, statistical power: each experiment needs enough users to detect small changes in metrics. Third, guardrail metrics: even if an experiment improves engagement, it must not degrade user satisfaction surveys, session length, or content diversity below a threshold.

The ranking pipeline is parameterized so that different model versions, feature sets, or re-ranking rules can be swapped per user without changing the serving infrastructure. The A/B Config Service tells the Feed Assembler which configuration to use for each request, based on the user's experiment assignments.

The 200ms Budget

From the moment a user opens the app to the moment the first 20 posts render, the total time budget is roughly 200ms (excluding network transit). Here is how it breaks down:

Stage Latency Budget Operation
Candidate retrieval 30-50ms Query social graph + post indexes
Feature lookup 20-30ms Batch fetch from feature store
First-pass ranking 10-20ms Lightweight model on ~3,000 posts
Main ranking 30-50ms Deep model on ~500 posts (GPU)
Re-ranking 5-10ms Diversity and policy rules
Response assembly 10-20ms Hydrate posts with media URLs, author info
Total server-side 105-180ms

Every millisecond matters. Feature lookups are batched into a single round-trip. The main ranking model uses batched GPU inference (score 500 posts in one forward pass, not 500 individual calls). The response includes pre-fetched media URLs so the client can begin downloading images before the user scrolls to them.

Further Reading

  • How Machine Learning Powers Facebook's News Feed Ranking (Meta Engineering Blog, 2021)
  • How Does News Feed Predict What You Want to See? (Meta Tech Blog)
  • EdgeRank Algorithm: The Original Facebook News Feed (GeeksforGeeks)
  • How Facebook's News Feed Algorithm Works: A Not-So-Deep Dive (DeanLong.io)

Assignment

A user opens their social media app. They follow 400 friends and 100 pages. In the last 24 hours, 2,500 new posts were created by those sources. The app must show the top 20 posts within 200ms of the request hitting the server.

Design the end-to-end flow:

  1. What happens in the first 50ms? What data is fetched and from where?
  2. How does the first-pass ranker reduce 2,500 candidates to 400? What signals does it use?
  3. The main ranking model runs on a GPU cluster. How do you batch 400 candidates into one inference call? What is the expected latency?
  4. The re-ranker must enforce: no more than 2 posts from the same source in the top 20, at least 3 different content types (text, image, video), and one "discover" post from outside the user's graph. How do you implement these rules without re-running the ML model?
  5. Draw a timeline showing all stages and their latency budgets.

What Makes Rental Platforms Hard

A rental platform like Airbnb appears simple on the surface. Hosts list properties. Guests search and book. Money changes hands. But underneath, the system is a calendar problem with a trust layer on top. Every listing has a time dimension that hotel booking systems can simplify (rooms are fungible) but rental platforms cannot (each property is unique).

The core technical challenges are: availability calendar management with concurrent booking prevention, geospatial search combined with date-range filtering, dynamic pricing that responds to supply and demand, and a bi-directional review system that builds trust between strangers.

Key insight: A rental platform is a calendar problem with a trust layer on top. Get the calendar locking wrong and you double-book. Get the trust layer wrong and nobody books at all.

High-Level Architecture

graph TD subgraph Client Layer MA[Mobile App] WA[Web App] end subgraph API Gateway GW[API Gateway
Auth + Rate Limit] end subgraph Core Services LS[Listing Service] SS[Search Service] BS[Booking Service] PS[Pricing Service] RS[Review Service] MS[Messaging Service] PAY[Payment Service] end subgraph Data Stores PG[(PostgreSQL
Listings, Users, Bookings)] ES[(Elasticsearch
Search Index)] RD[(Redis
Calendar Cache)] S3[(Object Storage
Photos)] end MA --> GW WA --> GW GW --> LS GW --> SS GW --> BS GW --> PS GW --> RS GW --> MS LS --> PG LS --> S3 SS --> ES BS --> PG BS --> RD BS --> PAY PS --> PG RS --> PG

The search service and booking service have fundamentally different consistency requirements. Search can tolerate slightly stale data (a listing that was booked 30 seconds ago still appearing in results). Booking cannot. A confirmed reservation must be strongly consistent, or two guests end up with the same dates.

The Calendar Problem

Each listing has a calendar. Each day is in one of several states: available, blocked by host, booked, or pending (in checkout flow). The calendar is the single most contested resource in the system.

When a guest initiates a booking for March 10-15, the system must atomically transition those six days from "available" to "pending," process payment, then transition to "booked." If payment fails, the days revert to "available." If two guests attempt to book overlapping dates simultaneously, exactly one must succeed and one must fail gracefully.

sequenceDiagram participant G1 as Guest A participant G2 as Guest B participant API as Booking API participant DB as Calendar DB participant PAY as Payment G1->>API: Book Mar 10-15 G2->>API: Book Mar 12-18 API->>DB: SELECT ... FOR UPDATE
WHERE listing=X AND date IN (Mar 10-15) Note over DB: Row-level lock acquired
for Mar 10-15 DB-->>API: All dates available API->>DB: UPDATE status='pending'
Mar 10-15 API->>DB: SELECT ... FOR UPDATE
WHERE listing=X AND date IN (Mar 12-18) Note over DB: Blocked on lock
for Mar 12-15 DB-->>API: Mar 12-15 NOT available API-->>G2: Dates unavailable API->>PAY: Charge Guest A PAY-->>API: Payment success API->>DB: UPDATE status='booked'
Mar 10-15 API-->>G1: Booking confirmed

The key mechanism is SELECT ... FOR UPDATE, which acquires row-level locks on the calendar rows for the requested dates. The second request blocks until the first transaction completes. If the first transaction booked overlapping dates, the second sees them as unavailable and fails cleanly.

An alternative to pessimistic locking is optimistic concurrency control: both requests read, both attempt to write, but only the first write succeeds (checked via a version number or CAS operation). Optimistic locking performs better under low contention. Pessimistic locking is safer for high-demand listings where contention is frequent.

Search: Multi-Dimensional Queries

A guest searching for a rental does not issue a simple key-value lookup. The query combines multiple dimensions simultaneously.

Dimension Type Index Strategy Example
Location Geospatial Geo-hash / R-tree in Elasticsearch "Within 10km of Ubud center"
Date range Temporal Availability bitmap per listing "Available March 10-15"
Price range Numeric range BKD tree in Elasticsearch "$50-$150 per night"
Guest count Numeric filter Inverted index "Accommodates 4+"
Amenities Set membership Inverted index (tags) "Pool AND WiFi AND Kitchen"
Property type Categorical Term filter "Entire home" vs "Private room"
Rating Numeric threshold Range filter "4.5+ stars"

The hardest dimension is availability. Location, price, and amenities are static properties that can be indexed in Elasticsearch. But availability changes with every booking. You cannot re-index a listing every time a date is booked or released.

The common approach: run the geo + filter query first in Elasticsearch to get a candidate set (maybe 200 listings), then check availability for those 200 listings against the calendar service. This two-phase approach avoids querying calendar state for thousands of irrelevant listings.

Dynamic Pricing: A Reinforcing Loop

Airbnb's Smart Pricing (powered by their Aerosolve ML framework) adjusts nightly rates based on demand signals. This creates a reinforcing feedback loop.

When demand rises (a festival, a holiday, high season), the pricing model increases rates. Higher rates attract more hosts to list or unblock dates. More supply moderates price growth. When demand drops, prices fall. Lower prices attract bargain hunters, which lifts occupancy, which stabilizes host revenue.

The reinforcing loop runs in both directions. In a hot market, rising prices attract more supply, which eventually caps price increases. In a cold market, falling prices stimulate demand, which prevents a death spiral. The pricing model acts as a market-clearing mechanism that balances supply and demand continuously.

Factors in the pricing model include: local comparable listings, day of week, seasonality patterns, lead time (how far in advance the booking is), length of stay, listing-specific demand history, and local events.

Bi-Directional Reviews and Trust

Unlike e-commerce where only buyers review sellers, rental platforms need reviews in both directions. Hosts review guests (were they respectful? did they leave the place clean?) and guests review hosts (was the listing accurate? was check-in smooth?).

To prevent retaliation bias (a host giving a bad review because the guest gave one first), both reviews are submitted blind and revealed simultaneously after both parties submit, or after a 14-day deadline. This is a game theory mechanism: neither party can condition their review on the other's.

Trust signals compound over time. A host with 50 five-star reviews gets Superhost status, which boosts their search ranking. A guest with verified ID and positive reviews gets Instant Book access. These trust signals reduce friction in the booking process and create a reinforcing loop: more trust leads to more bookings leads to more reviews leads to more trust.

Further Reading

  • Learning Market Dynamics for Optimal Pricing (Airbnb Engineering Blog)
  • The Secret of Airbnb's Pricing Algorithm (IEEE Spectrum)
  • Airbnb System Design (System Design Newsletter)
  • Aerosolve: Machine Learning for Humans (Airbnb, GitHub)

Assignment

A host sets their listing as available for March 1 through March 30. Two guests simultaneously attempt to book overlapping date ranges:

  • Guest A: March 10-15
  • Guest B: March 12-18

Design the calendar locking mechanism that ensures exactly one booking succeeds. Your design should address:

  1. What database operation prevents the double-book? Show the SQL or pseudocode.
  2. What happens to the losing guest's request? How does the UI communicate this?
  3. What if Guest A's payment fails after locking the dates? How do you release them?
  4. How long should a "pending" lock last before auto-expiring? What are the tradeoffs of short vs long timeouts?

Draw a sequence diagram showing both the success and failure paths.

From Monolith to 2,200 Microservices

Uber launched in 2010 as a single Python application. One codebase handled everything: rider requests, driver matching, trip tracking, payments, and email notifications. By 2014, this monolith was struggling. Deployments took hours. A bug in the notification module could bring down the entire matching system. Teams stepped on each other's code constantly.

By 2020, Uber had grown to roughly 2,200 microservices. This did not happen overnight, and it did not happen without pain. The migration followed a pattern that Martin Fowler named the Strangler Fig, after the tropical vine that grows around a host tree, gradually replacing it until the original tree is gone. The old system stays alive while new services absorb its responsibilities, one domain at a time.

Key insight: You do not rewrite a monolith. You strangle it. Each extracted service replaces one responsibility of the original system while the monolith continues serving everything else. The monolith shrinks until it disappears.

The Strangler Fig Pattern

The pattern works in three phases for each domain you extract.

Phase 1: Intercept. Place a routing layer (an API gateway or proxy) in front of the monolith. All traffic flows through this layer. Initially, it forwards everything to the monolith unchanged.

Phase 2: Extract. Build the new microservice alongside the monolith. Route a subset of requests to the new service while the monolith handles the rest. Run both in parallel. Compare outputs to verify correctness (shadow mode).

Phase 3: Retire. Once the new service handles 100% of its domain's traffic and has proven stable, remove the corresponding code from the monolith. The monolith is now smaller.

graph TD subgraph "Phase 1: Intercept" C1[Clients] --> GW1[API Gateway] GW1 --> M1[Monolith
All domains] end subgraph "Phase 2: Extract" C2[Clients] --> GW2[API Gateway] GW2 -->|trips traffic| NS[New Trips Service] GW2 -->|all other traffic| M2[Monolith
Shrinking] NS -.->|shadow compare| M2 end subgraph "Phase 3: Retire" C3[Clients] --> GW3[API Gateway] GW3 --> NS2[Trips Service] GW3 --> PS[Payments Service] GW3 --> MS[Matching Service] GW3 --> M3[Monolith
Minimal] end

Repeat this for each domain. The monolith shrinks with every extraction. The gateway grows more routing rules. After many iterations, the monolith contains only legacy code that nobody wants to touch, and eventually that too gets replaced or deleted.

Domain-Driven Decomposition

The critical question is: how do you decide where to cut? Uber's answer was Domain-Oriented Microservice Architecture (DOMA), published in 2020. Instead of decomposing by technical layer (one service for the database, one for caching, one for business logic), they decomposed by business domain.

A domain is a collection of related microservices that own a business capability. Each domain has a gateway that other domains must use to interact with it. Services within a domain can call each other freely. Cross-domain calls go through the gateway.

Domain Services Extracted Owns Key Data Store
Trips Trip lifecycle, trip state machine, trip history Trip creation, status updates, completion Cassandra (trip events), MySQL (trip metadata)
Matching Supply positioning, demand prediction, dispatch optimization Driver-rider pairing algorithm In-memory geospatial index, Redis
Payments Fare calculation, payment processing, invoicing, payouts All money movement MySQL (ledger), payment gateway integrations
Pricing Surge pricing, fare estimation, upfront pricing Price computation for all products Real-time demand signals, ML model server
Maps Routing, ETA, geocoding, map tile serving All location and navigation data Graph DB (road network), tile cache
Marketplace Supply-demand balancing, incentives, promotions Market health and efficiency Real-time streaming (Kafka), analytics DB

Each domain is owned by a team (or group of teams). The gateway enforces a contract: if the Trips domain changes its internal service structure, other domains do not need to update their code. They still call the Trips gateway with the same API. This is the key benefit. Without gateways, 2,200 services calling each other directly creates a "distributed monolith" where any change requires coordinating with dozens of teams.

Before and After

graph TD subgraph "Before: Monolith (2012)" MO[Single Python App] MO --> DB1[(Single MySQL DB)] MO --> MQ1[Single Message Queue] end subgraph "After: DOMA (2020)" AG[API Gateway] --> TG[Trips Gateway] AG --> MG[Matching Gateway] AG --> PG[Payments Gateway] AG --> PRG[Pricing Gateway] TG --> T1[Trip Lifecycle] TG --> T2[Trip History] MG --> M1[Supply Service] MG --> M2[Dispatch Service] PG --> P1[Fare Calc] PG --> P2[Payment Processing] PG --> P3[Payout Service] PRG --> PR1[Surge Engine] PRG --> PR2[Fare Estimator] end

What Breaks During Extraction

Extracting a domain from a monolith is not a clean lift-and-shift. Things break in predictable ways.

Shared database. The monolith uses one database. The Trips table has foreign keys to the Users table, which has foreign keys to the Payments table. When you extract Trips into its own service with its own database, those foreign keys disappear. You replace them with API calls. A query that was a single SQL join now becomes two network round-trips. Latency increases. Consistency becomes eventual instead of transactional.

Distributed transactions. In the monolith, "create a trip and charge the rider" is one database transaction. In microservices, it spans two services and two databases. You need a saga pattern: the Trips service creates the trip, the Payments service charges the rider, and if the charge fails, the Trips service compensates by canceling the trip. This compensation logic did not exist in the monolith.

Observability gap. In the monolith, a stack trace shows the full request path. In microservices, a single user request might touch 15 services. Without distributed tracing (Jaeger, Zipkin), debugging a slow request becomes guesswork. Uber built their own tracing system to handle this.

Deployment complexity. The monolith had one deployment pipeline. Now you have hundreds. Each service has its own CI/CD, its own canary process, its own rollback procedure. Uber invested heavily in internal tooling to manage this. Without that investment, the operational overhead of microservices outweighs the benefits.

Surge Pricing: A Balancing Loop

Surge pricing is one of Uber's most visible (and controversial) features. From a systems thinking perspective, it is a textbook balancing loop with a dampener. We covered balancing loops in Session 0.5. Here is how it applies.

When demand exceeds supply (more ride requests than available drivers), the surge multiplier increases. Higher fares do two things: they reduce demand (some riders decide the trip is not worth 2.5x the price) and they increase supply (drivers see higher fares and drive toward the surge zone). Both effects push the system back toward equilibrium. When supply catches up with demand, the multiplier drops.

Without the dampener, the surge multiplier would overcorrect. Prices spike to 5x, all drivers flood the zone, supply massively exceeds demand, prices crash, drivers leave, demand spikes again. The dampener smooths the multiplier changes: instead of jumping from 1.0x to 3.0x instantly, it ramps up gradually (1.0x, 1.2x, 1.5x, 1.8x...) and ramps down the same way. This prevents the oscillation pattern that plagues undamped feedback loops.

Further Reading

  • Introducing Domain-Oriented Microservice Architecture (Uber Engineering Blog, 2020)
  • Strangler Fig Application (Martin Fowler)
  • Explore Uber's Microservice Architecture (Edureka on Medium)
  • Uber System Design: A Complete Architectural Deep Dive (Grokking System Design)

Assignment

Uber started as a monolith. You are the architect tasked with extracting three domains into independent microservices: Trips, Payments, and Matching.

For each domain:

  1. What data does the new service own? What tables move out of the shared database?
  2. What cross-domain dependencies exist? (For example, Trips needs to call Payments to charge the rider.) List the API contracts between domains.
  3. What breaks when you remove the shared database? Identify at least one query that was a single SQL join and is now a cross-service call. How does latency change?
  4. Describe one distributed transaction that now requires a saga. What is the compensation action if a step fails?

Draw a before/after diagram showing the monolith's internal modules becoming independent services with defined API boundaries.

Scale of the Problem

An industrial IoT deployment does not look like a web application. A factory floor might have 100,000 temperature, vibration, and pressure sensors, each reporting a reading every 5 seconds. A smart city might have millions of connected devices: traffic sensors, air quality monitors, water meters, street lights. The data is small per reading (a timestamp, a sensor ID, and a numeric value), but the volume is relentless.

At 100,000 sensors reporting every 5 seconds, the ingestion rate is 20,000 readings per second. That is 1.7 billion readings per day. Each reading might be 50-100 bytes, so daily raw data volume is roughly 85-170 GB. After a year, you are looking at 30-60 TB of time-series data, and that is one factory.

Key insight: IoT data pipelines are write-dominated systems. The architecture must sustain continuous, predictable write throughput measured in tens of thousands of inserts per second, unlike web applications where reads vastly outnumber writes.

At 1 reading per 5 seconds per sensor, the formula is straightforward: QPS = sensor_count / interval_seconds. For 100K sensors at 5-second intervals, that is 100,000 / 5 = 20,000 QPS. This is well within the capability of modern time-series databases, but it requires an architecture built for sustained write throughput, not the bursty read patterns of web applications.

The IoT Data Pipeline

Data flows from physical sensors through several layers before it reaches a dashboard or triggers an alert. Each layer has a specific role.

graph LR S1[Sensor] --> GW[Edge Gateway] S2[Sensor] --> GW S3[Sensor] --> GW GW -->|MQTT| BR[Message Broker
EMQX / Mosquitto] BR --> SP[Stream Processor
Kafka / Flink] SP --> TS[(Time-Series DB
TimescaleDB / InfluxDB)] SP --> AL[Alert Engine
Threshold Rules] TS --> DASH[Dashboard
Grafana] AL --> NT[Notification
SMS / Email / PagerDuty]

Sensors are constrained devices. They have limited CPU, memory, and battery. They cannot run HTTP servers or maintain persistent TCP connections to cloud endpoints. They speak lightweight protocols.

Edge gateways aggregate data from dozens or hundreds of local sensors. A gateway might be a Raspberry Pi, a ruggedized industrial PC, or a purpose-built IoT hub. It collects readings over local protocols (Bluetooth, Zigbee, Modbus, or local MQTT) and forwards them to the cloud over MQTT or HTTPS.

The message broker (EMQX, HiveMQ, or Mosquitto) receives data from thousands of gateways. MQTT is the dominant protocol because it was designed for unreliable networks and resource-constrained devices. The broker decouples producers (gateways) from consumers (processors).

The stream processor (Kafka Streams, Apache Flink, or a simple consumer) reads from the broker and does two things: writes raw data to the time-series database and evaluates alert rules in real time.

The time-series database stores readings indexed by time and sensor ID. It supports queries like "average temperature in zone 3 over the last hour" or "maximum vibration on machine #42 this week." It handles aggressive compression because sensor data is highly regular: timestamps are evenly spaced, and values change slowly.

Protocol Comparison: MQTT vs HTTP vs CoAP

Property MQTT HTTP/REST CoAP
Transport TCP TCP UDP
Message overhead 2 bytes minimum header ~200-800 bytes (headers) 4 bytes minimum header
Connection model Persistent, bidirectional Request-response, stateless Request-response, stateless
QoS levels 0 (at most once), 1 (at least once), 2 (exactly once) None (application layer) Confirmable / Non-confirmable
Pub/Sub support Native (topics) No (requires SSE or WebSocket) Observe extension
Power consumption Low (persistent connection, small packets) High (connection setup per request) Very low (UDP, small packets)
Best for Reliable telemetry over TCP networks Low-frequency data, cloud integrations Battery-powered devices on lossy networks

MQTT dominates industrial IoT because of its persistent connection model (no reconnection overhead per message), its built-in QoS levels (you can choose between speed and delivery guarantees), and its minimal header overhead. HTTP works fine for devices that report once per hour, but at 5-second intervals, the per-request overhead adds up quickly.

Edge vs. Cloud Processing

Not all data needs to travel to the cloud. Edge processing handles time-sensitive decisions locally, reducing latency, bandwidth costs, and cloud compute bills.

graph TD subgraph "Edge (Local)" S[Sensors] --> EG[Edge Gateway] EG --> EF[Edge Filtering
Dedup, smoothing] EF --> EA[Edge Alerts
Critical thresholds] EA -->|ALERT: Temp > 95°C| LOCAL[Local Actuator
Shut down machine] end subgraph "Cloud" EF -->|Aggregated data
1 reading/min| BR2[Message Broker] BR2 --> SP2[Stream Processor] SP2 --> TS2[(Time-Series DB)] SP2 --> ML[ML Anomaly Detection] TS2 --> DASH2[Dashboard] end EA -->|Alert forwarded| BR2

The decision of what to process at the edge and what to send to the cloud depends on three factors.

Latency requirement. If a machine overheats, you need to shut it down in milliseconds, not wait for a round-trip to a cloud server 200ms away. Critical safety thresholds must be evaluated at the edge.

Bandwidth cost. Sending 20,000 raw readings per second to the cloud costs money (especially over cellular networks). The edge can aggregate readings (average over 1 minute instead of sending every 5-second sample), reducing bandwidth by 12x. Only anomalies and aggregates travel to the cloud.

Compute complexity. Simple threshold checks (temperature > 95C) run fine on a gateway. ML-based anomaly detection (this vibration pattern looks like bearing failure) requires more compute and typically runs in the cloud, or on beefier edge servers for critical applications.

Alert Thresholds as Balancing Loops

Alert systems in IoT follow balancing loop dynamics. When a sensor reading crosses a threshold, the system triggers a corrective action. That action brings the reading back below the threshold, which stops the alert. This is a negative feedback loop.

The challenge is tuning thresholds. Set them too low and you get alert fatigue: operators receive hundreds of alerts per shift, start ignoring them, and miss the real problems. Set them too high and you miss genuine anomalies until they become failures. The threshold acts as the goal in a balancing loop, and the gap between the current reading and the threshold determines the strength of the response.

Sophisticated systems use adaptive thresholds. Instead of a fixed number (alert if temperature > 80C), they use statistical baselines (alert if temperature is more than 3 standard deviations above the rolling 24-hour average for this specific sensor). This accounts for normal variation: a sensor in a warm zone runs hotter than one in a cool zone, and a fixed threshold would either miss anomalies in the warm zone or cry wolf in the cool zone.

OTA Firmware Updates

Deploying code to 100,000 devices in the field is not like deploying a web server update. You cannot SSH into each device. Many devices have limited storage (no room for two firmware images). Network connectivity is unreliable. A failed update can brick a device that requires a physical truck roll to replace.

Safe OTA (Over-The-Air) update strategies include: A/B partitions (device has two firmware slots, boots from the new one, rolls back to the old if health checks fail), staged rollouts (update 1% of devices, monitor for 24 hours, then 10%, then 100%), and delta updates (send only the binary diff, not the full firmware image, to reduce download size over slow networks).

Further Reading

  • Time-Series Database for IoT: The Missing Piece (EMQX Blog)
  • Patterns for IoT Time Series Data Ingestion with Amazon Timestream (AWS Database Blog)
  • Building Industrial IoT Data Streaming Architecture with MQTT (HiveMQ Blog)
  • MQTT with TimescaleDB for IoT Time-Series Data (EMQX Blog)

Assignment

You are designing a monitoring system for a factory with 100,000 temperature sensors. Each sensor sends a reading every 5 seconds.

  1. Calculate the ingestion QPS. How many writes per second must the time-series database sustain?
  2. Design the pipeline from sensor to dashboard alert. Draw the architecture showing: sensors, edge gateways, MQTT broker, stream processor, time-series database, alert engine, and dashboard.
  3. A sensor reading exceeds 95C. The alert must reach the operator within 500ms. Which parts of your pipeline are on the critical path? Can you meet the 500ms SLA if the data goes through the cloud, or must the alert fire at the edge?
  4. You want to reduce cloud bandwidth costs. Propose an edge aggregation strategy: what data do you aggregate locally, what do you send raw, and what is the resulting reduction in cloud-bound traffic?
  5. How do you handle a sensor that goes offline? What is the difference between "no reading" and "reading = 0"? Design the heartbeat mechanism.

Why This System Is Different

Most system design problems covered in this course involve storing and retrieving data. A generative AI system does something fundamentally different: it produces new content. The user sends a question, and the system generates an answer word by word, drawing on a language model with billions of parameters. The infrastructure requirements are unlike anything in traditional web services.

GPU compute is the bottleneck, not network or disk I/O. A single inference request to a 70-billion-parameter model can consume 40 GB of GPU memory and take 2-10 seconds to generate a full response. You cannot scale this the way you scale a REST API. Adding more CPU cores does nothing. You need more GPUs, and GPUs are expensive, scarce, and power-hungry.

The second challenge is cost. An H100 GPU costs roughly $25,000-30,000. A cluster of 64 GPUs for serving a 70B model costs over $1.5 million in hardware alone, not counting electricity, cooling, and networking. Every wasted GPU-second directly impacts the business. This makes batching, queuing, and rate limiting first-class architectural concerns.

Key insight: A generative AI system is a search system with a language model as the last mile. Most of the architecture (retrieval, caching, rate limiting, load balancing) is familiar. The unfamiliar part is the GPU serving layer and its unique constraints.

RAG Architecture

Retrieval-Augmented Generation (RAG) is the dominant pattern for building AI systems that answer questions about specific documents. Instead of fine-tuning a model on your data (expensive, slow, and hard to update), you retrieve relevant documents at query time and include them as context in the prompt. The model generates an answer grounded in those documents.

graph LR Q[User Query] --> EMB[Embedding Model
text-embedding-3-small] EMB --> VS[(Vector Database
Pinecone / pgvector)] VS --> CTX[Top-K Documents
k=5] CTX --> PM[Prompt Assembly
System + Context + Query] PM --> LLM[LLM Inference
GPT-4 / Claude / Llama] LLM --> R[Streamed Response] DOC[Document Corpus] --> CH[Chunker
Split into ~500 token chunks] CH --> EMB2[Embedding Model] EMB2 --> VS

The pipeline has two phases. The indexing phase (offline) splits documents into chunks, computes an embedding vector for each chunk, and stores those vectors in a vector database. The query phase (online) embeds the user's question, searches for the most similar document chunks, assembles a prompt with those chunks as context, and sends it to the language model for generation.

Chunking strategy matters. Chunks that are too small lose context. Chunks that are too large waste the model's limited context window and dilute relevance. A common approach is 300-500 tokens per chunk with 50-100 token overlap between consecutive chunks. The overlap prevents information from being split across a chunk boundary.

Embedding quality matters. The embedding model converts text into a fixed-length vector (typically 768 or 1536 dimensions). Similar texts produce similar vectors. The quality of retrieval depends entirely on the embedding model's ability to capture semantic meaning. State-of-the-art models like OpenAI's text-embedding-3-small or Cohere's embed-v3 produce embeddings in roughly 5-10ms per text chunk.

Vector Database Comparison

The vector database stores document embeddings and performs approximate nearest-neighbor (ANN) search. Several options exist with different tradeoffs.

Database Type Max Scale Index Algorithm Best For
Pinecone Managed SaaS Billions of vectors Proprietary (HNSW-based) Fastest time-to-production, fully managed
Weaviate Open source / Cloud Hundreds of millions HNSW + hybrid BM25 Hybrid search (vector + keyword), GraphQL API
pgvector PostgreSQL extension Tens of millions IVFFlat / HNSW Teams already on PostgreSQL, simpler ops
Qdrant Open source / Cloud Hundreds of millions HNSW with filtering High recall with complex filters
Milvus Open source / Cloud Billions of vectors Multiple (IVF, HNSW, DiskANN) Very large scale, cost efficiency
FAISS Library (not a DB) Depends on RAM Multiple (IVF, PQ, HNSW) Research, prototyping, in-process search

For most production RAG systems, the choice comes down to operational complexity vs. performance. pgvector is the simplest: add an extension to your existing PostgreSQL, no new infrastructure. It works well up to about 5-10 million vectors. Beyond that, Pinecone or Weaviate offer better performance with managed infrastructure. At billion-vector scale, Milvus or Pinecone are the primary options.

High-Level System Architecture

graph TD subgraph Client Layer WEB[Web Chat UI] API_C[API Client] end subgraph API Layer GW[API Gateway
Auth + Rate Limit] QU[Request Queue
Redis / SQS] end subgraph Retrieval Layer EMB_S[Embedding Service] VDB[(Vector Database)] CACHE[(Response Cache
Redis)] end subgraph Inference Layer LB_GPU[GPU Load Balancer] GPU1[GPU Worker 1
vLLM / TGI] GPU2[GPU Worker 2
vLLM / TGI] GPU3[GPU Worker 3
vLLM / TGI] end subgraph Data Layer DOC_S[(Document Store
S3 / PostgreSQL)] IDX[Indexing Pipeline
Chunk + Embed + Store] end WEB --> GW API_C --> GW GW --> CACHE CACHE -->|miss| QU QU --> EMB_S EMB_S --> VDB VDB --> LB_GPU LB_GPU --> GPU1 LB_GPU --> GPU2 LB_GPU --> GPU3 GPU1 --> WEB DOC_S --> IDX IDX --> VDB

The request flow: the API gateway authenticates the request and checks rate limits. It first checks the response cache (identical or near-identical questions often recur in customer support). On a cache miss, the request enters a queue. The embedding service converts the query to a vector, retrieves relevant documents from the vector database, assembles the prompt, and sends it to the GPU inference cluster. The GPU workers stream tokens back to the client as they are generated.

GPU Serving: Batching and Streaming

The GPU inference layer is the most expensive and constrained part of the system. Two techniques make it efficient.

Continuous batching. Naive inference processes one request at a time. While generating tokens for request A, the GPU sits idle between token generations (waiting for the next autoregressive step). Continuous batching (implemented by vLLM and TGI) groups multiple requests into a single batch. When request A finishes, a new request B takes its slot immediately, without waiting for the entire batch to complete. This increases GPU utilization from 30-40% (naive) to 80-90% (continuous batching).

Token streaming. A 500-token response takes 5-10 seconds to generate fully. Making the user wait 10 seconds for the complete response feels slow. Streaming sends each token to the client as it is generated (via Server-Sent Events or WebSocket). The user sees the response appear word by word, which feels much faster even though total generation time is unchanged.

The chart shows the fundamental tradeoff. Larger batch sizes reduce per-request cost dramatically (batch of 8 costs ~20% of a single request). But latency increases because requests in a batch share GPU time. For real-time chatbots, batch sizes of 8-16 offer the best balance. For offline batch processing (summarizing 10,000 documents), batch sizes of 32 or higher are appropriate because latency does not matter.

Rate Limiting for Inference Costs

In a traditional web API, rate limiting protects server capacity. In an LLM-serving system, rate limiting also protects your budget. Each request costs real money: a 1,000-token response from GPT-4 costs roughly $0.03-0.06. At 100,000 requests per day, that is $3,000-6,000 per day in inference costs alone.

Rate limiting strategies for AI systems include: per-user token limits (each user gets 50,000 tokens per day), per-request token caps (maximum 4,000 output tokens per response), priority queuing (paid users get dedicated GPU capacity, free users share a pool), and response caching (identical questions served from cache, zero GPU cost).

The response cache is particularly effective for customer support chatbots. If 100 customers ask "what is your return policy?" in one day, the first query hits the GPU. The remaining 99 are served from cache in under 10ms. Even with fuzzy matching (semantic similarity between cached questions above 0.95), cache hit rates of 30-50% are common for support use cases.

Production Considerations

Hallucination detection. The model can generate plausible-sounding answers that are factually wrong. RAG reduces this by grounding responses in retrieved documents, but does not eliminate it. Production systems add a verification step: check whether the generated answer is actually supported by the retrieved context. If the model claims "our return policy is 60 days" but the retrieved document says "30 days," flag it for human review.

Document freshness. When documents change, their embeddings must be re-computed and updated in the vector database. A stale index means the system retrieves outdated information. The indexing pipeline should run incrementally: detect changed documents, re-chunk, re-embed, and upsert into the vector store. For fast-moving content (product catalogs, pricing pages), this pipeline runs hourly or on every document change.

Observability. Track retrieval quality (are the top-K documents actually relevant?), generation quality (are users satisfied with answers?), latency (time to first token, total generation time), and cost (tokens consumed per request, GPU utilization). Log the full prompt (query + retrieved context) for debugging bad answers.

Further Reading

  • Towards Understanding Systems Trade-offs in Retrieval-Augmented Generation Model Inference (arXiv, 2024)
  • Best Vector Databases: A Complete Comparison Guide (Firecrawl, 2026)
  • Best Practices in RAG Evaluation (Qdrant Blog)
  • vLLM: Easy, Fast, and Cheap LLM Serving (GitHub)
  • Vector Database Comparison: Pinecone vs Weaviate vs Qdrant vs FAISS vs Milvus vs Chroma (LiquidMetal AI)

Assignment

Design a RAG-powered customer support chatbot for an e-commerce company. The company has 5,000 support articles, 200 FAQ pages, and 50 product manuals (totaling roughly 2 million words).

  1. Draw the high-level architecture. Show where documents are stored, how they are indexed (chunked and embedded), and how a user query flows from input to streamed response.
  2. Choose a vector database. Justify your choice based on the document count, expected query volume (1,000 queries per day initially, scaling to 50,000), and team size (3 engineers).
  3. Calculate the embedding storage requirements. If each chunk is ~400 tokens and you use 1536-dimensional embeddings, how many chunks do you expect? How much vector storage is needed?
  4. A user asks: "Can I return a laptop after 45 days?" The retrieved context says the return policy is 30 days for electronics. The model generates: "Yes, you can return the laptop within 45 days." How do you detect and prevent this hallucination?
  5. Design the caching layer. What do you use as the cache key (exact query match, or semantic similarity)? What is the expected cache hit rate for a support chatbot?
© Ibrahim Anwar · Bogor, West Java
This work is licensed under CC BY 4.0
  • Links
  • Entity
  • RSS