Course → Module 8: Real-World Case Studies II

The Problem

A popular artist announces a concert. 50,000 seats. Within seconds of tickets going on sale, 500,000 users hit the "Buy" button simultaneously. The system must ensure that no seat is sold twice, that the experience feels fair, and that payment processing does not corrupt inventory state. Failure at any point means overselling, angry customers, and refund chaos.

Ticketing is fundamentally different from e-commerce. In e-commerce, if one item sells out, there are usually substitutes. In ticketing, every seat is unique. Section 103, Row F, Seat 12 either belongs to one person or it does not. There is no partial fulfillment.

Key insight: Ticketing is a fairness problem disguised as a scaling problem. The hardest part is not handling 500K requests per second. It is ensuring that the person who clicked first actually gets the seat, and that no seat is promised to two people at once.

High-Level Architecture

graph TD U[Users] --> LB[Load Balancer] LB --> WQ[Virtual Waiting Queue] WQ --> API[Booking API] API --> SL[Seat Lock Service
Redis] API --> INV[Inventory Service
In-Memory Cache] API --> PAY[Payment Service] PAY --> DB[(Primary Database)] SL --> DB INV --> DB DB --> NOTIFY[Notification Service] NOTIFY --> U

The architecture separates concerns into distinct services. The virtual waiting queue controls admission rate. The seat lock service prevents double-booking. The inventory service tracks availability in memory for fast reads. The payment service handles the financial transaction. The notification service confirms or rejects.

Seat Locking: Optimistic vs. Pessimistic

When a user selects a seat, the system must temporarily reserve it while they complete payment. Two strategies exist for this lock.

Strategy How It Works Best For Risk
Pessimistic Locking Lock the seat row in the database immediately when selected. No other transaction can read or modify it until released. Low concurrency, strong consistency requirements Lock contention under high load; database becomes bottleneck
Optimistic Locking Allow multiple users to select the same seat. At commit time, check a version number. If it changed, the commit fails and the user must retry. High read volume, lower write contention Users see "available" seats that are already taken; poor UX during flash sales
Distributed Lock (Redis) Use Redis SET with NX (set-if-not-exists) and TTL. A Lua script atomically checks availability and sets the lock in one operation. Flash sales, extreme concurrency Requires TTL management; lock expiry before payment completes can cause issues

For flash sale scenarios, the Redis-based distributed lock is the standard choice. The lock is set with a TTL (typically 10 minutes). If the user completes payment within that window, the lock converts to a confirmed booking. If the TTL expires, the seat releases back to inventory automatically.

The critical detail is atomicity. Checking "is this seat available?" and "lock it for me" must happen in a single, uninterruptible operation. Without atomicity, two users can both see the seat as available, both attempt to lock it, and one ends up with a phantom reservation. Redis Lua scripts solve this by executing the check-and-set as one atomic unit on the server side.

Flash Sale Queue Architecture

When 500,000 users click "Buy" simultaneously, letting all of them hit the booking API directly would overwhelm every downstream service. The solution is a virtual waiting queue that controls admission.

sequenceDiagram participant U as 500K Users participant Q as Virtual Queue participant A as Admission Controller participant B as Booking API participant R as Redis Lock participant P as Payment U->>Q: Enter waiting room Q-->>U: Position #47,231. Estimated wait: 4 min A->>Q: Release next batch (500 users) Q->>B: Admitted users proceed B->>R: SET seat_103_F_12 NX TTL 600 R-->>B: OK (locked) B-->>U: Seat reserved. Complete payment in 10 min. U->>P: Submit payment P->>B: Payment confirmed B->>R: Convert lock to confirmed B-->>U: Booking confirmed

The queue serves two purposes. First, it acts as a buffer, absorbing the initial traffic spike without passing it downstream. Second, it enforces fairness by processing users in the order they arrived. The admission controller releases users in batches sized to match the booking API's throughput capacity.

Users in the queue see their position and an estimated wait time. This transparency is important. People tolerate waiting when they can see progress. They do not tolerate a spinning wheel with no information.

In-Memory Inventory

The primary database is the source of truth for seat ownership. But querying it for every "show me available seats" request is too slow under flash-sale load. The solution is an in-memory inventory cache, typically Redis, that mirrors seat availability.

When a seat is locked, both the Redis lock and the inventory cache are updated. When a lock expires, the cache is updated to show the seat as available again. The database is updated only on confirmed booking, not on lock acquisition. This reduces write pressure on the database to only successful transactions.

The trade-off is eventual consistency. The cache may briefly show a seat as available when it has just been locked, or show it as locked when the lock just expired. For flash sales, this is acceptable. The lock service is the true gatekeeper, not the inventory display.

Payment Idempotency

Network failures during payment create a dangerous scenario. The user's payment goes through, but the confirmation response is lost. The user clicks "Pay" again. Without idempotency, they get charged twice.

Idempotency means that processing the same request multiple times produces the same result as processing it once. The standard implementation assigns a unique idempotency key to each booking attempt. The payment service stores this key with the transaction. If the same key arrives again, the service returns the previous result instead of processing a new charge.

Traffic Pattern: The Spike

Ticketing traffic does not follow normal web patterns. It is dominated by extreme spikes at the moment tickets go on sale, followed by rapid decay.

This spike pattern drives every architectural decision. The virtual queue exists because of this spike. The in-memory inventory exists because the database cannot handle this spike. The Redis lock exists because traditional database locks cannot handle this spike. Every component is shaped by the fact that 90% of all traffic arrives in the first 60 seconds.

Failure Modes

Three failure scenarios require explicit handling:

Further Reading

Assignment

A popular concert has 50,000 seats. 500,000 users hit the "Buy" button within the first second of tickets going on sale. Design the admission and booking flow.

  1. How do you prevent all 500K requests from reaching the booking API simultaneously? Describe the queue mechanism and batch sizing strategy.
  2. A user selects Section 103, Row F, Seat 12. Another user selects the same seat 200ms later. Walk through exactly what happens at the Redis lock level for both users.
  3. The first user's payment takes 8 minutes. The lock TTL is 10 minutes. What happens if payment takes 12 minutes instead? How does the system handle the resulting state?
  4. Why is optimistic locking a poor choice for flash sale seat reservation? What specific failure mode makes it unsuitable?