Course → Module 7: Real-World Case Studies I

The Core Problem

An e-commerce platform at Amazon's scale sells hundreds of millions of products across dozens of countries. On Prime Day 2023, Amazon processed over 375 million items in a single day. The catalog alone contains over 350 million SKUs. Behind the simple act of clicking "Buy Now" sits a distributed transaction that must coordinate inventory, payment, shipping, and notifications across multiple services, all while ensuring that two customers never buy the last unit of the same item.

The hardest problem in e-commerce is not showing products. It is selling the last one exactly once. Everything in the checkout pipeline exists to guarantee this property under concurrent load.

Service Decomposition

A monolithic e-commerce application hits a wall quickly. The search team's deployment schedule differs from the payment team's. The inventory service has different scaling needs than the recommendation engine. Microservices let each team own their domain.

Service Responsibility Data Store Scaling Pattern
Catalog Product metadata, descriptions, images, categories Elasticsearch + DynamoDB Read replicas, CDN for images
Cart User's current selections, quantities, saved items Redis (session) + DynamoDB (persistent) Horizontal, stateless
Inventory Stock counts per SKU per warehouse PostgreSQL (strong consistency) Sharded by warehouse region
Order Order lifecycle, status tracking PostgreSQL Sharded by user ID
Payment Charge authorization, capture, refund PostgreSQL (ACID required) Vertical + limited horizontal
Shipping Carrier selection, label generation, tracking DynamoDB Event-driven, async
Recommendation "Customers also bought", personalized feeds Redis (precomputed) + ML model store Read-heavy, cache-first

High-Level Architecture

graph TB Client["Client
(Web / Mobile)"] --> CDN["CDN
(Images, Static)"] Client --> GW["API Gateway"] GW --> CS["Catalog Service"] GW --> CT["Cart Service"] GW --> OS["Order Service
(Saga Orchestrator)"] GW --> RS["Recommendation
Service"] CS --> ES["Elasticsearch"] CS --> DDB1["DynamoDB
(Catalog)"] CT --> RD["Redis
(Cart)"] OS --> INV["Inventory Service"] OS --> PAY["Payment Service"] OS --> SHP["Shipping Service"] INV --> PG1["PostgreSQL
(Inventory)"] PAY --> PG2["PostgreSQL
(Payment)"] OS --> KF["Kafka
(Order Events)"] KF --> NF["Notification Service"] KF --> AN["Analytics"] style Client fill:#222221,stroke:#6b8f71,color:#ede9e3 style CDN fill:#222221,stroke:#8a8478,color:#ede9e3 style GW fill:#222221,stroke:#c8a882,color:#ede9e3 style CS fill:#222221,stroke:#c47a5a,color:#ede9e3 style CT fill:#222221,stroke:#c47a5a,color:#ede9e3 style OS fill:#222221,stroke:#c47a5a,color:#ede9e3 style RS fill:#222221,stroke:#c47a5a,color:#ede9e3 style ES fill:#222221,stroke:#8a8478,color:#ede9e3 style DDB1 fill:#222221,stroke:#8a8478,color:#ede9e3 style RD fill:#222221,stroke:#8a8478,color:#ede9e3 style INV fill:#222221,stroke:#6b8f71,color:#ede9e3 style PAY fill:#222221,stroke:#6b8f71,color:#ede9e3 style SHP fill:#222221,stroke:#6b8f71,color:#ede9e3 style PG1 fill:#222221,stroke:#8a8478,color:#ede9e3 style PG2 fill:#222221,stroke:#8a8478,color:#ede9e3 style KF fill:#222221,stroke:#c8a882,color:#ede9e3 style NF fill:#222221,stroke:#6b8f71,color:#ede9e3 style AN fill:#222221,stroke:#6b8f71,color:#ede9e3

The Checkout Flow: Inventory Locking

The critical moment is checkout. When a user clicks "Place Order," the system must reserve inventory, authorize payment, and create the order. If any step fails, the preceding steps must be rolled back. This is the Saga pattern (covered in Session 6.5).

Inventory is modeled with three numbers per SKU per warehouse: on_hand (physical stock), reserved (claimed by pending orders), and available (on_hand minus reserved). This prevents overselling without locking the entire row for the duration of a checkout.

sequenceDiagram participant U as User participant OS as Order Service participant INV as Inventory Service participant PAY as Payment Service participant SHP as Shipping Service participant KF as Kafka U->>OS: Place Order OS->>INV: Reserve inventory (SKU, qty) Note over INV: UPDATE inventory
SET reserved = reserved + qty
WHERE available >= qty INV-->>OS: Reserved (reservation_id) OS->>PAY: Authorize payment PAY-->>OS: Authorized (auth_id) OS->>OS: Create order record (status: confirmed) OS->>KF: OrderConfirmed event KF->>INV: Deduct inventory (convert reservation to sale) KF->>SHP: Create shipment KF->>U: Order confirmation email Note over OS: If payment fails: OS->>INV: Release reservation OS->>U: Payment declined

The reservation step uses an optimistic approach. The SQL WHERE available >= qty clause acts as a guard. If two users try to buy the last item simultaneously, only one reservation will succeed because the second query will see available = 0 after the first transaction commits. The loser gets an "out of stock" response.

This is optimistic locking in practice. No explicit lock is held. Instead, the database's transactional guarantees ensure that the constraint check and the update happen atomically.

Payment: Exactly-Once Semantics

Payment is the one operation where "at least once" delivery can cost you real money. Charging a customer twice is worse than not charging them at all. The system must guarantee exactly-once processing.

The standard approach uses an idempotency key. The Order Service generates a unique key for each payment attempt and sends it with the charge request. The Payment Service stores this key. If the same key arrives again (due to a retry after a network timeout), the Payment Service returns the original result without processing a second charge.

On the provider side, payment finality relies on the webhook callback from the payment processor (Stripe, Adyen), not the synchronous API response. The synchronous response confirms the request was received. The webhook confirms the money moved. The Order Service should not mark an order as "paid" until the webhook arrives.

Search-Optimized Catalog

Product search is read-heavy. Users browse, filter, and search far more than they buy. The catalog uses Elasticsearch for full-text search and faceted filtering (brand, price range, rating, category). The source of truth for product data lives in DynamoDB. Changes propagate to Elasticsearch via a change data capture (CDC) stream.

This separation lets the search index be optimized for read patterns (denormalized, pre-aggregated facet counts) while the primary store maintains normalized data integrity.

Recommendation Engine

The "Customers who bought X also bought Y" feature is a collaborative filtering problem. At Amazon's scale, recommendations are precomputed offline using batch processing (Spark or equivalent) and stored in a fast key-value store (Redis, DynamoDB). When a user views a product, the service looks up precomputed recommendations by product ID. Real-time signals (what the user just clicked) are blended in at serving time via a lightweight model.

The key insight: recommendations are not computed on the fly. They are a read path. The heavy computation happens in a daily or hourly batch job.

Further Reading

Assignment

Two users click "Buy Now" on the last unit of a product at the same time.

  1. Design the checkout flow so that exactly one user succeeds and the other receives an "out of stock" message. Write the SQL for the inventory reservation step. Explain why it prevents overselling.
  2. What happens if the payment authorization succeeds but the Payment Service crashes before returning the response to the Order Service? How does the Order Service know whether to retry or roll back?
  3. Describe the locking strategy. Is this pessimistic or optimistic locking? What are the tradeoffs of each in a high-traffic checkout scenario?
  4. Draw a sequence diagram for the full saga, including the compensation (rollback) path when shipping fails after payment succeeds.