Course → Module 5: Distributed Systems & Consensus

Two Ways to Wire Services Together

Every distributed system faces the same question: when Service A needs something from Service B, should A wait for the answer or fire a message and move on? This is the fundamental split between synchronous and event-driven architecture. Neither is universally better. Each creates a different failure profile, a different debugging experience, and a different coupling structure.

Synchronous: Request-Response

In synchronous communication, the caller sends a request and blocks until it receives a response. HTTP REST calls, gRPC, and GraphQL queries are all synchronous by default. The caller knows immediately whether the operation succeeded or failed.

This simplicity has a cost. If Service B is slow, Service A is slow. If Service B is down, Service A fails. Chain five synchronous calls together (A calls B calls C calls D calls E) and your latency is the sum of all five, your availability is the product of all five. Five services at 99.9% availability each give you 99.5% end-to-end availability. That is 4.4 hours of downtime per year instead of 52 minutes.

sequenceDiagram participant Client participant OrderSvc as Order Service participant InvSvc as Inventory Service participant PaySvc as Payment Service Client->>OrderSvc: POST /orders OrderSvc->>InvSvc: Reserve inventory InvSvc-->>OrderSvc: Reserved OK OrderSvc->>PaySvc: Charge payment PaySvc-->>OrderSvc: Payment OK OrderSvc-->>Client: Order confirmed Note over Client,PaySvc: Total latency = sum of all calls
If any service fails, the whole chain fails

Synchronous architectures fail together. Event-driven architectures fail independently. This is the core tradeoff. Tight coupling gives you simplicity and immediate feedback. Loose coupling gives you resilience and independent scaling.

Event-Driven: Fire and React

In event-driven architecture, services communicate by publishing and consuming events. The Order Service does not call the Inventory Service directly. It publishes an "order.created" event. The Inventory Service subscribes to that event and reacts. The Order Service does not wait, does not know, and does not care what the Inventory Service does with the event.

This decoupling changes the failure profile. If the Inventory Service goes down for 10 minutes, events accumulate in the message broker. When the service comes back, it processes the backlog. The Order Service never noticed the outage. No cascading failures. No retry storms.

The cost is complexity. With synchronous calls, you can trace a request through a single HTTP call chain. With events, the flow is scattered across multiple services reacting to multiple event types. Debugging "why did this order not ship?" requires reconstructing the event timeline across services.

Dimension Synchronous Event-Driven
Coupling Tight: caller depends on callee's availability and interface Loose: publisher and subscriber share only the event schema
Latency Sum of all downstream calls Immediate response to client; processing happens asynchronously
Failure handling Cascading: one failure breaks the chain Isolated: failed service catches up from backlog
Traceability Simple: follow the call chain Hard: reconstruct event flows across services
Consistency Immediate (within a transaction or call) Eventual: state converges over time
Scaling Scale every service in the call chain equally Scale each service independently based on its own load
Testing Straightforward: mock downstream calls Requires event replay, integration testing across services
Data ownership Often shared databases or direct queries across services Each service owns its data, reacts to events to stay in sync

The Saga Pattern

In a monolith, you wrap a multi-step operation in a database transaction. Deduct inventory, charge payment, send confirmation. If any step fails, the whole transaction rolls back. In microservices, there is no single database. Each service owns its data. You cannot use a distributed transaction (two-phase commit) at scale because it blocks all participants and does not tolerate network partitions well.

The saga pattern replaces a single transaction with a sequence of local transactions, each in its own service. If a step fails, the saga executes compensating transactions to undo the previous steps. There are two ways to coordinate a saga: choreography and orchestration.

Choreography: Decentralized Coordination

In choreography, each service listens for events and decides what to do next. There is no central coordinator. The Order Service publishes "order.created." The Inventory Service hears it, reserves stock, and publishes "inventory.reserved." The Payment Service hears that, charges the card, and publishes "payment.completed." The Notification Service hears that and sends the confirmation email.

If payment fails, the Payment Service publishes "payment.failed." The Inventory Service hears it and releases the reserved stock (compensating transaction).

sequenceDiagram participant OS as Order Service participant IS as Inventory Service participant PS as Payment Service participant NS as Notification Service OS->>IS: Event: order.created IS->>IS: Reserve stock IS->>PS: Event: inventory.reserved PS->>PS: Charge card alt Payment succeeds PS->>NS: Event: payment.completed NS->>NS: Send confirmation else Payment fails PS->>IS: Event: payment.failed IS->>IS: Release stock (compensate) PS->>OS: Event: payment.failed OS->>OS: Mark order failed end

Choreography is simple for short sagas (2-3 steps). It becomes unmanageable for longer flows because the logic is spread across every participating service. Adding a step means modifying multiple services. Debugging requires tracing events across all participants.

Orchestration: Central Coordinator

In orchestration, a dedicated saga orchestrator controls the flow. It tells each service what to do and waits for the result. If a step fails, the orchestrator triggers compensating transactions in reverse order.

sequenceDiagram participant Orch as Saga Orchestrator participant IS as Inventory Service participant PS as Payment Service participant NS as Notification Service Orch->>IS: Reserve inventory IS-->>Orch: Reserved OK Orch->>PS: Charge payment alt Payment succeeds PS-->>Orch: Payment OK Orch->>NS: Send confirmation NS-->>Orch: Sent OK Orch->>Orch: Mark saga complete else Payment fails PS-->>Orch: Payment FAILED Orch->>IS: Release inventory (compensate) IS-->>Orch: Released OK Orch->>Orch: Mark saga failed end

The orchestrator is the single place where you can see the entire saga flow. It handles retries, timeouts, and compensation in one location. The downside is that the orchestrator becomes a single point of failure and a potential bottleneck. It also creates coupling: the orchestrator must know about every service in the flow.

The Transactional Outbox Pattern

Event-driven systems have a dual-write problem. When the Order Service creates an order, it must (1) write to its database and (2) publish an event. If the database write succeeds but the event publish fails, downstream services never learn about the order. If the event publishes but the database write fails, downstream services act on an order that does not exist.

The outbox pattern solves this. Instead of publishing the event directly, the service writes the event to an "outbox" table in the same database transaction as the business data. A separate process (a poller or a CDC connector like Debezium) reads the outbox table and publishes the events to the message broker. Because the business data and the outbox entry are in the same transaction, they succeed or fail together.

graph LR subgraph Order Service API[API Handler] -->|single transaction| DB[(Database)] API -->|same transaction| OB[Outbox Table] end OB -->|CDC / Poller| Kafka[Message Broker] Kafka --> IS[Inventory Service] Kafka --> PS[Payment Service] style DB fill:#222221,stroke:#c8a882,color:#ede9e3 style OB fill:#222221,stroke:#6b8f71,color:#ede9e3 style Kafka fill:#191918,stroke:#c47a5a,color:#ede9e3

Debezium is the most widely used open-source CDC tool for the outbox pattern. It reads the database's transaction log (PostgreSQL WAL, MySQL binlog) and streams changes to Kafka topics. No polling, no missed events, no dual-write risk.

Systems Thinking Lens

Synchronous architectures have a tight feedback loop: you call, you get a response, you know immediately. This is a reinforcing loop for developer productivity in the short term. But it creates a balancing loop at scale: more services in the chain means more latency and more failure modes, which pushes teams toward event-driven patterns.

Event-driven architectures introduce delay in the feedback loop. Events are processed asynchronously, so the cause (order created) and the effect (email sent) are separated in time. This delay makes debugging harder but makes the system more resilient to individual component failures. The saga pattern is a conscious design of compensating feedback loops: every forward action has a defined reverse action, creating a system that can self-correct when failures occur.

Further Reading

Assignment

Design a saga for the following order flow. Specify whether you would use choreography or orchestration, and justify your choice.

Flow:

  1. Step 1: Deduct inventory for the ordered items.
  2. Step 2: Charge the customer's credit card.
  3. Step 3: Send a confirmation email.

Questions:

  1. What happens if Step 2 (payment) fails after Step 1 (inventory deduction) has already succeeded? Write the compensating transaction.
  2. What happens if Step 3 (email) fails after Steps 1 and 2 succeed? Should you compensate Steps 1 and 2? Why or why not?
  3. How would you use the outbox pattern to ensure the "order.created" event is reliably published after the order is saved to the database?
  4. Draw a sequence diagram (choreography or orchestration) for the happy path and the failure path from question (a).