Course → Module 5: Distributed Systems & Consensus

Getting Distributed Nodes to Agree

Session 5.3 established that distributed systems face unreliable networks and imperfect clocks. Despite these constraints, real systems need nodes to agree on things: which node is the leader, what order operations happened in, and whether a transaction committed. Consensus algorithms are the formal mechanisms for reaching such agreement.

Two algorithms dominate the field: Paxos, published by Leslie Lamport in 1989, and Raft, published by Diego Ongaro and John Ousterhout in 2014. They solve the same problem with different philosophies.

Raft was designed to be understood. Paxos was designed to be correct. In practice, understandability wins.

What Is Consensus?

A consensus algorithm allows a group of nodes to agree on a single value, even if some nodes fail. The requirements are:

In practice, consensus is used for leader election, log replication, distributed locks, and configuration management. Every time your Kubernetes cluster elects a controller, every time etcd agrees on a key-value write, a consensus algorithm is running.

Paxos

Lamport described Paxos using an allegory about a parliament on the Greek island of Paxos. The algorithm has three roles: Proposers suggest values, Acceptors vote on proposals, and Learners learn the decided value. A single node can play multiple roles.

Basic Paxos runs in two phases:

Phase 1 (Prepare): A proposer picks a proposal number n and sends a Prepare(n) message to a majority of acceptors. Each acceptor responds with a promise not to accept any proposal numbered less than n, along with any value it has already accepted.

Phase 2 (Accept): If the proposer receives promises from a majority, it sends an Accept(n, v) message, where v is either the value from the highest-numbered previously accepted proposal, or the proposer's own value if no prior proposals exist. If a majority of acceptors accept, the value is chosen.

sequenceDiagram participant P as Proposer participant A1 as Acceptor 1 participant A2 as Acceptor 2 participant A3 as Acceptor 3 Note over P: Phase 1: Prepare P->>A1: Prepare(n=1) P->>A2: Prepare(n=1) P->>A3: Prepare(n=1) A1-->>P: Promise(n=1, no prior value) A2-->>P: Promise(n=1, no prior value) A3-->>P: Promise(n=1, no prior value) Note over P: Phase 2: Accept P->>A1: Accept(n=1, v="commit X") P->>A2: Accept(n=1, v="commit X") P->>A3: Accept(n=1, v="commit X") A1-->>P: Accepted A2-->>P: Accepted A3-->>P: Accepted Note over P: Value "commit X" is chosen

Paxos guarantees safety: it will never decide on two different values. But it has known liveness issues. Two proposers can repeatedly preempt each other with higher proposal numbers, creating an infinite loop where no value is ever chosen. Multi-Paxos solves this by electing a stable leader, but the optimization adds complexity.

The deeper issue is comprehensibility. Lamport's original paper was notoriously difficult to understand. Multiple follow-up papers attempted to clarify it. Implementations diverge from the specification because engineers cannot verify their code against a formalism they do not fully grasp.

Raft

Ongaro and Ousterhout designed Raft explicitly for understandability. Their 2014 paper opens with the observation that Paxos has become synonymous with consensus, yet few people understand it well enough to implement it correctly. Raft decomposes consensus into three sub-problems: leader election, log replication, and safety.

Leader Election

Every node is in one of three states: Follower, Candidate, or Leader. All nodes start as followers. If a follower does not hear from a leader within a randomized election timeout, it becomes a candidate, increments its term number, votes for itself, and requests votes from other nodes.

stateDiagram-v2 [*] --> Follower Follower --> Candidate: Election timeout expires Candidate --> Candidate: Election timeout (split vote) Candidate --> Leader: Receives majority votes Candidate --> Follower: Discovers higher term Leader --> Follower: Discovers higher term

A candidate wins the election if it receives votes from a majority of nodes. If two candidates split the vote (no majority), both time out and try again with new term numbers. The randomized timeout makes split votes unlikely to repeat.

Log Replication

Once elected, the leader accepts client requests and appends them to its log. It then sends AppendEntries RPCs to all followers. When a majority of followers have written the entry to their logs, the entry is committed. The leader notifies the client that the write succeeded.

sequenceDiagram participant C as Client participant L as Leader participant F1 as Follower 1 participant F2 as Follower 2 C->>L: Write request L->>L: Append to local log L->>F1: AppendEntries(entry) L->>F2: AppendEntries(entry) F1-->>L: Ack F2-->>L: Ack Note over L: Majority confirmed (3/3) L->>L: Commit entry L-->>C: Write successful L->>F1: Commit notification L->>F2: Commit notification

Safety

Raft guarantees that if a log entry is committed, it will be present in the logs of all future leaders. A candidate cannot win an election unless its log is at least as up-to-date as a majority of nodes. This prevents a node with a stale log from becoming leader and overwriting committed entries.

Paxos vs. Raft

Dimension Paxos Raft
Published 1989 (Lamport) 2014 (Ongaro & Ousterhout)
Design goal Correctness proof Understandability
Leader required? No (basic); yes (Multi-Paxos) Yes, always
Roles Proposer, Acceptor, Learner Leader, Follower, Candidate
Log ordering Gaps allowed, filled later Strictly sequential, no gaps
Failure handling Complex reconfiguration Leader dies, new election, simple recovery
Implementations Google Chubby, Apache Mesos etcd, Consul, CockroachDB, TiKV
Learning curve Steep. Multiple papers needed. Moderate. One paper suffices.
Performance Comparable in practice Comparable in practice
Correctness proof Original paper provides proof TLA+ specification by Ongaro

ZooKeeper: Consensus as a Service

Apache ZooKeeper provides consensus as a reusable service. Instead of embedding Paxos or Raft into every application, you delegate coordination to ZooKeeper. It uses a protocol called ZAB (ZooKeeper Atomic Broadcast), which is similar to Raft in that it uses a leader for ordering.

ZooKeeper provides distributed primitives: locks, barriers, leader election, configuration management, and group membership. Kafka (before KRaft) used ZooKeeper for broker coordination and partition leader election. Hadoop uses it for NameNode high availability. HBase uses it for region server coordination.

The trend in recent years is away from external coordination services and toward embedded consensus. Kafka replaced ZooKeeper with its own Raft-based protocol (KRaft). CockroachDB and TiKV embed Raft directly. The operational overhead of managing a ZooKeeper cluster is significant, and teams prefer fewer moving parts.

Quorum Math

Consensus algorithms depend on quorums: a majority of nodes that must agree for an operation to proceed. For a cluster of N nodes, the quorum size is floor(N/2) + 1. The cluster can tolerate floor((N-1)/2) failures.

Cluster Size (N) Quorum Size Max Failures Tolerated
3 2 1
5 3 2
7 4 3
9 5 4

Five nodes is the most common production configuration. It tolerates two failures, which covers most realistic scenarios (one planned maintenance + one unexpected failure). Seven or nine nodes increase fault tolerance but add latency, because every write must wait for more acknowledgments.

Further Reading

Assignment

You have a 5-node Raft cluster. The current leader (Node 1) crashes.

  1. Describe the step-by-step process of a new leader being elected. Include term numbers, vote requests, and the role of election timeouts.
  2. What is the minimum number of nodes that must agree for a new leader to be elected?
  3. What happens if Node 2 and Node 3 both become candidates in the same term? Walk through the split-vote scenario and its resolution.
  4. After the new leader is elected, it discovers that Node 4 has a log entry that was never committed. What does Raft do with this entry?