Consensus Algorithms: Paxos & Raft
Session 5.4 · ~5 min read
Getting Distributed Nodes to Agree
Session 5.3 established that distributed systems face unreliable networks and imperfect clocks. Despite these constraints, real systems need nodes to agree on things: which node is the leader, what order operations happened in, and whether a transaction committed. Consensus algorithms are the formal mechanisms for reaching such agreement.
Two algorithms dominate the field: Paxos, published by Leslie Lamport in 1989, and Raft, published by Diego Ongaro and John Ousterhout in 2014. They solve the same problem with different philosophies.
Raft was designed to be understood. Paxos was designed to be correct. In practice, understandability wins.
What Is Consensus?
A consensus algorithm allows a group of nodes to agree on a single value, even if some nodes fail. The requirements are:
- Agreement: All non-faulty nodes decide on the same value.
- Validity: The decided value was proposed by some node.
- Termination: All non-faulty nodes eventually decide.
- Integrity: Each node decides at most once.
In practice, consensus is used for leader election, log replication, distributed locks, and configuration management. Every time your Kubernetes cluster elects a controller, every time etcd agrees on a key-value write, a consensus algorithm is running.
Paxos
Lamport described Paxos using an allegory about a parliament on the Greek island of Paxos. The algorithm has three roles: Proposers suggest values, Acceptors vote on proposals, and Learners learn the decided value. A single node can play multiple roles.
Basic Paxos runs in two phases:
Phase 1 (Prepare): A proposer picks a proposal number n and sends a Prepare(n) message to a majority of acceptors. Each acceptor responds with a promise not to accept any proposal numbered less than n, along with any value it has already accepted.
Phase 2 (Accept): If the proposer receives promises from a majority, it sends an Accept(n, v) message, where v is either the value from the highest-numbered previously accepted proposal, or the proposer's own value if no prior proposals exist. If a majority of acceptors accept, the value is chosen.
Paxos guarantees safety: it will never decide on two different values. But it has known liveness issues. Two proposers can repeatedly preempt each other with higher proposal numbers, creating an infinite loop where no value is ever chosen. Multi-Paxos solves this by electing a stable leader, but the optimization adds complexity.
The deeper issue is comprehensibility. Lamport's original paper was notoriously difficult to understand. Multiple follow-up papers attempted to clarify it. Implementations diverge from the specification because engineers cannot verify their code against a formalism they do not fully grasp.
Raft
Ongaro and Ousterhout designed Raft explicitly for understandability. Their 2014 paper opens with the observation that Paxos has become synonymous with consensus, yet few people understand it well enough to implement it correctly. Raft decomposes consensus into three sub-problems: leader election, log replication, and safety.
Leader Election
Every node is in one of three states: Follower, Candidate, or Leader. All nodes start as followers. If a follower does not hear from a leader within a randomized election timeout, it becomes a candidate, increments its term number, votes for itself, and requests votes from other nodes.
A candidate wins the election if it receives votes from a majority of nodes. If two candidates split the vote (no majority), both time out and try again with new term numbers. The randomized timeout makes split votes unlikely to repeat.
Log Replication
Once elected, the leader accepts client requests and appends them to its log. It then sends AppendEntries RPCs to all followers. When a majority of followers have written the entry to their logs, the entry is committed. The leader notifies the client that the write succeeded.
Safety
Raft guarantees that if a log entry is committed, it will be present in the logs of all future leaders. A candidate cannot win an election unless its log is at least as up-to-date as a majority of nodes. This prevents a node with a stale log from becoming leader and overwriting committed entries.
Paxos vs. Raft
| Dimension | Paxos | Raft |
|---|---|---|
| Published | 1989 (Lamport) | 2014 (Ongaro & Ousterhout) |
| Design goal | Correctness proof | Understandability |
| Leader required? | No (basic); yes (Multi-Paxos) | Yes, always |
| Roles | Proposer, Acceptor, Learner | Leader, Follower, Candidate |
| Log ordering | Gaps allowed, filled later | Strictly sequential, no gaps |
| Failure handling | Complex reconfiguration | Leader dies, new election, simple recovery |
| Implementations | Google Chubby, Apache Mesos | etcd, Consul, CockroachDB, TiKV |
| Learning curve | Steep. Multiple papers needed. | Moderate. One paper suffices. |
| Performance | Comparable in practice | Comparable in practice |
| Correctness proof | Original paper provides proof | TLA+ specification by Ongaro |
ZooKeeper: Consensus as a Service
Apache ZooKeeper provides consensus as a reusable service. Instead of embedding Paxos or Raft into every application, you delegate coordination to ZooKeeper. It uses a protocol called ZAB (ZooKeeper Atomic Broadcast), which is similar to Raft in that it uses a leader for ordering.
ZooKeeper provides distributed primitives: locks, barriers, leader election, configuration management, and group membership. Kafka (before KRaft) used ZooKeeper for broker coordination and partition leader election. Hadoop uses it for NameNode high availability. HBase uses it for region server coordination.
The trend in recent years is away from external coordination services and toward embedded consensus. Kafka replaced ZooKeeper with its own Raft-based protocol (KRaft). CockroachDB and TiKV embed Raft directly. The operational overhead of managing a ZooKeeper cluster is significant, and teams prefer fewer moving parts.
Quorum Math
Consensus algorithms depend on quorums: a majority of nodes that must agree for an operation to proceed. For a cluster of N nodes, the quorum size is floor(N/2) + 1. The cluster can tolerate floor((N-1)/2) failures.
| Cluster Size (N) | Quorum Size | Max Failures Tolerated |
|---|---|---|
| 3 | 2 | 1 |
| 5 | 3 | 2 |
| 7 | 4 | 3 |
| 9 | 5 | 4 |
Five nodes is the most common production configuration. It tolerates two failures, which covers most realistic scenarios (one planned maintenance + one unexpected failure). Seven or nine nodes increase fault tolerance but add latency, because every write must wait for more acknowledgments.
Further Reading
- Ongaro, D. & Ousterhout, J. (2014). "In Search of an Understandable Consensus Algorithm."
- Lamport, L. (2001). "Paxos Made Simple."
- Raft Visualization — Interactive animation of Raft leader election and log replication.
- Wikipedia: Paxos
- Apache ZooKeeper Internals
Assignment
You have a 5-node Raft cluster. The current leader (Node 1) crashes.
- Describe the step-by-step process of a new leader being elected. Include term numbers, vote requests, and the role of election timeouts.
- What is the minimum number of nodes that must agree for a new leader to be elected?
- What happens if Node 2 and Node 3 both become candidates in the same term? Walk through the split-vote scenario and its resolution.
- After the new leader is elected, it discovers that Node 4 has a log entry that was never committed. What does Raft do with this entry?