Course → Module 5: Distributed Systems & Consensus

Assumptions That Will Betray You

In 1994, Peter Deutsch compiled a list of assumptions that developers new to distributed systems commonly make. Sun Microsystems later added an eighth. These are known as the Eight Fallacies of Distributed Computing. Every one of them is wrong, and every one of them causes real production failures when believed.

This session covers the fallacies, then goes deeper into three foundational problems of distributed systems: the Two Generals Problem, Byzantine fault tolerance, and logical clocks.

The network is not reliable, latency is not zero, bandwidth is not infinite. Every distributed system that forgets this will be reminded.

The Eight Fallacies

# Fallacy Reality Real-World Consequence
1 The network is reliable Packets get dropped, connections reset, hardware fails AWS us-east-1 outage (2017): S3 metadata service lost connectivity, cascading across dozens of services
2 Latency is zero Cross-region calls add 50-300ms; cross-continent 100-500ms Microservice chains with 10+ synchronous hops accumulate seconds of latency
3 Bandwidth is infinite Network links saturate under load; mobile connections throttle Video streaming services buffer when CDN capacity spikes during live events
4 The network is secure Traffic can be intercepted, modified, or replayed Man-in-the-middle attacks on unencrypted internal service communication
5 Topology doesn't change Nodes join and leave; routes shift; DNS entries update Auto-scaling adds new instances that the service registry has not yet discovered
6 There is one administrator Multiple teams, vendors, and cloud providers manage different parts Conflicting firewall rules between teams block legitimate traffic in production
7 Transport cost is zero Serialization, encryption, compression, and network I/O all consume CPU and time gRPC protobuf serialization overhead becomes measurable at millions of requests per second
8 The network is homogeneous Different hardware, protocols, OS versions, and configurations coexist IPv4/IPv6 mismatch causes connectivity failures between old and new services

The Two Generals Problem

The Two Generals Problem, first described by Jim Gray in 1978 and formalized by Akkoyunlu, Ekanadham, and Huber in 1975, is the simplest impossibility result in distributed computing.

Two armies, commanded by separate generals, must agree on a time to attack a city. They can only communicate by sending messengers through enemy territory. Any messenger might be captured. The question: can the generals reach agreement?

sequenceDiagram participant G1 as General 1 participant E as Enemy Territory participant G2 as General 2 G1->>E: Messenger 1: "Attack at dawn" Note over E: Messenger might be captured! E->>G2: Messenger 1 arrives G2->>E: Messenger 2: "Agreed, attack at dawn" Note over E: This messenger might also be captured! E->>G1: Messenger 2 arrives Note over G1: But does G2 know I received the confirmation? G1->>E: Messenger 3: "I confirm your confirmation" Note over E: And so on, infinitely...

The answer is no. No finite number of messages can guarantee agreement when any message might be lost. General 1 sends "attack at dawn." Even if General 2 receives it and sends back a confirmation, General 1 does not know whether the confirmation arrived. General 2 knows this, so General 2 cannot be sure General 1 will attack. The problem recurses infinitely.

This is not a contrived academic puzzle. It is the fundamental reason why distributed systems use timeouts, retries, and idempotency. TCP's three-way handshake does not solve the Two Generals Problem. It works well enough for practical purposes, but it does not provide absolute certainty of agreement.

Byzantine Fault Tolerance

The Two Generals Problem assumes honest participants with an unreliable channel. The Byzantine Generals Problem, described by Lamport, Shostak, and Pease in 1982, is harder: some participants may be actively malicious. They may send conflicting information to different nodes, or they may deliberately lie.

A system is Byzantine fault tolerant (BFT) if it can reach correct consensus even when some nodes are faulty or malicious. The classic result is that BFT requires at least 3f + 1 total nodes to tolerate f Byzantine failures. With 3 faulty nodes, you need at least 10 total nodes.

Most internal distributed systems (databases, message queues, coordination services) do not implement BFT because they trust their own nodes. Blockchain systems do, because participants are anonymous and untrusted. Bitcoin's proof-of-work and Ethereum's proof-of-stake are both solutions to the Byzantine Generals Problem for open networks.

Logical Clocks

In a single computer, events have a clear order because they share one clock. In a distributed system, each node has its own clock. Physical clocks drift. Network Time Protocol (NTP) synchronization has millisecond-level accuracy at best. You cannot rely on wall-clock time to order events across nodes.

Leslie Lamport solved this in 1978 with logical clocks (Lamport timestamps). Each process maintains a counter. On every local event, increment the counter. When sending a message, attach the counter. When receiving a message, set your counter to max(local, received) + 1.

sequenceDiagram participant A as Process A participant B as Process B participant C as Process C Note over A: Event a1 (LC=1) A->>B: Message (LC=1) Note over B: Event b1 (LC=2) Note over B: max(0,1)+1 = 2 B->>C: Message (LC=2) Note over C: Event c1 (LC=3) Note over C: max(0,2)+1 = 3 Note over A: Event a2 (LC=2) C->>A: Message (LC=3) Note over A: Event a3 (LC=4) Note over A: max(2,3)+1 = 4

Lamport clocks establish a happens-before relationship. If event A has a lower timestamp than event B, it is possible that A happened before B. But the converse is not guaranteed: two events with different timestamps might be concurrent (unrelated).

Vector Clocks

Vector clocks extend Lamport clocks to detect concurrency. Instead of a single counter, each process maintains a vector of counters, one per process. When process i has an event, it increments its own entry. When sending, it attaches the full vector. When receiving, it takes the element-wise maximum and increments its own entry.

If vector A is less than or equal to vector B in every position, then A happened before B. If neither is less than or equal to the other, the events are concurrent. This distinction is critical for conflict detection in systems like Amazon's Dynamo.

sequenceDiagram participant A as Process A participant B as Process B Note over A: Event a1
VC=[1,0] A->>B: Message VC=[1,0] Note over B: Event b1
VC=max([0,0],[1,0])+[0,1]
=[1,1] Note over A: Event a2
VC=[2,0] Note over B: Event b2
VC=[1,2] Note over A,B: a2=[2,0] vs b2=[1,2]
Neither dominates: CONCURRENT

Connecting the Concepts

These problems are not independent. The fallacies explain why distributed computing is hard. The Two Generals Problem proves that perfect agreement over unreliable channels is impossible. Byzantine faults show that even agreement itself is insufficient when participants lie. And logical clocks provide the ordering mechanism that physical clocks cannot.

Every distributed system you build or operate sits on top of these constraints. Consensus algorithms (Session 5.4), replication protocols (Session 3.4), and consistency models (Session 5.2) are all engineering responses to these fundamental limitations.

Further Reading

Assignment

List the Eight Fallacies of Distributed Computing. For each fallacy, give a specific, real example of a system or incident where assuming the fallacy caused a failure. Your examples should be concrete: name the system, the year if possible, and what went wrong.

Bonus: For fallacies 1, 2, and 3, describe the engineering pattern commonly used to mitigate each (e.g., retries with backoff, circuit breakers, compression, CDNs).