Distributed Computing Concepts
Session 5.3 · ~5 min read
Assumptions That Will Betray You
In 1994, Peter Deutsch compiled a list of assumptions that developers new to distributed systems commonly make. Sun Microsystems later added an eighth. These are known as the Eight Fallacies of Distributed Computing. Every one of them is wrong, and every one of them causes real production failures when believed.
This session covers the fallacies, then goes deeper into three foundational problems of distributed systems: the Two Generals Problem, Byzantine fault tolerance, and logical clocks.
The network is not reliable, latency is not zero, bandwidth is not infinite. Every distributed system that forgets this will be reminded.
The Eight Fallacies
| # | Fallacy | Reality | Real-World Consequence |
|---|---|---|---|
| 1 | The network is reliable | Packets get dropped, connections reset, hardware fails | AWS us-east-1 outage (2017): S3 metadata service lost connectivity, cascading across dozens of services |
| 2 | Latency is zero | Cross-region calls add 50-300ms; cross-continent 100-500ms | Microservice chains with 10+ synchronous hops accumulate seconds of latency |
| 3 | Bandwidth is infinite | Network links saturate under load; mobile connections throttle | Video streaming services buffer when CDN capacity spikes during live events |
| 4 | The network is secure | Traffic can be intercepted, modified, or replayed | Man-in-the-middle attacks on unencrypted internal service communication |
| 5 | Topology doesn't change | Nodes join and leave; routes shift; DNS entries update | Auto-scaling adds new instances that the service registry has not yet discovered |
| 6 | There is one administrator | Multiple teams, vendors, and cloud providers manage different parts | Conflicting firewall rules between teams block legitimate traffic in production |
| 7 | Transport cost is zero | Serialization, encryption, compression, and network I/O all consume CPU and time | gRPC protobuf serialization overhead becomes measurable at millions of requests per second |
| 8 | The network is homogeneous | Different hardware, protocols, OS versions, and configurations coexist | IPv4/IPv6 mismatch causes connectivity failures between old and new services |
The Two Generals Problem
The Two Generals Problem, first described by Jim Gray in 1978 and formalized by Akkoyunlu, Ekanadham, and Huber in 1975, is the simplest impossibility result in distributed computing.
Two armies, commanded by separate generals, must agree on a time to attack a city. They can only communicate by sending messengers through enemy territory. Any messenger might be captured. The question: can the generals reach agreement?
The answer is no. No finite number of messages can guarantee agreement when any message might be lost. General 1 sends "attack at dawn." Even if General 2 receives it and sends back a confirmation, General 1 does not know whether the confirmation arrived. General 2 knows this, so General 2 cannot be sure General 1 will attack. The problem recurses infinitely.
This is not a contrived academic puzzle. It is the fundamental reason why distributed systems use timeouts, retries, and idempotency. TCP's three-way handshake does not solve the Two Generals Problem. It works well enough for practical purposes, but it does not provide absolute certainty of agreement.
Byzantine Fault Tolerance
The Two Generals Problem assumes honest participants with an unreliable channel. The Byzantine Generals Problem, described by Lamport, Shostak, and Pease in 1982, is harder: some participants may be actively malicious. They may send conflicting information to different nodes, or they may deliberately lie.
A system is Byzantine fault tolerant (BFT) if it can reach correct consensus even when some nodes are faulty or malicious. The classic result is that BFT requires at least 3f + 1 total nodes to tolerate f Byzantine failures. With 3 faulty nodes, you need at least 10 total nodes.
Most internal distributed systems (databases, message queues, coordination services) do not implement BFT because they trust their own nodes. Blockchain systems do, because participants are anonymous and untrusted. Bitcoin's proof-of-work and Ethereum's proof-of-stake are both solutions to the Byzantine Generals Problem for open networks.
Logical Clocks
In a single computer, events have a clear order because they share one clock. In a distributed system, each node has its own clock. Physical clocks drift. Network Time Protocol (NTP) synchronization has millisecond-level accuracy at best. You cannot rely on wall-clock time to order events across nodes.
Leslie Lamport solved this in 1978 with logical clocks (Lamport timestamps). Each process maintains a counter. On every local event, increment the counter. When sending a message, attach the counter. When receiving a message, set your counter to max(local, received) + 1.
Lamport clocks establish a happens-before relationship. If event A has a lower timestamp than event B, it is possible that A happened before B. But the converse is not guaranteed: two events with different timestamps might be concurrent (unrelated).
Vector Clocks
Vector clocks extend Lamport clocks to detect concurrency. Instead of a single counter, each process maintains a vector of counters, one per process. When process i has an event, it increments its own entry. When sending, it attaches the full vector. When receiving, it takes the element-wise maximum and increments its own entry.
If vector A is less than or equal to vector B in every position, then A happened before B. If neither is less than or equal to the other, the events are concurrent. This distinction is critical for conflict detection in systems like Amazon's Dynamo.
VC=[1,0] A->>B: Message VC=[1,0] Note over B: Event b1
VC=max([0,0],[1,0])+[0,1]
=[1,1] Note over A: Event a2
VC=[2,0] Note over B: Event b2
VC=[1,2] Note over A,B: a2=[2,0] vs b2=[1,2]
Neither dominates: CONCURRENT
Connecting the Concepts
These problems are not independent. The fallacies explain why distributed computing is hard. The Two Generals Problem proves that perfect agreement over unreliable channels is impossible. Byzantine faults show that even agreement itself is insufficient when participants lie. And logical clocks provide the ordering mechanism that physical clocks cannot.
Every distributed system you build or operate sits on top of these constraints. Consensus algorithms (Session 5.4), replication protocols (Session 3.4), and consistency models (Session 5.2) are all engineering responses to these fundamental limitations.
Further Reading
- Wikipedia: Fallacies of Distributed Computing
- Lamport, L. (1978). "Time, Clocks, and the Ordering of Events in a Distributed System."
- Lamport, L., Shostak, R., & Pease, M. (1982). "The Byzantine Generals Problem."
- Wikipedia: Two Generals' Problem
- Wikipedia: Vector Clock
Assignment
List the Eight Fallacies of Distributed Computing. For each fallacy, give a specific, real example of a system or incident where assuming the fallacy caused a failure. Your examples should be concrete: name the system, the year if possible, and what went wrong.
Bonus: For fallacies 1, 2, and 3, describe the engineering pattern commonly used to mitigate each (e.g., retries with backoff, circuit breakers, compression, CDNs).