Fault Tolerance
Session 4.2 · ~5 min read
Beyond High Availability
Session 4.1 covered high availability: keeping the system running when components fail. Fault tolerance goes further. A fault-tolerant system continues operating correctly even during component failures, with no visible impact on the user. HA reduces downtime. FT eliminates it.
The difference is practical. An HA system might take 30 seconds to fail over to a standby database. During those 30 seconds, requests fail or queue. A fault-tolerant system handles the same failure transparently. The user never notices.
Fault tolerance is achieved through a set of patterns that isolate failures, limit their blast radius, and provide fallback behavior. The four most important patterns are circuit breakers, bulkheads, timeouts, and retries with exponential backoff.
Circuit Breakers
The circuit breaker pattern is borrowed from electrical engineering. An electrical circuit breaker detects excess current and opens the circuit to prevent damage. A software circuit breaker monitors calls to a downstream service and stops making calls when the failure rate exceeds a threshold.
Without a circuit breaker, a failing downstream service causes cascading failures. Callers wait for responses that never come, consuming threads and connections. Those callers then become slow, causing their own callers to back up. Within seconds, a single failing service can bring down an entire distributed system.
A circuit breaker operates in three states.
Closed: Normal operation. Requests pass through to the downstream service. The breaker counts failures. When the failure count exceeds a threshold (for example, 5 failures in 10 seconds), the breaker trips to the open state.
Open: The breaker rejects all requests immediately without calling the downstream service. This protects both the caller (no wasted threads waiting for timeouts) and the downstream service (no additional load while it is struggling). After a configured timeout (for example, 30 seconds), the breaker transitions to half-open.
Half-open: The breaker allows a single test request through. If it succeeds, the breaker returns to closed. If it fails, the breaker returns to open and resets the timeout. This probing mechanism provides automatic recovery without manual intervention.
Bulkheads
The bulkhead pattern is named after the watertight compartments in a ship's hull. If one compartment floods, the others remain dry. The ship stays afloat.
In software, a bulkhead isolates resources so that a failure in one part of the system does not consume resources needed by other parts. The most common implementation is dedicated thread pools or connection pools per downstream service.
Without bulkheads, all outbound calls share the same thread pool. If Service B becomes slow, calls to Service B consume all available threads. Now calls to Service C and Service D also fail, not because those services are down, but because there are no threads left to call them. The slow service has poisoned the entire system.
With bulkheads, Service B gets its own pool of 20 threads. Service C gets 20. Service D gets 20. When Service B becomes slow and exhausts its 20 threads, calls to C and D are unaffected. The failure is contained.
Timeouts
Every network call must have a timeout. Without one, a call to an unresponsive service will wait indefinitely, holding a thread, a connection, and memory the entire time. Timeouts are the simplest fault tolerance mechanism and the one most often neglected.
There are two types. A connection timeout limits how long the client waits to establish a TCP connection. A read timeout (sometimes called socket timeout) limits how long the client waits for a response after the connection is established. Both should be configured explicitly. Default values in most HTTP libraries are far too generous for production use, sometimes 30 seconds or even infinite.
Setting timeouts correctly requires knowing the expected response time of the downstream service. If the p99 latency of a service is 200ms, a read timeout of 1 second is reasonable. A timeout of 30 seconds means a thread is occupied 150 times longer than necessary when the service is failing.
Retries with Exponential Backoff
Transient failures are common in distributed systems. A network blip, a brief garbage collection pause, or a momentary overload can cause a request to fail even when the downstream service is healthy. Retries handle transient failures by simply trying again.
Naive retries are dangerous. If a service is failing under load and all clients immediately retry, the retry traffic adds to the load, making the failure worse. This is called a retry storm.
Exponential backoff solves this by increasing the delay between retries. The first retry waits 100ms. The second waits 200ms. The third waits 400ms. Each subsequent retry doubles the wait time. Adding random jitter (a small random offset to the delay) prevents multiple clients from retrying at exactly the same moment.
Always cap the number of retries and the maximum delay. Three retries with a cap of 5 seconds is a common starting point. After the final retry fails, the client should return an error or invoke a fallback, not retry forever.
| Pattern | What It Protects Against | How It Works | When to Use |
|---|---|---|---|
| Circuit Breaker | Cascading failures from a down service | Stops calling a failing service after threshold | Any call to an external or downstream service |
| Bulkhead | Resource exhaustion from one slow dependency | Isolates resource pools per dependency | Services with multiple downstream dependencies |
| Timeout | Indefinite waits on unresponsive services | Caps the time a call can take | Every network call, no exceptions |
| Retry + Backoff | Transient failures (network blips, brief overload) | Retries with increasing delay and jitter | Idempotent operations only |
| Graceful Degradation | Partial system failure affecting user experience | Returns reduced functionality instead of errors | Non-critical features with fallback options |
Graceful Degradation
When a dependency fails and retries are exhausted, the system has a choice: return an error or return something less complete but still useful. Graceful degradation chooses the latter.
A product page that cannot reach the recommendation service can still show the product details, price, and reviews. It just omits the "customers also bought" section. A search page that cannot reach the personalization service can fall back to unpersonalized results. A dashboard that cannot reach the analytics backend can show cached data with a "last updated 5 minutes ago" notice.
Graceful degradation requires knowing which parts of a response are essential and which are optional. This distinction must be designed into the system, not improvised during an outage.
Systems Thinking Lens
Fault tolerance patterns are balancing feedback loops. The circuit breaker detects rising failures (a signal) and reduces load on the failing service (a response), giving it time to recover. Exponential backoff spreads retry load over time instead of concentrating it. Bulkheads prevent a failure in one subsystem from propagating to others.
Without these loops, distributed systems exhibit a dangerous reinforcing pattern: failure increases load, increased load causes more failures, more failures increase load further. Circuit breakers, bulkheads, and backoff interrupt this spiral. They are the engineering equivalent of the balancing loops discussed in Session 0.5.
Further Reading
- Microsoft: Circuit Breaker Pattern. Detailed description of the pattern with implementation guidance for cloud applications.
- Microsoft: Bulkhead Pattern. How to isolate resources to prevent cascading failures.
- Amazon Builders' Library: Timeouts, Retries, and Backoff with Jitter. AWS's approach to implementing retries safely in distributed systems.
- Martin Fowler: CircuitBreaker. The original description of the circuit breaker pattern for software systems.
Assignment
Consider this scenario: Service A calls Service B to fetch user profile data. Service B is experiencing an outage.
- Without fault tolerance: Describe exactly what happens. How many threads does Service A consume? How long does the user wait? What does the user see?
- Add a timeout of 500ms. What changes? What is the user experience now?
- Add retries with exponential backoff (100ms, 200ms, 400ms, max 3 attempts). What is the total maximum wait time? Is this acceptable for a user-facing request?
- Add a circuit breaker with a threshold of 5 failures in 10 seconds. After the circuit opens, what does Service A return to the user? Design a graceful degradation response for when the profile service is unavailable.
- Draw a sequence diagram showing the full flow with all three patterns active.