Session 4.1: High Availability Design

Course → Module 4: Reliability, Security & System Resilience

What High Availability Actually Means

A system is highly available when it continues serving requests even after some of its components fail. That sounds obvious. But achieving it requires deliberate architectural choices at every layer: compute, storage, networking, and application logic.

The core principle is straightforward. Any component that, when it fails, brings down the entire system is a single point of failure (SPOF). High availability design is the systematic elimination of SPOFs through redundancy, failover automation, and geographic distribution.

Availability is not a feature. It is an architectural property. You cannot bolt it on after the fact. It must be designed in from the start, at every layer of the stack.

Measuring Availability: The Nines

Availability is expressed as a percentage of uptime over a given period. The industry uses "nines" as shorthand. The difference between 99% and 99.999% sounds small. It is not. The gap between them is the difference between 87.6 hours of downtime per year and 5.26 minutes.

Each additional nine roughly divides the allowed downtime by ten. But the cost and complexity of achieving each nine increases exponentially. Going from 99% to 99.9% might require a load balancer and a second server. Going from 99.99% to 99.999% might require multi-region deployment, automated failover, and extensive chaos testing.

Availability	Downtime/Year	Downtime/Month	Typical Strategy	Relative Cost
99%	3.65 days	7.3 hours	Single server with backups	$
99.9%	8.76 hours	43.8 minutes	Load balancer + multiple instances	$$
99.99%	52.6 minutes	4.38 minutes	Multi-AZ with automated failover	$$$
99.999%	5.26 minutes	26.3 seconds	Multi-region active-active	$$$$

Identifying Single Points of Failure

Before you can eliminate SPOFs, you need to find them. Walk through every component in your architecture and ask: if this one thing fails, does the whole system go down?

Common SPOFs include: a single database instance with no replica, a single load balancer, a DNS provider with no secondary, a single network path between services, a single deployment region, and even a single person who holds the credentials to production.

The fix for every SPOF follows the same pattern: redundancy. Run two of everything that matters, with automatic failover between them. But redundancy introduces its own complexity. Two database replicas need synchronization. Two load balancers need a virtual IP or DNS-based failover. Every layer of redundancy must be tested regularly to confirm it actually works when needed.

Active-Active vs. Active-Passive

There are two fundamental approaches to redundancy.

Active-passive means one component handles all traffic while the other sits idle, waiting to take over. The passive node is a hot standby. When the active node fails, the passive node is promoted. This approach is simpler to implement but wastes resources during normal operation. Failover also takes time, even if it is automated, because the passive node must detect the failure, assume the active role, and begin accepting traffic.

Active-active means all nodes handle traffic simultaneously. A load balancer distributes requests across them. If one node fails, the others absorb its traffic automatically. There is no failover delay because the remaining nodes are already running. This approach uses resources more efficiently and provides faster recovery, but it is harder to implement, especially for stateful workloads like databases.

Multi-AZ Architecture

Cloud providers divide their infrastructure into regions and availability zones (AZs). Each AZ is a physically separate data center with independent power, cooling, and networking. AZs within the same region are connected by high-bandwidth, low-latency links (typically under 2 milliseconds round-trip).

Deploying across multiple AZs protects against facility-level failures: power outages, cooling failures, network cuts, and natural disasters that affect a single data center. Most production workloads that target 99.99% availability or higher use multi-AZ deployment as their foundation.

graph TB subgraph Region["AWS Region (us-east-1)"] R53["Route 53
DNS + Health Checks"] ALB["Application Load Balancer"] subgraph AZ1["Availability Zone A"] EC2A1["App Server 1"] EC2A2["App Server 2"] RDSA["RDS Primary"] end subgraph AZ2["Availability Zone B"] EC2B1["App Server 3"] EC2B2["App Server 4"] RDSB["RDS Standby
(sync replica)"] end R53 --> ALB ALB --> EC2A1 ALB --> EC2A2 ALB --> EC2B1 ALB --> EC2B2 EC2A1 --> RDSA EC2A2 --> RDSA EC2B1 --> RDSA EC2B2 --> RDSA RDSA -- "synchronous
replication" --> RDSB end

In this architecture, the load balancer distributes traffic across instances in both AZs. The database uses synchronous replication so the standby is always current. If AZ-A fails entirely, the load balancer routes all traffic to AZ-B, and the RDS standby is promoted to primary. The application continues serving requests with minimal interruption.

Multi-Region for Maximum Resilience

Multi-AZ protects against facility failures. Multi-region protects against catastrophic regional events. A multi-region active-active architecture runs full copies of the application in two or more regions, with traffic routed by DNS (such as Route 53 latency-based or geolocation routing).

Multi-region introduces significant complexity. Data must be replicated across regions, which means choosing between synchronous replication (strong consistency but higher latency) and asynchronous replication (lower latency but risk of data loss during failover). Cross-region data transfer also incurs cost.

Most organizations do not need multi-region for their entire stack. The common approach is multi-AZ for compute and database, with multi-region reserved for the most critical services, such as the authentication layer or payment processing.

Systems Thinking Lens

High availability is a balancing feedback loop. As you add redundancy, availability increases, but so does complexity, cost, and the surface area for configuration errors. There is a point of diminishing returns where each additional nine costs disproportionately more than the last.

The system boundary matters too. Your application might be five-nines available, but if your DNS provider gives you only three nines, that is your actual availability. Availability of a chain of dependencies is the product of their individual availabilities. Two services in series, each at 99.9%, give you 99.8% end-to-end. The weakest link defines the system.

Assignment

Take any application you have built or currently work on. Draw its architecture, including every component: load balancers, application servers, databases, caches, message queues, DNS, and third-party APIs.

Circle every SPOF. For each one, ask: if this fails, what happens to the user?
Design redundancy for each SPOF you identified. Write down whether you would use active-active or active-passive for each, and why.
Calculate the theoretical availability of your current architecture (multiply the availability of each component in the critical path). Then calculate what it would be after your proposed changes.
What is the most expensive SPOF to fix? Is it worth fixing given the business requirements?

High Availability Design