Course → Module 4: Reliability, Security & System Resilience

When High Availability Is Not Enough

High availability (Session 4.1) and fault tolerance (Session 4.2) protect against component failures and transient issues. Disaster recovery addresses a different scale of problem: what happens when an entire region goes offline? When a data center is destroyed by fire, flooding, or extended power failure? When a ransomware attack encrypts your production database?

Disaster recovery (DR) is the plan, the infrastructure, and the process for restoring operations after a catastrophic event. It is not the same as high availability. HA keeps the system running during routine failures. DR brings the system back after extraordinary ones.

RPO and RTO: The Two Numbers That Define Everything

Every DR strategy is governed by two metrics.

Recovery Point Objective (RPO) answers the question: how much data can we afford to lose? If your RPO is 1 hour, you must have a copy of your data that is no more than 1 hour old at any point. If disaster strikes, you lose at most 1 hour of transactions. An RPO of zero means no data loss is acceptable, which requires synchronous replication.

Recovery Time Objective (RTO) answers the question: how long can we be down? If your RTO is 4 hours, the system must be fully operational within 4 hours of the disaster being declared. An RTO of zero means instant failover with no user-visible interruption.

RPO is about data. RTO is about time. Together, they define the contract between your DR capability and your business requirements. Every dollar spent on DR is buying a lower RPO, a lower RTO, or both.

The business, not engineering, should set RPO and RTO. Different systems have different tolerances. A marketing website might accept an RTO of 24 hours. A payment processing system might need an RTO under 5 minutes. The DR strategy must match the business requirement, not exceed it. Over-engineering DR is as wasteful as under-engineering it.

The Four DR Strategies

AWS and the broader industry recognize four DR strategies, ordered from cheapest (and slowest to recover) to most expensive (and fastest to recover).

1. Backup and Restore

The simplest strategy. Data is backed up regularly to a separate location (another region, another cloud, or offline storage). When disaster strikes, infrastructure is provisioned from scratch and data is restored from the most recent backup. This is the cheapest option but has the highest RTO (hours to days) and the highest RPO (hours, depending on backup frequency).

2. Pilot Light

Core infrastructure components, the absolute minimum needed to run the system, are kept running in the DR region at all times. Typically, this means database replicas and perhaps a minimal application instance. When disaster strikes, the rest of the infrastructure is provisioned and scaled up. RTO is measured in tens of minutes. RPO depends on replication lag, typically minutes.

3. Warm Standby

A scaled-down but fully functional copy of the production environment runs in the DR region at all times. It can handle a fraction of production traffic immediately. When disaster strikes, the standby environment is scaled up to full production capacity. RTO is measured in minutes. RPO is typically seconds, because the standby is continuously replicated.

4. Multi-Site Active-Active

Full production environments run in two or more regions simultaneously, all handling live traffic. There is no failover per se, because both sites are already active. If one region fails, the other absorbs the traffic. RTO is near zero. RPO depends on the replication model but can approach zero with synchronous replication.

graph TB subgraph Primary["Primary Region (us-east-1)"] PLB["Load Balancer"] PAPP1["App Tier
(full capacity)"] PDB["Database
(primary)"] PLB --> PAPP1 PAPP1 --> PDB end subgraph DR["DR Region (us-west-2), Warm Standby"] DLB["Load Balancer"] DAPP1["App Tier
(reduced capacity)"] DDB["Database
(read replica)"] DLB --> DAPP1 DAPP1 --> DDB end DNS["Route 53
Failover Routing"] DNS -- "primary" --> PLB DNS -. "failover" .-> DLB PDB -- "async replication" --> DDB

In normal operation, Route 53 routes all traffic to the primary region. The warm standby runs at reduced capacity, handling only health check traffic and replication. When Route 53 detects the primary region is unhealthy, it routes traffic to the DR region. The standby environment scales up to handle production load. Database failover promotes the replica to primary.

Cost vs. Recovery Trade-offs

The relationship between cost and recovery capability is not linear. Improving RTO from 24 hours to 4 hours is relatively cheap. Improving from 5 minutes to 30 seconds is extremely expensive.

Strategy RPO RTO Relative Cost Complexity Best For
Backup & Restore Hours Hours to days $ (lowest) Low Non-critical systems, dev/staging
Pilot Light Minutes Tens of minutes $$ Medium Business applications with moderate RTO
Warm Standby Seconds Minutes $$$ Medium-High Critical applications needing fast recovery
Multi-Site Active-Active Near zero Near zero $$$$ (highest) High Mission-critical, zero-downtime requirements

DR Testing: The Part Everyone Skips

A disaster recovery plan that has never been tested is a hypothesis, not a plan. DR testing validates that your failover procedures work, that your team knows how to execute them, and that recovery happens within your target RPO and RTO.

There are levels of testing. A tabletop exercise walks through the DR plan on paper without touching production. A simulation fails over to the DR environment using synthetic traffic. A full failover test routes real production traffic to the DR environment and back. Netflix famously runs Chaos Monkey and related tools that randomly terminate instances and even entire regions in production to verify their DR capabilities continuously.

At minimum, test your DR plan quarterly. After every significant infrastructure change, test again.

Systems Thinking Lens

DR strategy selection is a classic trade-off problem with a reinforcing loop. Cheaper DR strategies reduce cost but increase risk. When a disaster finally occurs and recovery takes too long, the organization invests heavily in DR. Over time, if no disasters occur, the organization reduces investment (the "it hasn't happened so it won't happen" fallacy). This cycle repeats.

The leverage point is making DR cost proportional to the value of the workload. A system that generates $100,000 per hour in revenue justifies a $50,000/month DR investment. One that generates $500/day does not. Tie the DR budget to the cost of downtime, and the trade-off becomes a rational calculation instead of a guess.

Further Reading

Assignment

You are designing the DR strategy for a payment processing system. The business requirements are:

  • RPO must be less than 1 minute (no more than 1 minute of transaction data can be lost).
  • RTO must be less than 5 minutes (the system must be processing payments again within 5 minutes).
  1. Which of the four DR strategies meets these requirements? Eliminate the ones that do not and explain why.
  2. For your chosen strategy, describe the specific infrastructure required in the DR region during normal operation.
  3. Estimate the cost difference between your chosen strategy and a simple backup-and-restore approach. Consider compute, storage, data transfer, and database replication costs. Use rough estimates; the important thing is the order of magnitude.
  4. Write a 5-step failover runbook. What happens in the first 30 seconds? The first minute? The first 5 minutes?