Skip to content
hibranwar
  • About
  • Work
  • Writing
  • Library
  • Made
  • Now
  • Contact

Module 4: Reliability, Security & System Resilience

Systems Thinking × System Design · 9 sessions

← Back to course

What High Availability Actually Means

A system is highly available when it continues serving requests even after some of its components fail. That sounds obvious. But achieving it requires deliberate architectural choices at every layer: compute, storage, networking, and application logic.

The core principle is straightforward. Any component that, when it fails, brings down the entire system is a single point of failure (SPOF). High availability design is the systematic elimination of SPOFs through redundancy, failover automation, and geographic distribution.

Availability is not a feature. It is an architectural property. You cannot bolt it on after the fact. It must be designed in from the start, at every layer of the stack.

Measuring Availability: The Nines

Availability is expressed as a percentage of uptime over a given period. The industry uses "nines" as shorthand. The difference between 99% and 99.999% sounds small. It is not. The gap between them is the difference between 87.6 hours of downtime per year and 5.26 minutes.

Each additional nine roughly divides the allowed downtime by ten. But the cost and complexity of achieving each nine increases exponentially. Going from 99% to 99.9% might require a load balancer and a second server. Going from 99.99% to 99.999% might require multi-region deployment, automated failover, and extensive chaos testing.

Availability Downtime/Year Downtime/Month Typical Strategy Relative Cost
99% 3.65 days 7.3 hours Single server with backups $
99.9% 8.76 hours 43.8 minutes Load balancer + multiple instances $$
99.99% 52.6 minutes 4.38 minutes Multi-AZ with automated failover $$$
99.999% 5.26 minutes 26.3 seconds Multi-region active-active $$$$

Identifying Single Points of Failure

Before you can eliminate SPOFs, you need to find them. Walk through every component in your architecture and ask: if this one thing fails, does the whole system go down?

Common SPOFs include: a single database instance with no replica, a single load balancer, a DNS provider with no secondary, a single network path between services, a single deployment region, and even a single person who holds the credentials to production.

The fix for every SPOF follows the same pattern: redundancy. Run two of everything that matters, with automatic failover between them. But redundancy introduces its own complexity. Two database replicas need synchronization. Two load balancers need a virtual IP or DNS-based failover. Every layer of redundancy must be tested regularly to confirm it actually works when needed.

Active-Active vs. Active-Passive

There are two fundamental approaches to redundancy.

Active-passive means one component handles all traffic while the other sits idle, waiting to take over. The passive node is a hot standby. When the active node fails, the passive node is promoted. This approach is simpler to implement but wastes resources during normal operation. Failover also takes time, even if it is automated, because the passive node must detect the failure, assume the active role, and begin accepting traffic.

Active-active means all nodes handle traffic simultaneously. A load balancer distributes requests across them. If one node fails, the others absorb its traffic automatically. There is no failover delay because the remaining nodes are already running. This approach uses resources more efficiently and provides faster recovery, but it is harder to implement, especially for stateful workloads like databases.

Multi-AZ Architecture

Cloud providers divide their infrastructure into regions and availability zones (AZs). Each AZ is a physically separate data center with independent power, cooling, and networking. AZs within the same region are connected by high-bandwidth, low-latency links (typically under 2 milliseconds round-trip).

Deploying across multiple AZs protects against facility-level failures: power outages, cooling failures, network cuts, and natural disasters that affect a single data center. Most production workloads that target 99.99% availability or higher use multi-AZ deployment as their foundation.

graph TB subgraph Region["AWS Region (us-east-1)"] R53["Route 53
DNS + Health Checks"] ALB["Application Load Balancer"] subgraph AZ1["Availability Zone A"] EC2A1["App Server 1"] EC2A2["App Server 2"] RDSA["RDS Primary"] end subgraph AZ2["Availability Zone B"] EC2B1["App Server 3"] EC2B2["App Server 4"] RDSB["RDS Standby
(sync replica)"] end R53 --> ALB ALB --> EC2A1 ALB --> EC2A2 ALB --> EC2B1 ALB --> EC2B2 EC2A1 --> RDSA EC2A2 --> RDSA EC2B1 --> RDSA EC2B2 --> RDSA RDSA -- "synchronous
replication" --> RDSB end

In this architecture, the load balancer distributes traffic across instances in both AZs. The database uses synchronous replication so the standby is always current. If AZ-A fails entirely, the load balancer routes all traffic to AZ-B, and the RDS standby is promoted to primary. The application continues serving requests with minimal interruption.

Multi-Region for Maximum Resilience

Multi-AZ protects against facility failures. Multi-region protects against catastrophic regional events. A multi-region active-active architecture runs full copies of the application in two or more regions, with traffic routed by DNS (such as Route 53 latency-based or geolocation routing).

Multi-region introduces significant complexity. Data must be replicated across regions, which means choosing between synchronous replication (strong consistency but higher latency) and asynchronous replication (lower latency but risk of data loss during failover). Cross-region data transfer also incurs cost.

Most organizations do not need multi-region for their entire stack. The common approach is multi-AZ for compute and database, with multi-region reserved for the most critical services, such as the authentication layer or payment processing.

Systems Thinking Lens

High availability is a balancing feedback loop. As you add redundancy, availability increases, but so does complexity, cost, and the surface area for configuration errors. There is a point of diminishing returns where each additional nine costs disproportionately more than the last.

The system boundary matters too. Your application might be five-nines available, but if your DNS provider gives you only three nines, that is your actual availability. Availability of a chain of dependencies is the product of their individual availabilities. Two services in series, each at 99.9%, give you 99.8% end-to-end. The weakest link defines the system.

Further Reading

  • AWS Well-Architected: Deploy the Workload to Multiple Locations. Official guidance on multi-AZ and multi-region deployment strategies.
  • AWS Whitepaper: Advanced Multi-AZ Resilience Patterns. Deep dive into AZ-level fault isolation and recovery.
  • AWS: High Availability Architectures. Patterns and building blocks for HA on AWS.
  • Google SRE Book: Availability Table. The canonical reference for availability nines and their meaning in practice.

Assignment

Take any application you have built or currently work on. Draw its architecture, including every component: load balancers, application servers, databases, caches, message queues, DNS, and third-party APIs.

  1. Circle every SPOF. For each one, ask: if this fails, what happens to the user?
  2. Design redundancy for each SPOF you identified. Write down whether you would use active-active or active-passive for each, and why.
  3. Calculate the theoretical availability of your current architecture (multiply the availability of each component in the critical path). Then calculate what it would be after your proposed changes.
  4. What is the most expensive SPOF to fix? Is it worth fixing given the business requirements?

Beyond High Availability

Session 4.1 covered high availability: keeping the system running when components fail. Fault tolerance goes further. A fault-tolerant system continues operating correctly even during component failures, with no visible impact on the user. HA reduces downtime. FT eliminates it.

The difference is practical. An HA system might take 30 seconds to fail over to a standby database. During those 30 seconds, requests fail or queue. A fault-tolerant system handles the same failure transparently. The user never notices.

Fault tolerance is achieved through a set of patterns that isolate failures, limit their blast radius, and provide fallback behavior. The four most important patterns are circuit breakers, bulkheads, timeouts, and retries with exponential backoff.

Circuit Breakers

The circuit breaker pattern is borrowed from electrical engineering. An electrical circuit breaker detects excess current and opens the circuit to prevent damage. A software circuit breaker monitors calls to a downstream service and stops making calls when the failure rate exceeds a threshold.

Without a circuit breaker, a failing downstream service causes cascading failures. Callers wait for responses that never come, consuming threads and connections. Those callers then become slow, causing their own callers to back up. Within seconds, a single failing service can bring down an entire distributed system.

A circuit breaker operates in three states.

stateDiagram-v2 [*] --> Closed Closed --> Open : Failure threshold exceeded Open --> HalfOpen : Timeout expires HalfOpen --> Closed : Test request succeeds HalfOpen --> Open : Test request fails note right of Closed All requests pass through. Failures are counted. end note note right of Open All requests fail immediately. No calls to downstream service. Timer starts. end note note right of HalfOpen One test request allowed. If it succeeds, close the circuit. If it fails, reopen. end note

Closed: Normal operation. Requests pass through to the downstream service. The breaker counts failures. When the failure count exceeds a threshold (for example, 5 failures in 10 seconds), the breaker trips to the open state.

Open: The breaker rejects all requests immediately without calling the downstream service. This protects both the caller (no wasted threads waiting for timeouts) and the downstream service (no additional load while it is struggling). After a configured timeout (for example, 30 seconds), the breaker transitions to half-open.

Half-open: The breaker allows a single test request through. If it succeeds, the breaker returns to closed. If it fails, the breaker returns to open and resets the timeout. This probing mechanism provides automatic recovery without manual intervention.

Bulkheads

The bulkhead pattern is named after the watertight compartments in a ship's hull. If one compartment floods, the others remain dry. The ship stays afloat.

In software, a bulkhead isolates resources so that a failure in one part of the system does not consume resources needed by other parts. The most common implementation is dedicated thread pools or connection pools per downstream service.

Without bulkheads, all outbound calls share the same thread pool. If Service B becomes slow, calls to Service B consume all available threads. Now calls to Service C and Service D also fail, not because those services are down, but because there are no threads left to call them. The slow service has poisoned the entire system.

With bulkheads, Service B gets its own pool of 20 threads. Service C gets 20. Service D gets 20. When Service B becomes slow and exhausts its 20 threads, calls to C and D are unaffected. The failure is contained.

Timeouts

Every network call must have a timeout. Without one, a call to an unresponsive service will wait indefinitely, holding a thread, a connection, and memory the entire time. Timeouts are the simplest fault tolerance mechanism and the one most often neglected.

There are two types. A connection timeout limits how long the client waits to establish a TCP connection. A read timeout (sometimes called socket timeout) limits how long the client waits for a response after the connection is established. Both should be configured explicitly. Default values in most HTTP libraries are far too generous for production use, sometimes 30 seconds or even infinite.

Setting timeouts correctly requires knowing the expected response time of the downstream service. If the p99 latency of a service is 200ms, a read timeout of 1 second is reasonable. A timeout of 30 seconds means a thread is occupied 150 times longer than necessary when the service is failing.

Retries with Exponential Backoff

Transient failures are common in distributed systems. A network blip, a brief garbage collection pause, or a momentary overload can cause a request to fail even when the downstream service is healthy. Retries handle transient failures by simply trying again.

Naive retries are dangerous. If a service is failing under load and all clients immediately retry, the retry traffic adds to the load, making the failure worse. This is called a retry storm.

Exponential backoff solves this by increasing the delay between retries. The first retry waits 100ms. The second waits 200ms. The third waits 400ms. Each subsequent retry doubles the wait time. Adding random jitter (a small random offset to the delay) prevents multiple clients from retrying at exactly the same moment.

sequenceDiagram participant C as Client participant S as Service C->>S: Request 1 S--xC: 503 Service Unavailable Note over C: Wait 100ms + jitter C->>S: Retry 1 S--xC: 503 Service Unavailable Note over C: Wait 200ms + jitter C->>S: Retry 2 S--xC: 503 Service Unavailable Note over C: Wait 400ms + jitter C->>S: Retry 3 S-->>C: 200 OK

Always cap the number of retries and the maximum delay. Three retries with a cap of 5 seconds is a common starting point. After the final retry fails, the client should return an error or invoke a fallback, not retry forever.

Pattern What It Protects Against How It Works When to Use
Circuit Breaker Cascading failures from a down service Stops calling a failing service after threshold Any call to an external or downstream service
Bulkhead Resource exhaustion from one slow dependency Isolates resource pools per dependency Services with multiple downstream dependencies
Timeout Indefinite waits on unresponsive services Caps the time a call can take Every network call, no exceptions
Retry + Backoff Transient failures (network blips, brief overload) Retries with increasing delay and jitter Idempotent operations only
Graceful Degradation Partial system failure affecting user experience Returns reduced functionality instead of errors Non-critical features with fallback options

Graceful Degradation

When a dependency fails and retries are exhausted, the system has a choice: return an error or return something less complete but still useful. Graceful degradation chooses the latter.

A product page that cannot reach the recommendation service can still show the product details, price, and reviews. It just omits the "customers also bought" section. A search page that cannot reach the personalization service can fall back to unpersonalized results. A dashboard that cannot reach the analytics backend can show cached data with a "last updated 5 minutes ago" notice.

Graceful degradation requires knowing which parts of a response are essential and which are optional. This distinction must be designed into the system, not improvised during an outage.

Systems Thinking Lens

Fault tolerance patterns are balancing feedback loops. The circuit breaker detects rising failures (a signal) and reduces load on the failing service (a response), giving it time to recover. Exponential backoff spreads retry load over time instead of concentrating it. Bulkheads prevent a failure in one subsystem from propagating to others.

Without these loops, distributed systems exhibit a dangerous reinforcing pattern: failure increases load, increased load causes more failures, more failures increase load further. Circuit breakers, bulkheads, and backoff interrupt this spiral. They are the engineering equivalent of the balancing loops discussed in Session 0.5.

Further Reading

  • Microsoft: Circuit Breaker Pattern. Detailed description of the pattern with implementation guidance for cloud applications.
  • Microsoft: Bulkhead Pattern. How to isolate resources to prevent cascading failures.
  • Amazon Builders' Library: Timeouts, Retries, and Backoff with Jitter. AWS's approach to implementing retries safely in distributed systems.
  • Martin Fowler: CircuitBreaker. The original description of the circuit breaker pattern for software systems.

Assignment

Consider this scenario: Service A calls Service B to fetch user profile data. Service B is experiencing an outage.

  1. Without fault tolerance: Describe exactly what happens. How many threads does Service A consume? How long does the user wait? What does the user see?
  2. Add a timeout of 500ms. What changes? What is the user experience now?
  3. Add retries with exponential backoff (100ms, 200ms, 400ms, max 3 attempts). What is the total maximum wait time? Is this acceptable for a user-facing request?
  4. Add a circuit breaker with a threshold of 5 failures in 10 seconds. After the circuit opens, what does Service A return to the user? Design a graceful degradation response for when the profile service is unavailable.
  5. Draw a sequence diagram showing the full flow with all three patterns active.

When High Availability Is Not Enough

High availability (Session 4.1) and fault tolerance (Session 4.2) protect against component failures and transient issues. Disaster recovery addresses a different scale of problem: what happens when an entire region goes offline? When a data center is destroyed by fire, flooding, or extended power failure? When a ransomware attack encrypts your production database?

Disaster recovery (DR) is the plan, the infrastructure, and the process for restoring operations after a catastrophic event. It is not the same as high availability. HA keeps the system running during routine failures. DR brings the system back after extraordinary ones.

RPO and RTO: The Two Numbers That Define Everything

Every DR strategy is governed by two metrics.

Recovery Point Objective (RPO) answers the question: how much data can we afford to lose? If your RPO is 1 hour, you must have a copy of your data that is no more than 1 hour old at any point. If disaster strikes, you lose at most 1 hour of transactions. An RPO of zero means no data loss is acceptable, which requires synchronous replication.

Recovery Time Objective (RTO) answers the question: how long can we be down? If your RTO is 4 hours, the system must be fully operational within 4 hours of the disaster being declared. An RTO of zero means instant failover with no user-visible interruption.

RPO is about data. RTO is about time. Together, they define the contract between your DR capability and your business requirements. Every dollar spent on DR is buying a lower RPO, a lower RTO, or both.

The business, not engineering, should set RPO and RTO. Different systems have different tolerances. A marketing website might accept an RTO of 24 hours. A payment processing system might need an RTO under 5 minutes. The DR strategy must match the business requirement, not exceed it. Over-engineering DR is as wasteful as under-engineering it.

The Four DR Strategies

AWS and the broader industry recognize four DR strategies, ordered from cheapest (and slowest to recover) to most expensive (and fastest to recover).

1. Backup and Restore

The simplest strategy. Data is backed up regularly to a separate location (another region, another cloud, or offline storage). When disaster strikes, infrastructure is provisioned from scratch and data is restored from the most recent backup. This is the cheapest option but has the highest RTO (hours to days) and the highest RPO (hours, depending on backup frequency).

2. Pilot Light

Core infrastructure components, the absolute minimum needed to run the system, are kept running in the DR region at all times. Typically, this means database replicas and perhaps a minimal application instance. When disaster strikes, the rest of the infrastructure is provisioned and scaled up. RTO is measured in tens of minutes. RPO depends on replication lag, typically minutes.

3. Warm Standby

A scaled-down but fully functional copy of the production environment runs in the DR region at all times. It can handle a fraction of production traffic immediately. When disaster strikes, the standby environment is scaled up to full production capacity. RTO is measured in minutes. RPO is typically seconds, because the standby is continuously replicated.

4. Multi-Site Active-Active

Full production environments run in two or more regions simultaneously, all handling live traffic. There is no failover per se, because both sites are already active. If one region fails, the other absorbs the traffic. RTO is near zero. RPO depends on the replication model but can approach zero with synchronous replication.

graph TB subgraph Primary["Primary Region (us-east-1)"] PLB["Load Balancer"] PAPP1["App Tier
(full capacity)"] PDB["Database
(primary)"] PLB --> PAPP1 PAPP1 --> PDB end subgraph DR["DR Region (us-west-2), Warm Standby"] DLB["Load Balancer"] DAPP1["App Tier
(reduced capacity)"] DDB["Database
(read replica)"] DLB --> DAPP1 DAPP1 --> DDB end DNS["Route 53
Failover Routing"] DNS -- "primary" --> PLB DNS -. "failover" .-> DLB PDB -- "async replication" --> DDB

In normal operation, Route 53 routes all traffic to the primary region. The warm standby runs at reduced capacity, handling only health check traffic and replication. When Route 53 detects the primary region is unhealthy, it routes traffic to the DR region. The standby environment scales up to handle production load. Database failover promotes the replica to primary.

Cost vs. Recovery Trade-offs

The relationship between cost and recovery capability is not linear. Improving RTO from 24 hours to 4 hours is relatively cheap. Improving from 5 minutes to 30 seconds is extremely expensive.

Strategy RPO RTO Relative Cost Complexity Best For
Backup & Restore Hours Hours to days $ (lowest) Low Non-critical systems, dev/staging
Pilot Light Minutes Tens of minutes $$ Medium Business applications with moderate RTO
Warm Standby Seconds Minutes $$$ Medium-High Critical applications needing fast recovery
Multi-Site Active-Active Near zero Near zero $$$$ (highest) High Mission-critical, zero-downtime requirements

DR Testing: The Part Everyone Skips

A disaster recovery plan that has never been tested is a hypothesis, not a plan. DR testing validates that your failover procedures work, that your team knows how to execute them, and that recovery happens within your target RPO and RTO.

There are levels of testing. A tabletop exercise walks through the DR plan on paper without touching production. A simulation fails over to the DR environment using synthetic traffic. A full failover test routes real production traffic to the DR environment and back. Netflix famously runs Chaos Monkey and related tools that randomly terminate instances and even entire regions in production to verify their DR capabilities continuously.

At minimum, test your DR plan quarterly. After every significant infrastructure change, test again.

Systems Thinking Lens

DR strategy selection is a classic trade-off problem with a reinforcing loop. Cheaper DR strategies reduce cost but increase risk. When a disaster finally occurs and recovery takes too long, the organization invests heavily in DR. Over time, if no disasters occur, the organization reduces investment (the "it hasn't happened so it won't happen" fallacy). This cycle repeats.

The leverage point is making DR cost proportional to the value of the workload. A system that generates $100,000 per hour in revenue justifies a $50,000/month DR investment. One that generates $500/day does not. Tie the DR budget to the cost of downtime, and the trade-off becomes a rational calculation instead of a guess.

Further Reading

  • AWS Whitepaper: Disaster Recovery Options in the Cloud. The definitive guide to the four DR strategies on AWS, with architecture diagrams and cost considerations.
  • AWS Architecture Blog: Pilot Light and Warm Standby. Detailed comparison of these two commonly confused strategies.
  • AWS Well-Architected: Use Defined Recovery Strategies. How to select and implement DR strategies within the Well-Architected Framework.
  • Google Cloud: Disaster Recovery Planning Guide. Cloud-agnostic concepts with GCP-specific implementation details.

Assignment

You are designing the DR strategy for a payment processing system. The business requirements are:

  • RPO must be less than 1 minute (no more than 1 minute of transaction data can be lost).
  • RTO must be less than 5 minutes (the system must be processing payments again within 5 minutes).
  1. Which of the four DR strategies meets these requirements? Eliminate the ones that do not and explain why.
  2. For your chosen strategy, describe the specific infrastructure required in the DR region during normal operation.
  3. Estimate the cost difference between your chosen strategy and a simple backup-and-restore approach. Consider compute, storage, data transfer, and database replication costs. Use rough estimates; the important thing is the order of magnitude.
  4. Write a 5-step failover runbook. What happens in the first 30 seconds? The first minute? The first 5 minutes?

The Foundation of Recovery

Disaster recovery strategies (Session 4.3) depend on one thing: having a usable copy of your data. Backups are that copy. Without reliable backups, no DR strategy works. Without tested restores, backups are meaningless.

This session covers the mechanics of backing up data: what types of backups exist, how to schedule them, how to retain them, and most importantly, how to verify that they actually work when you need them.

A backup you haven't tested restoring is not a backup. It's a hope. Until you have successfully restored from a backup, you have no evidence that it will work when it matters.

Three Types of Backups

All backup strategies are built from three fundamental types. Each copies a different amount of data, takes a different amount of time, and requires a different process to restore.

Full Backup

A full backup copies every file, every record, every byte. It is the simplest to understand and the simplest to restore. You need only one backup set to recover completely. The downside is obvious: it takes the longest to create, uses the most storage, and puts the highest load on the source system.

Incremental Backup

An incremental backup copies only the data that has changed since the last backup of any type. Monday's incremental contains changes since Sunday's full. Tuesday's incremental contains only changes since Monday's incremental. Each incremental is small and fast to create.

The trade-off appears at restore time. To restore from Wednesday's state, you need Sunday's full backup, then Monday's incremental, then Tuesday's incremental, then Wednesday's incremental, applied in order. If any backup in the chain is corrupted, the restore fails. The more increments in the chain, the higher the risk and the longer the restore takes.

Differential Backup

A differential backup copies all data that has changed since the last full backup. Monday's differential contains changes since Sunday's full. Tuesday's differential also contains all changes since Sunday's full (including Monday's changes again). Each differential grows larger as the week progresses, but restoring requires only two pieces: the last full backup and the most recent differential.

Dimension Full Incremental Differential
What it copies Everything Changes since last backup (any type) Changes since last full backup
Backup speed Slowest Fastest Medium (grows over time)
Storage required Highest Lowest per backup Medium (cumulative growth)
Restore speed Fastest (single file) Slowest (full + all increments) Medium (full + one differential)
Restore complexity Low High (chain dependency) Low (two files)
Risk of chain corruption None High (one bad link breaks the chain) Low
System load during backup High Low Medium

Backup Scheduling and Rotation

Most production systems use a combination of backup types on a rotation schedule. The classic pattern is the Grandfather-Father-Son (GFS) scheme: daily incrementals (Son), weekly full backups (Father), and monthly archive copies (Grandfather).

gantt title Backup Rotation Schedule (4-Week Cycle) dateFormat YYYY-MM-DD axisFormat %a %d section Week 1 Full Backup (Weekly) :milestone, w1f, 2026-04-06, 0d Incremental :w1i1, 2026-04-07, 1d Incremental :w1i2, 2026-04-08, 1d Incremental :w1i3, 2026-04-09, 1d Incremental :w1i4, 2026-04-10, 1d Incremental :w1i5, 2026-04-11, 1d Incremental :w1i6, 2026-04-12, 1d section Week 2 Full Backup (Weekly) :milestone, w2f, 2026-04-13, 0d Incremental :w2i1, 2026-04-14, 1d Incremental :w2i2, 2026-04-15, 1d Incremental :w2i3, 2026-04-16, 1d Incremental :w2i4, 2026-04-17, 1d Incremental :w2i5, 2026-04-18, 1d Incremental :w2i6, 2026-04-19, 1d section Week 3 Full Backup (Weekly) :milestone, w3f, 2026-04-20, 0d Incremental :w3i1, 2026-04-21, 1d Incremental :w3i2, 2026-04-22, 1d Incremental :w3i3, 2026-04-23, 1d Incremental :w3i4, 2026-04-24, 1d Incremental :w3i5, 2026-04-25, 1d Incremental :w3i6, 2026-04-26, 1d section Week 4 Full Backup (Weekly) :milestone, w4f, 2026-04-27, 0d Incremental :w4i1, 2026-04-28, 1d Incremental :w4i2, 2026-04-29, 1d Incremental :w4i3, 2026-04-30, 1d Monthly Archive :crit, archive, 2026-04-30, 1d

The retention policy determines how long each tier is kept. A common policy: daily backups retained for 7 days, weekly backups retained for 4 weeks, monthly archives retained for 12 months, yearly archives retained for 7 years (for compliance). The right retention depends on your regulatory requirements, storage budget, and the likelihood of needing historical data.

Retention Policies

Tier Frequency Retention Storage Tier Purpose
Daily (Son) Every night 7 days Hot (S3 Standard, local disk) Recover from recent errors or deletions
Weekly (Father) Every Sunday 4 weeks Warm (S3 Infrequent Access) Recover from issues discovered days later
Monthly (Grandfather) Last day of month 12 months Cold (S3 Glacier) Compliance, audit, historical reference
Yearly Archive Dec 31 7 years Deep Archive (S3 Glacier Deep) Regulatory compliance (SOX, GDPR, HIPAA)

Point-in-Time Recovery (PITR)

Traditional backups give you snapshots at fixed intervals. Point-in-time recovery gives you any moment in between. PITR works by combining a base backup with a continuous stream of transaction logs (called write-ahead logs in PostgreSQL, binary logs in MySQL, or redo logs in Oracle).

To restore to 2:47 PM last Tuesday, the system restores the most recent full backup before that time, then replays the transaction log up to exactly 2:47 PM. The result is an exact replica of the database at that specific moment.

PITR is essential for recovering from application bugs that corrupt data. A regular backup might have been taken after the corruption occurred. With PITR, you can restore to the moment just before the bug ran. Most managed database services (Amazon RDS, Aurora, Azure SQL, Cloud SQL) offer PITR with a configurable retention window, typically 1 to 35 days.

Restoration Testing

The most common failure mode in backup systems is not the backup failing. It is the restore failing. Backups that cannot be restored are worse than no backups at all, because they create a false sense of security.

Reasons restores fail include: backup files are corrupted but no checksum validation was performed, the backup software version on the restore target differs from the source, the restore process requires credentials or configuration that nobody documented, the storage format has changed since the backup was taken, and the backup is incomplete (some files or tables were excluded accidentally).

The only way to catch these problems is to test restores regularly. A good restoration test answers five questions: Can you actually restore the data? Is the restored data complete and consistent? How long does the restore take? Does the restored system function correctly? Can someone other than the person who created the backup perform the restore?

Systems Thinking Lens

Backup strategy involves competing feedback loops. Frequent backups reduce RPO (good) but increase storage costs and system load (bad). Longer retention provides more recovery options (good) but increases storage costs and compliance surface area (bad). The optimal strategy balances these loops based on the value of the data and the cost of losing it.

There is also a dangerous delay in the feedback loop. Backups that are never tested provide no feedback about their quality until a disaster occurs. By then, the feedback arrives too late. Regular restoration testing closes this loop and surfaces problems while there is still time to fix them.

Further Reading

  • AWS: Incremental vs Differential vs Other Backups. Clear comparison of backup types with diagrams showing data flow.
  • Acronis: Incremental vs Differential Backups. Practical guide with performance benchmarks and use-case recommendations.
  • TechTarget: How to Choose the Correct Backup Type. Decision framework for selecting backup strategies based on RTO, RPO, and storage constraints.
  • AWS: Point-in-Time Recovery for Amazon RDS. How PITR works in a managed database service, including retention and restore procedures.

Assignment

Your team runs a production PostgreSQL database with 500 GB of data. Daily backups are taken at midnight and stored in S3. The team has never tested a restore.

  1. Write a 5-step restoration test plan. Include: how you will create a test environment, how you will restore the backup, how you will validate the restored data, how you will measure restore time, and how you will document the results.
  2. What is the scariest possible discovery during the test? List three things that could go wrong, and for each one, describe what it would mean for your recovery capability.
  3. The current RPO is 24 hours (one daily backup). The business now wants RPO under 1 hour. What changes to the backup strategy would you make? Consider PITR, incremental frequency, and cost implications.
  4. Design a retention policy for this database. State how long each tier is kept and justify each choice.

Two Different Questions

Security in software systems starts with two fundamental questions. Authentication (AuthN) asks: who are you? Authorization (AuthZ) asks: what are you allowed to do? These are separate concerns, implemented by separate mechanisms, and they must never be confused.

Authentication verifies identity. When you type your username and password, the system checks whether those credentials match a known user. If they do, the system knows who you are. Authentication is a binary outcome: either the credentials are valid or they are not.

Authorization determines access. Once the system knows who you are, it must decide what you can access. Can you read this document? Can you delete this record? Can you access data belonging to a different tenant? Authorization is contextual and granular. The same authenticated user might have full access in one area and no access in another.

Authentication asks "who are you?" Authorization asks "what can you do?" Never confuse the two. A valid login does not mean unlimited access. A permission check does not verify identity.

OAuth 2.0: Delegated Authorization

OAuth 2.0 is an authorization framework, defined in RFC 6749, that allows a user to grant a third-party application limited access to their resources without sharing their password. When you click "Sign in with Google" on a third-party site, OAuth 2.0 is handling the flow.

The most widely used OAuth 2.0 flow is the Authorization Code Flow. It involves four parties: the user (resource owner), the client application, the authorization server, and the resource server.

sequenceDiagram participant U as User (Browser) participant C as Client App participant AS as Authorization Server participant RS as Resource Server U->>C: Click "Login with Provider" C->>AS: Redirect to /authorize
(client_id, redirect_uri, scope, state) AS->>U: Show login + consent screen U->>AS: Enter credentials, grant consent AS->>C: Redirect to callback with authorization code C->>AS: POST /token
(code, client_id, client_secret) AS->>C: Return access_token (+ refresh_token, id_token) C->>RS: GET /api/resource
(Authorization: Bearer access_token) RS->>C: Return protected resource C->>U: Display resource

The critical security property of this flow is that the user's credentials are never shared with the client application. The client receives an authorization code, which it exchanges for an access token using a server-to-server call that includes the client's own secret. The access token is scoped: it grants access only to the specific resources the user consented to, not to everything.

For public clients (single-page applications, mobile apps) that cannot securely store a client secret, the Authorization Code Flow is extended with PKCE (Proof Key for Code Exchange, pronounced "pixie"). PKCE prevents authorization code interception attacks by requiring the client to prove it was the one that initiated the flow.

OpenID Connect: Authentication on Top of OAuth

OAuth 2.0 by itself is an authorization protocol. It tells the client what the user has allowed, but it does not reliably tell the client who the user is. OpenID Connect (OIDC) adds an authentication layer on top of OAuth 2.0.

The key addition is the ID token, a JWT that contains claims about the authenticated user: their unique identifier, email, name, and when the authentication occurred. The client can decode this token to learn who the user is without making additional API calls.

JSON Web Tokens (JWT)

A JWT is a compact, URL-safe token format defined in RFC 7519. It consists of three parts separated by dots: a header (algorithm and token type), a payload (the claims), and a signature.

The payload contains standard claims and custom claims. Standard claims are registered in the IANA JWT Claims Registry.

Claim Name Purpose Example Value
sub Subject Unique identifier for the user "user-12345"
iss Issuer Who issued the token "https://auth.example.com"
aud Audience Intended recipient of the token "api.example.com"
exp Expiration When the token expires (Unix timestamp) 1775145600
iat Issued At When the token was created 1775142000
scope Scope Permissions granted to this token "read:profile write:settings"

The signature ensures integrity. If anyone modifies the payload (for example, changing their user ID to an admin's ID), the signature will not match and the token will be rejected. JWTs can be signed with HMAC (shared secret) or RSA/ECDSA (asymmetric keys). Asymmetric signing is preferred because the resource server only needs the public key to verify tokens, not the private signing key.

Multi-Factor Authentication (MFA)

Passwords alone are weak authentication. They can be guessed, phished, leaked in breaches, or reused across sites. Multi-factor authentication requires the user to prove their identity using two or more independent factors from different categories: something you know (password), something you have (phone, hardware key), and something you are (fingerprint, face).

Common MFA methods include time-based one-time passwords (TOTP, generated by apps like Google Authenticator), SMS codes (less secure due to SIM-swapping attacks), push notifications to an authenticator app, and hardware security keys (FIDO2/WebAuthn, the strongest option). For high-security systems, hardware keys provide the best protection because they are resistant to phishing. The user must physically possess and activate the key.

RBAC vs. ABAC

Once a user is authenticated, the system must decide what they can access. The two dominant models for this decision are Role-Based Access Control (RBAC) and Attribute-Based Access Control (ABAC).

Dimension RBAC ABAC
Access based on Assigned roles (admin, editor, viewer) Attributes of user, resource, and environment
Granularity Coarse (role-level) Fine-grained (attribute combinations)
Policy example "Editors can update articles" "Users in the EU can access EU data during business hours"
Scalability Role explosion in complex systems Scales well with dynamic policies
Implementation complexity Low to medium High
Audit clarity Easy to audit (who has which role) Harder (policies are rules, not lists)
Best for Applications with clear role hierarchies Applications needing contextual access decisions

RBAC assigns permissions to roles, then assigns roles to users. A user with the "editor" role can edit content. A user with the "admin" role can manage users. RBAC is straightforward and works well when the access model maps cleanly to organizational roles. The problem appears when you need exceptions: "editors can edit articles, but only articles in their department, and only during business hours." Each exception requires a new role, leading to role explosion.

ABAC evaluates access based on attributes of the user (department, clearance level), the resource (classification, owner), the action (read, write, delete), and the environment (time of day, IP address, device type). ABAC policies are rules, not lists. They handle complex access logic without role explosion, but they are harder to implement and harder to reason about.

Many production systems use both. RBAC handles the common cases (admin, editor, viewer). ABAC handles the exceptions and contextual rules that RBAC cannot express cleanly.

IAM in Cloud Environments

Cloud providers implement authorization through Identity and Access Management (IAM) services. AWS IAM, Azure RBAC, and Google Cloud IAM all follow the same core model: identities (users, groups, service accounts) are assigned policies that grant or deny specific actions on specific resources.

The principle of least privilege applies everywhere: every identity should have only the permissions it needs to perform its function, and nothing more. This limits the blast radius of a compromised credential. A database backup service should not have permission to delete databases. A read-only monitoring service should not have write access.

Systems Thinking Lens

Authentication and authorization form a dependency chain. Every downstream service trusts the token issued by the authentication layer. If that layer is compromised, every service that trusts it is compromised. This is a leverage point (Session 0.6) in the negative sense: a single vulnerability at the auth layer cascades through the entire system.

The feedback loop in access control is also worth noting. Overly restrictive permissions generate friction. Users request exceptions. Exceptions accumulate into overly permissive configurations. Security incidents trigger a lockdown. The cycle repeats. Good access control design anticipates the common access patterns and builds them into the model, reducing the pressure for exceptions.

graph LR A["Strict Permissions"] --> B["User Friction"] B --> C["Exception Requests"] C --> D["Permission Creep"] D --> E["Security Incident"] E --> A

Further Reading

  • RFC 6749: The OAuth 2.0 Authorization Framework. The specification that defines OAuth 2.0 flows, token types, and security considerations.
  • OpenID Connect Core 1.0. The specification for OIDC, including ID token format and standard claims.
  • Auth0: Authorization Code Flow. A practical, well-illustrated guide to implementing the authorization code flow.
  • Okta: OAuth 2.0 and OpenID Connect Overview. Clear explanation of how OAuth 2.0 and OIDC work together.
  • NIST SP 800-162: Guide to ABAC. The formal definition and implementation guidance for attribute-based access control.

Assignment

You are designing the authentication and authorization system for a multi-tenant SaaS application. Tenants are separate companies. Users belong to one tenant and should never access another tenant's data.

  1. Design the login flow. The user opens the app, enters credentials, and receives a token. Draw a sequence diagram showing each step, including the authorization server, the client application, and the API.
  2. Design the JWT payload. What claims do you include to ensure the API can enforce tenant isolation? Write out the exact JSON payload, including standard claims (sub, iss, exp, aud) and any custom claims you need (tenant ID, roles, permissions).
  3. Choose an access control model. Should you use RBAC, ABAC, or a combination? Justify your choice based on the multi-tenant requirement. Provide an example policy for each model.
  4. A user with the "admin" role in Tenant A makes an API call to access data in Tenant B. Describe exactly how the system detects and blocks this request. At which layer does the check happen?

Two Threats, Two Solutions

Data faces two distinct categories of risk depending on where it sits at any given moment. When stored on disk, it is vulnerable to physical theft, unauthorized filesystem access, or a compromised backup. When moving between services over a network, it is vulnerable to interception, modification, or replay. These are fundamentally different attack surfaces, and they require fundamentally different protections.

Encryption at rest protects against physical theft. Encryption in transit protects against eavesdropping. You need both.

Skipping either one creates a gap. Encrypt data in transit but store it in plaintext, and a stolen hard drive exposes everything. Encrypt data at rest but transmit it over HTTP, and anyone on the network path can read it. Security is not a menu where you pick one item. It is a chain, and it breaks at the weakest link.

Encryption at Rest: Three Approaches

Cloud providers, AWS in particular, offer multiple server-side encryption (SSE) options for object storage. Each shifts the boundary of who manages the keys and who performs the encryption. The choice depends on your compliance requirements, operational complexity tolerance, and trust model.

SSE-S3: AWS-Managed Keys

SSE-S3 is the default. AWS generates, manages, and rotates encryption keys on your behalf. Each object is encrypted with a unique key, and that key is itself encrypted with a root key that AWS rotates regularly. You never see or touch any key material. The encryption and decryption happen transparently on S3's side.

This is the simplest option. No configuration beyond enabling it (which is now on by default for all new buckets). No additional cost. The trade-off: you have no control over key policies, no audit trail of key usage, and no ability to revoke access at the key level.

SSE-KMS: AWS KMS-Managed Keys

SSE-KMS uses AWS Key Management Service to manage encryption keys. You can use an AWS-managed KMS key or create your own customer-managed key (CMK). The critical difference from SSE-S3 is visibility and control. Every use of the key is logged in CloudTrail. You can define key policies that restrict which IAM principals can encrypt or decrypt. You can disable or schedule deletion of keys, effectively making data permanently inaccessible.

This matters for compliance. PCI-DSS, HIPAA, and SOC 2 auditors want to see who accessed encryption keys and when. SSE-KMS provides that audit trail. The cost: KMS API calls are not free, and high-throughput workloads can hit request limits.

Client-Side Encryption

With client-side encryption (CSE), you encrypt data before it ever reaches the cloud provider. AWS never sees plaintext data. The encryption and decryption logic lives in your application, using either the AWS Encryption SDK or your own implementation. You can still use KMS to manage the data encryption keys (CSE-KMS), or you can manage keys entirely on your own infrastructure.

This is the highest-trust model. Even a fully compromised AWS account cannot read your data without the encryption keys. It is also the most complex. Your application must handle encryption, key rotation, and the inevitable key management failures. If you lose the keys, the data is gone. No recovery.

Comparison

Dimension SSE-S3 SSE-KMS Client-Side (CSE)
Who encrypts AWS (S3) AWS (S3 + KMS) Your application
Who manages keys AWS (fully managed) AWS KMS (you control policy) You (or KMS for wrapping)
Key audit trail No Yes (CloudTrail) You must implement
Granular key policy No Yes (IAM + key policy) Yes (your responsibility)
Performance impact Negligible KMS API latency + rate limits CPU cost in your app
Cost Free $1/key/month + API calls Your compute + optional KMS
Compliance fit Basic PCI, HIPAA, SOC 2 Maximum (data never exposed)
Complexity Zero Low High

Encryption in Transit: TLS

Transport Layer Security (TLS) is the protocol that makes HTTPS work. It provides three guarantees: confidentiality (nobody can read the data), integrity (nobody can modify the data without detection), and authentication (you are talking to who you think you are talking to).

TLS 1.3, defined in RFC 8446, is the current standard. It is faster and more secure than TLS 1.2. The handshake completes in one round trip instead of two. Older, insecure cipher suites are removed entirely. There is no negotiation of weak algorithms because the protocol simply does not offer them.

TLS 1.3 Handshake

The handshake establishes a shared secret between client and server without ever transmitting that secret over the wire. It uses Diffie-Hellman key exchange, where both parties contribute a public value and independently compute the same shared key.

sequenceDiagram participant C as Client participant S as Server Note over C,S: 1-RTT Handshake (TLS 1.3) C->>S: ClientHello + Key Share + Supported Ciphers Note right of S: Select cipher, compute shared secret S->>C: ServerHello + Key Share Note over C,S: All messages below are encrypted S->>C: EncryptedExtensions S->>C: Certificate S->>C: CertificateVerify S->>C: Finished C->>S: Finished Note over C,S: Application data flows (encrypted) C->>S: Application Data S->>C: Application Data

The key improvement over TLS 1.2: the client sends its key share in the first message, before knowing which parameters the server will choose. This eliminates one full round trip. For a client 100ms away from the server, that saves 100ms on every new connection. At scale, across millions of connections, this adds up.

TLS 1.3 also supports 0-RTT resumption, where a client that has previously connected can send encrypted application data in its very first message. This is useful for performance but introduces replay risk, so it should only be used for idempotent requests.

mTLS: Mutual Authentication

Standard TLS authenticates only the server. The client verifies the server's certificate, but the server accepts any client. This is fine for public websites, where you want anyone to connect.

In a microservices architecture, you want more. Service A should only accept requests from Service B, not from a compromised pod or a rogue container. Mutual TLS (mTLS) solves this by requiring both sides to present and verify certificates.

In practice, mTLS is usually implemented through a service mesh like Istio or Linkerd. The mesh's sidecar proxies handle certificate issuance, rotation, and verification automatically. Application code does not need to change. The mesh infrastructure guarantees that every service-to-service call is both encrypted and authenticated.

Certificate Management

Certificates expire. If you do not rotate them before expiration, your services go down. This is not a theoretical risk. Major outages at large companies have been caused by expired certificates, including incidents at companies that should know better.

Automated certificate management is essential. Tools like Let's Encrypt (for public-facing certificates) and HashiCorp Vault or AWS Certificate Manager (for internal certificates) handle issuance and renewal without human intervention. The goal: no human should ever need to remember to renew a certificate.

Performance Implications

Encryption is not free. Every encrypted operation costs CPU cycles. But the costs are often smaller than people assume.

AES-256 encryption on modern hardware with AES-NI instructions adds roughly 1-2% CPU overhead for bulk data encryption. The bottleneck is rarely the symmetric encryption itself. It is the key exchange (asymmetric operations) during TLS handshake, the KMS API calls for SSE-KMS, and the additional network round trips.

For most systems, the right approach is to encrypt everything and optimize only where measurements show a real bottleneck. The alternative, leaving data unencrypted "for performance," creates a security debt that compounds over time and eventually comes due in the form of a breach.

Further Reading

  • RFC 8446: The Transport Layer Security (TLS) Protocol Version 1.3 (IETF)
  • Protecting Data with Server-Side Encryption (AWS Documentation)
  • Understanding Amazon S3 Client-Side Encryption Options (AWS Storage Blog)
  • A Detailed Look at RFC 8446 (TLS 1.3) (Cloudflare Blog)
  • Mutual TLS: Securing Microservices in Service Mesh (The New Stack)

Assignment

You are designing a payment processing system. It stores credit card numbers in a database and transmits them between an API gateway, a payment service, and a card processor.

  1. What encryption-at-rest strategy do you choose for the stored card numbers? Why not SSE-S3?
  2. What encryption-in-transit strategy do you use between the API gateway and payment service? Between the payment service and the external card processor?
  3. Who holds the encryption keys? Draw a diagram showing key ownership at each layer.
  4. PCI-DSS requires that you can prove who accessed cardholder data and when. Which encryption option gives you that audit trail?

The Network as Attack Surface

Every network packet that enters your system is a potential attack vector. Network security is the practice of controlling which packets are allowed in, which are allowed out, and what happens to the ones that break the rules. In cloud environments, this means layering multiple controls so that no single misconfiguration exposes your entire system.

Defense in depth means every layer assumes every other layer has been compromised.

This is not paranoia. It is engineering discipline. A security group misconfiguration should not expose your database to the internet. A compromised web server should not give an attacker free movement through your private network. Each layer of defense operates independently, so a failure at one layer does not cascade into a total breach.

Security Groups vs. Network ACLs

AWS provides two distinct mechanisms for controlling network traffic: Security Groups and Network Access Control Lists (NACLs). They operate at different levels, follow different rules, and serve different purposes. Understanding the distinction is essential for designing secure VPC architectures.

Security Groups: Stateful, Instance-Level

A Security Group acts as a virtual firewall attached to an elastic network interface (ENI). It operates at the instance level. When you allow inbound traffic on port 443, the return traffic is automatically allowed. This is what "stateful" means: the firewall remembers the connection and permits its response without an explicit outbound rule.

Security Groups evaluate all rules before deciding. There is no rule ordering. If any rule permits the traffic, it passes. The default behavior is deny-all inbound and allow-all outbound.

NACLs: Stateless, Subnet-Level

A Network ACL is attached to a subnet. Every instance in that subnet is automatically subject to its rules. NACLs are stateless: if you allow inbound traffic on port 443, you must also create an explicit outbound rule for the response traffic. If you forget the outbound rule, the connection breaks silently.

NACLs evaluate rules in order, starting from the lowest rule number. The first matching rule wins, and evaluation stops. Rule ordering matters. A deny rule at number 100 overrides an allow rule at number 200.

Comparison

Dimension Security Group Network ACL
Scope Instance (ENI) level Subnet level
State tracking Stateful (return traffic auto-allowed) Stateless (return traffic needs explicit rule)
Rule evaluation All rules evaluated; any allow wins Rules evaluated in order; first match wins
Default behavior Deny all inbound, allow all outbound Allow all inbound and outbound
Allow/Deny rules Allow rules only Both allow and deny rules
Assignment Must be explicitly assigned to instance Automatically applies to all instances in subnet
Use case Fine-grained, per-service access control Coarse-grained subnet guardrails, IP blocking
Typical role Primary firewall Secondary defense layer

In practice, Security Groups handle most access control. NACLs serve as a safety net: broad rules that block known-bad IP ranges, enforce subnet isolation, or provide a fallback if a Security Group is misconfigured.

IDS vs. IPS

Firewalls filter traffic based on rules you define in advance. But what about attacks that look like legitimate traffic? A SQL injection arrives on port 443, just like a normal HTTPS request. A firewall cannot tell the difference.

Intrusion Detection Systems (IDS) and Intrusion Prevention Systems (IPS) inspect packet contents, not just headers. They look for patterns that match known attack signatures or statistical anomalies that suggest an attack in progress.

The difference between them is action. An IDS detects and alerts. An IPS detects and blocks. An IDS sits on a mirror port or tap, passively observing traffic. An IPS sits inline, actively intercepting and filtering traffic. An IDS that misses an attack produces a missed alert. An IPS that produces a false positive drops legitimate traffic.

The trade-off is clear: IPS gives you automated blocking at the cost of potential false positives. IDS gives you visibility without the risk of accidentally breaking production. Many organizations start with IDS to build confidence in their rules, then switch to IPS once false positive rates are low enough.

Web Application Firewall (WAF)

A WAF operates at Layer 7 (HTTP/HTTPS). It inspects request bodies, headers, query strings, and URL paths for patterns associated with web application attacks: SQL injection, cross-site scripting (XSS), path traversal, and other OWASP Top 10 threats.

AWS WAF, Cloudflare WAF, and similar services let you define rules, use managed rule sets, and create rate-limiting policies. A well-configured WAF blocks the most common automated attacks before they reach your application code.

A WAF is not a replacement for writing secure code. It is a filter that catches the obvious attacks. A determined attacker who understands your application can craft requests that bypass WAF rules. Defense in depth means the WAF catches the 90% of attacks that follow known patterns, and your application code handles the rest through proper input validation and output encoding.

VPC Design for Defense in Depth

A well-designed VPC separates resources into subnets based on their exposure level. The goal: minimize the blast radius of any single compromise. If an attacker gains access to a web server, they should not be able to reach the database directly.

graph TB Internet["Internet"] --> IGW["Internet Gateway"] subgraph VPC["VPC (10.0.0.0/16)"] subgraph Public["Public Subnet (10.0.1.0/24)"] ALB["Application
Load Balancer"] NAT["NAT Gateway"] end subgraph Private["Private Subnet (10.0.2.0/24)"] App1["App Server 1"] App2["App Server 2"] end subgraph Isolated["Isolated Subnet (10.0.3.0/24)"] DB1["Primary DB"] DB2["Replica DB"] end end IGW --> ALB ALB -->|"SG: 443 only"| App1 ALB -->|"SG: 443 only"| App2 App1 -->|"SG: 5432 only"| DB1 App2 -->|"SG: 5432 only"| DB2 App1 --> NAT NAT --> IGW style Public fill:#2a2a28,stroke:#6b8f71,color:#ede9e3 style Private fill:#2a2a28,stroke:#c8a882,color:#ede9e3 style Isolated fill:#2a2a28,stroke:#c47a5a,color:#ede9e3 style VPC fill:#191918,stroke:#8a8478,color:#ede9e3

Three Subnet Tiers

Public subnet. Contains resources that need direct internet access: load balancers, NAT gateways, bastion hosts. Has a route to the internet gateway. Resources here get public IP addresses. Security Groups restrict inbound traffic to specific ports (80, 443).

Private subnet. Contains application servers. No direct internet access. Outbound internet traffic (for package updates, API calls) routes through a NAT gateway in the public subnet. Inbound traffic comes only from the load balancer. Security Groups allow traffic only from the ALB's Security Group.

Isolated subnet. Contains databases and other sensitive data stores. No internet access at all, not even through NAT. No route to any internet gateway. The only allowed inbound traffic comes from the application servers' Security Group on the database port. This is the most restricted tier.

Security Controls at Each Boundary

Boundary Controls
Internet to Public Subnet NACL (allow 80/443, deny known-bad IPs), WAF (filter L7 attacks), Security Group on ALB (allow 80/443 from 0.0.0.0/0)
Public to Private Subnet NACL (restrict to ALB source), Security Group on App (allow from ALB SG only)
Private to Isolated Subnet NACL (restrict to App subnet CIDR), Security Group on DB (allow port 5432 from App SG only)
Isolated to Internet No route exists. Blocked by design, not by rules.

Notice the last row. The isolated subnet does not rely on a rule to block internet access. It has no route to the internet gateway. This is structural security: the capability does not exist, so it cannot be misconfigured. This is stronger than any firewall rule.

Further Reading

  • Control Subnet Traffic with Network Access Control Lists (AWS Documentation)
  • VPC with Servers in Private Subnets and NAT (AWS Documentation)
  • Infrastructure Security in Amazon VPC (AWS Documentation)
  • OWASP Top 10: 2021 (OWASP Foundation)

Assignment

Design a VPC for a three-tier web application. Your system has a public-facing load balancer, a fleet of application servers, and a PostgreSQL database cluster.

  1. Draw the VPC with three subnet tiers (public, private, isolated). Label CIDR ranges.
  2. For each boundary between tiers, specify the Security Group rules (protocol, port, source).
  3. Write NACL rules for the private subnet. Remember: NACLs are stateless. What happens if you forget ephemeral port ranges for return traffic?
  4. Where does the WAF sit in this architecture? What does it protect against that Security Groups cannot?
  5. An attacker compromises one application server. What can they reach? What can they not reach? Why?

Why Anti-Patterns Matter More Than Patterns

Security patterns tell you what to do. Security anti-patterns tell you what people actually do in practice, and why those things fail. The gap between the two is where breaches live. Most production security incidents are not caused by sophisticated zero-day exploits. They are caused by known mistakes that teams made under time pressure, then never went back to fix.

This session catalogs the most common security anti-patterns, maps them to the OWASP Top 10, and connects the underlying principle of least privilege back to the systems thinking concept of leverage points from Session 0.6.

Anti-Pattern 1: Secrets in Code

Database passwords, API keys, JWT signing secrets, cloud provider credentials. When they appear in source code, they end up in version control. Once in version control, they are in every developer's local clone, every CI/CD runner, every backup, and potentially every fork if the repo is public.

This is not a hypothetical risk. GitHub's secret scanning program detects millions of leaked credentials per year. Automated bots scrape public repositories and exploit found credentials within minutes of commit.

The fix: Secrets belong in a secrets manager (AWS Secrets Manager, HashiCorp Vault, or at minimum, environment variables injected at runtime). The application code references a secret name, not a secret value. Rotation happens in the secrets manager without code changes.

Anti-Pattern 2: Over-Privileged IAM

A developer needs to read from one S3 bucket. The fastest way to unblock them: attach AdministratorAccess. It works. The ticket closes. Six months later, nobody remembers why that service account has full admin rights to every AWS service.

Over-privileged IAM is the most common security debt in cloud environments. It accumulates gradually. Each permission grant solves an immediate problem. Nobody schedules time to review and reduce permissions after the feature ships.

The principle of least privilege is a leverage point. It changes the structure of what's possible, not just what's allowed.

In systems thinking terms (Session 0.6), a leverage point is a place in the system where a small change produces a large effect. Least privilege is exactly that. By restricting what each identity can do, you change the blast radius of every possible failure mode. A compromised service with read-only access to one bucket is a minor incident. The same service with admin access is a catastrophe.

The fix: Start with zero permissions. Add only what is needed, scoped to specific resources. Use IAM Access Analyzer to identify unused permissions. Schedule quarterly permission reviews.

Anti-Pattern 3: Missing Input Validation

Every piece of data that crosses a trust boundary, from a user, from an external API, from a message queue, must be validated before processing. Missing input validation is the root cause behind multiple OWASP Top 10 categories: injection, XSS, and SSRF.

SQL Injection: The Classic

SQL injection occurs when user input is concatenated directly into a SQL query string. The attacker provides input that changes the query's structure, not just its data.

flowchart TD A["User submits login form"] --> B["Application builds SQL query"] B --> C{"Input validated?"} C -->|"No: String concatenation"| D["Query: SELECT * FROM users
WHERE name = '' OR '1'='1' --'"] D --> E["Database executes
modified query"] E --> F["Returns ALL user records"] F --> G["Attacker gains
unauthorized access"] C -->|"Yes: Parameterized query"| H["Query: SELECT * FROM users
WHERE name = $1"] H --> I["Database treats input
as data, not code"] I --> J["Returns 0 records
(no matching user)"] style D fill:#3a2020,stroke:#c47a5a,color:#ede9e3 style G fill:#3a2020,stroke:#c47a5a,color:#ede9e3 style H fill:#1a2a1a,stroke:#6b8f71,color:#ede9e3 style J fill:#1a2a1a,stroke:#6b8f71,color:#ede9e3

The fix is parameterized queries (prepared statements). The database driver separates the query structure from the data. User input can never alter the query's logic because it is treated as a value, not as SQL code. Every modern database library supports this. There is no valid reason to concatenate user input into SQL strings.

Cross-Site Scripting (XSS)

XSS occurs when user-supplied data is rendered in a browser without encoding. The attacker injects a script tag (or event handler) that executes in other users' browsers. Stored XSS persists in the database and fires for every user who views the affected page. Reflected XSS arrives via a crafted URL.

The fix: Output encoding appropriate to the context (HTML encoding for HTML content, JavaScript encoding for script contexts, URL encoding for URLs). Content Security Policy (CSP) headers provide a second layer by restricting which scripts the browser will execute.

OWASP Top 10 (2021): Detection and Remediation

The OWASP Top 10 is the industry standard reference for web application security risks. The 2021 edition reflects data from over 500,000 applications.

Rank Category Detection Fix
A01 Broken Access Control Penetration testing, automated access control tests Deny by default, enforce server-side, disable directory listing
A02 Cryptographic Failures Code review, automated scanning for weak algorithms Use strong algorithms (AES-256, RSA-2048+), encrypt at rest and in transit
A03 Injection SAST/DAST tools, code review for string concatenation Parameterized queries, input validation, ORMs
A04 Insecure Design Threat modeling, architecture review Security by design, threat modeling in planning phase
A05 Security Misconfiguration Configuration audits, cloud security posture tools Hardened defaults, automated config scanning, remove unused features
A06 Vulnerable Components SCA tools (Snyk, Dependabot), CVE databases Dependency scanning in CI/CD, automated updates, remove unused dependencies
A07 Auth Failures Credential stuffing detection, failed login monitoring MFA, rate limiting, bcrypt/argon2 for passwords, session management
A08 Integrity Failures CI/CD pipeline audits, dependency verification Signed artifacts, verified CI/CD pipelines, integrity checks
A09 Logging Failures Log audit, incident response drills Centralized logging, alert on auth events, tamper-proof log storage
A10 SSRF DAST tools, URL validation testing Allowlist for outbound URLs, block internal IP ranges, validate URL schemes

Least Privilege as a System Structure

Most security discussions frame least privilege as a policy: "only grant the minimum permissions needed." This is correct but insufficient. The systems thinking perspective asks a deeper question: what structure makes over-granting permissions easy, and what structure makes it hard?

If your default IAM policy is permissive, every new service starts with too much access. Developers must actively work to reduce permissions. This is a reinforcing loop in the wrong direction: convenience drives more permissive defaults, which normalizes broad access, which makes it harder to justify restricting any single service.

If your default IAM policy is deny-all, every new service starts with zero access. Developers must explicitly request each permission. This is friction, but it is intentional friction. The system structure forces the right behavior. You do not rely on discipline. You rely on design.

This is the leverage point. Changing the default from "allow unless restricted" to "deny unless granted" does not change one permission. It changes every future permission decision. That is structural change, and it is far more durable than any policy document.

Further Reading

  • OWASP Top 10: 2021 (OWASP Foundation)
  • OWASP Top 10 Vulnerabilities (Snyk)
  • IAM Best Practices (AWS Documentation)
  • Secrets Detection and Remediation (GitGuardian)
  • Query Parameterization Cheat Sheet (OWASP)

Assignment

You are given access to a codebase for a web application. Perform a security review targeting three specific anti-patterns.

  1. Secrets audit. Identify 3 places where secrets might be hardcoded. Where would you look? (Hint: configuration files, environment setup scripts, test fixtures, Docker Compose files, CI/CD pipeline definitions.)
  2. Permission audit. Find 3 places where permissions might be over-granted. Consider IAM roles, database user privileges, file system permissions, and API token scopes.
  3. Input validation audit. Find 3 places where user input crosses a trust boundary without validation. Consider form submissions, API parameters, URL path segments, and file uploads.
  4. For each finding, classify the OWASP Top 10 category it falls under and propose a specific fix.

The Problem with "It Works"

A system can be running and still be failing. A request can succeed in 3 seconds when it should take 300 milliseconds. An error can occur once per thousand requests, invisible in aggregate metrics but catastrophic for the affected users. A memory leak can consume 1% more RAM per hour, unnoticed for days until the process crashes at 3 AM.

Without observability, you only know a system is broken when users tell you. By then, the damage is done. Observability is not about collecting data. It is about building the ability to ask new questions about your system's behavior without deploying new code.

Without observability, debugging a distributed system is archaeology.

The Three Pillars

Observability rests on three types of telemetry data. Each answers a different question. None is sufficient alone. Together, they provide the ability to understand why a system is behaving the way it is.

Metrics

Metrics are numeric measurements collected at regular intervals. CPU usage at 72%. Request rate at 1,200 per second. Error rate at 0.3%. P99 latency at 450ms. Metrics are cheap to store, fast to query, and excellent for dashboards and alerts.

Metrics answer the question: "Is something wrong?" They tell you that error rates spiked at 14:32, or that memory usage is trending upward. They do not tell you why. For that, you need logs and traces.

Logs

Logs are timestamped records of discrete events. A user logged in. A database query took 2.3 seconds. An API call returned a 500 error with a stack trace. Logs are detailed, high-volume, and essential for post-incident investigation.

Structured logging means emitting logs as key-value pairs or JSON objects rather than free-form strings. This matters for searchability. {"level":"error","service":"payment","user_id":"u-1234","msg":"charge failed","reason":"card_declined"} is searchable and filterable. Error: charge failed for user u-1234, card was declined requires regex and hope.

Traces

A trace follows a single request as it moves through multiple services. When a user clicks "Place Order," that request might touch the API gateway, the order service, the inventory service, the payment service, and the notification service. A trace connects all of those interactions into a single story with a shared trace ID.

Each service's contribution to the request is a span. Spans have a start time, duration, parent span, and metadata. The full trace is a tree of spans that shows exactly where time was spent and where errors occurred.

Pillar Comparison

Dimension Metrics Logs Traces
What it captures Numeric measurements over time Discrete events with context Request flow across services
Question it answers "Is something wrong?" "What exactly happened?" "Where did time go?"
Granularity Aggregated (per-minute, per-second) Per-event Per-request
Volume Low (time series points) Very high Medium (sampled)
Storage cost Low High Medium
Best for Dashboards, alerting, SLO tracking Debugging, auditing, compliance Latency analysis, dependency mapping
Common tools Prometheus, CloudWatch, Datadog ELK Stack, Loki, CloudWatch Logs Jaeger, Zipkin, AWS X-Ray

Distributed Tracing with Correlation IDs

In a monolith, a stack trace shows you the full execution path. In a distributed system, the execution path crosses process and network boundaries. Stack traces stop at each service's boundary. You need a way to stitch the story back together.

A correlation ID (or trace ID) is a unique identifier generated at the entry point of a request and propagated through every service call. Each service includes this ID in its logs and passes it to downstream services via HTTP headers (typically X-Request-ID or the W3C traceparent header).

sequenceDiagram participant U as User participant GW as API Gateway participant OS as Order Service participant PS as Payment Service Note over U,PS: Trace ID: abc-123 U->>GW: POST /orders Note right of GW: Generate trace ID: abc-123
Span A starts GW->>OS: Create order
Header: traceparent=abc-123 Note right of OS: Span B starts (parent: A) OS->>PS: Charge card
Header: traceparent=abc-123 Note right of PS: Span C starts (parent: B) PS-->>OS: Payment confirmed Note right of PS: Span C ends (220ms) OS-->>GW: Order created Note right of OS: Span B ends (350ms) GW-->>U: 201 Created Note right of GW: Span A ends (400ms)

With this trace, you can see that the total request took 400ms, the order service spent 350ms, and 220ms of that was waiting for the payment service. If latency increases, you can pinpoint exactly which service is responsible.

Log Volume Distribution

Not all log entries are equally important, but the volume distribution follows a predictable pattern. Most logs are informational. A small fraction represent actual problems. Understanding this distribution helps you design log storage, set retention policies, and configure alerts.

The implication: if you alert on every log entry, you drown in noise. If you only alert on CRITICAL, you miss the ERROR entries that are precursors to outages. Effective alerting uses metrics (error rate exceeding a threshold) rather than individual log entries. Logs are for investigation after the alert fires.

The Observability Stack

A complete observability stack needs tools for each pillar, plus a way to correlate between them. The most common open-source combination:

  • Prometheus for metrics. Pull-based collection, time-series database, PromQL query language. Excellent for alerting via Alertmanager.
  • Grafana for visualization. Connects to Prometheus, Elasticsearch, Loki, and 100+ data sources. Dashboards, alerts, and exploration.
  • Jaeger (or Zipkin) for distributed tracing. Collects spans from instrumented services, visualizes trace timelines, identifies latency bottlenecks.
  • ELK Stack (Elasticsearch, Logstash, Kibana) or Loki for logs. Centralized log aggregation, full-text search, filtering.

Commercial alternatives (Datadog, New Relic, Splunk) bundle all three pillars into a single platform. The trade-off is cost versus operational complexity. Running your own Prometheus, Grafana, and ELK cluster requires engineering effort. Paying for a managed platform requires budget.

Structured Logging Best Practices

Every log entry should include, at minimum: timestamp, severity level, service name, trace ID (if available), and a structured message. Beyond that, include any context that will help during debugging.

Do not log sensitive data: passwords, tokens, credit card numbers, personal identifiable information. Log redaction is easier to implement at the point of emission than after the fact. Build it into your logging library, not into your log pipeline.

Use consistent field names across services. If one service logs user_id and another logs userId and a third logs uid, correlating across services becomes painful. Agree on a schema and enforce it.

Further Reading

  • The Three Pillars of Observability (O'Reilly, Distributed Systems Observability)
  • Three Pillars of Observability: Logs, Metrics and Traces (IBM)
  • Comparing ELK, Grafana, and Prometheus for Observability (Last9)
  • What Is Observability and How Does It Work? (Datadog)
  • The Three Pillars of Observability (CrowdStrike)

Assignment

A user reports: "The app is slow." You have no observability in place. No metrics, no centralized logs, no traces. All you have is SSH access to the servers.

  1. Describe how you would debug this without observability. What commands would you run? What files would you check? How long would it take?
  2. Now design a minimum viable observability stack. Specify:
    • 3 metrics to collect (what, from where, alert threshold)
    • 2 structured log fields to add to every log entry beyond timestamp and message
    • 1 trace configuration: which request path to instrument first and why
  3. A developer argues that adding tracing will slow down the application. How do you respond? What is the typical performance overhead of distributed tracing with sampling?
© Ibrahim Anwar · Bogor, West Java
This work is licensed under CC BY 4.0
  • Links
  • Entity
  • RSS