Session 4.4: Backup & Recovery

Course → Module 4: Reliability, Security & System Resilience

The Foundation of Recovery

Disaster recovery strategies (Session 4.3) depend on one thing: having a usable copy of your data. Backups are that copy. Without reliable backups, no DR strategy works. Without tested restores, backups are meaningless.

This session covers the mechanics of backing up data: what types of backups exist, how to schedule them, how to retain them, and most importantly, how to verify that they actually work when you need them.

A backup you haven't tested restoring is not a backup. It's a hope. Until you have successfully restored from a backup, you have no evidence that it will work when it matters.

Three Types of Backups

All backup strategies are built from three fundamental types. Each copies a different amount of data, takes a different amount of time, and requires a different process to restore.

Full Backup

A full backup copies every file, every record, every byte. It is the simplest to understand and the simplest to restore. You need only one backup set to recover completely. The downside is obvious: it takes the longest to create, uses the most storage, and puts the highest load on the source system.

Incremental Backup

An incremental backup copies only the data that has changed since the last backup of any type. Monday's incremental contains changes since Sunday's full. Tuesday's incremental contains only changes since Monday's incremental. Each incremental is small and fast to create.

The trade-off appears at restore time. To restore from Wednesday's state, you need Sunday's full backup, then Monday's incremental, then Tuesday's incremental, then Wednesday's incremental, applied in order. If any backup in the chain is corrupted, the restore fails. The more increments in the chain, the higher the risk and the longer the restore takes.

Differential Backup

A differential backup copies all data that has changed since the last full backup. Monday's differential contains changes since Sunday's full. Tuesday's differential also contains all changes since Sunday's full (including Monday's changes again). Each differential grows larger as the week progresses, but restoring requires only two pieces: the last full backup and the most recent differential.

Dimension	Full	Incremental	Differential
What it copies	Everything	Changes since last backup (any type)	Changes since last full backup
Backup speed	Slowest	Fastest	Medium (grows over time)
Storage required	Highest	Lowest per backup	Medium (cumulative growth)
Restore speed	Fastest (single file)	Slowest (full + all increments)	Medium (full + one differential)
Restore complexity	Low	High (chain dependency)	Low (two files)
Risk of chain corruption	None	High (one bad link breaks the chain)	Low
System load during backup	High	Low	Medium

Backup Scheduling and Rotation

Most production systems use a combination of backup types on a rotation schedule. The classic pattern is the Grandfather-Father-Son (GFS) scheme: daily incrementals (Son), weekly full backups (Father), and monthly archive copies (Grandfather).

gantt title Backup Rotation Schedule (4-Week Cycle) dateFormat YYYY-MM-DD axisFormat %a %d section Week 1 Full Backup (Weekly) :milestone, w1f, 2026-04-06, 0d Incremental :w1i1, 2026-04-07, 1d Incremental :w1i2, 2026-04-08, 1d Incremental :w1i3, 2026-04-09, 1d Incremental :w1i4, 2026-04-10, 1d Incremental :w1i5, 2026-04-11, 1d Incremental :w1i6, 2026-04-12, 1d section Week 2 Full Backup (Weekly) :milestone, w2f, 2026-04-13, 0d Incremental :w2i1, 2026-04-14, 1d Incremental :w2i2, 2026-04-15, 1d Incremental :w2i3, 2026-04-16, 1d Incremental :w2i4, 2026-04-17, 1d Incremental :w2i5, 2026-04-18, 1d Incremental :w2i6, 2026-04-19, 1d section Week 3 Full Backup (Weekly) :milestone, w3f, 2026-04-20, 0d Incremental :w3i1, 2026-04-21, 1d Incremental :w3i2, 2026-04-22, 1d Incremental :w3i3, 2026-04-23, 1d Incremental :w3i4, 2026-04-24, 1d Incremental :w3i5, 2026-04-25, 1d Incremental :w3i6, 2026-04-26, 1d section Week 4 Full Backup (Weekly) :milestone, w4f, 2026-04-27, 0d Incremental :w4i1, 2026-04-28, 1d Incremental :w4i2, 2026-04-29, 1d Incremental :w4i3, 2026-04-30, 1d Monthly Archive :crit, archive, 2026-04-30, 1d

The retention policy determines how long each tier is kept. A common policy: daily backups retained for 7 days, weekly backups retained for 4 weeks, monthly archives retained for 12 months, yearly archives retained for 7 years (for compliance). The right retention depends on your regulatory requirements, storage budget, and the likelihood of needing historical data.

Retention Policies

Tier	Frequency	Retention	Storage Tier	Purpose
Daily (Son)	Every night	7 days	Hot (S3 Standard, local disk)	Recover from recent errors or deletions
Weekly (Father)	Every Sunday	4 weeks	Warm (S3 Infrequent Access)	Recover from issues discovered days later
Monthly (Grandfather)	Last day of month	12 months	Cold (S3 Glacier)	Compliance, audit, historical reference
Yearly Archive	Dec 31	7 years	Deep Archive (S3 Glacier Deep)	Regulatory compliance (SOX, GDPR, HIPAA)

Point-in-Time Recovery (PITR)

Traditional backups give you snapshots at fixed intervals. Point-in-time recovery gives you any moment in between. PITR works by combining a base backup with a continuous stream of transaction logs (called write-ahead logs in PostgreSQL, binary logs in MySQL, or redo logs in Oracle).

To restore to 2:47 PM last Tuesday, the system restores the most recent full backup before that time, then replays the transaction log up to exactly 2:47 PM. The result is an exact replica of the database at that specific moment.

PITR is essential for recovering from application bugs that corrupt data. A regular backup might have been taken after the corruption occurred. With PITR, you can restore to the moment just before the bug ran. Most managed database services (Amazon RDS, Aurora, Azure SQL, Cloud SQL) offer PITR with a configurable retention window, typically 1 to 35 days.

Restoration Testing

The most common failure mode in backup systems is not the backup failing. It is the restore failing. Backups that cannot be restored are worse than no backups at all, because they create a false sense of security.

Reasons restores fail include: backup files are corrupted but no checksum validation was performed, the backup software version on the restore target differs from the source, the restore process requires credentials or configuration that nobody documented, the storage format has changed since the backup was taken, and the backup is incomplete (some files or tables were excluded accidentally).

The only way to catch these problems is to test restores regularly. A good restoration test answers five questions: Can you actually restore the data? Is the restored data complete and consistent? How long does the restore take? Does the restored system function correctly? Can someone other than the person who created the backup perform the restore?

Systems Thinking Lens

Backup strategy involves competing feedback loops. Frequent backups reduce RPO (good) but increase storage costs and system load (bad). Longer retention provides more recovery options (good) but increases storage costs and compliance surface area (bad). The optimal strategy balances these loops based on the value of the data and the cost of losing it.

There is also a dangerous delay in the feedback loop. Backups that are never tested provide no feedback about their quality until a disaster occurs. By then, the feedback arrives too late. Regular restoration testing closes this loop and surfaces problems while there is still time to fix them.

Assignment

Your team runs a production PostgreSQL database with 500 GB of data. Daily backups are taken at midnight and stored in S3. The team has never tested a restore.

Write a 5-step restoration test plan. Include: how you will create a test environment, how you will restore the backup, how you will validate the restored data, how you will measure restore time, and how you will document the results.
What is the scariest possible discovery during the test? List three things that could go wrong, and for each one, describe what it would mean for your recovery capability.
The current RPO is 24 hours (one daily backup). The business now wants RPO under 1 hour. What changes to the backup strategy would you make? Consider PITR, incremental frequency, and cost implications.
Design a retention policy for this database. State how long each tier is kept and justify each choice.

Backup & Recovery