Backup & Recovery
Session 4.4 · ~5 min read
The Foundation of Recovery
Disaster recovery strategies (Session 4.3) depend on one thing: having a usable copy of your data. Backups are that copy. Without reliable backups, no DR strategy works. Without tested restores, backups are meaningless.
This session covers the mechanics of backing up data: what types of backups exist, how to schedule them, how to retain them, and most importantly, how to verify that they actually work when you need them.
A backup you haven't tested restoring is not a backup. It's a hope. Until you have successfully restored from a backup, you have no evidence that it will work when it matters.
Three Types of Backups
All backup strategies are built from three fundamental types. Each copies a different amount of data, takes a different amount of time, and requires a different process to restore.
Full Backup
A full backup copies every file, every record, every byte. It is the simplest to understand and the simplest to restore. You need only one backup set to recover completely. The downside is obvious: it takes the longest to create, uses the most storage, and puts the highest load on the source system.
Incremental Backup
An incremental backup copies only the data that has changed since the last backup of any type. Monday's incremental contains changes since Sunday's full. Tuesday's incremental contains only changes since Monday's incremental. Each incremental is small and fast to create.
The trade-off appears at restore time. To restore from Wednesday's state, you need Sunday's full backup, then Monday's incremental, then Tuesday's incremental, then Wednesday's incremental, applied in order. If any backup in the chain is corrupted, the restore fails. The more increments in the chain, the higher the risk and the longer the restore takes.
Differential Backup
A differential backup copies all data that has changed since the last full backup. Monday's differential contains changes since Sunday's full. Tuesday's differential also contains all changes since Sunday's full (including Monday's changes again). Each differential grows larger as the week progresses, but restoring requires only two pieces: the last full backup and the most recent differential.
| Dimension | Full | Incremental | Differential |
|---|---|---|---|
| What it copies | Everything | Changes since last backup (any type) | Changes since last full backup |
| Backup speed | Slowest | Fastest | Medium (grows over time) |
| Storage required | Highest | Lowest per backup | Medium (cumulative growth) |
| Restore speed | Fastest (single file) | Slowest (full + all increments) | Medium (full + one differential) |
| Restore complexity | Low | High (chain dependency) | Low (two files) |
| Risk of chain corruption | None | High (one bad link breaks the chain) | Low |
| System load during backup | High | Low | Medium |
Backup Scheduling and Rotation
Most production systems use a combination of backup types on a rotation schedule. The classic pattern is the Grandfather-Father-Son (GFS) scheme: daily incrementals (Son), weekly full backups (Father), and monthly archive copies (Grandfather).
The retention policy determines how long each tier is kept. A common policy: daily backups retained for 7 days, weekly backups retained for 4 weeks, monthly archives retained for 12 months, yearly archives retained for 7 years (for compliance). The right retention depends on your regulatory requirements, storage budget, and the likelihood of needing historical data.
Retention Policies
| Tier | Frequency | Retention | Storage Tier | Purpose |
|---|---|---|---|---|
| Daily (Son) | Every night | 7 days | Hot (S3 Standard, local disk) | Recover from recent errors or deletions |
| Weekly (Father) | Every Sunday | 4 weeks | Warm (S3 Infrequent Access) | Recover from issues discovered days later |
| Monthly (Grandfather) | Last day of month | 12 months | Cold (S3 Glacier) | Compliance, audit, historical reference |
| Yearly Archive | Dec 31 | 7 years | Deep Archive (S3 Glacier Deep) | Regulatory compliance (SOX, GDPR, HIPAA) |
Point-in-Time Recovery (PITR)
Traditional backups give you snapshots at fixed intervals. Point-in-time recovery gives you any moment in between. PITR works by combining a base backup with a continuous stream of transaction logs (called write-ahead logs in PostgreSQL, binary logs in MySQL, or redo logs in Oracle).
To restore to 2:47 PM last Tuesday, the system restores the most recent full backup before that time, then replays the transaction log up to exactly 2:47 PM. The result is an exact replica of the database at that specific moment.
PITR is essential for recovering from application bugs that corrupt data. A regular backup might have been taken after the corruption occurred. With PITR, you can restore to the moment just before the bug ran. Most managed database services (Amazon RDS, Aurora, Azure SQL, Cloud SQL) offer PITR with a configurable retention window, typically 1 to 35 days.
Restoration Testing
The most common failure mode in backup systems is not the backup failing. It is the restore failing. Backups that cannot be restored are worse than no backups at all, because they create a false sense of security.
Reasons restores fail include: backup files are corrupted but no checksum validation was performed, the backup software version on the restore target differs from the source, the restore process requires credentials or configuration that nobody documented, the storage format has changed since the backup was taken, and the backup is incomplete (some files or tables were excluded accidentally).
The only way to catch these problems is to test restores regularly. A good restoration test answers five questions: Can you actually restore the data? Is the restored data complete and consistent? How long does the restore take? Does the restored system function correctly? Can someone other than the person who created the backup perform the restore?
Systems Thinking Lens
Backup strategy involves competing feedback loops. Frequent backups reduce RPO (good) but increase storage costs and system load (bad). Longer retention provides more recovery options (good) but increases storage costs and compliance surface area (bad). The optimal strategy balances these loops based on the value of the data and the cost of losing it.
There is also a dangerous delay in the feedback loop. Backups that are never tested provide no feedback about their quality until a disaster occurs. By then, the feedback arrives too late. Regular restoration testing closes this loop and surfaces problems while there is still time to fix them.
Further Reading
- AWS: Incremental vs Differential vs Other Backups. Clear comparison of backup types with diagrams showing data flow.
- Acronis: Incremental vs Differential Backups. Practical guide with performance benchmarks and use-case recommendations.
- TechTarget: How to Choose the Correct Backup Type. Decision framework for selecting backup strategies based on RTO, RPO, and storage constraints.
- AWS: Point-in-Time Recovery for Amazon RDS. How PITR works in a managed database service, including retention and restore procedures.
Assignment
Your team runs a production PostgreSQL database with 500 GB of data. Daily backups are taken at midnight and stored in S3. The team has never tested a restore.
- Write a 5-step restoration test plan. Include: how you will create a test environment, how you will restore the backup, how you will validate the restored data, how you will measure restore time, and how you will document the results.
- What is the scariest possible discovery during the test? List three things that could go wrong, and for each one, describe what it would mean for your recovery capability.
- The current RPO is 24 hours (one daily backup). The business now wants RPO under 1 hour. What changes to the backup strategy would you make? Consider PITR, incremental frequency, and cost implications.
- Design a retention policy for this database. State how long each tier is kept and justify each choice.