10.1 Backup Strategy and Planning
Content
- Overview and Objectives
- Data Classification and Criticality Assessment
- Recovery Time and Recovery Point Objectives
- Backup Types: Full, Incremental, and Differential
- Backup Scheduling and Retention Policies
- Disaster Recovery Planning Considerations
- Real-world context
- Common pitfalls
- Recommended reading
- Assessment
Overview and Objectives
Most backup systems fail not because of bad tools, but because nobody asked the right questions before buying storage. Which data matters? How old can a backup be and still be useful? How long can a service be down before it becomes a serious problem? This chapter is about those questions — the decisions that determine whether your backup system actually works when you need it.
The tactical tools come later. Here the focus is on the “why” and “what” that drive all the technical choices: which data to protect, how much loss is acceptable, and what recovery looks like in practice.
Learning Objectives
By completing this section, you will be able to:
- Classify data by criticality and explain why that conversation requires business stakeholders, not just technical teams
- Define RTO and RPO, explain what drives each value, and describe how they translate into infrastructure requirements
- Choose between full, incremental, and differential backup strategies based on storage, speed, and recovery complexity tradeoffs
- Design a retention schedule using the GFS model and explain how retention policy relates to storage capacity planning
- Describe how the 3-2-1-1-0 rule addresses ransomware scenarios and what each component protects against
Data Classification and Criticality Assessment
Not all data is worth the same effort to protect. Treating everything as equally important wastes resources on low-value files and can leave critical systems under-protected. Classification forces you to be explicit about what matters and why.
Start by separating data into a few broad types: operational data that drives daily business processes, compliance data required by regulation, and historical data that’s useful for reference but isn’t time-sensitive to restore. Those categories have very different backup requirements.
The tricky part is that sysadmins usually can’t make classification decisions alone. You might know the architecture, but you probably don’t know which system a failure in would cost the company money within the hour versus which one would just be annoying. That conversation has to involve the people who run the business.
A typical web hosting company makes a useful example. Customer website files are revenue-generating — they need frequent backups and fast recovery. Billing databases are both operational and compliance data. Email is important but can usually tolerate a longer outage than customer-facing services. Log files are useful for troubleshooting, but they’re low priority in a disaster.
A simple four-tier framework covers most situations:
- Mission-critical: directly generates revenue or serves customers; outages have immediate business impact
- Business-essential: supports important functions; can tolerate short outages without serious consequences
- Important: valuable but can wait longer during recovery
- Standard: logs, reference archives, convenience data; lowest recovery priority
One thing classification often misses: dependencies. A customer database might be mission-critical, but so is the application server that reads it, the config files that define its behavior, and the network path that connects them. You’re not classifying files — you’re classifying functional systems. Protect the whole chain, not just the obvious piece.
Write your classifications down, including the reasoning. Business priorities shift, regulations change, and a year from now the next person on the team needs to understand why decisions were made. Schedule a review at least annually.
Recovery Time and Recovery Point Objectives
RTO and RPO are the two numbers that translate “we need good backups” into actual technical requirements.
Recovery Time Objective (RTO) is the maximum acceptable downtime from the moment something breaks until the system is back in service. It covers everything: restoration, reconfiguration, startup, testing. A 4-hour RTO means the service must be running within 4 hours of the failure, full stop.
Recovery Point Objective (RPO) is how much data you can afford to lose, expressed as time. An RPO of 1 hour means your backups must capture data at least every hour — if a failure happens at 3:50 PM and your last backup was at 3:00 PM, you’ve lost 50 minutes of work. That’s acceptable. An RPO of 24 hours means daily backups are fine.
Short RTO usually requires short RPO too. If a system has to be back in 15 minutes, you can’t be restoring from a backup that’s 8 hours old — the data would be too stale to be useful. Very aggressive targets (minutes, not hours) typically require continuous replication or clustered systems rather than traditional backup/restore workflows.
Setting these values requires understanding what downtime actually costs. How much revenue does the company lose per hour? Are there regulatory penalties for outages? What’s the impact on customers? Those answers come from business stakeholders, not from the server room.
Some examples to calibrate expectations:
- A SaaS project management tool might target an RTO of 1 hour and an RPO of 15 minutes. Customers can’t work; every hour down is churn risk and support tickets.
- A stock trading platform might need RTO and RPO in seconds — outages during market hours mean real financial losses and regulatory scrutiny.
- A manufacturing plant’s production control systems might need a 5-minute RTO, while the HR system can be down for a day with no serious consequences.
The pattern is the same everywhere: more critical systems justify more expensive infrastructure. There’s no universal right answer — just the tradeoff between what downtime costs and what prevention costs.
RTO vs RPO — a common mix-up: RTO measures time to restore service after a failure; RPO measures how much data you can afford to lose (expressed as time since the last backup). A system with an RTO of 4 hours and an RPO of 1 hour must be back online within 4 hours and must have captured data no older than 1 hour before the failure — these are independent targets that each drive different infrastructure decisions.
Aggressive objectives cost money. Very short RTOs and RPOs typically require redundant systems, fast storage, and continuous replication. Part of this work is helping the organization understand the tradeoff between protection cost and outage cost — most systems don’t need the expensive option.
Backup Types: Full, Incremental, and Differential
Three backup types cover most situations. The right choice depends on how much storage you have, how long backups can take, and how quickly you need to restore.
Full backups copy everything, every time. Recovery is simple — you need exactly one backup set. The downside is obvious: they’re slow and expensive on storage. Running daily fulls on a multi-terabyte file server usually isn’t practical.
Incremental backups capture only what changed since the last backup of any type. They’re fast and storage-efficient, which makes daily or even hourly schedules practical. The catch is recovery: to restore from Friday, you need the Sunday full backup plus every incremental from Monday through Friday. If the Wednesday incremental is corrupted or missing, Thursday and Friday are gone too — even though those backups ran fine.
Differential backups capture everything changed since the last full backup, regardless of any incrementals in between. Recovery always needs exactly two things: the last full and the latest differential. Simpler than incrementals, but differentials grow through the week — by Friday, the differential might be almost as large as a full.
The tradeoff table looks like this:
| Type | Storage | Backup speed | Recovery complexity |
|---|---|---|---|
| Full | High | Slow | One set needed |
| Incremental | Low | Fast | Full + every incremental |
| Differential | Medium | Medium | Full + latest differential |
Match the type to your priorities. Fast storage and aggressive RTO targets favor full or differential. Tight storage budgets and acceptable recovery times favor incrementals.
Backup Scheduling and Retention Policies
Strategy on paper is just a document until you turn it into a schedule. Scheduling is where you answer: when do backups run, how long do they keep, and what happens when they fail?
The first question is about backup windows — when can backups run without hurting production? Most systems have low-traffic periods, usually overnight. 24/7 operations or global teams complicate this; you may need staggered schedules that respect regional business hours rather than a single global window.
Automate everything. Manual backup processes fail silently — someone forgets, someone’s out sick, someone thinks someone else is doing it. Automation also handles retries, logging, and alerting. A backup job that fails and tells nobody is worse than no backup job at all.
Retention policy determines how long backups are kept before automatic deletion. The goal is enough history to cover realistic recovery scenarios without runaway storage costs.
The Grandfather-Father-Son (GFS) model is the standard approach. Daily backups (“Son”) are kept for a week or two. Weekly backups (“Father”) are kept for a few months. Monthly backups (“Grandfather”) are kept for a year or more. Most backup tools can apply GFS automatically — you just configure the counts.
How long to keep things depends on the business context. Financial services might require 7 years for certain data types due to regulation. A development environment might only need 30 days. Find out if there are legal retention requirements before designing a policy — adding retention is easy, but retroactively recovering data you deleted isn’t.
The Grandfather-Father-Son (GFS) model names three retention tiers: Son (daily backups, kept for 1–2 weeks), Father (weekly backups, kept for 1–3 months), and Grandfather (monthly backups, kept for 1–2 years). Most backup tools can apply GFS automatically, so you configure the counts and the tool handles deletion.
Verification is the step that most backup systems skip. Backups can complete successfully and still be unrestorable — corrupted archives, missing dependencies, changed configurations. The only way to know a backup works is to restore from it.
Schedule integrity checks automatically, but also schedule actual restore tests on a calendar. Mission-critical systems should be tested at least monthly. Everything else quarterly. Document the results. Storage capacity needs planning too — as data grows, retention policies can quietly exhaust disk space. Project growth and set alerts before you hit 80% capacity, not after backups start failing at 100%.
Disaster Recovery Planning Considerations
A backup strategy covers individual systems. Disaster recovery planning covers what happens when the building floods, the ransomware hits everything, or three key people are unavailable simultaneously.
The ransomware case deserves particular attention. Attackers increasingly target backup infrastructure specifically — they know you’ll pay if the backups are gone too. Traditional backup setups with always-connected network storage are vulnerable. The countermeasures are air-gapped backups (physically disconnected storage), immutable repositories (where data can’t be modified or deleted, only appended), and off-site copies.
The 3-2-1 rule is the classic framework: 3 copies of the data, on 2 different types of media, with 1 copy off-site. The updated 3-2-1-1-0 version adds a fourth requirement — 1 offline/air-gapped copy — and a fifth: 0 errors in verification. The extra two numbers address exactly the ransomware scenario. Geographic spread matters for physical disasters; two copies in the same data center don’t help if the data center burns down.
During an actual disaster, you can’t restore everything at once. Prioritization means deciding in advance which systems come first so you’re not making that decision at 2 AM under pressure. Mission-critical systems go first; the rest waits until the business is stable.
Documentation needs to survive the disaster that you’re recovering from. If your runbook is only on the system that’s down, it’s not a runbook. Keep procedures somewhere accessible without the primary infrastructure: printed copies, a separate cloud location, a team member’s personal device. Also include contact lists and who has authority to make decisions — emergencies are not the time to figure out the approval chain.
Disaster Recovery as a Service (DRaaS) is worth knowing about. Cloud-based standby infrastructure can significantly reduce the cost of maintaining off-site recovery capability, but verify that the provider’s RTO/RPO commitments actually match your requirements before you need them.
Real-world context
In practice, a lot of backup work is political as much as technical. The hard part often isn’t configuring the tools — it’s getting business stakeholders to tell you what their systems are actually worth, and getting budget approved for the infrastructure that matches their stated recovery requirements.
A common pattern: ask someone “how quickly do you need this system back?” and they’ll say “immediately.” Ask “what does an hour of downtime cost you?” and the number is smaller than they expected. That conversation usually lands on a more realistic RTO, which lands on a more realistic budget.
RTO and RPO values sometimes appear in contracts and SLAs, which changes the stakes — missing them isn’t just an operational inconvenience, it has financial or legal consequences. If you’re working in an environment where backup requirements are contractually defined, make sure your infrastructure can actually deliver. Test it.
Cloud and hybrid environments have made backup strategy more complicated. Data might live on-premises, across several cloud providers, and in SaaS applications that don’t expose raw backups at all. Each source needs a plan, and the plans need to be consistent. This is increasingly standard territory for DevOps and SRE roles.
Common pitfalls
Starting with tools instead of requirements. It’s easy to dive into configuring backup software before establishing what you actually need to protect and why. The result is an elaborate system that protects a lot of data but misses the specific recovery scenarios that matter.
No capacity planning. Backup systems that fit fine today can quietly become impossible in two years as data grows. Model growth before you deploy, and set storage alerts before you’re at 100%.
Never testing restores. A backup job that completes with exit code 0 is not the same as a backup that works. Files can be corrupted, dependencies can change, and procedures can rot. If you haven’t restored from a backup recently, you don’t know if your backup works.
Classification without business input. Sysadmins tend to rate system importance by technical complexity or their own familiarity. The right measure is business impact. A boring-looking billing database might be more critical than an impressive-sounding analytics cluster.
Treating backup storage as untouchable. Backup repositories are a high-value target — they contain large amounts of data in a centralized, often less-monitored location. They need the same security attention as production systems: access controls, encryption, and audit logging.
Recommended reading
-
“Modern Data Protection” by W. Curtis Preston (2021) - O’Reilly Media
Covers backup strategy, cloud integration, and ransomware defense. Good on the organizational side — how to get RTO/RPO values out of stakeholders and turn them into requirements. -
NIST Special Publication 800-34 Rev. 1: Contingency Planning Guide for Federal Information Systems (2020) — Federal contingency planning framework. Dry, but thorough — useful when you need a defensible process for BIA and recovery planning.
-
“The 3-2-1-1-0 Rule: How Modern Backup Best Practices Evolve” by Veeam (2024) — Short vendor article explaining the extended 3-2-1-1-0 rule and what the extra two requirements address.
-
“RPO and RTO: Recovery Objectives Best Practices Guide” by Rubrik (2024) — Covers how to derive RTO and RPO values from business requirements rather than guessing.
-
“Enterprise Backup Strategy: Building Resilient Data Protection” by TechTarget (2024) — Industry survey of current practices, useful for getting a sense of what “normal” looks like in different environments.
Assessment
Multiple Choice Questions
Question 1: Which data classification category should include systems that directly generate revenue or serve customers where outages immediately impact business operations?
- a) Business-Essential
- b) Mission-Critical
- c) Important
- d) Standard
Question 2: What does a Recovery Point Objective (RPO) of 4 hours indicate about backup requirements?
- a) Systems must be restored within 4 hours of an outage
- b) Backup systems can tolerate 4 hours of downtime
- c) Data must be backed up at least every 4 hours
- d) Recovery testing should occur every 4 hours
Question 3: In an incremental backup strategy, what is required to restore data from Friday if full backups occur on Sunday?
- a) Only the Friday incremental backup
- b) The Sunday full backup plus the Friday incremental backup
- c) The Sunday full backup plus all incremental backups from Monday through Friday
- d) Only the most recent full backup
Question 4: According to the enhanced 3-2-1-1-0 backup rule, which requirement addresses modern ransomware threats?
- a) Three copies of data
- b) Two different storage media
- c) One off-site backup copy
- d) One offline/air-gapped backup copy
Question 5: What is the primary advantage of differential backups compared to incremental backups?
- a) They require less storage space
- b) They complete faster than incremental backups
- c) They only require two backup sets for complete recovery
- d) They provide more frequent recovery points
Question 6: Which factor should be the PRIMARY consideration when establishing RTO and RPO values?
- a) Available storage capacity
- b) Network bandwidth limitations
- c) Business impact analysis and risk tolerance
- d) Backup software capabilities
Question 7: What is the main disadvantage of using full backups exclusively?
- a) Complex recovery procedures
- b) High storage requirements and long backup duration
- c) Dependency on backup chain integrity
- d) Limited recovery point options
Question 8: In the Grandfather-Father-Son (GFS) retention model, what do “Son” backups typically represent?
- a) Monthly backups retained for extended periods
- b) Weekly backups retained for medium periods
- c) Daily backups retained for short periods
- d) Annual backups for compliance requirements
Short Answer Questions
Question 9: Explain why data classification must involve both technical teams and business stakeholders, and describe two potential consequences of making data classification decisions without adequate business input.
Question 10: Describe how RTO and RPO objectives influence backup system architecture decisions. Provide a specific example showing how aggressive recovery objectives might drive infrastructure requirements.
Question 11: Compare and contrast the recovery complexity between incremental and differential backup strategies. Explain when you would choose each approach based on organizational requirements.