Disaster Recovery
Master disaster recovery strategies: backup and restore, pilot light, warm standby, multi-site active-active, and RTO/RPO planning.
Master disaster recovery strategies: backup and restore, pilot light, warm standby, multi-site active-active, and RTO/RPO planning. This hands-on tutorial focuses on practical implementation of disaster recovery concepts.
Disaster Recovery
Disaster recovery ensures business continuity when catastrophic events occur—natural disasters, cyber attacks, or major system failures.
DR Key Metrics
| Metric | Definition | Target |
|---|---|---|
| RTO (Recovery Time Objective) | Maximum acceptable downtime | Minutes to hours |
| RPO (Recovery Point Objective) | Maximum acceptable data loss | Seconds to hours |
| MTTR (Mean Time to Recovery) | Average time to restore service | As low as possible |
| MTBF (Mean Time Between Failures) | Average time between failures | As high as possible |
RPO Timeline:
─────────────────────────────────────────►
Last Backups Disaster Restore Current
Backup Lost! Event Point Time
│ │ │ │ │
└───────┴──────────┴──────────┴─────────┘
↑ Data Loss Window (RPO)
RTO Timeline:
─────────────────────────────────────────►
Normal Disaster Detection Recovery Normal
Ops Event Time Time Ops
│ │ │ │ │
└──────────┴──────────┴──────────┴────────┘
↑ Downtime Window (RTO)
DR Strategies
1. Backup and Restore (Cold)
Production S3/Glacier
┌─────────┐ ┌─────────┐
│ App │────────>│ Backup │
│ DB │ Daily │ Files │
└─────────┘ └─────────┘
Disaster Event ──> Restore ──> Back Online
(Hours/Days)
Cost: $ RTO: 24+ hours RPO: 24 hours
Best for: Non-critical systems, development environments
# AWS Backup plan
aws backup create-backup-plan \
--backup-plan '{
"BackupPlanName": "daily-backup",
"Rules": [{
"RuleName": "daily",
"TargetBackupVaultName": "Default",
"ScheduleExpression": "cron(0 5 ? * * *)",
"StartWindowMinutes": 60,
"CompletionWindowMinutes": 120,
"Lifecycle": {
"MoveToColdStorageAfterDays": 30,
"DeleteAfterDays": 120
}
}]
}'
2. Pilot Light
Production DR Region (Standby)
┌─────────┐ ┌─────────┐
│ App │ │ VPC │ (configured)
│ DB │────────>│ DB │ (minimal/replicated)
│ │ Sync │ │
└─────────┘ └─────────┘
│
Disaster ──────>│ Scale Up
Event │
▼
┌─────────┐
│ Full │
│ App │
└─────────┘
Cost: $$ RTO: Minutes RPO: Seconds
Best for: Critical databases, core applications
# RDS Cross-Region Read Replica
AWSTemplateFormatVersion: '2010-09-09'
Resources:
DBInstance:
Type: AWS::RDS::DBInstance
Properties:
DBInstanceIdentifier: primary-db
DBName: mydb
Engine: postgres
MultiAZ: true
AllocatedStorage: 100
InstanceClass: db.t3.large
BackupRetentionPeriod: 7
DRReplica:
Type: AWS::RDS::DBInstance
Properties:
DBInstanceIdentifier: dr-replica
SourceDBInstanceIdentifier: !Ref DBInstance
Engine: postgres
InstanceClass: db.t3.micro # Smaller, scaled up during DR
3. Warm Standby
Production DR Region (Warm)
┌─────────┐ ┌─────────┐
│ App │────────>│ App │ (running, scaled down)
│ DB │ Sync │ DB │ (replicated)
│ │ │ │
└─────────┘ └─────────┘
│
Disaster ──────>│ Scale Up
Event │ (minutes)
▼
┌─────────┐
│ Full │
│ App │
└─────────┘
Cost: $$$ RTO: Minutes RPO: Seconds
Best for: Business-critical systems
# Terraform warm standby
module "primary" {
source = "./modules/application"
environment = "production"
region = "us-east-1"
instance_count = 4
}
module "warm_standby" {
source = "./modules/application"
environment = "production-dr"
region = "us-west-2"
instance_count = 1 # Minimal running instances
database_replica = true
}
# Route53 health checks for failover
resource "aws_route53_health_check" "primary" {
fqdn = "app.example.com"
port = 443
type = "HTTPS"
resource_path = "/health"
failure_threshold = 3
}
resource "aws_route53_record" "failover" {
zone_id = aws_route53_zone.main.zone_id
name = "app.example.com"
type = "A"
failover_routing_policy {
type = "PRIMARY"
}
health_check_id = aws_route53_health_check.primary.id
set_identifier = "primary"
records = [module.primary.alb_ip]
}
4. Multi-Site Active-Active
┌─────────────┐ ┌─────────────┐
│ Region 1 │◄───────>│ Region 2 │
│ (Active) │ Sync │ (Active) │
│ │ │ │
│ • Full App │ │ • Full App │
│ • Full DB │ │ • Full DB │
│ • Full Load │ │ • Full Load │
└─────────────┘ └─────────────┘
│ │
└───────────┬───────────┘
Global Accelerator
or
Route53 Latency
Cost: $$$$ RTO: Near Zero RPO: Near Zero
Best for: Mission-critical systems, global applications
Backup Strategies
3-2-1 Backup Rule
3 Copies of data
2 Different media types
1 Offsite/cloud copy
Production
│
├──> Primary Storage (local SSD)
│
├──> Secondary Storage (NAS/local backup)
│
└──> Tertiary Storage (cloud/offsite)
(S3, Glacier, Azure Blob)
Snapshot Strategy
# AWS EBS Snapshots
aws ec2 create-snapshot \
--volume-id vol-1234567890abcdef0 \
--description "Daily backup $(date +%Y-%m-%d)"
# Automated with Data Lifecycle Manager
aws dlm create-lifecycle-policy \
--execution-role-arn arn:aws:iam::123456789012:role/AWSDataLifecycleManagerDefaultRole \
--description "Daily snapshots" \
--state ENABLED \
--policy-details file://policy.json
# Cross-region snapshot copy
aws ec2 copy-snapshot \
--source-region us-east-1 \
--source-snapshot-id snap-1234567890abcdef0 \
--destination-region us-west-2 \
--description "DR copy"
Database Backups
# RDS Automated Backups
DBInstance:
Type: AWS::RDS::DBInstance
Properties:
BackupRetentionPeriod: 35
PreferredBackupWindow: 03:00-04:00
CopyTagsToSnapshot: true
DeletionProtection: true
# Manual snapshot before major changes
aws rds create-db-snapshot \
--db-instance-identifier mydb \
--db-snapshot-identifier mydb-pre-upgrade-$(date +%Y%m%d)
# Point-in-time recovery
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier mydb \
--target-db-instance-identifier mydb-restored \
--restore-time 2024-01-15T10:00:00Z
DR Testing
Chaos Engineering
# AWS Fault Injection Simulator
aws fis create-experiment-template \
--cli-input-json '{
"description": "Test EC2 failure",
"stopConditions": [
{
"source": "none"
}
],
"targets": {
"EC2Instances": {
"resourceType": "aws:ec2:instance",
"selectionMode": "ALL",
"parameters": {
"availabilityZoneIdentifier": "us-east-1a"
}
}
},
"actions": {
"StopInstance": {
"actionId": "aws:ec2:stop-instances",
"targets": {
"Instances": "EC2Instances"
}
}
},
"roleArn": "arn:aws:iam::123456789012:role/ExperimentRole"
}'
# Gremlin (Chaos Engineering Platform)
gremlin attack host cpu --amount 80 --duration 300
gremlin attack host memory --amount 80 --duration 300
gremlin attack container kill --target container-name
DR Drill Checklist
## Monthly DR Drill
### Preparation
- [ ] Notify stakeholders
- [ ] Schedule maintenance window
- [ ] Prepare rollback procedures
### Execution
- [ ] Trigger failover
- [ ] Verify application functionality
- [ ] Check data consistency
- [ ] Verify monitoring/alerts
### Validation
- [ ] RTO met: ___ minutes
- [ ] RPO met: ___ data loss
- [ ] All critical functions working
- [ ] Performance acceptable
### Cleanup
- [ ] Document findings
- [ ] Update runbooks
- [ ] Schedule improvements
- [ ] Failback to primary (if needed)
DR Automation
# Lambda function for automated failover
import boto3
def lambda_handler(event, context):
# Check health of primary region
health = check_primary_health()
if health['status'] == 'UNHEALTHY':
# Trigger failover
route53 = boto3.client('route53')
# Update DNS to point to DR
route53.change_resource_record_sets(
HostedZoneId='Z123456789',
ChangeBatch={
'Changes': [{
'Action': 'UPSERT',
'ResourceRecordSet': {
'Name': 'app.example.com',
'Type': 'A',
'AliasTarget': {
'HostedZoneId': 'Z123456789',
'DNSName': 'dr-alb.amazonaws.com',
'EvaluateTargetHealth': True
}
}
}]
}
)
# Scale up DR environment
asg = boto3.client('autoscaling')
asg.update_auto_scaling_group(
AutoScalingGroupName='dr-asg',
MinSize=4,
DesiredCapacity=4
)
# Promote RDS replica
rds = boto3.client('rds')
rds.promote_read_replica(
DBInstanceIdentifier='dr-replica'
)
notify_team("Failover to DR region completed")
return {'status': 'success'}
Quiz
Quiz
Question 1 of 5What is the difference between RTO and RPO?
Next Steps
Now let's move to real-world DevOps projects to apply what you've learned.