Disaster Recovery

Disaster recovery ensures business continuity when catastrophic events occur—natural disasters, cyber attacks, or major system failures.

DR Key Metrics

Metric	Definition	Target
RTO (Recovery Time Objective)	Maximum acceptable downtime	Minutes to hours
RPO (Recovery Point Objective)	Maximum acceptable data loss	Seconds to hours
MTTR (Mean Time to Recovery)	Average time to restore service	As low as possible
MTBF (Mean Time Between Failures)	Average time between failures	As high as possible

RPO Timeline:
─────────────────────────────────────────►
Last    Backups    Disaster    Restore   Current
Backup   Lost!      Event      Point     Time
  │       │          │          │         │
  └───────┴──────────┴──────────┴─────────┘
       ↑ Data Loss Window (RPO)

RTO Timeline:
─────────────────────────────────────────►
Normal   Disaster    Detection   Recovery   Normal
Ops       Event        Time       Time      Ops
 │          │          │          │         │
 └──────────┴──────────┴──────────┴────────┘
       ↑ Downtime Window (RTO)

DR Strategies

1. Backup and Restore (Cold)

Production          S3/Glacier
┌─────────┐         ┌─────────┐
│  App    │────────>│  Backup │
│  DB     │  Daily  │  Files  │
└─────────┘         └─────────┘
                              
Disaster Event ──> Restore ──> Back Online
                   (Hours/Days)

Cost: $       RTO: 24+ hours    RPO: 24 hours

Best for: Non-critical systems, development environments

# AWS Backup plan
aws backup create-backup-plan \
  --backup-plan '{
    "BackupPlanName": "daily-backup",
    "Rules": [{
      "RuleName": "daily",
      "TargetBackupVaultName": "Default",
      "ScheduleExpression": "cron(0 5 ? * * *)",
      "StartWindowMinutes": 60,
      "CompletionWindowMinutes": 120,
      "Lifecycle": {
        "MoveToColdStorageAfterDays": 30,
        "DeleteAfterDays": 120
      }
    }]
  }'

2. Pilot Light

Production          DR Region (Standby)
┌─────────┐         ┌─────────┐
│  App    │         │  VPC    │  (configured)
│  DB     │────────>│  DB     │  (minimal/replicated)
│         │   Sync  │         │
└─────────┘         └─────────┘
                            │
              Disaster ──────>│ Scale Up
              Event           │
                              ▼
                        ┌─────────┐
                        │  Full   │
                        │  App    │
                        └─────────┘
                        
Cost: $$      RTO: Minutes      RPO: Seconds

Best for: Critical databases, core applications

# RDS Cross-Region Read Replica
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  DBInstance:
    Type: AWS::RDS::DBInstance
    Properties:
      DBInstanceIdentifier: primary-db
      DBName: mydb
      Engine: postgres
      MultiAZ: true
      AllocatedStorage: 100
      InstanceClass: db.t3.large
      BackupRetentionPeriod: 7

  DRReplica:
    Type: AWS::RDS::DBInstance
    Properties:
      DBInstanceIdentifier: dr-replica
      SourceDBInstanceIdentifier: !Ref DBInstance
      Engine: postgres
      InstanceClass: db.t3.micro  # Smaller, scaled up during DR

3. Warm Standby

Production          DR Region (Warm)
┌─────────┐         ┌─────────┐
│  App    │────────>│  App    │  (running, scaled down)
│  DB     │   Sync  │  DB     │  (replicated)
│         │         │         │
└─────────┘         └─────────┘
                            │
              Disaster ──────>│ Scale Up
              Event           │ (minutes)
                              ▼
                        ┌─────────┐
                        │  Full   │
                        │  App    │
                        └─────────┘
                        
Cost: $$$     RTO: Minutes      RPO: Seconds

Best for: Business-critical systems

# Terraform warm standby
module "primary" {
  source = "./modules/application"
  
  environment = "production"
  region      = "us-east-1"
  instance_count = 4
}

module "warm_standby" {
  source = "./modules/application"
  
  environment = "production-dr"
  region      = "us-west-2"
  instance_count = 1  # Minimal running instances
  
  database_replica = true
}

# Route53 health checks for failover
resource "aws_route53_health_check" "primary" {
  fqdn              = "app.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = 3
}

resource "aws_route53_record" "failover" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  health_check_id = aws_route53_health_check.primary.id
  set_identifier  = "primary"
  records         = [module.primary.alb_ip]
}

4. Multi-Site Active-Active

┌─────────────┐         ┌─────────────┐
│  Region 1   │◄───────>│  Region 2   │
│  (Active)   │   Sync  │  (Active)   │
│             │         │             │
│ • Full App  │         │ • Full App  │
│ • Full DB   │         │ • Full DB   │
│ • Full Load │         │ • Full Load │
└─────────────┘         └─────────────┘
       │                       │
       └───────────┬───────────┘
              Global Accelerator
                   or
              Route53 Latency
              
Cost: $$$$    RTO: Near Zero    RPO: Near Zero

Best for: Mission-critical systems, global applications

Backup Strategies

3-2-1 Backup Rule

3    Copies of data
2    Different media types
1    Offsite/cloud copy

Production
    │
    ├──> Primary Storage (local SSD)
    │
    ├──> Secondary Storage (NAS/local backup)
    │
    └──> Tertiary Storage (cloud/offsite)
         (S3, Glacier, Azure Blob)

Snapshot Strategy

# AWS EBS Snapshots
aws ec2 create-snapshot \
  --volume-id vol-1234567890abcdef0 \
  --description "Daily backup $(date +%Y-%m-%d)"

# Automated with Data Lifecycle Manager
aws dlm create-lifecycle-policy \
  --execution-role-arn arn:aws:iam::123456789012:role/AWSDataLifecycleManagerDefaultRole \
  --description "Daily snapshots" \
  --state ENABLED \
  --policy-details file://policy.json

# Cross-region snapshot copy
aws ec2 copy-snapshot \
  --source-region us-east-1 \
  --source-snapshot-id snap-1234567890abcdef0 \
  --destination-region us-west-2 \
  --description "DR copy"

Database Backups

# RDS Automated Backups
DBInstance:
  Type: AWS::RDS::DBInstance
  Properties:
    BackupRetentionPeriod: 35
    PreferredBackupWindow: 03:00-04:00
    CopyTagsToSnapshot: true
    DeletionProtection: true

# Manual snapshot before major changes
aws rds create-db-snapshot \
  --db-instance-identifier mydb \
  --db-snapshot-identifier mydb-pre-upgrade-$(date +%Y%m%d)

# Point-in-time recovery
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier mydb \
  --target-db-instance-identifier mydb-restored \
  --restore-time 2024-01-15T10:00:00Z

DR Testing

Chaos Engineering

# AWS Fault Injection Simulator
aws fis create-experiment-template \
  --cli-input-json '{
    "description": "Test EC2 failure",
    "stopConditions": [
      {
        "source": "none"
      }
    ],
    "targets": {
      "EC2Instances": {
        "resourceType": "aws:ec2:instance",
        "selectionMode": "ALL",
        "parameters": {
          "availabilityZoneIdentifier": "us-east-1a"
        }
      }
    },
    "actions": {
      "StopInstance": {
        "actionId": "aws:ec2:stop-instances",
        "targets": {
          "Instances": "EC2Instances"
        }
      }
    },
    "roleArn": "arn:aws:iam::123456789012:role/ExperimentRole"
  }'

# Gremlin (Chaos Engineering Platform)
gremlin attack host cpu --amount 80 --duration 300
gremlin attack host memory --amount 80 --duration 300
gremlin attack container kill --target container-name

DR Drill Checklist

## Monthly DR Drill

### Preparation
- [ ] Notify stakeholders
- [ ] Schedule maintenance window
- [ ] Prepare rollback procedures

### Execution
- [ ] Trigger failover
- [ ] Verify application functionality
- [ ] Check data consistency
- [ ] Verify monitoring/alerts

### Validation
- [ ] RTO met: ___ minutes
- [ ] RPO met: ___ data loss
- [ ] All critical functions working
- [ ] Performance acceptable

### Cleanup
- [ ] Document findings
- [ ] Update runbooks
- [ ] Schedule improvements
- [ ] Failback to primary (if needed)

DR Automation

# Lambda function for automated failover
import boto3

def lambda_handler(event, context):
    # Check health of primary region
    health = check_primary_health()
    
    if health['status'] == 'UNHEALTHY':
        # Trigger failover
        route53 = boto3.client('route53')
        
        # Update DNS to point to DR
        route53.change_resource_record_sets(
            HostedZoneId='Z123456789',
            ChangeBatch={
                'Changes': [{
                    'Action': 'UPSERT',
                    'ResourceRecordSet': {
                        'Name': 'app.example.com',
                        'Type': 'A',
                        'AliasTarget': {
                            'HostedZoneId': 'Z123456789',
                            'DNSName': 'dr-alb.amazonaws.com',
                            'EvaluateTargetHealth': True
                        }
                    }
                }]
            }
        )
        
        # Scale up DR environment
        asg = boto3.client('autoscaling')
        asg.update_auto_scaling_group(
            AutoScalingGroupName='dr-asg',
            MinSize=4,
            DesiredCapacity=4
        )
        
        # Promote RDS replica
        rds = boto3.client('rds')
        rds.promote_read_replica(
            DBInstanceIdentifier='dr-replica'
        )
        
        notify_team("Failover to DR region completed")
        
    return {'status': 'success'}

Quiz

Question 1 of 5

What is the difference between RTO and RPO?

RTO is for data loss, RPO is for downtime

RTO is maximum downtime allowed, RPO is maximum data loss acceptable

They are the same thing

RTO is for backup, RPO is for restore

Next Steps

Now let's move to real-world DevOps projects to apply what you've learned.