High Availability

High availability ensures systems remain operational and accessible despite failures. It's measured as a percentage of uptime.

Availability Tiers

Availability	Downtime/Year	Use Case
99% (Two Nines)	3.65 days	Internal tools
99.9% (Three Nines)	8.76 hours	Business applications
99.99% (Four Nines)	52.6 minutes	Critical systems
99.999% (Five Nines)	5.26 minutes	Financial/medical

Designing for High Availability

Single Point of Failure ──> Eliminate Redundancy

Before:                    After:
┌─────────┐               ┌─────────┐ ┌─────────┐
│  App    │               │  App 1  │ │  App 2  │
│ (SPOF)  │     ───>      │ (AZ-a)  │ │ (AZ-b)  │
└─────────┘               └────┬────┘ └────┬────┘
                               │           │
                         ┌─────┴───────────┴─────┐
                         │    Load Balancer      │
                         └─────────────────────────┘

HA Principles

Eliminate Single Points of Failure (SPOF): Every component should be redundant
Fault Isolation: Failures shouldn't cascade
Graceful Degradation: Reduced functionality over complete failure
Self-Healing: Automatic recovery from failures
Monitor Everything: Visibility into system health

Scaling Strategies

Vertical Scaling (Scale Up)

Small Instance ──> Large Instance
  2 vCPU             16 vCPU
  4 GB RAM           64 GB RAM

Pros: Simple, no code changes
Cons: Hardware limits, downtime during resize, expensive

Horizontal Scaling (Scale Out)

1 Instance ──> 10 Instances
(2 vCPU)       (20 vCPU total)

Pros: Near-unlimited scale, fault tolerance, cost-effective
Cons: Requires load balancing, stateless design, complexity

Auto Scaling

# AWS Auto Scaling
AWSTemplateFormatVersion: '2010-09-09'
Resources:
  LaunchTemplate:
    Type: AWS::EC2::LaunchTemplate
    Properties:
      LaunchTemplateName: web-server
      LaunchTemplateData:
        ImageId: ami-12345678
        InstanceType: t3.medium
        SecurityGroupIds:
          - !Ref SecurityGroup
        UserData:
          Fn::Base64: |
            #!/bin/bash
            yum install -y nginx
            systemctl start nginx

  AutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      VPCZoneIdentifier:
        - !Ref PublicSubnet1
        - !Ref PublicSubnet2
      LaunchTemplate:
        LaunchTemplateId: !Ref LaunchTemplate
        Version: !GetAtt LaunchTemplate.LatestVersionNumber
      MinSize: 2
      MaxSize: 10
      DesiredCapacity: 2
      HealthCheckType: ELB
      HealthCheckGracePeriod: 300
      TargetGroupARNs:
        - !Ref TargetGroup
      Tags:
        - Key: Name
          Value: web-server
          PropagateAtLaunch: true

  ScalingPolicy:
    Type: AWS::AutoScaling::ScalingPolicy
    Properties:
      AutoScalingGroupName: !Ref AutoScalingGroup
      PolicyType: TargetTrackingScaling
      TargetTrackingConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: ASGAverageCPUUtilization
        TargetValue: 60.0
        ScaleInCooldown: 300
        ScaleOutCooldown: 60

Multi-AZ and Multi-Region

Multi-AZ Deployment

┌─────────────────────────────────────────────────────┐
│                    AWS Region                        │
│  ┌─────────────────┐    ┌─────────────────┐        │
│  │  Availability   │    │  Availability   │        │
│  │    Zone A       │◄──>│    Zone B       │        │
│  │                 │    │                 │        │
│  │ • EC2 instances │    │ • EC2 instances │        │
│  │ • RDS Primary   │───>│ • RDS Standby   │        │
│  │ • ELB Node      │    │ • ELB Node      │        │
│  └─────────────────┘    └─────────────────┘        │
│         ▲                      ▲                    │
│         └──────────┬───────────┘                    │
│                 ALB/NLB                              │
└─────────────────────────────────────────────────────┘

Multi-Region Deployment

┌─────────────┐         ┌─────────────┐
│  us-east-1  │◄───────>│  us-west-2  │
│ (Primary)   │   RDS   │ (Secondary) │
│             │ Replication         │
│ • Active    │         │ • Read      │
│ • Writes    │ Route53 │   Replica │
│             │  Failover          │
└─────────────┘         └─────────────┘
       │                       │
       └──────────┬────────────┘
             Global Accelerator

Load Balancer Health Checks

# AWS ALB Health Check
resource "aws_lb_target_group" "app" {
  name     = "app-tg"
  port     = 80
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id
  
  health_check {
    enabled             = true
    healthy_threshold   = 2
    interval            = 30
    matcher             = "200"
    path                = "/health"
    port                = "traffic-port"
    protocol            = "HTTP"
    timeout             = 5
    unhealthy_threshold = 2
  }
  
  stickiness {
    type            = "lb_cookie"
    cookie_duration = 86400
    enabled         = true
  }
}

Database High Availability

RDS Multi-AZ

# CloudFormation
DBInstance:
  Type: AWS::RDS::DBInstance
  Properties:
    DBName: mydb
    AllocatedStorage: 100
    DBInstanceClass: db.t3.medium
    Engine: postgres
    EngineVersion: '15.4'
    MasterUsername: admin
    MasterUserPassword: !Ref DBPassword
    MultiAZ: true
    StorageEncrypted: true
    BackupRetentionPeriod: 7
    PreferredBackupWindow: 03:00-04:00
    PreferredMaintenanceWindow: Mon:04:00-Mon:05:00
    AutoMinorVersionUpgrade: true
    DeletionProtection: true

Read Replicas

# Create read replica
aws rds create-db-instance-read-replica \
  --db-instance-identifier mydb-replica \
  --source-db-instance-identifier mydb \
  --db-instance-class db.t3.small

# Promote to standalone
aws rds promote-read-replica \
  --db-instance-identifier mydb-replica

Health Checks and Circuit Breakers

Application Health Check

# health.py
from flask import Flask, jsonify
import psutil

app = Flask(__name__)

@app.route('/health')
def health():
    return jsonify({
        'status': 'healthy',
        'timestamp': datetime.now().isoformat()
    })

@app.route('/ready')
def ready():
    # Check database connectivity
    try:
        db.ping()
        return jsonify({'status': 'ready'})
    except:
        return jsonify({'status': 'not ready'}), 503

@app.route('/metrics')
def metrics():
    return jsonify({
        'cpu_percent': psutil.cpu_percent(),
        'memory_percent': psutil.virtual_memory().percent,
        'disk_usage': psutil.disk_usage('/').percent
    })

Circuit Breaker Pattern

from circuitbreaker import circuit
import requests

@circuit(failure_threshold=5, recovery_timeout=30, expected_exception=requests.RequestException)
def call_external_api():
    response = requests.get('https://api.example.com/data')
    return response.json()

# Usage
try:
    data = call_external_api()
except:
    # Fallback to cache or default
    data = get_cached_data()

Graceful Shutdown

# Kubernetes graceful shutdown
import signal
import sys
import time
from flask import Flask

app = Flask(__name__)

shutdown_requested = False

def graceful_shutdown(signum, frame):
    global shutdown_requested
    print("Received termination signal. Starting graceful shutdown...")
    shutdown_requested = True
    
    # Stop accepting new connections
    # Finish processing current requests
    # Close database connections
    # Flush logs
    
    time.sleep(5)  # Allow in-flight requests to complete
    sys.exit(0)

signal.signal(signal.SIGTERM, graceful_shutdown)
signal.signal(signal.SIGINT, graceful_shutdown)

@app.route('/')
def hello():
    if shutdown_requested:
        return "Shutting down", 503
    return "Hello World"

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Quiz

Question 1 of 5

What does 'Five Nines' (99.999%) availability mean?

5 minutes of downtime per month

5.26 minutes of downtime per year

5 hours of downtime per year

5 days of downtime per year

Next Steps

Now let's explore disaster recovery and backup strategies.