DevOps
High Availability
Learn high availability strategies: load balancing, auto-scaling, multi-AZ deployments, and designing for fault tolerance.
By TechCoder TeamLast updated: 2026-06-02
In a Nutshell
Learn high availability strategies: load balancing, auto-scaling, multi-AZ deployments, and designing for fault tolerance. This hands-on tutorial focuses on practical implementation of high availability concepts.
High Availability
High availability ensures systems remain operational and accessible despite failures. It's measured as a percentage of uptime.
Availability Tiers
| Availability | Downtime/Year | Use Case |
|---|---|---|
| 99% (Two Nines) | 3.65 days | Internal tools |
| 99.9% (Three Nines) | 8.76 hours | Business applications |
| 99.99% (Four Nines) | 52.6 minutes | Critical systems |
| 99.999% (Five Nines) | 5.26 minutes | Financial/medical |
Designing for High Availability
Single Point of Failure ──> Eliminate Redundancy
Before: After:
┌─────────┐ ┌─────────┐ ┌─────────┐
│ App │ │ App 1 │ │ App 2 │
│ (SPOF) │ ───> │ (AZ-a) │ │ (AZ-b) │
└─────────┘ └────┬────┘ └────┬────┘
│ │
┌─────┴───────────┴─────┐
│ Load Balancer │
└─────────────────────────┘
HA Principles
- Eliminate Single Points of Failure (SPOF): Every component should be redundant
- Fault Isolation: Failures shouldn't cascade
- Graceful Degradation: Reduced functionality over complete failure
- Self-Healing: Automatic recovery from failures
- Monitor Everything: Visibility into system health
Scaling Strategies
Vertical Scaling (Scale Up)
Small Instance ──> Large Instance
2 vCPU 16 vCPU
4 GB RAM 64 GB RAM
- Pros: Simple, no code changes
- Cons: Hardware limits, downtime during resize, expensive
Horizontal Scaling (Scale Out)
1 Instance ──> 10 Instances
(2 vCPU) (20 vCPU total)
- Pros: Near-unlimited scale, fault tolerance, cost-effective
- Cons: Requires load balancing, stateless design, complexity
Auto Scaling
# AWS Auto Scaling
AWSTemplateFormatVersion: '2010-09-09'
Resources:
LaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: web-server
LaunchTemplateData:
ImageId: ami-12345678
InstanceType: t3.medium
SecurityGroupIds:
- !Ref SecurityGroup
UserData:
Fn::Base64: |
#!/bin/bash
yum install -y nginx
systemctl start nginx
AutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
VPCZoneIdentifier:
- !Ref PublicSubnet1
- !Ref PublicSubnet2
LaunchTemplate:
LaunchTemplateId: !Ref LaunchTemplate
Version: !GetAtt LaunchTemplate.LatestVersionNumber
MinSize: 2
MaxSize: 10
DesiredCapacity: 2
HealthCheckType: ELB
HealthCheckGracePeriod: 300
TargetGroupARNs:
- !Ref TargetGroup
Tags:
- Key: Name
Value: web-server
PropagateAtLaunch: true
ScalingPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AutoScalingGroupName: !Ref AutoScalingGroup
PolicyType: TargetTrackingScaling
TargetTrackingConfiguration:
PredefinedMetricSpecification:
PredefinedMetricType: ASGAverageCPUUtilization
TargetValue: 60.0
ScaleInCooldown: 300
ScaleOutCooldown: 60
Multi-AZ and Multi-Region
Multi-AZ Deployment
┌─────────────────────────────────────────────────────┐
│ AWS Region │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Availability │ │ Availability │ │
│ │ Zone A │◄──>│ Zone B │ │
│ │ │ │ │ │
│ │ • EC2 instances │ │ • EC2 instances │ │
│ │ • RDS Primary │───>│ • RDS Standby │ │
│ │ • ELB Node │ │ • ELB Node │ │
│ └─────────────────┘ └─────────────────┘ │
│ ▲ ▲ │
│ └──────────┬───────────┘ │
│ ALB/NLB │
└─────────────────────────────────────────────────────┘
Multi-Region Deployment
┌─────────────┐ ┌─────────────┐
│ us-east-1 │◄───────>│ us-west-2 │
│ (Primary) │ RDS │ (Secondary) │
│ │ Replication │
│ • Active │ │ • Read │
│ • Writes │ Route53 │ Replica │
│ │ Failover │
└─────────────┘ └─────────────┘
│ │
└──────────┬────────────┘
Global Accelerator
Load Balancer Health Checks
# AWS ALB Health Check
resource "aws_lb_target_group" "app" {
name = "app-tg"
port = 80
protocol = "HTTP"
vpc_id = aws_vpc.main.id
health_check {
enabled = true
healthy_threshold = 2
interval = 30
matcher = "200"
path = "/health"
port = "traffic-port"
protocol = "HTTP"
timeout = 5
unhealthy_threshold = 2
}
stickiness {
type = "lb_cookie"
cookie_duration = 86400
enabled = true
}
}
Database High Availability
RDS Multi-AZ
# CloudFormation
DBInstance:
Type: AWS::RDS::DBInstance
Properties:
DBName: mydb
AllocatedStorage: 100
DBInstanceClass: db.t3.medium
Engine: postgres
EngineVersion: '15.4'
MasterUsername: admin
MasterUserPassword: !Ref DBPassword
MultiAZ: true
StorageEncrypted: true
BackupRetentionPeriod: 7
PreferredBackupWindow: 03:00-04:00
PreferredMaintenanceWindow: Mon:04:00-Mon:05:00
AutoMinorVersionUpgrade: true
DeletionProtection: true
Read Replicas
# Create read replica
aws rds create-db-instance-read-replica \
--db-instance-identifier mydb-replica \
--source-db-instance-identifier mydb \
--db-instance-class db.t3.small
# Promote to standalone
aws rds promote-read-replica \
--db-instance-identifier mydb-replica
Health Checks and Circuit Breakers
Application Health Check
# health.py
from flask import Flask, jsonify
import psutil
app = Flask(__name__)
@app.route('/health')
def health():
return jsonify({
'status': 'healthy',
'timestamp': datetime.now().isoformat()
})
@app.route('/ready')
def ready():
# Check database connectivity
try:
db.ping()
return jsonify({'status': 'ready'})
except:
return jsonify({'status': 'not ready'}), 503
@app.route('/metrics')
def metrics():
return jsonify({
'cpu_percent': psutil.cpu_percent(),
'memory_percent': psutil.virtual_memory().percent,
'disk_usage': psutil.disk_usage('/').percent
})
Circuit Breaker Pattern
from circuitbreaker import circuit
import requests
@circuit(failure_threshold=5, recovery_timeout=30, expected_exception=requests.RequestException)
def call_external_api():
response = requests.get('https://api.example.com/data')
return response.json()
# Usage
try:
data = call_external_api()
except:
# Fallback to cache or default
data = get_cached_data()
Graceful Shutdown
# Kubernetes graceful shutdown
import signal
import sys
import time
from flask import Flask
app = Flask(__name__)
shutdown_requested = False
def graceful_shutdown(signum, frame):
global shutdown_requested
print("Received termination signal. Starting graceful shutdown...")
shutdown_requested = True
# Stop accepting new connections
# Finish processing current requests
# Close database connections
# Flush logs
time.sleep(5) # Allow in-flight requests to complete
sys.exit(0)
signal.signal(signal.SIGTERM, graceful_shutdown)
signal.signal(signal.SIGINT, graceful_shutdown)
@app.route('/')
def hello():
if shutdown_requested:
return "Shutting down", 503
return "Hello World"
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
Quiz
Quiz
Question 1 of 5What does 'Five Nines' (99.999%) availability mean?
5 minutes of downtime per month
5.26 minutes of downtime per year
5 hours of downtime per year
5 days of downtime per year
Next Steps
Now let's explore disaster recovery and backup strategies.