High Availability & Disaster Recovery

Design systems that survive failures and disasters. Learn to achieve 99.99% availability.

What You’ll Learn

By the end of this chapter, you’ll understand:

What High Availability and Disaster Recovery really mean (and why they’re different)
How to calculate and achieve specific SLAs (99.9%, 99.99%, 99.999%)
What RPO and RTO mean (and why confusing them costs companies millions)
How to design systems that survive datacenter failures (Availability Zones)
How to design systems that survive regional disasters (Multi-region DR)
Real-world DR architectures with actual costs and trade-offs
How to test your DR plan (because untested plans always fail)

Introduction: What is High Availability & Disaster Recovery?

Start Here if You’re Completely New

High Availability (HA) = Your app stays online even when things break Disaster Recovery (DR) = Your app can recover from catastrophic failures Think of it like a restaurant: High Availability (HA):

Problem: One cook gets sick
Solution: You have 3 cooks (redundancy)
Result: Restaurant stays open ✅
Downtime: 0 seconds

Disaster Recovery (DR):

Problem: Fire destroys the entire restaurant
Solution: You have a second location across town (backup site)
Result: Open at backup location in 2 hours ✅
Downtime: 2 hours (but you survived!)

Key Difference:

HA = Handles small failures (broken VM, network glitch) → Seconds of downtime
DR = Handles catastrophic failures (datacenter destroyed, region offline) → Hours of downtime

Why This Matters: The Cost of Downtime

Real-World Disaster Example

Amazon Prime Day Outage (2018)

What happened: Website crashed for 63 minutes during biggest sale day
Revenue loss: $99 million in 63 minutes
Per-minute cost: $1.57 million/minute
Per-second cost: $26,000/second

Every second your app is down = money lost + customers lost + reputation damaged. More Real Examples:

Company	Incident	Downtime	Cost	Cause
Facebook	Oct 2021	6 hours	$60-100M	BGP routing error
GitHub	Oct 2018	24 hours	Unknown	Network partition
Google Cloud	Jun 2019	4.5 hours	Unknown	Configuration error
British Airways	May 2017	3 days	$100M+	Power surge

The Pattern: Companies that invest in HA/DR save millions. Companies that don’t, lose millions.

Understanding SLAs (Service Level Agreements)

What is an SLA?

SLA = A promise about how much downtime is acceptable Think of it like a pizza delivery guarantee:

Pizza delivery SLA: “30 minutes or it’s free”
Azure VM SLA: “99.9% uptime or we give you credits”

SLA Math Explained (From Scratch)

99.9% uptime sounds amazing, right? Let’s see what it actually means:

100% - 99.9% = 0.1% downtime allowed

0.1% of 1 month (30 days) = ?

Calculation:
30 days = 30 × 24 hours = 720 hours
0.1% of 720 hours = 0.72 hours = 43.2 minutes

Result: 99.9% SLA = 43 minutes of downtime per month is OK ⚠️

Translation Table:

SLA	Downtime/Month	Downtime/Year	Real-World Impact
99%	7.2 hours	3.65 days	❌ Unacceptable for production
99.9%	43.2 minutes	8.76 hours	⚠️ OK for internal tools
99.95%	21.6 minutes	4.38 hours	✅ Good for most apps
99.99%	4.3 minutes	52.56 minutes	✅ Great for e-commerce
99.999%	26 seconds	5.26 minutes	🚀 Required for banking

Example: Your e-commerce site makes $100,000/day

99.9% SLA: 43 minutes downtime/month = $3,000 lost revenue
99.99% SLA: 4.3 minutes downtime/month = $300 lost revenue
Cost to upgrade: ~$200/month
Savings: $2,700/month → 13.5x ROI ✅

How Azure Achieves High Availability

The Building Blocks (From Smallest to Largest)

1. Physical Server (Single point of failure ❌)
   ↓
2. Availability Set (Multiple servers in same datacenter ✅)
   ↓
3. Availability Zone (Multiple datacenters in same region ✅✅)
   ↓
4. Region Pair (Multiple regions 1,000+ km apart ✅✅✅)

Let’s understand each one:

1. Single VM (No High Availability)

Your Application
  ↓
Single VM in Azure
  ↓
Physical Server #47 in East US Datacenter

What Happens if Physical Server #47 fails?
- Your VM goes down ❌
- Downtime: 10-30 minutes (while Azure moves VM to new server)
- SLA: 99.9% (43 minutes downtime/month)

Real-World Analogy: Running a restaurant with only 1 cook. Cook gets sick = restaurant closes.

2. Availability Set (Same Datacenter, Different Racks)

Your Application (Load Balanced)
  ├── VM 1 → Physical Server #47 (Rack A)
  ├── VM 2 → Physical Server #128 (Rack B)
  └── VM 3 → Physical Server #201 (Rack C)

All in the same datacenter, but different racks (power/network isolation)

What Happens if Physical Server #47 fails?
- VM 1 goes down ❌
- VM 2 and VM 3 still running ✅
- Load balancer routes traffic to healthy VMs
- Downtime: 0 seconds ✅
- SLA: 99.95% (21 minutes downtime/month)

Real-World Analogy: Restaurant with 3 cooks. One cook gets sick = other 2 keep working. Cost Example:

1 VM (99.9%): $50/month → 43 min downtime
3 VMs in Availability Set (99.95%): $150/month → 21 min downtime
Extra cost: $100/month → Saves 22 minutes of downtime

3. Availability Zones (Different Datacenters, Same Region)

East US Region (has 3 Availability Zones)

Zone 1: Datacenter Building A (15 km away)
  └── VM 1

Zone 2: Datacenter Building B (20 km away)
  └── VM 2

Zone 3: Datacenter Building C (25 km away)
  └── VM 3

Each zone has independent:
- Power supply (different power grid)
- Cooling system
- Network connections

What Happens if Entire Datacenter Building A loses power?
- Zone 1 (VM 1) goes down ❌
- Zone 2 (VM 2) still running ✅
- Zone 3 (VM 3) still running ✅
- Downtime: 0 seconds ✅
- SLA: 99.99% (4.3 minutes downtime/month)

Real-World Analogy: Restaurant chain with 3 locations in same city. One location catches fire = other 2 still serve customers. Cost Example:

3 VMs in Availability Set (99.95%): $150/month → 21 min downtime
3 VMs across Availability Zones (99.99%): $150/month → 4.3 min downtime
Extra cost: $0 (same price!) → Saves 17 minutes of downtime ✅

Why Availability Zones are Better:

Protects against datacenter-level disasters (fire, flood, power outage)
Same cost as Availability Set
Higher SLA (99.99% vs 99.95%)

4. Multi-Region (Disaster Recovery)

Primary Region: East US
  └── 3 VMs across Availability Zones (99.99% SLA)

Secondary Region: West Europe (4,000 km away)
  └── 3 VMs across Availability Zones (standby)

Azure Front Door (Global Load Balancer)
  ├── Route to East US (primary)
  └── Failover to West Europe (if East US fails)

What Happens if Entire East US Region goes offline?
(Hurricane, earthquake, massive network outage)
- East US completely offline ❌
- Front Door automatically routes to West Europe ✅
- Downtime: 2-5 minutes (DNS propagation) ✅
- SLA: 99.99%+ (composite SLA)

Real-World Analogy: Restaurant chain with locations in New York and London. Hurricane destroys New York = London location still serves customers. Cost Example:

Single Region (East US): $150/month → 4.3 min downtime
Multi-Region (East US + West Europe): $300/month → 2 min downtime
Extra cost: $150/month → Protects against regional disasters

When You Need Multi-Region:

✅ Mission-critical applications (banking, healthcare)
✅ Global user base (low latency everywhere)
✅ Compliance requirements (data residency)
❌ Small internal tools (not worth the cost)

1. Availability SLAs

SLA Levels
Azure Service SLAs

99%     = 7.2 hours downtime/month
9%   = 43.2 minutes downtime/month
95%  = 21.6 minutes downtime/month
99%  = 4.3 minutes downtime/month
999% = 26 seconds downtime/month

Service	Configuration	SLA
VM	Single instance (Premium SSD)	99.9%
VM	2+ in Availability Set	99.95%
VM	2+ across Availability Zones	99.99%
App Service	Standard tier	99.95%
Azure SQL	General Purpose	99.99%
Azure SQL	Business Critical (zone-redundant)	99.995%
Cosmos DB	Single region	99.99%
Cosmos DB	Multi-region	99.999%

Understanding RPO & RTO (From Absolute Zero)

What is RPO and RTO?

These are the TWO most important numbers in disaster recovery. Companies have lost millions by confusing these. RPO (Recovery Point Objective) = “How much data can we afford to lose?” RTO (Recovery Time Objective) = “How long can we be offline?”

Real-World Analogy: Writing a Book

Imagine you’re writing a 500-page book on your computer: Scenario 1: You save every 5 minutes (RPO = 5 minutes)

Time: 2:00 PM → You save your work (page 247)
Time: 2:03 PM → You write 2 more pages (now on page 249)
Time: 2:05 PM → COMPUTER CRASHES! ❌

What happened:
- Last save: 2:00 PM (page 247)
- Crash: 2:05 PM (page 249)
- Lost work: 2 pages (5 minutes of work)

RPO = 5 minutes (you lost 5 minutes of work)

Scenario 2: You save every 1 hour (RPO = 1 hour)

Time: 1:00 PM → You save your work (page 220)
Time: 1:58 PM → You write 29 more pages (now on page 249)
Time: 2:00 PM → COMPUTER CRASHES! ❌

What happened:
- Last save: 1:00 PM (page 220)
- Crash: 2:00 PM (page 249)
- Lost work: 29 pages (1 hour of work)

RPO = 1 hour (you lost 1 hour of work)

RPO = Time between backups = Amount of data you can lose

Now RTO (Recovery Time Objective)

RTO = How long until you’re back to work after a disaster Continuing the book analogy: Scenario 1: Backup laptop ready (RTO = 5 minutes)

Time: 2:00 PM → Computer crashes ❌
Time: 2:01 PM → Grab backup laptop from closet
Time: 2:03 PM → Log into backup laptop
Time: 2:05 PM → Open last saved version (from 2:00 PM)
Time: 2:05 PM → Back to writing! ✅

RTO = 5 minutes (time to get back to work)

Scenario 2: Need to buy new laptop (RTO = 3 days)

Day 1, 2:00 PM → Computer crashes ❌
Day 1, 3:00 PM → Drive to store, store is out of stock
Day 2, 10:00 AM → Order laptop online
Day 3, 4:00 PM → Laptop arrives, install software
Day 3, 6:00 PM → Back to writing! ✅

RTO = 3 days (time to get back to work)

RTO = Time to recover from disaster = How long you’re offline

The Critical Difference (Why Companies Confuse This)

[!WARNING] Common Mistake: Confusing RPO and RTO RPO = Data Loss (measured in TIME since last backup) RTO = Downtime (measured in TIME to recover)

You can have DIFFERENT combinations: Example 1: Low RPO, High RTO

E-commerce Database:
- RPO: 1 minute (backup every minute)
- RTO: 4 hours (takes 4 hours to restore from backup)

Result:
- Data loss: Only 1 minute of orders lost ✅
- Downtime: 4 hours offline ❌
- Lost revenue: $400,000 (at $100,000/hour)

Example 2: High RPO, Low RTO

Analytics Dashboard:
- RPO: 24 hours (backup once daily)
- RTO: 5 minutes (hot standby ready)

Result:
- Data loss: 24 hours of analytics data lost ❌ (but analytics can be regenerated)
- Downtime: 5 minutes offline ✅
- Lost revenue: $0 (dashboard back quickly)

Example 3: Low RPO, Low RTO (Expensive but Best)

Banking System:
- RPO: 0 seconds (continuous replication)
- RTO: 30 seconds (automatic failover)

Result:
- Data loss: 0 transactions lost ✅
- Downtime: 30 seconds offline ✅
- Lost revenue: Minimal
- Cost: High ($$$$)

Real-World RPO/RTO Example: GitLab Database Incident (2017)

The Disaster:

GitLab engineer accidentally deleted production database
300 GB of data vanished

What They THOUGHT Their RPO Was: 24 hours (daily backups) What Their RPO ACTUALLY Was: 6 hours (daily backups were failing, only staging backups worked) Actual Result:

RPO: 6 hours → Lost 6 hours of data (5,000 projects, 5,000 comments, 700 new users)
RTO: 18 hours → Took 18 hours to restore from backups
Total impact: 6 hours of data lost + 18 hours offline
Cost: Immeasurable reputation damage (but they recovered with transparency)

Lesson: Your DR plan is only as good as your last successful restore TEST.

How to Choose Your RPO/RTO

Step 1: Calculate Cost of Downtime

Your e-commerce site makes $100,000/day

Cost per hour = $100,000 ÷ 24 = $4,166/hour
Cost per minute = $4,166 ÷ 60 = $69/minute
Cost per second = $69 ÷ 60 = $1.15/second

Step 2: Calculate Acceptable Loss

Question: "Can we afford to lose 1 hour of orders?"

1 hour of orders = $4,166 in revenue

If RTO = 1 hour:
- Lost revenue: $4,166
- Acceptable? (You decide based on business impact)

If RTO = 5 minutes:
- Lost revenue: $347
- Acceptable? (Much better!)

Step 3: Calculate Cost of DR Solution

Option 1: Daily Backups
- RPO: 24 hours
- RTO: 4 hours
- Cost: $50/month
- Risk: Lose up to $100,000 in orders + 4 hours downtime ($16,664)

Option 2: Continuous Replication + Auto-Failover
- RPO: 0 seconds
- RTO: 2 minutes
- Cost: $500/month
- Risk: Lose 2 minutes of uptime ($138)

Which is better?
- Option 2 costs $450 more per month
- But saves $100,000+ in potential losses
- ROI: 222x return on investment ✅

Decision Tree: Choosing RPO/RTO

Q1: What type of application?

├─ Internal Tool (HR dashboard, wiki)
│  ├─ RPO: 24 hours (daily backup OK)
│  └─ RTO: 4 hours (users can wait)
│  └─ Solution: Daily backups ($50/month)
│
├─ Customer-Facing App (blog, docs site)
│  ├─ RPO: 1 hour (acceptable data loss)
│  └─ RTO: 1 hour (acceptable downtime)
│  └─ Solution: Hourly backups + standby VM ($200/month)
│
├─ E-commerce / Revenue-Generating
│  ├─ RPO: 5 minutes (minimal data loss)
│  └─ RTO: 5 minutes (minimal downtime)
│  └─ Solution: Continuous replication + auto-failover ($500/month)
│
└─ Banking / Healthcare / Critical
   ├─ RPO: 0 seconds (zero data loss)
   └─ RTO: 30 seconds (near-zero downtime)
   └─ Solution: Synchronous replication + active-active ($2,000/month)

2. RPO & RTO

[!WARNING] Gotcha: RPO vs RTO A common interview trap. RPO (Point) = Data Loss (How far back do we go?) RTO (Time) = Downtime (How long until we are back online?) You can have low RPO (0 data loss) but high RTO (took 4 hours to restart).

[!TIP] Jargon Alert: Split Brain A disaster scenario where two databases both think they are “Primary” and accept writes at the same time, corrupting data. Always use a “Witness” or “Quorum” to preventing this in active-active architectures.

Quick Reference (after reading the detailed explanation above): RPO (Recovery Point Objective): How much data loss is acceptable? RTO (Recovery Time Objective): How long to recover?

Example Scenarios

Scenario 1: E-commerce Site
RPO: 5 minutes (transactional data)
RTO: 15 minutes (revenue impact)
Solution: Auto-failover groups, zone-redundant services

Scenario 2: Internal Analytics
RPO: 24 hours (reports can be delayed)
RTO: 4 hours (not business-critical)
Solution: Daily backups, manual recovery

Scenario 3: Banking System
RPO: 0 seconds (no data loss acceptable)
RTO: < 1 minute (critical business)
Solution: Synchronous replication, active-active

Common Mistakes in HA/DR (Learn from Others’ Failures)

Mistake #1: “Stopped” VMs Still Cost Money

The Trap:

Developer thinks: "I'll save money by stopping VMs at night"

Developer clicks "Stop" in Azure Portal
VM Status: "Stopped" ✅

Month-end bill arrives: Still charged $2,000! ❌

What Happened:

“Stopped” = OS shutdown, but VM resources still reserved
You still pay for compute, just not OS license
Correct action: “Deallocate” (not just “Stop”)

Cost Impact:

Stopped VM: Still ~80% of full cost
Deallocated VM: Only pay for storage (~5% of full cost)

Fix:

# WRONG (still costs money)
az vm stop --name myVM --resource-group myRG

# CORRECT (actually saves money)
az vm deallocate --name myVM --resource-group myRG

Savings: $1,600/month per VM ✅

Mistake #2: Untested Backups (The GitLab Disaster)

The Trap:

Company: "We have daily backups, we're safe!"

Reality:
- Backups running for 6 months ✅
- Never tested a restore ❌
- Disaster strikes
- Try to restore... backups are CORRUPTED ❌
- All backups unusable

Real Example: Code Spaces (2014):

Hosting company for developers
Backups existed but were on same infrastructure
Hacker deleted everything (including backups)
Company went out of business
Customers lost everything

The Fix: Test quarterly

Q1: January → Test restore production database to staging
Q2: April → Test restore VM from backup
Q3: July → Test failover to secondary region
Q4: October → Full disaster recovery drill

Cost of Testing: $500/month (test infrastructure) Cost of Untested Backup: Business bankruptcy ❌

Mistake #3: Ignoring Composite SLAs

The Trap:

Architect: "Our app has 99.99% SLA!"

Reality:
App Service (99.95%) × Azure SQL (99.99%) × Redis (99.9%)
= 99.95% × 99.99% × 99.9%
= 99.84% actual SLA ❌

99.84% = 69 minutes downtime/month (not 4.3 minutes!)

The Math (Explained Simply):

When you chain services, SLAs MULTIPLY:

Service A: 99.9% availability = 99.9% = 0.999
Service B: 99.9% availability = 99.9% = 0.999

Combined: 0.999 × 0.999 = 0.998001 = 99.8%

Translation:
- Promised: 99.9% (43 min downtime)
- Actual: 99.8% (86 min downtime)
- Difference: 2x more downtime! ❌

Real Example:

E-commerce Application Stack:
├── Azure Front Door: 99.99%
├── App Service: 99.95%
├── Azure SQL: 99.99%
├── Redis Cache: 99.9%
└── Blob Storage: 99.99%

Combined SLA:
0.9999 × 0.9995 × 0.9999 × 0.999 × 0.9999 = 0.9982 = 99.82%

Expected downtime: 4.3 min/month
Actual downtime: 77 min/month (18x worse!)

The Fix: Add redundancy

Option 1: Single Region
- Combined SLA: 99.82%
- Downtime: 77 min/month

Option 2: Multi-Region (Active-Passive)
- Each region: 99.82%
- Probability both fail: 0.18% × 0.18% = 0.0324%
- Combined SLA: 99.97%
- Downtime: 13 min/month

Improvement: 6x better SLA ✅

Mistake #4: Forgetting About Data Gravity

The Trap:

Architect: "We'll replicate to 5 regions for global HA!"

Reality:
- Application (100 MB): Replicates in seconds ✅
- Database (500 GB): Takes 4 hours to replicate ❌
- Failover time: 4+ hours (waiting for data sync) ❌

Data Gravity = Large data is slow to move Real Numbers:

Replication Speed: ~50 MB/second (typical Azure inter-region)

Small Database (10 GB):
- Replication time: 200 seconds (3.3 minutes) ✅

Medium Database (500 GB):
- Replication time: 10,000 seconds (2.7 hours) ⚠️

Large Database (10 TB):
- Replication time: 200,000 seconds (55 hours!) ❌

The Fix: Plan for data size

Scenario: E-commerce with 2 TB database

Option 1: Full replication on failover (Bad)
- RTO: 27+ hours ❌

Option 2: Continuous async replication (Good)
- Database always synced (5-10 min lag)
- RTO: 5-10 minutes ✅
- Cost: $800/month (geo-replication)

Option 3: Synchronous replication (Best)
- Database synced in real-time (0 lag)
- RTO: 30 seconds ✅
- Cost: $2,000/month (active-active)

Mistake #5: Active-Active Without Proper Conflict Resolution

The Trap:

Architect: "Let's run database in both regions actively!"

User in US: Updates customer email to "[email protected]" (Region 1)
User in EU: Updates same customer email to "[email protected]" (Region 2)
  ↓
CONFLICT: Which email is correct? ❌
  ↓
Split-brain scenario: Data corruption ❌

Real Example: Uber (2016):

Active-active setup without proper conflict resolution
Network partition between datacenters
Both sides accepted writes
Data corruption cost hundreds of hours to resolve

The Fix: Choose conflict resolution strategy

Strategy 1: Last-Write-Wins (LWW)
- Keep the most recent update (based on timestamp)
- Simple but data loss possible
- Good for: Analytics, non-critical data

Strategy 2: Application-Level Conflict Resolution
- Application decides which update wins
- Complex but no data loss
- Good for: Banking, critical applications

Strategy 3: Avoid Conflicts (Partition Data)
- US customers → US region only
- EU customers → EU region only
- Never have conflicts (single writer per data)
- Good for: Global applications with regional data

3. High Availability Patterns

1. Active-Passive

Primary Region (Active)
  - Handles all traffic
  - Replicates to secondary

Secondary Region (Passive)
  - Standby mode
  - Activated on primary failure

Pros: Simple, cost-effective
Cons: Unused capacity, manual failover

2. Active-Active

Region 1 (Active)
  - Handles 50% traffic

Region 2 (Active)
  - Handles 50% traffic

Both regions process requests simultaneously

Pros: Maximum availability, no wasted capacity
Cons: Complex (data conflicts), expensive

3. Multi-Region with Traffic Manager

Azure Front Door / Traffic Manager
  ├── Primary: East US (priority 1)
  ├── Secondary: West Europe (priority 2)
  └── Tertiary: Southeast Asia (priority 3)

Automatic failover based on health probes

4. Disaster Recovery Architecture

Azure Site Recovery Multi-Region DR Architecture

Example: E-Commerce Platform

Primary Region: East US
- App Service (zone-redundant)
- Azure SQL (zone-redundant)
- Redis Cache (zone-redundant)
- Front Door (global)

Secondary Region: West US (Passive)
- App Service (scaled to minimum)
- Azure SQL (geo-replica, read-only)
- Redis Cache (geo-replication)

Failover Process:
1. Front Door detects primary unhealthy
2. Routes traffic to secondary (automatic)
3. Promote SQL replica to primary
4. Scale up App Service instances
5. Total failover time: < 5 minutes

5. Backup Strategies

VMs
Databases
Storage

# Enable Azure Backup
az backup protection enable-for-vm \
  --resource-group rg-prod \
  --vault-name RecoveryServicesVault \
  --vm vm-web-01 \
  --policy-name DefaultPolicy

# Retention:
# - Daily: 7 days
# - Weekly: 4 weeks
# - Monthly: 12 months
# - Yearly: 10 years

Azure SQL:
- Automatic backups (7-35 days)
- Point-in-time restore
- Long-term retention (up to 10 years)
- Geo-restore (to any region)

Cosmos DB:
- Continuous backup (30 days)
- Point-in-time restore
- Restore to same or different account

Blob Storage:
- Soft delete (7-365 days)
- Blob versioning
- Snapshots
- GRS (geo-redundant storage)

Lifecycle management:
- Hot → Cool → Archive
- Automated retention policies

6. Testing DR Plan

Disaster Recovery Drill Checklist:

✅ Document all runbooks
✅ Test failover quarterly
✅ Measure actual RTO/RPO
✅ Update contact lists
✅ Test backup restores
✅ Verify monitoring alerts
✅ Review and update documentation
✅ Conduct post-drill review

7. Interview Questions

Beginner Level

Q1: What is the difference between Availability and Durability?

Answer:

Availability: Uptime. Can I access the service right now? (e.g., SLA 99.9%).
Durability: Data integrity. Is my data safe from loss? (e.g., 11 nines 99.999999999% for Blob Storage). You can have high availability but lose data (corruption), or high durability but be offline.

Q2: What is an Availability Zone?

Answer: A physically separate datacenter within the same Azure Context (Region). It has independent power, cooling, and networking. Protects against datacenter-level failures (fire, power cut).

Intermediate Level

Q3: Explain RPO and RTO in simple terms

Answer:

RPO (Recovery Point Objective): “How much data can we lose?” (Time since last backup).
RTO (Recovery Time Objective): “How long can we be down?” (Time to restore service).

Q4: Active-Passive vs Active-Active?

Answer:

Active-Passive: One region handles traffic. Secondary is standby. Cheaper, slower failover (RTO > 0).
Active-Active: Both regions handle traffic. Complex data sync. Expensive. zero downtime failover (RTO ≈ 0).

Advanced Level

Q5: How do you achieve 99.99% SLA not offered by a single service?

Answer: By using Composite SLAs. If you have two regions, each with 99.9% availability, the probability of both failing simultaneously is

0.1\% \times 0.1\% = 0.01\%

. Total Availability =

100\% - 0.01\% = 99.99\%

. Redundancy increases availability.

8. Key Takeaways

SLA Mathematics

Understand how SLAs compound. Dependencies reduce availability; Redundancy increases it.

Zones vs Regions

Use Zones for synchronous HA (High Availability). Use Regions for asynchronous DR (Disaster Recovery).

Data Gravity

Compute is stateless and easy to move. Data is heavy and hard to sync. Focus DR efforts on data replication.

Testing

A backup is useless if you can’t restore it. A DR plan is a hypothesis until tested.

Business Alignment

RPO/RTO are business decisions, not technical ones. They dictate the cost of the designated solution.

Next Steps

Continue to Chapter 14

Master real-world Azure architecture patterns and design principles

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​High Availability & Disaster Recovery

​What You’ll Learn

​Introduction: What is High Availability & Disaster Recovery?

​Start Here if You’re Completely New

​Why This Matters: The Cost of Downtime

​Real-World Disaster Example

​Understanding SLAs (Service Level Agreements)

​What is an SLA?

​SLA Math Explained (From Scratch)

​How Azure Achieves High Availability

​The Building Blocks (From Smallest to Largest)

​1. Single VM (No High Availability)

​2. Availability Set (Same Datacenter, Different Racks)

​3. Availability Zones (Different Datacenters, Same Region)

​4. Multi-Region (Disaster Recovery)

​1. Availability SLAs

​Understanding RPO & RTO (From Absolute Zero)

​What is RPO and RTO?

​Real-World Analogy: Writing a Book

​Now RTO (Recovery Time Objective)

​The Critical Difference (Why Companies Confuse This)

​Real-World RPO/RTO Example: GitLab Database Incident (2017)

​How to Choose Your RPO/RTO

​Decision Tree: Choosing RPO/RTO

​2. RPO & RTO

​Example Scenarios

​Common Mistakes in HA/DR (Learn from Others’ Failures)

​Mistake #1: “Stopped” VMs Still Cost Money

​Mistake #2: Untested Backups (The GitLab Disaster)

​Mistake #3: Ignoring Composite SLAs

​Mistake #4: Forgetting About Data Gravity

​Mistake #5: Active-Active Without Proper Conflict Resolution

​3. High Availability Patterns

​4. Disaster Recovery Architecture

​Example: E-Commerce Platform

​5. Backup Strategies

​6. Testing DR Plan

​7. Interview Questions

High Availability & Disaster Recovery

What You’ll Learn

Introduction: What is High Availability & Disaster Recovery?

Start Here if You’re Completely New

Why This Matters: The Cost of Downtime

Real-World Disaster Example

Understanding SLAs (Service Level Agreements)

What is an SLA?

SLA Math Explained (From Scratch)

How Azure Achieves High Availability

The Building Blocks (From Smallest to Largest)

1. Single VM (No High Availability)

2. Availability Set (Same Datacenter, Different Racks)

3. Availability Zones (Different Datacenters, Same Region)

4. Multi-Region (Disaster Recovery)

1. Availability SLAs

Understanding RPO & RTO (From Absolute Zero)

What is RPO and RTO?

Real-World Analogy: Writing a Book

Now RTO (Recovery Time Objective)

The Critical Difference (Why Companies Confuse This)

Real-World RPO/RTO Example: GitLab Database Incident (2017)

How to Choose Your RPO/RTO

Decision Tree: Choosing RPO/RTO

2. RPO & RTO

Example Scenarios

Common Mistakes in HA/DR (Learn from Others’ Failures)

Mistake #1: “Stopped” VMs Still Cost Money

Mistake #2: Untested Backups (The GitLab Disaster)

Mistake #3: Ignoring Composite SLAs

Mistake #4: Forgetting About Data Gravity

Mistake #5: Active-Active Without Proper Conflict Resolution

3. High Availability Patterns

4. Disaster Recovery Architecture

Example: E-Commerce Platform

5. Backup Strategies

6. Testing DR Plan

7. Interview Questions

Beginner Level

Intermediate Level

Advanced Level

8. Key Takeaways

Next Steps