Skip to main content

High Availability & Disaster Recovery

Design systems that survive failures and disasters. Learn to achieve 99.99% availability. Azure HA/DR Patterns

What You’ll Learn

By the end of this chapter, you’ll understand:
  • What High Availability and Disaster Recovery really mean (and why they’re different)
  • How to calculate and achieve specific SLAs (99.9%, 99.99%, 99.999%)
  • What RPO and RTO mean (and why confusing them costs companies millions)
  • How to design systems that survive datacenter failures (Availability Zones)
  • How to design systems that survive regional disasters (Multi-region DR)
  • Real-world DR architectures with actual costs and trade-offs
  • How to test your DR plan (because untested plans always fail)

Introduction: What is High Availability & Disaster Recovery?

Start Here if You’re Completely New

High Availability (HA) = Your app stays online even when things break Disaster Recovery (DR) = Your app can recover from catastrophic failures Think of it like a restaurant: High Availability (HA):
  • Problem: One cook gets sick
  • Solution: You have 3 cooks (redundancy)
  • Result: Restaurant stays open ✅
  • Downtime: 0 seconds
Disaster Recovery (DR):
  • Problem: Fire destroys the entire restaurant
  • Solution: You have a second location across town (backup site)
  • Result: Open at backup location in 2 hours ✅
  • Downtime: 2 hours (but you survived!)
Key Difference:
  • HA = Handles small failures (broken VM, network glitch) → Seconds of downtime
  • DR = Handles catastrophic failures (datacenter destroyed, region offline) → Hours of downtime

Why This Matters: The Cost of Downtime

Real-World Disaster Example

Amazon Prime Day Outage (2018)
  • What happened: Website crashed for 63 minutes during biggest sale day
  • Revenue loss: $99 million in 63 minutes
  • Per-minute cost: $1.57 million/minute
  • Per-second cost: $26,000/second
Every second your app is down = money lost + customers lost + reputation damaged. More Real Examples:
CompanyIncidentDowntimeCostCause
FacebookOct 20216 hours$60-100MBGP routing error
GitHubOct 201824 hoursUnknownNetwork partition
Google CloudJun 20194.5 hoursUnknownConfiguration error
British AirwaysMay 20173 days$100M+Power surge
The Pattern: Companies that invest in HA/DR save millions. Companies that don’t, lose millions.

Understanding SLAs (Service Level Agreements)

What is an SLA?

SLA = A promise about how much downtime is acceptable Think of it like a pizza delivery guarantee:
  • Pizza delivery SLA: “30 minutes or it’s free”
  • Azure VM SLA: “99.9% uptime or we give you credits”

SLA Math Explained (From Scratch)

99.9% uptime sounds amazing, right? Let’s see what it actually means:
100% - 99.9% = 0.1% downtime allowed

0.1% of 1 month (30 days) = ?

Calculation:
30 days = 30 × 24 hours = 720 hours
0.1% of 720 hours = 0.72 hours = 43.2 minutes

Result: 99.9% SLA = 43 minutes of downtime per month is OK ⚠️
Translation Table:
SLADowntime/MonthDowntime/YearReal-World Impact
99%7.2 hours3.65 days❌ Unacceptable for production
99.9%43.2 minutes8.76 hours⚠️ OK for internal tools
99.95%21.6 minutes4.38 hours✅ Good for most apps
99.99%4.3 minutes52.56 minutes✅ Great for e-commerce
99.999%26 seconds5.26 minutes🚀 Required for banking
Example: Your e-commerce site makes $100,000/day
  • 99.9% SLA: 43 minutes downtime/month = $3,000 lost revenue
  • 99.99% SLA: 4.3 minutes downtime/month = $300 lost revenue
  • Cost to upgrade: ~$200/month
  • Savings: $2,700/month → 13.5x ROI ✅

How Azure Achieves High Availability

The Building Blocks (From Smallest to Largest)

1. Physical Server (Single point of failure ❌)

2. Availability Set (Multiple servers in same datacenter ✅)

3. Availability Zone (Multiple datacenters in same region ✅✅)

4. Region Pair (Multiple regions 1,000+ km apart ✅✅✅)
Let’s understand each one:

1. Single VM (No High Availability)

Your Application

Single VM in Azure

Physical Server #47 in East US Datacenter

What Happens if Physical Server #47 fails?
- Your VM goes down ❌
- Downtime: 10-30 minutes (while Azure moves VM to new server)
- SLA: 99.9% (43 minutes downtime/month)
Real-World Analogy: Running a restaurant with only 1 cook. Cook gets sick = restaurant closes.

2. Availability Set (Same Datacenter, Different Racks)

Your Application (Load Balanced)
  ├── VM 1 → Physical Server #47 (Rack A)
  ├── VM 2 → Physical Server #128 (Rack B)
  └── VM 3 → Physical Server #201 (Rack C)

All in the same datacenter, but different racks (power/network isolation)

What Happens if Physical Server #47 fails?
- VM 1 goes down ❌
- VM 2 and VM 3 still running ✅
- Load balancer routes traffic to healthy VMs
- Downtime: 0 seconds ✅
- SLA: 99.95% (21 minutes downtime/month)
Real-World Analogy: Restaurant with 3 cooks. One cook gets sick = other 2 keep working. Cost Example:
  • 1 VM (99.9%): $50/month → 43 min downtime
  • 3 VMs in Availability Set (99.95%): $150/month → 21 min downtime
  • Extra cost: $100/month → Saves 22 minutes of downtime

3. Availability Zones (Different Datacenters, Same Region)

East US Region (has 3 Availability Zones)

Zone 1: Datacenter Building A (15 km away)
  └── VM 1

Zone 2: Datacenter Building B (20 km away)
  └── VM 2

Zone 3: Datacenter Building C (25 km away)
  └── VM 3

Each zone has independent:
- Power supply (different power grid)
- Cooling system
- Network connections

What Happens if Entire Datacenter Building A loses power?
- Zone 1 (VM 1) goes down ❌
- Zone 2 (VM 2) still running ✅
- Zone 3 (VM 3) still running ✅
- Downtime: 0 seconds ✅
- SLA: 99.99% (4.3 minutes downtime/month)
Real-World Analogy: Restaurant chain with 3 locations in same city. One location catches fire = other 2 still serve customers. Cost Example:
  • 3 VMs in Availability Set (99.95%): $150/month → 21 min downtime
  • 3 VMs across Availability Zones (99.99%): $150/month → 4.3 min downtime
  • Extra cost: $0 (same price!) → Saves 17 minutes of downtime ✅
Why Availability Zones are Better:
  • Protects against datacenter-level disasters (fire, flood, power outage)
  • Same cost as Availability Set
  • Higher SLA (99.99% vs 99.95%)

4. Multi-Region (Disaster Recovery)

Primary Region: East US
  └── 3 VMs across Availability Zones (99.99% SLA)

Secondary Region: West Europe (4,000 km away)
  └── 3 VMs across Availability Zones (standby)

Azure Front Door (Global Load Balancer)
  ├── Route to East US (primary)
  └── Failover to West Europe (if East US fails)

What Happens if Entire East US Region goes offline?
(Hurricane, earthquake, massive network outage)
- East US completely offline ❌
- Front Door automatically routes to West Europe ✅
- Downtime: 2-5 minutes (DNS propagation) ✅
- SLA: 99.99%+ (composite SLA)
Real-World Analogy: Restaurant chain with locations in New York and London. Hurricane destroys New York = London location still serves customers. Cost Example:
  • Single Region (East US): $150/month → 4.3 min downtime
  • Multi-Region (East US + West Europe): $300/month → 2 min downtime
  • Extra cost: $150/month → Protects against regional disasters
When You Need Multi-Region:
  • ✅ Mission-critical applications (banking, healthcare)
  • ✅ Global user base (low latency everywhere)
  • ✅ Compliance requirements (data residency)
  • ❌ Small internal tools (not worth the cost)

1. Availability SLAs

99%     = 7.2 hours downtime/month
99.9%   = 43.2 minutes downtime/month
99.95%  = 21.6 minutes downtime/month
99.99%  = 4.3 minutes downtime/month
99.999% = 26 seconds downtime/month

Understanding RPO & RTO (From Absolute Zero)

What is RPO and RTO?

These are the TWO most important numbers in disaster recovery. Companies have lost millions by confusing these. RPO (Recovery Point Objective) = “How much data can we afford to lose?” RTO (Recovery Time Objective) = “How long can we be offline?”

Real-World Analogy: Writing a Book

Imagine you’re writing a 500-page book on your computer: Scenario 1: You save every 5 minutes (RPO = 5 minutes)
Time: 2:00 PM → You save your work (page 247)
Time: 2:03 PM → You write 2 more pages (now on page 249)
Time: 2:05 PM → COMPUTER CRASHES! ❌

What happened:
- Last save: 2:00 PM (page 247)
- Crash: 2:05 PM (page 249)
- Lost work: 2 pages (5 minutes of work)

RPO = 5 minutes (you lost 5 minutes of work)
Scenario 2: You save every 1 hour (RPO = 1 hour)
Time: 1:00 PM → You save your work (page 220)
Time: 1:58 PM → You write 29 more pages (now on page 249)
Time: 2:00 PM → COMPUTER CRASHES! ❌

What happened:
- Last save: 1:00 PM (page 220)
- Crash: 2:00 PM (page 249)
- Lost work: 29 pages (1 hour of work)

RPO = 1 hour (you lost 1 hour of work)
RPO = Time between backups = Amount of data you can lose

Now RTO (Recovery Time Objective)

RTO = How long until you’re back to work after a disaster Continuing the book analogy: Scenario 1: Backup laptop ready (RTO = 5 minutes)
Time: 2:00 PM → Computer crashes ❌
Time: 2:01 PM → Grab backup laptop from closet
Time: 2:03 PM → Log into backup laptop
Time: 2:05 PM → Open last saved version (from 2:00 PM)
Time: 2:05 PM → Back to writing! ✅

RTO = 5 minutes (time to get back to work)
Scenario 2: Need to buy new laptop (RTO = 3 days)
Day 1, 2:00 PM → Computer crashes ❌
Day 1, 3:00 PM → Drive to store, store is out of stock
Day 2, 10:00 AM → Order laptop online
Day 3, 4:00 PM → Laptop arrives, install software
Day 3, 6:00 PM → Back to writing! ✅

RTO = 3 days (time to get back to work)
RTO = Time to recover from disaster = How long you’re offline

The Critical Difference (Why Companies Confuse This)

[!WARNING] Common Mistake: Confusing RPO and RTO RPO = Data Loss (measured in TIME since last backup) RTO = Downtime (measured in TIME to recover)
You can have DIFFERENT combinations: Example 1: Low RPO, High RTO
E-commerce Database:
- RPO: 1 minute (backup every minute)
- RTO: 4 hours (takes 4 hours to restore from backup)

Result:
- Data loss: Only 1 minute of orders lost ✅
- Downtime: 4 hours offline ❌
- Lost revenue: $400,000 (at $100,000/hour)
Example 2: High RPO, Low RTO
Analytics Dashboard:
- RPO: 24 hours (backup once daily)
- RTO: 5 minutes (hot standby ready)

Result:
- Data loss: 24 hours of analytics data lost ❌ (but analytics can be regenerated)
- Downtime: 5 minutes offline ✅
- Lost revenue: $0 (dashboard back quickly)
Example 3: Low RPO, Low RTO (Expensive but Best)
Banking System:
- RPO: 0 seconds (continuous replication)
- RTO: 30 seconds (automatic failover)

Result:
- Data loss: 0 transactions lost ✅
- Downtime: 30 seconds offline ✅
- Lost revenue: Minimal
- Cost: High ($$$$)

Real-World RPO/RTO Example: GitLab Database Incident (2017)

The Disaster:
  • GitLab engineer accidentally deleted production database
  • 300 GB of data vanished
What They THOUGHT Their RPO Was: 24 hours (daily backups) What Their RPO ACTUALLY Was: 6 hours (daily backups were failing, only staging backups worked) Actual Result:
  • RPO: 6 hours → Lost 6 hours of data (5,000 projects, 5,000 comments, 700 new users)
  • RTO: 18 hours → Took 18 hours to restore from backups
  • Total impact: 6 hours of data lost + 18 hours offline
  • Cost: Immeasurable reputation damage (but they recovered with transparency)
Lesson: Your DR plan is only as good as your last successful restore TEST.

How to Choose Your RPO/RTO

Step 1: Calculate Cost of Downtime
Your e-commerce site makes $100,000/day

Cost per hour = $100,000 ÷ 24 = $4,166/hour
Cost per minute = $4,166 ÷ 60 = $69/minute
Cost per second = $69 ÷ 60 = $1.15/second
Step 2: Calculate Acceptable Loss
Question: "Can we afford to lose 1 hour of orders?"

1 hour of orders = $4,166 in revenue

If RTO = 1 hour:
- Lost revenue: $4,166
- Acceptable? (You decide based on business impact)

If RTO = 5 minutes:
- Lost revenue: $347
- Acceptable? (Much better!)
Step 3: Calculate Cost of DR Solution
Option 1: Daily Backups
- RPO: 24 hours
- RTO: 4 hours
- Cost: $50/month
- Risk: Lose up to $100,000 in orders + 4 hours downtime ($16,664)

Option 2: Continuous Replication + Auto-Failover
- RPO: 0 seconds
- RTO: 2 minutes
- Cost: $500/month
- Risk: Lose 2 minutes of uptime ($138)

Which is better?
- Option 2 costs $450 more per month
- But saves $100,000+ in potential losses
- ROI: 222x return on investment ✅

Decision Tree: Choosing RPO/RTO

Q1: What type of application?

├─ Internal Tool (HR dashboard, wiki)
│  ├─ RPO: 24 hours (daily backup OK)
│  └─ RTO: 4 hours (users can wait)
│  └─ Solution: Daily backups ($50/month)

├─ Customer-Facing App (blog, docs site)
│  ├─ RPO: 1 hour (acceptable data loss)
│  └─ RTO: 1 hour (acceptable downtime)
│  └─ Solution: Hourly backups + standby VM ($200/month)

├─ E-commerce / Revenue-Generating
│  ├─ RPO: 5 minutes (minimal data loss)
│  └─ RTO: 5 minutes (minimal downtime)
│  └─ Solution: Continuous replication + auto-failover ($500/month)

└─ Banking / Healthcare / Critical
   ├─ RPO: 0 seconds (zero data loss)
   └─ RTO: 30 seconds (near-zero downtime)
   └─ Solution: Synchronous replication + active-active ($2,000/month)

2. RPO & RTO

[!WARNING] Gotcha: RPO vs RTO A common interview trap. RPO (Point) = Data Loss (How far back do we go?) RTO (Time) = Downtime (How long until we are back online?) You can have low RPO (0 data loss) but high RTO (took 4 hours to restart).
[!TIP] Jargon Alert: Split Brain A disaster scenario where two databases both think they are “Primary” and accept writes at the same time, corrupting data. Always use a “Witness” or “Quorum” to preventing this in active-active architectures.
Quick Reference (after reading the detailed explanation above): RPO (Recovery Point Objective): How much data loss is acceptable? RTO (Recovery Time Objective): How long to recover?

Example Scenarios

Scenario 1: E-commerce Site
RPO: 5 minutes (transactional data)
RTO: 15 minutes (revenue impact)
Solution: Auto-failover groups, zone-redundant services

Scenario 2: Internal Analytics
RPO: 24 hours (reports can be delayed)
RTO: 4 hours (not business-critical)
Solution: Daily backups, manual recovery

Scenario 3: Banking System
RPO: 0 seconds (no data loss acceptable)
RTO: < 1 minute (critical business)
Solution: Synchronous replication, active-active

Common Mistakes in HA/DR (Learn from Others’ Failures)

Mistake #1: “Stopped” VMs Still Cost Money

The Trap:
Developer thinks: "I'll save money by stopping VMs at night"

Developer clicks "Stop" in Azure Portal
VM Status: "Stopped" ✅

Month-end bill arrives: Still charged $2,000! ❌
What Happened:
  • “Stopped” = OS shutdown, but VM resources still reserved
  • You still pay for compute, just not OS license
  • Correct action: “Deallocate” (not just “Stop”)
Cost Impact:
  • Stopped VM: Still ~80% of full cost
  • Deallocated VM: Only pay for storage (~5% of full cost)
Fix:
# WRONG (still costs money)
az vm stop --name myVM --resource-group myRG

# CORRECT (actually saves money)
az vm deallocate --name myVM --resource-group myRG
Savings: $1,600/month per VM ✅

Mistake #2: Untested Backups (The GitLab Disaster)

The Trap:
Company: "We have daily backups, we're safe!"

Reality:
- Backups running for 6 months ✅
- Never tested a restore ❌
- Disaster strikes
- Try to restore... backups are CORRUPTED ❌
- All backups unusable
Real Example: Code Spaces (2014):
  • Hosting company for developers
  • Backups existed but were on same infrastructure
  • Hacker deleted everything (including backups)
  • Company went out of business
  • Customers lost everything
The Fix: Test quarterly
Q1: January → Test restore production database to staging
Q2: April → Test restore VM from backup
Q3: July → Test failover to secondary region
Q4: October → Full disaster recovery drill
Cost of Testing: $500/month (test infrastructure) Cost of Untested Backup: Business bankruptcy ❌

Mistake #3: Ignoring Composite SLAs

The Trap:
Architect: "Our app has 99.99% SLA!"

Reality:
App Service (99.95%) × Azure SQL (99.99%) × Redis (99.9%)
= 99.95% × 99.99% × 99.9%
= 99.84% actual SLA ❌

99.84% = 69 minutes downtime/month (not 4.3 minutes!)
The Math (Explained Simply):
When you chain services, SLAs MULTIPLY:

Service A: 99.9% availability = 99.9% = 0.999
Service B: 99.9% availability = 99.9% = 0.999

Combined: 0.999 × 0.999 = 0.998001 = 99.8%

Translation:
- Promised: 99.9% (43 min downtime)
- Actual: 99.8% (86 min downtime)
- Difference: 2x more downtime! ❌
Real Example:
E-commerce Application Stack:
├── Azure Front Door: 99.99%
├── App Service: 99.95%
├── Azure SQL: 99.99%
├── Redis Cache: 99.9%
└── Blob Storage: 99.99%

Combined SLA:
0.9999 × 0.9995 × 0.9999 × 0.999 × 0.9999 = 0.9982 = 99.82%

Expected downtime: 4.3 min/month
Actual downtime: 77 min/month (18x worse!)
The Fix: Add redundancy
Option 1: Single Region
- Combined SLA: 99.82%
- Downtime: 77 min/month

Option 2: Multi-Region (Active-Passive)
- Each region: 99.82%
- Probability both fail: 0.18% × 0.18% = 0.0324%
- Combined SLA: 99.97%
- Downtime: 13 min/month

Improvement: 6x better SLA ✅

Mistake #4: Forgetting About Data Gravity

The Trap:
Architect: "We'll replicate to 5 regions for global HA!"

Reality:
- Application (100 MB): Replicates in seconds ✅
- Database (500 GB): Takes 4 hours to replicate ❌
- Failover time: 4+ hours (waiting for data sync) ❌
Data Gravity = Large data is slow to move Real Numbers:
Replication Speed: ~50 MB/second (typical Azure inter-region)

Small Database (10 GB):
- Replication time: 200 seconds (3.3 minutes) ✅

Medium Database (500 GB):
- Replication time: 10,000 seconds (2.7 hours) ⚠️

Large Database (10 TB):
- Replication time: 200,000 seconds (55 hours!) ❌
The Fix: Plan for data size
Scenario: E-commerce with 2 TB database

Option 1: Full replication on failover (Bad)
- RTO: 27+ hours ❌

Option 2: Continuous async replication (Good)
- Database always synced (5-10 min lag)
- RTO: 5-10 minutes ✅
- Cost: $800/month (geo-replication)

Option 3: Synchronous replication (Best)
- Database synced in real-time (0 lag)
- RTO: 30 seconds ✅
- Cost: $2,000/month (active-active)

Mistake #5: Active-Active Without Proper Conflict Resolution

The Trap:
Architect: "Let's run database in both regions actively!"

User in US: Updates customer email to "[email protected]" (Region 1)
User in EU: Updates same customer email to "[email protected]" (Region 2)

CONFLICT: Which email is correct? ❌

Split-brain scenario: Data corruption ❌
Real Example: Uber (2016):
  • Active-active setup without proper conflict resolution
  • Network partition between datacenters
  • Both sides accepted writes
  • Data corruption cost hundreds of hours to resolve
The Fix: Choose conflict resolution strategy
Strategy 1: Last-Write-Wins (LWW)
- Keep the most recent update (based on timestamp)
- Simple but data loss possible
- Good for: Analytics, non-critical data

Strategy 2: Application-Level Conflict Resolution
- Application decides which update wins
- Complex but no data loss
- Good for: Banking, critical applications

Strategy 3: Avoid Conflicts (Partition Data)
- US customers → US region only
- EU customers → EU region only
- Never have conflicts (single writer per data)
- Good for: Global applications with regional data

3. High Availability Patterns

Primary Region (Active)
  - Handles all traffic
  - Replicates to secondary

Secondary Region (Passive)
  - Standby mode
  - Activated on primary failure

Pros: Simple, cost-effective
Cons: Unused capacity, manual failover
Region 1 (Active)
  - Handles 50% traffic

Region 2 (Active)
  - Handles 50% traffic

Both regions process requests simultaneously

Pros: Maximum availability, no wasted capacity
Cons: Complex (data conflicts), expensive
Azure Front Door / Traffic Manager
  ├── Primary: East US (priority 1)
  ├── Secondary: West Europe (priority 2)
  └── Tertiary: Southeast Asia (priority 3)

Automatic failover based on health probes

4. Disaster Recovery Architecture

Azure Site Recovery Multi-Region DR Architecture

Example: E-Commerce Platform

Primary Region: East US
- App Service (zone-redundant)
- Azure SQL (zone-redundant)
- Redis Cache (zone-redundant)
- Front Door (global)

Secondary Region: West US (Passive)
- App Service (scaled to minimum)
- Azure SQL (geo-replica, read-only)
- Redis Cache (geo-replication)

Failover Process:
1. Front Door detects primary unhealthy
2. Routes traffic to secondary (automatic)
3. Promote SQL replica to primary
4. Scale up App Service instances
5. Total failover time: < 5 minutes

5. Backup Strategies

# Enable Azure Backup
az backup protection enable-for-vm \
  --resource-group rg-prod \
  --vault-name RecoveryServicesVault \
  --vm vm-web-01 \
  --policy-name DefaultPolicy

# Retention:
# - Daily: 7 days
# - Weekly: 4 weeks
# - Monthly: 12 months
# - Yearly: 10 years

6. Testing DR Plan

Disaster Recovery Drill Checklist:
✅ Document all runbooks
✅ Test failover quarterly
✅ Measure actual RTO/RPO
✅ Update contact lists
✅ Test backup restores
✅ Verify monitoring alerts
✅ Review and update documentation
✅ Conduct post-drill review


7. Interview Questions

Beginner Level

Answer:
  • Availability: Uptime. Can I access the service right now? (e.g., SLA 99.9%).
  • Durability: Data integrity. Is my data safe from loss? (e.g., 11 nines 99.999999999% for Blob Storage). You can have high availability but lose data (corruption), or high durability but be offline.
Answer: A physically separate datacenter within the same Azure Context (Region). It has independent power, cooling, and networking. Protects against datacenter-level failures (fire, power cut).

Intermediate Level

Answer:
  • RPO (Recovery Point Objective): “How much data can we lose?” (Time since last backup).
  • RTO (Recovery Time Objective): “How long can we be down?” (Time to restore service).
Answer:
  • Active-Passive: One region handles traffic. Secondary is standby. Cheaper, slower failover (RTO > 0).
  • Active-Active: Both regions handle traffic. Complex data sync. Expensive. zero downtime failover (RTO ≈ 0).

Advanced Level

Answer: By using Composite SLAs. If you have two regions, each with 99.9% availability, the probability of both failing simultaneously is 0.1%×0.1%=0.01%0.1\% \times 0.1\% = 0.01\%. Total Availability = 100%0.01%=99.99%100\% - 0.01\% = 99.99\%. Redundancy increases availability.

8. Key Takeaways

SLA Mathematics

Understand how SLAs compound. Dependencies reduce availability; Redundancy increases it.

Zones vs Regions

Use Zones for synchronous HA (High Availability). Use Regions for asynchronous DR (Disaster Recovery).

Data Gravity

Compute is stateless and easy to move. Data is heavy and hard to sync. Focus DR efforts on data replication.

Testing

A backup is useless if you can’t restore it. A DR plan is a hypothesis until tested.

Business Alignment

RPO/RTO are business decisions, not technical ones. They dictate the cost of the designated solution.

Next Steps

Continue to Chapter 14

Master real-world Azure architecture patterns and design principles