High Availability & Disaster Recovery
Design systems that survive failures and disasters. Learn to achieve 99.99% availability.What You’ll Learn
By the end of this chapter, you’ll understand:- What High Availability and Disaster Recovery really mean (and why they’re different)
- How to calculate and achieve specific SLAs (99.9%, 99.99%, 99.999%)
- What RPO and RTO mean (and why confusing them costs companies millions)
- How to design systems that survive datacenter failures (Availability Zones)
- How to design systems that survive regional disasters (Multi-region DR)
- Real-world DR architectures with actual costs and trade-offs
- How to test your DR plan (because untested plans always fail)
Introduction: What is High Availability & Disaster Recovery?
Start Here if You’re Completely New
High Availability (HA) = Your app stays online even when things break Disaster Recovery (DR) = Your app can recover from catastrophic failures Think of it like a restaurant: High Availability (HA):- Problem: One cook gets sick
- Solution: You have 3 cooks (redundancy)
- Result: Restaurant stays open ✅
- Downtime: 0 seconds
- Problem: Fire destroys the entire restaurant
- Solution: You have a second location across town (backup site)
- Result: Open at backup location in 2 hours ✅
- Downtime: 2 hours (but you survived!)
- HA = Handles small failures (broken VM, network glitch) → Seconds of downtime
- DR = Handles catastrophic failures (datacenter destroyed, region offline) → Hours of downtime
Why This Matters: The Cost of Downtime
Real-World Disaster Example
Amazon Prime Day Outage (2018)- What happened: Website crashed for 63 minutes during biggest sale day
- Revenue loss: $99 million in 63 minutes
- Per-minute cost: $1.57 million/minute
- Per-second cost: $26,000/second
| Company | Incident | Downtime | Cost | Cause |
|---|---|---|---|---|
| Oct 2021 | 6 hours | $60-100M | BGP routing error | |
| GitHub | Oct 2018 | 24 hours | Unknown | Network partition |
| Google Cloud | Jun 2019 | 4.5 hours | Unknown | Configuration error |
| British Airways | May 2017 | 3 days | $100M+ | Power surge |
Understanding SLAs (Service Level Agreements)
What is an SLA?
SLA = A promise about how much downtime is acceptable Think of it like a pizza delivery guarantee:- Pizza delivery SLA: “30 minutes or it’s free”
- Azure VM SLA: “99.9% uptime or we give you credits”
SLA Math Explained (From Scratch)
99.9% uptime sounds amazing, right? Let’s see what it actually means:| SLA | Downtime/Month | Downtime/Year | Real-World Impact |
|---|---|---|---|
| 99% | 7.2 hours | 3.65 days | ❌ Unacceptable for production |
| 99.9% | 43.2 minutes | 8.76 hours | ⚠️ OK for internal tools |
| 99.95% | 21.6 minutes | 4.38 hours | ✅ Good for most apps |
| 99.99% | 4.3 minutes | 52.56 minutes | ✅ Great for e-commerce |
| 99.999% | 26 seconds | 5.26 minutes | 🚀 Required for banking |
- 99.9% SLA: 43 minutes downtime/month = $3,000 lost revenue
- 99.99% SLA: 4.3 minutes downtime/month = $300 lost revenue
- Cost to upgrade: ~$200/month
- Savings: $2,700/month → 13.5x ROI ✅
How Azure Achieves High Availability
The Building Blocks (From Smallest to Largest)
1. Single VM (No High Availability)
2. Availability Set (Same Datacenter, Different Racks)
- 1 VM (99.9%): $50/month → 43 min downtime
- 3 VMs in Availability Set (99.95%): $150/month → 21 min downtime
- Extra cost: $100/month → Saves 22 minutes of downtime
3. Availability Zones (Different Datacenters, Same Region)
- 3 VMs in Availability Set (99.95%): $150/month → 21 min downtime
- 3 VMs across Availability Zones (99.99%): $150/month → 4.3 min downtime
- Extra cost: $0 (same price!) → Saves 17 minutes of downtime ✅
- Protects against datacenter-level disasters (fire, flood, power outage)
- Same cost as Availability Set
- Higher SLA (99.99% vs 99.95%)
4. Multi-Region (Disaster Recovery)
- Single Region (East US): $150/month → 4.3 min downtime
- Multi-Region (East US + West Europe): $300/month → 2 min downtime
- Extra cost: $150/month → Protects against regional disasters
- ✅ Mission-critical applications (banking, healthcare)
- ✅ Global user base (low latency everywhere)
- ✅ Compliance requirements (data residency)
- ❌ Small internal tools (not worth the cost)
1. Availability SLAs
- SLA Levels
- Azure Service SLAs
Understanding RPO & RTO (From Absolute Zero)
What is RPO and RTO?
These are the TWO most important numbers in disaster recovery. Companies have lost millions by confusing these. RPO (Recovery Point Objective) = “How much data can we afford to lose?” RTO (Recovery Time Objective) = “How long can we be offline?”Real-World Analogy: Writing a Book
Imagine you’re writing a 500-page book on your computer: Scenario 1: You save every 5 minutes (RPO = 5 minutes)Now RTO (Recovery Time Objective)
RTO = How long until you’re back to work after a disaster Continuing the book analogy: Scenario 1: Backup laptop ready (RTO = 5 minutes)The Critical Difference (Why Companies Confuse This)
[!WARNING] Common Mistake: Confusing RPO and RTO RPO = Data Loss (measured in TIME since last backup) RTO = Downtime (measured in TIME to recover)You can have DIFFERENT combinations: Example 1: Low RPO, High RTO
Real-World RPO/RTO Example: GitLab Database Incident (2017)
The Disaster:- GitLab engineer accidentally deleted production database
- 300 GB of data vanished
- RPO: 6 hours → Lost 6 hours of data (5,000 projects, 5,000 comments, 700 new users)
- RTO: 18 hours → Took 18 hours to restore from backups
- Total impact: 6 hours of data lost + 18 hours offline
- Cost: Immeasurable reputation damage (but they recovered with transparency)
How to Choose Your RPO/RTO
Step 1: Calculate Cost of DowntimeDecision Tree: Choosing RPO/RTO
2. RPO & RTO
[!WARNING] Gotcha: RPO vs RTO A common interview trap. RPO (Point) = Data Loss (How far back do we go?) RTO (Time) = Downtime (How long until we are back online?) You can have low RPO (0 data loss) but high RTO (took 4 hours to restart).
[!TIP] Jargon Alert: Split Brain A disaster scenario where two databases both think they are “Primary” and accept writes at the same time, corrupting data. Always use a “Witness” or “Quorum” to preventing this in active-active architectures.Quick Reference (after reading the detailed explanation above): RPO (Recovery Point Objective): How much data loss is acceptable? RTO (Recovery Time Objective): How long to recover?
Example Scenarios
Common Mistakes in HA/DR (Learn from Others’ Failures)
Mistake #1: “Stopped” VMs Still Cost Money
The Trap:- “Stopped” = OS shutdown, but VM resources still reserved
- You still pay for compute, just not OS license
- Correct action: “Deallocate” (not just “Stop”)
- Stopped VM: Still ~80% of full cost
- Deallocated VM: Only pay for storage (~5% of full cost)
Mistake #2: Untested Backups (The GitLab Disaster)
The Trap:- Hosting company for developers
- Backups existed but were on same infrastructure
- Hacker deleted everything (including backups)
- Company went out of business
- Customers lost everything
Mistake #3: Ignoring Composite SLAs
The Trap:Mistake #4: Forgetting About Data Gravity
The Trap:Mistake #5: Active-Active Without Proper Conflict Resolution
The Trap:- Active-active setup without proper conflict resolution
- Network partition between datacenters
- Both sides accepted writes
- Data corruption cost hundreds of hours to resolve
3. High Availability Patterns
1. Active-Passive
1. Active-Passive
2. Active-Active
2. Active-Active
3. Multi-Region with Traffic Manager
3. Multi-Region with Traffic Manager
4. Disaster Recovery Architecture
Example: E-Commerce Platform
5. Backup Strategies
- VMs
- Databases
- Storage
6. Testing DR Plan
Disaster Recovery Drill Checklist:7. Interview Questions
Beginner Level
Q1: What is the difference between Availability and Durability?
Q1: What is the difference between Availability and Durability?
Answer:
- Availability: Uptime. Can I access the service right now? (e.g., SLA 99.9%).
- Durability: Data integrity. Is my data safe from loss? (e.g., 11 nines 99.999999999% for Blob Storage). You can have high availability but lose data (corruption), or high durability but be offline.
Q2: What is an Availability Zone?
Q2: What is an Availability Zone?
Answer:
A physically separate datacenter within the same Azure Context (Region). It has independent power, cooling, and networking.
Protects against datacenter-level failures (fire, power cut).
Intermediate Level
Q3: Explain RPO and RTO in simple terms
Q3: Explain RPO and RTO in simple terms
Answer:
- RPO (Recovery Point Objective): “How much data can we lose?” (Time since last backup).
- RTO (Recovery Time Objective): “How long can we be down?” (Time to restore service).
Q4: Active-Passive vs Active-Active?
Q4: Active-Passive vs Active-Active?
Answer:
- Active-Passive: One region handles traffic. Secondary is standby. Cheaper, slower failover (RTO > 0).
- Active-Active: Both regions handle traffic. Complex data sync. Expensive. zero downtime failover (RTO ≈ 0).
Advanced Level
Q5: How do you achieve 99.99% SLA not offered by a single service?
Q5: How do you achieve 99.99% SLA not offered by a single service?
Answer:
By using Composite SLAs.
If you have two regions, each with 99.9% availability, the probability of both failing simultaneously is .
Total Availability = .
Redundancy increases availability.
8. Key Takeaways
SLA Mathematics
Understand how SLAs compound. Dependencies reduce availability; Redundancy increases it.
Zones vs Regions
Use Zones for synchronous HA (High Availability). Use Regions for asynchronous DR (Disaster Recovery).
Data Gravity
Compute is stateless and easy to move. Data is heavy and hard to sync. Focus DR efforts on data replication.
Testing
A backup is useless if you can’t restore it. A DR plan is a hypothesis until tested.
Business Alignment
RPO/RTO are business decisions, not technical ones. They dictate the cost of the designated solution.
Next Steps
Continue to Chapter 14
Master real-world Azure architecture patterns and design principles