SRE & Production Excellence
Learn how Google’s Site Reliability Engineering practices apply to Azure, from setting SLOs to running chaos experiments.What You’ll Learn
By the end of this chapter, you’ll understand:- What SRE is (and how it’s different from traditional operations)
- How to measure reliability (SLIs, SLOs, SLAs explained from scratch)
- Error budgets (and how they balance innovation vs stability)
- Production best practices (deployments, rollbacks, monitoring)
- Chaos engineering (intentionally breaking things to build resilience)
- Real-world examples with actual costs and trade-offs
Introduction: What is Site Reliability Engineering (SRE)?
Start Here if You’re Completely New
SRE = Treating operations like a software engineering problem Think of it like maintaining a car: Traditional Operations (Old Way):- Automation: Replace manual tasks with code
- Monitoring: Detect problems before users notice
- Reliability: Measure and improve system uptime
- Balance: Ship features while maintaining stability
Why SRE Matters: The Cost of Unreliability
Real-World Failure Examples
Amazon Prime Day 2018- Problem: Site crashed for 63 minutes
- Cost: $99 million in lost revenue
- Per-minute cost: $1.57 million/minute
- Cause: No error budget, no chaos testing
- Prevention cost: ~$1 million (SRE team + tools)
- ROI: 99x return on investment ✅
| Company | Incident | Downtime | Cost | SRE Practice Missing |
|---|---|---|---|---|
| Oct 2021 | 6 hours | $60-100M | Change management + error budgets | |
| GitHub | Oct 2018 | 24 hours | Unknown | Chaos engineering |
| Google Cloud | Jun 2019 | 4.5 hours | Unknown | Gradual rollouts |
| Cloudflare | Jul 2019 | 30 minutes | Unknown | Progressive deployments |
Understanding SLI, SLO, SLA (From Absolute Zero)
These are the THREE most important acronyms in SRE. Let’s break them down with a real-world analogy:The Pizza Delivery Analogy
SLI (Service Level Indicator) = What You MeasureSoftware Example: E-Commerce Website
SLI (Service Level Indicator) = MeasurementsError Budgets Explained (From Scratch)
Error Budget = How much downtime you’re allowedThe Simple Math
Error Budgets as “Innovation Currency”
Think of error budget like a bank account: High Balance = Take RisksCommon Mistakes in SRE (Learn from Others)
Mistake #1: Setting SLOs Too High (Perfectionism)
The Trap:Mistake #2: No Error Budget Policy (Chaos Deploys)
The Trap:Mistake #3: SLO Based on System Metrics (Not User Experience)
The Trap:Mistake #4: Too Many SLOs (Analysis Paralysis)
The Trap:[!TIP] Jargon Alert: SLI vs SLO vs SLA SLI (Indicator): What you measure (e.g., latency = 250ms) SLO (Objective): Your internal goal (e.g., p95 latency < 300ms) SLA (Agreement): Your customer promise (e.g., 99.9% uptime or money back) SLO should be tighter than SLA to give you a buffer!
[!WARNING] Gotcha: 99.9% vs 99.99% Availability The difference is NOT 0.09%—it’s 10x more downtime! 99.9% = 43 minutes/month downtime 99.99% = 4.3 minutes/month downtime Achieving 99.99% can cost 10x more. Choose based on business impact, not ego.
1. The SRE Mindset
SRE = What happens when you ask a software engineer to design operations.Core SRE Principles
Embrace Risk
100% reliability is impossible and wasteful. Find the right balance between reliability and velocity.
Service Level Objectives
Define clear, measurable reliability targets. Make data-driven decisions.
Eliminate Toil
Automate repetitive manual work. Toil is the enemy of scaling.
Monitor Everything
You can’t improve what you don’t measure. Observability is mandatory.
Error Budgets
Use remaining error budget to balance innovation and stability.
Blameless Postmortems
Learn from failures without blaming individuals. Focus on systemic improvements.
2. Service Level Indicators (SLIs)
SLIs are carefully chosen metrics that represent the health of your service from the user’s perspective.The Golden Signals (Google SRE)
- Latency
- Traffic
- Errors
- Saturation
How long does a request take?Good SLI:
- p95 latency < 300ms (fast)
- p99 latency < 1000ms (acceptable)
- Average hides outliers (1ms + 10,000ms = 5,000ms average, useless!)
- p95 = 95% of users have this experience or better
Choosing Good SLIs
Bad SLI Examples:3. Service Level Objectives (SLOs)
SLOs are internal reliability targets. They should be slightly tighter than your SLA.SLO Examples
E-Commerce Site
E-Commerce Site
SLIs & SLOs:
API Service
API Service
SLIs & SLOs:
Batch Processing System
Batch Processing System
SLIs & SLOs:
4. Error Budgets
Error Budget = 100% - SLO If your SLO is 99.9% availability, your error budget is 0.1% = 43.2 minutes/month.Error Budget Policy
Example Decision Making:
Calculating Error Budget Burn Rate
5. Toil Reduction
Toil = Repetitive, manual, automatable work that scales linearly with service growth.What is Toil?
- ✅ Toil Examples
- ❌ Not Toil
- Manual
- Repetitive
- Automatable
- Tactical (no long-term value)
- Scales linearly with service size
Toil Elimination Strategy
6. Chaos Engineering
Chaos Engineering = Intentionally breaking things in production to build confidence.Chaos Principles
- Build a hypothesis (e.g., “If we kill 30% of pods, requests should still succeed”)
- Define steady state (e.g., “Error rate < 1%, latency p95 < 500ms”)
- Introduce chaos (e.g., kill pods, inject latency, fail database)
- Measure deviation (Did error rate spike? Did latency increase?)
- Learn and improve (Add retries? Implement circuit breaker?)
Azure Chaos Studio
Setup:Chaos Scenarios
1. Pod Failure (AKS)
1. Pod Failure (AKS)
Hypothesis: Application survives 30% pod failureTest:Expected: No user-visible errors (replicas take over)
Actual: ?
Learnings: Need faster readiness probes? Increase replica count?
2. Network Latency
2. Network Latency
Hypothesis: Application handles 500ms network latencyTest:Expected: Requests timeout gracefully, circuit breaker opens
Actual: ?
Learnings: Add timeout? Implement retry with backoff?
3. CPU Stress
3. CPU Stress
Hypothesis: Autoscaling kicks in under CPU loadTest:Expected: HPA scales pods from 3 → 10 within 2 minutes
Actual: ?
Learnings: Adjust HPA thresholds? Add more headroom?
4. Database Failover
4. Database Failover
Hypothesis: App survives database failoverTest:Expected: Brief spike in errors (< 1 minute), then recovery
Actual: ?
Learnings: Connection pool handles failover? Need retry logic?
GameDay Exercise
Scenario: Black Friday Simulation7. Incident Management
Incident Response Lifecycle
Severity Levels
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Critical - Revenue loss | Page immediately | Payment system down, database offline |
| SEV2 | Major - Degraded service | 15 minutes | API latency 10x higher, 10% error rate |
| SEV3 | Minor - Partial impact | 1 hour | Single pod failing, non-critical feature down |
| SEV4 | Cosmetic - No user impact | Next business day | UI typo, log noise |
Blameless Postmortem Template
8. On-Call Best Practices
On-Call Rotation
On-Call Expectations
Response Times:- SEV1: 5 minutes
- SEV2: 15 minutes
- SEV3: 1 hour
9. Load Testing & Performance Testing
Testing your system under load is critical for understanding scalability limits and preventing outages.Load Testing vs Performance Testing
| Type | Purpose | When | Tools |
|---|---|---|---|
| Load Testing | Verify system handles expected load | Before launch, before traffic spikes | k6, JMeter, Azure Load Testing |
| Stress Testing | Find breaking point | Capacity planning | k6, Locust |
| Spike Testing | Handle sudden traffic bursts | Black Friday prep | k6, Gatling |
| Soak Testing | Detect memory leaks over time | After deployments | k6 (24+ hours) |
| Performance Testing | Measure response times | Every release | Azure Load Testing |
Azure Load Testing
Azure Load Testing is a fully managed service for generating high-scale load.Create Load Test
JMeter Test Plan Example
k6 Load Testing
k6 is a modern, developer-friendly load testing tool.Basic Load Test
Run k6 Test
Performance Testing Scenarios
Scenario 1: Black Friday Load Test
Scenario 2: Soak Test (Memory Leak Detection)
- Memory usage (should be flat, not increasing)
- Connection pool size (should stabilize)
- Database connections (no leaks)
- HTTP response times (should not degrade)
Azure Application Insights Performance Testing
Custom Telemetry for Load Tests
KQL Query: Analyze Load Test Results
Performance Testing Best Practices
1. Realistic Test Data
2. Think Time (Realistic User Behavior)
3. Connection Pooling
4. Gradual Ramp-Up
Interpreting Load Test Results
Metrics to Track
| Metric | Good | Warning | Critical |
|---|---|---|---|
| P95 Latency | <500ms | 500-1000ms | >1000ms |
| Error Rate | <0.1% | 0.1-1% | >1% |
| Throughput | Meets SLO | 80% of SLO | <80% of SLO |
| CPU Usage | <70% | 70-85% | >85% |
| Memory Usage | Stable | Slow increase | Rapid increase |
Bottleneck Identification
- Database queries (most common) - Add indexes, optimize queries
- External API calls - Add caching, circuit breakers
- CPU-intensive operations - Move to background jobs
- Network latency - Use CDN, co-locate services
- Memory allocation - Reduce allocations, use object pooling
Load Testing Checklist
Before running production load tests:- Test environment matches production (same VM SKU, database tier, autoscaling rules)
- Disable sampling in Application Insights (100% telemetry during tests)
- Notify team (Slack #on-call: “Load test starting at 10 AM”)
- Monitor dashboards open (Grafana, Application Insights Live Metrics)
- Rollback plan ready (know how to stop the test immediately)
- Test data prepared (realistic user IDs, product IDs, payment methods)
- Success criteria defined (P95 <500ms, error rate <1%, no memory leaks)
- Baseline captured (run test before changes, compare after)
[!WARNING] Gotcha: Testing in Production Load testing in production is risky but sometimes necessary. If you must:
- Use synthetic users (test-user-1, test-user-2) to isolate test data
- Route test traffic to a separate backend pool (blue-green)
- Start with 5% of production load, increase gradually
- Have instant rollback ready (kill switch)
- Never test payment processing or critical user actions in production
10. Production Readiness Checklist
Before launching a new service to production:10. Interview Questions
Beginner Level
Q1: What is the difference between SLI, SLO, and SLA?
Q1: What is the difference between SLI, SLO, and SLA?
Answer:SLI (Service Level Indicator):
- What you measure
- Example: “p95 latency = 250ms”
- Your internal target
- Example: “p95 latency < 300ms”
- Guides engineering decisions
- Customer promise with consequences
- Example: “99.9% uptime or 10% refund”
- Legal/financial agreement
Q2: What is an error budget?
Q2: What is an error budget?
Answer:Error budget = How much unreliability you can tolerate.Calculation:Use:
- Budget remaining → ship features fast
- Budget exhausted → focus on reliability
Intermediate Level
Q3: How do you measure and reduce toil?
Q3: How do you measure and reduce toil?
Answer:Measure:
- Time audit: Track manual, repetitive tasks
- Calculate percentage of time spent on toil
- Goal: Keep toil < 50% (ideally < 30%)
- Automate high-frequency tasks first
- Examples:
- Manual deployments → CI/CD
- Manual scaling → Autoscaling
- Log searching → Centralized logging
- Measure improvement quarterly
- Novel problem solving
- Architecture design
- Incident response (new incidents)
Q4: Design an on-call rotation for a team of 6
Q4: Design an on-call rotation for a team of 6
Answer:Rotation:Compensation:
- Primary on-call: 1 week
- Secondary on-call: 1 week
- Rotation: 6 weeks per person
- Primary: $500/week
- Secondary: $250/week
- Written handoff document
- 15-minute sync call
- Slack thread with context
Advanced Level
Q5: Implement an error budget policy
Q5: Implement an error budget policy
Answer:
Q6: Design a chaos engineering program
Q6: Design a chaos engineering program
Answer:
11. Key Takeaways
SLOs Drive Decisions
Define clear SLOs. Use error budgets to balance velocity and reliability.
Eliminate Toil
Automate everything repetitive. Spend time on engineering, not firefighting.
Break Things on Purpose
Chaos engineering builds confidence. Test in production before customers do.
Blameless Culture
Focus on systems, not people. Learn from failures without punishment.
Monitor What Matters
User experience > server metrics. Latency, errors, saturation.
Operational Excellence
Production readiness checklist before launch. On-call with clear expectations.
Next Steps
Back to Course Overview
Return to course overview or continue to the Capstone Project
Continue to Chapter 15
Apply everything you’ve learned in the enterprise e-commerce capstone