SRE & Production Excellence

Learn how Google’s Site Reliability Engineering practices apply to Azure, from setting SLOs to running chaos experiments. SRE Pyramid

What You’ll Learn

By the end of this chapter, you’ll understand:

What SRE is (and how it’s different from traditional operations)
How to measure reliability (SLIs, SLOs, SLAs explained from scratch)
Error budgets (and how they balance innovation vs stability)
Production best practices (deployments, rollbacks, monitoring)
Chaos engineering (intentionally breaking things to build resilience)
Real-world examples with actual costs and trade-offs

Introduction: What is Site Reliability Engineering (SRE)?

Start Here if You’re Completely New

SRE = Treating operations like a software engineering problem Think of it like maintaining a car: Traditional Operations (Old Way):

Car breaks down
  ↓
Mechanic manually fixes it
  ↓
Car breaks down again next month
  ↓
Mechanic manually fixes it again
  ↓
Repeat forever... ❌

Problems:
- Reactive (fixing problems after they happen)
- Manual work (mechanic required every time)
- No improvement (same problems repeat)
- Doesn't scale (need more mechanics as fleet grows)

Site Reliability Engineering (New Way):

Car sensor detects problem before failure
  ↓
Automated system orders replacement part
  ↓
Technician installs during scheduled maintenance
  ↓
System analyzes failure pattern
  ↓
Software patch prevents issue in all cars ✅

Benefits:
- Proactive (preventing problems before they happen)
- Automated (systems fix themselves)
- Continuous improvement (learn from every incident)
- Scales (same team manages 10x more systems)

SRE in Software:

Automation: Replace manual tasks with code
Monitoring: Detect problems before users notice
Reliability: Measure and improve system uptime
Balance: Ship features while maintaining stability

Why SRE Matters: The Cost of Unreliability

Real-World Failure Examples

Amazon Prime Day 2018

Problem: Site crashed for 63 minutes
Cost: $99 million in lost revenue
Per-minute cost: $1.57 million/minute
Cause: No error budget, no chaos testing
Prevention cost: ~$1 million (SRE team + tools)
ROI: 99x return on investment ✅

More Examples:

Company	Incident	Downtime	Cost	SRE Practice Missing
Facebook	Oct 2021	6 hours	$60-100M	Change management + error budgets
GitHub	Oct 2018	24 hours	Unknown	Chaos engineering
Google Cloud	Jun 2019	4.5 hours	Unknown	Gradual rollouts
Cloudflare	Jul 2019	30 minutes	Unknown	Progressive deployments

The Pattern: Good SRE practices cost millions. Bad SRE practices cost tens of millions.

Understanding SLI, SLO, SLA (From Absolute Zero)

These are the THREE most important acronyms in SRE. Let’s break them down with a real-world analogy:

The Pizza Delivery Analogy

SLI (Service Level Indicator) = What You Measure

Pizza Delivery Service:
- SLI #1: Delivery time (measured with GPS tracker)
  - Order at 7:00 PM → Delivered at 7:25 PM = 25 minutes

- SLI #2: Pizza temperature (measured with thermometer)
  - Temperature at delivery: 145°F

- SLI #3: Order accuracy (measured by customer)
  - Correct toppings: Yes / No

**SLI = The actual measurements**

SLO (Service Level Objective) = Your Internal Goal

Pizza Company's Goals (not told to customers):
- SLO #1: 95% of pizzas delivered in < 30 minutes
- SLO #2: 99% of pizzas arrive above 140°F
- SLO #3: 99.9% of orders are correct

**SLO = What you promise yourself to achieve**

SLA (Service Level Agreement) = Your Customer Promise

Pizza Company's Guarantee (public promise):
- SLA: "Pizza in 30 minutes or it's free!"

**SLA = What you promise customers (with consequences)**

The Relationship:

SLI (What you measure) → SLO (Your goal) → SLA (Customer promise)

Real Numbers:
- SLI: Last 100 pizzas averaged 28 minutes
- SLO: We want 95% delivered in < 30 min
- SLA: We promise 30 min or it's free

Buffer: SLO (95%) is tighter than SLA (100%)
→ This gives you room for occasional failures ✅

Software Example: E-Commerce Website

SLI (Service Level Indicator) = Measurements

What we measure:
- API response time: 250ms (measured with Application Insights)
- Error rate: 0.05% (measured with logs)
- Uptime: 99.95% (measured with health checks)

These are FACTS (actual measurements)

SLO (Service Level Objective) = Internal Goals

Our internal targets:
- p95 latency < 300ms (95% of requests faster than 300ms)
- Error rate < 0.1% (99.9% success)
- Uptime > 99.9% (43 minutes downtime/month OK)

These guide our engineering decisions

SLA (Service Level Agreement) = Customer Promise

Our public guarantee:
- 99.5% uptime or we give you credits
- (We promise customers 99.5%, but aim for 99.9% internally)

Buffer: 99.9% (SLO) vs 99.5% (SLA) = 0.4% buffer ✅

Why the Buffer?

Month 1:
- Actual uptime: 99.8% ✅ (above SLO 99.9%? No, but above SLA 99.5%? Yes!)
- No customer refunds
- Internal: "We missed our SLO, let's improve"

Month 2:
- Actual uptime: 99.4% ❌ (below SLA 99.5%)
- Customer refunds triggered
- Internal: "Emergency! Stop feature releases!"

The buffer prevents small problems from costing money

Error Budgets Explained (From Scratch)

Error Budget = How much downtime you’re allowed

The Simple Math

SLO: 99.9% uptime

100% - 99.9% = 0.1% allowed downtime

0.1% of 30 days (1 month):
30 days = 43,200 minutes
0.1% of 43,200 = 43.2 minutes

Error Budget: 43.2 minutes/month ✅

What This Means:

You have 43.2 minutes of downtime to "spend" each month

Week 1: 10 minutes downtime
  Remaining: 33.2 minutes (77%)

Week 2: 20 minutes downtime
  Total used: 30 minutes
  Remaining: 13.2 minutes (31%)

Week 3: 15 minutes downtime
  Total used: 45 minutes
  Remaining: -1.8 minutes ❌

STATUS: OVER BUDGET! Stop deploying! ❌

Error Budgets as “Innovation Currency”

Think of error budget like a bank account: High Balance = Take Risks

Error Budget: 43.2 minutes/month
Used so far: 5 minutes (12%)
Remaining: 38.2 minutes (88%)

Decision: ✅ Ship new feature Friday!
Why? We have plenty of budget left
If it breaks, we can afford 30 min of downtime

Low Balance = Play it Safe

Error Budget: 43.2 minutes/month
Used so far: 40 minutes (93%)
Remaining: 3.2 minutes (7%)

Decision: ❌ STOP all deployments!
Why? We're almost out of budget
Focus on stability, not new features

Real-World Example: Netflix

Netflix SLO: 99.99% uptime
Error Budget: 4.3 minutes/month

Policy:
- Budget > 50% → Deploy multiple times per day ✅
- Budget 25-50% → Deploy once per day with extra testing ⚠️
- Budget < 25% → FREEZE deployments, fix reliability 🚨

Result:
- Balances innovation (new features) with stability (uptime)
- Engineers don't argue about "should we ship?" → Check error budget!
- Data-driven decisions, not politics

Common Mistakes in SRE (Learn from Others)

Mistake #1: Setting SLOs Too High (Perfectionism)

The Trap:

Engineer: "Let's aim for 99.999% uptime!" (26 seconds downtime/month)

Cost to achieve 99.999%:
- Multi-region active-active: $50,000/month
- Chaos engineering tools: $5,000/month
- SRE team (5 people): $75,000/month
Total: $130,000/month

Revenue: $10,000/month ❌
Lost money: $120,000/month ❌

The Fix: Match SLO to business value

Internal tool used by 10 employees:
- SLO: 99% (7 hours downtime/month) ✅
- Cost: $500/month
- Why: Employees can wait, not generating revenue

E-commerce site with $1M/month revenue:
- SLO: 99.99% (4 minutes downtime/month) ✅
- Cost: $10,000/month
- Why: $1M ÷ 43,200 min/month = $23/minute lost revenue
- ROI: $10k cost prevents $100k+ losses ✅

Mistake #2: No Error Budget Policy (Chaos Deploys)

The Trap:

Month 1: 40 minutes downtime (over budget)
→ Team: "Let's ship anyway!" ❌
→ Month 2: 60 minutes downtime (way over budget)
→ CEO: "Why is our site always down?!" ❌

The Fix: Enforce error budget policy

Error Budget Policy:
├─ Budget > 50% → Normal operations (deploy freely)
├─ Budget 25-50% → Caution mode (extra testing required)
└─ Budget < 25% → FREEZE (only emergency fixes)

Month 1: 40 min downtime → Over budget
→ FREEZE enforced ✅
→ Team fixes reliability issues
→ Month 2: Only 5 min downtime
→ Budget restored, resume normal operations ✅

Mistake #3: SLO Based on System Metrics (Not User Experience)

The Trap:

Bad SLO:
- "CPU < 80%" ❌
- "Database connections < 100" ❌
- "Server is running" ❌

Problem: Server can be "running" but users see errors!

The Fix: SLOs based on user experience

Good SLO:
- "95% of page loads < 2 seconds" ✅ (user-facing)
- "99.9% of API calls succeed" ✅ (user-facing)
- "99% of search results in < 500ms" ✅ (user-facing)

Why better? Measures what users actually experience

Mistake #4: Too Many SLOs (Analysis Paralysis)

The Trap:

Team defines 50 SLOs:
- API latency p50, p75, p90, p95, p99, p99.9
- Error rate by endpoint (20 endpoints)
- Database query time by query type (30 types)

Result: Nobody knows which metrics matter ❌
Meetings spent arguing about SLO violations ❌

The Fix: Start with 3-5 critical SLOs

E-Commerce Site (5 SLOs):
Availability: 99.9% (most important)
p95 page load: < 2 seconds
p95 API latency: < 300ms
Payment error rate: < 0.01%
Search results: < 500ms (p99)

Rule: If it's not in top 5, don't make it an SLO

[!TIP] Jargon Alert: SLI vs SLO vs SLA SLI (Indicator): What you measure (e.g., latency = 250ms) SLO (Objective): Your internal goal (e.g., p95 latency < 300ms) SLA (Agreement): Your customer promise (e.g., 99.9% uptime or money back) SLO should be tighter than SLA to give you a buffer!

[!WARNING] Gotcha: 99.9% vs 99.99% Availability The difference is NOT 0.09%—it’s 10x more downtime! 99.9% = 43 minutes/month downtime 99.99% = 4.3 minutes/month downtime Achieving 99.99% can cost 10x more. Choose based on business impact, not ego.

1. The SRE Mindset

SRE = What happens when you ask a software engineer to design operations.

Core SRE Principles

Embrace Risk

100% reliability is impossible and wasteful. Find the right balance between reliability and velocity.

Service Level Objectives

Define clear, measurable reliability targets. Make data-driven decisions.

Eliminate Toil

Automate repetitive manual work. Toil is the enemy of scaling.

Monitor Everything

You can’t improve what you don’t measure. Observability is mandatory.

Error Budgets

Use remaining error budget to balance innovation and stability.

Blameless Postmortems

Learn from failures without blaming individuals. Focus on systemic improvements.

2. Service Level Indicators (SLIs)

SLIs are carefully chosen metrics that represent the health of your service from the user’s perspective.

The Golden Signals (Google SRE)

Latency
Traffic
Errors
Saturation

How long does a request take?

// Azure Application Insights - Latency SLI
requests
| where timestamp > ago(24h)
| where success == true
| summarize
    p50 = percentile(duration, 50),
    p95 = percentile(duration, 95),
    p99 = percentile(duration, 99)
    by bin(timestamp, 5m), name
| render timechart

Good SLI:

p95 latency < 300ms (fast)
p99 latency < 1000ms (acceptable)

Why percentiles?

Average hides outliers (1ms + 10,000ms = 5,000ms average, useless!)
p95 = 95% of users have this experience or better

How many requests per second?

// Requests per second
requests
| where timestamp > ago(1h)
| summarize count() by bin(timestamp, 1m)
| extend rps = count_ / 60
| render timechart

Good SLI:

Current traffic: 5,000 RPS
Peak capacity: 10,000 RPS
Headroom: 100% (good!)

What percentage of requests fail?

// Error rate SLI
requests
| where timestamp > ago(24h)
| summarize
    total = count(),
    failed = countif(success == false)
    by bin(timestamp, 5m)
| extend errorRate = (failed * 100.0) / total
| render timechart

Good SLI:

Error rate < 0.1% (99.9% success)

Error classification:

4xx errors: Client errors (not counted against SLO)
5xx errors: Server errors (counted!)

How full is your service?

// CPU saturation
performanceCounters
| where timestamp > ago(1h)
| where name == "% Processor Time"
| summarize avg(value) by bin(timestamp, 5m), cloud_RoleInstance
| render timechart

// Memory saturation
performanceCounters
| where name == "Available Bytes"
| summarize avg(value / 1024 / 1024 / 1024) by bin(timestamp, 5m)
| render timechart

Good SLI:

CPU < 70% (30% headroom)
Memory < 80%
Disk < 80%

Choosing Good SLIs

Bad SLI Examples:

❌ "Server is up" (doesn't reflect user experience)
❌ "Average latency" (hides outliers)
❌ "Internal queue depth" (users don't care)

Good SLI Examples:

✅ "95% of API requests complete in < 300ms"
✅ "99.9% of requests return 2xx or 3xx"
✅ "Page load completes in < 2 seconds (p95)"
✅ "Search results returned in < 500ms (p99)"

3. Service Level Objectives (SLOs)

SLOs are internal reliability targets. They should be slightly tighter than your SLA.

SLO Examples

E-Commerce Site

SLIs & SLOs:

Latency:
- SLI: p95 page load time
- SLO: < 2 seconds
- Why: Studies show 2s+ loads = 20% abandonment

Availability:
- SLI: Successful checkout rate
- SLO: 99.9% (43 min downtime/month)
- Why: Every minute down = $10K revenue lost

Errors:
- SLI: Payment processing error rate
- SLO: < 0.01%
- Why: Payment errors damage trust

API Service

SLIs & SLOs:

Latency:
- SLI: p95 API response time
- SLO: < 100ms
- Why: Third-party integrations need fast responses

Availability:
- SLI: Successful API call rate
- SLO: 99.95% (21 min downtime/month)
- Why: Higher than web app (upstream dependency)

Throughput:
- SLI: Requests per second handled
- SLO: 10,000 RPS sustained
- Why: Contract guarantees 5,000 RPS average

Batch Processing System

SLIs & SLOs:

Freshness:
- SLI: Time from data arrival to processing completion
- SLO: < 5 minutes (p95)
- Why: Near real-time reporting requirement

Correctness:
- SLI: Percentage of jobs completing without errors
- SLO: 99.99%
- Why: Data quality is critical

Throughput:
- SLI: Events processed per hour
- SLO: 100 million events/hour
- Why: Peak load during business hours

4. Error Budgets

Error Budget = 100% - SLO If your SLO is 99.9% availability, your error budget is 0.1% = 43.2 minutes/month.

Error Budget Policy

Example Decision Making:

Month: January
SLO: 99.9% availability (error budget = 0.1% = 43.2 minutes)

Week 1: 5 minutes downtime
  Remaining budget: 38.2 minutes (88%)
  ✅ Status: GREEN - ship new features

Week 2: 15 minutes downtime (cumulative: 20 min)
  Remaining budget: 23.2 minutes (54%)
  ✅ Status: GREEN - continue

Week 3: 20 minutes downtime (cumulative: 40 min)
  Remaining budget: 3.2 minutes (7%)
  🚨 Status: RED - STOP feature releases
  Actions:
    - Cancel Friday deployment
    - Focus on reliability fixes
    - Root cause analysis of incidents
    - Increase monitoring

Week 4: 0 minutes downtime (cumulative: 40 min)
  Remaining budget: 3.2 minutes (7%)
  ⚠️  Status: YELLOW - Careful releases only
  Actions:
    - Small, low-risk changes only
    - Extended bake time
    - Manual approval required

Calculating Error Budget Burn Rate

// Error budget burn rate (Azure Monitor)
let slo = 99.9; // 99.9% availability SLO
let errorBudget = 100 - slo; // 0.1%
let timeWindow = 30d; // Monthly budget

requests
| where timestamp > ago(timeWindow)
| summarize
    total = count(),
    failed = countif(success == false)
| extend
    actualAvailability = ((total - failed) * 100.0) / total,
    budgetUsed = 100 - actualAvailability,
    budgetRemaining = errorBudget - (100 - actualAvailability),
    burnRate = (100 - actualAvailability) / errorBudget // 1.0 = burning at expected rate
| project
    SLO = slo,
    ActualAvailability = round(actualAvailability, 2),
    ErrorBudget = errorBudget,
    BudgetUsed = round(budgetUsed, 4),
    BudgetRemaining = round(budgetRemaining, 4),
    BurnRate = round(burnRate, 2),
    Status = case(
        burnRate < 0.5, "GREEN - Under budget",
        burnRate < 1.0, "YELLOW - On track",
        burnRate < 2.0, "ORANGE - Over budget",
        "RED - Critical"
    )

5. Toil Reduction

Toil = Repetitive, manual, automatable work that scales linearly with service growth.

What is Toil?

✅ Toil Examples
❌ Not Toil

✅ Manually restarting failed services
✅ Provisioning new servers by hand
✅ Copy-pasting database queries for reports
✅ Manually scaling resources up/down
✅ SSH into servers to check logs
✅ Running the same kubectl commands daily
✅ Manually updating configuration files
✅ Responding to the same alerts with same fix

Characteristics:

Manual
Repetitive
Automatable
Tactical (no long-term value)
Scales linearly with service size

❌ Writing automation scripts (engineering work)
❌ Planning capacity (strategic thinking)
❌ Incident response (novel problem-solving)
❌ Code reviews (knowledge sharing)
❌ Architecture design (creative work)
❌ Debugging new issues (requires expertise)

Characteristics:

Requires human judgment
Novel problems
Strategic value
Knowledge creation

Toil Elimination Strategy

Measure Toil

Track time spent on repetitive tasks

Weekly time audit:
- Manual deployments: 8 hours
- Alert triage (false positives): 5 hours
- Log searching: 4 hours
- Manual scaling: 3 hours
- Config updates: 2 hours

Total toil: 22 hours/week (55% of time!)

Prioritize by ROI

Automate high-frequency, high-time tasks first

Task                  Frequency    Time    Total/Week    ROI
Manual deployments    10/week      0.5h    5h            High
False alert triage    50/week      0.1h    5h            High
Manual scaling        20/week      0.15h   3h            Medium
Config updates        5/week       0.4h    2h            Low

Automate

Examples:

# Before: Manual deployment (30 min)
ssh server
sudo systemctl stop app
git pull
dotnet publish
sudo systemctl start app

# After: CI/CD pipeline (2 min, no human)
git push
# GitHub Actions deploys automatically

# Before: Manual scaling (20 min)
# Check metrics, decide, update config, restart

# After: Autoscaling (0 min, automatic)
az monitor autoscale create \
  --resource-group rg-prod \
  --resource webapp \
  --min-count 2 \
  --max-count 10 \
  --count 2

# Before: Searching logs (15 min per incident)
ssh server
tail -f /var/log/app.log | grep ERROR

# After: Centralized logging (30 seconds)
# KQL query in Application Insights
traces | where severityLevel == 3 | take 100

Measure Improvement

Track toil reduction over time

Q1: 55% time spent on toil
Q2: 40% (automated deployments)
Q3: 25% (added autoscaling)
Q4: 15% (improved alerting)

Time reclaimed: 20 hours/week
Used for: Feature development, reliability improvements

6. Chaos Engineering

Chaos Engineering = Intentionally breaking things in production to build confidence.

Chaos Principles

Build a hypothesis (e.g., “If we kill 30% of pods, requests should still succeed”)
Define steady state (e.g., “Error rate < 1%, latency p95 < 500ms”)
Introduce chaos (e.g., kill pods, inject latency, fail database)
Measure deviation (Did error rate spike? Did latency increase?)
Learn and improve (Add retries? Implement circuit breaker?)

Azure Chaos Studio

Setup:

# Enable Chaos Studio on AKS cluster
az aks update \
  --resource-group rg-prod \
  --name aks-prod \
  --enable-chaos

# Create chaos experiment
az chaos experiment create \
  --name "pod-failure-experiment" \
  --resource-group rg-chaos \
  --location eastus \
  --identity SystemAssigned \
  --steps '[
    {
      "name": "Kill Random Pods",
      "branches": [
        {
          "name": "Kill 30% of pods",
          "actions": [
            {
              "type": "continuous",
              "name": "urn:csci:microsoft:azureKubernetesServiceChaosMesh:podChaos/2.1",
              "duration": "PT5M",
              "parameters": [
                {
                  "key": "jsonSpec",
                  "value": "{\"action\":\"pod-kill\",\"mode\":\"fixed-percent\",\"value\":\"30\"}"
                }
              ],
              "selectorId": "aks-cluster-target"
            }
          ]
        }
      ]
    }
  ]'

# Run experiment
az chaos experiment start \
  --name "pod-failure-experiment" \
  --resource-group rg-chaos

Chaos Scenarios

1. Pod Failure (AKS)

Hypothesis: Application survives 30% pod failureTest:

# Using Chaos Mesh
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-experiment
spec:
  action: pod-kill
  mode: fixed-percent
  value: "30"
  selector:
    namespaces:
      - production
    labelSelectors:
      app: web-api
  scheduler:
    cron: "@every 1h"

Expected: No user-visible errors (replicas take over) Actual: ? Learnings: Need faster readiness probes? Increase replica count?

2. Network Latency

Hypothesis: Application handles 500ms network latencyTest:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-api
  delay:
    latency: "500ms"
    correlation: "100"
  duration: "5m"

Expected: Requests timeout gracefully, circuit breaker opens Actual: ? Learnings: Add timeout? Implement retry with backoff?

3. CPU Stress

Hypothesis: Autoscaling kicks in under CPU loadTest:

apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress
spec:
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: worker
  stressors:
    cpu:
      workers: 4
      load: 100
  duration: "10m"

Expected: HPA scales pods from 3 → 10 within 2 minutes Actual: ? Learnings: Adjust HPA thresholds? Add more headroom?

4. Database Failover

Hypothesis: App survives database failoverTest:

# Trigger Azure SQL failover
az sql db failover \
  --resource-group rg-prod \
  --server sql-prod \
  --database mydb

# Expected failover time: 30-60 seconds

Expected: Brief spike in errors (< 1 minute), then recovery Actual: ? Learnings: Connection pool handles failover? Need retry logic?

GameDay Exercise

Scenario: Black Friday Simulation

Timeline:
09:00 - Start load test (5,000 RPS)
09:15 - Kill 50% of web pods
09:20 - Inject 1s latency to payment service
09:25 - Trigger database failover
09:30 - Simulate CDN failure (disable Azure Front Door)
09:35 - End test

Metrics to Track:
- Error rate (target: &lt; 1%)
- p95 latency (target: &lt; 2s)
- Successful checkouts (target: > 99%)
- Revenue lost (target: &lt; $1,000)

Team Roles:
- Incident Commander
- Operations (executes fixes)
- Communications (updates stakeholders)
- Observers (document learnings)

Post-GameDay:
- Blameless postmortem
- Action items with owners
- Update runbooks

7. Incident Management

Incident Response Lifecycle

Severity Levels

Severity	Impact	Response Time	Example
SEV1	Critical - Revenue loss	Page immediately	Payment system down, database offline
SEV2	Major - Degraded service	15 minutes	API latency 10x higher, 10% error rate
SEV3	Minor - Partial impact	1 hour	Single pod failing, non-critical feature down
SEV4	Cosmetic - No user impact	Next business day	UI typo, log noise

Blameless Postmortem Template

# Incident Postmortem: Payment Service Outage

**Date**: 2026-01-20
**Duration**: 47 minutes
**Severity**: SEV1
**Impact**: $125K revenue lost, 15K failed transactions

## Timeline

09:42 - Alert fired: Payment API error rate 50%
09:43 - On-call engineer acknowledged
09:45 - Incident Commander assigned
09:47 - Root cause identified: Database connection pool exhausted
09:50 - Mitigation: Restarted API pods
09:55 - Error rate dropped to 5%
10:00 - Mitigation: Increased connection pool size
10:15 - Error rate back to normal (&lt; 0.1%)
10:29 - Incident resolved

## Root Cause

Database connection pool configured for 100 connections.
Traffic spike (Black Friday sale) increased to 5,000 RPS.
Each request held connection for 500ms average.
Required connections: 5,000 * 0.5 = 2,500.

Result: Connection pool exhausted → requests failed.

## What Went Well

✅ Alert fired within 30 seconds
✅ Clear escalation path
✅ Team mobilized quickly
✅ Mitigation identified in < 5 minutes

## What Went Wrong

❌ No capacity planning for Black Friday
❌ Connection pool size not monitored
❌ No autoscaling for connection pool
❌ Runbook didn't cover this scenario

## Action Items

1. [P0] Increase connection pool to 5,000 (@alice, Due: Jan 21)
2. [P0] Add connection pool saturation alert (@bob, Due: Jan 21)
3. [P1] Implement autoscaling for connection pool (@charlie, Due: Jan 25)
4. [P1] Update runbook with connection pool troubleshooting (@david, Due: Jan 27)
5. [P2] Load test before major events (@team, Due: Feb 1)

## Lessons Learned

- Monitor resource saturation, not just errors
- Load test before high-traffic events
- Connection pools are a finite resource

8. On-Call Best Practices

On-Call Rotation

Team of 6 engineers
Rotation: 1 week on-call, 5 weeks off

Primary On-Call:
- Responds to all alerts
- Pages: 5-10 per week expected
- Compensation: $500/week stipend

Secondary On-Call:
- Backup if primary unreachable
- Takes over for escalations
- Compensation: $250/week stipend

Schedule:
Week 1: Alice (primary), Bob (secondary)
Week 2: Charlie (primary), David (secondary)
Week 3: Eve (primary), Frank (secondary)
...

On-Call Expectations

Response Times:

SEV1: 5 minutes
SEV2: 15 minutes
SEV3: 1 hour

Escalation Path:

1. Primary on-call (5 min)
   ↓
2. Secondary on-call (10 min)
   ↓
3. Engineering Manager (15 min)
   ↓
4. VP Engineering (20 min)

Handoff:

## On-Call Handoff: Alice → Charlie

**Open Incidents**:
- INC-1234: Database latency spikes (SEV3, investigating)

**Ongoing Issues**:
- Redis cache hit rate dropped to 60% (normal: 90%)
- Monitoring this, no action needed yet

**Upcoming Maintenance**:
- SQL failover test scheduled for Wednesday 2am

**Tips**:
- If you see "connection timeout" errors, restart worker pods
- Database runbook: https://wiki.company.com/db-runbook
- Slack #oncall-help if you need guidance

9. Load Testing & Performance Testing

Testing your system under load is critical for understanding scalability limits and preventing outages.

Load Testing vs Performance Testing

Type	Purpose	When	Tools
Load Testing	Verify system handles expected load	Before launch, before traffic spikes	k6, JMeter, Azure Load Testing
Stress Testing	Find breaking point	Capacity planning	k6, Locust
Spike Testing	Handle sudden traffic bursts	Black Friday prep	k6, Gatling
Soak Testing	Detect memory leaks over time	After deployments	k6 (24+ hours)
Performance Testing	Measure response times	Every release	Azure Load Testing

Azure Load Testing

Azure Load Testing is a fully managed service for generating high-scale load.

Create Load Test

# Create Azure Load Testing resource
az load create \
  --name myloadtest \
  --resource-group rg-prod \
  --location eastus

# Upload JMeter test plan
az load test create \
  --load-test-resource myloadtest \
  --test-id checkout-load-test \
  --display-name "Checkout API Load Test" \
  --description "Test 1000 concurrent users" \
  --test-plan checkout-test.jmx \
  --engine-instances 10

JMeter Test Plan Example

<?xml version="1.0" encoding="UTF-8"?>
<jmeterTestPlan version="1.2">
  <hashTree>
    <TestPlan guiclass="TestPlanGui" testclass="TestPlan" testname="Checkout Load Test">
      <elementProp name="TestPlan.user_defined_variables" elementType="Arguments">
        <collectionProp name="Arguments.arguments">
          <elementProp name="TARGET_URL" elementType="Argument">
            <stringProp name="Argument.name">TARGET_URL</stringProp>
            <stringProp name="Argument.value">${__P(TARGET_URL,https://api.example.com)}</stringProp>
          </elementProp>
        </collectionProp>
      </elementProp>
    </TestPlan>
    <hashTree>
      <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="Users">
        <intProp name="ThreadGroup.num_threads">1000</intProp>
        <intProp name="ThreadGroup.ramp_time">60</intProp>
        <longProp name="ThreadGroup.duration">600</longProp>
      </ThreadGroup>
      <hashTree>
        <HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="POST /checkout">
          <stringProp name="HTTPSampler.domain">${TARGET_URL}</stringProp>
          <stringProp name="HTTPSampler.path">/api/checkout</stringProp>
          <stringProp name="HTTPSampler.method">POST</stringProp>
          <boolProp name="HTTPSampler.auto_redirects">true</boolProp>
          <boolProp name="HTTPSampler.use_keepalive">true</boolProp>
        </HTTPSamplerProxy>
      </hashTree>
    </hashTree>
  </hashTree>
</jmeterTestPlan>

k6 Load Testing

k6 is a modern, developer-friendly load testing tool.

Basic Load Test

// load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up to 100 users
    { duration: '5m', target: 100 },   // Stay at 100 users
    { duration: '2m', target: 200 },   // Spike to 200 users
    { duration: '5m', target: 200 },   // Stay at 200 users
    { duration: '2m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)&lt;500'],  // 95% of requests &lt; 500ms
    http_req_failed: ['rate&lt;0.01'],     // Error rate &lt; 1%
  },
};

export default function () {
  const payload = JSON.stringify({
    userId: `user${__VU}`,
    cartId: `cart${__VU}`,
    items: [{ productId: 'prod123', quantity: 2 }],
  });

  const params = {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${__ENV.API_TOKEN}`,
    },
  };

  const res = http.post('https://api.example.com/checkout', payload, params);

  check(res, {
    'status is 200': (r) => r.status === 200,
    'checkout successful': (r) => r.json('success') === true,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });

  sleep(1);  // Think time: 1 second between requests
}

Run k6 Test

# Local test
k6 run load-test.js

# Cloud test (distributed load)
k6 cloud load-test.js

# With environment variables
k6 run -e API_TOKEN=$TOKEN load-test.js

# Output to InfluxDB + Grafana
k6 run --out influxdb=http://localhost:8086/k6 load-test.js

Performance Testing Scenarios

Scenario 1: Black Friday Load Test

// black-friday.js
export const options = {
  scenarios: {
    // Normal traffic (1000 RPS)
    normal_traffic: {
      executor: 'constant-arrival-rate',
      rate: 1000,
      timeUnit: '1s',
      duration: '1h',
      preAllocatedVUs: 500,
      maxVUs: 2000,
    },
    // Black Friday spike (10x traffic for 30 min)
    black_friday_spike: {
      executor: 'constant-arrival-rate',
      rate: 10000,
      timeUnit: '1s',
      duration: '30m',
      startTime: '1h',  // Start after normal traffic
      preAllocatedVUs: 5000,
      maxVUs: 20000,
    },
  },
  thresholds: {
    'http_req_duration{scenario:normal_traffic}': ['p(99)&lt;1000'],
    'http_req_duration{scenario:black_friday_spike}': ['p(99)&lt;2000'],
    'http_req_failed': ['rate&lt;0.05'],  // 5% error budget during spike
  },
};

Scenario 2: Soak Test (Memory Leak Detection)

// soak-test.js
export const options = {
  stages: [
    { duration: '5m', target: 100 },    // Ramp up
    { duration: '24h', target: 100 },   // Run for 24 hours
    { duration: '5m', target: 0 },      // Ramp down
  ],
};

export default function () {
  http.get('https://api.example.com/products');
  sleep(3);  // 3 seconds think time
}

What to Monitor:

Memory usage (should be flat, not increasing)
Connection pool size (should stabilize)
Database connections (no leaks)
HTTP response times (should not degrade)

Azure Application Insights Performance Testing

Custom Telemetry for Load Tests

// Program.cs
services.AddApplicationInsightsTelemetry(options =>
{
    options.EnableAdaptiveSampling = false;  // Disable during load tests
    options.DeveloperMode = false;
});

// LoadTestController.cs
[HttpGet("health")]
public IActionResult Health()
{
    var telemetry = new TelemetryClient();

    // Track custom metric during load test
    telemetry.TrackMetric("DatabaseConnectionPoolSize",
        GetConnectionPoolSize());

    telemetry.TrackMetric("MemoryUsageMB",
        GC.GetTotalMemory(false) / 1024 / 1024);

    return Ok(new { status = "healthy" });
}

KQL Query: Analyze Load Test Results

// Performance during load test
requests
| where timestamp between(datetime(2026-01-21 10:00) .. datetime(2026-01-21 11:00))
| summarize
    P50 = percentile(duration, 50),
    P95 = percentile(duration, 95),
    P99 = percentile(duration, 99),
    RequestRate = count() / (max(timestamp) - min(timestamp)) / 1s,
    FailureRate = countif(success == false) * 100.0 / count()
    by bin(timestamp, 1m)
| project timestamp, P50, P95, P99, RequestRate, FailureRate
| render timechart

Performance Testing Best Practices

1. Realistic Test Data

// Bad: Same user every time
const userId = 'user123';

// Good: Random users from pool
const users = ['user1', 'user2', 'user3', ...];  // 10,000 users
const userId = users[Math.floor(Math.random() * users.length)];

// Better: Unique user per VU
const userId = `user${__VU}`;

2. Think Time (Realistic User Behavior)

export default function () {
  // User lands on homepage
  http.get('https://example.com/');
  sleep(2);  // User reads content

  // User searches for product
  http.get('https://example.com/search?q=laptop');
  sleep(3);  // User browses results

  // User clicks product
  http.get('https://example.com/products/laptop-123');
  sleep(5);  // User reads reviews

  // User adds to cart
  http.post('https://example.com/cart/add', { productId: 'laptop-123' });
  sleep(1);
}

3. Connection Pooling

export const options = {
  batch: 20,  // Send 20 requests in parallel per VU
  batchPerHost: 6,  // Max 6 parallel connections per host (HTTP/1.1)
};

4. Gradual Ramp-Up

// Bad: Instant spike (thundering herd)
export const options = {
  vus: 1000,
  duration: '10m',
};

// Good: Gradual ramp-up
export const options = {
  stages: [
    { duration: '2m', target: 100 },
    { duration: '2m', target: 500 },
    { duration: '2m', target: 1000 },
    { duration: '10m', target: 1000 },
  ],
};

Interpreting Load Test Results

Metrics to Track

Metric	Good	Warning	Critical
P95 Latency	<500ms	500-1000ms	>1000ms
Error Rate	<0.1%	0.1-1%	>1%
Throughput	Meets SLO	80% of SLO	<80% of SLO
CPU Usage	<70%	70-85%	>85%
Memory Usage	Stable	Slow increase	Rapid increase

Bottleneck Identification

// Find slowest dependencies
dependencies
| where timestamp > ago(1h)
| summarize
    AvgDuration = avg(duration),
    P95Duration = percentile(duration, 95),
    Count = count()
    by target
| order by P95Duration desc
| take 10

Common Bottlenecks:

Database queries (most common) - Add indexes, optimize queries
External API calls - Add caching, circuit breakers
CPU-intensive operations - Move to background jobs
Network latency - Use CDN, co-locate services
Memory allocation - Reduce allocations, use object pooling

Load Testing Checklist

Before running production load tests:

Test environment matches production (same VM SKU, database tier, autoscaling rules)
Disable sampling in Application Insights (100% telemetry during tests)
Notify team (Slack #on-call: “Load test starting at 10 AM”)
Monitor dashboards open (Grafana, Application Insights Live Metrics)
Rollback plan ready (know how to stop the test immediately)
Test data prepared (realistic user IDs, product IDs, payment methods)
Success criteria defined (P95 <500ms, error rate <1%, no memory leaks)
Baseline captured (run test before changes, compare after)

[!WARNING] Gotcha: Testing in Production Load testing in production is risky but sometimes necessary. If you must:

Use synthetic users (test-user-1, test-user-2) to isolate test data

Route test traffic to a separate backend pool (blue-green)

Start with 5% of production load, increase gradually

Have instant rollback ready (kill switch)

Never test payment processing or critical user actions in production

10. Production Readiness Checklist

Before launching a new service to production:

✅ Observability
  ✅ Metrics instrumented (Golden Signals)
  ✅ Logs centralized (Application Insights / Log Analytics)
  ✅ Distributed tracing enabled
  ✅ Dashboards created
  ✅ Alerts configured with runbooks

✅ Reliability
  ✅ SLOs defined and measured
  ✅ Error budget policy in place
  ✅ Circuit breakers implemented
  ✅ Retries with exponential backoff
  ✅ Timeouts configured
  ✅ Health checks (liveness & readiness)

✅ Scalability
  ✅ Load tested at 2x peak traffic
  ✅ Autoscaling configured
  ✅ Database connection pooling
  ✅ Caching strategy
  ✅ Rate limiting for external APIs

✅ Security
  ✅ Secrets in Key Vault (not in code)
  ✅ Managed Identity for authentication
  ✅ Network isolation (Private Endpoints)
  ✅ WAF enabled
  ✅ Security scanning in CI/CD

✅ Operational
  ✅ Runbooks documented
  ✅ On-call rotation defined
  ✅ Disaster recovery tested
  ✅ Backup and restore verified
  ✅ Chaos engineering experiments run

✅ Cost
  ✅ Cost estimation completed
  ✅ Budgets and alerts set
  ✅ Autoscaling to reduce waste
  ✅ Reserved Instances for steady load

10. Interview Questions

Beginner Level

Q1: What is the difference between SLI, SLO, and SLA?

Answer:SLI (Service Level Indicator):

What you measure
Example: “p95 latency = 250ms”

SLO (Service Level Objective):

Your internal target
Example: “p95 latency < 300ms”
Guides engineering decisions

SLA (Service Level Agreement):

Customer promise with consequences
Example: “99.9% uptime or 10% refund”
Legal/financial agreement

Relationship: SLI ≤ SLO ≤ SLA

Q2: What is an error budget?

Answer:Error budget = How much unreliability you can tolerate.Calculation:

SLO: 99.9% availability
Error budget: 100% - 99.9% = 0.1%

Monthly: 30 days × 24 hours × 60 min = 43,200 min
Error budget: 43,200 × 0.001 = 43.2 minutes downtime allowed

Use:

Budget remaining → ship features fast
Budget exhausted → focus on reliability

Intermediate Level

Q3: How do you measure and reduce toil?

Answer:Measure:

Time audit: Track manual, repetitive tasks
Calculate percentage of time spent on toil
Goal: Keep toil < 50% (ideally < 30%)

Reduce:

Automate high-frequency tasks first
Examples:
- Manual deployments → CI/CD
- Manual scaling → Autoscaling
- Log searching → Centralized logging
Measure improvement quarterly

Not Toil:

Novel problem solving
Architecture design
Incident response (new incidents)

Q4: Design an on-call rotation for a team of 6

Answer:Rotation:

Primary on-call: 1 week
Secondary on-call: 1 week
Rotation: 6 weeks per person

Schedule:

Week 1: Alice (P), Bob (S)
Week 2: Charlie (P), David (S)
Week 3: Eve (P), Frank (S)
Week 4: Bob (P), Alice (S)
Week 5: David (P), Charlie (S)
Week 6: Frank (P), Eve (S)

Compensation:

Primary: $500/week
Secondary: $250/week

Handoff Process:

Written handoff document
15-minute sync call
Slack thread with context

Advanced Level

Q5: Implement an error budget policy

Answer:

Service: E-Commerce Checkout
SLO: 99.95% availability (21.6 min downtime/month)

Error Budget Policy:

GREEN (> 80% budget remaining):
- Ship features daily
- Canary deployments (10% traffic)
- Automated rollouts
- Chaos experiments allowed

YELLOW (30-80% budget remaining):
- Ship features 2x per week
- Extended canary period (24 hours)
- Manual approval for risky changes
- Pause chaos experiments

ORANGE (10-30% budget remaining):
- Feature freeze (critical fixes only)
- Extended testing period
- Manual deployments
- Root cause analysis required

RED (< 10% budget remaining):
- FULL STOP on features
- Emergency fixes only
- Incident review with leadership
- Mandatory postmortems
- Focus: Pay down reliability debt

Reset: Monthly (first day of month)

Escalation:
- Budget hits YELLOW → Notify team
- Budget hits ORANGE → Notify manager
- Budget hits RED → Notify VP Engineering

Q6: Design a chaos engineering program

Answer:

**Phase 1: Foundation (Month 1-2)**
- Set up Azure Chaos Studio
- Define blast radius (start with dev/staging)
- Create baseline metrics dashboard
- Get stakeholder buy-in

**Phase 2: Simple Experiments (Month 3-4)**
- Pod failures (10% → 30% → 50%)
- Network latency (100ms → 500ms → 1s)
- CPU stress (50% → 80% → 100%)
- Run in staging, measure impact

**Phase 3: Production Testing (Month 5-6)**
- Start with off-peak hours
- Small blast radius (single AZ)
- Gradual rollout to full production
- Always have kill switch

**Phase 4: GameDays (Month 7+)**
- Quarterly exercises
- Simulate real outages
- Test incident response
- Cross-team coordination

**Metrics to Track**:
- MTTR (Mean Time To Recovery)
- Blast radius (how many users affected)
- Learnings per experiment
- Action items completed

**Culture**:
- Blameless
- Learning-focused
- Celebrate finding weaknesses
- Share results company-wide

11. Key Takeaways

SLOs Drive Decisions

Define clear SLOs. Use error budgets to balance velocity and reliability.

Eliminate Toil

Automate everything repetitive. Spend time on engineering, not firefighting.

Break Things on Purpose

Chaos engineering builds confidence. Test in production before customers do.

Blameless Culture

Focus on systems, not people. Learn from failures without punishment.

Monitor What Matters

User experience > server metrics. Latency, errors, saturation.

Operational Excellence

Production readiness checklist before launch. On-call with clear expectations.

Next Steps

Back to Course Overview

Return to course overview or continue to the Capstone Project

Continue to Chapter 15

Apply everything you’ve learned in the enterprise e-commerce capstone

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​SRE & Production Excellence

​What You’ll Learn

​Introduction: What is Site Reliability Engineering (SRE)?

​Start Here if You’re Completely New

​Why SRE Matters: The Cost of Unreliability

​Real-World Failure Examples

​Understanding SLI, SLO, SLA (From Absolute Zero)

​The Pizza Delivery Analogy

​Software Example: E-Commerce Website

​Error Budgets Explained (From Scratch)

​The Simple Math

​Error Budgets as “Innovation Currency”

​Common Mistakes in SRE (Learn from Others)

​Mistake #1: Setting SLOs Too High (Perfectionism)

​Mistake #2: No Error Budget Policy (Chaos Deploys)

​Mistake #3: SLO Based on System Metrics (Not User Experience)

​Mistake #4: Too Many SLOs (Analysis Paralysis)

​1. The SRE Mindset

​Core SRE Principles

Embrace Risk

Service Level Objectives

Eliminate Toil

Monitor Everything

Error Budgets

Blameless Postmortems

​2. Service Level Indicators (SLIs)

​The Golden Signals (Google SRE)

​Choosing Good SLIs

​3. Service Level Objectives (SLOs)

​SLO Examples

​4. Error Budgets

​Error Budget Policy

​Calculating Error Budget Burn Rate

​5. Toil Reduction

​What is Toil?

​Toil Elimination Strategy

​6. Chaos Engineering

​Chaos Principles

SRE & Production Excellence

What You’ll Learn

Introduction: What is Site Reliability Engineering (SRE)?

Start Here if You’re Completely New

Why SRE Matters: The Cost of Unreliability

Real-World Failure Examples

Understanding SLI, SLO, SLA (From Absolute Zero)

The Pizza Delivery Analogy

Software Example: E-Commerce Website

Error Budgets Explained (From Scratch)

The Simple Math

Error Budgets as “Innovation Currency”

Common Mistakes in SRE (Learn from Others)

Mistake #1: Setting SLOs Too High (Perfectionism)

Mistake #2: No Error Budget Policy (Chaos Deploys)

Mistake #3: SLO Based on System Metrics (Not User Experience)

Mistake #4: Too Many SLOs (Analysis Paralysis)

1. The SRE Mindset

Core SRE Principles

2. Service Level Indicators (SLIs)

The Golden Signals (Google SRE)

Choosing Good SLIs

3. Service Level Objectives (SLOs)

SLO Examples

4. Error Budgets

Error Budget Policy

Calculating Error Budget Burn Rate

5. Toil Reduction

What is Toil?

Toil Elimination Strategy

6. Chaos Engineering

Chaos Principles

Azure Chaos Studio

Chaos Scenarios

GameDay Exercise

7. Incident Management

Incident Response Lifecycle

Severity Levels

Blameless Postmortem Template

8. On-Call Best Practices

On-Call Rotation

On-Call Expectations

9. Load Testing & Performance Testing

Load Testing vs Performance Testing

Azure Load Testing

Create Load Test

JMeter Test Plan Example

k6 Load Testing

Basic Load Test

Run k6 Test

Performance Testing Scenarios

Scenario 1: Black Friday Load Test

Scenario 2: Soak Test (Memory Leak Detection)

Azure Application Insights Performance Testing

Custom Telemetry for Load Tests

KQL Query: Analyze Load Test Results

Performance Testing Best Practices

1. Realistic Test Data

2. Think Time (Realistic User Behavior)

3. Connection Pooling

4. Gradual Ramp-Up

Interpreting Load Test Results

Metrics to Track

Bottleneck Identification

Load Testing Checklist

10. Production Readiness Checklist

10. Interview Questions

Beginner Level

Intermediate Level

Advanced Level

11. Key Takeaways

Next Steps