Skip to main content

SRE & Production Excellence

Learn how Google’s Site Reliability Engineering practices apply to Azure, from setting SLOs to running chaos experiments. SRE Pyramid

What You’ll Learn

By the end of this chapter, you’ll understand:
  • What SRE is (and how it’s different from traditional operations)
  • How to measure reliability (SLIs, SLOs, SLAs explained from scratch)
  • Error budgets (and how they balance innovation vs stability)
  • Production best practices (deployments, rollbacks, monitoring)
  • Chaos engineering (intentionally breaking things to build resilience)
  • Real-world examples with actual costs and trade-offs

Introduction: What is Site Reliability Engineering (SRE)?

Start Here if You’re Completely New

SRE = Treating operations like a software engineering problem Think of it like maintaining a car: Traditional Operations (Old Way):
Car breaks down

Mechanic manually fixes it

Car breaks down again next month

Mechanic manually fixes it again

Repeat forever... ❌

Problems:
- Reactive (fixing problems after they happen)
- Manual work (mechanic required every time)
- No improvement (same problems repeat)
- Doesn't scale (need more mechanics as fleet grows)
Site Reliability Engineering (New Way):
Car sensor detects problem before failure

Automated system orders replacement part

Technician installs during scheduled maintenance

System analyzes failure pattern

Software patch prevents issue in all cars ✅

Benefits:
- Proactive (preventing problems before they happen)
- Automated (systems fix themselves)
- Continuous improvement (learn from every incident)
- Scales (same team manages 10x more systems)
SRE in Software:
  • Automation: Replace manual tasks with code
  • Monitoring: Detect problems before users notice
  • Reliability: Measure and improve system uptime
  • Balance: Ship features while maintaining stability

Why SRE Matters: The Cost of Unreliability

Real-World Failure Examples

Amazon Prime Day 2018
  • Problem: Site crashed for 63 minutes
  • Cost: $99 million in lost revenue
  • Per-minute cost: $1.57 million/minute
  • Cause: No error budget, no chaos testing
  • Prevention cost: ~$1 million (SRE team + tools)
  • ROI: 99x return on investment ✅
More Examples:
CompanyIncidentDowntimeCostSRE Practice Missing
FacebookOct 20216 hours$60-100MChange management + error budgets
GitHubOct 201824 hoursUnknownChaos engineering
Google CloudJun 20194.5 hoursUnknownGradual rollouts
CloudflareJul 201930 minutesUnknownProgressive deployments
The Pattern: Good SRE practices cost millions. Bad SRE practices cost tens of millions.

Understanding SLI, SLO, SLA (From Absolute Zero)

These are the THREE most important acronyms in SRE. Let’s break them down with a real-world analogy:

The Pizza Delivery Analogy

SLI (Service Level Indicator) = What You Measure
Pizza Delivery Service:
- SLI #1: Delivery time (measured with GPS tracker)
  - Order at 7:00 PM → Delivered at 7:25 PM = 25 minutes

- SLI #2: Pizza temperature (measured with thermometer)
  - Temperature at delivery: 145°F

- SLI #3: Order accuracy (measured by customer)
  - Correct toppings: Yes / No

**SLI = The actual measurements**
SLO (Service Level Objective) = Your Internal Goal
Pizza Company's Goals (not told to customers):
- SLO #1: 95% of pizzas delivered in < 30 minutes
- SLO #2: 99% of pizzas arrive above 140°F
- SLO #3: 99.9% of orders are correct

**SLO = What you promise yourself to achieve**
SLA (Service Level Agreement) = Your Customer Promise
Pizza Company's Guarantee (public promise):
- SLA: "Pizza in 30 minutes or it's free!"

**SLA = What you promise customers (with consequences)**
The Relationship:
SLI (What you measure) → SLO (Your goal) → SLA (Customer promise)

Real Numbers:
- SLI: Last 100 pizzas averaged 28 minutes
- SLO: We want 95% delivered in < 30 min
- SLA: We promise 30 min or it's free

Buffer: SLO (95%) is tighter than SLA (100%)
→ This gives you room for occasional failures ✅

Software Example: E-Commerce Website

SLI (Service Level Indicator) = Measurements
What we measure:
- API response time: 250ms (measured with Application Insights)
- Error rate: 0.05% (measured with logs)
- Uptime: 99.95% (measured with health checks)

These are FACTS (actual measurements)
SLO (Service Level Objective) = Internal Goals
Our internal targets:
- p95 latency < 300ms (95% of requests faster than 300ms)
- Error rate < 0.1% (99.9% success)
- Uptime > 99.9% (43 minutes downtime/month OK)

These guide our engineering decisions
SLA (Service Level Agreement) = Customer Promise
Our public guarantee:
- 99.5% uptime or we give you credits
- (We promise customers 99.5%, but aim for 99.9% internally)

Buffer: 99.9% (SLO) vs 99.5% (SLA) = 0.4% buffer ✅
Why the Buffer?
Month 1:
- Actual uptime: 99.8% ✅ (above SLO 99.9%? No, but above SLA 99.5%? Yes!)
- No customer refunds
- Internal: "We missed our SLO, let's improve"

Month 2:
- Actual uptime: 99.4% ❌ (below SLA 99.5%)
- Customer refunds triggered
- Internal: "Emergency! Stop feature releases!"

The buffer prevents small problems from costing money

Error Budgets Explained (From Scratch)

Error Budget = How much downtime you’re allowed

The Simple Math

SLO: 99.9% uptime

100% - 99.9% = 0.1% allowed downtime

0.1% of 30 days (1 month):
30 days = 43,200 minutes
0.1% of 43,200 = 43.2 minutes

Error Budget: 43.2 minutes/month ✅
What This Means:
You have 43.2 minutes of downtime to "spend" each month

Week 1: 10 minutes downtime
  Remaining: 33.2 minutes (77%)

Week 2: 20 minutes downtime
  Total used: 30 minutes
  Remaining: 13.2 minutes (31%)

Week 3: 15 minutes downtime
  Total used: 45 minutes
  Remaining: -1.8 minutes ❌

STATUS: OVER BUDGET! Stop deploying! ❌

Error Budgets as “Innovation Currency”

Think of error budget like a bank account: High Balance = Take Risks
Error Budget: 43.2 minutes/month
Used so far: 5 minutes (12%)
Remaining: 38.2 minutes (88%)

Decision: ✅ Ship new feature Friday!
Why? We have plenty of budget left
If it breaks, we can afford 30 min of downtime
Low Balance = Play it Safe
Error Budget: 43.2 minutes/month
Used so far: 40 minutes (93%)
Remaining: 3.2 minutes (7%)

Decision: ❌ STOP all deployments!
Why? We're almost out of budget
Focus on stability, not new features
Real-World Example: Netflix
Netflix SLO: 99.99% uptime
Error Budget: 4.3 minutes/month

Policy:
- Budget > 50% → Deploy multiple times per day ✅
- Budget 25-50% → Deploy once per day with extra testing ⚠️
- Budget < 25% → FREEZE deployments, fix reliability 🚨

Result:
- Balances innovation (new features) with stability (uptime)
- Engineers don't argue about "should we ship?" → Check error budget!
- Data-driven decisions, not politics

Common Mistakes in SRE (Learn from Others)

Mistake #1: Setting SLOs Too High (Perfectionism)

The Trap:
Engineer: "Let's aim for 99.999% uptime!" (26 seconds downtime/month)

Cost to achieve 99.999%:
- Multi-region active-active: $50,000/month
- Chaos engineering tools: $5,000/month
- SRE team (5 people): $75,000/month
Total: $130,000/month

Revenue: $10,000/month ❌
Lost money: $120,000/month ❌
The Fix: Match SLO to business value
Internal tool used by 10 employees:
- SLO: 99% (7 hours downtime/month) ✅
- Cost: $500/month
- Why: Employees can wait, not generating revenue

E-commerce site with $1M/month revenue:
- SLO: 99.99% (4 minutes downtime/month) ✅
- Cost: $10,000/month
- Why: $1M ÷ 43,200 min/month = $23/minute lost revenue
- ROI: $10k cost prevents $100k+ losses ✅

Mistake #2: No Error Budget Policy (Chaos Deploys)

The Trap:
Month 1: 40 minutes downtime (over budget)
→ Team: "Let's ship anyway!" ❌
→ Month 2: 60 minutes downtime (way over budget)
→ CEO: "Why is our site always down?!" ❌
The Fix: Enforce error budget policy
Error Budget Policy:
├─ Budget > 50% → Normal operations (deploy freely)
├─ Budget 25-50% → Caution mode (extra testing required)
└─ Budget < 25% → FREEZE (only emergency fixes)

Month 1: 40 min downtime → Over budget
→ FREEZE enforced ✅
→ Team fixes reliability issues
→ Month 2: Only 5 min downtime
→ Budget restored, resume normal operations ✅

Mistake #3: SLO Based on System Metrics (Not User Experience)

The Trap:
Bad SLO:
- "CPU < 80%" ❌
- "Database connections < 100" ❌
- "Server is running" ❌

Problem: Server can be "running" but users see errors!
The Fix: SLOs based on user experience
Good SLO:
- "95% of page loads < 2 seconds" ✅ (user-facing)
- "99.9% of API calls succeed" ✅ (user-facing)
- "99% of search results in < 500ms" ✅ (user-facing)

Why better? Measures what users actually experience

Mistake #4: Too Many SLOs (Analysis Paralysis)

The Trap:
Team defines 50 SLOs:
- API latency p50, p75, p90, p95, p99, p99.9
- Error rate by endpoint (20 endpoints)
- Database query time by query type (30 types)

Result: Nobody knows which metrics matter ❌
Meetings spent arguing about SLO violations ❌
The Fix: Start with 3-5 critical SLOs
E-Commerce Site (5 SLOs):
1. Availability: 99.9% (most important)
2. p95 page load: < 2 seconds
3. p95 API latency: < 300ms
4. Payment error rate: < 0.01%
5. Search results: < 500ms (p99)

Rule: If it's not in top 5, don't make it an SLO

[!TIP] Jargon Alert: SLI vs SLO vs SLA SLI (Indicator): What you measure (e.g., latency = 250ms) SLO (Objective): Your internal goal (e.g., p95 latency < 300ms) SLA (Agreement): Your customer promise (e.g., 99.9% uptime or money back) SLO should be tighter than SLA to give you a buffer!
[!WARNING] Gotcha: 99.9% vs 99.99% Availability The difference is NOT 0.09%—it’s 10x more downtime! 99.9% = 43 minutes/month downtime 99.99% = 4.3 minutes/month downtime Achieving 99.99% can cost 10x more. Choose based on business impact, not ego.

1. The SRE Mindset

SRE = What happens when you ask a software engineer to design operations.

Core SRE Principles

Embrace Risk

100% reliability is impossible and wasteful. Find the right balance between reliability and velocity.

Service Level Objectives

Define clear, measurable reliability targets. Make data-driven decisions.

Eliminate Toil

Automate repetitive manual work. Toil is the enemy of scaling.

Monitor Everything

You can’t improve what you don’t measure. Observability is mandatory.

Error Budgets

Use remaining error budget to balance innovation and stability.

Blameless Postmortems

Learn from failures without blaming individuals. Focus on systemic improvements.

2. Service Level Indicators (SLIs)

SLIs are carefully chosen metrics that represent the health of your service from the user’s perspective.

The Golden Signals (Google SRE)

How long does a request take?
// Azure Application Insights - Latency SLI
requests
| where timestamp > ago(24h)
| where success == true
| summarize
    p50 = percentile(duration, 50),
    p95 = percentile(duration, 95),
    p99 = percentile(duration, 99)
    by bin(timestamp, 5m), name
| render timechart
Good SLI:
  • p95 latency < 300ms (fast)
  • p99 latency < 1000ms (acceptable)
Why percentiles?
  • Average hides outliers (1ms + 10,000ms = 5,000ms average, useless!)
  • p95 = 95% of users have this experience or better

Choosing Good SLIs

Bad SLI Examples:
❌ "Server is up" (doesn't reflect user experience)
❌ "Average latency" (hides outliers)
❌ "Internal queue depth" (users don't care)
Good SLI Examples:
✅ "95% of API requests complete in < 300ms"
✅ "99.9% of requests return 2xx or 3xx"
✅ "Page load completes in < 2 seconds (p95)"
✅ "Search results returned in < 500ms (p99)"

3. Service Level Objectives (SLOs)

SLOs are internal reliability targets. They should be slightly tighter than your SLA.

SLO Examples

SLIs & SLOs:
Latency:
- SLI: p95 page load time
- SLO: < 2 seconds
- Why: Studies show 2s+ loads = 20% abandonment

Availability:
- SLI: Successful checkout rate
- SLO: 99.9% (43 min downtime/month)
- Why: Every minute down = $10K revenue lost

Errors:
- SLI: Payment processing error rate
- SLO: < 0.01%
- Why: Payment errors damage trust
SLIs & SLOs:
Latency:
- SLI: p95 API response time
- SLO: < 100ms
- Why: Third-party integrations need fast responses

Availability:
- SLI: Successful API call rate
- SLO: 99.95% (21 min downtime/month)
- Why: Higher than web app (upstream dependency)

Throughput:
- SLI: Requests per second handled
- SLO: 10,000 RPS sustained
- Why: Contract guarantees 5,000 RPS average
SLIs & SLOs:
Freshness:
- SLI: Time from data arrival to processing completion
- SLO: < 5 minutes (p95)
- Why: Near real-time reporting requirement

Correctness:
- SLI: Percentage of jobs completing without errors
- SLO: 99.99%
- Why: Data quality is critical

Throughput:
- SLI: Events processed per hour
- SLO: 100 million events/hour
- Why: Peak load during business hours

4. Error Budgets

Error Budget = 100% - SLO If your SLO is 99.9% availability, your error budget is 0.1% = 43.2 minutes/month.

Error Budget Policy

Example Decision Making:
Month: January
SLO: 99.9% availability (error budget = 0.1% = 43.2 minutes)

Week 1: 5 minutes downtime
  Remaining budget: 38.2 minutes (88%)
  ✅ Status: GREEN - ship new features

Week 2: 15 minutes downtime (cumulative: 20 min)
  Remaining budget: 23.2 minutes (54%)
  ✅ Status: GREEN - continue

Week 3: 20 minutes downtime (cumulative: 40 min)
  Remaining budget: 3.2 minutes (7%)
  🚨 Status: RED - STOP feature releases
  Actions:
    - Cancel Friday deployment
    - Focus on reliability fixes
    - Root cause analysis of incidents
    - Increase monitoring

Week 4: 0 minutes downtime (cumulative: 40 min)
  Remaining budget: 3.2 minutes (7%)
  ⚠️  Status: YELLOW - Careful releases only
  Actions:
    - Small, low-risk changes only
    - Extended bake time
    - Manual approval required

Calculating Error Budget Burn Rate

// Error budget burn rate (Azure Monitor)
let slo = 99.9; // 99.9% availability SLO
let errorBudget = 100 - slo; // 0.1%
let timeWindow = 30d; // Monthly budget

requests
| where timestamp > ago(timeWindow)
| summarize
    total = count(),
    failed = countif(success == false)
| extend
    actualAvailability = ((total - failed) * 100.0) / total,
    budgetUsed = 100 - actualAvailability,
    budgetRemaining = errorBudget - (100 - actualAvailability),
    burnRate = (100 - actualAvailability) / errorBudget // 1.0 = burning at expected rate
| project
    SLO = slo,
    ActualAvailability = round(actualAvailability, 2),
    ErrorBudget = errorBudget,
    BudgetUsed = round(budgetUsed, 4),
    BudgetRemaining = round(budgetRemaining, 4),
    BurnRate = round(burnRate, 2),
    Status = case(
        burnRate < 0.5, "GREEN - Under budget",
        burnRate < 1.0, "YELLOW - On track",
        burnRate < 2.0, "ORANGE - Over budget",
        "RED - Critical"
    )

5. Toil Reduction

Toil = Repetitive, manual, automatable work that scales linearly with service growth.

What is Toil?

✅ Manually restarting failed services
✅ Provisioning new servers by hand
✅ Copy-pasting database queries for reports
✅ Manually scaling resources up/down
✅ SSH into servers to check logs
✅ Running the same kubectl commands daily
✅ Manually updating configuration files
✅ Responding to the same alerts with same fix
Characteristics:
  • Manual
  • Repetitive
  • Automatable
  • Tactical (no long-term value)
  • Scales linearly with service size

Toil Elimination Strategy

1

Measure Toil

Track time spent on repetitive tasks
Weekly time audit:
- Manual deployments: 8 hours
- Alert triage (false positives): 5 hours
- Log searching: 4 hours
- Manual scaling: 3 hours
- Config updates: 2 hours

Total toil: 22 hours/week (55% of time!)
2

Prioritize by ROI

Automate high-frequency, high-time tasks first
Task                  Frequency    Time    Total/Week    ROI
Manual deployments    10/week      0.5h    5h            High
False alert triage    50/week      0.1h    5h            High
Manual scaling        20/week      0.15h   3h            Medium
Config updates        5/week       0.4h    2h            Low
3

Automate

Examples:
# Before: Manual deployment (30 min)
ssh server
sudo systemctl stop app
git pull
dotnet publish
sudo systemctl start app

# After: CI/CD pipeline (2 min, no human)
git push
# GitHub Actions deploys automatically

# Before: Manual scaling (20 min)
# Check metrics, decide, update config, restart

# After: Autoscaling (0 min, automatic)
az monitor autoscale create \
  --resource-group rg-prod \
  --resource webapp \
  --min-count 2 \
  --max-count 10 \
  --count 2

# Before: Searching logs (15 min per incident)
ssh server
tail -f /var/log/app.log | grep ERROR

# After: Centralized logging (30 seconds)
# KQL query in Application Insights
traces | where severityLevel == 3 | take 100
4

Measure Improvement

Track toil reduction over time
Q1: 55% time spent on toil
Q2: 40% (automated deployments)
Q3: 25% (added autoscaling)
Q4: 15% (improved alerting)

Time reclaimed: 20 hours/week
Used for: Feature development, reliability improvements

6. Chaos Engineering

Chaos Engineering = Intentionally breaking things in production to build confidence.

Chaos Principles

  1. Build a hypothesis (e.g., “If we kill 30% of pods, requests should still succeed”)
  2. Define steady state (e.g., “Error rate < 1%, latency p95 < 500ms”)
  3. Introduce chaos (e.g., kill pods, inject latency, fail database)
  4. Measure deviation (Did error rate spike? Did latency increase?)
  5. Learn and improve (Add retries? Implement circuit breaker?)

Azure Chaos Studio

Setup:
# Enable Chaos Studio on AKS cluster
az aks update \
  --resource-group rg-prod \
  --name aks-prod \
  --enable-chaos

# Create chaos experiment
az chaos experiment create \
  --name "pod-failure-experiment" \
  --resource-group rg-chaos \
  --location eastus \
  --identity SystemAssigned \
  --steps '[
    {
      "name": "Kill Random Pods",
      "branches": [
        {
          "name": "Kill 30% of pods",
          "actions": [
            {
              "type": "continuous",
              "name": "urn:csci:microsoft:azureKubernetesServiceChaosMesh:podChaos/2.1",
              "duration": "PT5M",
              "parameters": [
                {
                  "key": "jsonSpec",
                  "value": "{\"action\":\"pod-kill\",\"mode\":\"fixed-percent\",\"value\":\"30\"}"
                }
              ],
              "selectorId": "aks-cluster-target"
            }
          ]
        }
      ]
    }
  ]'

# Run experiment
az chaos experiment start \
  --name "pod-failure-experiment" \
  --resource-group rg-chaos

Chaos Scenarios

Hypothesis: Application survives 30% pod failureTest:
# Using Chaos Mesh
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-experiment
spec:
  action: pod-kill
  mode: fixed-percent
  value: "30"
  selector:
    namespaces:
      - production
    labelSelectors:
      app: web-api
  scheduler:
    cron: "@every 1h"
Expected: No user-visible errors (replicas take over) Actual: ? Learnings: Need faster readiness probes? Increase replica count?
Hypothesis: Application handles 500ms network latencyTest:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: payment-api
  delay:
    latency: "500ms"
    correlation: "100"
  duration: "5m"
Expected: Requests timeout gracefully, circuit breaker opens Actual: ? Learnings: Add timeout? Implement retry with backoff?
Hypothesis: Autoscaling kicks in under CPU loadTest:
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: cpu-stress
spec:
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: worker
  stressors:
    cpu:
      workers: 4
      load: 100
  duration: "10m"
Expected: HPA scales pods from 3 → 10 within 2 minutes Actual: ? Learnings: Adjust HPA thresholds? Add more headroom?
Hypothesis: App survives database failoverTest:
# Trigger Azure SQL failover
az sql db failover \
  --resource-group rg-prod \
  --server sql-prod \
  --database mydb

# Expected failover time: 30-60 seconds
Expected: Brief spike in errors (< 1 minute), then recovery Actual: ? Learnings: Connection pool handles failover? Need retry logic?

GameDay Exercise

Scenario: Black Friday Simulation
Timeline:
09:00 - Start load test (5,000 RPS)
09:15 - Kill 50% of web pods
09:20 - Inject 1s latency to payment service
09:25 - Trigger database failover
09:30 - Simulate CDN failure (disable Azure Front Door)
09:35 - End test

Metrics to Track:
- Error rate (target: &lt; 1%)
- p95 latency (target: &lt; 2s)
- Successful checkouts (target: > 99%)
- Revenue lost (target: &lt; $1,000)

Team Roles:
- Incident Commander
- Operations (executes fixes)
- Communications (updates stakeholders)
- Observers (document learnings)

Post-GameDay:
- Blameless postmortem
- Action items with owners
- Update runbooks

7. Incident Management

Incident Response Lifecycle

Severity Levels

SeverityImpactResponse TimeExample
SEV1Critical - Revenue lossPage immediatelyPayment system down, database offline
SEV2Major - Degraded service15 minutesAPI latency 10x higher, 10% error rate
SEV3Minor - Partial impact1 hourSingle pod failing, non-critical feature down
SEV4Cosmetic - No user impactNext business dayUI typo, log noise

Blameless Postmortem Template

# Incident Postmortem: Payment Service Outage

**Date**: 2026-01-20
**Duration**: 47 minutes
**Severity**: SEV1
**Impact**: $125K revenue lost, 15K failed transactions

## Timeline

09:42 - Alert fired: Payment API error rate 50%
09:43 - On-call engineer acknowledged
09:45 - Incident Commander assigned
09:47 - Root cause identified: Database connection pool exhausted
09:50 - Mitigation: Restarted API pods
09:55 - Error rate dropped to 5%
10:00 - Mitigation: Increased connection pool size
10:15 - Error rate back to normal (&lt; 0.1%)
10:29 - Incident resolved

## Root Cause

Database connection pool configured for 100 connections.
Traffic spike (Black Friday sale) increased to 5,000 RPS.
Each request held connection for 500ms average.
Required connections: 5,000 * 0.5 = 2,500.

Result: Connection pool exhausted → requests failed.

## What Went Well

✅ Alert fired within 30 seconds
✅ Clear escalation path
✅ Team mobilized quickly
✅ Mitigation identified in < 5 minutes

## What Went Wrong

❌ No capacity planning for Black Friday
❌ Connection pool size not monitored
❌ No autoscaling for connection pool
❌ Runbook didn't cover this scenario

## Action Items

1. [P0] Increase connection pool to 5,000 (@alice, Due: Jan 21)
2. [P0] Add connection pool saturation alert (@bob, Due: Jan 21)
3. [P1] Implement autoscaling for connection pool (@charlie, Due: Jan 25)
4. [P1] Update runbook with connection pool troubleshooting (@david, Due: Jan 27)
5. [P2] Load test before major events (@team, Due: Feb 1)

## Lessons Learned

- Monitor resource saturation, not just errors
- Load test before high-traffic events
- Connection pools are a finite resource

8. On-Call Best Practices

On-Call Rotation

Team of 6 engineers
Rotation: 1 week on-call, 5 weeks off

Primary On-Call:
- Responds to all alerts
- Pages: 5-10 per week expected
- Compensation: $500/week stipend

Secondary On-Call:
- Backup if primary unreachable
- Takes over for escalations
- Compensation: $250/week stipend

Schedule:
Week 1: Alice (primary), Bob (secondary)
Week 2: Charlie (primary), David (secondary)
Week 3: Eve (primary), Frank (secondary)
...

On-Call Expectations

Response Times:
  • SEV1: 5 minutes
  • SEV2: 15 minutes
  • SEV3: 1 hour
Escalation Path:
1. Primary on-call (5 min)

2. Secondary on-call (10 min)

3. Engineering Manager (15 min)

4. VP Engineering (20 min)
Handoff:
## On-Call Handoff: Alice → Charlie

**Open Incidents**:
- INC-1234: Database latency spikes (SEV3, investigating)

**Ongoing Issues**:
- Redis cache hit rate dropped to 60% (normal: 90%)
- Monitoring this, no action needed yet

**Upcoming Maintenance**:
- SQL failover test scheduled for Wednesday 2am

**Tips**:
- If you see "connection timeout" errors, restart worker pods
- Database runbook: https://wiki.company.com/db-runbook
- Slack #oncall-help if you need guidance

9. Load Testing & Performance Testing

Testing your system under load is critical for understanding scalability limits and preventing outages.

Load Testing vs Performance Testing

TypePurposeWhenTools
Load TestingVerify system handles expected loadBefore launch, before traffic spikesk6, JMeter, Azure Load Testing
Stress TestingFind breaking pointCapacity planningk6, Locust
Spike TestingHandle sudden traffic burstsBlack Friday prepk6, Gatling
Soak TestingDetect memory leaks over timeAfter deploymentsk6 (24+ hours)
Performance TestingMeasure response timesEvery releaseAzure Load Testing
k6 Load Test Stages

Azure Load Testing

Azure Load Testing is a fully managed service for generating high-scale load.

Create Load Test

# Create Azure Load Testing resource
az load create \
  --name myloadtest \
  --resource-group rg-prod \
  --location eastus

# Upload JMeter test plan
az load test create \
  --load-test-resource myloadtest \
  --test-id checkout-load-test \
  --display-name "Checkout API Load Test" \
  --description "Test 1000 concurrent users" \
  --test-plan checkout-test.jmx \
  --engine-instances 10

JMeter Test Plan Example

<?xml version="1.0" encoding="UTF-8"?>
<jmeterTestPlan version="1.2">
  <hashTree>
    <TestPlan guiclass="TestPlanGui" testclass="TestPlan" testname="Checkout Load Test">
      <elementProp name="TestPlan.user_defined_variables" elementType="Arguments">
        <collectionProp name="Arguments.arguments">
          <elementProp name="TARGET_URL" elementType="Argument">
            <stringProp name="Argument.name">TARGET_URL</stringProp>
            <stringProp name="Argument.value">${__P(TARGET_URL,https://api.example.com)}</stringProp>
          </elementProp>
        </collectionProp>
      </elementProp>
    </TestPlan>
    <hashTree>
      <ThreadGroup guiclass="ThreadGroupGui" testclass="ThreadGroup" testname="Users">
        <intProp name="ThreadGroup.num_threads">1000</intProp>
        <intProp name="ThreadGroup.ramp_time">60</intProp>
        <longProp name="ThreadGroup.duration">600</longProp>
      </ThreadGroup>
      <hashTree>
        <HTTPSamplerProxy guiclass="HttpTestSampleGui" testclass="HTTPSamplerProxy" testname="POST /checkout">
          <stringProp name="HTTPSampler.domain">${TARGET_URL}</stringProp>
          <stringProp name="HTTPSampler.path">/api/checkout</stringProp>
          <stringProp name="HTTPSampler.method">POST</stringProp>
          <boolProp name="HTTPSampler.auto_redirects">true</boolProp>
          <boolProp name="HTTPSampler.use_keepalive">true</boolProp>
        </HTTPSamplerProxy>
      </hashTree>
    </hashTree>
  </hashTree>
</jmeterTestPlan>

k6 Load Testing

k6 is a modern, developer-friendly load testing tool.

Basic Load Test

// load-test.js
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 100 },   // Ramp up to 100 users
    { duration: '5m', target: 100 },   // Stay at 100 users
    { duration: '2m', target: 200 },   // Spike to 200 users
    { duration: '5m', target: 200 },   // Stay at 200 users
    { duration: '2m', target: 0 },     // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)&lt;500'],  // 95% of requests &lt; 500ms
    http_req_failed: ['rate&lt;0.01'],     // Error rate &lt; 1%
  },
};

export default function () {
  const payload = JSON.stringify({
    userId: `user${__VU}`,
    cartId: `cart${__VU}`,
    items: [{ productId: 'prod123', quantity: 2 }],
  });

  const params = {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${__ENV.API_TOKEN}`,
    },
  };

  const res = http.post('https://api.example.com/checkout', payload, params);

  check(res, {
    'status is 200': (r) => r.status === 200,
    'checkout successful': (r) => r.json('success') === true,
    'response time < 500ms': (r) => r.timings.duration < 500,
  });

  sleep(1);  // Think time: 1 second between requests
}

Run k6 Test

# Local test
k6 run load-test.js

# Cloud test (distributed load)
k6 cloud load-test.js

# With environment variables
k6 run -e API_TOKEN=$TOKEN load-test.js

# Output to InfluxDB + Grafana
k6 run --out influxdb=http://localhost:8086/k6 load-test.js

Performance Testing Scenarios

Scenario 1: Black Friday Load Test

// black-friday.js
export const options = {
  scenarios: {
    // Normal traffic (1000 RPS)
    normal_traffic: {
      executor: 'constant-arrival-rate',
      rate: 1000,
      timeUnit: '1s',
      duration: '1h',
      preAllocatedVUs: 500,
      maxVUs: 2000,
    },
    // Black Friday spike (10x traffic for 30 min)
    black_friday_spike: {
      executor: 'constant-arrival-rate',
      rate: 10000,
      timeUnit: '1s',
      duration: '30m',
      startTime: '1h',  // Start after normal traffic
      preAllocatedVUs: 5000,
      maxVUs: 20000,
    },
  },
  thresholds: {
    'http_req_duration{scenario:normal_traffic}': ['p(99)&lt;1000'],
    'http_req_duration{scenario:black_friday_spike}': ['p(99)&lt;2000'],
    'http_req_failed': ['rate&lt;0.05'],  // 5% error budget during spike
  },
};

Scenario 2: Soak Test (Memory Leak Detection)

// soak-test.js
export const options = {
  stages: [
    { duration: '5m', target: 100 },    // Ramp up
    { duration: '24h', target: 100 },   // Run for 24 hours
    { duration: '5m', target: 0 },      // Ramp down
  ],
};

export default function () {
  http.get('https://api.example.com/products');
  sleep(3);  // 3 seconds think time
}
What to Monitor:
  • Memory usage (should be flat, not increasing)
  • Connection pool size (should stabilize)
  • Database connections (no leaks)
  • HTTP response times (should not degrade)

Azure Application Insights Performance Testing

Custom Telemetry for Load Tests

// Program.cs
services.AddApplicationInsightsTelemetry(options =>
{
    options.EnableAdaptiveSampling = false;  // Disable during load tests
    options.DeveloperMode = false;
});

// LoadTestController.cs
[HttpGet("health")]
public IActionResult Health()
{
    var telemetry = new TelemetryClient();

    // Track custom metric during load test
    telemetry.TrackMetric("DatabaseConnectionPoolSize",
        GetConnectionPoolSize());

    telemetry.TrackMetric("MemoryUsageMB",
        GC.GetTotalMemory(false) / 1024 / 1024);

    return Ok(new { status = "healthy" });
}

KQL Query: Analyze Load Test Results

// Performance during load test
requests
| where timestamp between(datetime(2026-01-21 10:00) .. datetime(2026-01-21 11:00))
| summarize
    P50 = percentile(duration, 50),
    P95 = percentile(duration, 95),
    P99 = percentile(duration, 99),
    RequestRate = count() / (max(timestamp) - min(timestamp)) / 1s,
    FailureRate = countif(success == false) * 100.0 / count()
    by bin(timestamp, 1m)
| project timestamp, P50, P95, P99, RequestRate, FailureRate
| render timechart

Performance Testing Best Practices

1. Realistic Test Data

// Bad: Same user every time
const userId = 'user123';

// Good: Random users from pool
const users = ['user1', 'user2', 'user3', ...];  // 10,000 users
const userId = users[Math.floor(Math.random() * users.length)];

// Better: Unique user per VU
const userId = `user${__VU}`;

2. Think Time (Realistic User Behavior)

export default function () {
  // User lands on homepage
  http.get('https://example.com/');
  sleep(2);  // User reads content

  // User searches for product
  http.get('https://example.com/search?q=laptop');
  sleep(3);  // User browses results

  // User clicks product
  http.get('https://example.com/products/laptop-123');
  sleep(5);  // User reads reviews

  // User adds to cart
  http.post('https://example.com/cart/add', { productId: 'laptop-123' });
  sleep(1);
}

3. Connection Pooling

export const options = {
  batch: 20,  // Send 20 requests in parallel per VU
  batchPerHost: 6,  // Max 6 parallel connections per host (HTTP/1.1)
};

4. Gradual Ramp-Up

// Bad: Instant spike (thundering herd)
export const options = {
  vus: 1000,
  duration: '10m',
};

// Good: Gradual ramp-up
export const options = {
  stages: [
    { duration: '2m', target: 100 },
    { duration: '2m', target: 500 },
    { duration: '2m', target: 1000 },
    { duration: '10m', target: 1000 },
  ],
};

Interpreting Load Test Results

Metrics to Track

MetricGoodWarningCritical
P95 Latency<500ms500-1000ms>1000ms
Error Rate<0.1%0.1-1%>1%
ThroughputMeets SLO80% of SLO<80% of SLO
CPU Usage<70%70-85%>85%
Memory UsageStableSlow increaseRapid increase

Bottleneck Identification

// Find slowest dependencies
dependencies
| where timestamp > ago(1h)
| summarize
    AvgDuration = avg(duration),
    P95Duration = percentile(duration, 95),
    Count = count()
    by target
| order by P95Duration desc
| take 10
Common Bottlenecks:
  1. Database queries (most common) - Add indexes, optimize queries
  2. External API calls - Add caching, circuit breakers
  3. CPU-intensive operations - Move to background jobs
  4. Network latency - Use CDN, co-locate services
  5. Memory allocation - Reduce allocations, use object pooling

Load Testing Checklist

Before running production load tests:
  • Test environment matches production (same VM SKU, database tier, autoscaling rules)
  • Disable sampling in Application Insights (100% telemetry during tests)
  • Notify team (Slack #on-call: “Load test starting at 10 AM”)
  • Monitor dashboards open (Grafana, Application Insights Live Metrics)
  • Rollback plan ready (know how to stop the test immediately)
  • Test data prepared (realistic user IDs, product IDs, payment methods)
  • Success criteria defined (P95 <500ms, error rate <1%, no memory leaks)
  • Baseline captured (run test before changes, compare after)
[!WARNING] Gotcha: Testing in Production Load testing in production is risky but sometimes necessary. If you must:
  • Use synthetic users (test-user-1, test-user-2) to isolate test data
  • Route test traffic to a separate backend pool (blue-green)
  • Start with 5% of production load, increase gradually
  • Have instant rollback ready (kill switch)
  • Never test payment processing or critical user actions in production

10. Production Readiness Checklist

Before launching a new service to production:
✅ Observability
  ✅ Metrics instrumented (Golden Signals)
  ✅ Logs centralized (Application Insights / Log Analytics)
  ✅ Distributed tracing enabled
  ✅ Dashboards created
  ✅ Alerts configured with runbooks

✅ Reliability
  ✅ SLOs defined and measured
  ✅ Error budget policy in place
  ✅ Circuit breakers implemented
  ✅ Retries with exponential backoff
  ✅ Timeouts configured
  ✅ Health checks (liveness & readiness)

✅ Scalability
  ✅ Load tested at 2x peak traffic
  ✅ Autoscaling configured
  ✅ Database connection pooling
  ✅ Caching strategy
  ✅ Rate limiting for external APIs

✅ Security
  ✅ Secrets in Key Vault (not in code)
  ✅ Managed Identity for authentication
  ✅ Network isolation (Private Endpoints)
  ✅ WAF enabled
  ✅ Security scanning in CI/CD

✅ Operational
  ✅ Runbooks documented
  ✅ On-call rotation defined
  ✅ Disaster recovery tested
  ✅ Backup and restore verified
  ✅ Chaos engineering experiments run

✅ Cost
  ✅ Cost estimation completed
  ✅ Budgets and alerts set
  ✅ Autoscaling to reduce waste
  ✅ Reserved Instances for steady load

10. Interview Questions

Beginner Level

Answer:SLI (Service Level Indicator):
  • What you measure
  • Example: “p95 latency = 250ms”
SLO (Service Level Objective):
  • Your internal target
  • Example: “p95 latency < 300ms”
  • Guides engineering decisions
SLA (Service Level Agreement):
  • Customer promise with consequences
  • Example: “99.9% uptime or 10% refund”
  • Legal/financial agreement
Relationship: SLI ≤ SLO ≤ SLA
Answer:Error budget = How much unreliability you can tolerate.Calculation:
SLO: 99.9% availability
Error budget: 100% - 99.9% = 0.1%

Monthly: 30 days × 24 hours × 60 min = 43,200 min
Error budget: 43,200 × 0.001 = 43.2 minutes downtime allowed
Use:
  • Budget remaining → ship features fast
  • Budget exhausted → focus on reliability

Intermediate Level

Answer:Measure:
  1. Time audit: Track manual, repetitive tasks
  2. Calculate percentage of time spent on toil
  3. Goal: Keep toil < 50% (ideally < 30%)
Reduce:
  1. Automate high-frequency tasks first
  2. Examples:
    • Manual deployments → CI/CD
    • Manual scaling → Autoscaling
    • Log searching → Centralized logging
  3. Measure improvement quarterly
Not Toil:
  • Novel problem solving
  • Architecture design
  • Incident response (new incidents)
Answer:Rotation:
  • Primary on-call: 1 week
  • Secondary on-call: 1 week
  • Rotation: 6 weeks per person
Schedule:
Week 1: Alice (P), Bob (S)
Week 2: Charlie (P), David (S)
Week 3: Eve (P), Frank (S)
Week 4: Bob (P), Alice (S)
Week 5: David (P), Charlie (S)
Week 6: Frank (P), Eve (S)
Compensation:
  • Primary: $500/week
  • Secondary: $250/week
Handoff Process:
  • Written handoff document
  • 15-minute sync call
  • Slack thread with context

Advanced Level

Answer:
Service: E-Commerce Checkout
SLO: 99.95% availability (21.6 min downtime/month)

Error Budget Policy:

GREEN (> 80% budget remaining):
- Ship features daily
- Canary deployments (10% traffic)
- Automated rollouts
- Chaos experiments allowed

YELLOW (30-80% budget remaining):
- Ship features 2x per week
- Extended canary period (24 hours)
- Manual approval for risky changes
- Pause chaos experiments

ORANGE (10-30% budget remaining):
- Feature freeze (critical fixes only)
- Extended testing period
- Manual deployments
- Root cause analysis required

RED (< 10% budget remaining):
- FULL STOP on features
- Emergency fixes only
- Incident review with leadership
- Mandatory postmortems
- Focus: Pay down reliability debt

Reset: Monthly (first day of month)

Escalation:
- Budget hits YELLOW → Notify team
- Budget hits ORANGE → Notify manager
- Budget hits RED → Notify VP Engineering
Answer:
**Phase 1: Foundation (Month 1-2)**
- Set up Azure Chaos Studio
- Define blast radius (start with dev/staging)
- Create baseline metrics dashboard
- Get stakeholder buy-in

**Phase 2: Simple Experiments (Month 3-4)**
- Pod failures (10% → 30% → 50%)
- Network latency (100ms → 500ms → 1s)
- CPU stress (50% → 80% → 100%)
- Run in staging, measure impact

**Phase 3: Production Testing (Month 5-6)**
- Start with off-peak hours
- Small blast radius (single AZ)
- Gradual rollout to full production
- Always have kill switch

**Phase 4: GameDays (Month 7+)**
- Quarterly exercises
- Simulate real outages
- Test incident response
- Cross-team coordination

**Metrics to Track**:
- MTTR (Mean Time To Recovery)
- Blast radius (how many users affected)
- Learnings per experiment
- Action items completed

**Culture**:
- Blameless
- Learning-focused
- Celebrate finding weaknesses
- Share results company-wide

11. Key Takeaways

SLOs Drive Decisions

Define clear SLOs. Use error budgets to balance velocity and reliability.

Eliminate Toil

Automate everything repetitive. Spend time on engineering, not firefighting.

Break Things on Purpose

Chaos engineering builds confidence. Test in production before customers do.

Blameless Culture

Focus on systems, not people. Learn from failures without punishment.

Monitor What Matters

User experience > server metrics. Latency, errors, saturation.

Operational Excellence

Production readiness checklist before launch. On-call with clear expectations.

Next Steps

Back to Course Overview

Return to course overview or continue to the Capstone Project

Continue to Chapter 15

Apply everything you’ve learned in the enterprise e-commerce capstone