SRE = Treating operations like a software engineering problemThink of it like maintaining a car:Traditional Operations (Old Way):
Car breaks down ↓Mechanic manually fixes it ↓Car breaks down again next month ↓Mechanic manually fixes it again ↓Repeat forever... ❌Problems:- Reactive (fixing problems after they happen)- Manual work (mechanic required every time)- No improvement (same problems repeat)- Doesn't scale (need more mechanics as fleet grows)
Site Reliability Engineering (New Way):
Car sensor detects problem before failure ↓Automated system orders replacement part ↓Technician installs during scheduled maintenance ↓System analyzes failure pattern ↓Software patch prevents issue in all cars ✅Benefits:- Proactive (preventing problems before they happen)- Automated (systems fix themselves)- Continuous improvement (learn from every incident)- Scales (same team manages 10x more systems)
SRE in Software:
Automation: Replace manual tasks with code
Monitoring: Detect problems before users notice
Reliability: Measure and improve system uptime
Balance: Ship features while maintaining stability
Pizza Delivery Service:- SLI #1: Delivery time (measured with GPS tracker) - Order at 7:00 PM → Delivered at 7:25 PM = 25 minutes- SLI #2: Pizza temperature (measured with thermometer) - Temperature at delivery: 145°F- SLI #3: Order accuracy (measured by customer) - Correct toppings: Yes / No**SLI = The actual measurements**
SLO (Service Level Objective) = Your Internal Goal
Pizza Company's Goals (not told to customers):- SLO #1: 95% of pizzas delivered in < 30 minutes- SLO #2: 99% of pizzas arrive above 140°F- SLO #3: 99.9% of orders are correct**SLO = What you promise yourself to achieve**
SLA (Service Level Agreement) = Your Customer Promise
Pizza Company's Guarantee (public promise):- SLA: "Pizza in 30 minutes or it's free!"**SLA = What you promise customers (with consequences)**
The Relationship:
SLI (What you measure) → SLO (Your goal) → SLA (Customer promise)Real Numbers:- SLI: Last 100 pizzas averaged 28 minutes- SLO: We want 95% delivered in < 30 min- SLA: We promise 30 min or it's freeBuffer: SLO (95%) is tighter than SLA (100%)→ This gives you room for occasional failures ✅
What we measure:- API response time: 250ms (measured with Application Insights)- Error rate: 0.05% (measured with logs)- Uptime: 99.95% (measured with health checks)These are FACTS (actual measurements)
Our public guarantee:- 99.5% uptime or we give you credits- (We promise customers 99.5%, but aim for 99.9% internally)Buffer: 99.9% (SLO) vs 99.5% (SLA) = 0.4% buffer ✅
Why the Buffer?
Month 1:- Actual uptime: 99.8% ✅ (above SLO 99.9%? No, but above SLA 99.5%? Yes!)- No customer refunds- Internal: "We missed our SLO, let's improve"Month 2:- Actual uptime: 99.4% ❌ (below SLA 99.5%)- Customer refunds triggered- Internal: "Emergency! Stop feature releases!"The buffer prevents small problems from costing money
Think of error budget like a bank account:High Balance = Take Risks
Error Budget: 43.2 minutes/monthUsed so far: 5 minutes (12%)Remaining: 38.2 minutes (88%)Decision: ✅ Ship new feature Friday!Why? We have plenty of budget leftIf it breaks, we can afford 30 min of downtime
Low Balance = Play it Safe
Error Budget: 43.2 minutes/monthUsed so far: 40 minutes (93%)Remaining: 3.2 minutes (7%)Decision: ❌ STOP all deployments!Why? We're almost out of budgetFocus on stability, not new features
Real-World Example: Netflix
Netflix SLO: 99.99% uptimeError Budget: 4.3 minutes/monthPolicy:- Budget > 50% → Deploy multiple times per day ✅- Budget 25-50% → Deploy once per day with extra testing ⚠️- Budget < 25% → FREEZE deployments, fix reliability 🚨Result:- Balances innovation (new features) with stability (uptime)- Engineers don't argue about "should we ship?" → Check error budget!- Data-driven decisions, not politics
Mistake #3: SLO Based on System Metrics (Not User Experience)
The Trap:
Bad SLO:- "CPU < 80%" ❌- "Database connections < 100" ❌- "Server is running" ❌Problem: Server can be "running" but users see errors!
The Fix: SLOs based on user experience
Good SLO:- "95% of page loads < 2 seconds" ✅ (user-facing)- "99.9% of API calls succeed" ✅ (user-facing)- "99% of search results in < 500ms" ✅ (user-facing)Why better? Measures what users actually experience
Team defines 50 SLOs:- API latency p50, p75, p90, p95, p99, p99.9- Error rate by endpoint (20 endpoints)- Database query time by query type (30 types)Result: Nobody knows which metrics matter ❌Meetings spent arguing about SLO violations ❌
The Fix: Start with 3-5 critical SLOs
E-Commerce Site (5 SLOs):1. Availability: 99.9% (most important)2. p95 page load: < 2 seconds3. p95 API latency: < 300ms4. Payment error rate: < 0.01%5. Search results: < 500ms (p99)Rule: If it's not in top 5, don't make it an SLO
[!TIP]
Jargon Alert: SLI vs SLO vs SLASLI (Indicator): What you measure (e.g., latency = 250ms)
SLO (Objective): Your internal goal (e.g., p95 latency < 300ms)
SLA (Agreement): Your customer promise (e.g., 99.9% uptime or money back)
SLO should be tighter than SLA to give you a buffer!
[!WARNING]
Gotcha: 99.9% vs 99.99% Availability
The difference is NOT 0.09%—it’s 10x more downtime!
99.9% = 43 minutes/month downtime
99.99% = 4.3 minutes/month downtime
Achieving 99.99% can cost 10x more. Choose based on business impact, not ego.
❌ "Server is up" (doesn't reflect user experience)❌ "Average latency" (hides outliers)❌ "Internal queue depth" (users don't care)
Good SLI Examples:
✅ "95% of API requests complete in < 300ms"✅ "99.9% of requests return 2xx or 3xx"✅ "Page load completes in < 2 seconds (p95)"✅ "Search results returned in < 500ms (p99)"
Latency:- SLI: p95 API response time- SLO: < 100ms- Why: Third-party integrations need fast responsesAvailability:- SLI: Successful API call rate- SLO: 99.95% (21 min downtime/month)- Why: Higher than web app (upstream dependency)Throughput:- SLI: Requests per second handled- SLO: 10,000 RPS sustained- Why: Contract guarantees 5,000 RPS average
Batch Processing System
SLIs & SLOs:
Freshness:- SLI: Time from data arrival to processing completion- SLO: < 5 minutes (p95)- Why: Near real-time reporting requirementCorrectness:- SLI: Percentage of jobs completing without errors- SLO: 99.99%- Why: Data quality is criticalThroughput:- SLI: Events processed per hour- SLO: 100 million events/hour- Why: Peak load during business hours
✅ Manually restarting failed services✅ Provisioning new servers by hand✅ Copy-pasting database queries for reports✅ Manually scaling resources up/down✅ SSH into servers to check logs✅ Running the same kubectl commands daily✅ Manually updating configuration files✅ Responding to the same alerts with same fix
# Incident Postmortem: Payment Service Outage**Date**: 2026-01-20**Duration**: 47 minutes**Severity**: SEV1**Impact**: $125K revenue lost, 15K failed transactions## Timeline09:42 - Alert fired: Payment API error rate 50%09:43 - On-call engineer acknowledged09:45 - Incident Commander assigned09:47 - Root cause identified: Database connection pool exhausted09:50 - Mitigation: Restarted API pods09:55 - Error rate dropped to 5%10:00 - Mitigation: Increased connection pool size10:15 - Error rate back to normal (< 0.1%)10:29 - Incident resolved## Root CauseDatabase connection pool configured for 100 connections.Traffic spike (Black Friday sale) increased to 5,000 RPS.Each request held connection for 500ms average.Required connections: 5,000 * 0.5 = 2,500.Result: Connection pool exhausted → requests failed.## What Went Well✅ Alert fired within 30 seconds✅ Clear escalation path✅ Team mobilized quickly✅ Mitigation identified in < 5 minutes## What Went Wrong❌ No capacity planning for Black Friday❌ Connection pool size not monitored❌ No autoscaling for connection pool❌ Runbook didn't cover this scenario## Action Items1. [P0] Increase connection pool to 5,000 (@alice, Due: Jan 21)2. [P0] Add connection pool saturation alert (@bob, Due: Jan 21)3. [P1] Implement autoscaling for connection pool (@charlie, Due: Jan 25)4. [P1] Update runbook with connection pool troubleshooting (@david, Due: Jan 27)5. [P2] Load test before major events (@team, Due: Feb 1)## Lessons Learned- Monitor resource saturation, not just errors- Load test before high-traffic events- Connection pools are a finite resource
Team of 6 engineersRotation: 1 week on-call, 5 weeks offPrimary On-Call:- Responds to all alerts- Pages: 5-10 per week expected- Compensation: $500/week stipendSecondary On-Call:- Backup if primary unreachable- Takes over for escalations- Compensation: $250/week stipendSchedule:Week 1: Alice (primary), Bob (secondary)Week 2: Charlie (primary), David (secondary)Week 3: Eve (primary), Frank (secondary)...
## On-Call Handoff: Alice → Charlie**Open Incidents**:- INC-1234: Database latency spikes (SEV3, investigating)**Ongoing Issues**:- Redis cache hit rate dropped to 60% (normal: 90%)- Monitoring this, no action needed yet**Upcoming Maintenance**:- SQL failover test scheduled for Wednesday 2am**Tips**:- If you see "connection timeout" errors, restart worker pods- Database runbook: https://wiki.company.com/db-runbook- Slack #oncall-help if you need guidance
# Local testk6 run load-test.js# Cloud test (distributed load)k6 cloud load-test.js# With environment variablesk6 run -e API_TOKEN=$TOKEN load-test.js# Output to InfluxDB + Grafanak6 run --out influxdb=http://localhost:8086/k6 load-test.js
export default function () { // User lands on homepage http.get('https://example.com/'); sleep(2); // User reads content // User searches for product http.get('https://example.com/search?q=laptop'); sleep(3); // User browses results // User clicks product http.get('https://example.com/products/laptop-123'); sleep(5); // User reads reviews // User adds to cart http.post('https://example.com/cart/add', { productId: 'laptop-123' }); sleep(1);}
Week 1: Alice (P), Bob (S)Week 2: Charlie (P), David (S)Week 3: Eve (P), Frank (S)Week 4: Bob (P), Alice (S)Week 5: David (P), Charlie (S)Week 6: Frank (P), Eve (S)