Track 6: Production Excellence
Operating distributed systems at scale requires specialized skills beyond just building them.Track Duration: 28-36 hours
Modules: 5
Key Topics: Observability, Chaos Engineering, SRE, Incident Management, Capacity Planning
Modules: 5
Key Topics: Observability, Chaos Engineering, SRE, Incident Management, Capacity Planning
Module 27: Observability at Scale
The Three Pillars
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ THREE PILLARS OF OBSERVABILITY │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ LOGS │ │ METRICS │ │ TRACES │ │
│ ├─────────────────┤ ├─────────────────┤ ├─────────────────┤ │
│ │ What happened │ │ How much/many │ │ Request journey │ │
│ │ at a point │ │ over time │ │ across services │ │
│ │ in time │ │ │ │ │ │
│ ├─────────────────┤ ├─────────────────┤ ├─────────────────┤ │
│ │ • Debug issues │ │ • Dashboards │ │ • Latency │ │
│ │ • Audit trail │ │ • Alerting │ │ breakdown │ │
│ │ • Security │ │ • Trends │ │ • Dependencies │ │
│ │ • Compliance │ │ • Capacity │ │ • Bottlenecks │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ CORRELATION IS KEY: │
│ ─────────────────── │
│ trace_id: abc123 links logs, metrics, and traces together │
│ │
│ Alert fires (metric) → Find trace_id → View logs for that trace │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Distributed Tracing
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED TRACING │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ REQUEST FLOW: │
│ │
│ User │
│ │ │
│ │ trace_id: abc123 │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ API Gateway (span: 0-450ms) │ │
│ │ │ │ │
│ │ ├──► Auth Service (span: 10-50ms) │ │
│ │ │ │ │
│ │ ├──► Order Service (span: 60-300ms) │ │
│ │ │ │ │ │
│ │ │ ├──► Inventory DB (span: 80-150ms) │ │
│ │ │ │ │ │
│ │ │ └──► Payment Service (span: 160-290ms) │ │
│ │ │ │ │ │
│ │ │ └──► Stripe API (span: 170-280ms) ← Bottleneck! │ │
│ │ │ │ │
│ │ └──► Notification Service (span: 310-440ms) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ TRACE CONTEXT PROPAGATION: │
│ ────────────────────────── │
│ HTTP Header: traceparent: 00-abc123-def456-01 │
│ │
│ W3C Trace Context format: │
│ 00 - version │
│ abc123 - trace-id (request identifier) │
│ def456 - parent-id (span identifier) │
│ 01 - flags (sampled) │
│ │
│ TOOLS: Jaeger, Zipkin, AWS X-Ray, Datadog APM, Honeycomb │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Metrics and Alerting
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ METRICS BEST PRACTICES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ USE (Brendan Gregg's method): │
│ ──────────────────────────── │
│ Utilization: % of resource used │
│ Saturation: Queue depth, backpressure │
│ Errors: Error count/rate │
│ │
│ RED (for services): │
│ ───────────────── │
│ Rate: Requests per second │
│ Errors: Error rate │
│ Duration: Latency distribution (p50, p95, p99) │
│ │
│ THE FOUR GOLDEN SIGNALS (Google SRE): │
│ ───────────────────────────────────── │
│ 1. Latency: Time to serve request │
│ 2. Traffic: Requests per second │
│ 3. Errors: Rate of failed requests │
│ 4. Saturation: How full is your system? │
│ │
│ ═══════════════════════════════════════════════════════════════════════ │
│ │
│ ALERTING BEST PRACTICES: │
│ ──────────────────────── │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ GOOD ALERTS │ BAD ALERTS │ │
│ ├────────────────────────────────────────────────────────────────────┤ │
│ │ Error rate > 1% for 5 min │ CPU > 80% │ │
│ │ p99 latency > 500ms │ Single node down │ │
│ │ Error budget < 10% │ Disk usage > 70% │ │
│ │ Zero successful payments │ Memory usage high │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ALERT ON SYMPTOMS, NOT CAUSES │
│ "Users are affected" not "CPU is high" │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
SLIs, SLOs, and Error Budgets
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ SLIs, SLOs, ERROR BUDGETS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SLI (Service Level Indicator): │
│ ────────────────────────────── │
│ Quantitative measure of service level │
│ │
│ Examples: │
│ • Request latency (p99 < 200ms) │
│ • Availability (successful requests / total requests) │
│ • Throughput (requests per second) │
│ │
│ SLO (Service Level Objective): │
│ ────────────────────────────── │
│ Target value or range for an SLI │
│ │
│ Examples: │
│ • 99.9% of requests complete in < 200ms │
│ • 99.95% availability per month │
│ │
│ ═══════════════════════════════════════════════════════════════════════ │
│ │
│ ERROR BUDGET: │
│ ───────────── │
│ Error Budget = 1 - SLO │
│ │
│ SLO: 99.9% availability │
│ Error Budget: 0.1% downtime allowed │
│ │
│ Per month (30 days): │
│ 0.1% × 30 × 24 × 60 = 43.2 minutes of allowed downtime │
│ │
│ BURN RATE: │
│ ────────── │
│ How fast are you consuming error budget? │
│ │
│ Burn rate 1.0 = Using budget at expected rate │
│ Burn rate 2.0 = Using budget 2x as fast (will exhaust in 15 days) │
│ Burn rate 10.0 = Critical! Budget gone in 3 days │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ ERROR BUDGET VISUALIZATION │ │
│ │ │ │
│ │ 100% ┤█████████████████████████████████████████████████████ │ │
│ │ │█████████████████████████████████████████████ │ │
│ │ 50% ┤████████████████████████████ │ │
│ │ │██████████████████████ ← Budget running low! │ │
│ │ 0% ┼──────────────────────────────────────────────────────── │ │
│ │ Day 1 Day 7 Day 14 Day 21 Day 28 Day 30 │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Module 28: Chaos Engineering
Netflix’s Approach
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ CHAOS ENGINEERING │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ "The discipline of experimenting on a system to build confidence │
│ in its capability to withstand turbulent conditions in production." │
│ │
│ NETFLIX SIMIAN ARMY: │
│ ──────────────────── │
│ │
│ 🐒 Chaos Monkey: Kill random instances │
│ 🦍 Chaos Gorilla: Kill entire availability zone │
│ 🦧 Chaos Kong: Kill entire region │
│ ⏰ Latency Monkey: Inject network latency │
│ 📝 Conformity Monkey: Find non-conforming instances │
│ 🩺 Doctor Monkey: Health checks │
│ 🧹 Janitor Monkey: Clean up unused resources │
│ Security Monkey: Find security issues │
│ │
│ PRINCIPLES: │
│ ─────────── │
│ 1. Start with hypothesis about steady state │
│ 2. Vary real-world events (failure, traffic spike) │
│ 3. Run experiments in production │
│ 4. Automate experiments to run continuously │
│ 5. Minimize blast radius │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Designing Chaos Experiments
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ CHAOS EXPERIMENT DESIGN │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ EXPERIMENT TEMPLATE: │
│ ──────────────────── │
│ │
│ 1. HYPOTHESIS │
│ "When database X fails, service Y will failover to replica │
│ within 30 seconds with no user-visible errors" │
│ │
│ 2. STEADY STATE │
│ - Error rate < 0.1% │
│ - P99 latency < 200ms │
│ - Orders per minute: ~1000 │
│ │
│ 3. EXPERIMENT │
│ - Target: Production database primary │
│ - Action: Kill database process │
│ - Duration: Until failover complete │
│ │
│ 4. BLAST RADIUS CONTROL │
│ - Run during low traffic (2 AM) │
│ - Limit to 5% of users (feature flag) │
│ - Automatic rollback if error rate > 5% │
│ │
│ 5. METRICS TO WATCH │
│ - Database connection errors │
│ - Order completion rate │
│ - Failover detection time │
│ │
│ 6. ABORT CONDITIONS │
│ - Error rate > 5% │
│ - No failover after 60 seconds │
│ - Manual abort button │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Failure Injection Types
Copy
┌────────────────────┬────────────────────────────────────────────────────────┐
│ Failure Type │ How to Inject │
├────────────────────┼────────────────────────────────────────────────────────┤
│ Process crash │ kill -9, container stop │
│ Instance failure │ Terminate EC2, pod delete │
│ Zone failure │ Block traffic to zone, DNS manipulation │
│ Region failure │ Failover entire region │
│ Network latency │ tc netem, iptables delay │
│ Network partition │ iptables DROP, security groups │
│ Packet loss │ tc netem loss │
│ CPU stress │ stress-ng, burn CPU │
│ Memory pressure │ stress-ng --vm │
│ Disk full │ dd if=/dev/zero of=/tmp/fill │
│ Disk slow │ dm-delay, slow filesystem │
│ Clock skew │ date --set, chronyd manipulation │
│ DNS failure │ Block port 53, bad DNS response │
│ Certificate expiry │ Use expired cert, revoke cert │
└────────────────────┴────────────────────────────────────────────────────────┘
Module 29: SRE Practices
Error Budgets and Release Velocity
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ ERROR BUDGETS IN PRACTICE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ERROR BUDGET POLICY: │
│ ──────────────────── │
│ │
│ If Error Budget > 50%: │
│ ✓ Deploy normally │
│ ✓ Run experiments │
│ ✓ Add new features │
│ │
│ If Error Budget 10-50%: │
│ ⚠ Slower deployments │
│ ⚠ Extra review for risky changes │
│ ⚠ Focus on reliability improvements │
│ │
│ If Error Budget < 10%: │
│ 🛑 Feature freeze │
│ 🛑 Only reliability fixes deployed │
│ 🛑 Postmortem required for any new issues │
│ │
│ If Error Budget Exhausted: │
│ 🚨 All hands on reliability │
│ 🚨 Executive escalation │
│ 🚨 No deploys until budget restored │
│ │
│ ═══════════════════════════════════════════════════════════════════════ │
│ │
│ BALANCING ACT: │
│ ────────────── │
│ │
│ Too much budget remaining = not innovating fast enough │
│ Too little budget = reliability suffering │
│ │
│ Target: Consume ~100% of budget by end of period │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
On-Call Best Practices
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ ON-CALL BEST PRACTICES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ROTATION STRUCTURE: │
│ ─────────────────── │
│ Primary → Secondary → Escalation Manager │
│ │
│ Primary: First responder (5 min SLA) │
│ Secondary: Backup if primary unavailable │
│ Escalation: For cross-team or executive decisions │
│ │
│ SUSTAINABLE ON-CALL: │
│ ──────────────────── │
│ ✓ Max 1 week in 4 on-call │
│ ✓ Max 2 incidents per 12-hour shift │
│ ✓ Time off after tough on-call shifts │
│ ✓ Clear escalation paths │
│ ✓ Good runbooks │
│ │
│ REDUCING TOIL: │
│ ────────────── │
│ Goal: < 50% of SRE time on toil │
│ │
│ Toil = Manual, repetitive, automatable, no lasting value │
│ │
│ Examples of toil: │
│ • Manual deployments │
│ • Restarting services │
│ • Certificate rotation │
│ • Capacity additions │
│ │
│ Fix: Automate everything that can be automated │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Progressive Rollouts
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ PROGRESSIVE ROLLOUTS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ CANARY DEPLOYMENT: │
│ ────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Load Balancer │ │
│ │ │ │ │
│ │ ┌────────────┼────────────┐ │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ │ │
│ │ │ v1.0 │ │ v1.0 │ │ v1.1 │ ← Canary (5%) │ │
│ │ │ (47.5%)│ │ (47.5%)│ │ (5%) │ │ │
│ │ └────────┘ └────────┘ └────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ STAGES: │
│ ─────── │
│ Stage 1: 1% → Verify no crashes (1 hour) │
│ Stage 2: 5% → Check metrics (2 hours) │
│ Stage 3: 25% → Broader validation (4 hours) │
│ Stage 4: 50% → Almost there (8 hours) │
│ Stage 5: 100% → Full rollout │
│ │
│ AUTOMATIC ROLLBACK TRIGGERS: │
│ ──────────────────────────── │
│ • Error rate > baseline + 0.5% │
│ • P99 latency > baseline + 100ms │
│ • Memory usage > 90% │
│ • Any crash loop │
│ │
│ BLUE-GREEN DEPLOYMENT: │
│ ────────────────────── │
│ │
│ Before: LB → Blue (v1.0) [Active] │
│ Green (v1.1) [Staged] │
│ │
│ After: LB → Green (v1.1) [Active] │
│ Blue (v1.0) [Standby for rollback] │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Module 30: Incident Management
Incident Response Framework
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ INCIDENT RESPONSE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SEVERITY LEVELS: │
│ ──────────────── │
│ │
│ SEV1 (Critical): │
│ • Complete service outage │
│ • Data loss or security breach │
│ • All hands, exec involvement │
│ │
│ SEV2 (High): │
│ • Major feature unavailable │
│ • Significant user impact │
│ • Dedicated incident team │
│ │
│ SEV3 (Medium): │
│ • Partial degradation │
│ • Limited user impact │
│ • On-call handles │
│ │
│ SEV4 (Low): │
│ • Minor issue │
│ • Workaround available │
│ • Fix in normal sprint │
│ │
│ ═══════════════════════════════════════════════════════════════════════ │
│ │
│ INCIDENT ROLES: │
│ ─────────────── │
│ │
│ Incident Commander (IC): │
│ • Coordinates response │
│ • Makes decisions │
│ • NOT doing technical work │
│ │
│ Communications Lead: │
│ • Updates status page │
│ • Communicates with stakeholders │
│ • Manages customer messaging │
│ │
│ Technical Lead: │
│ • Leads technical investigation │
│ • Coordinates engineering response │
│ │
│ Scribe: │
│ • Documents timeline │
│ • Records decisions and actions │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Postmortem Culture
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ BLAMELESS POSTMORTEMS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PRINCIPLES: │
│ ─────────── │
│ │
│ 1. BLAMELESS: Focus on systems, not individuals │
│ BAD: "John pushed bad code" │
│ GOOD: "Our CI pipeline didn't catch the regression" │
│ │
│ 2. LEARN: Extract lessons, don't just assign blame │
│ │
│ 3. SHARE: Postmortems are public within company │
│ │
│ 4. ACT: Action items must be tracked and completed │
│ │
│ ═══════════════════════════════════════════════════════════════════════ │
│ │
│ POSTMORTEM TEMPLATE: │
│ ──────────────────── │
│ │
│ ## Incident Summary │
│ • Duration: 2 hours 15 minutes │
│ • Impact: 40% of users saw errors │
│ • Severity: SEV2 │
│ │
│ ## Timeline │
│ 14:30 - Deploy of commit abc123 │
│ 14:35 - Error rate alerts fire │
│ 14:40 - On-call acknowledges │
│ 14:55 - Root cause identified │
│ 15:10 - Rollback initiated │
│ 15:15 - Service restored │
│ 16:45 - Incident closed │
│ │
│ ## Root Cause │
│ Database connection pool exhausted due to missing timeout │
│ │
│ ## What Went Well │
│ • Fast detection (5 min) │
│ • Clear runbook │
│ │
│ ## What Went Poorly │
│ • No canary caught the issue │
│ • Rollback took 20 min (should be 5) │
│ │
│ ## Action Items │
│ 1. [P0] Add connection pool monitoring - @alice - Due 1/15 │
│ 2. [P1] Improve canary coverage - @bob - Due 1/20 │
│ 3. [P2] Speed up rollback - @charlie - Due 1/30 │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Module 31: Capacity Planning
Load Testing
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ LOAD TESTING │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ TYPES OF TESTS: │
│ ─────────────── │
│ │
│ LOAD TEST: Normal/expected load │
│ ──────────────────────────── │
│ RPS ▲ │
│ │ ┌───────────────────────┐ │
│ 1000 │─────┤ Sustained 1000 RPS │───── │
│ │ └───────────────────────┘ │
│ └──────────────────────────────────────► Time │
│ │
│ STRESS TEST: Find breaking point │
│ ──────────────────────────────── │
│ RPS ▲ │
│ │ ╱ Break! │
│ 5000 │ ╱ │
│ │ ╱ │
│ 2500 │ ╱ │
│ │ ╱ │
│ 1000 │─────────────────╱ │
│ └──────────────────────────────────────► Time │
│ │
│ SOAK TEST: Long duration stability │
│ ────────────────────────────────── │
│ RPS ▲ │
│ │ ┌──────────────────────────────────────────┐ │
│ 1000 │─────┤ 24 hours at 1000 RPS │ │
│ │ └──────────────────────────────────────────┘ │
│ └──────────────────────────────────────────────────► Time │
│ Look for: Memory leaks, connection exhaustion │
│ │
│ SPIKE TEST: Sudden traffic burst │
│ ──────────────────────────────── │
│ RPS ▲ │
│ │ ╱╲ │
│ 5000 │ ╱ ╲ │
│ │ ╱ ╲ │
│ 1000 │───────╱ ╲─────── │
│ └──────────────────────────────────────► Time │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Capacity Modeling
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ CAPACITY MODELING │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ STEP 1: UNDERSTAND CURRENT STATE │
│ ───────────────────────────────── │
│ • Current RPS: 10,000 │
│ • Current instances: 20 │
│ • CPU utilization: 40% │
│ • RPS per instance: 500 │
│ │
│ STEP 2: DEFINE HEADROOM │
│ ──────────────────────── │
│ • Target CPU: 60% (buffer for spikes) │
│ • Available headroom: 20% (60% - 40%) │
│ • Max RPS at current capacity: 15,000 │
│ │
│ STEP 3: PROJECT GROWTH │
│ ──────────────────────── │
│ • Current: 10,000 RPS │
│ • Growth rate: 10% per month │
│ • In 6 months: 17,716 RPS │
│ • In 12 months: 31,384 RPS │
│ │
│ STEP 4: PLAN ADDITIONS │
│ ──────────────────────── │
│ │
│ Month 1-4: Current capacity sufficient │
│ Month 5: Need 25 instances (add 5) │
│ Month 8: Need 32 instances (add 7) │
│ Month 12: Need 42 instances (add 10) │
│ │
│ STEP 5: BUFFER FOR EVENTS │
│ ───────────────────────── │
│ • Black Friday: 3x normal traffic │
│ • Need 3x capacity for 48 hours │
│ • Pre-scale 1 week before │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Autoscaling Strategies
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ AUTOSCALING │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SCALING METRICS: │
│ ──────────────── │
│ │
│ CPU-based: │
│ Scale up when: avg CPU > 70% │
│ Scale down when: avg CPU < 30% │
│ │
│ Request-based: │
│ Target: 1000 RPS per instance │
│ Current: 3000 RPS, 2 instances │
│ Action: Scale to 3 instances │
│ │
│ Queue-based: │
│ Scale up when: queue depth > 1000 │
│ Scale down when: queue empty for 5 min │
│ │
│ ═══════════════════════════════════════════════════════════════════════ │
│ │
│ PREDICTIVE SCALING: │
│ ─────────────────── │
│ │
│ Traffic ▲ Expected Actual │
│ │ ───────── ──────── │
│ │ ╱╲ ╱╲ │
│ │ ╱ ╲ ╱ ╲ │
│ │ ╱ ╲ ╱ ╲ │
│ │───╱────────────╲╱──────────── │
│ │ 9AM 12PM 3PM │
│ └──────────────────────────────────► Time │
│ │
│ Scale BEFORE traffic arrives based on historical patterns │
│ │
│ COOLDOWN PERIODS: │
│ ───────────────── │
│ Scale up cooldown: 3 minutes │
│ Scale down cooldown: 10 minutes │
│ Prevents thrashing (scale up/down/up/down) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key Interview Questions
Q: How would you debug a latency spike across 100 services?
Q: How would you debug a latency spike across 100 services?
Systematic approach:
- Check dashboards first
- Which services show elevated latency?
- When did it start? Correlate with deployments
- Scope: All users or specific segment?
- Use distributed tracing
- Find slow traces
- Identify which span is slowest
- Look for patterns (specific service, DB, external API)
- Drill down
- Check that service’s metrics (CPU, memory, connections)
- Check dependencies (DB latency, cache hit rate)
- Check for new error types in logs
- Common culprits
- Database slow queries
- Cache miss spike
- Connection pool exhaustion
- Garbage collection
- Noisy neighbor (shared resources)
- External API degradation
- Mitigation while investigating
- Scale up if resource-bound
- Enable circuit breaker
- Failover to backup
Q: Design a chaos engineering program for your team
Q: Design a chaos engineering program for your team
Copy
PHASE 1: FOUNDATION (Month 1-2)
─────────────────────────────
• Define steady state metrics
• Set up experiment tracking
• Train team on principles
• Start with game days (manual)
PHASE 2: BASIC EXPERIMENTS (Month 3-4)
──────────────────────────────────────
• Instance failures
• Network latency injection
• Run in staging first
• Graduate to production with small blast radius
PHASE 3: ADVANCED (Month 5-6)
─────────────────────────────
• Zone failures
• Database failover
• Certificate expiry
• Clock skew
PHASE 4: AUTOMATION (Month 7+)
──────────────────────────────
• Continuous chaos in production
• Automatic rollback on failure
• Integrate with CI/CD
• Coverage reporting
KEY PRINCIPLES:
• Always have rollback plan
• Start small, expand gradually
• Document everything
• Celebrate finding issues!
Q: How do you handle being paged 5 times a night?
Q: How do you handle being paged 5 times a night?
Short-term mitigation:
- Analyze page patterns
- Suppress non-actionable alerts
- Add secondary on-call
- Improve runbooks for faster resolution
- Automate common remediations
- Add better monitoring to prevent issues
- Fix underlying reliability issues
- Add circuit breakers, retries
- Improve capacity planning
- Push back on missing error budget
- Track alert metrics (pages per week)
- Review every page in team meeting
- Goal: < 2 pages per on-call shift
- Escalate if consistently exceeded
Q: How do you prepare for a 10x traffic spike?
Q: How do you prepare for a 10x traffic spike?
Copy
PREPARATION TIMELINE:
T-4 weeks:
• Load test current capacity
• Identify bottlenecks
• Order additional capacity
T-2 weeks:
• Pre-scale databases (can't autoscale fast)
• Increase cache capacity
• Add read replicas
• Pre-warm caches
T-1 week:
• Scale compute to 3x normal
• Verify autoscaling works
• Test circuit breakers
• Prepare runbooks
T-1 day:
• Final scale to target (10x)
• War room ready
• All hands available
During event:
• Monitor dashboards
• Quick decisions on feature flags
• Prepared to shed load (graceful degradation)
After:
• Scale down gradually (30% per hour)
• Postmortem any issues
• Update capacity model
Capstone: Interview Preparation
Practice System Design Problems
Apply everything you’ve learned:
- Design Uber’s dispatch system
- Real-time location tracking
- Matching drivers to riders
- Surge pricing
- Design Stripe’s payment processing
- Exactly-once payments
- Idempotency
- Saga pattern for complex transactions
- Design Netflix’s video streaming
- CDN architecture
- Adaptive bitrate
- Regional failover
- Design Twitter’s timeline
- Fan-out on write vs read
- Celebrity problem
- Real-time updates
Congratulations!
You’ve completed the Distributed Systems Mastery course. You now have the knowledge to:- ✅ Explain consensus protocols (Raft, Paxos) in interviews
- ✅ Design systems with appropriate consistency guarantees
- ✅ Choose between replication strategies
- ✅ Implement distributed transactions correctly
- ✅ Build and operate systems at massive scale
- ✅ Debug complex distributed systems issues
Next steps: Practice! Do mock interviews, build projects, and read the papers referenced throughout this course.