Skip to main content

Track 6: Production Excellence

Operating distributed systems at scale requires specialized skills beyond just building them.
Track Duration: 28-36 hours
Modules: 5
Key Topics: Observability, Chaos Engineering, SRE, Incident Management, Capacity Planning

Module 27: Observability at Scale

The Three Pillars

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THREE PILLARS OF OBSERVABILITY                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐                │
│  │     LOGS        │ │    METRICS      │ │    TRACES       │                │
│  ├─────────────────┤ ├─────────────────┤ ├─────────────────┤                │
│  │ What happened   │ │ How much/many   │ │ Request journey │                │
│  │ at a point      │ │ over time       │ │ across services │                │
│  │ in time         │ │                 │ │                 │                │
│  ├─────────────────┤ ├─────────────────┤ ├─────────────────┤                │
│  │ • Debug issues  │ │ • Dashboards    │ │ • Latency       │                │
│  │ • Audit trail   │ │ • Alerting      │ │   breakdown     │                │
│  │ • Security      │ │ • Trends        │ │ • Dependencies  │                │
│  │ • Compliance    │ │ • Capacity      │ │ • Bottlenecks   │                │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘                │
│                                                                              │
│  CORRELATION IS KEY:                                                        │
│  ───────────────────                                                        │
│  trace_id: abc123 links logs, metrics, and traces together                 │
│                                                                              │
│  Alert fires (metric) → Find trace_id → View logs for that trace          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Distributed Tracing

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED TRACING                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  REQUEST FLOW:                                                              │
│                                                                              │
│  User                                                                       │
│   │                                                                         │
│   │ trace_id: abc123                                                        │
│   ▼                                                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ API Gateway (span: 0-450ms)                                         │    │
│  │   │                                                                 │    │
│  │   ├──► Auth Service (span: 10-50ms)                                │    │
│  │   │                                                                 │    │
│  │   ├──► Order Service (span: 60-300ms)                              │    │
│  │   │      │                                                          │    │
│  │   │      ├──► Inventory DB (span: 80-150ms)                        │    │
│  │   │      │                                                          │    │
│  │   │      └──► Payment Service (span: 160-290ms)                    │    │
│  │   │             │                                                   │    │
│  │   │             └──► Stripe API (span: 170-280ms) ← Bottleneck!   │    │
│  │   │                                                                 │    │
│  │   └──► Notification Service (span: 310-440ms)                      │    │
│  │                                                                     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  TRACE CONTEXT PROPAGATION:                                                 │
│  ──────────────────────────                                                 │
│  HTTP Header: traceparent: 00-abc123-def456-01                             │
│                                                                              │
│  W3C Trace Context format:                                                  │
│  00          - version                                                      │
│  abc123      - trace-id (request identifier)                               │
│  def456      - parent-id (span identifier)                                 │
│  01          - flags (sampled)                                              │
│                                                                              │
│  TOOLS: Jaeger, Zipkin, AWS X-Ray, Datadog APM, Honeycomb                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Metrics and Alerting

┌─────────────────────────────────────────────────────────────────────────────┐
│                    METRICS BEST PRACTICES                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  USE (Brendan Gregg's method):                                              │
│  ────────────────────────────                                               │
│  Utilization: % of resource used                                            │
│  Saturation: Queue depth, backpressure                                      │
│  Errors: Error count/rate                                                   │
│                                                                              │
│  RED (for services):                                                        │
│  ─────────────────                                                          │
│  Rate: Requests per second                                                  │
│  Errors: Error rate                                                         │
│  Duration: Latency distribution (p50, p95, p99)                            │
│                                                                              │
│  THE FOUR GOLDEN SIGNALS (Google SRE):                                      │
│  ─────────────────────────────────────                                      │
│  1. Latency: Time to serve request                                          │
│  2. Traffic: Requests per second                                            │
│  3. Errors: Rate of failed requests                                         │
│  4. Saturation: How full is your system?                                    │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  ALERTING BEST PRACTICES:                                                   │
│  ────────────────────────                                                   │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │ GOOD ALERTS                     │ BAD ALERTS                       │     │
│  ├────────────────────────────────────────────────────────────────────┤     │
│  │ Error rate > 1% for 5 min       │ CPU > 80%                        │     │
│  │ p99 latency > 500ms             │ Single node down                 │     │
│  │ Error budget < 10%              │ Disk usage > 70%                 │     │
│  │ Zero successful payments        │ Memory usage high                │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  ALERT ON SYMPTOMS, NOT CAUSES                                              │
│  "Users are affected" not "CPU is high"                                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

SLIs, SLOs, and Error Budgets

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SLIs, SLOs, ERROR BUDGETS                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SLI (Service Level Indicator):                                             │
│  ──────────────────────────────                                             │
│  Quantitative measure of service level                                      │
│                                                                              │
│  Examples:                                                                  │
│  • Request latency (p99 < 200ms)                                           │
│  • Availability (successful requests / total requests)                      │
│  • Throughput (requests per second)                                         │
│                                                                              │
│  SLO (Service Level Objective):                                             │
│  ──────────────────────────────                                             │
│  Target value or range for an SLI                                           │
│                                                                              │
│  Examples:                                                                  │
│  • 99.9% of requests complete in < 200ms                                   │
│  • 99.95% availability per month                                            │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  ERROR BUDGET:                                                              │
│  ─────────────                                                              │
│  Error Budget = 1 - SLO                                                     │
│                                                                              │
│  SLO: 99.9% availability                                                    │
│  Error Budget: 0.1% downtime allowed                                        │
│                                                                              │
│  Per month (30 days):                                                       │
│  0.1% × 30 × 24 × 60 = 43.2 minutes of allowed downtime                    │
│                                                                              │
│  BURN RATE:                                                                 │
│  ──────────                                                                 │
│  How fast are you consuming error budget?                                   │
│                                                                              │
│  Burn rate 1.0 = Using budget at expected rate                             │
│  Burn rate 2.0 = Using budget 2x as fast (will exhaust in 15 days)         │
│  Burn rate 10.0 = Critical! Budget gone in 3 days                          │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                    ERROR BUDGET VISUALIZATION                       │    │
│  │                                                                     │    │
│  │  100% ┤█████████████████████████████████████████████████████        │    │
│  │       │█████████████████████████████████████████████                │    │
│  │   50% ┤████████████████████████████                                 │    │
│  │       │██████████████████████ ← Budget running low!                │    │
│  │    0% ┼────────────────────────────────────────────────────────     │    │
│  │       Day 1    Day 7    Day 14   Day 21   Day 28   Day 30          │    │
│  │                                                                     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 28: Chaos Engineering

Netflix’s Approach

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CHAOS ENGINEERING                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  "The discipline of experimenting on a system to build confidence           │
│   in its capability to withstand turbulent conditions in production."       │
│                                                                              │
│  NETFLIX SIMIAN ARMY:                                                       │
│  ────────────────────                                                       │
│                                                                              │
│  🐒 Chaos Monkey: Kill random instances                                     │
│  🦍 Chaos Gorilla: Kill entire availability zone                            │
│  🦧 Chaos Kong: Kill entire region                                          │
│  ⏰ Latency Monkey: Inject network latency                                   │
│  📝 Conformity Monkey: Find non-conforming instances                        │
│  🩺 Doctor Monkey: Health checks                                            │
│  🧹 Janitor Monkey: Clean up unused resources                               │
│  Security Monkey: Find security issues                                      │
│                                                                              │
│  PRINCIPLES:                                                                │
│  ───────────                                                                │
│  1. Start with hypothesis about steady state                                │
│  2. Vary real-world events (failure, traffic spike)                         │
│  3. Run experiments in production                                           │
│  4. Automate experiments to run continuously                                │
│  5. Minimize blast radius                                                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Designing Chaos Experiments

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CHAOS EXPERIMENT DESIGN                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  EXPERIMENT TEMPLATE:                                                       │
│  ────────────────────                                                       │
│                                                                              │
│  1. HYPOTHESIS                                                              │
│     "When database X fails, service Y will failover to replica              │
│      within 30 seconds with no user-visible errors"                         │
│                                                                              │
│  2. STEADY STATE                                                            │
│     - Error rate < 0.1%                                                     │
│     - P99 latency < 200ms                                                   │
│     - Orders per minute: ~1000                                              │
│                                                                              │
│  3. EXPERIMENT                                                              │
│     - Target: Production database primary                                   │
│     - Action: Kill database process                                         │
│     - Duration: Until failover complete                                     │
│                                                                              │
│  4. BLAST RADIUS CONTROL                                                    │
│     - Run during low traffic (2 AM)                                         │
│     - Limit to 5% of users (feature flag)                                   │
│     - Automatic rollback if error rate > 5%                                 │
│                                                                              │
│  5. METRICS TO WATCH                                                        │
│     - Database connection errors                                            │
│     - Order completion rate                                                 │
│     - Failover detection time                                               │
│                                                                              │
│  6. ABORT CONDITIONS                                                        │
│     - Error rate > 5%                                                       │
│     - No failover after 60 seconds                                          │
│     - Manual abort button                                                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Failure Injection Types

┌────────────────────┬────────────────────────────────────────────────────────┐
│ Failure Type       │ How to Inject                                          │
├────────────────────┼────────────────────────────────────────────────────────┤
│ Process crash      │ kill -9, container stop                                │
│ Instance failure   │ Terminate EC2, pod delete                              │
│ Zone failure       │ Block traffic to zone, DNS manipulation                │
│ Region failure     │ Failover entire region                                 │
│ Network latency    │ tc netem, iptables delay                               │
│ Network partition  │ iptables DROP, security groups                         │
│ Packet loss        │ tc netem loss                                          │
│ CPU stress         │ stress-ng, burn CPU                                    │
│ Memory pressure    │ stress-ng --vm                                         │
│ Disk full          │ dd if=/dev/zero of=/tmp/fill                           │
│ Disk slow          │ dm-delay, slow filesystem                              │
│ Clock skew         │ date --set, chronyd manipulation                       │
│ DNS failure        │ Block port 53, bad DNS response                        │
│ Certificate expiry │ Use expired cert, revoke cert                          │
└────────────────────┴────────────────────────────────────────────────────────┘

Module 29: SRE Practices

Error Budgets and Release Velocity

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ERROR BUDGETS IN PRACTICE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ERROR BUDGET POLICY:                                                       │
│  ────────────────────                                                       │
│                                                                              │
│  If Error Budget > 50%:                                                     │
│    ✓ Deploy normally                                                        │
│    ✓ Run experiments                                                        │
│    ✓ Add new features                                                       │
│                                                                              │
│  If Error Budget 10-50%:                                                    │
│    ⚠ Slower deployments                                                     │
│    ⚠ Extra review for risky changes                                         │
│    ⚠ Focus on reliability improvements                                      │
│                                                                              │
│  If Error Budget < 10%:                                                     │
│    🛑 Feature freeze                                                        │
│    🛑 Only reliability fixes deployed                                       │
│    🛑 Postmortem required for any new issues                                │
│                                                                              │
│  If Error Budget Exhausted:                                                 │
│    🚨 All hands on reliability                                              │
│    🚨 Executive escalation                                                  │
│    🚨 No deploys until budget restored                                      │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  BALANCING ACT:                                                             │
│  ──────────────                                                             │
│                                                                              │
│  Too much budget remaining = not innovating fast enough                     │
│  Too little budget = reliability suffering                                  │
│                                                                              │
│  Target: Consume ~100% of budget by end of period                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

On-Call Best Practices

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ON-CALL BEST PRACTICES                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ROTATION STRUCTURE:                                                        │
│  ───────────────────                                                        │
│  Primary → Secondary → Escalation Manager                                   │
│                                                                              │
│  Primary: First responder (5 min SLA)                                       │
│  Secondary: Backup if primary unavailable                                   │
│  Escalation: For cross-team or executive decisions                          │
│                                                                              │
│  SUSTAINABLE ON-CALL:                                                       │
│  ────────────────────                                                       │
│  ✓ Max 1 week in 4 on-call                                                 │
│  ✓ Max 2 incidents per 12-hour shift                                       │
│  ✓ Time off after tough on-call shifts                                     │
│  ✓ Clear escalation paths                                                  │
│  ✓ Good runbooks                                                           │
│                                                                              │
│  REDUCING TOIL:                                                             │
│  ──────────────                                                             │
│  Goal: < 50% of SRE time on toil                                           │
│                                                                              │
│  Toil = Manual, repetitive, automatable, no lasting value                   │
│                                                                              │
│  Examples of toil:                                                          │
│  • Manual deployments                                                       │
│  • Restarting services                                                      │
│  • Certificate rotation                                                     │
│  • Capacity additions                                                       │
│                                                                              │
│  Fix: Automate everything that can be automated                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Progressive Rollouts

┌─────────────────────────────────────────────────────────────────────────────┐
│                    PROGRESSIVE ROLLOUTS                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  CANARY DEPLOYMENT:                                                         │
│  ──────────────────                                                         │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Load Balancer                                  │    │
│  │                           │                                         │    │
│  │              ┌────────────┼────────────┐                            │    │
│  │              │            │            │                            │    │
│  │              ▼            ▼            ▼                            │    │
│  │         ┌────────┐  ┌────────┐   ┌────────┐                         │    │
│  │         │ v1.0   │  │ v1.0   │   │ v1.1   │ ← Canary (5%)          │    │
│  │         │ (47.5%)│  │ (47.5%)│   │ (5%)   │                         │    │
│  │         └────────┘  └────────┘   └────────┘                         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  STAGES:                                                                    │
│  ───────                                                                    │
│  Stage 1: 1% → Verify no crashes (1 hour)                                  │
│  Stage 2: 5% → Check metrics (2 hours)                                     │
│  Stage 3: 25% → Broader validation (4 hours)                               │
│  Stage 4: 50% → Almost there (8 hours)                                     │
│  Stage 5: 100% → Full rollout                                              │
│                                                                              │
│  AUTOMATIC ROLLBACK TRIGGERS:                                               │
│  ────────────────────────────                                               │
│  • Error rate > baseline + 0.5%                                            │
│  • P99 latency > baseline + 100ms                                          │
│  • Memory usage > 90%                                                       │
│  • Any crash loop                                                           │
│                                                                              │
│  BLUE-GREEN DEPLOYMENT:                                                     │
│  ──────────────────────                                                     │
│                                                                              │
│  Before:  LB → Blue (v1.0) [Active]                                        │
│                Green (v1.1) [Staged]                                        │
│                                                                              │
│  After:   LB → Green (v1.1) [Active]                                       │
│                Blue (v1.0) [Standby for rollback]                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 30: Incident Management

Incident Response Framework

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INCIDENT RESPONSE                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SEVERITY LEVELS:                                                           │
│  ────────────────                                                           │
│                                                                              │
│  SEV1 (Critical):                                                           │
│  • Complete service outage                                                  │
│  • Data loss or security breach                                             │
│  • All hands, exec involvement                                              │
│                                                                              │
│  SEV2 (High):                                                               │
│  • Major feature unavailable                                                │
│  • Significant user impact                                                  │
│  • Dedicated incident team                                                  │
│                                                                              │
│  SEV3 (Medium):                                                             │
│  • Partial degradation                                                      │
│  • Limited user impact                                                      │
│  • On-call handles                                                          │
│                                                                              │
│  SEV4 (Low):                                                                │
│  • Minor issue                                                              │
│  • Workaround available                                                     │
│  • Fix in normal sprint                                                     │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  INCIDENT ROLES:                                                            │
│  ───────────────                                                            │
│                                                                              │
│  Incident Commander (IC):                                                   │
│  • Coordinates response                                                     │
│  • Makes decisions                                                          │
│  • NOT doing technical work                                                 │
│                                                                              │
│  Communications Lead:                                                       │
│  • Updates status page                                                      │
│  • Communicates with stakeholders                                           │
│  • Manages customer messaging                                               │
│                                                                              │
│  Technical Lead:                                                            │
│  • Leads technical investigation                                            │
│  • Coordinates engineering response                                         │
│                                                                              │
│  Scribe:                                                                    │
│  • Documents timeline                                                       │
│  • Records decisions and actions                                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Postmortem Culture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    BLAMELESS POSTMORTEMS                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PRINCIPLES:                                                                │
│  ───────────                                                                │
│                                                                              │
│  1. BLAMELESS: Focus on systems, not individuals                            │
│     BAD: "John pushed bad code"                                             │
│     GOOD: "Our CI pipeline didn't catch the regression"                     │
│                                                                              │
│  2. LEARN: Extract lessons, don't just assign blame                         │
│                                                                              │
│  3. SHARE: Postmortems are public within company                            │
│                                                                              │
│  4. ACT: Action items must be tracked and completed                         │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  POSTMORTEM TEMPLATE:                                                       │
│  ────────────────────                                                       │
│                                                                              │
│  ## Incident Summary                                                        │
│  • Duration: 2 hours 15 minutes                                             │
│  • Impact: 40% of users saw errors                                          │
│  • Severity: SEV2                                                           │
│                                                                              │
│  ## Timeline                                                                │
│  14:30 - Deploy of commit abc123                                            │
│  14:35 - Error rate alerts fire                                             │
│  14:40 - On-call acknowledges                                               │
│  14:55 - Root cause identified                                              │
│  15:10 - Rollback initiated                                                 │
│  15:15 - Service restored                                                   │
│  16:45 - Incident closed                                                    │
│                                                                              │
│  ## Root Cause                                                              │
│  Database connection pool exhausted due to missing timeout                  │
│                                                                              │
│  ## What Went Well                                                          │
│  • Fast detection (5 min)                                                   │
│  • Clear runbook                                                            │
│                                                                              │
│  ## What Went Poorly                                                        │
│  • No canary caught the issue                                               │
│  • Rollback took 20 min (should be 5)                                       │
│                                                                              │
│  ## Action Items                                                            │
│  1. [P0] Add connection pool monitoring - @alice - Due 1/15                │
│  2. [P1] Improve canary coverage - @bob - Due 1/20                         │
│  3. [P2] Speed up rollback - @charlie - Due 1/30                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 31: Capacity Planning

Load Testing

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LOAD TESTING                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  TYPES OF TESTS:                                                            │
│  ───────────────                                                            │
│                                                                              │
│  LOAD TEST: Normal/expected load                                            │
│  ────────────────────────────                                               │
│  RPS  ▲                                                                     │
│       │     ┌───────────────────────┐                                       │
│  1000 │─────┤  Sustained 1000 RPS   │─────                                  │
│       │     └───────────────────────┘                                       │
│       └──────────────────────────────────────► Time                         │
│                                                                              │
│  STRESS TEST: Find breaking point                                           │
│  ────────────────────────────────                                           │
│  RPS  ▲                                                                     │
│       │                           ╱ Break!                                  │
│  5000 │                         ╱                                           │
│       │                       ╱                                             │
│  2500 │                     ╱                                               │
│       │                   ╱                                                 │
│  1000 │─────────────────╱                                                   │
│       └──────────────────────────────────────► Time                         │
│                                                                              │
│  SOAK TEST: Long duration stability                                         │
│  ──────────────────────────────────                                         │
│  RPS  ▲                                                                     │
│       │     ┌──────────────────────────────────────────┐                    │
│  1000 │─────┤          24 hours at 1000 RPS            │                    │
│       │     └──────────────────────────────────────────┘                    │
│       └──────────────────────────────────────────────────► Time             │
│  Look for: Memory leaks, connection exhaustion                              │
│                                                                              │
│  SPIKE TEST: Sudden traffic burst                                           │
│  ────────────────────────────────                                           │
│  RPS  ▲                                                                     │
│       │          ╱╲                                                         │
│  5000 │         ╱  ╲                                                        │
│       │        ╱    ╲                                                       │
│  1000 │───────╱      ╲───────                                               │
│       └──────────────────────────────────────► Time                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Capacity Modeling

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CAPACITY MODELING                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  STEP 1: UNDERSTAND CURRENT STATE                                           │
│  ─────────────────────────────────                                          │
│  • Current RPS: 10,000                                                      │
│  • Current instances: 20                                                    │
│  • CPU utilization: 40%                                                     │
│  • RPS per instance: 500                                                    │
│                                                                              │
│  STEP 2: DEFINE HEADROOM                                                    │
│  ────────────────────────                                                   │
│  • Target CPU: 60% (buffer for spikes)                                      │
│  • Available headroom: 20% (60% - 40%)                                      │
│  • Max RPS at current capacity: 15,000                                      │
│                                                                              │
│  STEP 3: PROJECT GROWTH                                                     │
│  ────────────────────────                                                   │
│  • Current: 10,000 RPS                                                      │
│  • Growth rate: 10% per month                                               │
│  • In 6 months: 17,716 RPS                                                  │
│  • In 12 months: 31,384 RPS                                                 │
│                                                                              │
│  STEP 4: PLAN ADDITIONS                                                     │
│  ────────────────────────                                                   │
│                                                                              │
│  Month 1-4:   Current capacity sufficient                                   │
│  Month 5:     Need 25 instances (add 5)                                     │
│  Month 8:     Need 32 instances (add 7)                                     │
│  Month 12:    Need 42 instances (add 10)                                    │
│                                                                              │
│  STEP 5: BUFFER FOR EVENTS                                                  │
│  ─────────────────────────                                                  │
│  • Black Friday: 3x normal traffic                                          │
│  • Need 3x capacity for 48 hours                                            │
│  • Pre-scale 1 week before                                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Autoscaling Strategies

┌─────────────────────────────────────────────────────────────────────────────┐
│                    AUTOSCALING                                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SCALING METRICS:                                                           │
│  ────────────────                                                           │
│                                                                              │
│  CPU-based:                                                                 │
│    Scale up when: avg CPU > 70%                                             │
│    Scale down when: avg CPU < 30%                                           │
│                                                                              │
│  Request-based:                                                             │
│    Target: 1000 RPS per instance                                            │
│    Current: 3000 RPS, 2 instances                                           │
│    Action: Scale to 3 instances                                             │
│                                                                              │
│  Queue-based:                                                               │
│    Scale up when: queue depth > 1000                                        │
│    Scale down when: queue empty for 5 min                                   │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  PREDICTIVE SCALING:                                                        │
│  ───────────────────                                                        │
│                                                                              │
│  Traffic  ▲      Expected     Actual                                        │
│           │      ─────────    ────────                                      │
│           │         ╱╲           ╱╲                                         │
│           │       ╱    ╲       ╱    ╲                                       │
│           │     ╱        ╲   ╱        ╲                                     │
│           │───╱────────────╲╱────────────                                   │
│           │  9AM          12PM         3PM                                  │
│           └──────────────────────────────────► Time                         │
│                                                                              │
│  Scale BEFORE traffic arrives based on historical patterns                  │
│                                                                              │
│  COOLDOWN PERIODS:                                                          │
│  ─────────────────                                                          │
│  Scale up cooldown: 3 minutes                                               │
│  Scale down cooldown: 10 minutes                                            │
│  Prevents thrashing (scale up/down/up/down)                                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Key Interview Questions

Systematic approach:
  1. Check dashboards first
    • Which services show elevated latency?
    • When did it start? Correlate with deployments
    • Scope: All users or specific segment?
  2. Use distributed tracing
    • Find slow traces
    • Identify which span is slowest
    • Look for patterns (specific service, DB, external API)
  3. Drill down
    • Check that service’s metrics (CPU, memory, connections)
    • Check dependencies (DB latency, cache hit rate)
    • Check for new error types in logs
  4. Common culprits
    • Database slow queries
    • Cache miss spike
    • Connection pool exhaustion
    • Garbage collection
    • Noisy neighbor (shared resources)
    • External API degradation
  5. Mitigation while investigating
    • Scale up if resource-bound
    • Enable circuit breaker
    • Failover to backup
PHASE 1: FOUNDATION (Month 1-2)
─────────────────────────────
• Define steady state metrics
• Set up experiment tracking
• Train team on principles
• Start with game days (manual)

PHASE 2: BASIC EXPERIMENTS (Month 3-4)
──────────────────────────────────────
• Instance failures
• Network latency injection
• Run in staging first
• Graduate to production with small blast radius

PHASE 3: ADVANCED (Month 5-6)
─────────────────────────────
• Zone failures
• Database failover
• Certificate expiry
• Clock skew

PHASE 4: AUTOMATION (Month 7+)
──────────────────────────────
• Continuous chaos in production
• Automatic rollback on failure
• Integrate with CI/CD
• Coverage reporting

KEY PRINCIPLES:
• Always have rollback plan
• Start small, expand gradually
• Document everything
• Celebrate finding issues!
Short-term mitigation:
  • Analyze page patterns
  • Suppress non-actionable alerts
  • Add secondary on-call
Medium-term fixes:
  • Improve runbooks for faster resolution
  • Automate common remediations
  • Add better monitoring to prevent issues
Long-term solutions:
  • Fix underlying reliability issues
  • Add circuit breakers, retries
  • Improve capacity planning
  • Push back on missing error budget
Process changes:
  • Track alert metrics (pages per week)
  • Review every page in team meeting
  • Goal: < 2 pages per on-call shift
  • Escalate if consistently exceeded
PREPARATION TIMELINE:

T-4 weeks:
• Load test current capacity
• Identify bottlenecks
• Order additional capacity

T-2 weeks:
• Pre-scale databases (can't autoscale fast)
• Increase cache capacity
• Add read replicas
• Pre-warm caches

T-1 week:
• Scale compute to 3x normal
• Verify autoscaling works
• Test circuit breakers
• Prepare runbooks

T-1 day:
• Final scale to target (10x)
• War room ready
• All hands available

During event:
• Monitor dashboards
• Quick decisions on feature flags
• Prepared to shed load (graceful degradation)

After:
• Scale down gradually (30% per hour)
• Postmortem any issues
• Update capacity model

Capstone: Interview Preparation

Practice System Design Problems

Apply everything you’ve learned:
  1. Design Uber’s dispatch system
    • Real-time location tracking
    • Matching drivers to riders
    • Surge pricing
  2. Design Stripe’s payment processing
    • Exactly-once payments
    • Idempotency
    • Saga pattern for complex transactions
  3. Design Netflix’s video streaming
    • CDN architecture
    • Adaptive bitrate
    • Regional failover
  4. Design Twitter’s timeline
    • Fan-out on write vs read
    • Celebrity problem
    • Real-time updates

Congratulations!

You’ve completed the Distributed Systems Mastery course. You now have the knowledge to:
  • ✅ Explain consensus protocols (Raft, Paxos) in interviews
  • ✅ Design systems with appropriate consistency guarantees
  • ✅ Choose between replication strategies
  • ✅ Implement distributed transactions correctly
  • ✅ Build and operate systems at massive scale
  • ✅ Debug complex distributed systems issues
Next steps: Practice! Do mock interviews, build projects, and read the papers referenced throughout this course.