Skip to main content

Track 6: Production Excellence

Operating distributed systems at scale requires specialized skills beyond just building them.
Track Duration: 28-36 hours
Modules: 5
Key Topics: Observability, Chaos Engineering, SRE, Incident Management, Capacity Planning

Module 27: Observability at Scale

The Three Pillars

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THREE PILLARS OF OBSERVABILITY                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐                │
│  │     LOGS        │ │    METRICS      │ │    TRACES       │                │
│  ├─────────────────┤ ├─────────────────┤ ├─────────────────┤                │
│  │ What happened   │ │ How much/many   │ │ Request journey │                │
│  │ at a point      │ │ over time       │ │ across services │                │
│  │ in time         │ │                 │ │                 │                │
│  ├─────────────────┤ ├─────────────────┤ ├─────────────────┤                │
│  │ • Debug issues  │ │ • Dashboards    │ │ • Latency       │                │
│  │ • Audit trail   │ │ • Alerting      │ │   breakdown     │                │
│  │ • Security      │ │ • Trends        │ │ • Dependencies  │                │
│  │ • Compliance    │ │ • Capacity      │ │ • Bottlenecks   │                │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘                │
│                                                                              │
│  CORRELATION IS KEY:                                                        │
│  ───────────────────                                                        │
│  trace_id: abc123 links logs, metrics, and traces together                 │
│                                                                              │
│  Alert fires (metric) → Find trace_id → View logs for that trace          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Distributed Tracing

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED TRACING                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  REQUEST FLOW:                                                              │
│                                                                              │
│  User                                                                       │
│   │                                                                         │
│   │ trace_id: abc123                                                        │
│   ▼                                                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ API Gateway (span: 0-450ms)                                         │    │
│  │   │                                                                 │    │
│  │   ├──► Auth Service (span: 10-50ms)                                │    │
│  │   │                                                                 │    │
│  │   ├──► Order Service (span: 60-300ms)                              │    │
│  │   │      │                                                          │    │
│  │   │      ├──► Inventory DB (span: 80-150ms)                        │    │
│  │   │      │                                                          │    │
│  │   │      └──► Payment Service (span: 160-290ms)                    │    │
│  │   │             │                                                   │    │
│  │   │             └──► Stripe API (span: 170-280ms) ← Bottleneck!   │    │
│  │   │                                                                 │    │
│  │   └──► Notification Service (span: 310-440ms)                      │    │
│  │                                                                     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  TRACE CONTEXT PROPAGATION:                                                 │
│  ──────────────────────────                                                 │
│  HTTP Header: traceparent: 00-abc123-def456-01                             │
│                                                                              │
│  W3C Trace Context format:                                                  │
│  00          - version                                                      │
│  abc123      - trace-id (request identifier)                               │
│  def456      - parent-id (span identifier)                                 │
│  01          - flags (sampled)                                              │
│                                                                              │
│  TOOLS: Jaeger, Zipkin, AWS X-Ray, Datadog APM, Honeycomb                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Advanced: Tail-Based Sampling

At massive scale (millions of spans/sec), you cannot afford to store every trace.
  • Head-Based Sampling: The decision to sample is made at the start of the request (e.g., “Sample 1% of all requests”).
    • The Flaw: It might discard the 0.01% of requests that actually had an error or a 5-second latency spike.
  • Tail-Based Sampling: The sampling decision is delayed until the entire trace has been collected.
    • Spans are buffered in a collector (like the OpenTelemetry Collector).
    • Once the trace is complete, a policy is applied: “Keep if status is Error OR latency > 500ms OR method is POST”.
    • Result: You capture 100% of “interesting” traces while only paying for 1% of “boring” (successful/fast) ones.

27.2.1 Advanced: Trace Reconstruction & Causality

At Staff/Principal level, you must understand the math behind how traces are reconstructed from a sea of independent spans.

1. W3C Trace Context (The Standard)

The industry has standardized on the W3C Trace Context. It consists of two headers:
  • traceparent: Propagates the trace-id and parent-id.
    • Format: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
  • tracestate: Propagates vendor-specific metadata (e.g., Datadog or New Relic specific tags).

2. Trace Reconstruction (The DAG Problem)

When you view a trace in Jaeger or Honeycomb, the backend has to perform a Distributed Join to reconstruct the tree:
  1. Gathering: Collect all spans sharing the same trace-id.
  2. Topological Sort: Use the parent-id to arrange spans into a Directed Acyclic Graph (DAG).
  3. Clock Correction: Since different nodes have different clocks, the backend must adjust span start times so that a child span never starts before its parent (Causality preserved).

3. Tracing through Queues (The “Link” Pattern)

Standard tracing assumes a parent-child relationship (synchronous RPC). But for asynchronous messages (Kafka/SQS), we use Links.
  • The Problem: If a consumer processes 100 messages in one batch, who is the “parent”?
  • The Solution: The trace has multiple “Links” to the 100 originating traces, allowing you to see the fan-out without breaking the original trace’s structure.
FeatureChild-Of RelationshipLink Relationship
ModelSynchronous (RPC)Asynchronous (Messages)
DependencyDirect parentCasual link
CausalityStrong (Blocked)Weak (Hand-off)
Staff Tip: When implementing tracing, never use a “global” trace context variable in your code. Always pass the Context object explicitly (in Go) or use ThreadLocal storage (in Java) to avoid trace leakage between concurrent requests.

27.3: Context Propagation Implementation

The hardest part of distributed tracing is ensuring the trace context flows through every boundary: HTTP, gRPC, message queues, thread pools, and async callbacks.

1. HTTP Propagation

# Python: Inject context into outgoing HTTP requests
from opentelemetry import trace
from opentelemetry.propagate import inject

def make_http_call(url: str, data: dict):
    tracer = trace.get_tracer(__name__)
    
    with tracer.start_as_current_span("http_call") as span:
        headers = {}
        # Inject trace context into headers
        inject(headers)  # Adds 'traceparent' and 'tracestate'
        
        response = requests.post(url, json=data, headers=headers)
        
        span.set_attribute("http.status_code", response.status_code)
        return response

# Server side: Extract context from incoming request
from opentelemetry.propagate import extract

def handle_request(request):
    # Extract trace context from headers
    context = extract(request.headers)
    
    tracer = trace.get_tracer(__name__)
    with tracer.start_span("handle_request", context=context) as span:
        # Process request - this span is now a child of the caller's span
        return process(request)

2. Message Queue Propagation (Kafka/SQS)

The trickiest case - async messages break the parent-child relationship.
# Producer: Inject context into message headers
def send_to_kafka(topic: str, message: dict):
    tracer = trace.get_tracer(__name__)
    
    with tracer.start_as_current_span("kafka_produce") as span:
        headers = []
        carrier = {}
        inject(carrier)
        
        # Convert to Kafka header format
        for key, value in carrier.items():
            headers.append((key, value.encode('utf-8')))
        
        producer.send(
            topic,
            value=json.dumps(message).encode(),
            headers=headers
        )

# Consumer: Extract context, but use LINKS not parent
def consume_from_kafka():
    tracer = trace.get_tracer(__name__)
    
    for message in consumer:
        # Extract the trace context from producer
        carrier = {k: v.decode() for k, v in message.headers}
        producer_context = extract(carrier)
        
        # Create LINK to producer span (not parent-child!)
        producer_span = trace.get_current_span(producer_context)
        links = [trace.Link(producer_span.get_span_context())]
        
        # Start new span with link to producer
        with tracer.start_span(
            "kafka_consume",
            links=links  # Links allow many-to-many relationships
        ) as span:
            process_message(message)
Why Links instead of Parent?
  • A consumer might process messages from 100 different producers in one batch
  • Using parent-child would create 100 parallel traces
  • Links allow you to see “this processing was triggered by these 100 messages”

3. Thread Pool / Async Context Propagation

Context can be lost when work is submitted to a thread pool.
# BAD: Context is lost!
def bad_async_call():
    with tracer.start_as_current_span("parent"):
        # Context is NOT automatically copied to the thread pool worker
        executor.submit(do_work)  # do_work sees NO trace context

# GOOD: Explicitly propagate context
from opentelemetry.context import attach, detach, get_current

def good_async_call():
    with tracer.start_as_current_span("parent"):
        # Capture current context
        ctx = get_current()
        
        def wrapped_work():
            # Attach captured context in worker thread
            token = attach(ctx)
            try:
                do_work()  # Now sees the trace context!
            finally:
                detach(token)
        
        executor.submit(wrapped_work)

# Best: Use instrumented executor
from opentelemetry.instrumentation.threading import ThreadPoolExecutorInstrumentor
ThreadPoolExecutorInstrumentor().instrument()
# Now all executor.submit() calls auto-propagate context

4. Context Propagation Matrix

Boundary TypePropagation MethodHeaders/CarrierRelationship
HTTP/gRPCW3C Trace Contexttraceparent, tracestateParent-Child
KafkaMessage HeadersCustom headersLink
SQS/SNSMessage AttributesAWSTraceHeaderLink
RabbitMQMessage Propertiestrace_id in headersLink
Redis Pub/SubMessage PayloadEmbedded JSONLink
Thread PoolContext APIIn-memoryParent-Child
Async/AwaitContext VarsLanguage runtimeParent-Child

5. Common Pitfalls

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CONTEXT PROPAGATION PITFALLS                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. LOST CONTEXT IN ASYNC                                                   │
│     Problem: Context not copied to worker threads                           │
│     Fix: Use instrumented executors or manual propagation                   │
│                                                                              │
│  2. CONTEXT LEAKAGE                                                         │
│     Problem: Global context variable shared between requests                │
│     Fix: Use ThreadLocal/ContextVar, never global                          │
│                                                                              │
│  3. MISSING HEADERS IN RETRIES                                              │
│     Problem: Retry library doesn't copy trace headers                       │
│     Fix: Ensure retry wrapper copies headers from original request          │
│                                                                              │
│  4. WRONG RELATIONSHIP FOR QUEUES                                           │
│     Problem: Using parent-child for async messages                          │
│     Fix: Use Links for message queues, parent-child for sync calls         │
│                                                                              │
│  5. SAMPLING DECISIONS INCONSISTENT                                         │
│     Problem: Parent is sampled but child isn't (or vice versa)             │
│     Fix: Propagate sampling decision in tracestate                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Staff Tip: When designing an observability stack, ensure your Trace Context propagates through asynchronous boundaries like message queues (Kafka headers) and background jobs (sidekiq/celery), otherwise your traces will have “gaps” that make debugging impossible.

Metrics and Alerting

┌─────────────────────────────────────────────────────────────────────────────┐
│                    METRICS BEST PRACTICES                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  USE (Brendan Gregg's method):                                              │
│  ────────────────────────────                                               │
│  Utilization: % of resource used                                            │
│  Saturation: Queue depth, backpressure                                      │
│  Errors: Error count/rate                                                   │
│                                                                              │
│  RED (for services):                                                        │
│  ─────────────────                                                          │
│  Rate: Requests per second                                                  │
│  Errors: Error rate                                                         │
│  Duration: Latency distribution (p50, p95, p99)                            │
│                                                                              │
│  THE FOUR GOLDEN SIGNALS (Google SRE):                                      │
│  ─────────────────────────────────────                                      │
│  1. Latency: Time to serve request                                          │
│  2. Traffic: Requests per second                                            │
│  3. Errors: Rate of failed requests                                         │
│  4. Saturation: How full is your system?                                    │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  ALERTING BEST PRACTICES:                                                   │
│  ────────────────────────                                                   │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │ GOOD ALERTS                     │ BAD ALERTS                       │     │
│  ├────────────────────────────────────────────────────────────────────┤     │
│  │ Error rate > 1% for 5 min       │ CPU > 80%                        │     │
│  │ p99 latency > 500ms             │ Single node down                 │     │
│  │ Error budget < 10%              │ Disk usage > 70%                 │     │
│  │ Zero successful payments        │ Memory usage high                │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  ALERT ON SYMPTOMS, NOT CAUSES                                              │
│  "Users are affected" not "CPU is high"                                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

SLIs, SLOs, and Error Budgets

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SLIs, SLOs, ERROR BUDGETS                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SLI (Service Level Indicator):                                             │
│  ──────────────────────────────                                             │
│  Quantitative measure of service level                                      │
│                                                                              │
│  Examples:                                                                  │
│  • Request latency (p99 < 200ms)                                           │
│  • Availability (successful requests / total requests)                      │
│  • Throughput (requests per second)                                         │
│                                                                              │
│  SLO (Service Level Objective):                                             │
│  ──────────────────────────────                                             │
│  Target value or range for an SLI                                           │
│                                                                              │
│  Examples:                                                                  │
│  • 99.9% of requests complete in < 200ms                                   │
│  • 99.95% availability per month                                            │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  ERROR BUDGET:                                                              │
│  ─────────────                                                              │
│  Error Budget = 1 - SLO                                                     │
│                                                                              │
│  SLO: 99.9% availability                                                    │
│  Error Budget: 0.1% downtime allowed                                        │
│                                                                              │
│  Per month (30 days):                                                       │
│  0.1% × 30 × 24 × 60 = 43.2 minutes of allowed downtime                    │
│                                                                              │
│  BURN RATE:                                                                 │
│  ──────────                                                                 │
│  How fast are you consuming error budget?                                   │
│                                                                              │
│  Burn rate 1.0 = Using budget at expected rate                             │
│  Burn rate 2.0 = Using budget 2x as fast (will exhaust in 15 days)         │
│  Burn rate 10.0 = Critical! Budget gone in 3 days                          │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                    ERROR BUDGET VISUALIZATION                       │    │
│  │                                                                     │    │
│  │  100% ┤█████████████████████████████████████████████████████        │    │
│  │       │█████████████████████████████████████████████                │    │
│  │   50% ┤████████████████████████████                                 │    │
│  │       │██████████████████████ ← Budget running low!                │    │
│  │    0% ┼────────────────────────────────────────────────────────     │    │
│  │       Day 1    Day 7    Day 14   Day 21   Day 28   Day 30          │    │
│  │                                                                     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 28: Chaos Engineering

Netflix’s Approach

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CHAOS ENGINEERING                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  "The discipline of experimenting on a system to build confidence           │
│   in its capability to withstand turbulent conditions in production."       │
│                                                                              │
│  NETFLIX SIMIAN ARMY:                                                       │
│  ────────────────────                                                       │
│                                                                              │
│  🐒 Chaos Monkey: Kill random instances                                     │
│  🦍 Chaos Gorilla: Kill entire availability zone                            │
│  🦧 Chaos Kong: Kill entire region                                          │
│  ⏰ Latency Monkey: Inject network latency                                   │
│  📝 Conformity Monkey: Find non-conforming instances                        │
│  🩺 Doctor Monkey: Health checks                                            │
│  🧹 Janitor Monkey: Clean up unused resources                               │
│  Security Monkey: Find security issues                                      │
│                                                                              │
│  PRINCIPLES:                                                                │
│  ───────────                                                                │
│  1. Start with hypothesis about steady state                                │
│  2. Vary real-world events (failure, traffic spike)                         │
│  3. Run experiments in production                                           │
│  4. Automate experiments to run continuously                                │
│  5. Minimize blast radius                                                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Designing Chaos Experiments

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CHAOS EXPERIMENT DESIGN                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  EXPERIMENT TEMPLATE:                                                       │
│  ────────────────────                                                       │
│                                                                              │
│  1. HYPOTHESIS                                                              │
│     "When database X fails, service Y will failover to replica              │
│      within 30 seconds with no user-visible errors"                         │
│                                                                              │
│  2. STEADY STATE                                                            │
│     - Error rate < 0.1%                                                     │
│     - P99 latency < 200ms                                                   │
│     - Orders per minute: ~1000                                              │
│                                                                              │
│  3. EXPERIMENT                                                              │
│     - Target: Production database primary                                   │
│     - Action: Kill database process                                         │
│     - Duration: Until failover complete                                     │
│                                                                              │
│  4. BLAST RADIUS CONTROL                                                    │
│     - Run during low traffic (2 AM)                                         │
│     - Limit to 5% of users (feature flag)                                   │
│     - Automatic rollback if error rate > 5%                                 │
│                                                                              │
│  5. METRICS TO WATCH                                                        │
│     - Database connection errors                                            │
│     - Order completion rate                                                 │
│     - Failover detection time                                               │
│                                                                              │
│  6. ABORT CONDITIONS                                                        │
│     - Error rate > 5%                                                       │
│     - No failover after 60 seconds                                          │
│     - Manual abort button                                                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Failure Injection Types

┌────────────────────┬────────────────────────────────────────────────────────┐
│ Failure Type       │ How to Inject                                          │
├────────────────────┼────────────────────────────────────────────────────────┤
│ Process crash      │ kill -9, container stop                                │
│ Instance failure   │ Terminate EC2, pod delete                              │
│ Zone failure       │ Block traffic to zone, DNS manipulation                │
│ Region failure     │ Failover entire region                                 │
│ Network latency    │ tc netem, iptables delay                               │
│ Network partition  │ iptables DROP, security groups                         │
│ Packet loss        │ tc netem loss                                          │
│ CPU stress         │ stress-ng, burn CPU                                    │
│ Memory pressure    │ stress-ng --vm                                         │
│ Disk full          │ dd if=/dev/zero of=/tmp/fill                           │
│ Disk slow          │ dm-delay, slow filesystem                              │
│ Clock skew         │ date --set, chronyd manipulation                       │
│ DNS failure        │ Block port 53, bad DNS response                        │
│ Certificate expiry │ Use expired cert, revoke cert                          │
└────────────────────┴────────────────────────────────────────────────────────┘

Module 29: SRE Practices

Error Budgets and Release Velocity

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ERROR BUDGETS IN PRACTICE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ERROR BUDGET POLICY:                                                       │
│  ────────────────────                                                       │
│                                                                              │
│  If Error Budget > 50%:                                                     │
│    ✓ Deploy normally                                                        │
│    ✓ Run experiments                                                        │
│    ✓ Add new features                                                       │
│                                                                              │
│  If Error Budget 10-50%:                                                    │
│    ⚠ Slower deployments                                                     │
│    ⚠ Extra review for risky changes                                         │
│    ⚠ Focus on reliability improvements                                      │
│                                                                              │
│  If Error Budget < 10%:                                                     │
│    🛑 Feature freeze                                                        │
│    🛑 Only reliability fixes deployed                                       │
│    🛑 Postmortem required for any new issues                                │
│                                                                              │
│  If Error Budget Exhausted:                                                 │
│    🚨 All hands on reliability                                              │
│    🚨 Executive escalation                                                  │
│    🚨 No deploys until budget restored                                      │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  BALANCING ACT:                                                             │
│  ──────────────                                                             │
│                                                                              │
│  Too much budget remaining = not innovating fast enough                     │
│  Too little budget = reliability suffering                                  │
│                                                                              │
│  Target: Consume ~100% of budget by end of period                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

On-Call Best Practices

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ON-CALL BEST PRACTICES                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ROTATION STRUCTURE:                                                        │
│  ───────────────────                                                        │
│  Primary → Secondary → Escalation Manager                                   │
│                                                                              │
│  Primary: First responder (5 min SLA)                                       │
│  Secondary: Backup if primary unavailable                                   │
│  Escalation: For cross-team or executive decisions                          │
│                                                                              │
│  SUSTAINABLE ON-CALL:                                                       │
│  ────────────────────                                                       │
│  ✓ Max 1 week in 4 on-call                                                 │
│  ✓ Max 2 incidents per 12-hour shift                                       │
│  ✓ Time off after tough on-call shifts                                     │
│  ✓ Clear escalation paths                                                  │
│  ✓ Good runbooks                                                           │
│                                                                              │
│  REDUCING TOIL:                                                             │
│  ──────────────                                                             │
│  Goal: < 50% of SRE time on toil                                           │
│                                                                              │
│  Toil = Manual, repetitive, automatable, no lasting value                   │
│                                                                              │
│  Examples of toil:                                                          │
│  • Manual deployments                                                       │
│  • Restarting services                                                      │
│  • Certificate rotation                                                     │
│  • Capacity additions                                                       │
│                                                                              │
│  Fix: Automate everything that can be automated                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Progressive Rollouts

┌─────────────────────────────────────────────────────────────────────────────┐
│                    PROGRESSIVE ROLLOUTS                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  CANARY DEPLOYMENT:                                                         │
│  ──────────────────                                                         │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                      Load Balancer                                  │    │
│  │                           │                                         │    │
│  │              ┌────────────┼────────────┐                            │    │
│  │              │            │            │                            │    │
│  │              ▼            ▼            ▼                            │    │
│  │         ┌────────┐  ┌────────┐   ┌────────┐                         │    │
│  │         │ v1.0   │  │ v1.0   │   │ v1.1   │ ← Canary (5%)          │    │
│  │         │ (47.5%)│  │ (47.5%)│   │ (5%)   │                         │    │
│  │         └────────┘  └────────┘   └────────┘                         │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  STAGES:                                                                    │
│  ───────                                                                    │
│  Stage 1: 1% → Verify no crashes (1 hour)                                  │
│  Stage 2: 5% → Check metrics (2 hours)                                     │
│  Stage 3: 25% → Broader validation (4 hours)                               │
│  Stage 4: 50% → Almost there (8 hours)                                     │
│  Stage 5: 100% → Full rollout                                              │
│                                                                              │
│  AUTOMATIC ROLLBACK TRIGGERS:                                               │
│  ────────────────────────────                                               │
│  • Error rate > baseline + 0.5%                                            │
│  • P99 latency > baseline + 100ms                                          │
│  • Memory usage > 90%                                                       │
│  • Any crash loop                                                           │
│                                                                              │
│  BLUE-GREEN DEPLOYMENT:                                                     │
│  ──────────────────────                                                     │
│                                                                              │
│  Before:  LB → Blue (v1.0) [Active]                                        │
│                Green (v1.1) [Staged]                                        │
│                                                                              │
│  After:   LB → Green (v1.1) [Active]                                       │
│                Blue (v1.0) [Standby for rollback]                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 30: Advanced Resiliency Patterns

At the Staff/Principal level, reliability isn’t just about “fixing bugs”—it’s about designing architectures that are inherently resistant to failure.

1. Static Stability

A system is Statically Stable if it continues to operate in its “steady state” without needing to make changes during a dependency failure.
  • The Problem: Reactive autoscaling. If AZ-1 fails, AZ-2 and AZ-3 try to scale up. But if the control plane (Kubernetes/EC2 API) is also failing, they can’t scale, and the whole system crashes.
  • The Solution: Over-provisioning. Run AZ-1, AZ-2, and AZ-3 at 50% capacity each. If one AZ fails, the remaining two are already at 100% capacity and can handle the full load immediately without calling any external APIs.
  • Key Principle: Avoid “Control Plane” dependencies in the “Data Plane” recovery path.

2. Cell-Based Architectures

Instead of one giant “monolith” cluster, you split your infrastructure into many independent Cells.
  • Definition: A Cell is a complete, self-contained instance of the service (App + DB + Cache).
  • Blast Radius: If Cell A has a “poison pill” request or a hardware failure, only the 5% of users in Cell A are affected. The other 95% of users in other cells are completely isolated.
  • Scaling: To double capacity, you don’t scale the cells; you just add more cells.
  • Used By: AWS (Lambda, DynamoDB), Salesforce, and Facebook.
┌─────────────────────────────────────────────────────────────────────────────┐
│                         CELL-BASED ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│       [ Global DNS / Router ]                                               │
│                │                                                             │
│       ┌────────┴────────┬───────────────┬──────────────┐                     │
│       │                 │               │              │                     │
│       ▼                 ▼               ▼              ▼                     │
│  ┌──────────┐      ┌──────────┐    ┌──────────┐   ┌──────────┐               │
│  │  CELL 1  │      │  CELL 2  │    │  CELL 3  │   │  CELL 4  │               │
│  │ (Users   │      │ (Users   │    │ (Users   │   │ (Users   │               │
│  │  1-10k)  │      │ 10-20k)  │    │ 20-30k)  │   │ 30-40k)  │               │
│  └──────────┘      └──────────┘    └──────────┘   └──────────┘               │
│                                                                              │
│  - No cross-cell dependencies.                                               │
│  - Failures are contained within a single cell.                              │
│  - Upgrades happen one cell at a time.                                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

3. Dependency Isolation (The Bulkhead Pattern)

Just like a ship is divided into watertight compartments (bulkheads), a distributed system should isolate its dependencies.
  • Implementation: Use separate thread pools or connection pools for different downstream services.
  • Benefit: If Service A becomes slow, its thread pool fills up, but Service B’s thread pool remains free, allowing the rest of the system to function.

Module 31: Incident Management

Incident Response Framework

┌─────────────────────────────────────────────────────────────────────────────┐
│                    INCIDENT RESPONSE                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SEVERITY LEVELS:                                                           │
│  ────────────────                                                           │
│                                                                              │
│  SEV1 (Critical):                                                           │
│  • Complete service outage                                                  │
│  • Data loss or security breach                                             │
│  • All hands, exec involvement                                              │
│                                                                              │
│  SEV2 (High):                                                               │
│  • Major feature unavailable                                                │
│  • Significant user impact                                                  │
│  • Dedicated incident team                                                  │
│                                                                              │
│  SEV3 (Medium):                                                             │
│  • Partial degradation                                                      │
│  • Limited user impact                                                      │
│  • On-call handles                                                          │
│                                                                              │
│  SEV4 (Low):                                                                │
│  • Minor issue                                                              │
│  • Workaround available                                                     │
│  • Fix in normal sprint                                                     │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  INCIDENT ROLES:                                                            │
│  ───────────────                                                            │
│                                                                              │
│  Incident Commander (IC):                                                   │
│  • Coordinates response                                                     │
│  • Makes decisions                                                          │
│  • NOT doing technical work                                                 │
│                                                                              │
│  Communications Lead:                                                       │
│  • Updates status page                                                      │
│  • Communicates with stakeholders                                           │
│  • Manages customer messaging                                               │
│                                                                              │
│  Technical Lead:                                                            │
│  • Leads technical investigation                                            │
│  • Coordinates engineering response                                         │
│                                                                              │
│  Scribe:                                                                    │
│  • Documents timeline                                                       │
│  • Records decisions and actions                                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Postmortem Culture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    BLAMELESS POSTMORTEMS                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PRINCIPLES:                                                                │
│  ───────────                                                                │
│                                                                              │
│  1. BLAMELESS: Focus on systems, not individuals                            │
│     BAD: "John pushed bad code"                                             │
│     GOOD: "Our CI pipeline didn't catch the regression"                     │
│                                                                              │
│  2. LEARN: Extract lessons, don't just assign blame                         │
│                                                                              │
│  3. SHARE: Postmortems are public within company                            │
│                                                                              │
│  4. ACT: Action items must be tracked and completed                         │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  POSTMORTEM TEMPLATE:                                                       │
│  ────────────────────                                                       │
│                                                                              │
│  ## Incident Summary                                                        │
│  • Duration: 2 hours 15 minutes                                             │
│  • Impact: 40% of users saw errors                                          │
│  • Severity: SEV2                                                           │
│                                                                              │
│  ## Timeline                                                                │
│  14:30 - Deploy of commit abc123                                            │
│  14:35 - Error rate alerts fire                                             │
│  14:40 - On-call acknowledges                                               │
│  14:55 - Root cause identified                                              │
│  15:10 - Rollback initiated                                                 │
│  15:15 - Service restored                                                   │
│  16:45 - Incident closed                                                    │
│                                                                              │
│  ## Root Cause                                                              │
│  Database connection pool exhausted due to missing timeout                  │
│                                                                              │
│  ## What Went Well                                                          │
│  • Fast detection (5 min)                                                   │
│  • Clear runbook                                                            │
│                                                                              │
│  ## What Went Poorly                                                        │
│  • No canary caught the issue                                               │
│  • Rollback took 20 min (should be 5)                                       │
│                                                                              │
│  ## Action Items                                                            │
│  1. [P0] Add connection pool monitoring - @alice - Due 1/15                │
│  2. [P1] Improve canary coverage - @bob - Due 1/20                         │
│  3. [P2] Speed up rollback - @charlie - Due 1/30                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 32: Capacity Planning

Load Testing

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LOAD TESTING                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  TYPES OF TESTS:                                                            │
│  ───────────────                                                            │
│                                                                              │
│  LOAD TEST: Normal/expected load                                            │
│  ────────────────────────────                                               │
│  RPS  ▲                                                                     │
│       │     ┌───────────────────────┐                                       │
│  1000 │─────┤  Sustained 1000 RPS   │─────                                  │
│       │     └───────────────────────┘                                       │
│       └──────────────────────────────────────► Time                         │
│                                                                              │
│  STRESS TEST: Find breaking point                                           │
│  ────────────────────────────────                                           │
│  RPS  ▲                                                                     │
│       │                           ╱ Break!                                  │
│  5000 │                         ╱                                           │
│       │                       ╱                                             │
│  2500 │                     ╱                                               │
│       │                   ╱                                                 │
│  1000 │─────────────────╱                                                   │
│       └──────────────────────────────────────► Time                         │
│                                                                              │
│  SOAK TEST: Long duration stability                                         │
│  ──────────────────────────────────                                         │
│  RPS  ▲                                                                     │
│       │     ┌──────────────────────────────────────────┐                    │
│  1000 │─────┤          24 hours at 1000 RPS            │                    │
│       │     └──────────────────────────────────────────┘                    │
│       └──────────────────────────────────────────────────► Time             │
│  Look for: Memory leaks, connection exhaustion                              │
│                                                                              │
│  SPIKE TEST: Sudden traffic burst                                           │
│  ────────────────────────────────                                           │
│  RPS  ▲                                                                     │
│       │          ╱╲                                                         │
│  5000 │         ╱  ╲                                                        │
│       │        ╱    ╲                                                       │
│  1000 │───────╱      ╲───────                                               │
│       └──────────────────────────────────────► Time                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Capacity Modeling

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CAPACITY MODELING                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  STEP 1: UNDERSTAND CURRENT STATE                                           │
│  ─────────────────────────────────                                          │
│  • Current RPS: 10,000                                                      │
│  • Current instances: 20                                                    │
│  • CPU utilization: 40%                                                     │
│  • RPS per instance: 500                                                    │
│                                                                              │
│  STEP 2: DEFINE HEADROOM                                                    │
│  ────────────────────────                                                   │
│  • Target CPU: 60% (buffer for spikes)                                      │
│  • Available headroom: 20% (60% - 40%)                                      │
│  • Max RPS at current capacity: 15,000                                      │
│                                                                              │
│  STEP 3: PROJECT GROWTH                                                     │
│  ────────────────────────                                                   │
│  • Current: 10,000 RPS                                                      │
│  • Growth rate: 10% per month                                               │
│  • In 6 months: 17,716 RPS                                                  │
│  • In 12 months: 31,384 RPS                                                 │
│                                                                              │
│  STEP 4: PLAN ADDITIONS                                                     │
│  ────────────────────────                                                   │
│                                                                              │
│  Month 1-4:   Current capacity sufficient                                   │
│  Month 5:     Need 25 instances (add 5)                                     │
│  Month 8:     Need 32 instances (add 7)                                     │
│  Month 12:    Need 42 instances (add 10)                                    │
│                                                                              │
│  STEP 5: BUFFER FOR EVENTS                                                  │
│  ─────────────────────────                                                  │
│  • Black Friday: 3x normal traffic                                          │
│  • Need 3x capacity for 48 hours                                            │
│  • Pre-scale 1 week before                                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Autoscaling Strategies

┌─────────────────────────────────────────────────────────────────────────────┐
│                    AUTOSCALING                                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SCALING METRICS:                                                           │
│  ────────────────                                                           │
│                                                                              │
│  CPU-based:                                                                 │
│    Scale up when: avg CPU > 70%                                             │
│    Scale down when: avg CPU < 30%                                           │
│                                                                              │
│  Request-based:                                                             │
│    Target: 1000 RPS per instance                                            │
│    Current: 3000 RPS, 2 instances                                           │
│    Action: Scale to 3 instances                                             │
│                                                                              │
│  Queue-based:                                                               │
│    Scale up when: queue depth > 1000                                        │
│    Scale down when: queue empty for 5 min                                   │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  PREDICTIVE SCALING:                                                        │
│  ───────────────────                                                        │
│                                                                              │
│  Traffic  ▲      Expected     Actual                                        │
│           │      ─────────    ────────                                      │
│           │         ╱╲           ╱╲                                         │
│           │       ╱    ╲       ╱    ╲                                       │
│           │     ╱        ╲   ╱        ╲                                     │
│           │───╱────────────╲╱────────────                                   │
│           │  9AM          12PM         3PM                                  │
│           └──────────────────────────────────► Time                         │
│                                                                              │
│  Scale BEFORE traffic arrives based on historical patterns                  │
│                                                                              │
│  COOLDOWN PERIODS:                                                          │
│  ─────────────────                                                          │
│  Scale up cooldown: 3 minutes                                               │
│  Scale down cooldown: 10 minutes                                            │
│  Prevents thrashing (scale up/down/up/down)                                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Advanced: Cluster Scheduling Internals (Borg & DRF)

In a modern distributed environment, you don’t manage individual servers; you manage a Cluster. A Scheduler (like Google’s Borg or Kubernetes’ kube-scheduler) is responsible for deciding where your code runs.

The Bin Packing Problem

At its core, scheduling is a “Multidimensional Bin Packing” problem. You have containers with varying CPU/RAM needs and nodes with varying capacities.

1. Dominant Resource Fairness (DRF)

In a cluster, users need multiple resources (e.g., User A needs high CPU, User B needs high RAM). How do you allocate resources fairly? DRF is the standard algorithm.
  • Definition: DRF calculates the “dominant share” for each user (the resource they need the most of relative to the cluster’s total capacity) and tries to equalize these shares.

2. Priority and Preemption

Not all jobs are equal. Borg introduced two main categories:
  • Prod (Production): Low latency, high availability (e.g., Search, Gmail).
  • Non-Prod (Batch): High throughput, latency-insensitive (e.g., Log processing, ML training).
Preemption: If a high-priority “Prod” job needs resources, the scheduler will kill (evict) lower-priority “Non-Prod” jobs to make room. This allows Google to run clusters at 90%+ utilization.
┌─────────────────────────────────────────────────────────────────────────────┐
│                        SCHEDULER DECISION FLOW                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Task arrives: {CPU: 2, RAM: 4GB, Priority: PROD}                            │
│                                                                              │
│  1. FILTERING (Predicates):                                                  │
│     - Does node have enough free RAM?                                        │
│     - Does node have required OS/Hardware?                                   │
│     - Result: Set of "Feasible Nodes"                                        │
│                                                                              │
│  2. SCORING (Priorities):                                                    │
│     - Which node minimizes fragmentation? (Bin packing)                      │
│     - Which node has the data locally? (Locality)                            │
│     - Result: "Best Node" picked                                             │
│                                                                              │
│  3. PREEMPTION (If no feasible nodes):                                       │
│     - Can I kill priority 0 jobs to fit priority 10?                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Staff Tip: When designing a system, differentiate between Hard Constraints (the app cannot run without this) and Soft Constraints/Affinities (the app should run here for better performance). Over-constraining your scheduler leads to “unschedulable” tasks and wasted resources.

Key Interview Questions

Systematic approach:
  1. Check dashboards first
    • Which services show elevated latency?
    • When did it start? Correlate with deployments
    • Scope: All users or specific segment?
  2. Use distributed tracing
    • Find slow traces
    • Identify which span is slowest
    • Look for patterns (specific service, DB, external API)
  3. Drill down
    • Check that service’s metrics (CPU, memory, connections)
    • Check dependencies (DB latency, cache hit rate)
    • Check for new error types in logs
  4. Common culprits
    • Database slow queries
    • Cache miss spike
    • Connection pool exhaustion
    • Garbage collection
    • Noisy neighbor (shared resources)
    • External API degradation
  5. Mitigation while investigating
    • Scale up if resource-bound
    • Enable circuit breaker
    • Failover to backup
PHASE 1: FOUNDATION (Month 1-2)
─────────────────────────────
• Define steady state metrics
• Set up experiment tracking
• Train team on principles
• Start with game days (manual)

PHASE 2: BASIC EXPERIMENTS (Month 3-4)
──────────────────────────────────────
• Instance failures
• Network latency injection
• Run in staging first
• Graduate to production with small blast radius

PHASE 3: ADVANCED (Month 5-6)
─────────────────────────────
• Zone failures
• Database failover
• Certificate expiry
• Clock skew

PHASE 4: AUTOMATION (Month 7+)
──────────────────────────────
• Continuous chaos in production
• Automatic rollback on failure
• Integrate with CI/CD
• Coverage reporting

KEY PRINCIPLES:
• Always have rollback plan
• Start small, expand gradually
• Document everything
• Celebrate finding issues!
Short-term mitigation:
  • Analyze page patterns
  • Suppress non-actionable alerts
  • Add secondary on-call
Medium-term fixes:
  • Improve runbooks for faster resolution
  • Automate common remediations
  • Add better monitoring to prevent issues
Long-term solutions:
  • Fix underlying reliability issues
  • Add circuit breakers, retries
  • Improve capacity planning
  • Push back on missing error budget
Process changes:
  • Track alert metrics (pages per week)
  • Review every page in team meeting
  • Goal: < 2 pages per on-call shift
  • Escalate if consistently exceeded
PREPARATION TIMELINE:

T-4 weeks:
• Load test current capacity
• Identify bottlenecks
• Order additional capacity

T-2 weeks:
• Pre-scale databases (can't autoscale fast)
• Increase cache capacity
• Add read replicas
• Pre-warm caches

T-1 week:
• Scale compute to 3x normal
• Verify autoscaling works
• Test circuit breakers
• Prepare runbooks

T-1 day:
• Final scale to target (10x)
• War room ready
• All hands available

During event:
• Monitor dashboards
• Quick decisions on feature flags
• Prepared to shed load (graceful degradation)

After:
• Scale down gradually (30% per hour)
• Postmortem any issues
• Update capacity model

Capstone: Interview Preparation

Practice System Design Problems

Apply everything you’ve learned:
  1. Design Uber’s dispatch system
    • Real-time location tracking
    • Matching drivers to riders
    • Surge pricing
  2. Design Stripe’s payment processing
    • Exactly-once payments
    • Idempotency
    • Saga pattern for complex transactions
  3. Design Netflix’s video streaming
    • CDN architecture
    • Adaptive bitrate
    • Regional failover
  4. Design Twitter’s timeline
    • Fan-out on write vs read
    • Celebrity problem
    • Real-time updates

Congratulations!

You’ve completed the Distributed Systems Mastery course. You now have the knowledge to:
  • ✅ Explain consensus protocols (Raft, Paxos) in interviews
  • ✅ Design systems with appropriate consistency guarantees
  • ✅ Choose between replication strategies
  • ✅ Implement distributed transactions correctly
  • ✅ Build and operate systems at massive scale
  • ✅ Debug complex distributed systems issues
Next steps: Practice! Do mock interviews, build projects, and read the papers referenced throughout this course.