Chaos Engineering

Chaos engineering proactively tests system resilience by intentionally introducing failures. Learn how Netflix and other tech giants build confidence in their distributed systems.

Learning Objectives:

Understand chaos engineering principles
Implement failure injection techniques
Design and run chaos experiments
Build resilience through controlled chaos
Create game day exercises

Why Chaos Engineering?

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE NEED FOR CHAOS ENGINEERING                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  REALITY OF DISTRIBUTED SYSTEMS:                                            │
│  ───────────────────────────────────────                                    │
│                                                                              │
│  "Everything fails, all the time" - Werner Vogels, Amazon CTO               │
│                                                                              │
│  Microservices introduce:                                                   │
│  • Network partitions              • Dependency failures                    │
│  • Latency spikes                  • Resource exhaustion                    │
│  • Data inconsistency              • Configuration errors                   │
│  • Cascading failures              • Deployment issues                      │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════════│
│                                                                              │
│  CHAOS ENGINEERING APPROACH:                                                │
│  ─────────────────────────────                                              │
│                                                                              │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐         │
│  │   Hypothesis    │───▶│    Inject       │───▶│    Observe      │         │
│  │  "System will   │    │   Failures      │    │    Behavior     │         │
│  │   handle X"     │    │   (Controlled)  │    │                 │         │
│  └─────────────────┘    └─────────────────┘    └────────┬────────┘         │
│                                                          │                  │
│         ┌────────────────────────────────────────────────┘                  │
│         │                                                                   │
│         ▼                                                                   │
│  ┌─────────────────┐    ┌─────────────────┐                                │
│  │   Learn &       │◀───│    Analyze      │                                │
│  │   Improve       │    │    Results      │                                │
│  └─────────────────┘    └─────────────────┘                                │
│                                                                              │
│  GOAL: Build confidence that your system can withstand turbulent           │
│        conditions in production                                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Chaos Engineering Principles

The Scientific Method

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CHAOS EXPERIMENT LIFECYCLE                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. DEFINE STEADY STATE                                                     │
│     ─────────────────────────                                               │
│     "What does 'healthy' look like?"                                        │
│     • Response time p99 < 200ms                                             │
│     • Error rate < 0.1%                                                     │
│     • Orders processed per minute > 100                                     │
│                                                                              │
│  2. HYPOTHESIZE                                                             │
│     ─────────────────                                                       │
│     "Steady state will continue when..."                                    │
│     • Payment service becomes unavailable                                   │
│     • Database latency increases 10x                                        │
│     • 30% of instances are terminated                                       │
│                                                                              │
│  3. DESIGN EXPERIMENT                                                       │
│     ─────────────────────                                                   │
│     • What failure to inject?                                               │
│     • Blast radius (scope)                                                  │
│     • Duration                                                              │
│     • Abort conditions                                                      │
│                                                                              │
│  4. RUN EXPERIMENT                                                          │
│     ────────────────────                                                    │
│     • Inject failure                                                        │
│     • Monitor systems                                                       │
│     • Observe behavior                                                      │
│     • Be ready to abort                                                     │
│                                                                              │
│  5. ANALYZE & LEARN                                                         │
│     ────────────────────                                                    │
│     • Did steady state hold?                                                │
│     • What broke?                                                           │
│     • What was the blast radius?                                            │
│     • How can we improve?                                                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Failure Injection Types

Service Level Failures

// chaos/service-failures.js

class ServiceChaos {
  constructor(options = {}) {
    this.enabled = process.env.CHAOS_ENABLED === 'true';
    this.targetPercentage = options.targetPercentage || 10;
    this.latencyMs = options.latencyMs || 2000;
  }

  // Middleware for Express
  middleware() {
    return (req, res, next) => {
      if (!this.enabled) return next();
      
      const random = Math.random() * 100;
      
      // Inject failures based on configuration
      if (random < this.targetPercentage) {
        const failureType = this.selectFailure();
        return this.injectFailure(failureType, req, res, next);
      }
      
      next();
    };
  }

  selectFailure() {
    const failures = [
      { type: 'latency', weight: 50 },
      { type: 'error', weight: 30 },
      { type: 'timeout', weight: 15 },
      { type: 'exception', weight: 5 }
    ];
    
    const total = failures.reduce((sum, f) => sum + f.weight, 0);
    let random = Math.random() * total;
    
    for (const failure of failures) {
      random -= failure.weight;
      if (random <= 0) return failure.type;
    }
    
    return 'latency';
  }

  injectFailure(type, req, res, next) {
    console.log(`[CHAOS] Injecting ${type} failure for ${req.path}`);
    
    switch (type) {
      case 'latency':
        // Add artificial delay
        setTimeout(next, this.latencyMs);
        break;
        
      case 'error':
        // Return 500 error
        res.status(500).json({
          error: 'Internal Server Error',
          chaos: true,
          message: 'This is a chaos-injected failure'
        });
        break;
        
      case 'timeout':
        // Don't respond (simulate timeout)
        // The client will eventually timeout
        break;
        
      case 'exception':
        // Throw an exception
        throw new Error('Chaos-injected exception');
        
      default:
        next();
    }
  }
}

// Apply to specific routes
app.use('/api/orders', new ServiceChaos({
  targetPercentage: 5,
  latencyMs: 3000
}).middleware());

Network Failures

// chaos/network-chaos.js
const http = require('http');
const https = require('https');

class NetworkChaos {
  constructor(options = {}) {
    this.enabled = process.env.CHAOS_ENABLED === 'true';
    this.config = {
      packetLoss: options.packetLoss || 0,  // Percentage
      latency: options.latency || 0,  // Milliseconds
      jitter: options.jitter || 0,  // Milliseconds
      bandwidth: options.bandwidth || null,  // Bytes per second
      ...options
    };
    
    this.originalRequest = http.request.bind(http);
    this.originalSecureRequest = https.request.bind(https);
  }

  enable() {
    if (!this.enabled) return;
    
    // Monkey-patch http.request
    http.request = (options, callback) => {
      return this.wrapRequest(this.originalRequest, options, callback);
    };
    
    https.request = (options, callback) => {
      return this.wrapRequest(this.originalSecureRequest, options, callback);
    };
    
    console.log('[CHAOS] Network chaos enabled');
  }

  disable() {
    http.request = this.originalRequest;
    https.request = this.originalSecureRequest;
    console.log('[CHAOS] Network chaos disabled');
  }

  wrapRequest(originalFn, options, callback) {
    const hostname = typeof options === 'string' 
      ? new URL(options).hostname 
      : options.hostname || options.host;
    
    // Check if this host should be affected
    if (!this.shouldAffect(hostname)) {
      return originalFn(options, callback);
    }
    
    // Simulate packet loss
    if (Math.random() * 100 < this.config.packetLoss) {
      console.log(`[CHAOS] Simulating packet loss for ${hostname}`);
      const req = originalFn(options, () => {});
      req.on('socket', (socket) => {
        socket.destroy(new Error('Chaos: simulated packet loss'));
      });
      return req;
    }
    
    // Add latency with jitter
    const delay = this.config.latency + (Math.random() * this.config.jitter);
    if (delay > 0) {
      console.log(`[CHAOS] Adding ${delay}ms latency for ${hostname}`);
      return new Promise((resolve) => {
        setTimeout(() => {
          resolve(originalFn(options, callback));
        }, delay);
      });
    }
    
    return originalFn(options, callback);
  }

  shouldAffect(hostname) {
    // Only affect specific services
    const targetHosts = (process.env.CHAOS_TARGET_HOSTS || '').split(',');
    
    if (targetHosts.length === 0 || targetHosts[0] === '') {
      return true;  // Affect all hosts
    }
    
    return targetHosts.some(host => hostname.includes(host));
  }
}

module.exports = { NetworkChaos };

Resource Exhaustion

// chaos/resource-chaos.js

class ResourceChaos {
  constructor() {
    this.memoryHogs = [];
    this.cpuBurner = null;
    this.fdLeaks = [];
  }

  // Consume memory
  exhaustMemory(megabytes = 100) {
    console.log(`[CHAOS] Consuming ${megabytes}MB of memory`);
    
    const chunks = Math.ceil(megabytes / 10);
    for (let i = 0; i < chunks; i++) {
      // Allocate 10MB chunks
      const buffer = Buffer.alloc(10 * 1024 * 1024);
      buffer.fill('X');
      this.memoryHogs.push(buffer);
    }
    
    console.log(`[CHAOS] Memory allocated: ${this.memoryHogs.length * 10}MB`);
    return this;
  }

  releaseMemory() {
    this.memoryHogs = [];
    if (global.gc) {
      global.gc();
    }
    console.log('[CHAOS] Memory released');
    return this;
  }

  // Burn CPU
  burnCPU(percentage = 50, durationMs = 5000) {
    console.log(`[CHAOS] Burning ${percentage}% CPU for ${durationMs}ms`);
    
    const endTime = Date.now() + durationMs;
    const workTime = percentage;
    const sleepTime = 100 - percentage;
    
    const burn = () => {
      if (Date.now() >= endTime) {
        console.log('[CHAOS] CPU burn complete');
        return;
      }
      
      // Work (busy loop)
      const workEnd = Date.now() + workTime;
      while (Date.now() < workEnd) {
        Math.random() * Math.random();
      }
      
      // Sleep
      setTimeout(burn, sleepTime);
    };
    
    burn();
    return this;
  }

  // Exhaust file descriptors
  exhaustFileDescriptors(count = 1000) {
    const fs = require('fs');
    
    console.log(`[CHAOS] Opening ${count} file descriptors`);
    
    for (let i = 0; i < count; i++) {
      try {
        const fd = fs.openSync('/dev/null', 'r');
        this.fdLeaks.push(fd);
      } catch (error) {
        console.log(`[CHAOS] Hit FD limit at ${i} descriptors`);
        break;
      }
    }
    
    return this;
  }

  releaseFileDescriptors() {
    const fs = require('fs');
    
    for (const fd of this.fdLeaks) {
      try {
        fs.closeSync(fd);
      } catch (e) {}
    }
    
    this.fdLeaks = [];
    console.log('[CHAOS] File descriptors released');
    return this;
  }

  // Fill disk
  fillDisk(path, gigabytes = 1) {
    const fs = require('fs');
    
    console.log(`[CHAOS] Filling ${gigabytes}GB at ${path}`);
    
    const chunkSize = 100 * 1024 * 1024;  // 100MB chunks
    const chunks = gigabytes * 10;
    const buffer = Buffer.alloc(chunkSize);
    buffer.fill('X');
    
    const fd = fs.openSync(path, 'w');
    
    for (let i = 0; i < chunks; i++) {
      try {
        fs.writeSync(fd, buffer);
      } catch (error) {
        console.log(`[CHAOS] Disk fill stopped at ${i * 100}MB: ${error.message}`);
        break;
      }
    }
    
    fs.closeSync(fd);
    return path;
  }
}

Dependency Failures

// chaos/dependency-chaos.js

class DependencyChaos {
  constructor(httpClient) {
    this.httpClient = httpClient;
    this.failures = new Map();
  }

  // Fail specific service
  failService(serviceName, options = {}) {
    const config = {
      type: options.type || 'error',  // 'error', 'timeout', 'slow'
      errorCode: options.errorCode || 500,
      delay: options.delay || 5000,
      message: options.message || 'Chaos-injected failure'
    };
    
    this.failures.set(serviceName, config);
    console.log(`[CHAOS] ${serviceName} will ${config.type}`);
  }

  restoreService(serviceName) {
    this.failures.delete(serviceName);
    console.log(`[CHAOS] ${serviceName} restored`);
  }

  // Wrap HTTP client
  wrapClient() {
    const original = this.httpClient.request.bind(this.httpClient);
    
    this.httpClient.request = async (url, options = {}) => {
      const serviceName = this.extractServiceName(url);
      const failure = this.failures.get(serviceName);
      
      if (failure) {
        return this.simulateFailure(failure, url);
      }
      
      return original(url, options);
    };
  }

  extractServiceName(url) {
    try {
      const parsed = new URL(url);
      return parsed.hostname.split('.')[0];  // e.g., 'payment-service'
    } catch {
      return url;
    }
  }

  simulateFailure(config, url) {
    console.log(`[CHAOS] Simulating ${config.type} for ${url}`);
    
    switch (config.type) {
      case 'error':
        return Promise.reject({
          status: config.errorCode,
          message: config.message,
          chaos: true
        });
        
      case 'timeout':
        return new Promise((_, reject) => {
          setTimeout(() => {
            reject(new Error('Chaos: Connection timeout'));
          }, config.delay);
        });
        
      case 'slow':
        return new Promise((resolve) => {
          setTimeout(() => {
            resolve({ status: 200, data: { slow: true } });
          }, config.delay);
        });
        
      default:
        return Promise.reject(new Error('Unknown chaos type'));
    }
  }
}

Chaos Experiment Framework

// chaos/experiment-framework.js

class ChaosExperiment {
  constructor(name, options = {}) {
    this.name = name;
    this.description = options.description || '';
    this.hypothesis = options.hypothesis || '';
    this.steadyState = options.steadyState || {};
    this.metrics = [];
    this.status = 'pending';
    this.startTime = null;
    this.endTime = null;
    this.results = null;
  }

  async run(chaosAction, monitoringClient, options = {}) {
    const {
      duration = 60000,  // 1 minute default
      warmup = 10000,    // 10 seconds warmup
      cooldown = 10000,  // 10 seconds cooldown
      abortThreshold = null
    } = options;

    console.log(`\n${'='.repeat(60)}`);
    console.log(`CHAOS EXPERIMENT: ${this.name}`);
    console.log(`${'='.repeat(60)}`);
    console.log(`Hypothesis: ${this.hypothesis}`);
    console.log(`Duration: ${duration}ms`);
    console.log(`${'='.repeat(60)}\n`);

    this.status = 'running';
    this.startTime = new Date();

    try {
      // Phase 1: Collect baseline metrics
      console.log('[Phase 1] Collecting baseline metrics...');
      const baseline = await this.collectMetrics(monitoringClient, warmup);
      console.log('Baseline:', JSON.stringify(baseline, null, 2));

      // Verify steady state before experiment
      if (!this.verifySteadyState(baseline)) {
        throw new Error('System not in steady state before experiment');
      }

      // Phase 2: Inject chaos
      console.log('\n[Phase 2] Injecting chaos...');
      await chaosAction.start();

      // Phase 3: Monitor during chaos
      console.log('\n[Phase 3] Monitoring during chaos...');
      const chaosMetrics = await this.monitorWithAbort(
        monitoringClient,
        duration,
        abortThreshold,
        chaosAction
      );

      // Phase 4: Stop chaos
      console.log('\n[Phase 4] Stopping chaos...');
      await chaosAction.stop();

      // Phase 5: Cooldown and collect recovery metrics
      console.log('\n[Phase 5] Collecting recovery metrics...');
      await this.sleep(cooldown);
      const recovery = await this.collectMetrics(monitoringClient, 5000);

      // Analyze results
      this.results = {
        baseline,
        chaos: chaosMetrics,
        recovery,
        hypothesis: this.evaluateHypothesis(baseline, chaosMetrics, recovery)
      };

      this.status = this.results.hypothesis.passed ? 'passed' : 'failed';
      
    } catch (error) {
      console.error('Experiment failed:', error);
      this.status = 'aborted';
      this.results = { error: error.message };
      
      // Ensure chaos is stopped
      try {
        await chaosAction.stop();
      } catch (e) {}
      
    } finally {
      this.endTime = new Date();
    }

    this.printResults();
    return this.results;
  }

  async collectMetrics(monitoringClient, duration) {
    const metrics = {
      errorRate: [],
      latencyP50: [],
      latencyP99: [],
      throughput: [],
      saturation: []
    };

    const interval = 1000;
    const iterations = Math.ceil(duration / interval);

    for (let i = 0; i < iterations; i++) {
      const snapshot = await monitoringClient.getMetrics();
      
      metrics.errorRate.push(snapshot.errorRate);
      metrics.latencyP50.push(snapshot.latencyP50);
      metrics.latencyP99.push(snapshot.latencyP99);
      metrics.throughput.push(snapshot.throughput);
      metrics.saturation.push(snapshot.saturation);

      await this.sleep(interval);
    }

    return {
      errorRate: this.average(metrics.errorRate),
      latencyP50: this.average(metrics.latencyP50),
      latencyP99: this.average(metrics.latencyP99),
      throughput: this.average(metrics.throughput),
      saturation: this.average(metrics.saturation)
    };
  }

  async monitorWithAbort(monitoringClient, duration, threshold, chaosAction) {
    const metrics = [];
    const interval = 5000;
    const iterations = Math.ceil(duration / interval);

    for (let i = 0; i < iterations; i++) {
      const snapshot = await monitoringClient.getMetrics();
      metrics.push(snapshot);

      // Check abort conditions
      if (threshold && this.shouldAbort(snapshot, threshold)) {
        console.log('\n⚠️  ABORTING: Threshold exceeded');
        await chaosAction.stop();
        break;
      }

      console.log(`  [${i + 1}/${iterations}] Error: ${(snapshot.errorRate * 100).toFixed(2)}%, Latency P99: ${snapshot.latencyP99}ms`);
      await this.sleep(interval);
    }

    return {
      errorRate: this.average(metrics.map(m => m.errorRate)),
      latencyP50: this.average(metrics.map(m => m.latencyP50)),
      latencyP99: this.average(metrics.map(m => m.latencyP99)),
      throughput: this.average(metrics.map(m => m.throughput)),
      maxErrorRate: Math.max(...metrics.map(m => m.errorRate)),
      maxLatency: Math.max(...metrics.map(m => m.latencyP99))
    };
  }

  shouldAbort(metrics, threshold) {
    return (
      metrics.errorRate > threshold.maxErrorRate ||
      metrics.latencyP99 > threshold.maxLatency
    );
  }

  verifySteadyState(metrics) {
    const { steadyState } = this;
    
    if (steadyState.maxErrorRate && metrics.errorRate > steadyState.maxErrorRate) {
      console.log(`Steady state check failed: errorRate ${metrics.errorRate} > ${steadyState.maxErrorRate}`);
      return false;
    }
    
    if (steadyState.maxLatencyP99 && metrics.latencyP99 > steadyState.maxLatencyP99) {
      console.log(`Steady state check failed: latencyP99 ${metrics.latencyP99} > ${steadyState.maxLatencyP99}`);
      return false;
    }
    
    return true;
  }

  evaluateHypothesis(baseline, chaos, recovery) {
    const results = {
      passed: true,
      findings: []
    };

    // Check if error rate stayed within bounds
    const errorRateIncrease = chaos.errorRate - baseline.errorRate;
    if (errorRateIncrease > 0.05) {  // More than 5% increase
      results.findings.push(`Error rate increased by ${(errorRateIncrease * 100).toFixed(2)}%`);
      results.passed = false;
    }

    // Check if latency degradation was acceptable
    const latencyIncrease = (chaos.latencyP99 - baseline.latencyP99) / baseline.latencyP99;
    if (latencyIncrease > 0.5) {  // More than 50% increase
      results.findings.push(`Latency P99 increased by ${(latencyIncrease * 100).toFixed(0)}%`);
      results.passed = false;
    }

    // Check recovery
    const recoveryTime = recovery.latencyP99 / baseline.latencyP99;
    if (recoveryTime > 1.1) {  // Not recovered to within 10%
      results.findings.push('System did not fully recover');
      results.passed = false;
    }

    if (results.passed) {
      results.findings.push('Hypothesis validated: system maintained steady state');
    }

    return results;
  }

  printResults() {
    console.log(`\n${'='.repeat(60)}`);
    console.log('EXPERIMENT RESULTS');
    console.log(`${'='.repeat(60)}`);
    console.log(`Status: ${this.status.toUpperCase()}`);
    console.log(`Duration: ${(this.endTime - this.startTime) / 1000}s`);
    
    if (this.results) {
      console.log('\nMetrics Comparison:');
      console.log(`  Error Rate: ${(this.results.baseline.errorRate * 100).toFixed(2)}% → ${(this.results.chaos.errorRate * 100).toFixed(2)}%`);
      console.log(`  Latency P99: ${this.results.baseline.latencyP99}ms → ${this.results.chaos.latencyP99}ms`);
      console.log(`  Throughput: ${this.results.baseline.throughput} → ${this.results.chaos.throughput}`);
      
      console.log('\nFindings:');
      for (const finding of this.results.hypothesis.findings) {
        console.log(`  • ${finding}`);
      }
    }
    
    console.log(`${'='.repeat(60)}\n`);
  }

  average(arr) {
    return arr.reduce((a, b) => a + b, 0) / arr.length;
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

module.exports = { ChaosExperiment };

Running Chaos Experiments

Example: Service Failure Experiment

// experiments/payment-service-failure.js

const { ChaosExperiment } = require('../chaos/experiment-framework');
const { DependencyChaos } = require('../chaos/dependency-chaos');

async function runPaymentFailureExperiment() {
  // Define the experiment
  const experiment = new ChaosExperiment('Payment Service Failure', {
    description: 'Simulate complete payment service unavailability',
    hypothesis: 'Order service will gracefully handle payment failures with proper fallbacks',
    steadyState: {
      maxErrorRate: 0.01,
      maxLatencyP99: 200
    }
  });

  // Create chaos action
  const dependencyChaos = new DependencyChaos(httpClient);
  
  const chaosAction = {
    start: async () => {
      dependencyChaos.failService('payment-service', {
        type: 'error',
        errorCode: 503,
        message: 'Service Unavailable'
      });
    },
    stop: async () => {
      dependencyChaos.restoreService('payment-service');
    }
  };

  // Create monitoring client
  const monitoringClient = {
    getMetrics: async () => {
      const response = await fetch('http://prometheus:9090/api/v1/query', {
        method: 'POST',
        body: new URLSearchParams({
          query: `
            sum(rate(http_requests_total{service="order-service"}[1m])) by (status)
          `
        })
      });
      const data = await response.json();
      
      // Parse Prometheus response
      return {
        errorRate: parseFloat(data.data.result.find(r => r.metric.status >= 500)?.value[1] || 0),
        latencyP50: await getLatencyPercentile(50),
        latencyP99: await getLatencyPercentile(99),
        throughput: parseFloat(data.data.result.reduce((sum, r) => sum + parseFloat(r.value[1]), 0)),
        saturation: await getCPUUtilization()
      };
    }
  };

  // Run experiment
  const results = await experiment.run(chaosAction, monitoringClient, {
    duration: 120000,  // 2 minutes
    warmup: 15000,
    cooldown: 30000,
    abortThreshold: {
      maxErrorRate: 0.5,  // Abort if error rate exceeds 50%
      maxLatency: 10000   // Abort if latency exceeds 10s
    }
  });

  return results;
}

runPaymentFailureExperiment();

Kubernetes Chaos with LitmusChaos

# litmus/pod-delete-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: order-service-chaos
  namespace: production
spec:
  appinfo:
    appns: 'production'
    applabel: 'app=order-service'
    appkind: 'deployment'
  engineState: 'active'
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '60'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'false'
            - name: PODS_AFFECTED_PERC
              value: '50'
        probe:
          - name: 'check-order-endpoint'
            type: 'httpProbe'
            mode: 'Continuous'
            runProperties:
              probeTimeout: 5
              retry: 2
              interval: 5
            httpProbe/inputs:
              url: 'http://order-service.production.svc:80/health'
              insecureSkipVerify: false
              responseTimeout: 3000
              method:
                get:
                  criteria: '=='
                  responseCode: '200'
---
# Network chaos
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-chaos
spec:
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: NETWORK_INTERFACE
              value: 'eth0'
            - name: NETWORK_LATENCY
              value: '2000'  # 2 second latency
            - name: TOTAL_CHAOS_DURATION
              value: '120'
            - name: TARGET_PODS
              value: 'payment-service'

Game Day Exercises

// gameday/runner.js

class GameDay {
  constructor(name, scenarios) {
    this.name = name;
    this.scenarios = scenarios;
    this.results = [];
    this.observers = [];
  }

  addObserver(observer) {
    this.observers.push(observer);
  }

  notify(event) {
    for (const observer of this.observers) {
      observer.onEvent(event);
    }
  }

  async run() {
    console.log(`\n${'#'.repeat(70)}`);
    console.log(`# GAME DAY: ${this.name}`);
    console.log(`# Date: ${new Date().toISOString()}`);
    console.log(`# Scenarios: ${this.scenarios.length}`);
    console.log(`${'#'.repeat(70)}\n`);

    this.notify({ type: 'gameday_start', name: this.name });

    for (let i = 0; i < this.scenarios.length; i++) {
      const scenario = this.scenarios[i];
      
      console.log(`\n--- Scenario ${i + 1}/${this.scenarios.length}: ${scenario.name} ---`);
      this.notify({ type: 'scenario_start', scenario: scenario.name });

      try {
        // Pre-scenario check
        const preCheck = await scenario.preCheck();
        if (!preCheck.ready) {
          console.log(`Skipping: ${preCheck.reason}`);
          this.results.push({ scenario: scenario.name, status: 'skipped', reason: preCheck.reason });
          continue;
        }

        // Run scenario
        const result = await scenario.execute();
        
        // Validate expectations
        const validation = await scenario.validate(result);
        
        this.results.push({
          scenario: scenario.name,
          status: validation.passed ? 'passed' : 'failed',
          result,
          validation
        });

        this.notify({ 
          type: 'scenario_complete', 
          scenario: scenario.name, 
          passed: validation.passed 
        });

        // Recovery period
        console.log('Recovery period...');
        await this.sleep(scenario.recoveryTime || 30000);

      } catch (error) {
        console.error(`Scenario failed with error: ${error.message}`);
        this.results.push({
          scenario: scenario.name,
          status: 'error',
          error: error.message
        });
        
        // Try to recover
        await scenario.cleanup?.();
      }
    }

    this.printSummary();
    this.notify({ type: 'gameday_complete', results: this.results });

    return this.results;
  }

  printSummary() {
    console.log(`\n${'='.repeat(70)}`);
    console.log('GAME DAY SUMMARY');
    console.log(`${'='.repeat(70)}`);
    
    const passed = this.results.filter(r => r.status === 'passed').length;
    const failed = this.results.filter(r => r.status === 'failed').length;
    const errors = this.results.filter(r => r.status === 'error').length;
    
    console.log(`Total Scenarios: ${this.results.length}`);
    console.log(`Passed: ${passed}`);
    console.log(`Failed: ${failed}`);
    console.log(`Errors: ${errors}`);
    
    console.log('\nDetails:');
    for (const result of this.results) {
      const icon = result.status === 'passed' ? '✅' : result.status === 'failed' ? '❌' : '⚠️';
      console.log(`  ${icon} ${result.scenario}: ${result.status}`);
      
      if (result.validation?.findings) {
        for (const finding of result.validation.findings) {
          console.log(`      - ${finding}`);
        }
      }
    }
    
    console.log(`${'='.repeat(70)}\n`);
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Example Game Day Scenario
const scenarios = [
  {
    name: 'Database Failover',
    preCheck: async () => ({ ready: true }),
    execute: async () => {
      // Simulate database failure
      await exec('kubectl delete pod postgres-primary-0 -n database');
      
      // Wait for failover
      await sleep(30000);
      
      // Check if replica was promoted
      const status = await exec('kubectl get pods -n database -l role=primary');
      return { newPrimary: status.includes('Running') };
    },
    validate: async (result) => ({
      passed: result.newPrimary,
      findings: result.newPrimary 
        ? ['Failover completed successfully'] 
        : ['Failover did not complete']
    }),
    cleanup: async () => {
      // Restore original setup if needed
    },
    recoveryTime: 60000
  },
  
  {
    name: 'Cache Failure',
    preCheck: async () => ({ ready: true }),
    execute: async () => {
      // Stop Redis
      await exec('kubectl scale deployment redis --replicas=0 -n cache');
      
      // Make requests during outage
      const responses = await Promise.all(
        Array(100).fill().map(() => 
          fetch('http://order-service/api/products').catch(e => ({ error: true }))
        )
      );
      
      // Restore Redis
      await exec('kubectl scale deployment redis --replicas=3 -n cache');
      
      const errors = responses.filter(r => r.error || r.status >= 500).length;
      return { errorRate: errors / 100 };
    },
    validate: async (result) => ({
      passed: result.errorRate < 0.1,  // Less than 10% error rate
      findings: [`Error rate during cache failure: ${(result.errorRate * 100).toFixed(1)}%`]
    }),
    recoveryTime: 45000
  }
];

const gameDay = new GameDay('Q4 Resilience Testing', scenarios);
gameDay.run();

Interview Questions

Q1: What is chaos engineering and why is it important?

Answer:Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.Why important:

Distributed systems have unpredictable failure modes
Traditional testing doesn’t cover all scenarios
Builds confidence before production incidents
Reveals weaknesses proactively

Key principles:

Define “steady state” (normal behavior)
Hypothesize that steady state continues
Introduce real-world failures
Try to disprove the hypothesis
Run in production (with safety)

Q2: What is Netflix's Chaos Monkey?

Answer:Chaos Monkey randomly terminates production instances to ensure services can survive instance failures.Part of Simian Army:

Chaos Monkey: Kills instances
Latency Monkey: Adds artificial delays
Conformity Monkey: Checks for best practices
Chaos Gorilla: Kills entire availability zones
Chaos Kong: Kills entire regions

Design principles:

Everything must handle instance failure
Stateless services
Redundancy at every level
Automated recovery

Lesson: Design for failure from day one.

Q3: How do you safely run chaos experiments in production?

Answer:Safety measures:

Start small
- Begin in staging
- Small blast radius
- Short duration
Abort conditions
- Define thresholds (error rate, latency)
- Automatic rollback
- Kill switch ready
Observability
- Real-time monitoring
- Dashboards visible
- Alerts configured
Team preparedness
- Incident response ready
- Runbooks available
- All stakeholders aware
Gradual expansion
- Increase scope over time
- Learn from each experiment
- Build confidence incrementally

Q4: What failures would you test in a microservices system?

Answer:Service failures:

Service unavailable (crash, OOM)
Slow responses (latency)
Error responses (5xx)

Network failures:

Packet loss
Network partition
DNS failure

Infrastructure:

Instance termination
Zone failure
Disk full

Dependencies:

Database failure
Cache unavailable
Message queue failure

Resource exhaustion:

CPU saturation
Memory exhaustion
Connection pool exhaustion
Thread pool exhaustion

Q5: What is a Game Day?

Answer:Game Day is a scheduled event where teams intentionally inject failures to test system resilience.Components:

Planning: Define scenarios, success criteria
Communication: Notify stakeholders
Execution: Run scenarios with observers
Observation: Monitor and document
Retrospective: Analyze and improve

Benefits:

Team practices incident response
Reveals documentation gaps
Tests monitoring and alerting
Builds muscle memory for real incidents

Example scenarios:

Database failover
Region evacuation
DDoS simulation
Major dependency outage

Chapter Summary

Key Takeaways:

Chaos engineering proactively finds weaknesses before production incidents
Follow the scientific method: hypothesis → experiment → analyze
Start small in staging, gradually expand to production
Always have abort conditions and rollback plans
Game days help teams practice incident response
Design systems assuming everything will fail

Next Chapter: Real-World Case Studies - Architecture breakdowns from Netflix, Uber, Amazon.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Chaos Engineering

​Why Chaos Engineering?

​Chaos Engineering Principles

​The Scientific Method

​Failure Injection Types

​Service Level Failures

​Network Failures

​Resource Exhaustion

​Dependency Failures

​Chaos Experiment Framework

​Running Chaos Experiments

​Example: Service Failure Experiment

​Kubernetes Chaos with LitmusChaos

​Game Day Exercises

​Interview Questions

​Chapter Summary