Skip to main content

Chaos Engineering

Chaos engineering proactively tests system resilience by intentionally introducing failures. Learn how Netflix and other tech giants build confidence in their distributed systems.
Learning Objectives:
  • Understand chaos engineering principles
  • Implement failure injection techniques
  • Design and run chaos experiments
  • Build resilience through controlled chaos
  • Create game day exercises

Why Chaos Engineering?

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE NEED FOR CHAOS ENGINEERING                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  REALITY OF DISTRIBUTED SYSTEMS:                                            │
│  ───────────────────────────────────────                                    │
│                                                                              │
│  "Everything fails, all the time" - Werner Vogels, Amazon CTO               │
│                                                                              │
│  Microservices introduce:                                                   │
│  • Network partitions              • Dependency failures                    │
│  • Latency spikes                  • Resource exhaustion                    │
│  • Data inconsistency              • Configuration errors                   │
│  • Cascading failures              • Deployment issues                      │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════════│
│                                                                              │
│  CHAOS ENGINEERING APPROACH:                                                │
│  ─────────────────────────────                                              │
│                                                                              │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐         │
│  │   Hypothesis    │───▶│    Inject       │───▶│    Observe      │         │
│  │  "System will   │    │   Failures      │    │    Behavior     │         │
│  │   handle X"     │    │   (Controlled)  │    │                 │         │
│  └─────────────────┘    └─────────────────┘    └────────┬────────┘         │
│                                                          │                  │
│         ┌────────────────────────────────────────────────┘                  │
│         │                                                                   │
│         ▼                                                                   │
│  ┌─────────────────┐    ┌─────────────────┐                                │
│  │   Learn &       │◀───│    Analyze      │                                │
│  │   Improve       │    │    Results      │                                │
│  └─────────────────┘    └─────────────────┘                                │
│                                                                              │
│  GOAL: Build confidence that your system can withstand turbulent           │
│        conditions in production                                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Chaos Engineering Principles

The Scientific Method

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CHAOS EXPERIMENT LIFECYCLE                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. DEFINE STEADY STATE                                                     │
│     ─────────────────────────                                               │
│     "What does 'healthy' look like?"                                        │
│     • Response time p99 < 200ms                                             │
│     • Error rate < 0.1%                                                     │
│     • Orders processed per minute > 100                                     │
│                                                                              │
│  2. HYPOTHESIZE                                                             │
│     ─────────────────                                                       │
│     "Steady state will continue when..."                                    │
│     • Payment service becomes unavailable                                   │
│     • Database latency increases 10x                                        │
│     • 30% of instances are terminated                                       │
│                                                                              │
│  3. DESIGN EXPERIMENT                                                       │
│     ─────────────────────                                                   │
│     • What failure to inject?                                               │
│     • Blast radius (scope)                                                  │
│     • Duration                                                              │
│     • Abort conditions                                                      │
│                                                                              │
│  4. RUN EXPERIMENT                                                          │
│     ────────────────────                                                    │
│     • Inject failure                                                        │
│     • Monitor systems                                                       │
│     • Observe behavior                                                      │
│     • Be ready to abort                                                     │
│                                                                              │
│  5. ANALYZE & LEARN                                                         │
│     ────────────────────                                                    │
│     • Did steady state hold?                                                │
│     • What broke?                                                           │
│     • What was the blast radius?                                            │
│     • How can we improve?                                                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Failure Injection Types

Service Level Failures

// chaos/service-failures.js

class ServiceChaos {
  constructor(options = {}) {
    this.enabled = process.env.CHAOS_ENABLED === 'true';
    this.targetPercentage = options.targetPercentage || 10;
    this.latencyMs = options.latencyMs || 2000;
  }

  // Middleware for Express
  middleware() {
    return (req, res, next) => {
      if (!this.enabled) return next();
      
      const random = Math.random() * 100;
      
      // Inject failures based on configuration
      if (random < this.targetPercentage) {
        const failureType = this.selectFailure();
        return this.injectFailure(failureType, req, res, next);
      }
      
      next();
    };
  }

  selectFailure() {
    const failures = [
      { type: 'latency', weight: 50 },
      { type: 'error', weight: 30 },
      { type: 'timeout', weight: 15 },
      { type: 'exception', weight: 5 }
    ];
    
    const total = failures.reduce((sum, f) => sum + f.weight, 0);
    let random = Math.random() * total;
    
    for (const failure of failures) {
      random -= failure.weight;
      if (random <= 0) return failure.type;
    }
    
    return 'latency';
  }

  injectFailure(type, req, res, next) {
    console.log(`[CHAOS] Injecting ${type} failure for ${req.path}`);
    
    switch (type) {
      case 'latency':
        // Add artificial delay
        setTimeout(next, this.latencyMs);
        break;
        
      case 'error':
        // Return 500 error
        res.status(500).json({
          error: 'Internal Server Error',
          chaos: true,
          message: 'This is a chaos-injected failure'
        });
        break;
        
      case 'timeout':
        // Don't respond (simulate timeout)
        // The client will eventually timeout
        break;
        
      case 'exception':
        // Throw an exception
        throw new Error('Chaos-injected exception');
        
      default:
        next();
    }
  }
}

// Apply to specific routes
app.use('/api/orders', new ServiceChaos({
  targetPercentage: 5,
  latencyMs: 3000
}).middleware());

Network Failures

// chaos/network-chaos.js
const http = require('http');
const https = require('https');

class NetworkChaos {
  constructor(options = {}) {
    this.enabled = process.env.CHAOS_ENABLED === 'true';
    this.config = {
      packetLoss: options.packetLoss || 0,  // Percentage
      latency: options.latency || 0,  // Milliseconds
      jitter: options.jitter || 0,  // Milliseconds
      bandwidth: options.bandwidth || null,  // Bytes per second
      ...options
    };
    
    this.originalRequest = http.request.bind(http);
    this.originalSecureRequest = https.request.bind(https);
  }

  enable() {
    if (!this.enabled) return;
    
    // Monkey-patch http.request
    http.request = (options, callback) => {
      return this.wrapRequest(this.originalRequest, options, callback);
    };
    
    https.request = (options, callback) => {
      return this.wrapRequest(this.originalSecureRequest, options, callback);
    };
    
    console.log('[CHAOS] Network chaos enabled');
  }

  disable() {
    http.request = this.originalRequest;
    https.request = this.originalSecureRequest;
    console.log('[CHAOS] Network chaos disabled');
  }

  wrapRequest(originalFn, options, callback) {
    const hostname = typeof options === 'string' 
      ? new URL(options).hostname 
      : options.hostname || options.host;
    
    // Check if this host should be affected
    if (!this.shouldAffect(hostname)) {
      return originalFn(options, callback);
    }
    
    // Simulate packet loss
    if (Math.random() * 100 < this.config.packetLoss) {
      console.log(`[CHAOS] Simulating packet loss for ${hostname}`);
      const req = originalFn(options, () => {});
      req.on('socket', (socket) => {
        socket.destroy(new Error('Chaos: simulated packet loss'));
      });
      return req;
    }
    
    // Add latency with jitter
    const delay = this.config.latency + (Math.random() * this.config.jitter);
    if (delay > 0) {
      console.log(`[CHAOS] Adding ${delay}ms latency for ${hostname}`);
      return new Promise((resolve) => {
        setTimeout(() => {
          resolve(originalFn(options, callback));
        }, delay);
      });
    }
    
    return originalFn(options, callback);
  }

  shouldAffect(hostname) {
    // Only affect specific services
    const targetHosts = (process.env.CHAOS_TARGET_HOSTS || '').split(',');
    
    if (targetHosts.length === 0 || targetHosts[0] === '') {
      return true;  // Affect all hosts
    }
    
    return targetHosts.some(host => hostname.includes(host));
  }
}

module.exports = { NetworkChaos };

Resource Exhaustion

// chaos/resource-chaos.js

class ResourceChaos {
  constructor() {
    this.memoryHogs = [];
    this.cpuBurner = null;
    this.fdLeaks = [];
  }

  // Consume memory
  exhaustMemory(megabytes = 100) {
    console.log(`[CHAOS] Consuming ${megabytes}MB of memory`);
    
    const chunks = Math.ceil(megabytes / 10);
    for (let i = 0; i < chunks; i++) {
      // Allocate 10MB chunks
      const buffer = Buffer.alloc(10 * 1024 * 1024);
      buffer.fill('X');
      this.memoryHogs.push(buffer);
    }
    
    console.log(`[CHAOS] Memory allocated: ${this.memoryHogs.length * 10}MB`);
    return this;
  }

  releaseMemory() {
    this.memoryHogs = [];
    if (global.gc) {
      global.gc();
    }
    console.log('[CHAOS] Memory released');
    return this;
  }

  // Burn CPU
  burnCPU(percentage = 50, durationMs = 5000) {
    console.log(`[CHAOS] Burning ${percentage}% CPU for ${durationMs}ms`);
    
    const endTime = Date.now() + durationMs;
    const workTime = percentage;
    const sleepTime = 100 - percentage;
    
    const burn = () => {
      if (Date.now() >= endTime) {
        console.log('[CHAOS] CPU burn complete');
        return;
      }
      
      // Work (busy loop)
      const workEnd = Date.now() + workTime;
      while (Date.now() < workEnd) {
        Math.random() * Math.random();
      }
      
      // Sleep
      setTimeout(burn, sleepTime);
    };
    
    burn();
    return this;
  }

  // Exhaust file descriptors
  exhaustFileDescriptors(count = 1000) {
    const fs = require('fs');
    
    console.log(`[CHAOS] Opening ${count} file descriptors`);
    
    for (let i = 0; i < count; i++) {
      try {
        const fd = fs.openSync('/dev/null', 'r');
        this.fdLeaks.push(fd);
      } catch (error) {
        console.log(`[CHAOS] Hit FD limit at ${i} descriptors`);
        break;
      }
    }
    
    return this;
  }

  releaseFileDescriptors() {
    const fs = require('fs');
    
    for (const fd of this.fdLeaks) {
      try {
        fs.closeSync(fd);
      } catch (e) {}
    }
    
    this.fdLeaks = [];
    console.log('[CHAOS] File descriptors released');
    return this;
  }

  // Fill disk
  fillDisk(path, gigabytes = 1) {
    const fs = require('fs');
    
    console.log(`[CHAOS] Filling ${gigabytes}GB at ${path}`);
    
    const chunkSize = 100 * 1024 * 1024;  // 100MB chunks
    const chunks = gigabytes * 10;
    const buffer = Buffer.alloc(chunkSize);
    buffer.fill('X');
    
    const fd = fs.openSync(path, 'w');
    
    for (let i = 0; i < chunks; i++) {
      try {
        fs.writeSync(fd, buffer);
      } catch (error) {
        console.log(`[CHAOS] Disk fill stopped at ${i * 100}MB: ${error.message}`);
        break;
      }
    }
    
    fs.closeSync(fd);
    return path;
  }
}

Dependency Failures

// chaos/dependency-chaos.js

class DependencyChaos {
  constructor(httpClient) {
    this.httpClient = httpClient;
    this.failures = new Map();
  }

  // Fail specific service
  failService(serviceName, options = {}) {
    const config = {
      type: options.type || 'error',  // 'error', 'timeout', 'slow'
      errorCode: options.errorCode || 500,
      delay: options.delay || 5000,
      message: options.message || 'Chaos-injected failure'
    };
    
    this.failures.set(serviceName, config);
    console.log(`[CHAOS] ${serviceName} will ${config.type}`);
  }

  restoreService(serviceName) {
    this.failures.delete(serviceName);
    console.log(`[CHAOS] ${serviceName} restored`);
  }

  // Wrap HTTP client
  wrapClient() {
    const original = this.httpClient.request.bind(this.httpClient);
    
    this.httpClient.request = async (url, options = {}) => {
      const serviceName = this.extractServiceName(url);
      const failure = this.failures.get(serviceName);
      
      if (failure) {
        return this.simulateFailure(failure, url);
      }
      
      return original(url, options);
    };
  }

  extractServiceName(url) {
    try {
      const parsed = new URL(url);
      return parsed.hostname.split('.')[0];  // e.g., 'payment-service'
    } catch {
      return url;
    }
  }

  simulateFailure(config, url) {
    console.log(`[CHAOS] Simulating ${config.type} for ${url}`);
    
    switch (config.type) {
      case 'error':
        return Promise.reject({
          status: config.errorCode,
          message: config.message,
          chaos: true
        });
        
      case 'timeout':
        return new Promise((_, reject) => {
          setTimeout(() => {
            reject(new Error('Chaos: Connection timeout'));
          }, config.delay);
        });
        
      case 'slow':
        return new Promise((resolve) => {
          setTimeout(() => {
            resolve({ status: 200, data: { slow: true } });
          }, config.delay);
        });
        
      default:
        return Promise.reject(new Error('Unknown chaos type'));
    }
  }
}

Chaos Experiment Framework

// chaos/experiment-framework.js

class ChaosExperiment {
  constructor(name, options = {}) {
    this.name = name;
    this.description = options.description || '';
    this.hypothesis = options.hypothesis || '';
    this.steadyState = options.steadyState || {};
    this.metrics = [];
    this.status = 'pending';
    this.startTime = null;
    this.endTime = null;
    this.results = null;
  }

  async run(chaosAction, monitoringClient, options = {}) {
    const {
      duration = 60000,  // 1 minute default
      warmup = 10000,    // 10 seconds warmup
      cooldown = 10000,  // 10 seconds cooldown
      abortThreshold = null
    } = options;

    console.log(`\n${'='.repeat(60)}`);
    console.log(`CHAOS EXPERIMENT: ${this.name}`);
    console.log(`${'='.repeat(60)}`);
    console.log(`Hypothesis: ${this.hypothesis}`);
    console.log(`Duration: ${duration}ms`);
    console.log(`${'='.repeat(60)}\n`);

    this.status = 'running';
    this.startTime = new Date();

    try {
      // Phase 1: Collect baseline metrics
      console.log('[Phase 1] Collecting baseline metrics...');
      const baseline = await this.collectMetrics(monitoringClient, warmup);
      console.log('Baseline:', JSON.stringify(baseline, null, 2));

      // Verify steady state before experiment
      if (!this.verifySteadyState(baseline)) {
        throw new Error('System not in steady state before experiment');
      }

      // Phase 2: Inject chaos
      console.log('\n[Phase 2] Injecting chaos...');
      await chaosAction.start();

      // Phase 3: Monitor during chaos
      console.log('\n[Phase 3] Monitoring during chaos...');
      const chaosMetrics = await this.monitorWithAbort(
        monitoringClient,
        duration,
        abortThreshold,
        chaosAction
      );

      // Phase 4: Stop chaos
      console.log('\n[Phase 4] Stopping chaos...');
      await chaosAction.stop();

      // Phase 5: Cooldown and collect recovery metrics
      console.log('\n[Phase 5] Collecting recovery metrics...');
      await this.sleep(cooldown);
      const recovery = await this.collectMetrics(monitoringClient, 5000);

      // Analyze results
      this.results = {
        baseline,
        chaos: chaosMetrics,
        recovery,
        hypothesis: this.evaluateHypothesis(baseline, chaosMetrics, recovery)
      };

      this.status = this.results.hypothesis.passed ? 'passed' : 'failed';
      
    } catch (error) {
      console.error('Experiment failed:', error);
      this.status = 'aborted';
      this.results = { error: error.message };
      
      // Ensure chaos is stopped
      try {
        await chaosAction.stop();
      } catch (e) {}
      
    } finally {
      this.endTime = new Date();
    }

    this.printResults();
    return this.results;
  }

  async collectMetrics(monitoringClient, duration) {
    const metrics = {
      errorRate: [],
      latencyP50: [],
      latencyP99: [],
      throughput: [],
      saturation: []
    };

    const interval = 1000;
    const iterations = Math.ceil(duration / interval);

    for (let i = 0; i < iterations; i++) {
      const snapshot = await monitoringClient.getMetrics();
      
      metrics.errorRate.push(snapshot.errorRate);
      metrics.latencyP50.push(snapshot.latencyP50);
      metrics.latencyP99.push(snapshot.latencyP99);
      metrics.throughput.push(snapshot.throughput);
      metrics.saturation.push(snapshot.saturation);

      await this.sleep(interval);
    }

    return {
      errorRate: this.average(metrics.errorRate),
      latencyP50: this.average(metrics.latencyP50),
      latencyP99: this.average(metrics.latencyP99),
      throughput: this.average(metrics.throughput),
      saturation: this.average(metrics.saturation)
    };
  }

  async monitorWithAbort(monitoringClient, duration, threshold, chaosAction) {
    const metrics = [];
    const interval = 5000;
    const iterations = Math.ceil(duration / interval);

    for (let i = 0; i < iterations; i++) {
      const snapshot = await monitoringClient.getMetrics();
      metrics.push(snapshot);

      // Check abort conditions
      if (threshold && this.shouldAbort(snapshot, threshold)) {
        console.log('\n⚠️  ABORTING: Threshold exceeded');
        await chaosAction.stop();
        break;
      }

      console.log(`  [${i + 1}/${iterations}] Error: ${(snapshot.errorRate * 100).toFixed(2)}%, Latency P99: ${snapshot.latencyP99}ms`);
      await this.sleep(interval);
    }

    return {
      errorRate: this.average(metrics.map(m => m.errorRate)),
      latencyP50: this.average(metrics.map(m => m.latencyP50)),
      latencyP99: this.average(metrics.map(m => m.latencyP99)),
      throughput: this.average(metrics.map(m => m.throughput)),
      maxErrorRate: Math.max(...metrics.map(m => m.errorRate)),
      maxLatency: Math.max(...metrics.map(m => m.latencyP99))
    };
  }

  shouldAbort(metrics, threshold) {
    return (
      metrics.errorRate > threshold.maxErrorRate ||
      metrics.latencyP99 > threshold.maxLatency
    );
  }

  verifySteadyState(metrics) {
    const { steadyState } = this;
    
    if (steadyState.maxErrorRate && metrics.errorRate > steadyState.maxErrorRate) {
      console.log(`Steady state check failed: errorRate ${metrics.errorRate} > ${steadyState.maxErrorRate}`);
      return false;
    }
    
    if (steadyState.maxLatencyP99 && metrics.latencyP99 > steadyState.maxLatencyP99) {
      console.log(`Steady state check failed: latencyP99 ${metrics.latencyP99} > ${steadyState.maxLatencyP99}`);
      return false;
    }
    
    return true;
  }

  evaluateHypothesis(baseline, chaos, recovery) {
    const results = {
      passed: true,
      findings: []
    };

    // Check if error rate stayed within bounds
    const errorRateIncrease = chaos.errorRate - baseline.errorRate;
    if (errorRateIncrease > 0.05) {  // More than 5% increase
      results.findings.push(`Error rate increased by ${(errorRateIncrease * 100).toFixed(2)}%`);
      results.passed = false;
    }

    // Check if latency degradation was acceptable
    const latencyIncrease = (chaos.latencyP99 - baseline.latencyP99) / baseline.latencyP99;
    if (latencyIncrease > 0.5) {  // More than 50% increase
      results.findings.push(`Latency P99 increased by ${(latencyIncrease * 100).toFixed(0)}%`);
      results.passed = false;
    }

    // Check recovery
    const recoveryTime = recovery.latencyP99 / baseline.latencyP99;
    if (recoveryTime > 1.1) {  // Not recovered to within 10%
      results.findings.push('System did not fully recover');
      results.passed = false;
    }

    if (results.passed) {
      results.findings.push('Hypothesis validated: system maintained steady state');
    }

    return results;
  }

  printResults() {
    console.log(`\n${'='.repeat(60)}`);
    console.log('EXPERIMENT RESULTS');
    console.log(`${'='.repeat(60)}`);
    console.log(`Status: ${this.status.toUpperCase()}`);
    console.log(`Duration: ${(this.endTime - this.startTime) / 1000}s`);
    
    if (this.results) {
      console.log('\nMetrics Comparison:');
      console.log(`  Error Rate: ${(this.results.baseline.errorRate * 100).toFixed(2)}% → ${(this.results.chaos.errorRate * 100).toFixed(2)}%`);
      console.log(`  Latency P99: ${this.results.baseline.latencyP99}ms → ${this.results.chaos.latencyP99}ms`);
      console.log(`  Throughput: ${this.results.baseline.throughput}${this.results.chaos.throughput}`);
      
      console.log('\nFindings:');
      for (const finding of this.results.hypothesis.findings) {
        console.log(`  • ${finding}`);
      }
    }
    
    console.log(`${'='.repeat(60)}\n`);
  }

  average(arr) {
    return arr.reduce((a, b) => a + b, 0) / arr.length;
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

module.exports = { ChaosExperiment };

Running Chaos Experiments

Example: Service Failure Experiment

// experiments/payment-service-failure.js

const { ChaosExperiment } = require('../chaos/experiment-framework');
const { DependencyChaos } = require('../chaos/dependency-chaos');

async function runPaymentFailureExperiment() {
  // Define the experiment
  const experiment = new ChaosExperiment('Payment Service Failure', {
    description: 'Simulate complete payment service unavailability',
    hypothesis: 'Order service will gracefully handle payment failures with proper fallbacks',
    steadyState: {
      maxErrorRate: 0.01,
      maxLatencyP99: 200
    }
  });

  // Create chaos action
  const dependencyChaos = new DependencyChaos(httpClient);
  
  const chaosAction = {
    start: async () => {
      dependencyChaos.failService('payment-service', {
        type: 'error',
        errorCode: 503,
        message: 'Service Unavailable'
      });
    },
    stop: async () => {
      dependencyChaos.restoreService('payment-service');
    }
  };

  // Create monitoring client
  const monitoringClient = {
    getMetrics: async () => {
      const response = await fetch('http://prometheus:9090/api/v1/query', {
        method: 'POST',
        body: new URLSearchParams({
          query: `
            sum(rate(http_requests_total{service="order-service"}[1m])) by (status)
          `
        })
      });
      const data = await response.json();
      
      // Parse Prometheus response
      return {
        errorRate: parseFloat(data.data.result.find(r => r.metric.status >= 500)?.value[1] || 0),
        latencyP50: await getLatencyPercentile(50),
        latencyP99: await getLatencyPercentile(99),
        throughput: parseFloat(data.data.result.reduce((sum, r) => sum + parseFloat(r.value[1]), 0)),
        saturation: await getCPUUtilization()
      };
    }
  };

  // Run experiment
  const results = await experiment.run(chaosAction, monitoringClient, {
    duration: 120000,  // 2 minutes
    warmup: 15000,
    cooldown: 30000,
    abortThreshold: {
      maxErrorRate: 0.5,  // Abort if error rate exceeds 50%
      maxLatency: 10000   // Abort if latency exceeds 10s
    }
  });

  return results;
}

runPaymentFailureExperiment();

Kubernetes Chaos with LitmusChaos

# litmus/pod-delete-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: order-service-chaos
  namespace: production
spec:
  appinfo:
    appns: 'production'
    applabel: 'app=order-service'
    appkind: 'deployment'
  engineState: 'active'
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '60'
            - name: CHAOS_INTERVAL
              value: '10'
            - name: FORCE
              value: 'false'
            - name: PODS_AFFECTED_PERC
              value: '50'
        probe:
          - name: 'check-order-endpoint'
            type: 'httpProbe'
            mode: 'Continuous'
            runProperties:
              probeTimeout: 5
              retry: 2
              interval: 5
            httpProbe/inputs:
              url: 'http://order-service.production.svc:80/health'
              insecureSkipVerify: false
              responseTimeout: 3000
              method:
                get:
                  criteria: '=='
                  responseCode: '200'
---
# Network chaos
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-chaos
spec:
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: NETWORK_INTERFACE
              value: 'eth0'
            - name: NETWORK_LATENCY
              value: '2000'  # 2 second latency
            - name: TOTAL_CHAOS_DURATION
              value: '120'
            - name: TARGET_PODS
              value: 'payment-service'

Game Day Exercises

// gameday/runner.js

class GameDay {
  constructor(name, scenarios) {
    this.name = name;
    this.scenarios = scenarios;
    this.results = [];
    this.observers = [];
  }

  addObserver(observer) {
    this.observers.push(observer);
  }

  notify(event) {
    for (const observer of this.observers) {
      observer.onEvent(event);
    }
  }

  async run() {
    console.log(`\n${'#'.repeat(70)}`);
    console.log(`# GAME DAY: ${this.name}`);
    console.log(`# Date: ${new Date().toISOString()}`);
    console.log(`# Scenarios: ${this.scenarios.length}`);
    console.log(`${'#'.repeat(70)}\n`);

    this.notify({ type: 'gameday_start', name: this.name });

    for (let i = 0; i < this.scenarios.length; i++) {
      const scenario = this.scenarios[i];
      
      console.log(`\n--- Scenario ${i + 1}/${this.scenarios.length}: ${scenario.name} ---`);
      this.notify({ type: 'scenario_start', scenario: scenario.name });

      try {
        // Pre-scenario check
        const preCheck = await scenario.preCheck();
        if (!preCheck.ready) {
          console.log(`Skipping: ${preCheck.reason}`);
          this.results.push({ scenario: scenario.name, status: 'skipped', reason: preCheck.reason });
          continue;
        }

        // Run scenario
        const result = await scenario.execute();
        
        // Validate expectations
        const validation = await scenario.validate(result);
        
        this.results.push({
          scenario: scenario.name,
          status: validation.passed ? 'passed' : 'failed',
          result,
          validation
        });

        this.notify({ 
          type: 'scenario_complete', 
          scenario: scenario.name, 
          passed: validation.passed 
        });

        // Recovery period
        console.log('Recovery period...');
        await this.sleep(scenario.recoveryTime || 30000);

      } catch (error) {
        console.error(`Scenario failed with error: ${error.message}`);
        this.results.push({
          scenario: scenario.name,
          status: 'error',
          error: error.message
        });
        
        // Try to recover
        await scenario.cleanup?.();
      }
    }

    this.printSummary();
    this.notify({ type: 'gameday_complete', results: this.results });

    return this.results;
  }

  printSummary() {
    console.log(`\n${'='.repeat(70)}`);
    console.log('GAME DAY SUMMARY');
    console.log(`${'='.repeat(70)}`);
    
    const passed = this.results.filter(r => r.status === 'passed').length;
    const failed = this.results.filter(r => r.status === 'failed').length;
    const errors = this.results.filter(r => r.status === 'error').length;
    
    console.log(`Total Scenarios: ${this.results.length}`);
    console.log(`Passed: ${passed}`);
    console.log(`Failed: ${failed}`);
    console.log(`Errors: ${errors}`);
    
    console.log('\nDetails:');
    for (const result of this.results) {
      const icon = result.status === 'passed' ? '✅' : result.status === 'failed' ? '❌' : '⚠️';
      console.log(`  ${icon} ${result.scenario}: ${result.status}`);
      
      if (result.validation?.findings) {
        for (const finding of result.validation.findings) {
          console.log(`      - ${finding}`);
        }
      }
    }
    
    console.log(`${'='.repeat(70)}\n`);
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Example Game Day Scenario
const scenarios = [
  {
    name: 'Database Failover',
    preCheck: async () => ({ ready: true }),
    execute: async () => {
      // Simulate database failure
      await exec('kubectl delete pod postgres-primary-0 -n database');
      
      // Wait for failover
      await sleep(30000);
      
      // Check if replica was promoted
      const status = await exec('kubectl get pods -n database -l role=primary');
      return { newPrimary: status.includes('Running') };
    },
    validate: async (result) => ({
      passed: result.newPrimary,
      findings: result.newPrimary 
        ? ['Failover completed successfully'] 
        : ['Failover did not complete']
    }),
    cleanup: async () => {
      // Restore original setup if needed
    },
    recoveryTime: 60000
  },
  
  {
    name: 'Cache Failure',
    preCheck: async () => ({ ready: true }),
    execute: async () => {
      // Stop Redis
      await exec('kubectl scale deployment redis --replicas=0 -n cache');
      
      // Make requests during outage
      const responses = await Promise.all(
        Array(100).fill().map(() => 
          fetch('http://order-service/api/products').catch(e => ({ error: true }))
        )
      );
      
      // Restore Redis
      await exec('kubectl scale deployment redis --replicas=3 -n cache');
      
      const errors = responses.filter(r => r.error || r.status >= 500).length;
      return { errorRate: errors / 100 };
    },
    validate: async (result) => ({
      passed: result.errorRate < 0.1,  // Less than 10% error rate
      findings: [`Error rate during cache failure: ${(result.errorRate * 100).toFixed(1)}%`]
    }),
    recoveryTime: 45000
  }
];

const gameDay = new GameDay('Q4 Resilience Testing', scenarios);
gameDay.run();

Interview Questions

Answer:Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.Why important:
  • Distributed systems have unpredictable failure modes
  • Traditional testing doesn’t cover all scenarios
  • Builds confidence before production incidents
  • Reveals weaknesses proactively
Key principles:
  1. Define “steady state” (normal behavior)
  2. Hypothesize that steady state continues
  3. Introduce real-world failures
  4. Try to disprove the hypothesis
  5. Run in production (with safety)
Answer:Chaos Monkey randomly terminates production instances to ensure services can survive instance failures.Part of Simian Army:
  • Chaos Monkey: Kills instances
  • Latency Monkey: Adds artificial delays
  • Conformity Monkey: Checks for best practices
  • Chaos Gorilla: Kills entire availability zones
  • Chaos Kong: Kills entire regions
Design principles:
  • Everything must handle instance failure
  • Stateless services
  • Redundancy at every level
  • Automated recovery
Lesson: Design for failure from day one.
Answer:Safety measures:
  1. Start small
    • Begin in staging
    • Small blast radius
    • Short duration
  2. Abort conditions
    • Define thresholds (error rate, latency)
    • Automatic rollback
    • Kill switch ready
  3. Observability
    • Real-time monitoring
    • Dashboards visible
    • Alerts configured
  4. Team preparedness
    • Incident response ready
    • Runbooks available
    • All stakeholders aware
  5. Gradual expansion
    • Increase scope over time
    • Learn from each experiment
    • Build confidence incrementally
Answer:Service failures:
  • Service unavailable (crash, OOM)
  • Slow responses (latency)
  • Error responses (5xx)
Network failures:
  • Packet loss
  • Network partition
  • DNS failure
Infrastructure:
  • Instance termination
  • Zone failure
  • Disk full
Dependencies:
  • Database failure
  • Cache unavailable
  • Message queue failure
Resource exhaustion:
  • CPU saturation
  • Memory exhaustion
  • Connection pool exhaustion
  • Thread pool exhaustion
Answer:Game Day is a scheduled event where teams intentionally inject failures to test system resilience.Components:
  1. Planning: Define scenarios, success criteria
  2. Communication: Notify stakeholders
  3. Execution: Run scenarios with observers
  4. Observation: Monitor and document
  5. Retrospective: Analyze and improve
Benefits:
  • Team practices incident response
  • Reveals documentation gaps
  • Tests monitoring and alerting
  • Builds muscle memory for real incidents
Example scenarios:
  • Database failover
  • Region evacuation
  • DDoS simulation
  • Major dependency outage

Chapter Summary

Key Takeaways:
  • Chaos engineering proactively finds weaknesses before production incidents
  • Follow the scientific method: hypothesis → experiment → analyze
  • Start small in staging, gradually expand to production
  • Always have abort conditions and rollback plans
  • Game days help teams practice incident response
  • Design systems assuming everything will fail
Next Chapter: Real-World Case Studies - Architecture breakdowns from Netflix, Uber, Amazon.