Chaos Engineering
Chaos engineering proactively tests system resilience by intentionally introducing failures. Learn how Netflix and other tech giants build confidence in their distributed systems.Learning Objectives:
- Understand chaos engineering principles
- Implement failure injection techniques
- Design and run chaos experiments
- Build resilience through controlled chaos
- Create game day exercises
Why Chaos Engineering?
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE NEED FOR CHAOS ENGINEERING │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ REALITY OF DISTRIBUTED SYSTEMS: │
│ ─────────────────────────────────────── │
│ │
│ "Everything fails, all the time" - Werner Vogels, Amazon CTO │
│ │
│ Microservices introduce: │
│ • Network partitions • Dependency failures │
│ • Latency spikes • Resource exhaustion │
│ • Data inconsistency • Configuration errors │
│ • Cascading failures • Deployment issues │
│ │
│ ═══════════════════════════════════════════════════════════════════════════│
│ │
│ CHAOS ENGINEERING APPROACH: │
│ ───────────────────────────── │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Hypothesis │───▶│ Inject │───▶│ Observe │ │
│ │ "System will │ │ Failures │ │ Behavior │ │
│ │ handle X" │ │ (Controlled) │ │ │ │
│ └─────────────────┘ └─────────────────┘ └────────┬────────┘ │
│ │ │
│ ┌────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Learn & │◀───│ Analyze │ │
│ │ Improve │ │ Results │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ GOAL: Build confidence that your system can withstand turbulent │
│ conditions in production │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Chaos Engineering Principles
The Scientific Method
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ CHAOS EXPERIMENT LIFECYCLE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. DEFINE STEADY STATE │
│ ───────────────────────── │
│ "What does 'healthy' look like?" │
│ • Response time p99 < 200ms │
│ • Error rate < 0.1% │
│ • Orders processed per minute > 100 │
│ │
│ 2. HYPOTHESIZE │
│ ───────────────── │
│ "Steady state will continue when..." │
│ • Payment service becomes unavailable │
│ • Database latency increases 10x │
│ • 30% of instances are terminated │
│ │
│ 3. DESIGN EXPERIMENT │
│ ───────────────────── │
│ • What failure to inject? │
│ • Blast radius (scope) │
│ • Duration │
│ • Abort conditions │
│ │
│ 4. RUN EXPERIMENT │
│ ──────────────────── │
│ • Inject failure │
│ • Monitor systems │
│ • Observe behavior │
│ • Be ready to abort │
│ │
│ 5. ANALYZE & LEARN │
│ ──────────────────── │
│ • Did steady state hold? │
│ • What broke? │
│ • What was the blast radius? │
│ • How can we improve? │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Failure Injection Types
Service Level Failures
Copy
// chaos/service-failures.js
class ServiceChaos {
constructor(options = {}) {
this.enabled = process.env.CHAOS_ENABLED === 'true';
this.targetPercentage = options.targetPercentage || 10;
this.latencyMs = options.latencyMs || 2000;
}
// Middleware for Express
middleware() {
return (req, res, next) => {
if (!this.enabled) return next();
const random = Math.random() * 100;
// Inject failures based on configuration
if (random < this.targetPercentage) {
const failureType = this.selectFailure();
return this.injectFailure(failureType, req, res, next);
}
next();
};
}
selectFailure() {
const failures = [
{ type: 'latency', weight: 50 },
{ type: 'error', weight: 30 },
{ type: 'timeout', weight: 15 },
{ type: 'exception', weight: 5 }
];
const total = failures.reduce((sum, f) => sum + f.weight, 0);
let random = Math.random() * total;
for (const failure of failures) {
random -= failure.weight;
if (random <= 0) return failure.type;
}
return 'latency';
}
injectFailure(type, req, res, next) {
console.log(`[CHAOS] Injecting ${type} failure for ${req.path}`);
switch (type) {
case 'latency':
// Add artificial delay
setTimeout(next, this.latencyMs);
break;
case 'error':
// Return 500 error
res.status(500).json({
error: 'Internal Server Error',
chaos: true,
message: 'This is a chaos-injected failure'
});
break;
case 'timeout':
// Don't respond (simulate timeout)
// The client will eventually timeout
break;
case 'exception':
// Throw an exception
throw new Error('Chaos-injected exception');
default:
next();
}
}
}
// Apply to specific routes
app.use('/api/orders', new ServiceChaos({
targetPercentage: 5,
latencyMs: 3000
}).middleware());
Network Failures
Copy
// chaos/network-chaos.js
const http = require('http');
const https = require('https');
class NetworkChaos {
constructor(options = {}) {
this.enabled = process.env.CHAOS_ENABLED === 'true';
this.config = {
packetLoss: options.packetLoss || 0, // Percentage
latency: options.latency || 0, // Milliseconds
jitter: options.jitter || 0, // Milliseconds
bandwidth: options.bandwidth || null, // Bytes per second
...options
};
this.originalRequest = http.request.bind(http);
this.originalSecureRequest = https.request.bind(https);
}
enable() {
if (!this.enabled) return;
// Monkey-patch http.request
http.request = (options, callback) => {
return this.wrapRequest(this.originalRequest, options, callback);
};
https.request = (options, callback) => {
return this.wrapRequest(this.originalSecureRequest, options, callback);
};
console.log('[CHAOS] Network chaos enabled');
}
disable() {
http.request = this.originalRequest;
https.request = this.originalSecureRequest;
console.log('[CHAOS] Network chaos disabled');
}
wrapRequest(originalFn, options, callback) {
const hostname = typeof options === 'string'
? new URL(options).hostname
: options.hostname || options.host;
// Check if this host should be affected
if (!this.shouldAffect(hostname)) {
return originalFn(options, callback);
}
// Simulate packet loss
if (Math.random() * 100 < this.config.packetLoss) {
console.log(`[CHAOS] Simulating packet loss for ${hostname}`);
const req = originalFn(options, () => {});
req.on('socket', (socket) => {
socket.destroy(new Error('Chaos: simulated packet loss'));
});
return req;
}
// Add latency with jitter
const delay = this.config.latency + (Math.random() * this.config.jitter);
if (delay > 0) {
console.log(`[CHAOS] Adding ${delay}ms latency for ${hostname}`);
return new Promise((resolve) => {
setTimeout(() => {
resolve(originalFn(options, callback));
}, delay);
});
}
return originalFn(options, callback);
}
shouldAffect(hostname) {
// Only affect specific services
const targetHosts = (process.env.CHAOS_TARGET_HOSTS || '').split(',');
if (targetHosts.length === 0 || targetHosts[0] === '') {
return true; // Affect all hosts
}
return targetHosts.some(host => hostname.includes(host));
}
}
module.exports = { NetworkChaos };
Resource Exhaustion
Copy
// chaos/resource-chaos.js
class ResourceChaos {
constructor() {
this.memoryHogs = [];
this.cpuBurner = null;
this.fdLeaks = [];
}
// Consume memory
exhaustMemory(megabytes = 100) {
console.log(`[CHAOS] Consuming ${megabytes}MB of memory`);
const chunks = Math.ceil(megabytes / 10);
for (let i = 0; i < chunks; i++) {
// Allocate 10MB chunks
const buffer = Buffer.alloc(10 * 1024 * 1024);
buffer.fill('X');
this.memoryHogs.push(buffer);
}
console.log(`[CHAOS] Memory allocated: ${this.memoryHogs.length * 10}MB`);
return this;
}
releaseMemory() {
this.memoryHogs = [];
if (global.gc) {
global.gc();
}
console.log('[CHAOS] Memory released');
return this;
}
// Burn CPU
burnCPU(percentage = 50, durationMs = 5000) {
console.log(`[CHAOS] Burning ${percentage}% CPU for ${durationMs}ms`);
const endTime = Date.now() + durationMs;
const workTime = percentage;
const sleepTime = 100 - percentage;
const burn = () => {
if (Date.now() >= endTime) {
console.log('[CHAOS] CPU burn complete');
return;
}
// Work (busy loop)
const workEnd = Date.now() + workTime;
while (Date.now() < workEnd) {
Math.random() * Math.random();
}
// Sleep
setTimeout(burn, sleepTime);
};
burn();
return this;
}
// Exhaust file descriptors
exhaustFileDescriptors(count = 1000) {
const fs = require('fs');
console.log(`[CHAOS] Opening ${count} file descriptors`);
for (let i = 0; i < count; i++) {
try {
const fd = fs.openSync('/dev/null', 'r');
this.fdLeaks.push(fd);
} catch (error) {
console.log(`[CHAOS] Hit FD limit at ${i} descriptors`);
break;
}
}
return this;
}
releaseFileDescriptors() {
const fs = require('fs');
for (const fd of this.fdLeaks) {
try {
fs.closeSync(fd);
} catch (e) {}
}
this.fdLeaks = [];
console.log('[CHAOS] File descriptors released');
return this;
}
// Fill disk
fillDisk(path, gigabytes = 1) {
const fs = require('fs');
console.log(`[CHAOS] Filling ${gigabytes}GB at ${path}`);
const chunkSize = 100 * 1024 * 1024; // 100MB chunks
const chunks = gigabytes * 10;
const buffer = Buffer.alloc(chunkSize);
buffer.fill('X');
const fd = fs.openSync(path, 'w');
for (let i = 0; i < chunks; i++) {
try {
fs.writeSync(fd, buffer);
} catch (error) {
console.log(`[CHAOS] Disk fill stopped at ${i * 100}MB: ${error.message}`);
break;
}
}
fs.closeSync(fd);
return path;
}
}
Dependency Failures
Copy
// chaos/dependency-chaos.js
class DependencyChaos {
constructor(httpClient) {
this.httpClient = httpClient;
this.failures = new Map();
}
// Fail specific service
failService(serviceName, options = {}) {
const config = {
type: options.type || 'error', // 'error', 'timeout', 'slow'
errorCode: options.errorCode || 500,
delay: options.delay || 5000,
message: options.message || 'Chaos-injected failure'
};
this.failures.set(serviceName, config);
console.log(`[CHAOS] ${serviceName} will ${config.type}`);
}
restoreService(serviceName) {
this.failures.delete(serviceName);
console.log(`[CHAOS] ${serviceName} restored`);
}
// Wrap HTTP client
wrapClient() {
const original = this.httpClient.request.bind(this.httpClient);
this.httpClient.request = async (url, options = {}) => {
const serviceName = this.extractServiceName(url);
const failure = this.failures.get(serviceName);
if (failure) {
return this.simulateFailure(failure, url);
}
return original(url, options);
};
}
extractServiceName(url) {
try {
const parsed = new URL(url);
return parsed.hostname.split('.')[0]; // e.g., 'payment-service'
} catch {
return url;
}
}
simulateFailure(config, url) {
console.log(`[CHAOS] Simulating ${config.type} for ${url}`);
switch (config.type) {
case 'error':
return Promise.reject({
status: config.errorCode,
message: config.message,
chaos: true
});
case 'timeout':
return new Promise((_, reject) => {
setTimeout(() => {
reject(new Error('Chaos: Connection timeout'));
}, config.delay);
});
case 'slow':
return new Promise((resolve) => {
setTimeout(() => {
resolve({ status: 200, data: { slow: true } });
}, config.delay);
});
default:
return Promise.reject(new Error('Unknown chaos type'));
}
}
}
Chaos Experiment Framework
Copy
// chaos/experiment-framework.js
class ChaosExperiment {
constructor(name, options = {}) {
this.name = name;
this.description = options.description || '';
this.hypothesis = options.hypothesis || '';
this.steadyState = options.steadyState || {};
this.metrics = [];
this.status = 'pending';
this.startTime = null;
this.endTime = null;
this.results = null;
}
async run(chaosAction, monitoringClient, options = {}) {
const {
duration = 60000, // 1 minute default
warmup = 10000, // 10 seconds warmup
cooldown = 10000, // 10 seconds cooldown
abortThreshold = null
} = options;
console.log(`\n${'='.repeat(60)}`);
console.log(`CHAOS EXPERIMENT: ${this.name}`);
console.log(`${'='.repeat(60)}`);
console.log(`Hypothesis: ${this.hypothesis}`);
console.log(`Duration: ${duration}ms`);
console.log(`${'='.repeat(60)}\n`);
this.status = 'running';
this.startTime = new Date();
try {
// Phase 1: Collect baseline metrics
console.log('[Phase 1] Collecting baseline metrics...');
const baseline = await this.collectMetrics(monitoringClient, warmup);
console.log('Baseline:', JSON.stringify(baseline, null, 2));
// Verify steady state before experiment
if (!this.verifySteadyState(baseline)) {
throw new Error('System not in steady state before experiment');
}
// Phase 2: Inject chaos
console.log('\n[Phase 2] Injecting chaos...');
await chaosAction.start();
// Phase 3: Monitor during chaos
console.log('\n[Phase 3] Monitoring during chaos...');
const chaosMetrics = await this.monitorWithAbort(
monitoringClient,
duration,
abortThreshold,
chaosAction
);
// Phase 4: Stop chaos
console.log('\n[Phase 4] Stopping chaos...');
await chaosAction.stop();
// Phase 5: Cooldown and collect recovery metrics
console.log('\n[Phase 5] Collecting recovery metrics...');
await this.sleep(cooldown);
const recovery = await this.collectMetrics(monitoringClient, 5000);
// Analyze results
this.results = {
baseline,
chaos: chaosMetrics,
recovery,
hypothesis: this.evaluateHypothesis(baseline, chaosMetrics, recovery)
};
this.status = this.results.hypothesis.passed ? 'passed' : 'failed';
} catch (error) {
console.error('Experiment failed:', error);
this.status = 'aborted';
this.results = { error: error.message };
// Ensure chaos is stopped
try {
await chaosAction.stop();
} catch (e) {}
} finally {
this.endTime = new Date();
}
this.printResults();
return this.results;
}
async collectMetrics(monitoringClient, duration) {
const metrics = {
errorRate: [],
latencyP50: [],
latencyP99: [],
throughput: [],
saturation: []
};
const interval = 1000;
const iterations = Math.ceil(duration / interval);
for (let i = 0; i < iterations; i++) {
const snapshot = await monitoringClient.getMetrics();
metrics.errorRate.push(snapshot.errorRate);
metrics.latencyP50.push(snapshot.latencyP50);
metrics.latencyP99.push(snapshot.latencyP99);
metrics.throughput.push(snapshot.throughput);
metrics.saturation.push(snapshot.saturation);
await this.sleep(interval);
}
return {
errorRate: this.average(metrics.errorRate),
latencyP50: this.average(metrics.latencyP50),
latencyP99: this.average(metrics.latencyP99),
throughput: this.average(metrics.throughput),
saturation: this.average(metrics.saturation)
};
}
async monitorWithAbort(monitoringClient, duration, threshold, chaosAction) {
const metrics = [];
const interval = 5000;
const iterations = Math.ceil(duration / interval);
for (let i = 0; i < iterations; i++) {
const snapshot = await monitoringClient.getMetrics();
metrics.push(snapshot);
// Check abort conditions
if (threshold && this.shouldAbort(snapshot, threshold)) {
console.log('\n⚠️ ABORTING: Threshold exceeded');
await chaosAction.stop();
break;
}
console.log(` [${i + 1}/${iterations}] Error: ${(snapshot.errorRate * 100).toFixed(2)}%, Latency P99: ${snapshot.latencyP99}ms`);
await this.sleep(interval);
}
return {
errorRate: this.average(metrics.map(m => m.errorRate)),
latencyP50: this.average(metrics.map(m => m.latencyP50)),
latencyP99: this.average(metrics.map(m => m.latencyP99)),
throughput: this.average(metrics.map(m => m.throughput)),
maxErrorRate: Math.max(...metrics.map(m => m.errorRate)),
maxLatency: Math.max(...metrics.map(m => m.latencyP99))
};
}
shouldAbort(metrics, threshold) {
return (
metrics.errorRate > threshold.maxErrorRate ||
metrics.latencyP99 > threshold.maxLatency
);
}
verifySteadyState(metrics) {
const { steadyState } = this;
if (steadyState.maxErrorRate && metrics.errorRate > steadyState.maxErrorRate) {
console.log(`Steady state check failed: errorRate ${metrics.errorRate} > ${steadyState.maxErrorRate}`);
return false;
}
if (steadyState.maxLatencyP99 && metrics.latencyP99 > steadyState.maxLatencyP99) {
console.log(`Steady state check failed: latencyP99 ${metrics.latencyP99} > ${steadyState.maxLatencyP99}`);
return false;
}
return true;
}
evaluateHypothesis(baseline, chaos, recovery) {
const results = {
passed: true,
findings: []
};
// Check if error rate stayed within bounds
const errorRateIncrease = chaos.errorRate - baseline.errorRate;
if (errorRateIncrease > 0.05) { // More than 5% increase
results.findings.push(`Error rate increased by ${(errorRateIncrease * 100).toFixed(2)}%`);
results.passed = false;
}
// Check if latency degradation was acceptable
const latencyIncrease = (chaos.latencyP99 - baseline.latencyP99) / baseline.latencyP99;
if (latencyIncrease > 0.5) { // More than 50% increase
results.findings.push(`Latency P99 increased by ${(latencyIncrease * 100).toFixed(0)}%`);
results.passed = false;
}
// Check recovery
const recoveryTime = recovery.latencyP99 / baseline.latencyP99;
if (recoveryTime > 1.1) { // Not recovered to within 10%
results.findings.push('System did not fully recover');
results.passed = false;
}
if (results.passed) {
results.findings.push('Hypothesis validated: system maintained steady state');
}
return results;
}
printResults() {
console.log(`\n${'='.repeat(60)}`);
console.log('EXPERIMENT RESULTS');
console.log(`${'='.repeat(60)}`);
console.log(`Status: ${this.status.toUpperCase()}`);
console.log(`Duration: ${(this.endTime - this.startTime) / 1000}s`);
if (this.results) {
console.log('\nMetrics Comparison:');
console.log(` Error Rate: ${(this.results.baseline.errorRate * 100).toFixed(2)}% → ${(this.results.chaos.errorRate * 100).toFixed(2)}%`);
console.log(` Latency P99: ${this.results.baseline.latencyP99}ms → ${this.results.chaos.latencyP99}ms`);
console.log(` Throughput: ${this.results.baseline.throughput} → ${this.results.chaos.throughput}`);
console.log('\nFindings:');
for (const finding of this.results.hypothesis.findings) {
console.log(` • ${finding}`);
}
}
console.log(`${'='.repeat(60)}\n`);
}
average(arr) {
return arr.reduce((a, b) => a + b, 0) / arr.length;
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
module.exports = { ChaosExperiment };
Running Chaos Experiments
Example: Service Failure Experiment
Copy
// experiments/payment-service-failure.js
const { ChaosExperiment } = require('../chaos/experiment-framework');
const { DependencyChaos } = require('../chaos/dependency-chaos');
async function runPaymentFailureExperiment() {
// Define the experiment
const experiment = new ChaosExperiment('Payment Service Failure', {
description: 'Simulate complete payment service unavailability',
hypothesis: 'Order service will gracefully handle payment failures with proper fallbacks',
steadyState: {
maxErrorRate: 0.01,
maxLatencyP99: 200
}
});
// Create chaos action
const dependencyChaos = new DependencyChaos(httpClient);
const chaosAction = {
start: async () => {
dependencyChaos.failService('payment-service', {
type: 'error',
errorCode: 503,
message: 'Service Unavailable'
});
},
stop: async () => {
dependencyChaos.restoreService('payment-service');
}
};
// Create monitoring client
const monitoringClient = {
getMetrics: async () => {
const response = await fetch('http://prometheus:9090/api/v1/query', {
method: 'POST',
body: new URLSearchParams({
query: `
sum(rate(http_requests_total{service="order-service"}[1m])) by (status)
`
})
});
const data = await response.json();
// Parse Prometheus response
return {
errorRate: parseFloat(data.data.result.find(r => r.metric.status >= 500)?.value[1] || 0),
latencyP50: await getLatencyPercentile(50),
latencyP99: await getLatencyPercentile(99),
throughput: parseFloat(data.data.result.reduce((sum, r) => sum + parseFloat(r.value[1]), 0)),
saturation: await getCPUUtilization()
};
}
};
// Run experiment
const results = await experiment.run(chaosAction, monitoringClient, {
duration: 120000, // 2 minutes
warmup: 15000,
cooldown: 30000,
abortThreshold: {
maxErrorRate: 0.5, // Abort if error rate exceeds 50%
maxLatency: 10000 // Abort if latency exceeds 10s
}
});
return results;
}
runPaymentFailureExperiment();
Kubernetes Chaos with LitmusChaos
Copy
# litmus/pod-delete-experiment.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: order-service-chaos
namespace: production
spec:
appinfo:
appns: 'production'
applabel: 'app=order-service'
appkind: 'deployment'
engineState: 'active'
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '60'
- name: CHAOS_INTERVAL
value: '10'
- name: FORCE
value: 'false'
- name: PODS_AFFECTED_PERC
value: '50'
probe:
- name: 'check-order-endpoint'
type: 'httpProbe'
mode: 'Continuous'
runProperties:
probeTimeout: 5
retry: 2
interval: 5
httpProbe/inputs:
url: 'http://order-service.production.svc:80/health'
insecureSkipVerify: false
responseTimeout: 3000
method:
get:
criteria: '=='
responseCode: '200'
---
# Network chaos
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: network-chaos
spec:
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: NETWORK_INTERFACE
value: 'eth0'
- name: NETWORK_LATENCY
value: '2000' # 2 second latency
- name: TOTAL_CHAOS_DURATION
value: '120'
- name: TARGET_PODS
value: 'payment-service'
Game Day Exercises
Copy
// gameday/runner.js
class GameDay {
constructor(name, scenarios) {
this.name = name;
this.scenarios = scenarios;
this.results = [];
this.observers = [];
}
addObserver(observer) {
this.observers.push(observer);
}
notify(event) {
for (const observer of this.observers) {
observer.onEvent(event);
}
}
async run() {
console.log(`\n${'#'.repeat(70)}`);
console.log(`# GAME DAY: ${this.name}`);
console.log(`# Date: ${new Date().toISOString()}`);
console.log(`# Scenarios: ${this.scenarios.length}`);
console.log(`${'#'.repeat(70)}\n`);
this.notify({ type: 'gameday_start', name: this.name });
for (let i = 0; i < this.scenarios.length; i++) {
const scenario = this.scenarios[i];
console.log(`\n--- Scenario ${i + 1}/${this.scenarios.length}: ${scenario.name} ---`);
this.notify({ type: 'scenario_start', scenario: scenario.name });
try {
// Pre-scenario check
const preCheck = await scenario.preCheck();
if (!preCheck.ready) {
console.log(`Skipping: ${preCheck.reason}`);
this.results.push({ scenario: scenario.name, status: 'skipped', reason: preCheck.reason });
continue;
}
// Run scenario
const result = await scenario.execute();
// Validate expectations
const validation = await scenario.validate(result);
this.results.push({
scenario: scenario.name,
status: validation.passed ? 'passed' : 'failed',
result,
validation
});
this.notify({
type: 'scenario_complete',
scenario: scenario.name,
passed: validation.passed
});
// Recovery period
console.log('Recovery period...');
await this.sleep(scenario.recoveryTime || 30000);
} catch (error) {
console.error(`Scenario failed with error: ${error.message}`);
this.results.push({
scenario: scenario.name,
status: 'error',
error: error.message
});
// Try to recover
await scenario.cleanup?.();
}
}
this.printSummary();
this.notify({ type: 'gameday_complete', results: this.results });
return this.results;
}
printSummary() {
console.log(`\n${'='.repeat(70)}`);
console.log('GAME DAY SUMMARY');
console.log(`${'='.repeat(70)}`);
const passed = this.results.filter(r => r.status === 'passed').length;
const failed = this.results.filter(r => r.status === 'failed').length;
const errors = this.results.filter(r => r.status === 'error').length;
console.log(`Total Scenarios: ${this.results.length}`);
console.log(`Passed: ${passed}`);
console.log(`Failed: ${failed}`);
console.log(`Errors: ${errors}`);
console.log('\nDetails:');
for (const result of this.results) {
const icon = result.status === 'passed' ? '✅' : result.status === 'failed' ? '❌' : '⚠️';
console.log(` ${icon} ${result.scenario}: ${result.status}`);
if (result.validation?.findings) {
for (const finding of result.validation.findings) {
console.log(` - ${finding}`);
}
}
}
console.log(`${'='.repeat(70)}\n`);
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Example Game Day Scenario
const scenarios = [
{
name: 'Database Failover',
preCheck: async () => ({ ready: true }),
execute: async () => {
// Simulate database failure
await exec('kubectl delete pod postgres-primary-0 -n database');
// Wait for failover
await sleep(30000);
// Check if replica was promoted
const status = await exec('kubectl get pods -n database -l role=primary');
return { newPrimary: status.includes('Running') };
},
validate: async (result) => ({
passed: result.newPrimary,
findings: result.newPrimary
? ['Failover completed successfully']
: ['Failover did not complete']
}),
cleanup: async () => {
// Restore original setup if needed
},
recoveryTime: 60000
},
{
name: 'Cache Failure',
preCheck: async () => ({ ready: true }),
execute: async () => {
// Stop Redis
await exec('kubectl scale deployment redis --replicas=0 -n cache');
// Make requests during outage
const responses = await Promise.all(
Array(100).fill().map(() =>
fetch('http://order-service/api/products').catch(e => ({ error: true }))
)
);
// Restore Redis
await exec('kubectl scale deployment redis --replicas=3 -n cache');
const errors = responses.filter(r => r.error || r.status >= 500).length;
return { errorRate: errors / 100 };
},
validate: async (result) => ({
passed: result.errorRate < 0.1, // Less than 10% error rate
findings: [`Error rate during cache failure: ${(result.errorRate * 100).toFixed(1)}%`]
}),
recoveryTime: 45000
}
];
const gameDay = new GameDay('Q4 Resilience Testing', scenarios);
gameDay.run();
Interview Questions
Q1: What is chaos engineering and why is it important?
Q1: What is chaos engineering and why is it important?
Answer:Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.Why important:
- Distributed systems have unpredictable failure modes
- Traditional testing doesn’t cover all scenarios
- Builds confidence before production incidents
- Reveals weaknesses proactively
- Define “steady state” (normal behavior)
- Hypothesize that steady state continues
- Introduce real-world failures
- Try to disprove the hypothesis
- Run in production (with safety)
Q2: What is Netflix's Chaos Monkey?
Q2: What is Netflix's Chaos Monkey?
Answer:Chaos Monkey randomly terminates production instances to ensure services can survive instance failures.Part of Simian Army:
- Chaos Monkey: Kills instances
- Latency Monkey: Adds artificial delays
- Conformity Monkey: Checks for best practices
- Chaos Gorilla: Kills entire availability zones
- Chaos Kong: Kills entire regions
- Everything must handle instance failure
- Stateless services
- Redundancy at every level
- Automated recovery
Q3: How do you safely run chaos experiments in production?
Q3: How do you safely run chaos experiments in production?
Answer:Safety measures:
- Start small
- Begin in staging
- Small blast radius
- Short duration
- Abort conditions
- Define thresholds (error rate, latency)
- Automatic rollback
- Kill switch ready
- Observability
- Real-time monitoring
- Dashboards visible
- Alerts configured
- Team preparedness
- Incident response ready
- Runbooks available
- All stakeholders aware
- Gradual expansion
- Increase scope over time
- Learn from each experiment
- Build confidence incrementally
Q4: What failures would you test in a microservices system?
Q4: What failures would you test in a microservices system?
Answer:Service failures:
- Service unavailable (crash, OOM)
- Slow responses (latency)
- Error responses (5xx)
- Packet loss
- Network partition
- DNS failure
- Instance termination
- Zone failure
- Disk full
- Database failure
- Cache unavailable
- Message queue failure
- CPU saturation
- Memory exhaustion
- Connection pool exhaustion
- Thread pool exhaustion
Q5: What is a Game Day?
Q5: What is a Game Day?
Answer:Game Day is a scheduled event where teams intentionally inject failures to test system resilience.Components:
- Planning: Define scenarios, success criteria
- Communication: Notify stakeholders
- Execution: Run scenarios with observers
- Observation: Monitor and document
- Retrospective: Analyze and improve
- Team practices incident response
- Reveals documentation gaps
- Tests monitoring and alerting
- Builds muscle memory for real incidents
- Database failover
- Region evacuation
- DDoS simulation
- Major dependency outage
Chapter Summary
Key Takeaways:
- Chaos engineering proactively finds weaknesses before production incidents
- Follow the scientific method: hypothesis → experiment → analyze
- Start small in staging, gradually expand to production
- Always have abort conditions and rollback plans
- Game days help teams practice incident response
- Design systems assuming everything will fail