> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Reliability & Fault Tolerance > Building resilient systems that survive failures **Senior Level**: Fault tolerance is what separates good systems from great ones. Interviewers expect senior engineers to design for failure from day one. ## Design for Failure Mindset Failure Scenarios

## Redundancy Patterns ### Active-Passive (Standby) Active-Passive Pattern

### Active-Active Active-Active Pattern

### Multi-Region Active-Active Multi-Region Deployment

## Resilience Patterns ### Circuit Breaker (Deep Dive)

```python theme={null} from enum import Enum from datetime import datetime, timedelta import threading class CircuitState(Enum): CLOSED = "closed" # Normal operation OPEN = "open" # Failing fast HALF_OPEN = "half_open" # Testing if recovered class CircuitBreaker: """ Production-grade circuit breaker with: - Failure threshold - Success threshold for recovery - Timeout for open state - Thread safety """ def __init__( self, failure_threshold: int = 5, success_threshold: int = 3, timeout_seconds: int = 30 ): self.failure_threshold = failure_threshold self.success_threshold = success_threshold self.timeout = timedelta(seconds=timeout_seconds) self.state = CircuitState.CLOSED self.failure_count = 0 self.success_count = 0 self.last_failure_time = None self.lock = threading.Lock() def call(self, func, *args, **kwargs): with self.lock: if not self._can_execute(): raise CircuitOpenError("Circuit is OPEN") try: result = func(*args, **kwargs) self._on_success() return result except Exception as e: self._on_failure() raise def _can_execute(self) -> bool: if self.state == CircuitState.CLOSED: return True if self.state == CircuitState.OPEN: # Check if timeout has passed if datetime.now() - self.last_failure_time > self.timeout: self.state = CircuitState.HALF_OPEN self.success_count = 0 return True return False # HALF_OPEN: allow limited requests return True def _on_success(self): with self.lock: if self.state == CircuitState.HALF_OPEN: self.success_count += 1 if self.success_count >= self.success_threshold: # Service recovered! self.state = CircuitState.CLOSED self.failure_count = 0 elif self.state == CircuitState.CLOSED: self.failure_count = 0 def _on_failure(self): with self.lock: self.failure_count += 1 self.last_failure_time = datetime.now() if self.state == CircuitState.HALF_OPEN: # Failed during recovery test self.state = CircuitState.OPEN elif self.failure_count >= self.failure_threshold: self.state = CircuitState.OPEN # Usage with fallback circuit = CircuitBreaker(failure_threshold=5, timeout_seconds=30) def get_user_with_fallback(user_id): try: return circuit.call(user_service.get_user, user_id) except CircuitOpenError: # Return cached data or default return cache.get(f"user:{user_id}") or {"id": user_id, "name": "Unknown"} ``` ### Retry Strategies

```python theme={null} import asyncio import random import time from functools import wraps from typing import Callable, TypeVar, Type, Tuple, Optional, Any from dataclasses import dataclass from enum import Enum import logging T = TypeVar('T') logger = logging.getLogger(__name__) class RetryStrategy(Enum): EXPONENTIAL = "exponential" LINEAR = "linear" FIBONACCI = "fibonacci" DECORRELATED_JITTER = "decorrelated_jitter" @dataclass class RetryConfig: max_retries: int = 5 base_delay: float = 1.0 max_delay: float = 60.0 strategy: RetryStrategy = RetryStrategy.EXPONENTIAL jitter: bool = True retryable_exceptions: Tuple[Type[Exception], ...] = (Exception,) non_retryable_exceptions: Tuple[Type[Exception], ...] = () on_retry: Optional[Callable[[Exception, int, float], None]] = None class RetryExhausted(Exception): def __init__(self, last_exception: Exception, attempts: int): self.last_exception = last_exception self.attempts = attempts super().__init__(f"All {attempts} retry attempts failed") class RetryHandler: """Advanced retry handler with multiple strategies""" def __init__(self, config: RetryConfig = None): self.config = config or RetryConfig() self._fib_cache = {0: 0, 1: 1} self._last_delay = self.config.base_delay def _fibonacci(self, n: int) -> int: if n not in self._fib_cache: self._fib_cache[n] = self._fibonacci(n - 1) + self._fibonacci(n - 2) return self._fib_cache[n] def _calculate_delay(self, attempt: int) -> float: cfg = self.config if cfg.strategy == RetryStrategy.EXPONENTIAL: delay = cfg.base_delay * (2 ** attempt) elif cfg.strategy == RetryStrategy.LINEAR: delay = cfg.base_delay * (attempt + 1) elif cfg.strategy == RetryStrategy.FIBONACCI: delay = cfg.base_delay * self._fibonacci(attempt + 2) elif cfg.strategy == RetryStrategy.DECORRELATED_JITTER: # AWS recommended: sleep = min(cap, random(base, sleep * 3)) delay = random.uniform(cfg.base_delay, self._last_delay * 3) self._last_delay = delay else: delay = cfg.base_delay delay = min(delay, cfg.max_delay) # Add standard jitter (not for decorrelated which has built-in) if cfg.jitter and cfg.strategy != RetryStrategy.DECORRELATED_JITTER: delay = delay * (0.5 + random.random()) return delay def _should_retry(self, exception: Exception) -> bool: # Check non-retryable first if isinstance(exception, self.config.non_retryable_exceptions): return False return isinstance(exception, self.config.retryable_exceptions) async def execute(self, func: Callable[[], T]) -> T: """Execute function with retry logic""" last_exception = None for attempt in range(self.config.max_retries): try: if asyncio.iscoroutinefunction(func): return await func() return func() except Exception as e: last_exception = e if not self._should_retry(e): raise if attempt == self.config.max_retries - 1: break delay = self._calculate_delay(attempt) logger.warning( f"Attempt {attempt + 1}/{self.config.max_retries} failed: {e}. " f"Retrying in {delay:.2f}s" ) if self.config.on_retry: self.config.on_retry(e, attempt + 1, delay) await asyncio.sleep(delay) raise RetryExhausted(last_exception, self.config.max_retries) def __call__(self, func: Callable) -> Callable: """Use as decorator""" @wraps(func) async def async_wrapper(*args, **kwargs): return await self.execute(lambda: func(*args, **kwargs)) @wraps(func) def sync_wrapper(*args, **kwargs): return asyncio.run(self.execute(lambda: func(*args, **kwargs))) return async_wrapper if asyncio.iscoroutinefunction(func) else sync_wrapper # Retry with circuit breaker integration class ResilientCaller: """Combines retry, circuit breaker, and timeout""" def __init__( self, circuit_breaker: 'CircuitBreaker', retry_config: RetryConfig = None, timeout_seconds: float = 30.0 ): self.circuit = circuit_breaker self.retry = RetryHandler(retry_config or RetryConfig(max_retries=3)) self.timeout = timeout_seconds async def call( self, func: Callable, fallback: Callable = None, *args, **kwargs ) -> Any: """Execute with full resilience pattern""" async def wrapped(): async with asyncio.timeout(self.timeout): return await self.circuit.call_async( lambda: func(*args, **kwargs) ) try: return await self.retry.execute(wrapped) except (RetryExhausted, CircuitBreakerOpenException) as e: if fallback: logger.warning(f"Using fallback due to: {e}") return await fallback(*args, **kwargs) if asyncio.iscoroutinefunction(fallback) else fallback(*args, **kwargs) raise # Usage examples @RetryHandler(RetryConfig( max_retries=3, strategy=RetryStrategy.EXPONENTIAL, retryable_exceptions=(ConnectionError, TimeoutError) )) async def fetch_data(url: str) -> dict: async with aiohttp.ClientSession() as session: async with session.get(url) as resp: return await resp.json() # With metrics callback def log_retry(exc: Exception, attempt: int, delay: float): metrics.increment("retries", tags={"attempt": attempt}) retry_handler = RetryHandler(RetryConfig( max_retries=5, strategy=RetryStrategy.DECORRELATED_JITTER, on_retry=log_retry )) async def call_payment_api(amount: float): return await retry_handler.execute( lambda: payment_client.charge(amount) ) ``` ```javascript theme={null} // Retry strategies const RetryStrategy = { EXPONENTIAL: 'exponential', LINEAR: 'linear', FIBONACCI: 'fibonacci', DECORRELATED_JITTER: 'decorrelated_jitter' }; class RetryExhaustedError extends Error { constructor(lastError, attempts) { super(`All ${attempts} retry attempts failed`); this.name = 'RetryExhaustedError'; this.lastError = lastError; this.attempts = attempts; } } class RetryHandler { constructor(options = {}) { this.config = { maxRetries: options.maxRetries || 5, baseDelay: options.baseDelay || 1000, maxDelay: options.maxDelay || 60000, strategy: options.strategy || RetryStrategy.EXPONENTIAL, jitter: options.jitter !== false, retryableErrors: options.retryableErrors || [Error], nonRetryableErrors: options.nonRetryableErrors || [], onRetry: options.onRetry || null, shouldRetry: options.shouldRetry || null }; this.fibCache = { 0: 0, 1: 1 }; this.lastDelay = this.config.baseDelay; } fibonacci(n) { if (!(n in this.fibCache)) { this.fibCache[n] = this.fibonacci(n - 1) + this.fibonacci(n - 2); } return this.fibCache[n]; } calculateDelay(attempt) { const { baseDelay, maxDelay, strategy, jitter } = this.config; let delay; switch (strategy) { case RetryStrategy.EXPONENTIAL: delay = baseDelay * Math.pow(2, attempt); break; case RetryStrategy.LINEAR: delay = baseDelay * (attempt + 1); break; case RetryStrategy.FIBONACCI: delay = baseDelay * this.fibonacci(attempt + 2); break; case RetryStrategy.DECORRELATED_JITTER: delay = Math.random() * (this.lastDelay * 3 - baseDelay) + baseDelay; this.lastDelay = delay; break; default: delay = baseDelay; } delay = Math.min(delay, maxDelay); if (jitter && strategy !== RetryStrategy.DECORRELATED_JITTER) { delay = delay * (0.5 + Math.random()); } return delay; } shouldRetryError(error) { // Check non-retryable first for (const ErrorClass of this.config.nonRetryableErrors) { if (error instanceof ErrorClass) return false; } // Custom predicate if (this.config.shouldRetry) { return this.config.shouldRetry(error); } // Check retryable for (const ErrorClass of this.config.retryableErrors) { if (error instanceof ErrorClass) return true; } return false; } async execute(fn) { let lastError; for (let attempt = 0; attempt < this.config.maxRetries; attempt++) { try { return await fn(); } catch (error) { lastError = error; if (!this.shouldRetryError(error)) { throw error; } if (attempt === this.config.maxRetries - 1) { break; } const delay = this.calculateDelay(attempt); console.warn( `Attempt ${attempt + 1}/${this.config.maxRetries} failed: ${error.message}. ` + `Retrying in ${(delay / 1000).toFixed(2)}s` ); if (this.config.onRetry) { await this.config.onRetry(error, attempt + 1, delay); } await this.sleep(delay); } } throw new RetryExhaustedError(lastError, this.config.maxRetries); } sleep(ms) { return new Promise(resolve => setTimeout(resolve, ms)); } // Use as wrapper wrap(fn) { return async (...args) => { return this.execute(() => fn(...args)); }; } } // Resilient caller combining patterns class ResilientCaller { constructor({ circuitBreaker, retryConfig = {}, timeoutMs = 30000 }) { this.circuit = circuitBreaker; this.retry = new RetryHandler(retryConfig); this.timeoutMs = timeoutMs; } async call(fn, { fallback = null, args = [], kwargs = {} } = {}) { const wrapped = async () => { return this.withTimeout( this.circuit.execute(() => fn(...args)), this.timeoutMs ); }; try { return await this.retry.execute(wrapped); } catch (error) { if (fallback && (error instanceof RetryExhaustedError || error.name === 'CircuitBreakerOpenError')) { console.warn(`Using fallback due to: ${error.message}`); return typeof fallback === 'function' ? fallback(...args) : fallback; } throw error; } } withTimeout(promise, ms) { return Promise.race([ promise, new Promise((_, reject) => setTimeout(() => reject(new Error(`Timeout after ${ms}ms`)), ms) ) ]); } } // Usage examples const retry = new RetryHandler({ maxRetries: 3, strategy: RetryStrategy.EXPONENTIAL, shouldRetry: (error) => { // Retry on network errors or 5xx return error.code === 'ECONNREFUSED' || error.code === 'ETIMEDOUT' || (error.response && error.response.status >= 500); }, onRetry: async (error, attempt, delay) => { // Send to metrics await metrics.increment('api.retry', { attempt, error: error.name }); } }); const fetchWithRetry = retry.wrap(async (url) => { const response = await fetch(url); if (!response.ok) { const error = new Error(`HTTP ${response.status}`); error.response = response; throw error; } return response.json(); }); // Full resilience pattern const paymentCaller = new ResilientCaller({ circuitBreaker: new CircuitBreaker('payments', { failureThreshold: 3 }), retryConfig: { maxRetries: 2, strategy: RetryStrategy.DECORRELATED_JITTER }, timeoutMs: 5000 }); async function chargePayment(amount) { return paymentCaller.call( (amt) => paymentService.charge(amt), { args: [amount], fallback: async (amt) => { // Queue for later processing await paymentQueue.add({ amount: amt, status: 'pending' }); return { status: 'queued', message: 'Payment will be processed shortly' }; } } ); } ``` ### Bulkhead Pattern

```python theme={null} import asyncio from contextlib import asynccontextmanager from dataclasses import dataclass, field from typing import Dict, Any, Optional, Callable from enum import Enum import time import logging logger = logging.getLogger(__name__) class BulkheadFullError(Exception): def __init__(self, bulkhead_name: str, current: int, max_size: int): self.bulkhead_name = bulkhead_name self.current = current self.max_size = max_size super().__init__( f"Bulkhead '{bulkhead_name}' is full ({current}/{max_size})" ) @dataclass class BulkheadMetrics: accepted: int = 0 rejected: int = 0 active: int = 0 peak_active: int = 0 total_wait_time: float = 0.0 class Bulkhead: """ Bulkhead pattern with semaphore and queue. Isolates failures to prevent cascade effects. """ def __init__( self, name: str, max_concurrent: int, max_queue: int = 0, queue_timeout: float = 10.0 ): self.name = name self.max_concurrent = max_concurrent self.max_queue = max_queue self.queue_timeout = queue_timeout self.semaphore = asyncio.Semaphore(max_concurrent) self.metrics = BulkheadMetrics() self._queue_size = 0 @asynccontextmanager async def acquire(self, timeout: Optional[float] = None): """Acquire a slot in the bulkhead""" timeout = timeout or self.queue_timeout start = time.time() # Check if we can queue if self.semaphore.locked(): if self._queue_size >= self.max_queue: self.metrics.rejected += 1 raise BulkheadFullError( self.name, self.metrics.active, self.max_concurrent ) self._queue_size += 1 try: acquired = await asyncio.wait_for( self.semaphore.acquire(), timeout=timeout ) except asyncio.TimeoutError: self._queue_size = max(0, self._queue_size - 1) self.metrics.rejected += 1 raise BulkheadFullError( self.name, self.metrics.active, self.max_concurrent ) finally: if self._queue_size > 0: self._queue_size -= 1 wait_time = time.time() - start self.metrics.total_wait_time += wait_time self.metrics.active += 1 self.metrics.accepted += 1 self.metrics.peak_active = max(self.metrics.peak_active, self.metrics.active) try: yield finally: self.metrics.active -= 1 self.semaphore.release() def __call__(self, func: Callable): """Use as decorator""" async def wrapper(*args, **kwargs): async with self.acquire(): return await func(*args, **kwargs) return wrapper def get_metrics(self) -> Dict[str, Any]: return { "name": self.name, "active": self.metrics.active, "max_concurrent": self.max_concurrent, "queue_size": self._queue_size, "max_queue": self.max_queue, "accepted": self.metrics.accepted, "rejected": self.metrics.rejected, "rejection_rate": self.metrics.rejected / max(1, self.metrics.accepted + self.metrics.rejected), "peak_active": self.metrics.peak_active, "avg_wait_ms": (self.metrics.total_wait_time / max(1, self.metrics.accepted)) * 1000 } class ThreadPoolBulkhead: """ Thread pool based bulkhead for CPU-bound or blocking operations. Uses a dedicated thread pool to isolate work. """ def __init__(self, name: str, max_workers: int): from concurrent.futures import ThreadPoolExecutor self.name = name self.executor = ThreadPoolExecutor( max_workers=max_workers, thread_name_prefix=f"bulkhead-{name}" ) self.max_workers = max_workers async def run(self, func: Callable, *args, **kwargs) -> Any: """Run blocking function in isolated thread pool""" loop = asyncio.get_event_loop() return await loop.run_in_executor( self.executor, lambda: func(*args, **kwargs) ) def shutdown(self): self.executor.shutdown(wait=True) class BulkheadManager: """Manage multiple bulkheads for different services""" _instance = None def __new__(cls): if cls._instance is None: cls._instance = super().__new__(cls) cls._instance.bulkheads = {} return cls._instance def create( self, name: str, max_concurrent: int, max_queue: int = 0 ) -> Bulkhead: bulkhead = Bulkhead(name, max_concurrent, max_queue) self.bulkheads[name] = bulkhead return bulkhead def get(self, name: str) -> Optional[Bulkhead]: return self.bulkheads.get(name) def get_all_metrics(self) -> Dict[str, Dict]: return {name: bh.get_metrics() for name, bh in self.bulkheads.items()} # Usage bulkhead_manager = BulkheadManager() # Different pools for different services payment_bulkhead = bulkhead_manager.create("payment", max_concurrent=10, max_queue=20) inventory_bulkhead = bulkhead_manager.create("inventory", max_concurrent=30) notification_bulkhead = bulkhead_manager.create("notification", max_concurrent=50) @payment_bulkhead async def charge_payment(order_id: str, amount: float): """Payment calls isolated in their own pool""" return await payment_client.charge(order_id, amount) @inventory_bulkhead async def reserve_inventory(items: list): """Inventory calls isolated - won't affect payments""" return await inventory_client.reserve(items) async def process_order(order): """Even if inventory is slow, payments still work""" try: # These run in isolated pools payment = await charge_payment(order.id, order.total) inventory = await reserve_inventory(order.items) return {"status": "success", "payment": payment} except BulkheadFullError as e: # One service full doesn't crash everything logger.warning(f"Bulkhead full: {e}") return {"status": "retry_later", "reason": str(e)} # FastAPI integration from fastapi import FastAPI, Request, HTTPException app = FastAPI() @app.middleware("http") async def bulkhead_metrics_middleware(request: Request, call_next): response = await call_next(request) # Add bulkhead metrics to response headers metrics = bulkhead_manager.get_all_metrics() response.headers["X-Bulkhead-Active"] = str( sum(m["active"] for m in metrics.values()) ) return response @app.get("/metrics/bulkheads") async def get_bulkhead_metrics(): return bulkhead_manager.get_all_metrics() ``` ```javascript theme={null} class BulkheadFullError extends Error { constructor(bulkheadName, current, maxSize) { super(`Bulkhead '${bulkheadName}' is full (${current}/${maxSize})`); this.name = 'BulkheadFullError'; this.bulkheadName = bulkheadName; this.current = current; this.maxSize = maxSize; } } class Bulkhead { constructor(name, { maxConcurrent = 10, maxQueue = 0, queueTimeout = 10000 } = {}) { this.name = name; this.maxConcurrent = maxConcurrent; this.maxQueue = maxQueue; this.queueTimeout = queueTimeout; this.active = 0; this.queue = []; this.metrics = { accepted: 0, rejected: 0, peakActive: 0, totalWaitTime: 0 }; } async execute(fn) { const start = Date.now(); // Check if at capacity if (this.active >= this.maxConcurrent) { if (this.queue.length >= this.maxQueue) { this.metrics.rejected++; throw new BulkheadFullError(this.name, this.active, this.maxConcurrent); } // Wait in queue await this.waitInQueue(); } const waitTime = Date.now() - start; this.metrics.totalWaitTime += waitTime; this.active++; this.metrics.accepted++; this.metrics.peakActive = Math.max(this.metrics.peakActive, this.active); try { return await fn(); } finally { this.active--; this.releaseNext(); } } waitInQueue() { return new Promise((resolve, reject) => { const timeout = setTimeout(() => { const index = this.queue.findIndex(item => item.resolve === resolve); if (index !== -1) { this.queue.splice(index, 1); } this.metrics.rejected++; reject(new BulkheadFullError(this.name, this.active, this.maxConcurrent)); }, this.queueTimeout); this.queue.push({ resolve: () => { clearTimeout(timeout); resolve(); } }); }); } releaseNext() { if (this.queue.length > 0 && this.active < this.maxConcurrent) { const next = this.queue.shift(); next.resolve(); } } wrap(fn) { return async (...args) => { return this.execute(() => fn(...args)); }; } getMetrics() { return { name: this.name, active: this.active, maxConcurrent: this.maxConcurrent, queueSize: this.queue.length, maxQueue: this.maxQueue, ...this.metrics, rejectionRate: this.metrics.rejected / Math.max(1, this.metrics.accepted + this.metrics.rejected), avgWaitMs: this.metrics.totalWaitTime / Math.max(1, this.metrics.accepted) }; } } // Worker pool bulkhead for CPU-bound tasks class WorkerPoolBulkhead { constructor(name, maxWorkers) { this.name = name; this.maxWorkers = maxWorkers; this.workers = []; this.taskQueue = []; // Initialize worker pool (using worker_threads in Node.js) this.initWorkers(); } initWorkers() { // In a real implementation, use worker_threads // This is a simplified version using a semaphore pattern this.activeWorkers = 0; } async execute(task) { // Queue task if all workers busy if (this.activeWorkers >= this.maxWorkers) { return new Promise((resolve, reject) => { this.taskQueue.push({ task, resolve, reject }); }); } return this.runTask(task); } async runTask(task) { this.activeWorkers++; try { return await task(); } finally { this.activeWorkers--; this.processQueue(); } } processQueue() { if (this.taskQueue.length > 0 && this.activeWorkers < this.maxWorkers) { const { task, resolve, reject } = this.taskQueue.shift(); this.runTask(task).then(resolve).catch(reject); } } } // Bulkhead manager singleton class BulkheadManager { constructor() { if (BulkheadManager.instance) { return BulkheadManager.instance; } this.bulkheads = new Map(); BulkheadManager.instance = this; } create(name, options = {}) { const bulkhead = new Bulkhead(name, options); this.bulkheads.set(name, bulkhead); return bulkhead; } get(name) { return this.bulkheads.get(name); } getAllMetrics() { const metrics = {}; for (const [name, bulkhead] of this.bulkheads) { metrics[name] = bulkhead.getMetrics(); } return metrics; } } // Usage const manager = new BulkheadManager(); const paymentBulkhead = manager.create('payment', { maxConcurrent: 10, maxQueue: 20 }); const inventoryBulkhead = manager.create('inventory', { maxConcurrent: 30 }); const notificationBulkhead = manager.create('notification', { maxConcurrent: 50 }); // Wrap service calls const chargePayment = paymentBulkhead.wrap(async (orderId, amount) => { return paymentClient.charge(orderId, amount); }); const reserveInventory = inventoryBulkhead.wrap(async (items) => { return inventoryClient.reserve(items); }); async function processOrder(order) { try { const [payment, inventory] = await Promise.all([ chargePayment(order.id, order.total), reserveInventory(order.items) ]); return { status: 'success', payment, inventory }; } catch (error) { if (error instanceof BulkheadFullError) { console.warn(`Bulkhead full: ${error.message}`); return { status: 'retry_later', reason: error.message }; } throw error; } } // Express integration const express = require('express'); const app = express(); app.use((req, res, next) => { const metrics = manager.getAllMetrics(); res.set('X-Bulkhead-Active', Object.values(metrics).reduce((sum, m) => sum + m.active, 0) ); next(); }); app.get('/metrics/bulkheads', (req, res) => { res.json(manager.getAllMetrics()); }); app.get('/orders/:id', async (req, res) => { try { const result = await processOrder({ id: req.params.id, items: [] }); res.json(result); } catch (error) { if (error instanceof BulkheadFullError) { res.status(503).json({ error: 'Service temporarily unavailable', retryAfter: 5 }); } else { res.status(500).json({ error: error.message }); } } }); ``` ## Health Checks ### Health Check Types ``` ┌─────────────────────────────────────────────────────────────────┐ │ Health Check Types │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ LIVENESS CHECK │ │ ───────────── │ │ "Is the process running?" │ │ If fails: Restart the container/process │ │ │ │ GET /health/live → 200 OK │ │ │ │ Check: │ │ • Process responding │ │ • Not deadlocked │ │ │ │ ───────────────────────────────────────────────────────────── │ │ │ │ READINESS CHECK │ │ ─────────────── │ │ "Can it handle traffic?" │ │ If fails: Remove from load balancer │ │ │ │ GET /health/ready → 200 OK or 503 Not Ready │ │ │ │ Check: │ │ • Database connection works │ │ • Cache connection works │ │ • Required dependencies reachable │ │ • Warmup complete │ │ │ │ ───────────────────────────────────────────────────────────── │ │ │ │ DEEP HEALTH CHECK (Use sparingly!) │ │ ──────────────── │ │ "Is everything working?" │ │ Used for: Monitoring dashboards, not load balancers │ │ │ │ GET /health/deep → { db: ok, cache: ok, queue: ok } │ │ │ │ Warning: Can be expensive, rate limit! │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ```python theme={null} from fastapi import FastAPI, Response from datetime import datetime app = FastAPI() @app.get("/health/live") async def liveness(): """Just checks if the process is alive""" return {"status": "alive", "timestamp": datetime.utcnow().isoformat()} @app.get("/health/ready") async def readiness(response: Response): """Checks if we can handle traffic""" checks = {} # Check database try: await db.execute("SELECT 1") checks["database"] = "ok" except Exception as e: checks["database"] = f"error: {str(e)}" response.status_code = 503 # Check Redis try: await redis.ping() checks["redis"] = "ok" except Exception as e: checks["redis"] = f"error: {str(e)}" response.status_code = 503 # Check if warmup complete if not app.state.warmup_complete: checks["warmup"] = "in progress" response.status_code = 503 else: checks["warmup"] = "complete" return { "status": "ready" if response.status_code == 200 else "not ready", "checks": checks } ``` ## Timeouts ### Timeout Hierarchy

### Deadline Propagation ```python theme={null} import time from contextvars import ContextVar # Context variable for deadline deadline_ctx: ContextVar[float] = ContextVar('deadline', default=None) def with_deadline(timeout_seconds: float): """Set deadline for current request""" deadline = time.time() + timeout_seconds deadline_ctx.set(deadline) return deadline def remaining_time() -> float: """Get remaining time until deadline""" deadline = deadline_ctx.get() if deadline is None: return float('inf') return max(0, deadline - time.time()) async def call_service(service_name: str, payload: dict): """Call service with propagated deadline""" remaining = remaining_time() if remaining <= 0: raise DeadlineExceeded("Request deadline already passed") # Use remaining time as timeout (with buffer) timeout = min(remaining * 0.9, 30.0) # 90% of remaining, max 30s try: async with asyncio.timeout(timeout): return await http_client.post( f"http://{service_name}/api", json=payload, headers={"X-Deadline": str(deadline_ctx.get())} ) except asyncio.TimeoutError: raise DeadlineExceeded(f"Timeout calling {service_name}") ``` ## Graceful Degradation

## Senior Interview Questions **Key components**: 1. **Redundancy**: At least 2 of everything (servers, DBs, regions) 2. **Load balancing**: Automatic failover when node fails 3. **Health checks**: Detect failures in seconds 4. **Auto-scaling**: Handle traffic spikes 5. **Multi-region**: Survive region outages 6. **Chaos engineering**: Regularly test failure scenarios **Math**: 99.99% = 52 minutes downtime/year * Single component at 99.9% can't achieve 99.99% * Need redundancy: 2 components at 99.9% = 99.9999% (if independent) **Saga Pattern**: 1. Each step has a compensating action 2. If step N fails, run compensations for steps N-1 to 1 3. Track saga state in database **Example**: ``` 1. Create order → Compensate: Cancel order 2. Reserve inventory → Compensate: Release inventory 3. Charge payment → Compensate: Refund payment 4. Ship order → Compensate: Cancel shipment If step 3 fails: - Refund payment (if partially charged) - Release inventory - Cancel order ``` **Defense layers**: 1. **Circuit breakers**: Stop calling failing service 2. **Timeouts**: Don't wait forever 3. **Bulkheads**: Isolate failures to one service 4. **Rate limiting**: Prevent overload 5. **Load shedding**: Reject low-priority requests 6. **Fallbacks**: Degrade gracefully **Key insight**: "Fast failure is better than slow failure. If a service is struggling, fail fast and use fallback." **Chaos Engineering approach**: 1. **Define steady state**: Normal metrics (latency, error rate) 2. **Form hypothesis**: "System handles server failure" 3. **Inject failure**: Kill a server 4. **Observe**: Did metrics stay within bounds? 5. **Fix and repeat** **Types of failures to test**: * Server crashes * Network partitions * High latency * Disk full * Memory exhaustion * Clock skew * Dependency outages ## Interview Questions **Strong answer:** * A circuit breaker wraps calls to an external dependency and monitors failures. The core idea is borrowed from electrical engineering -- when current overloads, the breaker trips to prevent fires. In software, it prevents your system from wasting resources hammering a service that is already down, which can turn a partial outage into a full cascading failure. * The three states are **Closed** (normal operation -- requests flow through and failures are counted), **Open** (the breaker has tripped after the failure threshold is exceeded -- all requests fail immediately without calling the downstream service), and **Half-Open** (after a timeout period, the breaker allows a limited number of test requests through to see if the service has recovered). * The key design decisions are: what counts as a failure (timeouts? 5xx errors? specific exceptions?), what the failure threshold should be (e.g., 5 failures in 30 seconds), and what the recovery timeout is. In production, you want these configurable per-dependency because a payment service and a notification service have very different tolerance profiles. A payment failure at 3/5 should trip the breaker fast; a non-critical analytics service might tolerate 10/20. * The pattern becomes critical when combined with fallbacks. When the circuit is open, you serve cached data, a default response, or queue the request for later. Netflix's Hystrix popularized this -- if the recommendation engine is down, they show a generic "top 10" list instead of personalized suggestions. The user experience degrades but doesn't break. * **Example:** An e-commerce checkout service calls a payment gateway. Without a circuit breaker, if the gateway goes down, every checkout request hangs for 30 seconds waiting for a timeout, thread pools fill up, and the entire site becomes unresponsive. With a circuit breaker set to trip at 5 failures, after 5 timeouts the breaker opens, checkout immediately returns "payment temporarily unavailable, try again in a minute," and the rest of the site (browsing, search, cart) stays healthy. **Red flag answer:** "It's like a switch that turns off when there are too many errors and turns back on after a while." This shows no understanding of the half-open state, failure counting mechanics, fallback strategies, or why the pattern matters in distributed systems. **Follow-ups:** 1. What happens if you set the failure threshold too low versus too high? How would you tune it for a payment service versus a recommendation service? 2. How would you implement circuit breaker state sharing across multiple instances of the same service behind a load balancer -- should the state be local or distributed? **Strong answer:** * Fixed-interval retries (e.g., retry every 2 seconds) are simple but dangerous at scale because they create the **thundering herd problem**. If a service goes down and 10,000 clients all start retrying at the same 2-second interval, they synchronize into periodic traffic spikes that can keep the recovering service permanently overloaded. It is literally the worst thing you can do during an outage. * Exponential backoff (1s, 2s, 4s, 8s, 16s...) spreads retries out over time so the downstream service gets breathing room to recover. But pure exponential backoff without jitter still has a problem -- clients that started at the same time will still be synchronized at each backoff interval. * Adding **jitter** (randomness) to the backoff breaks this synchronization. There are two common approaches: **full jitter** (`random(0, base * 2^attempt)`) and **decorrelated jitter** (`random(base, previous_delay * 3)`). AWS's research showed decorrelated jitter provides the best balance of spread and convergence time. The key insight is that jitter is not optional for production retry logic -- it is mandatory. * Fixed-interval retries only make sense in tightly controlled environments where you know the number of callers is small and bounded -- for example, a single cron job retrying a database migration, or an internal tool with one user. Anything customer-facing or multi-tenant must use backoff with jitter. * **Example:** Stripe's API docs recommend exponential backoff with jitter for their webhooks. If Stripe sends 50,000 webhooks and the receiving server is momentarily down, pure fixed retries would DDoS the server on recovery. With decorrelated jitter, retries spread over a wide time window, giving the server a smooth ramp-up. **Red flag answer:** "I'd just retry 3 times with a 1-second delay." This ignores thundering herds, jitter, max delay caps, and the distinction between retryable vs non-retryable errors (retrying a 400 Bad Request is pointless). **Follow-ups:** 1. Should you retry a payment charge that timed out? What is the risk and how would you make the operation safe to retry? 2. How would you implement a retry budget -- limiting total retries across all callers to prevent retry storms from amplifying an outage? **Strong answer:** * The bulkhead pattern borrows its name from ship design -- ships have watertight compartments (bulkheads) so that if one section floods, the ship doesn't sink. In software, the idea is identical: you isolate resources for different dependencies so that one failing dependency can't consume all available resources and take down everything else. * The most common implementation is giving each external dependency its own **connection pool or thread pool** with a fixed size and a bounded queue. For instance, your payment service gets a pool of 10 concurrent connections, your inventory service gets 30, and your notification service gets 50. If the payment gateway becomes slow and all 10 connections are saturated, only payment-related requests queue up and eventually fail -- the inventory and notification pools continue operating normally. * Without bulkheads, all dependencies share the same thread pool. When one dependency becomes slow, its requests hold threads open, gradually consuming the entire pool until no threads are available for any requests -- even completely healthy code paths. This is how a slow notification email service can take down your checkout flow. * The key tuning parameters are `max_concurrent` (how many simultaneous requests to allow), `max_queue` (how many waiting requests to buffer), and `queue_timeout` (how long to wait before rejecting). Setting these requires understanding each dependency's latency profile and criticality. A slow payment call with a 5-second timeout and 10 concurrent slots means you can process 2 payments per second sustained -- is that enough? * **Example:** At a large e-commerce platform, a third-party shipping rate calculator became unresponsive during Black Friday. Without bulkheads, the slow shipping API calls consumed all 200 threads in the shared pool within minutes. With bulkheads, the shipping pool (20 threads) filled up, shipping rate requests got "service unavailable" errors, but checkout, search, and browsing (using the other 180 threads) continued without interruption. Revenue impact was reduced from "site down" to "shipping estimates temporarily unavailable." **Red flag answer:** "You just limit the number of threads" -- this misses the core point about failure isolation between different dependencies, and doesn't address queue management, timeout interaction, or how to size the pools. **Follow-ups:** 1. How would you decide the `max_concurrent` and `max_queue` values for each bulkhead in production? What metrics would you monitor to know if your values are correct? 2. How does the bulkhead pattern interact with circuit breakers -- should they be layered together, and if so, in what order? **Strong answer:** * **Liveness** answers "is this process alive and not deadlocked?" It should be trivially cheap -- essentially `return 200 OK`. If a liveness check fails, the orchestrator (Kubernetes, ECS) **kills and restarts** the container. The only thing you should check here is whether the process can respond at all. If you put a database check in your liveness probe and the database goes down, Kubernetes will restart all your healthy application pods, turning a database outage into a complete application outage. * **Readiness** answers "can this instance handle traffic right now?" If a readiness check fails, the load balancer **removes the instance from the rotation** but does not kill it. This is where you check dependency connections (database, cache, message queue), whether warmup/cache priming is complete, and whether the instance has finished initialization. An instance that fails readiness stays alive and keeps checking -- once its dependencies recover, it passes readiness again and gets added back to the pool. * The critical mistake is putting dependency checks in liveness probes. In Kubernetes specifically, a failed liveness probe triggers a container restart. If your database has a brief hiccup and your liveness probe checks the DB, every pod restarts simultaneously, causing a full outage on top of the DB issue. This is called a **death spiral** -- restarting pods increases load on the recovering database, which causes more liveness failures, which causes more restarts. * There is also a **startup probe** concept (Kubernetes added this in 1.16) for slow-starting applications. It disables liveness checking until the app finishes startup, preventing premature kills during initialization (e.g., a Java application loading a large ML model). * **Example:** A team put a Redis connectivity check in their liveness probe. During a routine Redis failover (primary to replica, takes 3-5 seconds), all 40 pods simultaneously failed liveness, Kubernetes restarted them all, the new pods all tried to connect to Redis at once, overwhelmed the new Redis primary, and the entire service was down for 12 minutes. The fix was moving Redis checks to readiness only and making the liveness probe a simple `return 200`. **Red flag answer:** "Liveness checks if the app is healthy and readiness checks if it's ready to serve traffic -- they're basically the same thing." This completely misses the operational consequence: liveness triggers restarts, readiness triggers traffic removal. Confusing them causes cascading outages. **Follow-ups:** 1. How would you design a deep health check endpoint for monitoring dashboards that doesn't accidentally become a DoS vector against your own dependencies? 2. What startup probe configuration would you use for a service that takes 60 seconds to warm up its in-memory cache before it can serve accurate responses? **Strong answer:** * Deadline propagation means passing a request's absolute expiration time (not a relative timeout) through every service in the call chain. If the client sets a 5-second deadline and Service A takes 2 seconds, Service A passes the remaining 3 seconds (or the original absolute deadline timestamp) to Service B. Without this, every service in the chain uses its own independent timeout, and the total end-to-end time can exceed anything reasonable. * The problem deadline propagation solves is **wasted work**. Consider a chain: Client -> API Gateway (10s timeout) -> Service A (30s timeout) -> Service B (30s timeout) -> Database. The client gives up after 10 seconds. Without deadline propagation, Service A is still waiting on Service B, which is still waiting on the database -- all doing work that nobody will use because the client is already gone. Multiply this across thousands of concurrent requests and you have a significant resource waste that can worsen an overload. * The standard implementation uses a context variable (Go's `context.Context` does this natively, gRPC has built-in deadline propagation). At each service boundary, you check remaining time before making a downstream call. If remaining time is less than or equal to zero, you short-circuit immediately. When making the call, you set the downstream timeout to `min(remaining_time * 0.9, service_default_timeout)` -- the 0.9 factor leaves a buffer for network latency and response processing. * A subtlety: you must use **absolute timestamps** (wall-clock deadlines), not relative durations. If Service A receives "timeout: 3 seconds" and spends 500ms doing local work before calling Service B, it needs to pass "timeout: 2.5 seconds" to B. With an absolute deadline like "expires at 14:30:05.000Z", this math is trivial and avoids clock drift compounding across hops. * **Example:** Google's internal systems (and gRPC by default) propagate deadlines through their entire call stack. If a user search request has a 200ms deadline, every service in the chain (query parsing, index lookup, ranking, ad serving) receives the same deadline. If the index lookup takes 180ms, the ranking service knows it only has 20ms left and can return a faster but less optimal ranking rather than doing its full 150ms computation that would be wasted anyway. **Red flag answer:** "Just set a timeout on every HTTP call" -- this ignores the cascading timeout problem, wasted work in downstream services, and the absolute-vs-relative timestamp distinction. **Follow-ups:** 1. How do you handle clock skew between services when propagating absolute deadlines? What if Service A's clock is 2 seconds ahead of Service B's? 2. Should you always respect the propagated deadline, or are there cases where a downstream service should ignore it and complete its work anyway (e.g., a write that must not be half-completed)? **Strong answer:** * First, the math: 99.9% allows 8.76 hours of downtime per year (43.8 minutes/month). 99.99% allows only 52.6 minutes per year (4.38 minutes/month). That is a 10x reduction. Every single deployment, config change, and dependency failure now matters. You cannot achieve this with good engineering alone -- it requires operational discipline and architectural changes. * **Multi-region active-active deployment** is almost mandatory. A single region cannot realistically deliver 99.99% because cloud providers themselves typically only guarantee 99.99% per region for compute, and any single dependency below that threshold (database, load balancer, DNS) breaks your SLA. With active-active in two regions, you survive an entire region outage. The math: if each region is 99.9% available independently, two regions give you `1 - (0.001 * 0.001) = 99.9999%` theoretical availability (assuming independent failures). * **Zero-downtime deployments** become non-negotiable. Blue-green or canary deployments where you shift traffic gradually. A bad deploy that takes 5 minutes to detect and roll back consumes your entire monthly error budget at 99.99%. You need automated canary analysis that compares error rates between old and new versions and auto-rolls-back within 60 seconds. * **Dependency isolation and fallbacks for every critical path.** Every external call needs a circuit breaker, timeout, and a fallback that lets the core user journey succeed even if that dependency is down. If your recommendation engine is down, show trending items. If your user profile service is slow, serve from cache. * **Runbook automation and on-call SLOs.** At 99.99%, human response time is too slow. You need automated detection (anomaly detection, not just threshold alerts), automated mitigation (auto-scaling, auto-failover, auto-rollback), and humans are only for novel incidents. Mean time to detect (MTTD) must be under 1 minute and mean time to recover (MTTR) under 5 minutes. * **Example:** Moving from 99.9% to 99.99% at a fintech company required: adding a second AWS region with active-active routing via Route 53 health checks, switching from rolling deployments to canary with automated rollback, adding circuit breakers to all 12 downstream services, implementing database read replicas with automatic failover, and establishing an on-call rotation with 5-minute response SLO. The infrastructure cost roughly doubled, and operational complexity tripled. **Red flag answer:** "Add more servers and use auto-scaling." This shows no understanding of the 99.99% availability math, multi-region requirements, deployment risk, or the operational discipline needed. **Follow-ups:** 1. How would you handle database writes in a multi-region active-active setup? What consistency model would you choose and what are the trade-offs? 2. If your monthly error budget for 99.99% is 4.38 minutes and a bad deploy already consumed 3 minutes, what policy changes would you enforce for the rest of the month? **Strong answer:** * Graceful degradation means your system continues to provide core functionality even when some components fail, by shedding non-essential features. The key word is "graceful" -- users should get a slightly worse experience, not an error page. It is the opposite of the "all or nothing" approach where any failure returns a 500 error. * The decision of what to degrade requires a **feature criticality matrix** defined before an incident happens, not during one. You categorize every feature into tiers: **Tier 1 (critical)** -- must always work (checkout, login, core search), **Tier 2 (important)** -- degrade with notice (recommendations, reviews, real-time inventory counts), **Tier 3 (nice-to-have)** -- can be completely disabled (analytics tracking, A/B test variants, social features). During an overload event, you shed tiers in reverse order. * Implementation typically involves **feature flags combined with dependency health monitoring**. When the recommendation service circuit breaker opens, the feature flag for "personalized recommendations" automatically switches to "show trending items" (static cache). When the inventory service is slow, you show "in stock" based on a cached snapshot from 5 minutes ago instead of real-time counts. * Load shedding is the extreme form: when the system is overwhelmed, you actively reject low-priority requests to preserve capacity for high-priority ones. For example, during a flash sale, you might reject browse/search requests from non-authenticated users to preserve capacity for users who are actively in checkout. This is controversial but effective. * **Example:** Twitter (now X) historically degrades by disabling features under load: first, follower count updates stop being real-time and switch to periodic batch updates. Then, the "who to follow" recommendations disappear. Then, the trending topics become stale. But the core timeline and tweet posting continue working. Each degradation tier has a predefined trigger (e.g., p99 latency exceeding 500ms, error rate above 1%) and an automatic activation mechanism. **Red flag answer:** "Just show an error message saying the service is temporarily unavailable." That is a total failure, not graceful degradation. The point is that the user can still accomplish their primary task. **Follow-ups:** 1. How would you implement automatic degradation that triggers without human intervention? What signals would you use and how do you prevent false positives from triggering unnecessary degradations? 2. How do you test graceful degradation -- can you verify that each degradation tier actually works before you need it in production? **Strong answer:** * A retry storm happens when a service becomes slow or partially unavailable, causing all its callers to retry simultaneously, which multiplies the load on the already struggling service and pushes it from "slow" to "completely down." If every caller retries 3 times, a service that was handling 10,000 requests/second now receives 30,000 requests/second -- exactly when it can least handle the load. * The first defense is **exponential backoff with jitter** at each individual client, which we discussed. But that alone is not sufficient because each client acts independently. The more powerful mechanism is a **retry budget** at the caller side: "this service is allowed to retry at most 10% of its requests over any 30-second window." If the service is failing 50% of requests and every failure triggers a retry, you cap the retry traffic at 10% of total volume rather than letting it grow to 50% additional load. * **Server-side cooperation** is equally important. The struggling service should return `429 Too Many Requests` with a `Retry-After` header when it is overloaded, giving clients an explicit signal to back off. Even better, it can return a `503 Service Unavailable` with `Retry-After: 30` to tell clients not to retry for 30 seconds. Clients that respect these headers dramatically reduce retry pressure. * **Circuit breakers** at the caller side are the final safety net. After N failures, the circuit opens and all requests fail immediately (no retries at all) for a timeout period. This gives the downstream service complete relief. The combination of retry budgets + circuit breakers + exponential backoff with jitter forms a layered defense. * **Example:** An internal platform team at a large company discovered that during a database failover (which took 15 seconds), 200 microservices all started retrying their database calls simultaneously. Each service retried 3 times with 1-second intervals. The database received 600x normal write volume the moment it came back online, immediately fell over again, triggering another round of retries. The fix was implementing a global retry budget (max 10% retry ratio per service), adding jitter, and having the database return `503 Retry-After: 10` during failover. Recovery time dropped from 12 minutes to 20 seconds. **Red flag answer:** "Limit retries to 3 per request." This addresses individual request retries but completely ignores the systemic problem -- 100,000 clients each doing 3 retries is 300,000 additional requests hitting a service that is already drowning. **Follow-ups:** 1. How would you implement a distributed retry budget across multiple instances of the same service? Do they need to coordinate, or can each instance track its own budget independently? 2. Your service returns a mix of 200s and 503s during degraded operation. How do you differentiate between "this request is safe to retry" and "the service is overloaded, stop retrying entirely"? ## Interview Deep-Dive Questions **What the interviewer is really testing:** Whether you understand that a system's availability cannot exceed its least available dependency unless you architect around it, and whether you can apply concrete patterns to close the gap. **Strong Answer:** * The math problem: if you depend on a 99.9% gateway and call it synchronously on every request, your service's availability ceiling is 99.9% -- below the 99.95% target. The gap between 99.9% and 99.95% is 4.38 hours of downtime per year that the gateway experiences but your service cannot. * Strategy 1 -- Multi-provider failover: integrate with two payment gateways (e.g., Stripe and Adyen). Route traffic primarily through Stripe. When Stripe's circuit breaker opens (5 failures in 30 seconds), automatically route to Adyen. This gives you availability of `1 - (0.001 * 0.001) = 99.9999%` assuming independent failures. The cost: maintaining two integrations, handling different response formats, and reconciling transactions across providers. This is the highest-impact solution and most production payment systems use it. * Strategy 2 -- Async processing with queuing: for payments that are not time-sensitive (subscriptions, scheduled payments), queue the payment request and process it asynchronously. If the gateway is down, the request stays in the queue and is retried with backoff until the gateway recovers. The user sees "payment processing" instead of an error. This converts a synchronous availability dependency into a latency dependency -- the payment succeeds eventually, just slower. * Strategy 3 -- Intelligent caching and pre-authorization: for repeat customers, pre-authorize payment methods during idle periods. Store the authorization token. When the customer checks out, you already have a valid authorization and only need to capture, which is a simpler call that can be retried more aggressively. If the capture fails, you have a window (usually 7 days) to retry before the authorization expires. * Strategy 4 -- Graceful degradation: if the gateway is down and failover is not available, accept the order and process the payment later. This requires credit risk assessment (do you trust this customer enough to ship before payment clears?). For returning customers with payment history, this is often acceptable. For new customers, show "We are experiencing payment issues, please try again in a few minutes." * The combination I would implement: multi-provider failover as the primary defense, async queuing for non-interactive payments, and graceful degradation as the last resort. This gives a theoretical availability well above 99.95%. * **Example:** Amazon reportedly uses multiple payment processors and will route around failures automatically. If Visa's network is slow, they can fall back to processing through a different acquiring bank. Their checkout success rate is a key business metric that is monitored second-by-second. **Follow-up: You fail over from Stripe to Adyen, but the customer's card declines on Adyen (different fraud rules). How do you handle this without losing the sale?** This is a false decline caused by the failover, not a genuine card issue. Solutions: (1) Queue the transaction for retry on Stripe when it recovers (within seconds to minutes). Present the user with "Your payment is being processed" rather than "Payment declined." (2) If the user is still interactive, offer an alternative payment method ("Would you like to try a different card or PayPal?"). (3) Log the false decline and use it to negotiate with Adyen to align their fraud rules with your risk profile. Over time, ensure both providers have similar acceptance rates for your customer base by sharing fraud signals with both. **What the interviewer is really testing:** Whether you understand chaos engineering as a disciplined practice with safety mechanisms, not reckless "break things in production" cowboy behavior. **Strong Answer:** * The justification is simple: untested resilience is not resilience. If we have never verified that our service survives instance failures in production, we are relying on hope. The question is not whether instances will fail -- cloud instances fail regularly (AWS reports that individual EC2 instances have a roughly 2-4% annual failure rate). The question is whether our system handles it gracefully or falls apart. Finding out during a real incident is far more costly than finding out in a controlled experiment. * Safeguards before the experiment: (1) Define a steady state hypothesis: "Terminating 10% of instances will result in no user-visible errors, latency increase of less than 50ms at p99, and auto-scaling will replace instances within 90 seconds." (2) Set up automated abort conditions: if error rate exceeds 1% or p99 latency exceeds 500ms, automatically stop the experiment and restore the killed instances. Use an experiment controller (like Gremlin, LitmusChaos, or Chaos Monkey) that monitors these conditions in real-time. (3) Start with a smaller blast radius: begin with 1 instance, not 10%. Validate the hypothesis. Then increase to 5%, then 10%. (4) Run during low-traffic hours initially, then graduate to business hours once confidence is established. (5) Notify the on-call team and customer support that a chaos experiment is running so they are not surprised by alerts. * Safeguards during the experiment: (1) The experiment controller watches dashboards in real-time and has a kill switch that immediately stops the experiment. (2) Run the experiment for a bounded duration (10 minutes, not all day). (3) A human operator is watching the dashboards during the entire experiment. (4) The experiment logs exactly which instances were terminated and when, for post-mortem correlation. * What you learn: either the system handles it (confidence increases, you document this and run it regularly) or it does not (you found a resilience gap before a real incident found it for you). Common findings: auto-scaling takes longer than expected, health check intervals are too long so traffic is still routed to dying instances, connection pools do not recover gracefully, and caches are cold on new instances causing a latency spike. * **Example:** Netflix runs Chaos Monkey continuously in production -- it terminates random instances during business hours every single day. But they started small (one instance at a time in non-critical services) and built up over years. They also have Chaos Kong, which simulates an entire region failure. The key insight from their practice: the experiments themselves rarely find bugs. The discipline of preparing for chaos (building fallbacks, testing auto-scaling, validating health checks) is what actually improves reliability. **Follow-up: The chaos experiment reveals that when 10% of instances die, the remaining instances' CPU spikes to 95% and p99 latency doubles. Auto-scaling kicks in but takes 3 minutes to provision new instances. How do you fix this?** The root cause is insufficient headroom. If losing 10% of capacity causes 95% CPU on the remaining instances, you are running at roughly 85% utilization normally -- that is too hot. Fix the headroom: run at 60-70% average CPU so that losing 10% of instances only raises the others to 70-80%. For the 3-minute auto-scaling lag: (1) Use pre-warmed instances (scale-to-zero spare instances that are already booted and ready, just not receiving traffic). (2) Set more aggressive auto-scaling triggers (scale at 60% CPU, not 80%). (3) Over-provision slightly -- keep 1-2 extra instances beyond what current traffic needs. The cost of those extra instances is far less than the cost of a 3-minute latency spike during an incident. **What the interviewer is really testing:** Whether you can apply the Saga pattern with proper compensation logic and understand the nuances of distributed consistency in a real business workflow. **Strong Answer:** * This is a distributed workflow that cannot use traditional database transactions because the data lives across five independent services. The correct pattern is an Orchestration Saga with compensating transactions. * The Saga sequence: (1) Create Order (status: PENDING). (2) Reserve Inventory (decrement available stock). (3) Charge Payment (authorize and capture). (4) Schedule Shipping (create shipment label). (5) Send Notification (confirmation email/push). Each step has a compensating action that undoes it if a later step fails. * Compensations: (1) Cancel Order. (2) Release Inventory. (3) Refund Payment. (4) Cancel Shipment. (5) No compensation needed for notifications (send an "order canceled" notification instead). * Failure scenario walkthrough: Payment succeeds at step 3, but Shipping fails at step 4 (carrier API is down). The Saga orchestrator triggers compensations in reverse: Cancel Shipment (no-op since it failed), Refund Payment, Release Inventory, Cancel Order. The user sees "We could not complete your order. Your payment has been refunded." * The hard parts: (1) Idempotent compensations. The refund must be safe to call twice. If the orchestrator retries the compensation due to a network timeout, you cannot double-refund. Use an idempotency key derived from the order ID. (2) Partial compensation. What if the Refund call fails? You need a retry loop with exponential backoff for compensations, and if retries are exhausted, escalate to a dead-letter queue for manual intervention. (3) Observability. The orchestrator must log every state transition so that customer support can see exactly where the order failed and what compensations ran. (4) Concurrent Saga instances. Two orders for the last item in stock: both reserve inventory at step 2, but only one can succeed. Use optimistic locking on inventory (check-and-decrement atomically) so the second Saga fails at step 2 and compensates immediately. * Implementation: use a workflow engine (Temporal, AWS Step Functions, or Cadence). Define the Saga as an explicit workflow with retries, timeouts, and compensation handlers. The workflow engine provides durable execution (survives orchestrator restarts), automatic retries, and visibility into workflow state. * What I would explicitly avoid: (1) Two-phase commit across all five services -- holds locks, kills availability. (2) Choreography Saga (event-driven) for this workflow -- with 5 services, the implicit workflow is too hard to debug and monitor. (3) Ignoring notification failures -- even if the email fails, the order should still succeed. Treat notification as best-effort with a separate retry mechanism. **Follow-up: During a flash sale, your inventory system is getting hammered and the "Reserve Inventory" step takes 10 seconds instead of 200ms. This causes the Payment authorization to expire (they have a 30-second hold time). How do you redesign the Saga ordering to handle this?** Reorder the Saga to put the most latency-variable step first: (1) Reserve Inventory first (when this is slow, you have not yet touched payment). (2) Then authorize Payment (now you know inventory is reserved and the authorization starts fresh). (3) Then create Order record. This way, if inventory reservation is slow, you have not wasted a payment authorization. Additionally, implement a timeout on the inventory reservation step: if it does not respond in 5 seconds, fail the Saga early rather than holding the user waiting for 10 seconds. The user sees "High demand, please try again in a moment" which is better than a silent 30-second hang followed by a payment authorization expiry error.