Resilience Patterns

In distributed systems, failure is inevitable. Resilience patterns help your services degrade gracefully and recover quickly.

Learning Objectives:

Implement circuit breaker pattern
Design effective retry strategies
Apply bulkhead isolation
Create fallback mechanisms
Handle timeouts properly

Why Resilience Matters

┌─────────────────────────────────────────────────────────────────────────────┐
│                         CASCADE FAILURE                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  WITHOUT RESILIENCE:                                                         │
│  ──────────────────────                                                      │
│                                                                              │
│  User ──▶ API ──▶ Order ──▶ Payment ──▶ ❌ Bank API (down)                  │
│              │       │         │                                             │
│              │       │         └── Threads blocked, timeout...              │
│              │       └── Connection pool exhausted...                       │
│              └── All requests queuing...                                    │
│  ▲                                                                           │
│  └── Eventually entire system fails!                                        │
│                                                                              │
│                                                                              │
│  WITH RESILIENCE:                                                            │
│  ─────────────────                                                           │
│                                                                              │
│  User ──▶ API ──▶ Order ──▶ Payment ──▶ ⚡ Circuit Breaker                  │
│                               │              │                              │
│                               │              └── Fast fail, use fallback    │
│                               └── Returns "Payment pending"                 │
│  ▲                                                                           │
│  └── System stays responsive, partial degradation only                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Circuit Breaker Pattern

Prevent cascade failures by “breaking the circuit” to failing services.

State Machine

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CIRCUIT BREAKER STATES                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                           ┌──────────────┐                                  │
│                           │    CLOSED    │  ◀── Normal operation            │
│                           │  (Healthy)   │      Requests pass through       │
│                           └──────┬───────┘                                  │
│                                  │                                          │
│                    Failure threshold reached                                │
│                                  │                                          │
│                                  ▼                                          │
│                           ┌──────────────┐                                  │
│      Timeout expires ───▶ │     OPEN     │  ◀── Fail fast                   │
│             │             │  (Tripped)   │      Return error immediately    │
│             │             └──────┬───────┘                                  │
│             │                    │                                          │
│             │         After reset timeout                                   │
│             │                    │                                          │
│             │                    ▼                                          │
│             │             ┌──────────────┐                                  │
│             └──────────── │  HALF-OPEN   │  ◀── Testing                     │
│                           │  (Testing)   │      Allow limited requests      │
│                           └──────┬───────┘                                  │
│                                  │                                          │
│              ┌───────────────────┴───────────────────┐                      │
│              │                                       │                      │
│        Test succeeds                           Test fails                   │
│              │                                       │                      │
│              ▼                                       ▼                      │
│        ┌──────────────┐                       ┌──────────────┐              │
│        │    CLOSED    │                       │     OPEN     │              │
│        └──────────────┘                       └──────────────┘              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation from Scratch

// resilience/CircuitBreaker.js
class CircuitBreaker {
  constructor(options = {}) {
    this.name = options.name || 'default';
    this.failureThreshold = options.failureThreshold || 5;
    this.successThreshold = options.successThreshold || 2;
    this.timeout = options.timeout || 10000;
    this.resetTimeout = options.resetTimeout || 30000;
    
    this.state = 'CLOSED';
    this.failureCount = 0;
    this.successCount = 0;
    this.lastFailureTime = null;
    this.nextAttemptTime = null;
    
    this.metrics = {
      totalCalls: 0,
      successfulCalls: 0,
      failedCalls: 0,
      rejectedCalls: 0,
      timeouts: 0
    };
    
    this.listeners = {
      stateChange: [],
      success: [],
      failure: [],
      reject: []
    };
  }

  async call(fn, fallback = null) {
    this.metrics.totalCalls++;
    
    // Check if circuit is open
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttemptTime) {
        this.metrics.rejectedCalls++;
        this.emit('reject', { reason: 'Circuit is open' });
        
        if (fallback) {
          return fallback();
        }
        throw new CircuitOpenError(this.name);
      }
      
      // Try half-open
      this.transitionTo('HALF_OPEN');
    }

    try {
      const result = await this.executeWithTimeout(fn);
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure(error);
      
      if (fallback) {
        return fallback(error);
      }
      throw error;
    }
  }

  async executeWithTimeout(fn) {
    return new Promise(async (resolve, reject) => {
      const timer = setTimeout(() => {
        this.metrics.timeouts++;
        reject(new TimeoutError(`Operation timed out after ${this.timeout}ms`));
      }, this.timeout);

      try {
        const result = await fn();
        clearTimeout(timer);
        resolve(result);
      } catch (error) {
        clearTimeout(timer);
        reject(error);
      }
    });
  }

  onSuccess() {
    this.metrics.successfulCalls++;
    this.emit('success');
    
    if (this.state === 'HALF_OPEN') {
      this.successCount++;
      
      if (this.successCount >= this.successThreshold) {
        this.transitionTo('CLOSED');
      }
    }
    
    this.failureCount = 0;
  }

  onFailure(error) {
    this.metrics.failedCalls++;
    this.lastFailureTime = Date.now();
    this.emit('failure', error);
    
    if (this.state === 'HALF_OPEN') {
      this.transitionTo('OPEN');
      return;
    }
    
    this.failureCount++;
    
    if (this.failureCount >= this.failureThreshold) {
      this.transitionTo('OPEN');
    }
  }

  transitionTo(newState) {
    const oldState = this.state;
    this.state = newState;
    
    if (newState === 'OPEN') {
      this.nextAttemptTime = Date.now() + this.resetTimeout;
    }
    
    if (newState === 'CLOSED') {
      this.failureCount = 0;
      this.successCount = 0;
    }
    
    if (newState === 'HALF_OPEN') {
      this.successCount = 0;
    }
    
    this.emit('stateChange', { from: oldState, to: newState });
    console.log(`Circuit [${this.name}]: ${oldState} → ${newState}`);
  }

  on(event, callback) {
    this.listeners[event].push(callback);
  }

  emit(event, data) {
    this.listeners[event]?.forEach(cb => cb(data));
  }

  getStatus() {
    return {
      name: this.name,
      state: this.state,
      failureCount: this.failureCount,
      successCount: this.successCount,
      metrics: this.metrics,
      nextAttemptTime: this.nextAttemptTime
    };
  }
}

class CircuitOpenError extends Error {
  constructor(circuitName) {
    super(`Circuit breaker [${circuitName}] is open`);
    this.name = 'CircuitOpenError';
    this.circuitName = circuitName;
  }
}

class TimeoutError extends Error {
  constructor(message) {
    super(message);
    this.name = 'TimeoutError';
  }
}

module.exports = { CircuitBreaker, CircuitOpenError, TimeoutError };

Using with Service Clients

// services/PaymentServiceClient.js
const { CircuitBreaker } = require('../resilience/CircuitBreaker');

class PaymentServiceClient {
  constructor(httpClient, cache) {
    this.http = httpClient;
    this.cache = cache;
    
    this.circuitBreaker = new CircuitBreaker({
      name: 'payment-service',
      failureThreshold: 5,
      successThreshold: 2,
      timeout: 5000,
      resetTimeout: 30000
    });
    
    // Listen to state changes
    this.circuitBreaker.on('stateChange', ({ from, to }) => {
      if (to === 'OPEN') {
        this.alertOps(`Payment service circuit opened!`);
      }
    });
  }

  async processPayment(orderId, amount, paymentMethod) {
    return this.circuitBreaker.call(
      // Primary operation
      async () => {
        const response = await this.http.post('/payments', {
          orderId,
          amount,
          paymentMethod
        });
        return response.data;
      },
      // Fallback
      async (error) => {
        console.log('Using payment fallback:', error.message);
        
        // Queue for later processing
        await this.queuePaymentForRetry(orderId, amount, paymentMethod);
        
        return {
          status: 'PENDING',
          message: 'Payment queued for processing',
          retryId: generateId()
        };
      }
    );
  }

  async getPaymentStatus(paymentId) {
    return this.circuitBreaker.call(
      async () => {
        const response = await this.http.get(`/payments/${paymentId}`);
        // Cache successful responses
        await this.cache.set(`payment:${paymentId}`, response.data, 60);
        return response.data;
      },
      async () => {
        // Return cached data as fallback
        const cached = await this.cache.get(`payment:${paymentId}`);
        if (cached) {
          return { ...cached, _fromCache: true };
        }
        throw new Error('Payment status unavailable');
      }
    );
  }
}

Retry Patterns

Retry Strategies

┌─────────────────────────────────────────────────────────────────────────────┐
│                         RETRY STRATEGIES                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  IMMEDIATE RETRY (for transient failures):                                   │
│  ─────────────────────────────────────────                                  │
│  [Request] ──✗── [Retry 1] ──✗── [Retry 2] ──✓── [Success]                 │
│     │             │                │                                        │
│     0ms           0ms              0ms                                      │
│                                                                              │
│  LINEAR BACKOFF:                                                             │
│  ───────────────                                                             │
│  [Request] ──✗── [Retry 1] ──✗── [Retry 2] ──✗── [Retry 3] ──✓             │
│     │             │                │                │                       │
│     0ms         1000ms           2000ms          3000ms                     │
│                                                                              │
│  EXPONENTIAL BACKOFF (recommended):                                          │
│  ──────────────────────────────────                                         │
│  [Request] ──✗── [Retry 1] ──✗── [Retry 2] ──✗── [Retry 3] ──✓             │
│     │             │                │                │                       │
│     0ms         1000ms           2000ms          4000ms                     │
│                                                                              │
│  EXPONENTIAL BACKOFF WITH JITTER (best):                                     │
│  ───────────────────────────────────────                                    │
│  [Request] ──✗── [Retry 1] ──✗── [Retry 2] ──✗── [Retry 3] ──✓             │
│     │             │                │                │                       │
│     0ms       1000+rand        2000+rand       4000+rand                    │
│                                                                              │
│  Jitter prevents thundering herd when many clients retry simultaneously      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation

// resilience/Retry.js
class RetryPolicy {
  constructor(options = {}) {
    this.maxRetries = options.maxRetries || 3;
    this.strategy = options.strategy || 'exponential';
    this.baseDelay = options.baseDelay || 1000;
    this.maxDelay = options.maxDelay || 30000;
    this.jitterFactor = options.jitterFactor || 0.2;
    this.retryableErrors = options.retryableErrors || this.defaultRetryableErrors();
  }

  defaultRetryableErrors() {
    return [
      'ECONNRESET',
      'ETIMEDOUT',
      'ECONNREFUSED',
      'NETWORK_ERROR'
    ];
  }

  async execute(operation) {
    let lastError;
    
    for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
      try {
        return await operation();
      } catch (error) {
        lastError = error;
        
        if (!this.isRetryable(error) || attempt === this.maxRetries) {
          throw error;
        }
        
        const delay = this.calculateDelay(attempt);
        console.log(`Retry ${attempt + 1}/${this.maxRetries} after ${delay}ms`);
        
        await this.sleep(delay);
      }
    }
    
    throw lastError;
  }

  isRetryable(error) {
    // Network errors
    if (this.retryableErrors.includes(error.code)) {
      return true;
    }
    
    // HTTP status codes
    const retryableStatusCodes = [408, 429, 500, 502, 503, 504];
    if (retryableStatusCodes.includes(error.statusCode)) {
      return true;
    }
    
    // Circuit breaker open (will retry after reset)
    if (error.name === 'CircuitOpenError') {
      return false; // Don't retry, circuit breaker handles it
    }
    
    return false;
  }

  calculateDelay(attempt) {
    let delay;
    
    switch (this.strategy) {
      case 'fixed':
        delay = this.baseDelay;
        break;
        
      case 'linear':
        delay = this.baseDelay * (attempt + 1);
        break;
        
      case 'exponential':
        delay = this.baseDelay * Math.pow(2, attempt);
        break;
        
      default:
        delay = this.baseDelay;
    }
    
    // Apply jitter
    const jitter = delay * this.jitterFactor * Math.random();
    delay = delay + jitter;
    
    // Cap at max delay
    return Math.min(delay, this.maxDelay);
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Retry with specific status code handling
class HttpRetryPolicy extends RetryPolicy {
  constructor(options = {}) {
    super(options);
    this.statusCodeHandlers = options.statusCodeHandlers || {};
  }

  isRetryable(error) {
    if (error.statusCode === 429) {
      // Rate limited - respect Retry-After header
      return true;
    }
    
    return super.isRetryable(error);
  }

  calculateDelay(attempt, error) {
    // Respect Retry-After header
    if (error?.headers?.['retry-after']) {
      const retryAfter = parseInt(error.headers['retry-after']);
      if (!isNaN(retryAfter)) {
        return retryAfter * 1000;
      }
    }
    
    return super.calculateDelay(attempt);
  }
}

module.exports = { RetryPolicy, HttpRetryPolicy };

Combining Circuit Breaker and Retry

// resilience/ResilientClient.js
class ResilientClient {
  constructor(options = {}) {
    this.circuitBreaker = new CircuitBreaker(options.circuitBreaker);
    this.retryPolicy = new RetryPolicy(options.retry);
  }

  async execute(operation, fallback = null) {
    return this.circuitBreaker.call(
      // Wrap operation with retry
      () => this.retryPolicy.execute(operation),
      fallback
    );
  }
}

// Usage
const resilientPaymentClient = new ResilientClient({
  circuitBreaker: {
    name: 'payment',
    failureThreshold: 5,
    resetTimeout: 30000
  },
  retry: {
    maxRetries: 3,
    strategy: 'exponential',
    baseDelay: 1000
  }
});

const result = await resilientPaymentClient.execute(
  async () => {
    return axios.post('http://payment-service/charge', { amount: 100 });
  },
  async () => {
    return { status: 'PENDING', queued: true };
  }
);

Bulkhead Pattern

Isolate failures to prevent them from affecting other parts of the system.

┌─────────────────────────────────────────────────────────────────────────────┐
│                         BULKHEAD PATTERN                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  WITHOUT BULKHEAD:                                                           │
│  ─────────────────                                                           │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────┐            │
│  │                    SHARED THREAD POOL (100)                  │            │
│  │  [Payment] [Payment] [Payment] ... (all blocked)            │            │
│  │  [Orders] ✗ No threads available                            │            │
│  │  [Users] ✗ No threads available                             │            │
│  └─────────────────────────────────────────────────────────────┘            │
│                                                                              │
│  Payment service is slow → ALL services affected!                            │
│                                                                              │
│                                                                              │
│  WITH BULKHEAD:                                                              │
│  ───────────────                                                             │
│                                                                              │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐                │
│  │ Payment (30)    │ │ Orders (40)     │ │ Users (30)      │                │
│  │ [blocked...]    │ │ [working]       │ │ [working]       │                │
│  │ [blocked...]    │ │ [working]       │ │ [working]       │                │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘                │
│                                                                              │
│  Payment service is slow → Only payment pool affected!                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation

// resilience/Bulkhead.js
class Bulkhead {
  constructor(options = {}) {
    this.name = options.name || 'default';
    this.maxConcurrent = options.maxConcurrent || 10;
    this.maxQueue = options.maxQueue || 100;
    this.queueTimeout = options.queueTimeout || 30000;
    
    this.activeCount = 0;
    this.queue = [];
    
    this.metrics = {
      totalCalls: 0,
      activeCalls: 0,
      queuedCalls: 0,
      rejectedCalls: 0,
      completedCalls: 0,
      timeoutCalls: 0
    };
  }

  async execute(operation) {
    this.metrics.totalCalls++;
    
    // If within limit, execute immediately
    if (this.activeCount < this.maxConcurrent) {
      return this.executeOperation(operation);
    }
    
    // If queue is full, reject
    if (this.queue.length >= this.maxQueue) {
      this.metrics.rejectedCalls++;
      throw new BulkheadFullError(this.name);
    }
    
    // Queue the operation
    return this.queueOperation(operation);
  }

  async executeOperation(operation) {
    this.activeCount++;
    this.metrics.activeCalls = this.activeCount;
    
    try {
      const result = await operation();
      this.metrics.completedCalls++;
      return result;
    } finally {
      this.activeCount--;
      this.metrics.activeCalls = this.activeCount;
      this.processQueue();
    }
  }

  queueOperation(operation) {
    return new Promise((resolve, reject) => {
      const queueItem = {
        operation,
        resolve,
        reject,
        timestamp: Date.now(),
        timer: setTimeout(() => {
          this.removeFromQueue(queueItem);
          this.metrics.timeoutCalls++;
          reject(new QueueTimeoutError(this.name, this.queueTimeout));
        }, this.queueTimeout)
      };
      
      this.queue.push(queueItem);
      this.metrics.queuedCalls = this.queue.length;
    });
  }

  processQueue() {
    if (this.queue.length === 0 || this.activeCount >= this.maxConcurrent) {
      return;
    }
    
    const item = this.queue.shift();
    clearTimeout(item.timer);
    this.metrics.queuedCalls = this.queue.length;
    
    this.executeOperation(item.operation)
      .then(item.resolve)
      .catch(item.reject);
  }

  removeFromQueue(item) {
    const index = this.queue.indexOf(item);
    if (index > -1) {
      this.queue.splice(index, 1);
      this.metrics.queuedCalls = this.queue.length;
    }
  }

  getStatus() {
    return {
      name: this.name,
      activeCount: this.activeCount,
      queueLength: this.queue.length,
      metrics: this.metrics
    };
  }
}

class BulkheadFullError extends Error {
  constructor(name) {
    super(`Bulkhead [${name}] is full`);
    this.name = 'BulkheadFullError';
  }
}

class QueueTimeoutError extends Error {
  constructor(name, timeout) {
    super(`Queue timeout in bulkhead [${name}] after ${timeout}ms`);
    this.name = 'QueueTimeoutError';
  }
}

module.exports = { Bulkhead, BulkheadFullError, QueueTimeoutError };

Semaphore-Based Bulkhead

// resilience/SemaphoreBulkhead.js
class Semaphore {
  constructor(maxConcurrent) {
    this.maxConcurrent = maxConcurrent;
    this.current = 0;
    this.waiting = [];
  }

  async acquire() {
    if (this.current < this.maxConcurrent) {
      this.current++;
      return;
    }

    await new Promise((resolve) => {
      this.waiting.push(resolve);
    });
  }

  release() {
    this.current--;
    
    if (this.waiting.length > 0) {
      const next = this.waiting.shift();
      this.current++;
      next();
    }
  }

  async execute(fn) {
    await this.acquire();
    try {
      return await fn();
    } finally {
      this.release();
    }
  }
}

// Usage with per-service bulkheads
class BulkheadManager {
  constructor() {
    this.bulkheads = new Map();
  }

  get(name, maxConcurrent = 10) {
    if (!this.bulkheads.has(name)) {
      this.bulkheads.set(name, new Semaphore(maxConcurrent));
    }
    return this.bulkheads.get(name);
  }

  async execute(name, maxConcurrent, fn) {
    const bulkhead = this.get(name, maxConcurrent);
    return bulkhead.execute(fn);
  }
}

// Usage
const bulkheadManager = new BulkheadManager();

// Payment operations limited to 20 concurrent
const paymentResult = await bulkheadManager.execute('payment', 20, async () => {
  return paymentService.processPayment(orderId, amount);
});

// Inventory operations limited to 50 concurrent
const stockResult = await bulkheadManager.execute('inventory', 50, async () => {
  return inventoryService.checkStock(productId);
});

Timeout Patterns

// resilience/Timeout.js
class TimeoutPolicy {
  constructor(timeoutMs) {
    this.timeoutMs = timeoutMs;
  }

  async execute(operation, fallback = null) {
    return new Promise(async (resolve, reject) => {
      const timer = setTimeout(() => {
        if (fallback) {
          resolve(fallback());
        } else {
          reject(new TimeoutError(`Operation timed out after ${this.timeoutMs}ms`));
        }
      }, this.timeoutMs);

      try {
        const result = await operation();
        clearTimeout(timer);
        resolve(result);
      } catch (error) {
        clearTimeout(timer);
        reject(error);
      }
    });
  }
}

// Cascading timeouts
class CascadingTimeout {
  // Outer service must have shorter timeout than inner
  // to give time for error handling
  
  /*
   API Gateway (5s) 
       └── Order Service (4s)
            └── Payment Service (3s)
                 └── Bank API (2s)
  */
  
  static forLayer(layer) {
    const timeouts = {
      gateway: 5000,
      application: 4000,
      integration: 3000,
      external: 2000
    };
    return new TimeoutPolicy(timeouts[layer] || 5000);
  }
}

// Adaptive timeout based on p99 latency
class AdaptiveTimeout {
  constructor(options = {}) {
    this.baseTimeout = options.baseTimeout || 1000;
    this.multiplier = options.multiplier || 3;
    this.minTimeout = options.minTimeout || 500;
    this.maxTimeout = options.maxTimeout || 30000;
    this.latencies = [];
    this.windowSize = options.windowSize || 100;
  }

  recordLatency(latency) {
    this.latencies.push(latency);
    if (this.latencies.length > this.windowSize) {
      this.latencies.shift();
    }
  }

  getCurrentTimeout() {
    if (this.latencies.length < 10) {
      return this.baseTimeout;
    }

    // Calculate p99 latency
    const sorted = [...this.latencies].sort((a, b) => a - b);
    const p99Index = Math.floor(sorted.length * 0.99);
    const p99 = sorted[p99Index];

    // Timeout = p99 * multiplier
    const timeout = p99 * this.multiplier;

    return Math.max(this.minTimeout, Math.min(timeout, this.maxTimeout));
  }

  async execute(operation) {
    const timeout = this.getCurrentTimeout();
    const start = Date.now();

    try {
      const result = await new TimeoutPolicy(timeout).execute(operation);
      this.recordLatency(Date.now() - start);
      return result;
    } catch (error) {
      if (!(error instanceof TimeoutError)) {
        this.recordLatency(Date.now() - start);
      }
      throw error;
    }
  }
}

Complete Resilience Stack

// resilience/ResilientService.js
class ResilientService {
  constructor(name, options = {}) {
    this.name = name;
    
    this.circuitBreaker = new CircuitBreaker({
      name,
      failureThreshold: options.circuitBreaker?.failureThreshold || 5,
      resetTimeout: options.circuitBreaker?.resetTimeout || 30000,
      timeout: options.timeout?.operation || 5000
    });
    
    this.retryPolicy = new RetryPolicy({
      maxRetries: options.retry?.maxRetries || 3,
      strategy: options.retry?.strategy || 'exponential',
      baseDelay: options.retry?.baseDelay || 1000
    });
    
    this.bulkhead = new Bulkhead({
      name,
      maxConcurrent: options.bulkhead?.maxConcurrent || 10,
      maxQueue: options.bulkhead?.maxQueue || 100
    });
    
    this.cache = options.cache;
    this.metrics = new ResilienceMetrics(name);
  }

  async execute(operationName, operation, options = {}) {
    const { 
      fallback = null,
      cacheKey = null,
      cacheTtl = 60
    } = options;

    const startTime = Date.now();

    try {
      // Layer 1: Bulkhead (limit concurrency)
      return await this.bulkhead.execute(async () => {
        
        // Layer 2: Circuit Breaker (fail fast)
        return await this.circuitBreaker.call(async () => {
          
          // Layer 3: Retry (handle transient failures)
          return await this.retryPolicy.execute(async () => {
            
            const result = await operation();
            
            // Cache successful results
            if (cacheKey && this.cache) {
              await this.cache.set(cacheKey, result, cacheTtl);
            }
            
            this.metrics.recordSuccess(operationName, Date.now() - startTime);
            return result;
          });
          
        }, async () => {
          // Circuit breaker fallback
          this.metrics.recordCircuitOpen(operationName);
          
          // Try cache
          if (cacheKey && this.cache) {
            const cached = await this.cache.get(cacheKey);
            if (cached) {
              this.metrics.recordCacheHit(operationName);
              return { ...cached, _fromCache: true };
            }
          }
          
          // Use provided fallback
          if (fallback) {
            return fallback();
          }
          
          throw new ServiceUnavailableError(this.name);
        });
      });
      
    } catch (error) {
      this.metrics.recordFailure(operationName, error, Date.now() - startTime);
      throw error;
    }
  }

  getStatus() {
    return {
      service: this.name,
      circuitBreaker: this.circuitBreaker.getStatus(),
      bulkhead: this.bulkhead.getStatus(),
      metrics: this.metrics.getSummary()
    };
  }
}

// Usage
const paymentService = new ResilientService('payment-service', {
  circuitBreaker: {
    failureThreshold: 5,
    resetTimeout: 30000
  },
  retry: {
    maxRetries: 3,
    strategy: 'exponential'
  },
  bulkhead: {
    maxConcurrent: 20,
    maxQueue: 50
  },
  cache: redisCache
});

const payment = await paymentService.execute(
  'processPayment',
  async () => {
    return axios.post('http://payment-api/charge', { amount });
  },
  {
    cacheKey: `payment:${orderId}`,
    cacheTtl: 300,
    fallback: () => ({
      status: 'PENDING',
      message: 'Payment will be processed shortly'
    })
  }
);

Health Checks

// health/HealthChecker.js
class HealthChecker {
  constructor() {
    this.checks = new Map();
  }

  register(name, checkFn, options = {}) {
    this.checks.set(name, {
      check: checkFn,
      critical: options.critical !== false,
      timeout: options.timeout || 5000
    });
  }

  async runCheck(name) {
    const config = this.checks.get(name);
    if (!config) {
      return { name, status: 'UNKNOWN', message: 'Check not found' };
    }

    try {
      const result = await Promise.race([
        config.check(),
        new Promise((_, reject) => 
          setTimeout(() => reject(new Error('Health check timeout')), config.timeout)
        )
      ]);

      return {
        name,
        status: 'HEALTHY',
        critical: config.critical,
        details: result
      };
    } catch (error) {
      return {
        name,
        status: 'UNHEALTHY',
        critical: config.critical,
        error: error.message
      };
    }
  }

  async runAllChecks() {
    const results = await Promise.all(
      Array.from(this.checks.keys()).map(name => this.runCheck(name))
    );

    const unhealthyCritical = results.filter(
      r => r.status === 'UNHEALTHY' && r.critical
    );

    return {
      status: unhealthyCritical.length > 0 ? 'UNHEALTHY' : 'HEALTHY',
      timestamp: new Date().toISOString(),
      checks: results
    };
  }
}

// Setup health checks
const healthChecker = new HealthChecker();

healthChecker.register('database', async () => {
  await db.query('SELECT 1');
  return { connected: true };
}, { critical: true });

healthChecker.register('redis', async () => {
  await redis.ping();
  return { connected: true };
}, { critical: true });

healthChecker.register('payment-service', async () => {
  const response = await axios.get('http://payment-service/health', { timeout: 2000 });
  return response.data;
}, { critical: false });

// Health endpoint
app.get('/health', async (req, res) => {
  const health = await healthChecker.runAllChecks();
  const statusCode = health.status === 'HEALTHY' ? 200 : 503;
  res.status(statusCode).json(health);
});

// Liveness probe (for Kubernetes)
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'UP' });
});

// Readiness probe (for Kubernetes)
app.get('/health/ready', async (req, res) => {
  const health = await healthChecker.runAllChecks();
  const statusCode = health.status === 'HEALTHY' ? 200 : 503;
  res.status(statusCode).json(health);
});

Interview Questions

Q1: Explain the Circuit Breaker pattern and its states

Answer:Circuit Breaker prevents cascade failures by monitoring call failures:States:

CLOSED: Normal operation, requests pass through
OPEN: Failure threshold reached, requests fail immediately
HALF-OPEN: After reset timeout, allows limited requests to test recovery

Parameters:

Failure threshold (e.g., 50% in 10 seconds)
Reset timeout (e.g., 30 seconds)
Success threshold for recovery (e.g., 3 successful calls)

Benefits: Fast failure, reduced load on failing service, automatic recovery

Q2: Why use exponential backoff with jitter?

Answer:Exponential Backoff:

Increases wait time between retries (1s, 2s, 4s, 8s…)
Reduces load on recovering service
Allows more time for transient issues to resolve

Jitter:

Adds randomness to delay (e.g., 2s + 0-500ms)
Prevents “thundering herd” where all clients retry simultaneously
Spreads retry load evenly over time

delay = baseDelay * 2^attempt + random(0, baseDelay * 0.2)

Q3: What is the Bulkhead pattern?

Answer:Concept: Isolate components like ship bulkheads prevent flooding.Implementation:

Separate thread pools per dependency
Limit concurrent calls per service
Queue excess requests with timeout

Example:

Payment service: 20 concurrent max
Inventory service: 50 concurrent max
If payment is slow, only payment pool is affected

Benefits:

Failure isolation
Prevents resource exhaustion
Graceful degradation

Q4: How do you design cascading timeouts?

Answer:Rule: Inner timeouts < Outer timeoutsExample:

API Gateway:    5000ms
└─ Order:       4000ms (leaves 1s for gateway error handling)
   └─ Payment:  3000ms (leaves 1s for order error handling)
      └─ Bank:  2000ms (leaves 1s for payment error handling)

Why:

Outer service needs time to handle timeout errors
Prevents double timeout (inner times out, outer times out)
Enables proper error responses at each layer

Summary

Key Takeaways

Circuit Breaker prevents cascade failures
Retry with exponential backoff + jitter
Bulkhead isolates failure domains
Cascading timeouts for proper error handling
Always have fallback strategies

Next Steps

In the next chapter, we’ll explore Observability - distributed tracing, logging, and monitoring.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Resilience Patterns

​Why Resilience Matters

​Circuit Breaker Pattern

​State Machine

​Implementation from Scratch

​Using with Service Clients

​Retry Patterns

​Retry Strategies

​Implementation

​Combining Circuit Breaker and Retry

​Bulkhead Pattern

​Implementation

​Semaphore-Based Bulkhead

​Timeout Patterns

​Complete Resilience Stack

​Health Checks

​Interview Questions

​Summary

Key Takeaways

Next Steps

Resilience Patterns

Why Resilience Matters

Circuit Breaker Pattern

State Machine

Implementation from Scratch

Using with Service Clients

Retry Patterns

Retry Strategies

Implementation

Combining Circuit Breaker and Retry

Bulkhead Pattern

Implementation

Semaphore-Based Bulkhead

Timeout Patterns

Complete Resilience Stack

Health Checks

Interview Questions

Summary