Skip to main content

Resilience Patterns

In distributed systems, failure is inevitable. Resilience patterns help your services degrade gracefully and recover quickly.
Learning Objectives:
  • Implement circuit breaker pattern
  • Design effective retry strategies
  • Apply bulkhead isolation
  • Create fallback mechanisms
  • Handle timeouts properly

Why Resilience Matters

┌─────────────────────────────────────────────────────────────────────────────┐
│                         CASCADE FAILURE                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  WITHOUT RESILIENCE:                                                         │
│  ──────────────────────                                                      │
│                                                                              │
│  User ──▶ API ──▶ Order ──▶ Payment ──▶ ❌ Bank API (down)                  │
│              │       │         │                                             │
│              │       │         └── Threads blocked, timeout...              │
│              │       └── Connection pool exhausted...                       │
│              └── All requests queuing...                                    │
│  ▲                                                                           │
│  └── Eventually entire system fails!                                        │
│                                                                              │
│                                                                              │
│  WITH RESILIENCE:                                                            │
│  ─────────────────                                                           │
│                                                                              │
│  User ──▶ API ──▶ Order ──▶ Payment ──▶ ⚡ Circuit Breaker                  │
│                               │              │                              │
│                               │              └── Fast fail, use fallback    │
│                               └── Returns "Payment pending"                 │
│  ▲                                                                           │
│  └── System stays responsive, partial degradation only                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Circuit Breaker Pattern

Prevent cascade failures by “breaking the circuit” to failing services.

State Machine

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CIRCUIT BREAKER STATES                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                           ┌──────────────┐                                  │
│                           │    CLOSED    │  ◀── Normal operation            │
│                           │  (Healthy)   │      Requests pass through       │
│                           └──────┬───────┘                                  │
│                                  │                                          │
│                    Failure threshold reached                                │
│                                  │                                          │
│                                  ▼                                          │
│                           ┌──────────────┐                                  │
│      Timeout expires ───▶ │     OPEN     │  ◀── Fail fast                   │
│             │             │  (Tripped)   │      Return error immediately    │
│             │             └──────┬───────┘                                  │
│             │                    │                                          │
│             │         After reset timeout                                   │
│             │                    │                                          │
│             │                    ▼                                          │
│             │             ┌──────────────┐                                  │
│             └──────────── │  HALF-OPEN   │  ◀── Testing                     │
│                           │  (Testing)   │      Allow limited requests      │
│                           └──────┬───────┘                                  │
│                                  │                                          │
│              ┌───────────────────┴───────────────────┐                      │
│              │                                       │                      │
│        Test succeeds                           Test fails                   │
│              │                                       │                      │
│              ▼                                       ▼                      │
│        ┌──────────────┐                       ┌──────────────┐              │
│        │    CLOSED    │                       │     OPEN     │              │
│        └──────────────┘                       └──────────────┘              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation from Scratch

// resilience/CircuitBreaker.js
class CircuitBreaker {
  constructor(options = {}) {
    this.name = options.name || 'default';
    this.failureThreshold = options.failureThreshold || 5;
    this.successThreshold = options.successThreshold || 2;
    this.timeout = options.timeout || 10000;
    this.resetTimeout = options.resetTimeout || 30000;
    
    this.state = 'CLOSED';
    this.failureCount = 0;
    this.successCount = 0;
    this.lastFailureTime = null;
    this.nextAttemptTime = null;
    
    this.metrics = {
      totalCalls: 0,
      successfulCalls: 0,
      failedCalls: 0,
      rejectedCalls: 0,
      timeouts: 0
    };
    
    this.listeners = {
      stateChange: [],
      success: [],
      failure: [],
      reject: []
    };
  }

  async call(fn, fallback = null) {
    this.metrics.totalCalls++;
    
    // Check if circuit is open
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttemptTime) {
        this.metrics.rejectedCalls++;
        this.emit('reject', { reason: 'Circuit is open' });
        
        if (fallback) {
          return fallback();
        }
        throw new CircuitOpenError(this.name);
      }
      
      // Try half-open
      this.transitionTo('HALF_OPEN');
    }

    try {
      const result = await this.executeWithTimeout(fn);
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure(error);
      
      if (fallback) {
        return fallback(error);
      }
      throw error;
    }
  }

  async executeWithTimeout(fn) {
    return new Promise(async (resolve, reject) => {
      const timer = setTimeout(() => {
        this.metrics.timeouts++;
        reject(new TimeoutError(`Operation timed out after ${this.timeout}ms`));
      }, this.timeout);

      try {
        const result = await fn();
        clearTimeout(timer);
        resolve(result);
      } catch (error) {
        clearTimeout(timer);
        reject(error);
      }
    });
  }

  onSuccess() {
    this.metrics.successfulCalls++;
    this.emit('success');
    
    if (this.state === 'HALF_OPEN') {
      this.successCount++;
      
      if (this.successCount >= this.successThreshold) {
        this.transitionTo('CLOSED');
      }
    }
    
    this.failureCount = 0;
  }

  onFailure(error) {
    this.metrics.failedCalls++;
    this.lastFailureTime = Date.now();
    this.emit('failure', error);
    
    if (this.state === 'HALF_OPEN') {
      this.transitionTo('OPEN');
      return;
    }
    
    this.failureCount++;
    
    if (this.failureCount >= this.failureThreshold) {
      this.transitionTo('OPEN');
    }
  }

  transitionTo(newState) {
    const oldState = this.state;
    this.state = newState;
    
    if (newState === 'OPEN') {
      this.nextAttemptTime = Date.now() + this.resetTimeout;
    }
    
    if (newState === 'CLOSED') {
      this.failureCount = 0;
      this.successCount = 0;
    }
    
    if (newState === 'HALF_OPEN') {
      this.successCount = 0;
    }
    
    this.emit('stateChange', { from: oldState, to: newState });
    console.log(`Circuit [${this.name}]: ${oldState}${newState}`);
  }

  on(event, callback) {
    this.listeners[event].push(callback);
  }

  emit(event, data) {
    this.listeners[event]?.forEach(cb => cb(data));
  }

  getStatus() {
    return {
      name: this.name,
      state: this.state,
      failureCount: this.failureCount,
      successCount: this.successCount,
      metrics: this.metrics,
      nextAttemptTime: this.nextAttemptTime
    };
  }
}

class CircuitOpenError extends Error {
  constructor(circuitName) {
    super(`Circuit breaker [${circuitName}] is open`);
    this.name = 'CircuitOpenError';
    this.circuitName = circuitName;
  }
}

class TimeoutError extends Error {
  constructor(message) {
    super(message);
    this.name = 'TimeoutError';
  }
}

module.exports = { CircuitBreaker, CircuitOpenError, TimeoutError };

Using with Service Clients

// services/PaymentServiceClient.js
const { CircuitBreaker } = require('../resilience/CircuitBreaker');

class PaymentServiceClient {
  constructor(httpClient, cache) {
    this.http = httpClient;
    this.cache = cache;
    
    this.circuitBreaker = new CircuitBreaker({
      name: 'payment-service',
      failureThreshold: 5,
      successThreshold: 2,
      timeout: 5000,
      resetTimeout: 30000
    });
    
    // Listen to state changes
    this.circuitBreaker.on('stateChange', ({ from, to }) => {
      if (to === 'OPEN') {
        this.alertOps(`Payment service circuit opened!`);
      }
    });
  }

  async processPayment(orderId, amount, paymentMethod) {
    return this.circuitBreaker.call(
      // Primary operation
      async () => {
        const response = await this.http.post('/payments', {
          orderId,
          amount,
          paymentMethod
        });
        return response.data;
      },
      // Fallback
      async (error) => {
        console.log('Using payment fallback:', error.message);
        
        // Queue for later processing
        await this.queuePaymentForRetry(orderId, amount, paymentMethod);
        
        return {
          status: 'PENDING',
          message: 'Payment queued for processing',
          retryId: generateId()
        };
      }
    );
  }

  async getPaymentStatus(paymentId) {
    return this.circuitBreaker.call(
      async () => {
        const response = await this.http.get(`/payments/${paymentId}`);
        // Cache successful responses
        await this.cache.set(`payment:${paymentId}`, response.data, 60);
        return response.data;
      },
      async () => {
        // Return cached data as fallback
        const cached = await this.cache.get(`payment:${paymentId}`);
        if (cached) {
          return { ...cached, _fromCache: true };
        }
        throw new Error('Payment status unavailable');
      }
    );
  }
}

Retry Patterns

Retry Strategies

┌─────────────────────────────────────────────────────────────────────────────┐
│                         RETRY STRATEGIES                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  IMMEDIATE RETRY (for transient failures):                                   │
│  ─────────────────────────────────────────                                  │
│  [Request] ──✗── [Retry 1] ──✗── [Retry 2] ──✓── [Success]                 │
│     │             │                │                                        │
│     0ms           0ms              0ms                                      │
│                                                                              │
│  LINEAR BACKOFF:                                                             │
│  ───────────────                                                             │
│  [Request] ──✗── [Retry 1] ──✗── [Retry 2] ──✗── [Retry 3] ──✓             │
│     │             │                │                │                       │
│     0ms         1000ms           2000ms          3000ms                     │
│                                                                              │
│  EXPONENTIAL BACKOFF (recommended):                                          │
│  ──────────────────────────────────                                         │
│  [Request] ──✗── [Retry 1] ──✗── [Retry 2] ──✗── [Retry 3] ──✓             │
│     │             │                │                │                       │
│     0ms         1000ms           2000ms          4000ms                     │
│                                                                              │
│  EXPONENTIAL BACKOFF WITH JITTER (best):                                     │
│  ───────────────────────────────────────                                    │
│  [Request] ──✗── [Retry 1] ──✗── [Retry 2] ──✗── [Retry 3] ──✓             │
│     │             │                │                │                       │
│     0ms       1000+rand        2000+rand       4000+rand                    │
│                                                                              │
│  Jitter prevents thundering herd when many clients retry simultaneously      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation

// resilience/Retry.js
class RetryPolicy {
  constructor(options = {}) {
    this.maxRetries = options.maxRetries || 3;
    this.strategy = options.strategy || 'exponential';
    this.baseDelay = options.baseDelay || 1000;
    this.maxDelay = options.maxDelay || 30000;
    this.jitterFactor = options.jitterFactor || 0.2;
    this.retryableErrors = options.retryableErrors || this.defaultRetryableErrors();
  }

  defaultRetryableErrors() {
    return [
      'ECONNRESET',
      'ETIMEDOUT',
      'ECONNREFUSED',
      'NETWORK_ERROR'
    ];
  }

  async execute(operation) {
    let lastError;
    
    for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
      try {
        return await operation();
      } catch (error) {
        lastError = error;
        
        if (!this.isRetryable(error) || attempt === this.maxRetries) {
          throw error;
        }
        
        const delay = this.calculateDelay(attempt);
        console.log(`Retry ${attempt + 1}/${this.maxRetries} after ${delay}ms`);
        
        await this.sleep(delay);
      }
    }
    
    throw lastError;
  }

  isRetryable(error) {
    // Network errors
    if (this.retryableErrors.includes(error.code)) {
      return true;
    }
    
    // HTTP status codes
    const retryableStatusCodes = [408, 429, 500, 502, 503, 504];
    if (retryableStatusCodes.includes(error.statusCode)) {
      return true;
    }
    
    // Circuit breaker open (will retry after reset)
    if (error.name === 'CircuitOpenError') {
      return false; // Don't retry, circuit breaker handles it
    }
    
    return false;
  }

  calculateDelay(attempt) {
    let delay;
    
    switch (this.strategy) {
      case 'fixed':
        delay = this.baseDelay;
        break;
        
      case 'linear':
        delay = this.baseDelay * (attempt + 1);
        break;
        
      case 'exponential':
        delay = this.baseDelay * Math.pow(2, attempt);
        break;
        
      default:
        delay = this.baseDelay;
    }
    
    // Apply jitter
    const jitter = delay * this.jitterFactor * Math.random();
    delay = delay + jitter;
    
    // Cap at max delay
    return Math.min(delay, this.maxDelay);
  }

  sleep(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Retry with specific status code handling
class HttpRetryPolicy extends RetryPolicy {
  constructor(options = {}) {
    super(options);
    this.statusCodeHandlers = options.statusCodeHandlers || {};
  }

  isRetryable(error) {
    if (error.statusCode === 429) {
      // Rate limited - respect Retry-After header
      return true;
    }
    
    return super.isRetryable(error);
  }

  calculateDelay(attempt, error) {
    // Respect Retry-After header
    if (error?.headers?.['retry-after']) {
      const retryAfter = parseInt(error.headers['retry-after']);
      if (!isNaN(retryAfter)) {
        return retryAfter * 1000;
      }
    }
    
    return super.calculateDelay(attempt);
  }
}

module.exports = { RetryPolicy, HttpRetryPolicy };

Combining Circuit Breaker and Retry

// resilience/ResilientClient.js
class ResilientClient {
  constructor(options = {}) {
    this.circuitBreaker = new CircuitBreaker(options.circuitBreaker);
    this.retryPolicy = new RetryPolicy(options.retry);
  }

  async execute(operation, fallback = null) {
    return this.circuitBreaker.call(
      // Wrap operation with retry
      () => this.retryPolicy.execute(operation),
      fallback
    );
  }
}

// Usage
const resilientPaymentClient = new ResilientClient({
  circuitBreaker: {
    name: 'payment',
    failureThreshold: 5,
    resetTimeout: 30000
  },
  retry: {
    maxRetries: 3,
    strategy: 'exponential',
    baseDelay: 1000
  }
});

const result = await resilientPaymentClient.execute(
  async () => {
    return axios.post('http://payment-service/charge', { amount: 100 });
  },
  async () => {
    return { status: 'PENDING', queued: true };
  }
);

Bulkhead Pattern

Isolate failures to prevent them from affecting other parts of the system.
┌─────────────────────────────────────────────────────────────────────────────┐
│                         BULKHEAD PATTERN                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  WITHOUT BULKHEAD:                                                           │
│  ─────────────────                                                           │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────┐            │
│  │                    SHARED THREAD POOL (100)                  │            │
│  │  [Payment] [Payment] [Payment] ... (all blocked)            │            │
│  │  [Orders] ✗ No threads available                            │            │
│  │  [Users] ✗ No threads available                             │            │
│  └─────────────────────────────────────────────────────────────┘            │
│                                                                              │
│  Payment service is slow → ALL services affected!                            │
│                                                                              │
│                                                                              │
│  WITH BULKHEAD:                                                              │
│  ───────────────                                                             │
│                                                                              │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐                │
│  │ Payment (30)    │ │ Orders (40)     │ │ Users (30)      │                │
│  │ [blocked...]    │ │ [working]       │ │ [working]       │                │
│  │ [blocked...]    │ │ [working]       │ │ [working]       │                │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘                │
│                                                                              │
│  Payment service is slow → Only payment pool affected!                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation

// resilience/Bulkhead.js
class Bulkhead {
  constructor(options = {}) {
    this.name = options.name || 'default';
    this.maxConcurrent = options.maxConcurrent || 10;
    this.maxQueue = options.maxQueue || 100;
    this.queueTimeout = options.queueTimeout || 30000;
    
    this.activeCount = 0;
    this.queue = [];
    
    this.metrics = {
      totalCalls: 0,
      activeCalls: 0,
      queuedCalls: 0,
      rejectedCalls: 0,
      completedCalls: 0,
      timeoutCalls: 0
    };
  }

  async execute(operation) {
    this.metrics.totalCalls++;
    
    // If within limit, execute immediately
    if (this.activeCount < this.maxConcurrent) {
      return this.executeOperation(operation);
    }
    
    // If queue is full, reject
    if (this.queue.length >= this.maxQueue) {
      this.metrics.rejectedCalls++;
      throw new BulkheadFullError(this.name);
    }
    
    // Queue the operation
    return this.queueOperation(operation);
  }

  async executeOperation(operation) {
    this.activeCount++;
    this.metrics.activeCalls = this.activeCount;
    
    try {
      const result = await operation();
      this.metrics.completedCalls++;
      return result;
    } finally {
      this.activeCount--;
      this.metrics.activeCalls = this.activeCount;
      this.processQueue();
    }
  }

  queueOperation(operation) {
    return new Promise((resolve, reject) => {
      const queueItem = {
        operation,
        resolve,
        reject,
        timestamp: Date.now(),
        timer: setTimeout(() => {
          this.removeFromQueue(queueItem);
          this.metrics.timeoutCalls++;
          reject(new QueueTimeoutError(this.name, this.queueTimeout));
        }, this.queueTimeout)
      };
      
      this.queue.push(queueItem);
      this.metrics.queuedCalls = this.queue.length;
    });
  }

  processQueue() {
    if (this.queue.length === 0 || this.activeCount >= this.maxConcurrent) {
      return;
    }
    
    const item = this.queue.shift();
    clearTimeout(item.timer);
    this.metrics.queuedCalls = this.queue.length;
    
    this.executeOperation(item.operation)
      .then(item.resolve)
      .catch(item.reject);
  }

  removeFromQueue(item) {
    const index = this.queue.indexOf(item);
    if (index > -1) {
      this.queue.splice(index, 1);
      this.metrics.queuedCalls = this.queue.length;
    }
  }

  getStatus() {
    return {
      name: this.name,
      activeCount: this.activeCount,
      queueLength: this.queue.length,
      metrics: this.metrics
    };
  }
}

class BulkheadFullError extends Error {
  constructor(name) {
    super(`Bulkhead [${name}] is full`);
    this.name = 'BulkheadFullError';
  }
}

class QueueTimeoutError extends Error {
  constructor(name, timeout) {
    super(`Queue timeout in bulkhead [${name}] after ${timeout}ms`);
    this.name = 'QueueTimeoutError';
  }
}

module.exports = { Bulkhead, BulkheadFullError, QueueTimeoutError };

Semaphore-Based Bulkhead

// resilience/SemaphoreBulkhead.js
class Semaphore {
  constructor(maxConcurrent) {
    this.maxConcurrent = maxConcurrent;
    this.current = 0;
    this.waiting = [];
  }

  async acquire() {
    if (this.current < this.maxConcurrent) {
      this.current++;
      return;
    }

    await new Promise((resolve) => {
      this.waiting.push(resolve);
    });
  }

  release() {
    this.current--;
    
    if (this.waiting.length > 0) {
      const next = this.waiting.shift();
      this.current++;
      next();
    }
  }

  async execute(fn) {
    await this.acquire();
    try {
      return await fn();
    } finally {
      this.release();
    }
  }
}

// Usage with per-service bulkheads
class BulkheadManager {
  constructor() {
    this.bulkheads = new Map();
  }

  get(name, maxConcurrent = 10) {
    if (!this.bulkheads.has(name)) {
      this.bulkheads.set(name, new Semaphore(maxConcurrent));
    }
    return this.bulkheads.get(name);
  }

  async execute(name, maxConcurrent, fn) {
    const bulkhead = this.get(name, maxConcurrent);
    return bulkhead.execute(fn);
  }
}

// Usage
const bulkheadManager = new BulkheadManager();

// Payment operations limited to 20 concurrent
const paymentResult = await bulkheadManager.execute('payment', 20, async () => {
  return paymentService.processPayment(orderId, amount);
});

// Inventory operations limited to 50 concurrent
const stockResult = await bulkheadManager.execute('inventory', 50, async () => {
  return inventoryService.checkStock(productId);
});

Timeout Patterns

// resilience/Timeout.js
class TimeoutPolicy {
  constructor(timeoutMs) {
    this.timeoutMs = timeoutMs;
  }

  async execute(operation, fallback = null) {
    return new Promise(async (resolve, reject) => {
      const timer = setTimeout(() => {
        if (fallback) {
          resolve(fallback());
        } else {
          reject(new TimeoutError(`Operation timed out after ${this.timeoutMs}ms`));
        }
      }, this.timeoutMs);

      try {
        const result = await operation();
        clearTimeout(timer);
        resolve(result);
      } catch (error) {
        clearTimeout(timer);
        reject(error);
      }
    });
  }
}

// Cascading timeouts
class CascadingTimeout {
  // Outer service must have shorter timeout than inner
  // to give time for error handling
  
  /*
   API Gateway (5s) 
       └── Order Service (4s)
            └── Payment Service (3s)
                 └── Bank API (2s)
  */
  
  static forLayer(layer) {
    const timeouts = {
      gateway: 5000,
      application: 4000,
      integration: 3000,
      external: 2000
    };
    return new TimeoutPolicy(timeouts[layer] || 5000);
  }
}

// Adaptive timeout based on p99 latency
class AdaptiveTimeout {
  constructor(options = {}) {
    this.baseTimeout = options.baseTimeout || 1000;
    this.multiplier = options.multiplier || 3;
    this.minTimeout = options.minTimeout || 500;
    this.maxTimeout = options.maxTimeout || 30000;
    this.latencies = [];
    this.windowSize = options.windowSize || 100;
  }

  recordLatency(latency) {
    this.latencies.push(latency);
    if (this.latencies.length > this.windowSize) {
      this.latencies.shift();
    }
  }

  getCurrentTimeout() {
    if (this.latencies.length < 10) {
      return this.baseTimeout;
    }

    // Calculate p99 latency
    const sorted = [...this.latencies].sort((a, b) => a - b);
    const p99Index = Math.floor(sorted.length * 0.99);
    const p99 = sorted[p99Index];

    // Timeout = p99 * multiplier
    const timeout = p99 * this.multiplier;

    return Math.max(this.minTimeout, Math.min(timeout, this.maxTimeout));
  }

  async execute(operation) {
    const timeout = this.getCurrentTimeout();
    const start = Date.now();

    try {
      const result = await new TimeoutPolicy(timeout).execute(operation);
      this.recordLatency(Date.now() - start);
      return result;
    } catch (error) {
      if (!(error instanceof TimeoutError)) {
        this.recordLatency(Date.now() - start);
      }
      throw error;
    }
  }
}

Complete Resilience Stack

// resilience/ResilientService.js
class ResilientService {
  constructor(name, options = {}) {
    this.name = name;
    
    this.circuitBreaker = new CircuitBreaker({
      name,
      failureThreshold: options.circuitBreaker?.failureThreshold || 5,
      resetTimeout: options.circuitBreaker?.resetTimeout || 30000,
      timeout: options.timeout?.operation || 5000
    });
    
    this.retryPolicy = new RetryPolicy({
      maxRetries: options.retry?.maxRetries || 3,
      strategy: options.retry?.strategy || 'exponential',
      baseDelay: options.retry?.baseDelay || 1000
    });
    
    this.bulkhead = new Bulkhead({
      name,
      maxConcurrent: options.bulkhead?.maxConcurrent || 10,
      maxQueue: options.bulkhead?.maxQueue || 100
    });
    
    this.cache = options.cache;
    this.metrics = new ResilienceMetrics(name);
  }

  async execute(operationName, operation, options = {}) {
    const { 
      fallback = null,
      cacheKey = null,
      cacheTtl = 60
    } = options;

    const startTime = Date.now();

    try {
      // Layer 1: Bulkhead (limit concurrency)
      return await this.bulkhead.execute(async () => {
        
        // Layer 2: Circuit Breaker (fail fast)
        return await this.circuitBreaker.call(async () => {
          
          // Layer 3: Retry (handle transient failures)
          return await this.retryPolicy.execute(async () => {
            
            const result = await operation();
            
            // Cache successful results
            if (cacheKey && this.cache) {
              await this.cache.set(cacheKey, result, cacheTtl);
            }
            
            this.metrics.recordSuccess(operationName, Date.now() - startTime);
            return result;
          });
          
        }, async () => {
          // Circuit breaker fallback
          this.metrics.recordCircuitOpen(operationName);
          
          // Try cache
          if (cacheKey && this.cache) {
            const cached = await this.cache.get(cacheKey);
            if (cached) {
              this.metrics.recordCacheHit(operationName);
              return { ...cached, _fromCache: true };
            }
          }
          
          // Use provided fallback
          if (fallback) {
            return fallback();
          }
          
          throw new ServiceUnavailableError(this.name);
        });
      });
      
    } catch (error) {
      this.metrics.recordFailure(operationName, error, Date.now() - startTime);
      throw error;
    }
  }

  getStatus() {
    return {
      service: this.name,
      circuitBreaker: this.circuitBreaker.getStatus(),
      bulkhead: this.bulkhead.getStatus(),
      metrics: this.metrics.getSummary()
    };
  }
}

// Usage
const paymentService = new ResilientService('payment-service', {
  circuitBreaker: {
    failureThreshold: 5,
    resetTimeout: 30000
  },
  retry: {
    maxRetries: 3,
    strategy: 'exponential'
  },
  bulkhead: {
    maxConcurrent: 20,
    maxQueue: 50
  },
  cache: redisCache
});

const payment = await paymentService.execute(
  'processPayment',
  async () => {
    return axios.post('http://payment-api/charge', { amount });
  },
  {
    cacheKey: `payment:${orderId}`,
    cacheTtl: 300,
    fallback: () => ({
      status: 'PENDING',
      message: 'Payment will be processed shortly'
    })
  }
);

Health Checks

// health/HealthChecker.js
class HealthChecker {
  constructor() {
    this.checks = new Map();
  }

  register(name, checkFn, options = {}) {
    this.checks.set(name, {
      check: checkFn,
      critical: options.critical !== false,
      timeout: options.timeout || 5000
    });
  }

  async runCheck(name) {
    const config = this.checks.get(name);
    if (!config) {
      return { name, status: 'UNKNOWN', message: 'Check not found' };
    }

    try {
      const result = await Promise.race([
        config.check(),
        new Promise((_, reject) => 
          setTimeout(() => reject(new Error('Health check timeout')), config.timeout)
        )
      ]);

      return {
        name,
        status: 'HEALTHY',
        critical: config.critical,
        details: result
      };
    } catch (error) {
      return {
        name,
        status: 'UNHEALTHY',
        critical: config.critical,
        error: error.message
      };
    }
  }

  async runAllChecks() {
    const results = await Promise.all(
      Array.from(this.checks.keys()).map(name => this.runCheck(name))
    );

    const unhealthyCritical = results.filter(
      r => r.status === 'UNHEALTHY' && r.critical
    );

    return {
      status: unhealthyCritical.length > 0 ? 'UNHEALTHY' : 'HEALTHY',
      timestamp: new Date().toISOString(),
      checks: results
    };
  }
}

// Setup health checks
const healthChecker = new HealthChecker();

healthChecker.register('database', async () => {
  await db.query('SELECT 1');
  return { connected: true };
}, { critical: true });

healthChecker.register('redis', async () => {
  await redis.ping();
  return { connected: true };
}, { critical: true });

healthChecker.register('payment-service', async () => {
  const response = await axios.get('http://payment-service/health', { timeout: 2000 });
  return response.data;
}, { critical: false });

// Health endpoint
app.get('/health', async (req, res) => {
  const health = await healthChecker.runAllChecks();
  const statusCode = health.status === 'HEALTHY' ? 200 : 503;
  res.status(statusCode).json(health);
});

// Liveness probe (for Kubernetes)
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'UP' });
});

// Readiness probe (for Kubernetes)
app.get('/health/ready', async (req, res) => {
  const health = await healthChecker.runAllChecks();
  const statusCode = health.status === 'HEALTHY' ? 200 : 503;
  res.status(statusCode).json(health);
});

Interview Questions

Answer:Circuit Breaker prevents cascade failures by monitoring call failures:States:
  1. CLOSED: Normal operation, requests pass through
  2. OPEN: Failure threshold reached, requests fail immediately
  3. HALF-OPEN: After reset timeout, allows limited requests to test recovery
Parameters:
  • Failure threshold (e.g., 50% in 10 seconds)
  • Reset timeout (e.g., 30 seconds)
  • Success threshold for recovery (e.g., 3 successful calls)
Benefits: Fast failure, reduced load on failing service, automatic recovery
Answer:Exponential Backoff:
  • Increases wait time between retries (1s, 2s, 4s, 8s…)
  • Reduces load on recovering service
  • Allows more time for transient issues to resolve
Jitter:
  • Adds randomness to delay (e.g., 2s + 0-500ms)
  • Prevents “thundering herd” where all clients retry simultaneously
  • Spreads retry load evenly over time
delay = baseDelay * 2^attempt + random(0, baseDelay * 0.2)
Answer:Concept: Isolate components like ship bulkheads prevent flooding.Implementation:
  • Separate thread pools per dependency
  • Limit concurrent calls per service
  • Queue excess requests with timeout
Example:
  • Payment service: 20 concurrent max
  • Inventory service: 50 concurrent max
  • If payment is slow, only payment pool is affected
Benefits:
  • Failure isolation
  • Prevents resource exhaustion
  • Graceful degradation
Answer:Rule: Inner timeouts < Outer timeoutsExample:
API Gateway:    5000ms
└─ Order:       4000ms (leaves 1s for gateway error handling)
   └─ Payment:  3000ms (leaves 1s for order error handling)
      └─ Bank:  2000ms (leaves 1s for payment error handling)
Why:
  • Outer service needs time to handle timeout errors
  • Prevents double timeout (inner times out, outer times out)
  • Enables proper error responses at each layer

Summary

Key Takeaways

  • Circuit Breaker prevents cascade failures
  • Retry with exponential backoff + jitter
  • Bulkhead isolates failure domains
  • Cascading timeouts for proper error handling
  • Always have fallback strategies

Next Steps

In the next chapter, we’ll explore Observability - distributed tracing, logging, and monitoring.