Resilience Patterns
In distributed systems, failure is inevitable. Resilience patterns help your services degrade gracefully and recover quickly.Learning Objectives:
- Implement circuit breaker pattern
- Design effective retry strategies
- Apply bulkhead isolation
- Create fallback mechanisms
- Handle timeouts properly
Why Resilience Matters
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ CASCADE FAILURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ WITHOUT RESILIENCE: │
│ ────────────────────── │
│ │
│ User ──▶ API ──▶ Order ──▶ Payment ──▶ ❌ Bank API (down) │
│ │ │ │ │
│ │ │ └── Threads blocked, timeout... │
│ │ └── Connection pool exhausted... │
│ └── All requests queuing... │
│ ▲ │
│ └── Eventually entire system fails! │
│ │
│ │
│ WITH RESILIENCE: │
│ ───────────────── │
│ │
│ User ──▶ API ──▶ Order ──▶ Payment ──▶ ⚡ Circuit Breaker │
│ │ │ │
│ │ └── Fast fail, use fallback │
│ └── Returns "Payment pending" │
│ ▲ │
│ └── System stays responsive, partial degradation only │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Circuit Breaker Pattern
Prevent cascade failures by “breaking the circuit” to failing services.State Machine
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ CIRCUIT BREAKER STATES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ CLOSED │ ◀── Normal operation │
│ │ (Healthy) │ Requests pass through │
│ └──────┬───────┘ │
│ │ │
│ Failure threshold reached │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ Timeout expires ───▶ │ OPEN │ ◀── Fail fast │
│ │ │ (Tripped) │ Return error immediately │
│ │ └──────┬───────┘ │
│ │ │ │
│ │ After reset timeout │
│ │ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ └──────────── │ HALF-OPEN │ ◀── Testing │
│ │ (Testing) │ Allow limited requests │
│ └──────┬───────┘ │
│ │ │
│ ┌───────────────────┴───────────────────┐ │
│ │ │ │
│ Test succeeds Test fails │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ CLOSED │ │ OPEN │ │
│ └──────────────┘ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation from Scratch
Copy
// resilience/CircuitBreaker.js
class CircuitBreaker {
constructor(options = {}) {
this.name = options.name || 'default';
this.failureThreshold = options.failureThreshold || 5;
this.successThreshold = options.successThreshold || 2;
this.timeout = options.timeout || 10000;
this.resetTimeout = options.resetTimeout || 30000;
this.state = 'CLOSED';
this.failureCount = 0;
this.successCount = 0;
this.lastFailureTime = null;
this.nextAttemptTime = null;
this.metrics = {
totalCalls: 0,
successfulCalls: 0,
failedCalls: 0,
rejectedCalls: 0,
timeouts: 0
};
this.listeners = {
stateChange: [],
success: [],
failure: [],
reject: []
};
}
async call(fn, fallback = null) {
this.metrics.totalCalls++;
// Check if circuit is open
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttemptTime) {
this.metrics.rejectedCalls++;
this.emit('reject', { reason: 'Circuit is open' });
if (fallback) {
return fallback();
}
throw new CircuitOpenError(this.name);
}
// Try half-open
this.transitionTo('HALF_OPEN');
}
try {
const result = await this.executeWithTimeout(fn);
this.onSuccess();
return result;
} catch (error) {
this.onFailure(error);
if (fallback) {
return fallback(error);
}
throw error;
}
}
async executeWithTimeout(fn) {
return new Promise(async (resolve, reject) => {
const timer = setTimeout(() => {
this.metrics.timeouts++;
reject(new TimeoutError(`Operation timed out after ${this.timeout}ms`));
}, this.timeout);
try {
const result = await fn();
clearTimeout(timer);
resolve(result);
} catch (error) {
clearTimeout(timer);
reject(error);
}
});
}
onSuccess() {
this.metrics.successfulCalls++;
this.emit('success');
if (this.state === 'HALF_OPEN') {
this.successCount++;
if (this.successCount >= this.successThreshold) {
this.transitionTo('CLOSED');
}
}
this.failureCount = 0;
}
onFailure(error) {
this.metrics.failedCalls++;
this.lastFailureTime = Date.now();
this.emit('failure', error);
if (this.state === 'HALF_OPEN') {
this.transitionTo('OPEN');
return;
}
this.failureCount++;
if (this.failureCount >= this.failureThreshold) {
this.transitionTo('OPEN');
}
}
transitionTo(newState) {
const oldState = this.state;
this.state = newState;
if (newState === 'OPEN') {
this.nextAttemptTime = Date.now() + this.resetTimeout;
}
if (newState === 'CLOSED') {
this.failureCount = 0;
this.successCount = 0;
}
if (newState === 'HALF_OPEN') {
this.successCount = 0;
}
this.emit('stateChange', { from: oldState, to: newState });
console.log(`Circuit [${this.name}]: ${oldState} → ${newState}`);
}
on(event, callback) {
this.listeners[event].push(callback);
}
emit(event, data) {
this.listeners[event]?.forEach(cb => cb(data));
}
getStatus() {
return {
name: this.name,
state: this.state,
failureCount: this.failureCount,
successCount: this.successCount,
metrics: this.metrics,
nextAttemptTime: this.nextAttemptTime
};
}
}
class CircuitOpenError extends Error {
constructor(circuitName) {
super(`Circuit breaker [${circuitName}] is open`);
this.name = 'CircuitOpenError';
this.circuitName = circuitName;
}
}
class TimeoutError extends Error {
constructor(message) {
super(message);
this.name = 'TimeoutError';
}
}
module.exports = { CircuitBreaker, CircuitOpenError, TimeoutError };
Using with Service Clients
Copy
// services/PaymentServiceClient.js
const { CircuitBreaker } = require('../resilience/CircuitBreaker');
class PaymentServiceClient {
constructor(httpClient, cache) {
this.http = httpClient;
this.cache = cache;
this.circuitBreaker = new CircuitBreaker({
name: 'payment-service',
failureThreshold: 5,
successThreshold: 2,
timeout: 5000,
resetTimeout: 30000
});
// Listen to state changes
this.circuitBreaker.on('stateChange', ({ from, to }) => {
if (to === 'OPEN') {
this.alertOps(`Payment service circuit opened!`);
}
});
}
async processPayment(orderId, amount, paymentMethod) {
return this.circuitBreaker.call(
// Primary operation
async () => {
const response = await this.http.post('/payments', {
orderId,
amount,
paymentMethod
});
return response.data;
},
// Fallback
async (error) => {
console.log('Using payment fallback:', error.message);
// Queue for later processing
await this.queuePaymentForRetry(orderId, amount, paymentMethod);
return {
status: 'PENDING',
message: 'Payment queued for processing',
retryId: generateId()
};
}
);
}
async getPaymentStatus(paymentId) {
return this.circuitBreaker.call(
async () => {
const response = await this.http.get(`/payments/${paymentId}`);
// Cache successful responses
await this.cache.set(`payment:${paymentId}`, response.data, 60);
return response.data;
},
async () => {
// Return cached data as fallback
const cached = await this.cache.get(`payment:${paymentId}`);
if (cached) {
return { ...cached, _fromCache: true };
}
throw new Error('Payment status unavailable');
}
);
}
}
Retry Patterns
Retry Strategies
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ RETRY STRATEGIES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ IMMEDIATE RETRY (for transient failures): │
│ ───────────────────────────────────────── │
│ [Request] ──✗── [Retry 1] ──✗── [Retry 2] ──✓── [Success] │
│ │ │ │ │
│ 0ms 0ms 0ms │
│ │
│ LINEAR BACKOFF: │
│ ─────────────── │
│ [Request] ──✗── [Retry 1] ──✗── [Retry 2] ──✗── [Retry 3] ──✓ │
│ │ │ │ │ │
│ 0ms 1000ms 2000ms 3000ms │
│ │
│ EXPONENTIAL BACKOFF (recommended): │
│ ────────────────────────────────── │
│ [Request] ──✗── [Retry 1] ──✗── [Retry 2] ──✗── [Retry 3] ──✓ │
│ │ │ │ │ │
│ 0ms 1000ms 2000ms 4000ms │
│ │
│ EXPONENTIAL BACKOFF WITH JITTER (best): │
│ ─────────────────────────────────────── │
│ [Request] ──✗── [Retry 1] ──✗── [Retry 2] ──✗── [Retry 3] ──✓ │
│ │ │ │ │ │
│ 0ms 1000+rand 2000+rand 4000+rand │
│ │
│ Jitter prevents thundering herd when many clients retry simultaneously │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation
Copy
// resilience/Retry.js
class RetryPolicy {
constructor(options = {}) {
this.maxRetries = options.maxRetries || 3;
this.strategy = options.strategy || 'exponential';
this.baseDelay = options.baseDelay || 1000;
this.maxDelay = options.maxDelay || 30000;
this.jitterFactor = options.jitterFactor || 0.2;
this.retryableErrors = options.retryableErrors || this.defaultRetryableErrors();
}
defaultRetryableErrors() {
return [
'ECONNRESET',
'ETIMEDOUT',
'ECONNREFUSED',
'NETWORK_ERROR'
];
}
async execute(operation) {
let lastError;
for (let attempt = 0; attempt <= this.maxRetries; attempt++) {
try {
return await operation();
} catch (error) {
lastError = error;
if (!this.isRetryable(error) || attempt === this.maxRetries) {
throw error;
}
const delay = this.calculateDelay(attempt);
console.log(`Retry ${attempt + 1}/${this.maxRetries} after ${delay}ms`);
await this.sleep(delay);
}
}
throw lastError;
}
isRetryable(error) {
// Network errors
if (this.retryableErrors.includes(error.code)) {
return true;
}
// HTTP status codes
const retryableStatusCodes = [408, 429, 500, 502, 503, 504];
if (retryableStatusCodes.includes(error.statusCode)) {
return true;
}
// Circuit breaker open (will retry after reset)
if (error.name === 'CircuitOpenError') {
return false; // Don't retry, circuit breaker handles it
}
return false;
}
calculateDelay(attempt) {
let delay;
switch (this.strategy) {
case 'fixed':
delay = this.baseDelay;
break;
case 'linear':
delay = this.baseDelay * (attempt + 1);
break;
case 'exponential':
delay = this.baseDelay * Math.pow(2, attempt);
break;
default:
delay = this.baseDelay;
}
// Apply jitter
const jitter = delay * this.jitterFactor * Math.random();
delay = delay + jitter;
// Cap at max delay
return Math.min(delay, this.maxDelay);
}
sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Retry with specific status code handling
class HttpRetryPolicy extends RetryPolicy {
constructor(options = {}) {
super(options);
this.statusCodeHandlers = options.statusCodeHandlers || {};
}
isRetryable(error) {
if (error.statusCode === 429) {
// Rate limited - respect Retry-After header
return true;
}
return super.isRetryable(error);
}
calculateDelay(attempt, error) {
// Respect Retry-After header
if (error?.headers?.['retry-after']) {
const retryAfter = parseInt(error.headers['retry-after']);
if (!isNaN(retryAfter)) {
return retryAfter * 1000;
}
}
return super.calculateDelay(attempt);
}
}
module.exports = { RetryPolicy, HttpRetryPolicy };
Combining Circuit Breaker and Retry
Copy
// resilience/ResilientClient.js
class ResilientClient {
constructor(options = {}) {
this.circuitBreaker = new CircuitBreaker(options.circuitBreaker);
this.retryPolicy = new RetryPolicy(options.retry);
}
async execute(operation, fallback = null) {
return this.circuitBreaker.call(
// Wrap operation with retry
() => this.retryPolicy.execute(operation),
fallback
);
}
}
// Usage
const resilientPaymentClient = new ResilientClient({
circuitBreaker: {
name: 'payment',
failureThreshold: 5,
resetTimeout: 30000
},
retry: {
maxRetries: 3,
strategy: 'exponential',
baseDelay: 1000
}
});
const result = await resilientPaymentClient.execute(
async () => {
return axios.post('http://payment-service/charge', { amount: 100 });
},
async () => {
return { status: 'PENDING', queued: true };
}
);
Bulkhead Pattern
Isolate failures to prevent them from affecting other parts of the system.Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ BULKHEAD PATTERN │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ WITHOUT BULKHEAD: │
│ ───────────────── │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ SHARED THREAD POOL (100) │ │
│ │ [Payment] [Payment] [Payment] ... (all blocked) │ │
│ │ [Orders] ✗ No threads available │ │
│ │ [Users] ✗ No threads available │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
│ Payment service is slow → ALL services affected! │
│ │
│ │
│ WITH BULKHEAD: │
│ ─────────────── │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Payment (30) │ │ Orders (40) │ │ Users (30) │ │
│ │ [blocked...] │ │ [working] │ │ [working] │ │
│ │ [blocked...] │ │ [working] │ │ [working] │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ Payment service is slow → Only payment pool affected! │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation
Copy
// resilience/Bulkhead.js
class Bulkhead {
constructor(options = {}) {
this.name = options.name || 'default';
this.maxConcurrent = options.maxConcurrent || 10;
this.maxQueue = options.maxQueue || 100;
this.queueTimeout = options.queueTimeout || 30000;
this.activeCount = 0;
this.queue = [];
this.metrics = {
totalCalls: 0,
activeCalls: 0,
queuedCalls: 0,
rejectedCalls: 0,
completedCalls: 0,
timeoutCalls: 0
};
}
async execute(operation) {
this.metrics.totalCalls++;
// If within limit, execute immediately
if (this.activeCount < this.maxConcurrent) {
return this.executeOperation(operation);
}
// If queue is full, reject
if (this.queue.length >= this.maxQueue) {
this.metrics.rejectedCalls++;
throw new BulkheadFullError(this.name);
}
// Queue the operation
return this.queueOperation(operation);
}
async executeOperation(operation) {
this.activeCount++;
this.metrics.activeCalls = this.activeCount;
try {
const result = await operation();
this.metrics.completedCalls++;
return result;
} finally {
this.activeCount--;
this.metrics.activeCalls = this.activeCount;
this.processQueue();
}
}
queueOperation(operation) {
return new Promise((resolve, reject) => {
const queueItem = {
operation,
resolve,
reject,
timestamp: Date.now(),
timer: setTimeout(() => {
this.removeFromQueue(queueItem);
this.metrics.timeoutCalls++;
reject(new QueueTimeoutError(this.name, this.queueTimeout));
}, this.queueTimeout)
};
this.queue.push(queueItem);
this.metrics.queuedCalls = this.queue.length;
});
}
processQueue() {
if (this.queue.length === 0 || this.activeCount >= this.maxConcurrent) {
return;
}
const item = this.queue.shift();
clearTimeout(item.timer);
this.metrics.queuedCalls = this.queue.length;
this.executeOperation(item.operation)
.then(item.resolve)
.catch(item.reject);
}
removeFromQueue(item) {
const index = this.queue.indexOf(item);
if (index > -1) {
this.queue.splice(index, 1);
this.metrics.queuedCalls = this.queue.length;
}
}
getStatus() {
return {
name: this.name,
activeCount: this.activeCount,
queueLength: this.queue.length,
metrics: this.metrics
};
}
}
class BulkheadFullError extends Error {
constructor(name) {
super(`Bulkhead [${name}] is full`);
this.name = 'BulkheadFullError';
}
}
class QueueTimeoutError extends Error {
constructor(name, timeout) {
super(`Queue timeout in bulkhead [${name}] after ${timeout}ms`);
this.name = 'QueueTimeoutError';
}
}
module.exports = { Bulkhead, BulkheadFullError, QueueTimeoutError };
Semaphore-Based Bulkhead
Copy
// resilience/SemaphoreBulkhead.js
class Semaphore {
constructor(maxConcurrent) {
this.maxConcurrent = maxConcurrent;
this.current = 0;
this.waiting = [];
}
async acquire() {
if (this.current < this.maxConcurrent) {
this.current++;
return;
}
await new Promise((resolve) => {
this.waiting.push(resolve);
});
}
release() {
this.current--;
if (this.waiting.length > 0) {
const next = this.waiting.shift();
this.current++;
next();
}
}
async execute(fn) {
await this.acquire();
try {
return await fn();
} finally {
this.release();
}
}
}
// Usage with per-service bulkheads
class BulkheadManager {
constructor() {
this.bulkheads = new Map();
}
get(name, maxConcurrent = 10) {
if (!this.bulkheads.has(name)) {
this.bulkheads.set(name, new Semaphore(maxConcurrent));
}
return this.bulkheads.get(name);
}
async execute(name, maxConcurrent, fn) {
const bulkhead = this.get(name, maxConcurrent);
return bulkhead.execute(fn);
}
}
// Usage
const bulkheadManager = new BulkheadManager();
// Payment operations limited to 20 concurrent
const paymentResult = await bulkheadManager.execute('payment', 20, async () => {
return paymentService.processPayment(orderId, amount);
});
// Inventory operations limited to 50 concurrent
const stockResult = await bulkheadManager.execute('inventory', 50, async () => {
return inventoryService.checkStock(productId);
});
Timeout Patterns
Copy
// resilience/Timeout.js
class TimeoutPolicy {
constructor(timeoutMs) {
this.timeoutMs = timeoutMs;
}
async execute(operation, fallback = null) {
return new Promise(async (resolve, reject) => {
const timer = setTimeout(() => {
if (fallback) {
resolve(fallback());
} else {
reject(new TimeoutError(`Operation timed out after ${this.timeoutMs}ms`));
}
}, this.timeoutMs);
try {
const result = await operation();
clearTimeout(timer);
resolve(result);
} catch (error) {
clearTimeout(timer);
reject(error);
}
});
}
}
// Cascading timeouts
class CascadingTimeout {
// Outer service must have shorter timeout than inner
// to give time for error handling
/*
API Gateway (5s)
└── Order Service (4s)
└── Payment Service (3s)
└── Bank API (2s)
*/
static forLayer(layer) {
const timeouts = {
gateway: 5000,
application: 4000,
integration: 3000,
external: 2000
};
return new TimeoutPolicy(timeouts[layer] || 5000);
}
}
// Adaptive timeout based on p99 latency
class AdaptiveTimeout {
constructor(options = {}) {
this.baseTimeout = options.baseTimeout || 1000;
this.multiplier = options.multiplier || 3;
this.minTimeout = options.minTimeout || 500;
this.maxTimeout = options.maxTimeout || 30000;
this.latencies = [];
this.windowSize = options.windowSize || 100;
}
recordLatency(latency) {
this.latencies.push(latency);
if (this.latencies.length > this.windowSize) {
this.latencies.shift();
}
}
getCurrentTimeout() {
if (this.latencies.length < 10) {
return this.baseTimeout;
}
// Calculate p99 latency
const sorted = [...this.latencies].sort((a, b) => a - b);
const p99Index = Math.floor(sorted.length * 0.99);
const p99 = sorted[p99Index];
// Timeout = p99 * multiplier
const timeout = p99 * this.multiplier;
return Math.max(this.minTimeout, Math.min(timeout, this.maxTimeout));
}
async execute(operation) {
const timeout = this.getCurrentTimeout();
const start = Date.now();
try {
const result = await new TimeoutPolicy(timeout).execute(operation);
this.recordLatency(Date.now() - start);
return result;
} catch (error) {
if (!(error instanceof TimeoutError)) {
this.recordLatency(Date.now() - start);
}
throw error;
}
}
}
Complete Resilience Stack
Copy
// resilience/ResilientService.js
class ResilientService {
constructor(name, options = {}) {
this.name = name;
this.circuitBreaker = new CircuitBreaker({
name,
failureThreshold: options.circuitBreaker?.failureThreshold || 5,
resetTimeout: options.circuitBreaker?.resetTimeout || 30000,
timeout: options.timeout?.operation || 5000
});
this.retryPolicy = new RetryPolicy({
maxRetries: options.retry?.maxRetries || 3,
strategy: options.retry?.strategy || 'exponential',
baseDelay: options.retry?.baseDelay || 1000
});
this.bulkhead = new Bulkhead({
name,
maxConcurrent: options.bulkhead?.maxConcurrent || 10,
maxQueue: options.bulkhead?.maxQueue || 100
});
this.cache = options.cache;
this.metrics = new ResilienceMetrics(name);
}
async execute(operationName, operation, options = {}) {
const {
fallback = null,
cacheKey = null,
cacheTtl = 60
} = options;
const startTime = Date.now();
try {
// Layer 1: Bulkhead (limit concurrency)
return await this.bulkhead.execute(async () => {
// Layer 2: Circuit Breaker (fail fast)
return await this.circuitBreaker.call(async () => {
// Layer 3: Retry (handle transient failures)
return await this.retryPolicy.execute(async () => {
const result = await operation();
// Cache successful results
if (cacheKey && this.cache) {
await this.cache.set(cacheKey, result, cacheTtl);
}
this.metrics.recordSuccess(operationName, Date.now() - startTime);
return result;
});
}, async () => {
// Circuit breaker fallback
this.metrics.recordCircuitOpen(operationName);
// Try cache
if (cacheKey && this.cache) {
const cached = await this.cache.get(cacheKey);
if (cached) {
this.metrics.recordCacheHit(operationName);
return { ...cached, _fromCache: true };
}
}
// Use provided fallback
if (fallback) {
return fallback();
}
throw new ServiceUnavailableError(this.name);
});
});
} catch (error) {
this.metrics.recordFailure(operationName, error, Date.now() - startTime);
throw error;
}
}
getStatus() {
return {
service: this.name,
circuitBreaker: this.circuitBreaker.getStatus(),
bulkhead: this.bulkhead.getStatus(),
metrics: this.metrics.getSummary()
};
}
}
// Usage
const paymentService = new ResilientService('payment-service', {
circuitBreaker: {
failureThreshold: 5,
resetTimeout: 30000
},
retry: {
maxRetries: 3,
strategy: 'exponential'
},
bulkhead: {
maxConcurrent: 20,
maxQueue: 50
},
cache: redisCache
});
const payment = await paymentService.execute(
'processPayment',
async () => {
return axios.post('http://payment-api/charge', { amount });
},
{
cacheKey: `payment:${orderId}`,
cacheTtl: 300,
fallback: () => ({
status: 'PENDING',
message: 'Payment will be processed shortly'
})
}
);
Health Checks
Copy
// health/HealthChecker.js
class HealthChecker {
constructor() {
this.checks = new Map();
}
register(name, checkFn, options = {}) {
this.checks.set(name, {
check: checkFn,
critical: options.critical !== false,
timeout: options.timeout || 5000
});
}
async runCheck(name) {
const config = this.checks.get(name);
if (!config) {
return { name, status: 'UNKNOWN', message: 'Check not found' };
}
try {
const result = await Promise.race([
config.check(),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Health check timeout')), config.timeout)
)
]);
return {
name,
status: 'HEALTHY',
critical: config.critical,
details: result
};
} catch (error) {
return {
name,
status: 'UNHEALTHY',
critical: config.critical,
error: error.message
};
}
}
async runAllChecks() {
const results = await Promise.all(
Array.from(this.checks.keys()).map(name => this.runCheck(name))
);
const unhealthyCritical = results.filter(
r => r.status === 'UNHEALTHY' && r.critical
);
return {
status: unhealthyCritical.length > 0 ? 'UNHEALTHY' : 'HEALTHY',
timestamp: new Date().toISOString(),
checks: results
};
}
}
// Setup health checks
const healthChecker = new HealthChecker();
healthChecker.register('database', async () => {
await db.query('SELECT 1');
return { connected: true };
}, { critical: true });
healthChecker.register('redis', async () => {
await redis.ping();
return { connected: true };
}, { critical: true });
healthChecker.register('payment-service', async () => {
const response = await axios.get('http://payment-service/health', { timeout: 2000 });
return response.data;
}, { critical: false });
// Health endpoint
app.get('/health', async (req, res) => {
const health = await healthChecker.runAllChecks();
const statusCode = health.status === 'HEALTHY' ? 200 : 503;
res.status(statusCode).json(health);
});
// Liveness probe (for Kubernetes)
app.get('/health/live', (req, res) => {
res.status(200).json({ status: 'UP' });
});
// Readiness probe (for Kubernetes)
app.get('/health/ready', async (req, res) => {
const health = await healthChecker.runAllChecks();
const statusCode = health.status === 'HEALTHY' ? 200 : 503;
res.status(statusCode).json(health);
});
Interview Questions
Q1: Explain the Circuit Breaker pattern and its states
Q1: Explain the Circuit Breaker pattern and its states
Answer:Circuit Breaker prevents cascade failures by monitoring call failures:States:
- CLOSED: Normal operation, requests pass through
- OPEN: Failure threshold reached, requests fail immediately
- HALF-OPEN: After reset timeout, allows limited requests to test recovery
- Failure threshold (e.g., 50% in 10 seconds)
- Reset timeout (e.g., 30 seconds)
- Success threshold for recovery (e.g., 3 successful calls)
Q2: Why use exponential backoff with jitter?
Q2: Why use exponential backoff with jitter?
Answer:Exponential Backoff:
- Increases wait time between retries (1s, 2s, 4s, 8s…)
- Reduces load on recovering service
- Allows more time for transient issues to resolve
- Adds randomness to delay (e.g., 2s + 0-500ms)
- Prevents “thundering herd” where all clients retry simultaneously
- Spreads retry load evenly over time
Copy
delay = baseDelay * 2^attempt + random(0, baseDelay * 0.2)
Q3: What is the Bulkhead pattern?
Q3: What is the Bulkhead pattern?
Answer:Concept: Isolate components like ship bulkheads prevent flooding.Implementation:
- Separate thread pools per dependency
- Limit concurrent calls per service
- Queue excess requests with timeout
- Payment service: 20 concurrent max
- Inventory service: 50 concurrent max
- If payment is slow, only payment pool is affected
- Failure isolation
- Prevents resource exhaustion
- Graceful degradation
Q4: How do you design cascading timeouts?
Q4: How do you design cascading timeouts?
Answer:Rule: Inner timeouts < Outer timeoutsExample:Why:
Copy
API Gateway: 5000ms
└─ Order: 4000ms (leaves 1s for gateway error handling)
└─ Payment: 3000ms (leaves 1s for order error handling)
└─ Bank: 2000ms (leaves 1s for payment error handling)
- Outer service needs time to handle timeout errors
- Prevents double timeout (inner times out, outer times out)
- Enables proper error responses at each layer
Summary
Key Takeaways
- Circuit Breaker prevents cascade failures
- Retry with exponential backoff + jitter
- Bulkhead isolates failure domains
- Cascading timeouts for proper error handling
- Always have fallback strategies
Next Steps
In the next chapter, we’ll explore Observability - distributed tracing, logging, and monitoring.