Fault Tolerance Patterns
In distributed systems, failures are not exceptional events—they are expected. This module covers battle-tested patterns for building systems that survive and thrive despite failures.Track Duration: 12-16 hours
Key Topics: Circuit Breakers, Bulkheads, Retries, Timeouts, Health Checks, Graceful Degradation
Interview Focus: Netflix Hystrix patterns, cascading failure prevention, failure domain isolation
Key Topics: Circuit Breakers, Bulkheads, Retries, Timeouts, Health Checks, Graceful Degradation
Interview Focus: Netflix Hystrix patterns, cascading failure prevention, failure domain isolation
The Reality of Failure
Module 37: Timeout Patterns
The Foundation of Fault Tolerance
Without timeouts, a slow dependency can block your entire system forever.Timeout Implementation
Timeout Best Practices
Module 38: Retry Patterns
When and How to Retry
Exponential Backoff with Jitter
Retry Amplification Problem
Advanced: The “Tail at Scale” (Hedged Requests & Backup Tasks)
In large-scale systems (like Google Search), the total system latency is determined by the slowest node. As you increase the number of nodes involved in a request, the probability of hitting a “tail” latency (p99 or p99.9) increases dramatically.1. Hedged Requests
Instead of waiting for a timeout to retry, a system sends the same request to multiple replicas simultaneously.- Classic Hedge: Send to two replicas, take the fastest response.
- Adaptive Hedge: Send to one replica. If no response after the percentile latency (), send a second request to another replica.
- Result: Drastically reduces latency with only increase in total load.
2. Backup Tasks (MapReduce/Spark)
In batch processing, a single “straggler” task can delay an entire job.- When a task is nearly finished but a few instances are still running, the scheduler spawns Backup Tasks (replicas of the slow tasks).
- The first one to finish wins, and the others are killed.
3. Tie-breaking (The “Power of Two Choices” variant)
When a client has multiple healthy replicas, it sends a probe to two of them. It then sends the actual request to the one with the shorter queue (or better historical latency). Staff Tip: When designing for low latency, don’t just “fix the slow node.” Assume some nodes will be slow (due to GC, background tasks, or noisy neighbors) and build “Hedged” logic into your client libraries to route around them.Advanced: Tail Latency Optimization (p99/p999 Patterns)
At scale, tail latency (p99, p99.9) becomes more important than average latency. Here’s why and how to fix it.The Math of Fan-Out
Pattern 1: Hedged Requests
Send duplicate requests to reduce waiting for slow nodes.Pattern 2: Request Coalescing
Combine duplicate requests to reduce backend load.Pattern 3: Adaptive Timeout Percentile Tracking
Don’t use static timeouts - track latency percentiles and adapt.Pattern 4: Goodput Optimization
Maximize goodput (successful work) not throughput (total work).p99 Optimization Checklist
| Technique | p99 Improvement | Cost |
|---|---|---|
| Hedged requests | 50-80% | +5% load |
| Request coalescing | Variable | Complexity |
| LIFO queuing | 30-50% | Unfairness |
| Adaptive timeouts | 20-40% | Monitoring |
| Pre-warming connections | 50%+ on cold | Memory |
| GC tuning | 30-50% | Expertise |
| Disable Nagle’s algorithm | Variable | Bandwidth |
Module 39: Circuit Breaker Pattern
The circuit breaker prevents cascade failures by failing fast when a dependency is unhealthy.Circuit Breaker Implementation
Module 40: Bulkhead Isolation
Isolate components so a failure in one doesn’t sink the entire ship.Bulkhead Types
Thread Pool Bulkhead
Separate thread pools for different dependencies.Pros: Strong isolation, works for blocking calls
Cons: Resource overhead, context switchingUse for: HTTP clients, database connections
Semaphore Bulkhead
Limit concurrent calls with semaphores.Pros: Lightweight, works with async
Cons: Weaker isolation, no queueUse for: Rate limiting, quick operations
Distributed Rate Limiting
Generic Cell Rate Algorithm (GCRA)
For Staff-level infrastructure, basic “Fixed Window” or “Sliding Window” rate limiting is often insufficient due to burstiness. GCRA, used in ATM networks and by companies like Stripe and Cloudflare (vialimit-ador), provides a sophisticated way to handle rate limiting with sub-millisecond precision.
How it works: TAT (Theoretical Arrival Time)
Instead of a counter, GCRA tracks the Theoretical Arrival Time (TAT) of the next allowed request.- Emission Interval (): The time between requests (e.g., if limit is 10 req/sec, ).
- Burst Tolerance (): How much earlier than the TAT a request can arrive.
- Logic:
- When a request arrives at time :
- If : Reject (too early).
- Else: Accept and update .
Implementation with Redis
GCRA is easy to implement atomically in Redis using a single key andSET ... GET or a Lua script.
- Memory Efficient: Only stores one number (the TAT) per user/key.
- Fairness: Smooths out traffic better than leaky bucket.
- Precision: Handles high-frequency requests without the “window reset” spikes of fixed-window algorithms.
Control Theory & Adaptive Load Shedding
While rate limiting (GCRA) protects against specific users, Load Shedding protects the system from its own congestion. Principal-level systems often move beyond static thresholds to Control Theory (PID Controllers) to maintain stability under extreme load.The Feedback Loop
In control theory, we treat the system as a process with an output (e.g., CPU, Latency) that we want to keep at a Set Point.- Process Variable (): The measured value (e.g., Current 99th percentile latency).
- Set Point (): The desired value (e.g., 200ms).
- Error (): .
- Manipulated Variable (): What we change to fix the error (e.g., The percentage of traffic we drop).
PID Controllers in Load Shedding
A PID (Proportional-Integral-Derivative) controller calculates the rejection rate based on the error over time.- Proportional (P): Drops traffic based on the current error. If latency is high, drop more now.
- Integral (I): Drops traffic based on the history of error. If latency has been high for a while, increase the rejection rate even if it’s currently dropping.
- Derivative (D): Drops traffic based on the rate of change. if latency is spiking rapidly, drop traffic aggressively before it hits the limit.
Little’s Law and Congestion
Load shedding is mathematically grounded in Little’s Law: (Items = Arrival Rate Wait Time). If (Latency) increases while (Concurrent Requests) is capped (Bulkheads), the system must reduce (Arrival Rate) or it will enter a “Congestion Collapse” where it spends more time context switching than doing work.Health Checks
Levels of Health Checking
Health Check Implementation
Graceful Degradation
When things fail, fail gracefully instead of completely.Implementing Graceful Degradation
Load Shedding
When overwhelmed, strategically drop load to preserve core functionality.Module 41: Graceful Degradation
Combining Patterns
Real systems use multiple patterns together to maintain partial functionality during failures:Advanced Design Scenarios
Scenario 1: Resilient Payment Service
You are responsible for a payment service that calls multiple downstream providers (card processor, fraud service, ledger). Outages or latency at any dependency must not take down checkout globally. Requirements:- Never charge a customer twice for the same request.
- Degrade gracefully to limited functionality (e.g., only some payment methods) during partial outages.
- Protect core services from cascading failures.
- Timeouts + retries with idempotency:
- Each payment request carries a unique
idempotency_key. - Gateway retries only idempotent operations with exponential backoff + jitter.
- Each payment request carries a unique
- Circuit breakers per downstream:
- Separate circuit for
card-processor,fraud-service,ledger. - Open circuits fail fast and use GracefulDegrader to fall back.
- Separate circuit for
- Bulkheads:
- Dedicated connection/thread pools for each dependency.
- Failure or slowness in
fraud-servicedoes not starveledgercalls.
- Fallback strategies:
fraud-servicedown → apply conservative rules + lower risk limits.- Secondary payment provider configured for card processing.
Scenario 2: Multi-Region Read API with Safe Degradation
You operate a global read API (product catalog, user profiles) with a primary write region and multiple read replicas. You must stay up even if a region or inter-region link fails. Design:- Local region first:
- For each client, route to nearest healthy region using DNS or anycast.
- Within a region, use health checks and load shedding to reject traffic before overload.
- Cross-region fallback:
- If local DB or cache is unhealthy (readiness fails), circuit-break that dependency.
- Use GracefulDegrader to fall back to:
- Stale cache snapshot.
- A remote region’s read replica (with higher latency but still available).
- Health checks drive traffic routing: if an entire region is unhealthy, remove it from DNS rotation.
- Use bounded staleness policies: local caches can serve data for up to (T) seconds when both DB and remote region are unavailable.
Scenario 3: Protecting Core System with Load Shedding
A search service provides typeahead suggestions and full-text search. Under heavy load, it must prioritize checkout and order placement over search. Strategy:- Use PriorityLoadShedder in front of search.
- Assign higher priorities to requests originating from checkout flows, lower priorities for background/autocomplete.
- During overload, low-priority searches are dropped first, preserving capacity for critical flows.
- Combined with timeouts, retries, and circuit breakers, this keeps the system responsive instead of failing uniformly.
Advanced Testing: Deterministic Simulation
While Chaos Engineering (injecting failures in production) is powerful, it has a major flaw: non-determinism. If you find a bug in production, it’s often impossible to reproduce it exactly. Deterministic Simulation Testing (DST), pioneered by FoundationDB, is the “Holy Grail” of distributed systems testing.The FoundationDB Philosophy
Every component of the system—the database, the network, the disk, and even time—is virtualized. The entire distributed system runs in a single-threaded, deterministic simulator.Requirements for Determinism
To make a system simulation-ready, you must eliminate all sources of non-determinism:- No Threads: Use an actor model or a single-threaded event loop (like Go’s goroutines with a deterministic scheduler or C++ actors).
- No Direct Time: Never call
time.Now(). Use aTimeSourceinterface that the simulator can inject. - No Direct Randomness: Never call
rand(). Use a PRNG seeded by the simulator. - No Direct I/O: Virtualize network and disk calls so the simulator can fail or delay them at specific points.
DST vs. Chaos Engineering
| Feature | Chaos Engineering (Jepsen/Chaos Monkey) | Deterministic Simulation (DST) |
|---|---|---|
| Environment | Real network/Real OS | Single-process simulator |
| Speed | Real-time | Faster than real-time (no idle waits) |
| Reproducibility | Low (Heisenbugs) | 100% (Same seed = Same result) |
| Coverage | Surface-level failures | Deep logic races and corner cases |
Interview Questions
Q: Design a resilient API gateway
Q: Design a resilient API gateway
Key points to cover:
- Timeouts: Connection and request timeouts per backend
- Circuit Breakers: Per-backend, with different thresholds
- Bulkheads: Separate thread/connection pools per backend
- Retries: Only for idempotent operations, with backoff
- Rate Limiting: Per-client and global limits
- Health Checks: Remove unhealthy backends from rotation
- Fallbacks: Static responses, cached data, or error pages
Q: What's the retry amplification problem and how do you solve it?
Q: What's the retry amplification problem and how do you solve it?
Problem:
In a multi-tier system, each tier retrying multiplies the total attempts.
3 tiers × 3 retries each = 27 actual requests for 1 user request.Solutions:
- Retry at edge only: Internal services fail fast
- Retry budgets: Track % of retries, stop if too high
- Deadline propagation: Don’t retry if deadline passed
- Hedged requests: Send to multiple replicas, take first
Q: How would you implement graceful degradation for a search feature?
Q: How would you implement graceful degradation for a search feature?
Degradation levels:
- Full functionality: Real-time search with personalization
- Cached results: Return last known good results for query
- Popular results: Return trending/popular items
- Static results: Return hardcoded fallback results
- Maintenance mode: Show “search temporarily unavailable”
Key Takeaways
Design for Failure
Assume everything will fail. Build systems that survive failures gracefully.
Fail Fast
Use timeouts, circuit breakers, and health checks to detect failures quickly.
Isolate Failures
Use bulkheads to prevent failures from spreading. Contain the blast radius.
Degrade Gracefully
When things break, return something useful. Stale data beats no data.