Chapter 6: Fault Tolerance
Fault tolerance is at the heart of GFS’s design. Built for commodity hardware where failures are the norm, not the exception, GFS employs multiple layers of redundancy and recovery mechanisms. This chapter explores how GFS handles failures at every level, from disk corruption to datacenter outages.Chapter Goals:
- Understand GFS’s multi-layered fault tolerance approach
- Master chunk replication and re-replication strategies
- Learn master replication and fast recovery mechanisms
- Explore failure detection and handling procedures
- Grasp data integrity maintenance across failures
Failure Assumptions
GFS was designed with specific failure assumptions based on Google’s operational experience.Expected Failure Rates
Design for Failure
Assume Everything Fails
Core Philosophy:
- Disks fail continuously
- Machines crash regularly
- Network partitions happen
- Silent data corruption occurs
- Plan for all failure scenarios
Fast Detection
Quick Discovery:
- Heartbeat every 30-60 seconds
- Checksum verification on reads
- Version number staleness detection
- Client-reported errors
- Background scrubbing
Automatic Recovery
Self-Healing:
- Re-replication on detection
- Master failover automated
- Chunk migration for balance
- Garbage collection cleanup
- No manual intervention
Multiple Replicas
Redundancy:
- 3 replicas default per chunk
- Cross-rack placement
- Independent failure domains
- Can tolerate 2 failures
- Configurable replication factor
Chunk Replication
Chunk replication is the primary defense against data loss.Replication Strategy
- Placement Policy
- Re-replication
- Rebalancing
Initial Replica Placement:
The “External Watchdog” (Chubby)
While GFS is self-contained for data, its Master Election is often coordinated by an external distributed locking service called Chubby.- Master Election: When the master starts, it attempts to acquire a specific lock in Chubby. Only one process can hold this lock at a time, becoming the “Primary Master”.
- Monitoring: If the Primary Master crashes, its Chubby session expires and the lock is released.
- Failover: A waiting “Shadow Master” or a new process detects the released lock, acquires it, and promotes itself to Primary Master.
Data Integrity: The Checksumming Trade-off
GFS chooses to verify data integrity only at the endpoints (Chunkservers) rather than during every hop in the network.No Inter-Replica Checksumming
- The Decision: GFS does not compare checksums between replicas during write operations.
- The Reason: Network bandwidth is expensive. Comparing 64MB of data across 3 replicas every time a chunk is copied would consume massive aggregate bandwidth.
- The Risk: If a chunk is corrupted during a copy from one server to another, the destination might store corrupted data.
- The Mitigation: The destination chunkserver computes its own checksum after receiving the data. On the next read, the corruption will be detected, and the master will schedule a repair from a different, healthy replica.
Silent Data Corruption (Bit Rot)
Chunkservers don’t just wait for reads. They cycle through inactive chunks in the background to detect bit rot on aging disks. This “Scrubbing” ensures that a rarely-accessed archive doesn’t slowly decay into unreadability.Master Fault Tolerance
The master is the most critical component—its failure requires special handling.Master Replication
Checkpointing
Failure Scenarios
How GFS handles different failure types:Chunkserver Failure
Chunkserver Failure
Most Common Failure:
Disk Failure
Disk Failure
Disk Corruption or Failure:
Network Partition
Network Partition
Network Isolation:
Master Failure
Master Failure
Most Critical Failure:
Data Integrity
Multiple layers ensure data correctness:Integrity Mechanisms
Checksums
Chunk-Level Verification:
- 64KB blocks with CRC32
- Verified on every read
- Updated on every write
- Background scrubbing
- Detects silent corruption
Version Numbers
Staleness Detection:
- Incremented on lease grant
- Stored in metadata
- Checked on every operation
- Stale replicas garbage collected
- Prevents reading old data
Replication
Redundancy:
- 3 copies minimum
- Cross-rack placement
- Independent failures
- Can lose 2 copies safely
- Automatic restoration
Application Validation
Application-Level:
- Record checksums
- Unique identifiers
- Magic numbers
- Length markers
- De-duplication
Interview Questions
Basic: Why does GFS use 3 replicas by default?
Basic: Why does GFS use 3 replicas by default?
Expected Answer:GFS uses 3 replicas to balance durability, availability, and cost:Why 3?
-
Fault Tolerance: Can lose 2 replicas and still have data
- Common scenario: 1 server down for maintenance, 1 unexpected failure
- With 3 replicas, still have 1 copy available
- 2 replicas: Single failure away from data loss
-
Availability: Read from any replica
- Load balancing across 3 servers
- Better performance than 2
- Diminishing returns after 3
-
Cost: Storage overhead
- 3x storage cost
- 4 or 5 replicas → higher cost
- 3 is sweet spot for Google’s workload
- 1 replica: No fault tolerance, disaster
- 2 replicas: One failure away from data loss, insufficient
- 3 replicas: Can tolerate 2 failures, good balance
- 5 replicas: Higher availability but 67% more storage cost
- 2 in one rack, 1 in another (typical)
- Survives rack failure
- Optimizes intra-rack bandwidth
Intermediate: How does GFS detect and handle stale replicas?
Intermediate: How does GFS detect and handle stale replicas?
Expected Answer:GFS uses version numbers to detect stale replicas:Version Number System:
- Each chunk has version number
- Stored in master metadata and on chunkserver
- Incremented when master grants new lease
- All current replicas updated to new version
-
Lease Grant (version increment):
- Master grants lease to primary
- Increments version: v3 → v4
- Updates all available replicas to v4
- Replica that’s down stays at v3 (STALE)
-
Heartbeat Report:
- Chunkserver reports: “I have chunk X, version 3”
- Master knows current version is 4
- Master identifies replica as stale
-
Client Request:
- Master returns replica locations with version 4
- Client contacts replica, sends expected version
- If replica has v3, client knows it’s stale
- Tries different replica
-
Mark for Deletion:
- Master tells chunkserver: “Delete chunk X (stale)”
- Garbage collection cleanup
- Not immediate (lazy deletion)
-
Re-replication:
- Chunk now has fewer valid replicas
- Schedule re-replication from good replica
- Restore to target count
-
Never Served:
- Stale replicas never returned to clients
- Master only returns current version
- Prevents reading old data
- Chunk v3 on [cs-1, cs-2, cs-3]
- cs-3 goes offline
- Client writes → master grants lease → v4
- cs-1, cs-2 updated to v4
- cs-3 comes back with v3
- Master: “Delete v3, re-replicate v4”
Advanced: Explain GFS's master recovery process in detail
Advanced: Explain GFS's master recovery process in detail
Expected Answer:GFS master recovery is designed for speed (under 1 minute) with no data loss:Phase 1: Checkpoint Load (~1-2 seconds):
- Load most recent checkpoint from disk
- Checkpoint contains:
- Full namespace (files, directories)
- Chunk handles for each file
- Version numbers
- File metadata (permissions, timestamps)
- Compact B-tree format (~1GB)
- Loaded into memory quickly
- Read operation log since last checkpoint
- Apply operations sequentially:
- File creates/deletes
- Chunk allocations
- Version increments
- Metadata updates
- Typical: 100K-1M operations
- Rate: 100K ops/second
- Brings namespace to current state
- Chunk locations NOT persisted (by design)
- Master polls all chunkservers: “What chunks do you have?”
- Each chunkserver reports:
- Chunk handles
- Version numbers
- Master builds in-memory location map
- Identifies stale replicas (version mismatch)
- Marks stale replicas for deletion
- Master now has:
- Complete namespace
- All chunk locations
- Version information
- Begins accepting requests:
- Grants leases
- Serves metadata queries
- Creates chunks
- Fully operational
- Checkpoint eliminates long log replay
- All metadata in memory (no disk I/O for operations)
- Chunk locations discovered in parallel
- Simple, single-master design
- Operation log replicated before client sees success
- Multiple copies (shadows, remote disks)
- Different failure domains/datacenters
- If master fails, log intact
- Replay gives exact state
- Shadow masters serve read requests
- Writes blocked temporarily
- Clients retry automatically
- Minimal user impact
System Design: Design a more fault-tolerant master for GFS
System Design: Design a more fault-tolerant master for GFS
Expected Answer:Several approaches to improve GFS master fault tolerance beyond shadow masters:Approach 1: Multi-Master with Consensus (like Colossus):
- Multiple active masters using Paxos/Raft
- Shared replicated state machine
- Any master can handle requests
- Benefits:
- No failover delay (masters always available)
- Higher read throughput (load balanced)
- Better fault tolerance (majority quorum)
- Challenges:
- Complex consensus protocol
- Higher write latency (consensus overhead)
- More difficult to implement/debug
- Partition namespace across multiple masters
- Each master handles subset of files
- Example: Hash(filename) % num_masters
- Benefits:
- Scales metadata capacity
- Scales throughput
- Failure affects subset
- Challenges:
- Cross-shard operations complex
- Rebalancing difficult
- Client routing logic
- Root master for namespace
- Leaf masters for chunk management
- Separate concerns
- Benefits:
- Scales chunk operations
- Root master simpler (less load)
- Challenges:
- Two-level coordination
- Partial failures complex
- Multiple masters with optimistic concurrency
- Eventually consistent
- Conflict resolution protocol
- Benefits:
- No failover needed
- Higher availability
- Challenges:
- Consistency corner cases
- Conflict resolution complexity
- Colossus (GFS successor) uses this
- Paxos provides strong consistency
- Multiple masters for availability
- Acceptable latency increase (metadata ops less frequent)
- Worth complexity for zero-downtime failover
- 5-7 master servers (quorum=3-4)
- Consensus on operation log
- Replicated state machine
- Any master can serve requests
- Automatic leader election
- Client tries multiple masters
- Complexity: Higher (consensus protocol)
- Performance: Slightly lower write latency
- Availability: Much higher (no failover delay)
- Consistency: Stronger (linearizable)
Key Takeaways
Fault Tolerance Summary:
- Design for Failure: Assume constant component failures
- Replication: 3 copies across racks for durability
- Fast Detection: Heartbeats every 30-60s, declare dead after 3 misses
- Automatic Recovery: Re-replication without human intervention
- Prioritization: Urgent chunks (1 replica) before normal (2 replicas)
- Master Replication: Operation log replicated, shadow masters for failover
- Fast Recovery: under 1 minute master restart via checkpoint + log
- Version Numbers: Simple, effective staleness detection
- No Split-Brain: Lease expiration prevents divergent writes
- Lazy Cleanup: Garbage collection handles deleted/stale chunks
Up Next
In Chapter 7: Performance & Optimizations, we’ll explore:- Real-world GFS benchmarks and measurements
- Performance characteristics and bottlenecks
- Optimization techniques Google employed
- Workload analysis and tuning strategies
- Lessons learned from production deployment