Chapter 6: Fault Tolerance
Fault tolerance is at the heart of GFS’s design, and this chapter is arguably the most practically valuable for any engineer who will operate production distributed systems. Built for commodity hardware where failures are the norm, not the exception, GFS employs multiple layers of redundancy and recovery mechanisms. What makes GFS’s approach noteworthy is not any single fault tolerance technique — replication, checksumming, and heartbeats all existed before GFS — but the way these techniques are composed into a cohesive, self-healing system that requires minimal human intervention. Before GFS, most storage systems required operators to manually intervene when failures occurred. GFS demonstrated that a well-designed system could handle the vast majority of failure scenarios automatically, an insight that became the foundation for Site Reliability Engineering (SRE) practices at Google and the entire DevOps movement. This chapter explores how GFS handles failures at every level, from disk corruption to datacenter outages.- Understand GFS’s multi-layered fault tolerance approach
- Master chunk replication and re-replication strategies
- Learn master replication and fast recovery mechanisms
- Explore failure detection and handling procedures
- Grasp data integrity maintenance across failures
Failure Assumptions
GFS was designed with specific failure assumptions based on Google’s operational experience.Expected Failure Rates
Design for Failure
Assume Everything Fails
- Disks fail continuously
- Machines crash regularly
- Network partitions happen
- Silent data corruption occurs
- Plan for all failure scenarios
Fast Detection
- Heartbeat every 30-60 seconds
- Checksum verification on reads
- Version number staleness detection
- Client-reported errors
- Background scrubbing
Automatic Recovery
- Re-replication on detection
- Master failover automated
- Chunk migration for balance
- Garbage collection cleanup
- No manual intervention
Multiple Replicas
- 3 replicas default per chunk
- Cross-rack placement
- Independent failure domains
- Can tolerate 2 failures
- Configurable replication factor
Chunk Replication
Chunk replication is the primary defense against data loss.Replication Strategy
- Placement Policy
- Re-replication
- Rebalancing
The “External Watchdog” (Chubby)
While GFS is self-contained for data, its Master Election is often coordinated by an external distributed locking service called Chubby. This architectural decision — delegating leader election to a separate, purpose-built coordination service — is a pattern that has been widely adopted in modern distributed systems. Just as GFS relied on Chubby, Hadoop HDFS relies on ZooKeeper, Kafka relies on ZooKeeper (or its newer KRaft protocol), and Kubernetes relies on etcd. The key insight is that leader election and distributed consensus are hard problems that benefit from being solved once in a dedicated, well-tested component rather than being reimplemented in every system that needs them.- Master Election: When the master starts, it attempts to acquire a specific lock in Chubby. Only one process can hold this lock at a time, becoming the “Primary Master”.
- Monitoring: If the Primary Master crashes, its Chubby session expires and the lock is released.
- Failover: A waiting “Shadow Master” or a new process detects the released lock, acquires it, and promotes itself to Primary Master.
Data Integrity: The Checksumming Trade-off
GFS chooses to verify data integrity only at the endpoints (Chunkservers) rather than during every hop in the network. This is an application of the “end-to-end argument” in systems design, a foundational principle articulated by Saltzer, Reed, and Clark in their 1984 paper. The argument states that reliability checks at intermediate points in a system do not eliminate the need for end-to-end checks, so adding intermediate checks primarily adds overhead without proportional benefit. GFS applies this principle pragmatically: verifying data at every network hop during replication would consume significant CPU and bandwidth, while end-to-end verification at the destination chunkserver catches the same errors at a fraction of the cost.No Inter-Replica Checksumming
- The Decision: GFS does not compare checksums between replicas during write operations.
- The Reason: Network bandwidth is expensive. Comparing 64MB of data across 3 replicas every time a chunk is copied would consume massive aggregate bandwidth.
- The Risk: If a chunk is corrupted during a copy from one server to another, the destination might store corrupted data.
- The Mitigation: The destination chunkserver computes its own checksum after receiving the data. On the next read, the corruption will be detected, and the master will schedule a repair from a different, healthy replica.
Silent Data Corruption (Bit Rot)
Chunkservers don’t just wait for reads. They cycle through inactive chunks in the background to detect bit rot on aging disks. This “Scrubbing” ensures that a rarely-accessed archive doesn’t slowly decay into unreadability.Master Fault Tolerance
The master is the most critical component—its failure requires special handling.Master Replication
Checkpointing
Failure Scenarios
How GFS handles different failure types:Chunkserver Failure
Chunkserver Failure
Disk Failure
Disk Failure
Network Partition
Network Partition
Master Failure
Master Failure
Data Integrity
Multiple layers ensure data correctness:Integrity Mechanisms
Checksums
- 64KB blocks with CRC32
- Verified on every read
- Updated on every write
- Background scrubbing
- Detects silent corruption
Version Numbers
- Incremented on lease grant
- Stored in metadata
- Checked on every operation
- Stale replicas garbage collected
- Prevents reading old data
Replication
- 3 copies minimum
- Cross-rack placement
- Independent failures
- Can lose 2 copies safely
- Automatic restoration
Application Validation
- Record checksums
- Unique identifiers
- Magic numbers
- Length markers
- De-duplication
Interview Questions
Basic: Why does GFS use 3 replicas by default?
Basic: Why does GFS use 3 replicas by default?
-
Fault Tolerance: Can lose 2 replicas and still have data
- Common scenario: 1 server down for maintenance, 1 unexpected failure
- With 3 replicas, still have 1 copy available
- 2 replicas: Single failure away from data loss
-
Availability: Read from any replica
- Load balancing across 3 servers
- Better performance than 2
- Diminishing returns after 3
-
Cost: Storage overhead
- 3x storage cost
- 4 or 5 replicas → higher cost
- 3 is sweet spot for Google’s workload
- 1 replica: No fault tolerance, disaster
- 2 replicas: One failure away from data loss, insufficient
- 3 replicas: Can tolerate 2 failures, good balance
- 5 replicas: Higher availability but 67% more storage cost
- 2 in one rack, 1 in another (typical)
- Survives rack failure
- Optimizes intra-rack bandwidth
Intermediate: How does GFS detect and handle stale replicas?
Intermediate: How does GFS detect and handle stale replicas?
- Each chunk has version number
- Stored in master metadata and on chunkserver
- Incremented when master grants new lease
- All current replicas updated to new version
-
Lease Grant (version increment):
- Master grants lease to primary
- Increments version: v3 → v4
- Updates all available replicas to v4
- Replica that’s down stays at v3 (STALE)
-
Heartbeat Report:
- Chunkserver reports: “I have chunk X, version 3”
- Master knows current version is 4
- Master identifies replica as stale
-
Client Request:
- Master returns replica locations with version 4
- Client contacts replica, sends expected version
- If replica has v3, client knows it’s stale
- Tries different replica
-
Mark for Deletion:
- Master tells chunkserver: “Delete chunk X (stale)”
- Garbage collection cleanup
- Not immediate (lazy deletion)
-
Re-replication:
- Chunk now has fewer valid replicas
- Schedule re-replication from good replica
- Restore to target count
-
Never Served:
- Stale replicas never returned to clients
- Master only returns current version
- Prevents reading old data
- Chunk v3 on [cs-1, cs-2, cs-3]
- cs-3 goes offline
- Client writes → master grants lease → v4
- cs-1, cs-2 updated to v4
- cs-3 comes back with v3
- Master: “Delete v3, re-replicate v4”
Advanced: Explain GFS's master recovery process in detail
Advanced: Explain GFS's master recovery process in detail
- Load most recent checkpoint from disk
- Checkpoint contains:
- Full namespace (files, directories)
- Chunk handles for each file
- Version numbers
- File metadata (permissions, timestamps)
- Compact B-tree format (~1GB)
- Loaded into memory quickly
- Read operation log since last checkpoint
- Apply operations sequentially:
- File creates/deletes
- Chunk allocations
- Version increments
- Metadata updates
- Typical: 100K-1M operations
- Rate: 100K ops/second
- Brings namespace to current state
- Chunk locations NOT persisted (by design)
- Master polls all chunkservers: “What chunks do you have?”
- Each chunkserver reports:
- Chunk handles
- Version numbers
- Master builds in-memory location map
- Identifies stale replicas (version mismatch)
- Marks stale replicas for deletion
- Master now has:
- Complete namespace
- All chunk locations
- Version information
- Begins accepting requests:
- Grants leases
- Serves metadata queries
- Creates chunks
- Fully operational
- Checkpoint eliminates long log replay
- All metadata in memory (no disk I/O for operations)
- Chunk locations discovered in parallel
- Simple, single-master design
- Operation log replicated before client sees success
- Multiple copies (shadows, remote disks)
- Different failure domains/datacenters
- If master fails, log intact
- Replay gives exact state
- Shadow masters serve read requests
- Writes blocked temporarily
- Clients retry automatically
- Minimal user impact
System Design: Design a more fault-tolerant master for GFS
System Design: Design a more fault-tolerant master for GFS
- Multiple active masters using Paxos/Raft
- Shared replicated state machine
- Any master can handle requests
- Benefits:
- No failover delay (masters always available)
- Higher read throughput (load balanced)
- Better fault tolerance (majority quorum)
- Challenges:
- Complex consensus protocol
- Higher write latency (consensus overhead)
- More difficult to implement/debug
- Partition namespace across multiple masters
- Each master handles subset of files
- Example: Hash(filename) % num_masters
- Benefits:
- Scales metadata capacity
- Scales throughput
- Failure affects subset
- Challenges:
- Cross-shard operations complex
- Rebalancing difficult
- Client routing logic
- Root master for namespace
- Leaf masters for chunk management
- Separate concerns
- Benefits:
- Scales chunk operations
- Root master simpler (less load)
- Challenges:
- Two-level coordination
- Partial failures complex
- Multiple masters with optimistic concurrency
- Eventually consistent
- Conflict resolution protocol
- Benefits:
- No failover needed
- Higher availability
- Challenges:
- Consistency corner cases
- Conflict resolution complexity
- Colossus (GFS successor) uses this
- Paxos provides strong consistency
- Multiple masters for availability
- Acceptable latency increase (metadata ops less frequent)
- Worth complexity for zero-downtime failover
- 5-7 master servers (quorum=3-4)
- Consensus on operation log
- Replicated state machine
- Any master can serve requests
- Automatic leader election
- Client tries multiple masters
- Complexity: Higher (consensus protocol)
- Performance: Slightly lower write latency
- Availability: Much higher (no failover delay)
- Consistency: Stronger (linearizable)
Key Takeaways
- Design for Failure: Assume constant component failures
- Replication: 3 copies across racks for durability
- Fast Detection: Heartbeats every 30-60s, declare dead after 3 misses
- Automatic Recovery: Re-replication without human intervention
- Prioritization: Urgent chunks (1 replica) before normal (2 replicas)
- Master Replication: Operation log replicated, shadow masters for failover
- Fast Recovery: under 1 minute master restart via checkpoint + log
- Version Numbers: Simple, effective staleness detection
- No Split-Brain: Lease expiration prevents divergent writes
- Lazy Cleanup: Garbage collection handles deleted/stale chunks
Up Next
In Chapter 7: Performance & Optimizations, we’ll explore:- Real-world GFS benchmarks and measurements
- Performance characteristics and bottlenecks
- Optimization techniques Google employed
- Workload analysis and tuning strategies
- Lessons learned from production deployment
Interview Deep-Dive
GFS re-replicates chunks when a replica is lost. Walk me through how the master prioritizes which chunks to re-replicate first.
GFS re-replicates chunks when a replica is lost. Walk me through how the master prioritizes which chunks to re-replicate first.
How does GFS handle silent data corruption -- bit rot where the disk returns wrong data without reporting an error?
How does GFS handle silent data corruption -- bit rot where the disk returns wrong data without reporting an error?
The GFS master is a single point of failure. How does GFS handle master crashes, and what is the recovery time?
The GFS master is a single point of failure. How does GFS handle master crashes, and what is the recovery time?