Chapter 5: Consistency Model
GFS’s consistency model is one of its most interesting and often misunderstood aspects. Unlike traditional file systems that provide strong consistency, GFS offers a relaxed model that trades some guarantees for higher performance and simpler implementation. This chapter explores what GFS guarantees, what it doesn’t, and how applications work with this model.Chapter Goals:
- Understand GFS’s consistency guarantees and terminology
- Learn the difference between defined, undefined, and inconsistent regions
- Master atomic record append semantics
- Explore how applications handle relaxed consistency
- Grasp the trade-offs between consistency and performance
Consistency Guarantees
GFS provides different consistency guarantees depending on the operation type and success/failure scenarios.Consistency Terminology
The “Consistent but Undefined” Paradox
A common point of confusion is how a region can be consistent (all replicas agree) but undefined (meaningless to the application).The Interleaving Problem
Consider two clients writing to the same 64MB chunk at the same offset (offset 0).- Client A writes:
[AAAAA] - Client B writes:
[BBBBB]
- Primary receives both. It serializes them.
- But within the network pipeline, fragments of A and B might be interleaved if the client library sends them in multiple RPCs.
- Replicas might end up with:
[AABBA]or[BBAAA].
- Consistent: All replicas (R1, R2, R3) will have exactly the same interleaved data (e.g.,
[AABBA]) because they all followed the Primary’s serial order of individual packets. - Undefined: Neither Client A nor Client B wrote
[AABBA]. The data is garbage.
Operation-Based Guarantees
- Successful Write
- Failed Write
- Concurrent Writes
- Record Append
Write That Succeeds on All Replicas:
Consistency Matrix
Visual summary of GFS consistency guarantees:Application Implications
How do applications work with GFS’s relaxed consistency?Handling Inconsistent Regions
- Record Validation
- Write Patterns
- MapReduce Integration
Detecting Bad Records:
Consistency Best Practices
Use Record Append
Prefer Atomic Appends:
- Use record_append for concurrent writes
- Guaranteed atomic and defined
- No coordination between clients
- Handle duplicates at read time
- Perfect for logging and MapReduce
Write-Once Pattern
Immutable After Creation:
- Write file once, then read-only
- No mutations after finalized
- Eliminates consistency issues
- Safe for concurrent readers
- Common pattern in data processing
Validate Records
Application-Level Checking:
- Add checksums to records
- Use unique IDs for de-duplication
- Include magic numbers
- Scan for valid records
- Skip inconsistent regions
Avoid Overwrites
Don’t Modify Existing Data:
- No random writes to same location
- No concurrent writes to same offset
- Use new files for updates
- Append-only workflows
- Checkpoint with new files
Relaxed Consistency Trade-offs
Why did GFS choose relaxed consistency?Benefits of Relaxed Model
Costs of Relaxed Model
Consistency Violations Examples
Real scenarios that can occur:Lost Update
Lost Update
Scenario: Concurrent Writes Overwrite Each Other:
Partial Read
Partial Read
Scenario: Reading During Write:
Duplicate Records
Duplicate Records
Scenario: Append Retry Creates Duplicates:
Interview Questions
Basic: What does 'defined' mean in GFS?
Basic: What does 'defined' mean in GFS?
Expected Answer:In GFS, a file region is “defined” when it is:
- Consistent: All replicas have the same data
- Matches the mutation: The data is exactly what the client wrote
- All replicas agree on the data (consistent)
- The data reflects the client’s write operation completely
- Any client reading the region will see the expected data
- Consistent but undefined: All replicas agree, but data may be a mix of concurrent writes (not what any single client expected)
- Inconsistent: Replicas disagree, may see different data depending on which replica you read from
- Successful writes (all replicas ACK)
- Successful record appends (atomic, all replicas apply at same offset)
Intermediate: How does record append achieve at-least-once delivery?
Intermediate: How does record append achieve at-least-once delivery?
Expected Answer:Record append provides at-least-once delivery through automatic retry with duplicate tolerance:Mechanism:At-least-once is perfect for idempotent operations and batch processing where duplicates can be filtered.
- Client sends append request to primary
- Primary assigns offset and coordinates replicas
- If any replica fails, primary returns ERROR
- Client automatically retries the entire append
- Primary assigns a NEW offset for the retry
- Eventually all replicas succeed
- Client receives the successful offset
- Guaranteed success (at least once): Client retries until success
- May have duplicates: Failed attempt may have partially applied
- Application handles duplicates: Uses unique IDs to filter
- Attempt 1: Succeeds on R1, R2, fails on R3 → Error returned
- Partial state: R1 and R2 have record at offset 1000
- Attempt 2: Succeeds on all at offset 1003 → Success
- Result: Record at 1003 (guaranteed), possible duplicate at 1000
Advanced: Explain how GFS's consistency model enables higher performance
Advanced: Explain how GFS's consistency model enables higher performance
Expected Answer:GFS’s relaxed consistency model enables higher performance through several mechanisms:1. No Distributed Consensus:
- Strong consistency requires Paxos/Raft for every write (2-3 round trips, 50-100ms)
- GFS uses lease-based primary authority (1 round trip, 5-10ms)
- Primary makes serialization decisions locally without voting
- 10x latency improvement
- Traditional: Distributed locks (expensive, deadlock-prone)
- GFS: 60-second leases with automatic timeout
- No explicit release protocol needed
- Failures handled by timeout, not complex recovery
- Primary can process thousands of operations per second
- Strong consistency: All communication through coordinator
- GFS: Data pushed in pipeline, control to primary separately
- Maximizes network bandwidth utilization
- Parallel data transfer while primary makes decisions
- Failed writes create inconsistent regions
- Client retries overwrite with defined data
- System doesn’t block waiting for consensus
- Higher availability and throughput
- Cost shifted to application (de-duplication, validation)
- But applications can optimize for their use case
- MapReduce already handles duplicates for fault tolerance
- Perfect synergy with workload
- Performance: 10x lower latency, 10x higher throughput
- Cost: Application complexity, not suitable for all workloads
System Design: How would you build a database on top of GFS?
System Design: How would you build a database on top of GFS?
Expected Answer:Building a database on GFS is challenging due to relaxed consistency. Several approaches:Approach 1: Log-Structured Merge (LSM) Tree:
- Write-ahead log (WAL) using record append
- Immutable sorted string tables (SSTables) as GFS files
- Compaction creates new files
- Like Bigtable implementation:
- WAL: Append-only (perfect for record append)
- SSTables: Write-once, immutable (no consistency issues)
- Memtable: In-memory, not in GFS
- Store each version in separate GFS file
- Atomic rename for version transition
- Readers see consistent snapshots
- Writers append to new version
- Garbage collect old versions
- Use Chubby/Zookeeper for consistency
- GFS for storage only
- Coordinator provides locks, transactions
- GFS provides durability, availability
- Example: Bigtable uses Chubby for coordination
- Multi-version concurrency control
- Each record has version number
- Record append for new versions
- Readers filter to consistent version
- Garbage collect old versions
- Don’t rely on GFS for consistency
- Use append-only patterns
- Leverage immutability where possible
- Add coordination layer for transactions
- Handle duplicates and validation in app
- WAL on GFS (record append)
- SSTables on GFS (immutable)
- Chubby for coordination
- Client library handles consistency
- Perfect layering of systems
Key Takeaways
Consistency Model Summary:
- Three States: Defined (best), Consistent but undefined (mixed), Inconsistent (bad)
- Record Append: Atomic, at-least-once, defined on success, handles duplicates
- Regular Writes: Defined on success, inconsistent on failure, undefined if concurrent
- Application Responsibility: Validate records, de-duplicate, handle inconsistency
- Performance Trade-off: Relaxed consistency enables 10x better performance
- Workload Match: Perfect for append-heavy batch processing, not for OLTP
- Design Pattern: Write-once, append-only, immutable after creation
- Record Format: Magic numbers, checksums, unique IDs, length markers
Up Next
In Chapter 6: Fault Tolerance, we’ll explore:- How GFS handles master failures and recovery
- Chunk replication strategies and re-replication
- Detecting and handling chunkserver failures
- Data integrity across component failures
- Disaster recovery mechanisms