Skip to main content

Chapter 5: Consistency Model

GFS’s consistency model is one of its most interesting and often misunderstood aspects. Unlike traditional file systems that provide strong consistency, GFS offers a relaxed model that trades some guarantees for higher performance and simpler implementation. This chapter explores what GFS guarantees, what it doesn’t, and how applications work with this model.
Chapter Goals:
  • Understand GFS’s consistency guarantees and terminology
  • Learn the difference between defined, undefined, and inconsistent regions
  • Master atomic record append semantics
  • Explore how applications handle relaxed consistency
  • Grasp the trade-offs between consistency and performance

Consistency Guarantees

GFS provides different consistency guarantees depending on the operation type and success/failure scenarios.

Consistency Terminology

GFS CONSISTENCY STATES
─────────────────────

Three possible states for file regions:

1. CONSISTENT
   ──────────
   Definition: All clients see the same data,
               regardless of which replica they read from

   ┌─────────────────────────────────┐
   │ Replica 1: "Hello World"        │
   │ Replica 2: "Hello World"        │ ← All identical
   │ Replica 3: "Hello World"        │
   └─────────────────────────────────┘

   Client A reads: "Hello World"
   Client B reads: "Hello World"
   Client C reads: "Hello World"

   Property: Agreement across replicas


2. DEFINED
   ────────
   Definition: Consistent AND clients see the mutation
               in its entirety

   ┌─────────────────────────────────┐
   │ All replicas: "XYZ" (from write)│
   │ Exactly what client wrote       │
   └─────────────────────────────────┘

   Stronger than consistent:
   • All replicas agree (consistent)
   • Data matches what was written (defined)


3. INCONSISTENT
   ────────────
   Definition: Clients may see different data
               depending on which replica they read

   ┌─────────────────────────────────┐
   │ Replica 1: "ABC"                │
   │ Replica 2: "XYZ"                │ ← Different!
   │ Replica 3: "ABC"                │
   └─────────────────────────────────┘

   Client A (reads from R1): "ABC"
   Client B (reads from R2): "XYZ"

   Property: Disagreement across replicas


RELATIONSHIP:
────────────

   Defined is subset of Consistent is subset of All States

   Defined: Best case (what app wants)
   Consistent: Readable but may have junk
   Inconsistent: Unusable, skip this region

The “Consistent but Undefined” Paradox

A common point of confusion is how a region can be consistent (all replicas agree) but undefined (meaningless to the application).

The Interleaving Problem

Consider two clients writing to the same 64MB chunk at the same offset (offset 0).
  • Client A writes: [AAAAA]
  • Client B writes: [BBBBB]
If these writes are large (e.g., several MBs) and not handled via “Record Append”, GFS does not guarantee that the write is atomic across the entire byte range. Scenario:
  1. Primary receives both. It serializes them.
  2. But within the network pipeline, fragments of A and B might be interleaved if the client library sends them in multiple RPCs.
  3. Replicas might end up with: [AABBA] or [BBAAA].
Result:
  • Consistent: All replicas (R1, R2, R3) will have exactly the same interleaved data (e.g., [AABBA]) because they all followed the Primary’s serial order of individual packets.
  • Undefined: Neither Client A nor Client B wrote [AABBA]. The data is garbage.
Key Insight: Consistency in GFS refers to replica agreement, not data integrity relative to client intent. This is why GFS strongly recommends “Record Append” for concurrent access, as it guarantees atomicity.

Operation-Based Guarantees

Write That Succeeds on All Replicas:
SUCCESSFUL WRITE GUARANTEE
─────────────────────────

Operation: write(offset, data)
Result: All replicas ACK success

Guarantee: DEFINED
─────────────────

Meaning:
• All replicas have identical data
• Data is exactly what client wrote
• Region is consistent and defined

Example:
────────
Client writes "HELLO" at offset 1000

Before:
┌────────────────────────────┐
│ Offset 1000: "xxxxx"       │
│ (all replicas)             │
└────────────────────────────┘

Write operation succeeds

After:
┌────────────────────────────┐
│ Replica 1: offset 1000: "HELLO" │
│ Replica 2: offset 1000: "HELLO" │
│ Replica 3: offset 1000: "HELLO" │
└────────────────────────────┘

Client reads back: "HELLO" ✓
Different client reads: "HELLO" ✓
Reading from any replica: "HELLO" ✓

State: DEFINED


WHY DEFINED?
───────────

Primary serializes all writes:
1. Assigns serial numbers
2. Applies in order
3. Tells secondaries same order
4. All replicas apply identically

Deterministic outcome, all replicas identical, defined region.

Consistency Matrix

Visual summary of GFS consistency guarantees:
CONSISTENCY GUARANTEE MATRIX
───────────────────────────

┌─────────────────┬──────────────┬──────────────────┐
│   Operation     │  All Success │   Some Fail      │
├─────────────────┼──────────────┼──────────────────┤
│                 │              │                  │
│ Write           │   DEFINED    │  INCONSISTENT    │
│ (single client) │              │                  │
│                 │  ✓ Identical │  ✗ Replicas      │
│                 │  ✓ Expected  │     differ       │
│                 │              │  ✗ Undefined     │
│                 │              │     behavior     │
├─────────────────┼──────────────┼──────────────────┤
│                 │              │                  │
│ Write           │  CONSISTENT  │  INCONSISTENT    │
│ (concurrent)    │ but UNDEFINED│                  │
│                 │              │  ✗ Replicas      │
│                 │  ✓ Identical │     differ       │
│                 │  ✗ May not   │  ✗ Unpredictable │
│                 │     match    │                  │
│                 │     expected │                  │
├─────────────────┼──────────────┼──────────────────┤
│                 │              │                  │
│ Record Append   │   DEFINED    │  Record retried  │
│                 │ (at offset   │  → eventually    │
│                 │  returned)   │     DEFINED      │
│                 │              │                  │
│                 │  ✓ Identical │  ⚠ May have      │
│                 │  ✓ Expected  │     duplicates   │
│                 │  ✓ Atomic    │  ⚠ Inconsistent  │
│                 │              │     regions      │
└─────────────────┴──────────────┴──────────────────┘

KEY INSIGHTS:
────────────

1. Record append is the "safe" operation
   → Always eventually defined
   → At-least-once delivery

2. Regular writes can create inconsistent regions
   → Application must handle or retry

3. Concurrent writes: Just don't do it!
   → Use record append instead

4. Failed operations → inconsistent regions
   → But client knows it failed
   → Can retry to overwrite

Application Implications

How do applications work with GFS’s relaxed consistency?

Handling Inconsistent Regions

Detecting Bad Records:
RECORD VALIDATION STRATEGY
─────────────────────────

Application record format:
──────────────────────────

┌───────────────────────────────────────┐
│ Record Header                         │
├───────────────────────────────────────┤
│ Magic number:   0xDEADBEEF (4 bytes) │
│ Record ID:      UUID (16 bytes)      │
│ Checksum:       CRC32 (4 bytes)      │
│ Length:         (4 bytes)            │
├───────────────────────────────────────┤
│ Payload Data                          │
│ (variable length)                     │
└───────────────────────────────────────┘

Reading records:
───────────────

def read_records(filename):
    offset = 0
    records = []

    while offset < file_size:
        try:
            # Read header
            header = read(file, offset, HEADER_SIZE)

            # Validate magic number
            if header.magic != 0xDEADBEEF:
                # INCONSISTENT REGION!
                offset += scan_for_next_magic()
                continue

            # Validate checksum
            payload = read(file, offset + HEADER_SIZE, header.length)
            if crc32(payload) != header.checksum:
                # CORRUPTED RECORD!
                offset += header.length + HEADER_SIZE
                continue

            # Check for duplicate
            if header.record_id in seen_ids:
                # DUPLICATE!
                offset += header.length + HEADER_SIZE
                continue

            # Valid record!
            records.append(payload)
            seen_ids.add(header.record_id)
            offset += header.length + HEADER_SIZE

        except Exception:
            # Scan forward for next valid record
            offset += SCAN_INCREMENT

    return records


HANDLING STRATEGIES:
───────────────────

1. Skip inconsistent regions:
   • No valid magic number
   • Checksum mismatch
   • Partial records
   → Move to next potential record

2. De-duplicate:
   • Track seen record IDs
   • Skip duplicates from retries

3. Validate checksums:
   • Application-level checksums
   • Independent of GFS checksums
   • Detect inconsistent regions

4. Use markers:
   • Beginning/end markers
   • Identify record boundaries
   • Scan forward if corrupted

Consistency Best Practices

Use Record Append

Prefer Atomic Appends:
  • Use record_append for concurrent writes
  • Guaranteed atomic and defined
  • No coordination between clients
  • Handle duplicates at read time
  • Perfect for logging and MapReduce

Write-Once Pattern

Immutable After Creation:
  • Write file once, then read-only
  • No mutations after finalized
  • Eliminates consistency issues
  • Safe for concurrent readers
  • Common pattern in data processing

Validate Records

Application-Level Checking:
  • Add checksums to records
  • Use unique IDs for de-duplication
  • Include magic numbers
  • Scan for valid records
  • Skip inconsistent regions

Avoid Overwrites

Don’t Modify Existing Data:
  • No random writes to same location
  • No concurrent writes to same offset
  • Use new files for updates
  • Append-only workflows
  • Checkpoint with new files

Relaxed Consistency Trade-offs

Why did GFS choose relaxed consistency?

Benefits of Relaxed Model

ADVANTAGES OF RELAXED CONSISTENCY
─────────────────────────────────

1. PERFORMANCE
   ───────────

   No Distributed Consensus:
   • No Paxos/Raft for every write
   • Primary makes decisions locally
   • Lower latency (1-10ms vs 50-100ms)
   • Higher throughput (10,000+ ops/sec)

   Example:
   ────────
   Strong consistency write:
   1. Propose operation
   2. Consensus among replicas (2-3 RTT)
   3. Apply to all
   4. Acknowledge
   Total: 50-100ms

   GFS write:
   1. Primary assigns order
   2. Apply to all (1 RTT pipeline)
   3. Acknowledge
   Total: 5-10ms


2. AVAILABILITY
   ────────────

   No Blocking on Failures:
   • Failed write → client retries
   • No need to wait for failed replica
   • System keeps running
   • Eventual consistency via re-replication

   Strong consistency:
   • Need quorum (2 of 3)
   • If 2 replicas down → unavailable
   • Must wait for recovery

   GFS:
   • Primary can make progress
   • Client retries create defined regions
   • Inconsistent regions eventually fixed


3. SIMPLICITY
   ──────────

   Simpler Protocol:
   • No complex consensus algorithm
   • Lease-based primary authority
   • Clear failure semantics
   • Easier to implement and debug

   Code complexity comparison:
   • Paxos/Raft: 1000s of lines
   • GFS lease mechanism: 100s of lines


4. SCALABILITY
   ───────────

   Less Coordination:
   • Primary makes local decisions
   • No cross-replica voting
   • No distributed state machines
   • Scales to more replicas easier

   1000 concurrent writes:
   • Strong consistency: serialized
   • GFS: primary serializes (fast)


5. MATCHES WORKLOAD
   ────────────────

   Google's Use Cases:
   • MapReduce: append-only, de-dup OK
   • Logging: append-only, duplicates OK
   • Batch processing: validate input
   • Not: OLTP database (need strong consistency)

   Applications can handle:
   • Duplicates (unique IDs)
   • Inconsistent regions (validation)
   • Retries (idempotent operations)

Costs of Relaxed Model

DISADVANTAGES OF RELAXED CONSISTENCY
────────────────────────────────────

1. APPLICATION COMPLEXITY
   ──────────────────────

   Application Must Handle:
   • De-duplication logic
   • Record validation
   • Checksum verification
   • Scanning for inconsistent regions
   • Idempotent operations

   Code burden shifted to application!

   Example application code:
   ────────────────────────

   # Strong consistency:
   data = file.read()
   # Just works, always consistent

   # GFS:
   def read_valid_records(file):
       records = []
       seen = set()
       for record in file:
           if not valid_magic(record):
               continue  # Inconsistent
           if not valid_checksum(record):
               continue  # Corrupted
           if record.id in seen:
               continue  # Duplicate
           records.append(record)
       return records

   → 10x more code!


2. DIFFICULT TO REASON ABOUT
   ─────────────────────────

   Consistency Model Subtleties:
   • Defined vs consistent vs inconsistent
   • Different guarantees per operation
   • Timing-dependent behaviors
   • Non-intuitive failure cases

   Developer confusion:
   • "Why does my write disappear?"
   • "Why do I see old data sometimes?"
   • "Why are there duplicates?"

   → Requires education and experience


3. NOT SUITABLE FOR ALL APPS
   ─────────────────────────

   Bad fit for:
   • Financial transactions (need atomicity)
   • Database storage (need ACID)
   • Configuration files (need consistency)
   • User-facing writes (expect consistency)

   Example:
   ────────
   Bank account update:
   • Write: balance = balance - 100
   • Concurrent write: balance = balance + 50
   • GFS: Undefined result!
   • Need: Atomic transactions

   GFS not a general-purpose file system!


4. DEBUGGING CHALLENGES
   ──────────────────────

   Hard to Debug:
   • Inconsistent regions → sporadic failures
   • Duplicates → non-deterministic behavior
   • Timing-dependent → hard to reproduce
   • Multi-replica state → distributed debugging

   Example bug:
   ───────────
   "Application sometimes processes
    records twice"

   Root cause:
   • Record append retry created duplicate
   • Application didn't de-duplicate
   • Hard to trace which append failed
   • Timing-dependent reproduction


5. REQUIRES CO-DESIGN
   ──────────────────

   Not Plug-and-Play:
   • Applications must be GFS-aware
   • Can't just use POSIX applications
   • Need custom record formats
   • Need validation logic

   Cannot do:
   • Run MySQL on GFS (needs consistency)
   • Use standard Unix tools
   • Expect POSIX semantics

   Can do:
   • MapReduce (co-designed)
   • Custom batch processing
   • Logging systems

Consistency Violations Examples

Real scenarios that can occur:
Scenario: Concurrent Writes Overwrite Each Other:
LOST UPDATE PROBLEM
──────────────────

Initial: offset 1000 = "AAAA"

Time T1:
────────
Client 1: write(1000, "BBBB")
Client 2: write(1000, "CCCC")

Both sent to primary simultaneously

Primary serializes:
──────────────────
Serial #1: Write "BBBB" at 1000
Serial #2: Write "CCCC" at 1000

Apply in order:
──────────────
offset 1000 = "BBBB"  (from client 1)
offset 1000 = "CCCC"  (from client 2, overwrites)

Final: offset 1000 = "CCCC"

Client 1's write is LOST!

Result:
──────
• Client 1 thinks it wrote "BBBB"
• Client 2 thinks it wrote "CCCC"
• Actual result: "CCCC"
• Client 1's update lost

State: CONSISTENT (all replicas agree on "CCCC")
       but UNDEFINED (not what client 1 expected)

Solution:
────────
Don't do concurrent writes!
Use record append instead
Or application-level locking
Scenario: Reading During Write:
READING DURING WRITE
───────────────────

Scenario:
────────
Writer: Writing 1MB at offset 1000
Reader: Reading 1MB starting at offset 1000

Timeline:
────────

T0: Writer pushes data to replicas

T1: Replica 1 receives full 1MB

T2: Replica 2 receives 500KB so far

T3: Reader asks for chunk locations
    Master returns: [R1, R2, R3]

T4: Reader picks R2 (closest)
    Reads 1MB from offset 1000

    Gets: 500KB new data + 500KB old data

T5: Replica 2 receives remaining 500KB

T6: Writer sends write command to primary

T7: All replicas apply

Reader's data:
─────────────
• First 500KB: New data (partially written)
• Last 500KB: Old data (not yet written)
• INCONSISTENT STATE!

Solutions:
─────────
1. Application validates records
   (checksums detect partial writes)

2. Use write-once pattern
   (don't read until write finalized)

3. Application-level versioning
   (read specific version)

GFS doesn't prevent this!
Application must handle
Scenario: Append Retry Creates Duplicates:
DUPLICATE RECORDS
────────────────

Client appends record "DATA-123"

Attempt 1:
─────────
T1: Push data to R1, R2, R3
T2: R1 buffers data ✓
T3: R2 buffers data ✓
T4: R3 FAILS (network timeout)
T5: Client gets error
    (Not all replicas buffered)

Attempt 2:
─────────
T6: Client retries
T7: Push data to R1, R2, R3
T8: All buffer successfully
T9: Send append command to primary
T10: Primary assigns offset 2000
T11: All replicas apply at offset 2000
T12: Client gets success: offset=2000

But wait! Attempt 1 partial state:
──────────────────────────────────
• R1 buffered "DATA-123" at some point
• May have applied before failure detected
• Now have "DATA-123" at TWO offsets!

File contents:
─────────────
R1: offset 1500: "DATA-123" (from attempt 1)
    offset 2000: "DATA-123" (from attempt 2)

R2: offset 1500: "DATA-123" (from attempt 1)
    offset 2000: "DATA-123" (from attempt 2)

R3: offset 1500: <junk/old data>
    offset 2000: "DATA-123" (from attempt 2)

Reader sees:
───────────
• Offset 1500: Duplicate (on R1, R2)
• Offset 2000: Success (on all)

Application handles:
───────────────────
if record.id == "123" and seen("123"):
    skip  # Duplicate

Unique IDs enable de-duplication!

Interview Questions

Expected Answer:In GFS, a file region is “defined” when it is:
  1. Consistent: All replicas have the same data
  2. Matches the mutation: The data is exactly what the client wrote
“Defined” is the strongest guarantee GFS provides. It means that:
  • All replicas agree on the data (consistent)
  • The data reflects the client’s write operation completely
  • Any client reading the region will see the expected data
Contrast with:
  • Consistent but undefined: All replicas agree, but data may be a mix of concurrent writes (not what any single client expected)
  • Inconsistent: Replicas disagree, may see different data depending on which replica you read from
GFS guarantees defined regions for:
  • Successful writes (all replicas ACK)
  • Successful record appends (atomic, all replicas apply at same offset)
Failed operations create inconsistent regions that applications must handle.
Expected Answer:Record append provides at-least-once delivery through automatic retry with duplicate tolerance:Mechanism:
  1. Client sends append request to primary
  2. Primary assigns offset and coordinates replicas
  3. If any replica fails, primary returns ERROR
  4. Client automatically retries the entire append
  5. Primary assigns a NEW offset for the retry
  6. Eventually all replicas succeed
  7. Client receives the successful offset
Result:
  • Guaranteed success (at least once): Client retries until success
  • May have duplicates: Failed attempt may have partially applied
  • Application handles duplicates: Uses unique IDs to filter
Example:
  • Attempt 1: Succeeds on R1, R2, fails on R3 → Error returned
  • Partial state: R1 and R2 have record at offset 1000
  • Attempt 2: Succeeds on all at offset 1003 → Success
  • Result: Record at 1003 (guaranteed), possible duplicate at 1000
Application Code:
# Write with unique ID
record = {'id': uuid(), 'data': ...}
gfs.record_append(file, record)

# Read with de-duplication
seen_ids = set()
for record in read(file):
    if record.id in seen_ids:
        continue  # Skip duplicate
    process(record)
At-least-once is perfect for idempotent operations and batch processing where duplicates can be filtered.
Expected Answer:GFS’s relaxed consistency model enables higher performance through several mechanisms:1. No Distributed Consensus:
  • Strong consistency requires Paxos/Raft for every write (2-3 round trips, 50-100ms)
  • GFS uses lease-based primary authority (1 round trip, 5-10ms)
  • Primary makes serialization decisions locally without voting
  • 10x latency improvement
2. Leases Instead of Locks:
  • Traditional: Distributed locks (expensive, deadlock-prone)
  • GFS: 60-second leases with automatic timeout
  • No explicit release protocol needed
  • Failures handled by timeout, not complex recovery
  • Primary can process thousands of operations per second
3. Decoupled Data and Control Flow:
  • Strong consistency: All communication through coordinator
  • GFS: Data pushed in pipeline, control to primary separately
  • Maximizes network bandwidth utilization
  • Parallel data transfer while primary makes decisions
4. Acceptable Inconsistent Regions:
  • Failed writes create inconsistent regions
  • Client retries overwrite with defined data
  • System doesn’t block waiting for consensus
  • Higher availability and throughput
5. Application-Level Handling:
  • Cost shifted to application (de-duplication, validation)
  • But applications can optimize for their use case
  • MapReduce already handles duplicates for fault tolerance
  • Perfect synergy with workload
Trade-off:
  • Performance: 10x lower latency, 10x higher throughput
  • Cost: Application complexity, not suitable for all workloads
For Google’s batch processing workload (MapReduce, logging, analytics), this trade-off was perfect.
Expected Answer:Building a database on GFS is challenging due to relaxed consistency. Several approaches:Approach 1: Log-Structured Merge (LSM) Tree:
  • Write-ahead log (WAL) using record append
  • Immutable sorted string tables (SSTables) as GFS files
  • Compaction creates new files
  • Like Bigtable implementation:
    • WAL: Append-only (perfect for record append)
    • SSTables: Write-once, immutable (no consistency issues)
    • Memtable: In-memory, not in GFS
Approach 2: Snapshot Isolation:
  • Store each version in separate GFS file
  • Atomic rename for version transition
  • Readers see consistent snapshots
  • Writers append to new version
  • Garbage collect old versions
Approach 3: External Coordination:
  • Use Chubby/Zookeeper for consistency
  • GFS for storage only
  • Coordinator provides locks, transactions
  • GFS provides durability, availability
  • Example: Bigtable uses Chubby for coordination
Approach 4: Application-Level MVCC:
  • Multi-version concurrency control
  • Each record has version number
  • Record append for new versions
  • Readers filter to consistent version
  • Garbage collect old versions
Key Principles:
  1. Don’t rely on GFS for consistency
  2. Use append-only patterns
  3. Leverage immutability where possible
  4. Add coordination layer for transactions
  5. Handle duplicates and validation in app
Real Example: Bigtable:
  • WAL on GFS (record append)
  • SSTables on GFS (immutable)
  • Chubby for coordination
  • Client library handles consistency
  • Perfect layering of systems
You wouldn’t build traditional RDBMS on GFS directly, but log-structured systems work well.

Key Takeaways

Consistency Model Summary:
  1. Three States: Defined (best), Consistent but undefined (mixed), Inconsistent (bad)
  2. Record Append: Atomic, at-least-once, defined on success, handles duplicates
  3. Regular Writes: Defined on success, inconsistent on failure, undefined if concurrent
  4. Application Responsibility: Validate records, de-duplicate, handle inconsistency
  5. Performance Trade-off: Relaxed consistency enables 10x better performance
  6. Workload Match: Perfect for append-heavy batch processing, not for OLTP
  7. Design Pattern: Write-once, append-only, immutable after creation
  8. Record Format: Magic numbers, checksums, unique IDs, length markers

Up Next

In Chapter 6: Fault Tolerance, we’ll explore:
  • How GFS handles master failures and recovery
  • Chunk replication strategies and re-replication
  • Detecting and handling chunkserver failures
  • Data integrity across component failures
  • Disaster recovery mechanisms
The consistency model defines what GFS guarantees—now we’ll see how it maintains those guarantees despite constant failures.