Chapter 5: Consistency Model

GFS’s consistency model is one of its most interesting and often misunderstood aspects. Unlike traditional file systems that provide strong consistency, GFS offers a relaxed model that trades some guarantees for higher performance and simpler implementation. This chapter explores what GFS guarantees, what it doesn’t, and how applications work with this model.

Chapter Goals:

Understand GFS’s consistency guarantees and terminology
Learn the difference between defined, undefined, and inconsistent regions
Master atomic record append semantics
Explore how applications handle relaxed consistency
Grasp the trade-offs between consistency and performance

Consistency Guarantees

GFS provides different consistency guarantees depending on the operation type and success/failure scenarios.

Consistency Terminology

GFS CONSISTENCY STATES
─────────────────────

Three possible states for file regions:

1. CONSISTENT
   ──────────
   Definition: All clients see the same data,
               regardless of which replica they read from

   ┌─────────────────────────────────┐
   │ Replica 1: "Hello World"        │
   │ Replica 2: "Hello World"        │ ← All identical
   │ Replica 3: "Hello World"        │
   └─────────────────────────────────┘

   Client A reads: "Hello World"
   Client B reads: "Hello World"
   Client C reads: "Hello World"

   Property: Agreement across replicas


2. DEFINED
   ────────
   Definition: Consistent AND clients see the mutation
               in its entirety

   ┌─────────────────────────────────┐
   │ All replicas: "XYZ" (from write)│
   │ Exactly what client wrote       │
   └─────────────────────────────────┘

   Stronger than consistent:
   • All replicas agree (consistent)
   • Data matches what was written (defined)


3. INCONSISTENT
   ────────────
   Definition: Clients may see different data
               depending on which replica they read

   ┌─────────────────────────────────┐
   │ Replica 1: "ABC"                │
   │ Replica 2: "XYZ"                │ ← Different!
   │ Replica 3: "ABC"                │
   └─────────────────────────────────┘

   Client A (reads from R1): "ABC"
   Client B (reads from R2): "XYZ"

   Property: Disagreement across replicas


RELATIONSHIP:
────────────

   Defined is subset of Consistent is subset of All States

   Defined: Best case (what app wants)
   Consistent: Readable but may have junk
   Inconsistent: Unusable, skip this region

The “Consistent but Undefined” Paradox

A common point of confusion is how a region can be consistent (all replicas agree) but undefined (meaningless to the application).

The Interleaving Problem

Consider two clients writing to the same 64MB chunk at the same offset (offset 0).

Client A writes: [AAAAA]
Client B writes: [BBBBB]

If these writes are large (e.g., several MBs) and not handled via “Record Append”, GFS does not guarantee that the write is atomic across the entire byte range. Scenario:

Primary receives both. It serializes them.
But within the network pipeline, fragments of A and B might be interleaved if the client library sends them in multiple RPCs.
Replicas might end up with: [AABBA] or [BBAAA].

Result:

Consistent: All replicas (R1, R2, R3) will have exactly the same interleaved data (e.g., [AABBA]) because they all followed the Primary’s serial order of individual packets.
Undefined: Neither Client A nor Client B wrote [AABBA]. The data is garbage.

Key Insight: Consistency in GFS refers to replica agreement, not data integrity relative to client intent. This is why GFS strongly recommends “Record Append” for concurrent access, as it guarantees atomicity.

Operation-Based Guarantees

Successful Write
Failed Write
Concurrent Writes
Record Append

Write That Succeeds on All Replicas:

SUCCESSFUL WRITE GUARANTEE
─────────────────────────

Operation: write(offset, data)
Result: All replicas ACK success

Guarantee: DEFINED
─────────────────

Meaning:
• All replicas have identical data
• Data is exactly what client wrote
• Region is consistent and defined

Example:
────────
Client writes "HELLO" at offset 1000

Before:
┌────────────────────────────┐
│ Offset 1000: "xxxxx"       │
│ (all replicas)             │
└────────────────────────────┘

Write operation succeeds

After:
┌────────────────────────────┐
│ Replica 1: offset 1000: "HELLO" │
│ Replica 2: offset 1000: "HELLO" │
│ Replica 3: offset 1000: "HELLO" │
└────────────────────────────┘

Client reads back: "HELLO" ✓
Different client reads: "HELLO" ✓
Reading from any replica: "HELLO" ✓

State: DEFINED


WHY DEFINED?
───────────

Primary serializes all writes:
1. Assigns serial numbers
2. Applies in order
3. Tells secondaries same order
4. All replicas apply identically

Deterministic outcome, all replicas identical, defined region.

Write That Fails on Some Replicas:

FAILED WRITE GUARANTEE
─────────────────────

Operation: write(offset, data)
Result: Some replicas fail to apply

Guarantee: INCONSISTENT
──────────────────────

Scenario:
────────
Client writes "HELLO" at offset 1000

Process:
1. Data pushed to all replicas ✓
2. Primary applies write ✓
3. Secondary 1 applies write ✓
4. Secondary 2 FAILS (disk error, crash, etc.)

Primary returns ERROR to client

After:
┌─────────────────────────────────┐
│ Replica 1 (primary): "HELLO"    │
│ Replica 2: "HELLO"              │
│ Replica 3: "xxxxx" (old data)   │
└─────────────────────────────────┘

State: INCONSISTENT
• Replicas disagree
• Some have new data, some have old
• Client knows write failed
• Client will retry


CLIENT RETRY:
────────────

Client retries same write:

1. Data pushed again
2. Primary applies again at offset 1000
3. All replicas succeed this time

After retry:
┌─────────────────────────────────┐
│ All replicas: "HELLO"           │
└─────────────────────────────────┘

State: Now DEFINED

Old inconsistent region overwritten
→ Application sees defined data


WHAT IF CLIENT DOESN'T RETRY?
────────────────────────────

Inconsistent region remains:

┌─────────────────────────────────┐
│ R1: "HELLO"                     │
│ R2: "HELLO"                     │
│ R3: "xxxxx"                     │
└─────────────────────────────────┘

Future reads:
• Some clients see "HELLO"
• Some see "xxxxx"
• Undefined behavior!

Application must:
• Either retry on failure
• Or use record append (atomic)
• Or validate data when reading

Multiple Clients Write to Same Location:

CONCURRENT WRITES GUARANTEE
──────────────────────────

Operation: Multiple clients write to same offset
Result: All succeed

Guarantee: CONSISTENT but UNDEFINED
──────────────────────────────────

Scenario:
────────
Client A writes "AAA" at offset 1000
Client B writes "BBB" at offset 1000
(simultaneously)

What happens:
────────────

Primary serializes:
1. Receives both requests
2. Assigns serial numbers:
   A → serial #42
   B → serial #43

3. Applies in order:
   - Apply "AAA" → offset 1000 = "AAA"
   - Apply "BBB" → offset 1000 = "BBB"

4. Tells secondaries: "Apply #42 then #43"

5. All replicas: "BBB" (last write wins)

Result:
┌─────────────────────────────────┐
│ All replicas: "BBB"             │
└─────────────────────────────────┘

State: CONSISTENT (all agree)
       But UNDEFINED (not what A expected!)

Client A thinks it wrote "AAA"
Client B thinks it wrote "BBB"
Actual result: "BBB" (B won)

→ Data is consistent across replicas
→ But result is undefined from app perspective
→ Application shouldn't do concurrent writes!


GFS RECOMMENDATION:
──────────────────

DON'T do concurrent writes to same offset

Instead:
• Use record append (atomic, serialized)
• Use application-level locking
• Write to different files
• Write to different regions


WHY ALLOW THIS?
──────────────

GFS doesn't prevent it because:
• Simplifies protocol (no distributed locking)
• Application should handle coordination
• Record append is the "right" way
• Focus on common case (sequential appends)

Atomic Record Append:

RECORD APPEND GUARANTEE
──────────────────────

Operation: record_append(data)
Result: Success (or retry if failure)

Guarantee: DEFINED AT LEAST ONCE
────────────────────────────────

Properties:
──────────

1. ATOMICITY:
   • Record written atomically
   • All or nothing per replica
   • No partial records

2. AT-LEAST-ONCE:
   • Guaranteed to succeed eventually
   • May appear more than once (duplicates)
   • Client receives offset on success

3. DEFINED:
   • Successful appends create defined regions
   • All replicas have same data
   • Reading from any replica sees same record

4. SERIAL ORDERING:
   • Primary assigns offset
   • All replicas apply at same offset
   • Concurrent appends serialized by primary


SUCCESS CASE:
────────────

Client A: append("AAA")
Client B: append("BBB")

Primary serializes:
1. A arrives first → offset 1000
2. B arrives second → offset 1003

Result:
┌─────────────────────────────────┐
│ All replicas:                   │
│   Offset 1000: "AAA"            │
│   Offset 1003: "BBB"            │
└─────────────────────────────────┘

Both defined regions
Both clients get their offsets
Both succeed


FAILURE + RETRY CASE:
────────────────────

Client appends "AAA"
1. Data pushed to all replicas
2. Primary applies at offset 1000
3. Secondary 1 applies ✓
4. Secondary 2 FAILS
5. Primary returns ERROR

Client retries:
1. Data pushed again
2. Primary assigns NEW offset 1003
3. All replicas apply at 1003
4. Primary returns SUCCESS, offset=1003

Result:
┌─────────────────────────────────┐
│ Replica 1: 1000="AAA", 1003="AAA" │
│ Replica 2: 1000="AAA", 1003="AAA" │
│ Replica 3: 1000=junk,  1003="AAA" │
└─────────────────────────────────┘

• Offset 1000: INCONSISTENT (skip it)
• Offset 1003: DEFINED (success!)
• Client knows offset 1003 is the real one
• May have duplicate at 1000 (on R1, R2)

Application handles:
• Offset 1003 is guaranteed good
• Offset 1000 may be duplicate
• Use unique IDs to de-duplicate

Consistency Matrix

Visual summary of GFS consistency guarantees:

CONSISTENCY GUARANTEE MATRIX
───────────────────────────

┌─────────────────┬──────────────┬──────────────────┐
│   Operation     │  All Success │   Some Fail      │
├─────────────────┼──────────────┼──────────────────┤
│                 │              │                  │
│ Write           │   DEFINED    │  INCONSISTENT    │
│ (single client) │              │                  │
│                 │  ✓ Identical │  ✗ Replicas      │
│                 │  ✓ Expected  │     differ       │
│                 │              │  ✗ Undefined     │
│                 │              │     behavior     │
├─────────────────┼──────────────┼──────────────────┤
│                 │              │                  │
│ Write           │  CONSISTENT  │  INCONSISTENT    │
│ (concurrent)    │ but UNDEFINED│                  │
│                 │              │  ✗ Replicas      │
│                 │  ✓ Identical │     differ       │
│                 │  ✗ May not   │  ✗ Unpredictable │
│                 │     match    │                  │
│                 │     expected │                  │
├─────────────────┼──────────────┼──────────────────┤
│                 │              │                  │
│ Record Append   │   DEFINED    │  Record retried  │
│                 │ (at offset   │  → eventually    │
│                 │  returned)   │     DEFINED      │
│                 │              │                  │
│                 │  ✓ Identical │  ⚠ May have      │
│                 │  ✓ Expected  │     duplicates   │
│                 │  ✓ Atomic    │  ⚠ Inconsistent  │
│                 │              │     regions      │
└─────────────────┴──────────────┴──────────────────┘

KEY INSIGHTS:
────────────

1. Record append is the "safe" operation
   → Always eventually defined
   → At-least-once delivery

2. Regular writes can create inconsistent regions
   → Application must handle or retry

3. Concurrent writes: Just don't do it!
   → Use record append instead

4. Failed operations → inconsistent regions
   → But client knows it failed
   → Can retry to overwrite

Application Implications

How do applications work with GFS’s relaxed consistency?

Handling Inconsistent Regions

Record Validation
Write Patterns
MapReduce Integration

Detecting Bad Records:

RECORD VALIDATION STRATEGY
─────────────────────────

Application record format:
──────────────────────────

┌───────────────────────────────────────┐
│ Record Header                         │
├───────────────────────────────────────┤
│ Magic number:   0xDEADBEEF (4 bytes) │
│ Record ID:      UUID (16 bytes)      │
│ Checksum:       CRC32 (4 bytes)      │
│ Length:         (4 bytes)            │
├───────────────────────────────────────┤
│ Payload Data                          │
│ (variable length)                     │
└───────────────────────────────────────┘

Reading records:
───────────────

def read_records(filename):
    offset = 0
    records = []

    while offset < file_size:
        try:
            # Read header
            header = read(file, offset, HEADER_SIZE)

            # Validate magic number
            if header.magic != 0xDEADBEEF:
                # INCONSISTENT REGION!
                offset += scan_for_next_magic()
                continue

            # Validate checksum
            payload = read(file, offset + HEADER_SIZE, header.length)
            if crc32(payload) != header.checksum:
                # CORRUPTED RECORD!
                offset += header.length + HEADER_SIZE
                continue

            # Check for duplicate
            if header.record_id in seen_ids:
                # DUPLICATE!
                offset += header.length + HEADER_SIZE
                continue

            # Valid record!
            records.append(payload)
            seen_ids.add(header.record_id)
            offset += header.length + HEADER_SIZE

        except Exception:
            # Scan forward for next valid record
            offset += SCAN_INCREMENT

    return records


HANDLING STRATEGIES:
───────────────────

1. Skip inconsistent regions:
   • No valid magic number
   • Checksum mismatch
   • Partial records
   → Move to next potential record

2. De-duplicate:
   • Track seen record IDs
   • Skip duplicates from retries

3. Validate checksums:
   • Application-level checksums
   • Independent of GFS checksums
   • Detect inconsistent regions

4. Use markers:
   • Beginning/end markers
   • Identify record boundaries
   • Scan forward if corrupted

Safe Writing Strategies:

RECOMMENDED WRITE PATTERNS
─────────────────────────

❌ AVOID: Random writes
──────────────────────

DON'T DO THIS:
─────────────

# Bad: Random writes from multiple clients
client_A.write(file, offset=1000, "AAA")
client_B.write(file, offset=1000, "BBB")
# Undefined result!

# Bad: Overwriting existing data
client.write(file, offset=500, "new data")
# May create inconsistent region if fails


✅ GOOD: Sequential appends
──────────────────────────

Pattern 1: Single writer, append-only
─────────────────────────────────────

writer = GFS.open(file, mode='append')
while has_data():
    data = get_next_data()
    writer.append(data)
writer.close()

# Properties:
# • No concurrent writers
# • Sequential appends
# • All operations defined


Pattern 2: Multiple writers, record append
──────────────────────────────────────────

# Multiple MapReduce tasks writing
for i in range(1000):  # 1000 mappers
    def mapper_task():
        while has_output():
            record = {
                'id': uuid(),
                'data': compute(),
                'checksum': crc32(data)
            }
            offset = GFS.record_append(output_file, record)
            # Atomic, no coordination needed!

# Properties:
# • Concurrent appends OK
# • Each gets unique offset
# • All atomic
# • May have duplicates (de-dup when reading)


Pattern 3: Write-once, read-many
────────────────────────────────

# Producer
writer.create(file)
writer.write(all_data)
writer.close()

# Mark immutable
GFS.finalize(file)

# Many readers
readers = [GFS.open(file, 'read') for _ in range(100)]
# All see consistent, defined data

# Properties:
# • Single write phase
# • Multiple concurrent reads
# • No mutations after finalize
# • Perfect for MapReduce input


Pattern 4: Checkpointing
────────────────────────

# Writer creates checkpoint periodically
def checkpoint_writer():
    current_file = 0

    while producing_data():
        data = produce_batch()

        # Write to new file
        filename = f"checkpoint_{current_file}"
        GFS.write(filename, data)
        GFS.finalize(filename)

        # Atomic rename (metadata-only)
        GFS.rename(filename, "current_checkpoint")

        current_file += 1

# Reader always reads finalized checkpoints
def reader():
    data = GFS.read("current_checkpoint")

# Properties:
# • Readers never see partial writes
# • Atomic switchover via rename
# • Old checkpoints can be kept

How MapReduce Uses GFS:

MAPREDUCE + GFS PATTERNS
───────────────────────

INPUT FILES (Read-only):
───────────────────────

Preparation:
1. Input data loaded into GFS
2. Files marked immutable
3. Replicated (3x default)

Map Phase Reads:
────────────────
• 1000s of map tasks
• Each reads different chunk
• All concurrent
• No consistency issues (immutable)
• Locality: schedule task near chunk


INTERMEDIATE DATA (Append-only):
────────────────────────────────

Map Task Writes:
───────────────

def map_task(input_chunk):
    results = process(input_chunk)

    for key, value in results:
        partition = hash(key) % NUM_REDUCERS
        filename = f"intermediate_{partition}"

        record = {
            'id': uuid(),  # For de-duplication
            'key': key,
            'value': value,
            'checksum': crc32(key + value)
        }

        # ATOMIC RECORD APPEND
        offset = GFS.record_append(filename, serialize(record))

# 1000s of mappers appending concurrently
# No coordination needed!
# GFS handles serialization

Reduce Task Reads:
─────────────────

def reduce_task(partition_id):
    filename = f"intermediate_{partition_id}"

    records = []
    seen_ids = set()

    for record in GFS.read_records(filename):
        # Validate checksum
        if not valid_checksum(record):
            continue  # Skip inconsistent

        # De-duplicate
        if record.id in seen_ids:
            continue

        records.append(record)
        seen_ids.add(record.id)

    # Process de-duplicated records
    results = reduce_function(records)
    return results


OUTPUT FILES (Sequential Write):
────────────────────────────────

Reduce Task Writes:
──────────────────

def reduce_task_output(partition_id):
    results = reduce_function(...)

    # Each reducer writes to separate file
    output_file = f"output_{partition_id}"

    # Single writer, sequential appends
    writer = GFS.create(output_file)
    for result in results:
        writer.append(result)
    writer.close()

    # Mark immutable
    GFS.finalize(output_file)

# No concurrent writes to same file
# All defined regions
# Safe for next MapReduce job as input


BENEFITS:
────────

✓ High throughput (concurrent appends)
✓ Fault tolerance (retry failed tasks)
✓ Simple coordination (record append)
✓ Scalability (1000s of tasks)
✓ No distributed locking needed

Consistency Best Practices

Use Record Append

Prefer Atomic Appends:

Use record_append for concurrent writes
Guaranteed atomic and defined
No coordination between clients
Handle duplicates at read time
Perfect for logging and MapReduce

Write-Once Pattern

Immutable After Creation:

Write file once, then read-only
No mutations after finalized
Eliminates consistency issues
Safe for concurrent readers
Common pattern in data processing

Validate Records

Application-Level Checking:

Add checksums to records
Use unique IDs for de-duplication
Include magic numbers
Scan for valid records
Skip inconsistent regions

Avoid Overwrites

Don’t Modify Existing Data:

No random writes to same location
No concurrent writes to same offset
Use new files for updates
Append-only workflows
Checkpoint with new files

Relaxed Consistency Trade-offs

Why did GFS choose relaxed consistency?

Benefits of Relaxed Model

ADVANTAGES OF RELAXED CONSISTENCY
─────────────────────────────────

1. PERFORMANCE
   ───────────

   No Distributed Consensus:
   • No Paxos/Raft for every write
   • Primary makes decisions locally
   • Lower latency (1-10ms vs 50-100ms)
   • Higher throughput (10,000+ ops/sec)

   Example:
   ────────
   Strong consistency write:
   1. Propose operation
   2. Consensus among replicas (2-3 RTT)
   3. Apply to all
   4. Acknowledge
   Total: 50-100ms

   GFS write:
   1. Primary assigns order
   2. Apply to all (1 RTT pipeline)
   3. Acknowledge
   Total: 5-10ms


2. AVAILABILITY
   ────────────

   No Blocking on Failures:
   • Failed write → client retries
   • No need to wait for failed replica
   • System keeps running
   • Eventual consistency via re-replication

   Strong consistency:
   • Need quorum (2 of 3)
   • If 2 replicas down → unavailable
   • Must wait for recovery

   GFS:
   • Primary can make progress
   • Client retries create defined regions
   • Inconsistent regions eventually fixed


3. SIMPLICITY
   ──────────

   Simpler Protocol:
   • No complex consensus algorithm
   • Lease-based primary authority
   • Clear failure semantics
   • Easier to implement and debug

   Code complexity comparison:
   • Paxos/Raft: 1000s of lines
   • GFS lease mechanism: 100s of lines


4. SCALABILITY
   ───────────

   Less Coordination:
   • Primary makes local decisions
   • No cross-replica voting
   • No distributed state machines
   • Scales to more replicas easier

   1000 concurrent writes:
   • Strong consistency: serialized
   • GFS: primary serializes (fast)


5. MATCHES WORKLOAD
   ────────────────

   Google's Use Cases:
   • MapReduce: append-only, de-dup OK
   • Logging: append-only, duplicates OK
   • Batch processing: validate input
   • Not: OLTP database (need strong consistency)

   Applications can handle:
   • Duplicates (unique IDs)
   • Inconsistent regions (validation)
   • Retries (idempotent operations)

Costs of Relaxed Model

DISADVANTAGES OF RELAXED CONSISTENCY
────────────────────────────────────

1. APPLICATION COMPLEXITY
   ──────────────────────

   Application Must Handle:
   • De-duplication logic
   • Record validation
   • Checksum verification
   • Scanning for inconsistent regions
   • Idempotent operations

   Code burden shifted to application!

   Example application code:
   ────────────────────────

   # Strong consistency:
   data = file.read()
   # Just works, always consistent

   # GFS:
   def read_valid_records(file):
       records = []
       seen = set()
       for record in file:
           if not valid_magic(record):
               continue  # Inconsistent
           if not valid_checksum(record):
               continue  # Corrupted
           if record.id in seen:
               continue  # Duplicate
           records.append(record)
       return records

   → 10x more code!


2. DIFFICULT TO REASON ABOUT
   ─────────────────────────

   Consistency Model Subtleties:
   • Defined vs consistent vs inconsistent
   • Different guarantees per operation
   • Timing-dependent behaviors
   • Non-intuitive failure cases

   Developer confusion:
   • "Why does my write disappear?"
   • "Why do I see old data sometimes?"
   • "Why are there duplicates?"

   → Requires education and experience


3. NOT SUITABLE FOR ALL APPS
   ─────────────────────────

   Bad fit for:
   • Financial transactions (need atomicity)
   • Database storage (need ACID)
   • Configuration files (need consistency)
   • User-facing writes (expect consistency)

   Example:
   ────────
   Bank account update:
   • Write: balance = balance - 100
   • Concurrent write: balance = balance + 50
   • GFS: Undefined result!
   • Need: Atomic transactions

   GFS not a general-purpose file system!


4. DEBUGGING CHALLENGES
   ──────────────────────

   Hard to Debug:
   • Inconsistent regions → sporadic failures
   • Duplicates → non-deterministic behavior
   • Timing-dependent → hard to reproduce
   • Multi-replica state → distributed debugging

   Example bug:
   ───────────
   "Application sometimes processes
    records twice"

   Root cause:
   • Record append retry created duplicate
   • Application didn't de-duplicate
   • Hard to trace which append failed
   • Timing-dependent reproduction


5. REQUIRES CO-DESIGN
   ──────────────────

   Not Plug-and-Play:
   • Applications must be GFS-aware
   • Can't just use POSIX applications
   • Need custom record formats
   • Need validation logic

   Cannot do:
   • Run MySQL on GFS (needs consistency)
   • Use standard Unix tools
   • Expect POSIX semantics

   Can do:
   • MapReduce (co-designed)
   • Custom batch processing
   • Logging systems

Consistency Violations Examples

Real scenarios that can occur:

Lost Update

Scenario: Concurrent Writes Overwrite Each Other:

LOST UPDATE PROBLEM
──────────────────

Initial: offset 1000 = "AAAA"

Time T1:
────────
Client 1: write(1000, "BBBB")
Client 2: write(1000, "CCCC")

Both sent to primary simultaneously

Primary serializes:
──────────────────
Serial #1: Write "BBBB" at 1000
Serial #2: Write "CCCC" at 1000

Apply in order:
──────────────
offset 1000 = "BBBB"  (from client 1)
offset 1000 = "CCCC"  (from client 2, overwrites)

Final: offset 1000 = "CCCC"

Client 1's write is LOST!

Result:
──────
• Client 1 thinks it wrote "BBBB"
• Client 2 thinks it wrote "CCCC"
• Actual result: "CCCC"
• Client 1's update lost

State: CONSISTENT (all replicas agree on "CCCC")
       but UNDEFINED (not what client 1 expected)

Solution:
────────
Don't do concurrent writes!
Use record append instead
Or application-level locking

Partial Read

Scenario: Reading During Write:

READING DURING WRITE
───────────────────

Scenario:
────────
Writer: Writing 1MB at offset 1000
Reader: Reading 1MB starting at offset 1000

Timeline:
────────

T0: Writer pushes data to replicas
    ↓
T1: Replica 1 receives full 1MB
    ↓
T2: Replica 2 receives 500KB so far
    ↓
T3: Reader asks for chunk locations
    Master returns: [R1, R2, R3]
    ↓
T4: Reader picks R2 (closest)
    Reads 1MB from offset 1000
    ↓
    Gets: 500KB new data + 500KB old data
    ↓
T5: Replica 2 receives remaining 500KB
    ↓
T6: Writer sends write command to primary
    ↓
T7: All replicas apply

Reader's data:
─────────────
• First 500KB: New data (partially written)
• Last 500KB: Old data (not yet written)
• INCONSISTENT STATE!

Solutions:
─────────
1. Application validates records
   (checksums detect partial writes)

2. Use write-once pattern
   (don't read until write finalized)

3. Application-level versioning
   (read specific version)

GFS doesn't prevent this!
Application must handle

Duplicate Records

Scenario: Append Retry Creates Duplicates:

DUPLICATE RECORDS
────────────────

Client appends record "DATA-123"

Attempt 1:
─────────
T1: Push data to R1, R2, R3
T2: R1 buffers data ✓
T3: R2 buffers data ✓
T4: R3 FAILS (network timeout)
T5: Client gets error
    (Not all replicas buffered)

Attempt 2:
─────────
T6: Client retries
T7: Push data to R1, R2, R3
T8: All buffer successfully
T9: Send append command to primary
T10: Primary assigns offset 2000
T11: All replicas apply at offset 2000
T12: Client gets success: offset=2000

But wait! Attempt 1 partial state:
──────────────────────────────────
• R1 buffered "DATA-123" at some point
• May have applied before failure detected
• Now have "DATA-123" at TWO offsets!

File contents:
─────────────
R1: offset 1500: "DATA-123" (from attempt 1)
    offset 2000: "DATA-123" (from attempt 2)

R2: offset 1500: "DATA-123" (from attempt 1)
    offset 2000: "DATA-123" (from attempt 2)

R3: offset 1500: <junk/old data>
    offset 2000: "DATA-123" (from attempt 2)

Reader sees:
───────────
• Offset 1500: Duplicate (on R1, R2)
• Offset 2000: Success (on all)

Application handles:
───────────────────
if record.id == "123" and seen("123"):
    skip  # Duplicate

Unique IDs enable de-duplication!

Interview Questions

Basic: What does 'defined' mean in GFS?

Expected Answer:In GFS, a file region is “defined” when it is:

Consistent: All replicas have the same data
Matches the mutation: The data is exactly what the client wrote

“Defined” is the strongest guarantee GFS provides. It means that:

All replicas agree on the data (consistent)
The data reflects the client’s write operation completely
Any client reading the region will see the expected data

Contrast with:

Consistent but undefined: All replicas agree, but data may be a mix of concurrent writes (not what any single client expected)
Inconsistent: Replicas disagree, may see different data depending on which replica you read from

GFS guarantees defined regions for:

Successful writes (all replicas ACK)
Successful record appends (atomic, all replicas apply at same offset)

Failed operations create inconsistent regions that applications must handle.

Intermediate: How does record append achieve at-least-once delivery?

Expected Answer:Record append provides at-least-once delivery through automatic retry with duplicate tolerance:Mechanism:

Client sends append request to primary
Primary assigns offset and coordinates replicas
If any replica fails, primary returns ERROR
Client automatically retries the entire append
Primary assigns a NEW offset for the retry
Eventually all replicas succeed
Client receives the successful offset

Result:

Guaranteed success (at least once): Client retries until success
May have duplicates: Failed attempt may have partially applied
Application handles duplicates: Uses unique IDs to filter

Example:

Attempt 1: Succeeds on R1, R2, fails on R3 → Error returned
Partial state: R1 and R2 have record at offset 1000
Attempt 2: Succeeds on all at offset 1003 → Success
Result: Record at 1003 (guaranteed), possible duplicate at 1000

Application Code:

# Write with unique ID
record = {'id': uuid(), 'data': ...}
gfs.record_append(file, record)

# Read with de-duplication
seen_ids = set()
for record in read(file):
    if record.id in seen_ids:
        continue  # Skip duplicate
    process(record)

At-least-once is perfect for idempotent operations and batch processing where duplicates can be filtered.

Advanced: Explain how GFS's consistency model enables higher performance

Expected Answer:GFS’s relaxed consistency model enables higher performance through several mechanisms:1. No Distributed Consensus:

Strong consistency requires Paxos/Raft for every write (2-3 round trips, 50-100ms)
GFS uses lease-based primary authority (1 round trip, 5-10ms)
Primary makes serialization decisions locally without voting
10x latency improvement

2. Leases Instead of Locks:

Traditional: Distributed locks (expensive, deadlock-prone)
GFS: 60-second leases with automatic timeout
No explicit release protocol needed
Failures handled by timeout, not complex recovery
Primary can process thousands of operations per second

3. Decoupled Data and Control Flow:

Strong consistency: All communication through coordinator
GFS: Data pushed in pipeline, control to primary separately
Maximizes network bandwidth utilization
Parallel data transfer while primary makes decisions

4. Acceptable Inconsistent Regions:

Failed writes create inconsistent regions
Client retries overwrite with defined data
System doesn’t block waiting for consensus
Higher availability and throughput

5. Application-Level Handling:

Cost shifted to application (de-duplication, validation)
But applications can optimize for their use case
MapReduce already handles duplicates for fault tolerance
Perfect synergy with workload

Trade-off:

Performance: 10x lower latency, 10x higher throughput
Cost: Application complexity, not suitable for all workloads

For Google’s batch processing workload (MapReduce, logging, analytics), this trade-off was perfect.

System Design: How would you build a database on top of GFS?

Expected Answer:Building a database on GFS is challenging due to relaxed consistency. Several approaches:Approach 1: Log-Structured Merge (LSM) Tree:

Write-ahead log (WAL) using record append
Immutable sorted string tables (SSTables) as GFS files
Compaction creates new files
Like Bigtable implementation:
- WAL: Append-only (perfect for record append)
- SSTables: Write-once, immutable (no consistency issues)
- Memtable: In-memory, not in GFS

Approach 2: Snapshot Isolation:

Store each version in separate GFS file
Atomic rename for version transition
Readers see consistent snapshots
Writers append to new version
Garbage collect old versions

Approach 3: External Coordination:

Use Chubby/Zookeeper for consistency
GFS for storage only
Coordinator provides locks, transactions
GFS provides durability, availability
Example: Bigtable uses Chubby for coordination

Approach 4: Application-Level MVCC:

Multi-version concurrency control
Each record has version number
Record append for new versions
Readers filter to consistent version
Garbage collect old versions

Key Principles:

Don’t rely on GFS for consistency
Use append-only patterns
Leverage immutability where possible
Add coordination layer for transactions
Handle duplicates and validation in app

Real Example: Bigtable:

WAL on GFS (record append)
SSTables on GFS (immutable)
Chubby for coordination
Client library handles consistency
Perfect layering of systems

You wouldn’t build traditional RDBMS on GFS directly, but log-structured systems work well.

Key Takeaways

Consistency Model Summary:

Three States: Defined (best), Consistent but undefined (mixed), Inconsistent (bad)
Record Append: Atomic, at-least-once, defined on success, handles duplicates
Regular Writes: Defined on success, inconsistent on failure, undefined if concurrent
Application Responsibility: Validate records, de-duplicate, handle inconsistency
Performance Trade-off: Relaxed consistency enables 10x better performance
Workload Match: Perfect for append-heavy batch processing, not for OLTP
Design Pattern: Write-once, append-only, immutable after creation
Record Format: Magic numbers, checksums, unique IDs, length markers

Up Next

In Chapter 6: Fault Tolerance, we’ll explore:

How GFS handles master failures and recovery
Chunk replication strategies and re-replication
Detecting and handling chunkserver failures
Data integrity across component failures
Disaster recovery mechanisms

The consistency model defines what GFS guarantees—now we’ll see how it maintains those guarantees despite constant failures.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Chapter 5: Consistency Model

​Consistency Guarantees

​Consistency Terminology

​The “Consistent but Undefined” Paradox

​The Interleaving Problem

​Operation-Based Guarantees

​Consistency Matrix

​Application Implications

​Handling Inconsistent Regions

​Consistency Best Practices