Chapter 4: Chunkservers and Data Flow

Chunkservers are the workhorses of GFS, storing actual data and serving client requests. This chapter explores how chunks are stored, how data flows through the system, and how GFS ensures data integrity despite commodity hardware failures.

Chapter Goals:

Understand chunk storage implementation
Master the replication pipeline and data flow optimization
Learn detailed read, write, and record append flows
Explore checksumming and data integrity mechanisms
Grasp failure handling at the data layer

Chunk Storage

Each chunkserver is a simple Linux process storing chunks as regular files.

Physical Storage Layout

CHUNKSERVER DISK LAYOUT
──────────────────────

Linux Filesystem (ext3/ext4):
/
├── ...
├── gfs/
│   ├── chunks/
│   │   ├── 0x1a2b3c4d5e6f7g8h      (64MB max)
│   │   ├── 0x2b3c4d5e6f7g8h9i      (64MB max)
│   │   ├── 0x3c4d5e6f7g8h9i0j      (22MB actual)
│   │   ├── ...
│   │   └── 0xaabbccddeeff1122      (64MB max)
│   │
│   └── metadata/
│       ├── checksums.db            (chunk → checksum map)
│       └── versions.db             (chunk → version map)
│
└── ...

Chunk File Structure:
────────────────────

Each chunk = Linux file named by chunk handle

Properties:
• Sparse allocation (only actual data uses disk)
• Can grow up to 64MB
• Standard Linux file operations
• Checksummed in 64KB blocks

Example:
────────
Chunk handle: 0x1a2b3c4d5e6f7g8h
File path: /gfs/chunks/0x1a2b3c4d5e6f7g8h
Size on disk: 45MB (even though max is 64MB)
Blocks: 45MB / 64KB = 720 blocks
Checksums: 720 × 4 bytes = 2.8KB


SPARSE ALLOCATION:
─────────────────

Small file (1MB) in 64MB chunk:

Without sparse allocation:
• Allocates 64MB on disk
• Wastes 63MB
• 98.4% waste!

With sparse allocation:
• Allocates ~1MB on disk
• Only actual data uses space
• Linux handles efficiently

Implementation:
───────────────
fd = open("/gfs/chunks/0x1a2b...", O_CREAT)
ftruncate(fd, 64MB)  // Set max size
write(fd, data, len)  // Only allocates used blocks

Chunk Metadata

Checksums
Version Numbers
Chunk Handle

Data Integrity Protection:

CHECKSUM STRUCTURE
─────────────────

Each chunk divided into 64KB blocks:

┌─────────────────────────────────────────┐
│ Chunk (64MB max)                        │
├─────────────────────────────────────────┤
│ Block 0 (64KB)  → Checksum 0 (4 bytes)  │
│ Block 1 (64KB)  → Checksum 1 (4 bytes)  │
│ Block 2 (64KB)  → Checksum 2 (4 bytes)  │
│ ...                                     │
│ Block 1023 (64KB) → Checksum 1023       │
└─────────────────────────────────────────┘

Checksum = 32-bit CRC32

For full 64MB chunk:
• 1024 blocks
• 1024 checksums
• 4KB checksum metadata

Stored: In-memory + on disk


CHECKSUM COMPUTATION:
────────────────────

On Write:
─────────
def write_block(block_index, data):
    checksum = crc32(data)
    write_data_to_disk(block_index, data)
    update_checksum(block_index, checksum)
    persist_checksum()

On Read:
────────
def read_block(block_index):
    data = read_data_from_disk(block_index)
    expected_checksum = get_checksum(block_index)
    actual_checksum = crc32(data)

    if actual_checksum != expected_checksum:
        report_corruption()
        read_from_different_replica()
    else:
        return data


WHY 64KB BLOCKS?
───────────────

Trade-off analysis:

Smaller blocks (e.g., 4KB):
✓ Detect corruption precisely
✓ Re-read smaller amount
✗ More checksums (higher overhead)
✗ More CPU for checksum computation

Larger blocks (e.g., 1MB):
✓ Fewer checksums
✓ Less CPU overhead
✗ Must re-read larger amount on corruption
✗ Less precise error detection

64KB: Sweet spot
• 0.006% overhead (4KB per 64MB)
• Fast CRC32 computation
• Reasonable re-read size
• Matches common I/O sizes

Detecting Stale Replicas:

VERSION NUMBER SYSTEM
────────────────────

Each chunk has version number

Purpose: Detect stale replicas

Lifecycle:
──────────

1. Chunk Created:
   version = 1

2. Master Grants Lease:
   version++  (now version = 2)
   All current replicas updated to 2

3. Replica Was Down:
   Server was offline during lease grant
   Its version still = 1 (STALE!)

4. Detection:
   Server comes back online
   Heartbeat: "I have chunk X, version 1"
   Master: "Current version is 2, yours is stale"
   → Mark for garbage collection


EXAMPLE SCENARIO:
────────────────

Chunk 0x1a2b, initially version 1
Replicas: [cs-5, cs-12, cs-23]

Time T0: All have version 1
────────────────────────────
cs-5:  version 1 ✓
cs-12: version 1 ✓
cs-23: version 1 ✓

Time T1: cs-23 goes offline
───────────────────────────
cs-23 crashes, network partition, etc.

Time T2: Client writes to chunk
───────────────────────────────
Master grants lease → version++
New version: 2

Master updates:
cs-5:  version 2 ✓
cs-12: version 2 ✓
cs-23: offline (still has version 1)

Time T3: cs-23 comes back online
────────────────────────────────
cs-23 heartbeat: "I have chunk 0x1a2b, version 1"

Master checks:
• Current version: 2
• cs-23 version: 1
• STALE REPLICA!

Master → cs-23: "Delete chunk 0x1a2b (stale)"

Master schedules re-replication:
• Copy from cs-5 or cs-12 (version 2)
• To new chunkserver
• Maintain 3 replicas


PERSISTENCE:
───────────

Version stored:
• In master's operation log (persistent)
• In chunkserver's metadata (persistent)
• In master's memory (volatile)

After master restart:
• Loads version from operation log
• Chunkservers report their versions
• Master validates and cleans up stale replicas

Globally Unique Identifier:

CHUNK HANDLE
───────────

64-bit globally unique identifier

Generation: Master assigns at chunk creation

Properties:
──────────
• Unique across entire GFS cluster
• Never reused
• Identifies chunk forever
• Used as filename on chunkservers

Format:
───────
0x1a2b3c4d5e6f7g8h
│  │
│  └─ Random/sequential component
└──── Timestamp component (optional)

Mapping:
────────
Master maintains:
File → List of chunk handles
/data/file1 → [0x1a2b..., 0x2c3d..., 0x3e4f...]

Each chunk handle → Metadata
0x1a2b... → {
    version: 5,
    replicas: [cs-5, cs-12, cs-23],
    primary: cs-5,
    lease_expiration: T+60
}

Chunkserver stores:
Chunk handle → Chunk file
0x1a2b... → /gfs/chunks/0x1a2b3c4d5e6f7g8h

Benefits:
────────
✓ Simple mapping
✓ No path dependencies
✓ Easy to move chunks between servers
✓ No name conflicts
✓ Efficient lookup

Data Flow Optimization

GFS decouples data flow from control flow and optimizes for network topology.

Pipelined Data Push

PIPELINED TRANSFER
─────────────────

Goal: Fully utilize network bandwidth

Problem with Sequential Transfer:
─────────────────────────────────

Client sends to all replicas sequentially:

Client → cs-1 (100MB) → done
      → cs-2 (100MB) → done
      → cs-3 (100MB) → done

Time: 3 × transfer_time
Network: Only one link used at a time
Utilization: 33%


GFS Solution: Pipeline + Chain
──────────────────────────────

Client sends to closest replica
Each replica forwards while receiving

Client ──→ cs-1 ──→ cs-2 ──→ cs-3

Timing:
───────

Time 0s:    Client → cs-1 (start)
Time 0.1s:  Client → cs-1 (continue)
            cs-1 → cs-2 (start)

Time 0.2s:  Client → cs-1 (continue)
            cs-1 → cs-2 (continue)
            cs-2 → cs-3 (start)

Time 1.0s:  Client → cs-1 (done)
Time 1.1s:  cs-1 → cs-2 (done)
Time 1.2s:  cs-2 → cs-3 (done)

Total time: ~1.2s (vs 3.0s sequential!)

Bandwidth utilization:
• Client → cs-1 link: 100%
• cs-1 → cs-2 link: 100%
• cs-2 → cs-3 link: 100%
• ALL LINKS UTILIZED SIMULTANEOUSLY!


PIPELINING IMPLEMENTATION:
─────────────────────────

Buffer size: 64KB (one checksum block)

Client:
───────
buffer = []
for each 64KB chunk in data:
    send_to_chunkserver_1(chunk)
    # Don't wait for ACK
    # Send next chunk immediately

Chunkserver:
───────────
while receiving:
    chunk = receive_64KB()
    write_to_disk(chunk)
    forward_to_next_chunkserver(chunk)
    # Parallel: write and forward!

Last chunkserver:
────────────────
    chunk = receive_64KB()
    write_to_disk(chunk)
    # No forwarding needed

All send ACK when fully received

Network Topology Awareness

Chaining Strategy
Rack Awareness
Bandwidth Utilization

Optimal Chain Formation:

CHAIN ORDERING
─────────────

Replicas: cs-1 (rack A), cs-2 (rack B), cs-3 (rack A)
Client: In rack A

Bad Chain:
──────────
Client (rack A) → cs-2 (rack B) → cs-1 (rack A) → cs-3 (rack A)
          ^               ^               ^
     cross-rack     cross-rack      same-rack

• 2 cross-rack transfers
• Slow uplink saturated twice
• Poor performance

Good Chain:
───────────
Client (rack A) → cs-1 (rack A) → cs-2 (rack B) → cs-3 (rack A)
          ^               ^               ^
     same-rack      cross-rack     cross-rack

• 1 same-rack transfer first (fast)
• Then cross-rack in parallel
• Better utilization

Best Chain (if possible):
────────────────────────
Client (rack A) → cs-1 (rack A) → cs-3 (rack A) → cs-2 (rack B)
          ^               ^               ^
     same-rack       same-rack      cross-rack

• 2 same-rack transfers
• Only 1 cross-rack (at end)
• Optimal performance


DISTANCE METRIC:
───────────────

Distance calculation:

def network_distance(node1, node2):
    if same_machine(node1, node2):
        return 0
    elif same_rack(node1, node2):
        return 1
    elif same_datacenter(node1, node2):
        return 2
    else:
        return 3

Chaining algorithm:
──────────────────
1. Start with closest replica to client
2. Each replica forwards to closest remaining
3. Forms efficient chain automatically

Topology Information:

NETWORK TOPOLOGY
───────────────

Hierarchical structure:

Datacenter
├── Rack 1 (1 Gbps uplink)
│   ├── cs-1 (1 Gbps NIC)
│   ├── cs-2 (1 Gbps NIC)
│   └── cs-3 (1 Gbps NIC)
│
├── Rack 2 (1 Gbps uplink)
│   ├── cs-4 (1 Gbps NIC)
│   └── cs-5 (1 Gbps NIC)
│
└── Rack 3 (1 Gbps uplink)
    ├── cs-6 (1 Gbps NIC)
    └── cs-7 (1 Gbps NIC)

Bandwidth Analysis:
──────────────────

Within rack:
cs-1 → cs-2: 1 Gbps (limited by NIC)
cs-2 → cs-3: 1 Gbps (limited by NIC)

Cross-rack:
cs-1 → cs-4: ~333 Mbps
(Rack 1 uplink shared by 3 machines)

Sharing uplink:
If cs-1, cs-2, cs-3 all send cross-rack:
Each gets ~333 Mbps


MASTER'S TOPOLOGY MAP:
─────────────────────

Master maintains topology:

topology = {
    'rack-1': ['cs-1', 'cs-2', 'cs-3'],
    'rack-2': ['cs-4', 'cs-5'],
    'rack-3': ['cs-6', 'cs-7']
}

Used for:
• Replica placement (cross-rack)
• Read optimization (same-rack)
• Chain formation (minimize cross-rack)
• Load balancing (avoid hot racks)


CLIENT CHAIN FORMATION:
──────────────────────

Client receives replica list from master:
replicas = [cs-2, cs-5, cs-7]

Client determines its location:
client_rack = 'rack-1'

Sort replicas by distance:
cs-2: same rack (rack-1) → distance 1
cs-5: different rack (rack-2) → distance 2
cs-7: different rack (rack-3) → distance 2

Ordered: [cs-2, cs-5, cs-7]
(or [cs-2, cs-7, cs-5] - either works)

Send data: Client → cs-2 → cs-5 → cs-7

Theoretical Analysis:

BANDWIDTH EFFICIENCY
───────────────────

Setup:
• 100MB write
• 3 replicas across 3 racks
• 1 Gbps = 125 MB/s network
• Client in rack 1

Sequential Transfer:
───────────────────

Client → cs-1 (rack 1): 100MB / 125MB/s = 0.8s
Client → cs-2 (rack 2): 100MB / 125MB/s = 0.8s
Client → cs-3 (rack 3): 100MB / 125MB/s = 0.8s

Total: 2.4s
Total bytes sent: 300MB
Average bandwidth: 125 MB/s


Parallel (No Pipeline):
──────────────────────

Client sends to all 3 simultaneously:

Client bandwidth: 125 MB/s
Per stream: 125/3 = 41.7 MB/s

Time: 100MB / 41.7MB/s = 2.4s

Total: 2.4s (same as sequential!)
Total bytes sent: 300MB


GFS Pipeline:
────────────

Client → cs-1 → cs-2 → cs-3

Client → cs-1: 125 MB/s
cs-1 → cs-2: 125 MB/s (parallel)
cs-2 → cs-3: 125 MB/s (parallel)

Time for 100MB:
• Client sends: 100MB / 125MB/s = 0.8s
• cs-1 forwards: overlapped, +0.8s from start
• cs-2 forwards: overlapped, +0.8s from cs-1

Total: ~0.8s (dominated by first transfer)
        + small propagation delay

Total: ~0.9s
Total bytes sent: 300MB
Average bandwidth: 333 MB/s equivalent!

3X IMPROVEMENT over sequential!


REAL-WORLD MEASUREMENTS:
───────────────────────

From GFS paper:

Without pipelining: 40 MB/s throughput
With pipelining: 67 MB/s throughput

~1.7x improvement in practice
(less than theoretical due to disk I/O)

Read Operation

Reads are straightforward—client talks directly to chunkserver.

Read Flow Detailed

Application Read Request

APPLICATION:
───────────

data = gfs.read("/data/file1", offset=100MB, length=1MB)

GFS CLIENT LIBRARY:
──────────────────

1. Convert offset to chunk index:
   offset 100MB ÷ 64MB/chunk = chunk 1
   within-chunk offset = 100MB % 64MB = 36MB

2. Check if read spans multiple chunks:
   start: chunk 1, offset 36MB
   end: 36MB + 1MB = 37MB (still in chunk 1)
   → Single chunk read

3. Check metadata cache:
   cache.lookup("/data/file1", chunk_index=1)

Metadata Lookup

CACHE MISS → Contact Master:
────────────────────────────

Client → Master:
  "Get chunk locations for /data/file1, chunks [1, 2, 3]"
  (Prefetch multiple chunks for sequential reads)

Master → Client:
  Chunk 1:
    handle: 0x1a2b3c4d5e6f7g8h
    version: 5
    locations: [cs-5, cs-12, cs-23]

  Chunk 2:
    handle: 0x2b3c4d5e6f7g8h9i
    version: 3
    locations: [cs-7, cs-15, cs-21]

  Chunk 3:
    handle: 0x3c4d5e6f7g8h9i0j
    version: 2
    locations: [cs-3, cs-9, cs-18]

Client caches this metadata (timeout: few minutes)

Chunkserver Selection

SELECT CLOSEST CHUNKSERVER:
──────────────────────────

Chunk 1 locations: [cs-5, cs-12, cs-23]

Client calculates network distance:
cs-5:  same rack → distance 1
cs-12: different rack → distance 2
cs-23: different rack → distance 2

Select: cs-5 (closest)

Fallback strategy:
If cs-5 fails → try cs-12
If cs-12 fails → try cs-23
If all fail → report error to master

Data Transfer

CLIENT → CHUNKSERVER-5:
──────────────────────

Request:
────────
{
    chunk_handle: 0x1a2b3c4d5e6f7g8h,
    chunk_version: 5,
    offset: 36MB,
    length: 1MB
}

CHUNKSERVER-5 PROCESSING:
────────────────────────

1. Verify chunk version:
   stored_version = get_version(0x1a2b...)
   if stored_version != 5:
       return ERROR_STALE_CHUNK

2. Read from disk:
   fd = open("/gfs/chunks/0x1a2b...")
   lseek(fd, 36MB)
   data = read(fd, 1MB)

3. Verify checksums:
   start_block = 36MB / 64KB = block 576
   end_block = 37MB / 64KB = block 592
   for block in [576..592]:
       expected = get_checksum(block)
       actual = crc32(data[block])
       if expected != actual:
           return ERROR_CORRUPTED

4. Return data:
   send_to_client(data)

Client Validation

CLIENT RECEIVES DATA:
────────────────────

1. Verify length:
   if len(received) != requested_length:
       retry_from_different_replica()

2. Optional: Re-verify checksums
   (paranoid mode)

3. Return to application:
   return data

Error handling:
──────────────

If chunkserver reports corruption:
1. Try different replica (cs-12 or cs-23)
2. Report corruption to master
3. Master schedules re-replication from good replica
4. Master marks corrupted replica for deletion

Read Optimizations

Metadata Caching

Reduce Master Load:

Cache chunk locations for minutes
Batch requests for multiple chunks
Prefetch for sequential reads
Cache hit rate: 95%+
Master only sees 5% of reads

Replica Selection

Network Optimization:

Choose closest chunkserver
Reduces latency and bandwidth
Load balance across replicas
Automatic failover to backup
Reduces cross-rack traffic

Large Reads

Throughput Focus:

Optimize for 1MB+ reads
Amortize connection setup
Pipeline multiple chunks
Sustained 100+ MB/s per client
Linear scaling with clients

Checksum Validation

Data Integrity:

Verify checksums on read
Detect disk corruption
Silent data corruption caught
Transparent retry from replica
Report to master for repair

Record Append: Failure and Duplication

The “Record Append” is GFS’s most unique operation. It guarantees that data is appended atomically at least once.

Handling Append Failures

What happens if an append succeeds on the primary and one secondary, but fails on another?

The Failure: If any replica fails the append, the primary returns an error to the client.
The Inconsistency: At this point, the replicas are inconsistent. Some have the record, others don’t.
The Retry: The client retries the operation.
The Resolution:
- The primary picks a new offset for the retry.
- It tells all replicas (including those that succeeded the first time) to write the data at this new offset.
- The original “successful” data in the first attempt now exists at an offset that is effectively “garbage” or “padding” in the failed replicas.

Result:

At-least-once: The record is guaranteed to be present in all replicas at the same offset.
Duplicates: Some replicas may contain the record multiple times (once at the failed offset, once at the retry offset).
Undefined regions: The failed offset contains partial or duplicate data, which the client library must filter out.

Data Integrity: The Scrubbing Process

Beyond checking data during reads, chunkservers perform background “scrubbing” to detect silent data corruption (bit rot) in rarely accessed chunks.

Background Scanning

Process: A background thread continuously cycles through all chunks stored on the server.
Action: It reads each 64KB block, computes the CRC32, and compares it to the stored checksum.
Detection: If corruption is found, the chunkserver notifies the master.
Repair: The master treats this as a replica loss and initiates a re-replication from a healthy replica.

This process ensures that the target replication count (e.g., 3x) represents verified, healthy copies, not just filenames on a disk.

Write Operation

Writes are more complex, involving multiple replicas and consistency guarantees.

Write Flow Detailed

COMPLETE WRITE OPERATION
────────────────────────

CLIENT WRITE REQUEST:
────────────────────

gfs.write("/data/file1", offset=50MB, data=1MB)

PHASE 1: LEASE ACQUISITION
──────────────────────────

Client → Master:
  "Need to write to /data/file1, chunk 0"

Master checks:
  1. Chunk 0 exists? Yes
  2. Has current primary? No
  3. Select primary: cs-5 (lowest load)
  4. Grant lease to cs-5 (60 seconds)
  5. Increment version: 4 → 5

Master → cs-5, cs-12, cs-23:
  "Increment version to 5"

All chunkservers ACK

Master → Client:
  {
    chunk_handle: 0x1a2b...,
    version: 5,
    primary: cs-5,
    secondaries: [cs-12, cs-23],
    lease_expiration: T+60
  }

PHASE 2: DATA PUSH
─────────────────

Client forms chain: cs-5 → cs-12 → cs-23
(based on network topology)

Client → cs-5:
  {
    type: DATA_PUSH,
    chunk_handle: 0x1a2b...,
    data_id: uuid_123,  // Unique ID for this data
    data: [1MB payload]
  }

cs-5:
  1. Buffer data in memory (don't write to disk yet!)
  2. Forward to cs-12 while receiving from client

  → cs-12: DATA_PUSH with same data

cs-12:
  1. Buffer data in memory
  2. Forward to cs-23

  → cs-23: DATA_PUSH with same data

cs-23:
  1. Buffer data in memory
  2. Send ACK to cs-12

cs-12 → ACK to cs-5
cs-5 → ACK to client

Client now knows all replicas have buffered the data

PHASE 3: WRITE COMMAND
─────────────────────

Client → Primary (cs-5):
  {
    type: WRITE_COMMAND,
    chunk_handle: 0x1a2b...,
    data_id: uuid_123,  // Reference to buffered data
    offset: 50MB,
    length: 1MB
  }

PRIMARY (cs-5) PROCESSING:
─────────────────────────

1. Assign serial number:
   (If multiple concurrent writes, serialize them)
   serial_number = next_serial++  // e.g., 42

2. Apply write to local disk:
   fd = open("/gfs/chunks/0x1a2b...")
   lseek(fd, 50MB)
   write(fd, buffered_data[uuid_123])

3. Update checksums:
   start_block = 50MB / 64KB
   end_block = 51MB / 64KB
   for block in [start_block..end_block]:
       checksum[block] = crc32(data[block])

4. Forward write command to secondaries:

   cs-5 → cs-12, cs-23:
     {
       type: APPLY_WRITE,
       chunk_handle: 0x1a2b...,
       data_id: uuid_123,
       offset: 50MB,
       length: 1MB,
       serial_number: 42
     }

SECONDARIES (cs-12, cs-23):
──────────────────────────

1. Apply in serial number order:
   (Wait if earlier serial numbers not yet applied)

2. Write to disk:
   Same as primary

3. Update checksums:
   Same as primary

4. Send ACK to primary:
   cs-12 → cs-5: ACK(serial_number=42)
   cs-23 → cs-5: ACK(serial_number=42)

PRIMARY RECEIVES ACKS:
─────────────────────

If ALL secondaries ACK:
  cs-5 → Client: SUCCESS

If ANY secondary fails:
  cs-5 → Client: ERROR
  (Client retries entire operation)

Client writes are atomic: all replicas or none

FAILURE SCENARIOS:
─────────────────

Scenario 1: cs-23 fails during data push
────────────────────────────────────────
• Client doesn't receive ACK from all replicas
• Client retries entire write
• Master may detect cs-23 failure
• Master re-replicates chunk elsewhere

Scenario 2: cs-12 fails during apply
────────────────────────────────────
• Primary doesn't receive ACK from cs-12
• Primary returns ERROR to client
• Client retries
• Chunk now inconsistent:
  - cs-5: has write
  - cs-12: may or may not have write
  - cs-23: has write
• But: All replicas have same version (5)
• Next write will succeed and create defined region
• GFS consistency model allows this (relaxed)

Scenario 3: Primary fails after apply
─────────────────────────────────────
• Primary wrote locally and sent to secondaries
• Crashed before receiving ACKs
• Client sees timeout
• Client retries
• Master's lease expires (60 sec max wait)
• Master grants lease to different replica
• New primary has the data (it was applied)
• Client retry may create duplicate (application handles)

Record Append Operation

The atomic record append is GFS’s signature feature.

Record Append Flow

Normal Case
Chunk Boundary
Failure Handling
Concurrent Appends

SUCCESSFUL RECORD APPEND
───────────────────────

CLIENT REQUEST:
──────────────

offset = gfs.record_append("/data/file1", data=100KB)

CLIENT → MASTER:
───────────────

"Need to append to /data/file1"

Master returns last chunk (let's say chunk 5):
{
  chunk_handle: 0x1a2b...,
  version: 3,
  primary: cs-5,
  secondaries: [cs-12, cs-23],
  current_size: 60MB  // Approximate
}

DATA PUSH PHASE:
───────────────

(Same as write - push to all replicas)

Client → cs-5 → cs-12 → cs-23
Data buffered in memory on all

APPEND COMMAND:
──────────────

Client → Primary (cs-5):
{
  type: RECORD_APPEND,
  chunk_handle: 0x1a2b...,
  data_id: uuid_456,
  length: 100KB
}

PRIMARY PROCESSING:
──────────────────

1. Check current offset:
   current_offset = get_chunk_size(0x1a2b...)
   current_offset = 60MB

2. Check if record fits:
   new_offset = 60MB + 100KB
   if new_offset <= 64MB:
       # Fits! Proceed
   else:
       # Doesn't fit, pad and use next chunk

3. Assign offset:
   append_offset = 60MB

4. Assign serial number:
   serial = next_serial++

5. Apply locally:
   fd = open("/gfs/chunks/0x1a2b...")
   lseek(fd, 60MB)
   write(fd, buffered_data[uuid_456])

6. Update checksums

7. Forward to secondaries:
   cs-5 → cs-12, cs-23:
   {
     type: APPLY_APPEND,
     data_id: uuid_456,
     offset: 60MB,  // SAME offset for all!
     length: 100KB,
     serial: serial
   }

SECONDARIES APPLY:
─────────────────

cs-12:
  lseek(60MB)
  write(data)
  update_checksums()
  ACK to primary

cs-23:
  Same

PRIMARY → CLIENT:
────────────────

{
  status: SUCCESS,
  offset: 60MB
}

Client now knows record is at offset 60MB
All replicas identical!

APPEND AT CHUNK BOUNDARY
───────────────────────

Current chunk state:
• Chunk 5: 63.9MB used (almost full)
• Want to append: 200KB

Check: 63.9MB + 200KB = 64.1MB > 64MB
→ Doesn't fit in current chunk!

PRIMARY ACTION:
──────────────

1. Pad current chunk to 64MB:
   padding = 64MB - 63.9MB = 100KB
   write(padding_bytes)  // All zeros or random

2. All secondaries do same:
   cs-5 → cs-12, cs-23:
   {
     type: PAD_TO_END,
     chunk: 0x1a2b...,
     padding: 100KB
   }

3. Tell client to retry with next chunk:
   Primary → Client:
   {
     status: RETRY_NEXT_CHUNK,
     next_chunk: 6
   }

CLIENT RETRIES:
──────────────

Client → Master:
  "Need chunk 6 of /data/file1"

Master:
  • Creates chunk 6 if doesn't exist
  • Assigns chunk handle, selects replicas
  • Returns metadata

Client retries append with chunk 6
→ Succeeds at offset 0 of new chunk

RESULT:
──────

Chunk 5: 63.9MB data + 100KB padding = 64MB (full)
Chunk 6: 200KB record at offset 0

Application sees:
• Record at chunk 6, offset 0
• Knows to skip padding when reading
• Uses record markers/checksums to identify data

APPEND WITH FAILURES
───────────────────

Scenario: Secondary fails during append

CLIENT REQUEST:
──────────────

gfs.record_append("/data/file1", data=100KB)

DATA PUSH:
─────────

Client → cs-5 → cs-12 → cs-23

cs-5: Buffered ✓
cs-12: Buffered ✓
cs-23: CRASH (network partition, disk failure, etc.)

Client receives ACK from cs-5, cs-12 only

APPEND COMMAND:
──────────────

Client → Primary (cs-5):
  RECORD_APPEND(data_id, 100KB)

PRIMARY APPLIES:
───────────────

cs-5:
  • Assigns offset: 60MB
  • Writes locally at 60MB ✓

cs-5 → cs-12:
  APPLY_APPEND(offset=60MB)

cs-12:
  • Writes at 60MB ✓
  • ACKs to primary ✓

cs-5 → cs-23:
  APPLY_APPEND(offset=60MB)

cs-23:
  • TIMEOUT (offline!)

PRIMARY RECEIVES:
────────────────

ACKs: cs-12 ✓, cs-23 ✗

NOT all replicas ACKed!

Primary → Client:
  ERROR: Append failed

CURRENT STATE:
─────────────

cs-5:  Has record at 60MB ✓
cs-12: Has record at 60MB ✓
cs-23: Doesn't have record ✗

Chunk is now INCONSISTENT!

CLIENT RETRIES:
──────────────

Client retries entire operation:

1. Re-push data to all replicas
   (cs-23 may be back, or replaced)

2. Primary assigns NEW offset:
   Now at 60.1MB (after previous record)

3. Applies to all replicas

4. SUCCESS

RESULT:
──────

Chunk contents:
┌──────────────────────────────┐
│ ...                          │
│ 60.0MB: Record data (100KB)  │ ← First attempt (on cs-5, cs-12)
│ 60.1MB: Record data (100KB)  │ ← Retry (on all replicas)
│ ...                          │
└──────────────────────────────┘

• DUPLICATE at 60.0MB (only on cs-5, cs-12)
• Successful record at 60.1MB (on all)
• Client receives offset 60.1MB
• Application will see duplicate when reading

APPLICATION HANDLING:
────────────────────

Reader scans file:

for record in read_file("/data/file1"):
    if not valid_checksum(record):
        skip  // Inconsistent region (60.0MB on cs-23)

    if seen_id(record.id):
        skip  // Duplicate (60.0MB duplicate)

    process(record)  // Process 60.1MB record

Application de-duplicates based on unique IDs
Application skips inconsistent regions

GFS guarantees:
• At least once delivery
• May have duplicates
• May have inconsistent regions
• Application handles this

MULTIPLE CONCURRENT APPENDS
──────────────────────────

Scenario: 3 clients append simultaneously

CLIENT A: append(data_A, 50KB)
CLIENT B: append(data_B, 75KB)
CLIENT C: append(data_C, 100KB)

All three:
1. Push data to replicas
2. Send append command to primary

PRIMARY SERIALIZES:
──────────────────

Primary receives three append requests:

Request queue:
1. Append(data_A, 50KB)  - arrived first
2. Append(data_B, 75KB)  - arrived second
3. Append(data_C, 100KB) - arrived third

Primary processes IN ORDER:

Step 1: Process data_A
──────────────────────
current_offset = 60MB
assign offset: 60MB
write data_A at 60MB
forward to secondaries: "write at 60MB"
new_offset = 60MB + 50KB = 60.05MB

Step 2: Process data_B
──────────────────────
current_offset = 60.05MB
assign offset: 60.05MB
write data_B at 60.05MB
forward to secondaries: "write at 60.05MB"
new_offset = 60.05MB + 75KB = 60.125MB

Step 3: Process data_C
──────────────────────
current_offset = 60.125MB
assign offset: 60.125MB
write data_C at 60.125MB
forward to secondaries: "write at 60.125MB"
new_offset = 60.125MB + 100KB = 60.225MB

RESULT:
──────

All replicas have identical contents:
┌────────────────────────────────┐
│ ...                            │
│ 60.000MB: data_A (50KB)        │
│ 60.050MB: data_B (75KB)        │
│ 60.125MB: data_C (100KB)       │
│ 60.225MB: <next append here>   │
└────────────────────────────────┘

Clients receive:
• Client A: offset 60.000MB
• Client B: offset 60.050MB
• Client C: offset 60.125MB

KEY POINT:
─────────

Primary serializes ALL appends
→ Consistent order across all replicas
→ No distributed consensus needed
→ Simple, fast, scalable

Throughput:
• Limited by primary's processing
• ~1000s of appends/second possible
• Perfect for MapReduce intermediate data

Data Integrity

GFS uses checksumming to detect data corruption from disk/memory/network errors.

Checksum Implementation

CHECKSUM SYSTEM
──────────────

Data Structure (per chunk):
──────────────────────────

chunk_checksums = {
    chunk_handle: {
        block_0: checksum_0,    # 32-bit CRC32
        block_1: checksum_1,
        ...
        block_1023: checksum_1023
    }
}

For 64MB chunk:
• 1024 blocks of 64KB each
• 1024 checksums of 4 bytes each
• Total: 4KB metadata per chunk


CHECKSUM COMPUTATION:
────────────────────

CRC32 algorithm (fast, good detection)

On Write:
─────────

def write_block(chunk_handle, block_index, data):
    # Write data
    offset = block_index * 64KB
    fd = open(f"/gfs/chunks/{chunk_handle}")
    lseek(fd, offset)
    write(fd, data)

    # Compute and store checksum
    checksum = crc32(data)
    chunk_checksums[chunk_handle][block_index] = checksum

    # Persist checksum
    save_checksums_to_disk()

On Read:
────────

def read_block(chunk_handle, block_index):
    # Read data
    offset = block_index * 64KB
    fd = open(f"/gfs/chunks/{chunk_handle}")
    lseek(fd, offset)
    data = read(fd, 64KB)

    # Verify checksum
    expected = chunk_checksums[chunk_handle][block_index]
    actual = crc32(data)

    if expected != actual:
        # CORRUPTION DETECTED!
        log_error(f"Corruption in chunk {chunk_handle}, block {block_index}")
        report_to_master(chunk_handle)
        raise CorruptionError()

    return data


CHECKSUM STORAGE:
────────────────

In memory: Hash map for fast access
On disk: Append-only log

File: /gfs/metadata/checksums.db
Format:
───────
chunk_handle | block_index | checksum
0x1a2b...    | 0           | 0xabcd1234
0x1a2b...    | 1           | 0xef567890
...

Loaded into memory at startup
Updated on every write
Periodically compacted


CORRUPTION DETECTION RATE:
─────────────────────────

CRC32 properties:
• 32-bit checksum
• Detects all single-bit errors
• Detects all double-bit errors
• Detects bursts up to 32 bits
• Miss rate: ~1 in 4 billion for random corruption

In practice:
• Disk bit flips: Detected 99.9999%+
• Memory corruption: Detected reliably
• Network corruption: Also caught by TCP checksums
• Enough for GFS's reliability requirements

Corruption Handling

Read-Time Detection

Detecting Corruption on Read:

READ PATH WITH CORRUPTION:
─────────────────────────

Client reads from cs-5:

1. Chunkserver reads block from disk
2. Computes checksum
3. Compares with stored checksum
4. MISMATCH! Corruption detected

Chunkserver → Client:
  ERROR: Data corrupted

Client response:
  1. Try different replica (cs-12)
  2. Report corruption to master
  3. Return correct data to application

Master action:
  1. Mark cs-5's copy as corrupted
  2. Schedule re-replication from cs-12
  3. Eventually delete corrupted copy from cs-5
  4. Restore to 3 replicas


EXAMPLE:
───────

Read request: chunk 0x1a2b, offset 5MB, length 1MB

cs-5 reads blocks [80..95] (5MB/64KB to 6MB/64KB)

Block 87: checksum mismatch!
Expected: 0xabcd1234
Actual:   0xabcd1235  ← One bit different

Error returned to client

Client retries:
cs-12 → Checksum OK → Return data

Master notified:
• cs-5 chunk 0x1a2b: CORRUPTED
• Re-replicate from cs-12 or cs-23
• Delete from cs-5 after re-replication

Scrubbing

Proactive Corruption Detection:

BACKGROUND SCRUBBING
───────────────────

Chunkserver runs background task:

def scrubbing_task():
    while True:
        for chunk in all_chunks:
            if chunk.idle() and low_load():
                verify_chunk(chunk)
        sleep(random(60, 300))  # 1-5 minutes

def verify_chunk(chunk_handle):
    for block_index in range(num_blocks):
        data = read_block(chunk_handle, block_index)
        expected = get_checksum(block_index)
        actual = crc32(data)

        if expected != actual:
            # Found corruption!
            log_error(...)
            report_to_master(chunk_handle, block_index)
            # Don't fix locally, let master coordinate


BENEFITS:
────────

• Detect corruption before client reads
• Find rarely-read data corruption
• Catch disk degradation early
• Reduce read errors for clients

SCHEDULING:
──────────

• Low priority (idle time only)
• Rate limited (1-2 chunks/minute)
• Doesn't impact foreground I/O
• Full cluster scrub: weeks/months

MASTER RESPONSE:
───────────────

Chunkserver: "Chunk 0x1a2b is corrupted"

Master:
  1. Verify chunk has other good replicas
  2. Increment re-replication priority
  3. Schedule re-replication
  4. Mark corrupted replica for deletion
  5. Eventually delete from reporting chunkserver

Write-Time Optimization

Append Checksum Optimization:

APPEND CHECKSUM HANDLING
───────────────────────

Problem: Last block may be partial

Example:
────────
Chunk size: 5.5MB
Blocks: 0-87 (88 blocks)
Last block (87): Only 32KB used (not full 64KB)

Append: 100KB data

Naive approach:
──────────────
1. Read last block (32KB existing data)
2. Append new data (100KB)
3. Compute checksum for modified blocks
4. Write back

→ Extra read I/O! Slow!

GFS Optimization:
────────────────

1. Don't read last block
2. Append data starting at block boundary
3. Pad if necessary

Example:
────────
Block 87: 32KB data + 32KB padding
Block 88: 64KB new data (part of 100KB)
Block 89: 36KB new data + 28KB unused

Trade-off:
• Some wasted space (padding)
• No extra read I/O
• Faster appends
• Acceptable for append-heavy workload


CHECKSUM INCREMENTAL UPDATE:
───────────────────────────

For record append:

1. Primary assigns offset
2. Writes data
3. Computes checksums for new blocks only
4. Sends checksums to secondaries with data
5. Secondaries verify and store

No need to re-checksum entire chunk
→ Fast, scalable

Interview Questions

Basic: Why does GFS use pipelined data transfer?

Expected Answer:GFS uses pipelined data transfer to maximize network bandwidth utilization:Without Pipelining (Sequential):

Client sends to replica 1, waits for completion
Then sends to replica 2, waits
Then sends to replica 3
Time: 3× transfer time
Only one network link active at a time

With Pipelining:

Client sends to replica 1
Replica 1 forwards to replica 2 while still receiving from client
Replica 2 forwards to replica 3 while receiving from replica 1
All network links active simultaneously
Time: ~1× transfer time (plus small propagation delay)

Benefits:

3× speedup for 3 replicas
Fully utilizes network topology
Each link operates at full bandwidth
Essential for GFS’s high throughput goals

GFS chains replicas by network distance (closest first) to minimize cross-rack transfers and optimize pipelining.

Intermediate: How does record append handle concurrent writers?

Expected Answer:Record append enables multiple clients to append to the same file concurrently without coordination:Process:

Data Push: Each client pushes data to all replicas independently
Append Request: Each client sends append command to primary (not to a specific offset!)
Primary Serializes: Primary assigns offset for each append in the order received
Sequential Application: Primary applies appends sequentially, each at its assigned offset
Replica Coordination: Primary tells secondaries exact offset for each append
Offset Return: Primary returns assigned offset to each client

Key Properties:

Primary acts as serialization point (no distributed consensus needed)
All replicas apply appends in same order at same offsets
Each client learns where its record was placed
No locking or coordination between clients required
Enables thousands of concurrent appenders

Failure Handling:

If append fails on any replica, client retries
Retry may create duplicate at different offset
Application de-duplicates using unique record IDs
At-least-once delivery guaranteed

Perfect for MapReduce where thousands of mappers append to shared output files.

Advanced: Explain GFS's checksum strategy and trade-offs

Expected Answer:GFS uses 32-bit CRC32 checksums on 64KB blocks:Design:

Each 64MB chunk divided into 1024 × 64KB blocks
Each block has 32-bit checksum (4KB metadata per chunk)
Checksums stored in memory and on disk
Verified on every read

Why 64KB blocks?Smaller blocks (4KB):

✓ More precise corruption detection
✗ More checksums (higher memory overhead)
✗ More CPU for computation
✗ Higher metadata storage

Larger blocks (1MB):

✓ Less overhead
✗ Must re-read more data on corruption
✗ Less precise detection

64KB: Balance of precision and overheadWrite Optimization:

For appends: Don’t re-read last partial block
Pad to block boundary instead
Trade space for speed (append-optimized workload)

Detection Timing:

Read-time: Every read verified
Scrubbing: Background task checks idle chunks
Replication: Verified during chunk copy

Corruption Response:

Read from different replica
Report to master
Master schedules re-replication
Corrupted replica deleted

Trade-offs:

0.006% space overhead (acceptable)
CPU cost negligible (modern CRC32 instructions)
False negative: ~1 in 4 billion (acceptable for commodity hardware)
Catches virtually all bit flips, disk corruption, memory errors

System Design: How would you optimize GFS for small random writes?

Expected Answer:GFS is optimized for large sequential writes/appends. For small random writes, several optimizations possible:1. Client-Side Buffering:

Buffer multiple small writes in client
Batch into larger writes (e.g., 1MB)
Flush periodically or when buffer full
Trade-off: Latency for throughput, memory usage

2. Log-Structured Approach:

Append small writes to log file
Background compaction merges into final locations
Like LevelDB/RocksDB approach
Trade-off: Complexity, write amplification

3. Write Coalescing at Chunkserver:

Chunkserver buffers writes in memory
Coalesces overlapping/adjacent writes
Flushes in batches
Trade-off: Durability concerns, memory usage

4. Smaller Chunks:

Reduce from 64MB to 4-8MB
Less internal fragmentation
More metadata overhead (acceptable for small files)
Trade-off: More master load, more metadata

5. Separate Tier:

Small-write tier with different chunk size/strategy
Large-write tier with current 64MB chunks
Route based on access pattern
Trade-off: System complexity, two code paths

6. Hybrid Storage:

Small writes to SSD/NVME (low latency)
Large sequential to HDD (high throughput)
Background migration based on patterns
Trade-off: Cost, complexity

Real-World:

Colossus (GFS successor) uses metadata sharding and smaller chunks
HDFS has append-only model similar to GFS
Modern systems like Ceph use different strategies entirely
For truly random small writes, key-value stores (Bigtable, HBase) better fit

Key Takeaways

Chunkserver & Data Flow Summary:

Simple Storage: Chunks as regular Linux files with sparse allocation
Pipelined Replication: Chain forwarding maximizes network bandwidth
Network-Aware: Topology-aware chaining reduces cross-rack traffic
Checksums: 64KB block granularity balances precision and overhead
Record Append: Primary serializes concurrent appends without locks
Corruption Handling: Read-time verification + background scrubbing
Write Optimization: Avoid re-reading partial blocks on append
At-Least-Once: Retries may create duplicates, application handles

Up Next

In Chapter 5: Consistency Model, we’ll explore:

GFS’s relaxed consistency guarantees
Defined vs undefined vs inconsistent regions
Atomic record append semantics in detail
How applications handle the consistency model
Implications for distributed systems design

We’ve seen how data flows—now we’ll understand what guarantees GFS provides to applications.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Chapter 4: Chunkservers and Data Flow

​Chunk Storage

​Physical Storage Layout

​Chunk Metadata

​Data Flow Optimization

​Pipelined Data Push

​Network Topology Awareness

​Read Operation

​Read Flow Detailed

​Read Optimizations

Metadata Caching

Replica Selection

Large Reads

Checksum Validation

​Record Append: Failure and Duplication

​Handling Append Failures

​Data Integrity: The Scrubbing Process

​Background Scanning

​Write Operation

​Write Flow Detailed

​Record Append Operation

​Record Append Flow

​Data Integrity

​Checksum Implementation

​Corruption Handling

​Interview Questions

​Key Takeaways

​Up Next