The master is the brain of GFS, orchestrating all metadata operations and coordinating the distributed system. This chapter explores how the master manages the namespace, allocates chunks, grants leases, places replicas, and maintains system health through garbage collection.
Chapter Goals:
Understand namespace management with coarse-grained locking
LOCK ORDERING PROTOCOL─────────────────────Rule: Always acquire locks in consistent orderOrder: Lexicographic order by full pathnameExample:────────Operation: Rename /a/x to /b/yIncorrect: /a/x lock (write) /b/y lock (write) → If another thread does /b/... first → DEADLOCK!Correct: Sort: [/a, /a/x, /b, /b/y] Acquire in order: /a → read /a/x → write /b → read /b/y → writeDeadlock impossible!IMPLEMENTATION:──────────────Algorithm:──────────1. Collect all paths needed for operation2. Sort lexicographically3. Acquire locks in sorted order4. Perform operation5. Release all locksExample with snapshot:─────────────────────Snapshot /home/alice to /backup/alice_snapPaths needed:- /home (read)- /home/alice (write - prevent changes)- /backup (read)- /backup/alice_snap (write)Sorted order:1. /backup (read)2. /backup/alice_snap (write)3. /home (read)4. /home/alice (write)Acquire in this order → No deadlock possible
Scalability Benefits:
PERFORMANCE COMPARISON─────────────────────Metric: Operations per second on /data/logs/Traditional (directory-level locking):──────────────────────────────────────1 thread: 1000 creates/sec2 threads: 1000 creates/sec (serialized!)4 threads: 1000 creates/sec (serialized!)8 threads: 1000 creates/sec (serialized!)GFS (fine-grained path locking):────────────────────────────────1 thread: 1000 creates/sec2 threads: 2000 creates/sec (parallel!)4 threads: 4000 creates/sec (parallel!)8 threads: 8000 creates/sec (parallel!)→ Linear scaling with parallelismREAL-WORLD IMPACT:─────────────────MapReduce with 1000 mappers:• All write to /output/intermediate/• Each creates file with unique name• All proceed in parallel• No lock contention!Without fine-grained locking:• Serialized on directory lock• 1000x slower• System unusable
The master keeps three types of metadata, all in memory:
File Namespace
Directory and File Names
Full pathname → metadata
Persistent (operation log)
Modification times
Owner/permissions
List of chunk handles
Chunk Metadata
Chunk Handle → Info
Chunk version number
List of replica locations
Primary (if leased)
Lease expiration
Partially persistent
Server State
Chunkserver Info
Available disk space
Current load (CPU, I/O)
Chunks stored
Last heartbeat time
Not persistent
METADATA MEMORY LAYOUT─────────────────────For 100 million chunks (several petabytes):File Namespace:──────────────• ~10 million files• ~100 bytes per file• Total: ~1 GBChunk Metadata:──────────────• 100 million chunks• ~64 bytes per chunk - Chunk handle (8 bytes) - Version number (8 bytes) - Replica locations (3-5 locations × 8 bytes) - Lease info (16 bytes)• Total: ~6 GBChunkserver State:─────────────────• ~1000 chunkservers• ~1 KB per server• Total: ~1 MBTOTAL: < 10 GB (easily fits in RAM of modern server)Benefits:────────✓ Fast operations (no disk I/O)✓ Simple consistency✓ Global view for decisions✓ Easy to scan entire namespace
LEASE GRANT PROCESS──────────────────Client requests write to chunk XMaster checks:1. Does chunk X have current primary? NO → Select one replica as primary2. Grant lease: ┌──────────────────────────────┐ │ Chunk: X │ │ Primary: chunkserver-5 │ │ Lease expiration: T+60sec │ │ Version: 7 │ └──────────────────────────────┘3. Send to chunkserver-5: "You are primary for chunk X until T+60"4. Increment chunk version to 85. Return to client: Primary: cs-5 Secondaries: [cs-2, cs-9] Version: 8
2
Primary Orders Mutations
PRIMARY ROLE───────────While lease held, primary has authority:Multiple clients send writes:Client A: Write(offset=1000, len=100)Client B: Append(len=200)Client C: Write(offset=5000, len=50)Primary receives all three:─────────────────────────1. Assign serial numbers: A → Serial #1 B → Serial #2 C → Serial #32. Apply to local chunk in order: #1: Write at 1000 #2: Append at end #3: Write at 50003. Send to secondaries: "Apply operations in order: #1, #2, #3"4. Wait for ACKs5. Reply to clientsResult: Consistent ordering across all replicas!
3
Lease Renewal
LEASE LIFECYCLE──────────────Initial lease: 60 secondsRenewal (if writes ongoing):────────────────────────────Every ~10-15 seconds:Primary → Master: "Heartbeat: Still active, renew lease for chunk X"Master → Primary: "Lease extended to T+60"This continues while writes happenExpiration:──────────No renewal for 60 seconds?→ Lease expires→ Master can grant to different replica→ Or same replica (if still alive)REVOCATION:──────────Master needs to revoke lease:(e.g., client requested chunk deletion)1. Master sends revoke message to primary2. Primary stops accepting new mutations3. Completes in-flight operations4. ACKs revocation to master5. Master can now delete chunkIf primary unreachable?→ Wait for 60sec expiration→ Then proceed
WHY LEASES?──────────Problem Without Leases:───────────────────────Distributed consensus for every write:Client → Replica 1 ┐Client → Replica 2 ┼→ Agree on order? (Paxos/Raft)Client → Replica 3 ┘ Slow! Complex!For each write:- Multiple round trips- Consensus protocol- Failure handling→ 10-100ms latency per operationWith Leases:───────────One-time setup (lease grant):Master → Primary: "You order operations for 60sec"Then for each write:Client → Primary: "Write this"Primary → Secondaries: "Apply in this order"→ ~1ms latency per operationBenefits:────────✓ No distributed consensus per operation✓ Primary makes serialization decision✓ Low latency writes✓ Simple protocol✓ Automatic timeout (no need for perfect failure detection)LEASE VS LOCK:─────────────Lock:- Must explicitly release- Failure? → Deadlock or complex recoveryLease:- Automatic timeout- Failure? → Expires naturally- Master can wait or revoke- No distributed state
GFS uses Copy-on-Write (CoW) at the chunk level to implement snapshots efficiently.
Lease Revocation: When the master receives a snapshot request, it first revokes any outstanding leases on the chunks of the files to be snapshotted. This ensures that any subsequent writes will require a new lease, giving the master a chance to intercept and create a copy.
Log & Copy Metadata: The master logs the snapshot operation to disk. It then applies the operation to its in-memory state by duplicating the metadata for the file or directory. The new “snapshot” file points to the same chunk handles as the original.
Reference Counting: Each chunk handle now has a reference count > 1.
When a client wants to write to a chunk that has been snapshotted:
1. Client asks Master for lease: "Write to Chunk C (ref_count=2)"2. Master sees ref_count > 1: a) Pick new chunk handle C' b) Tell chunkservers holding C: "Copy C to C' locally" c) Update metadata: Current file now points to C' (ref_count=1) d) Snapshot still points to C (ref_count=1)3. Master grants lease for C' to client.4. Client writes to C' normally.
Key Benefit: The initial snapshot is just a metadata operation (copying pointers). The actual data copying is deferred until a write occurs, and only for the specific chunks being modified.
CHUNK CREATION WORKFLOW──────────────────────Scenario: Client creates new file and writes first chunk1. CLIENT CREATES FILE ────────────────── Client: "Create file /data/logs/2003-10-15" Master: a) Create namespace entry b) Log operation to disk c) Return success2. CLIENT WRITES DATA ────────────────── Client: "Write to /data/logs/2003-10-15, offset 0, 1MB" Master: a) File has no chunks yet b) Need to create chunk 03. MASTER ALLOCATES CHUNK ────────────────────── a) Generate chunk handle (globally unique 64-bit) chunk_handle = 0x1a2b3c4d5e6f7g8h b) Select 3 chunkservers: Available servers: [cs-1, cs-2, ..., cs-50] Filter: • Enough disk space • Below average utilization • Rack diversity Selected: [cs-5, cs-12, cs-23] (cs-5 and cs-12 in rack-1, cs-23 in rack-2) c) Designate primary (lowest load) Primary: cs-5 d) Grant lease to cs-5 (60 seconds) e) Increment version to 1 f) Log operation: "CHUNK_CREATE: handle=0x1a2b, file=/data/logs/2003-10-15, index=0, version=1, locations=[cs-5, cs-12, cs-23]"4. MASTER CONTACTS CHUNKSERVERS ──────────────────────────── Master → cs-5, cs-12, cs-23: "Create chunk 0x1a2b, version 1" Each chunkserver: a) Create empty file /gfs/chunks/0x1a2b3c4d5e6f7g8h b) Initialize version to 1 c) ACK to master5. MASTER RETURNS TO CLIENT ──────────────────────── Master → Client: "Chunk 0 metadata: handle: 0x1a2b3c4d5e6f7g8h primary: cs-5 secondaries: [cs-12, cs-23] version: 1 lease_expiration: T+60"6. CLIENT WRITES DATA ────────────────── (Normal write flow as described in Chapter 2)
FILE DELETION WORKFLOW─────────────────────1. CLIENT DELETES FILE ────────────────── Client: "Delete /data/old/file1" Master: ┌──────────────────────────────────┐ │ a) Rename file to hidden name: │ │ /data/old/file1 │ │ → │ │ /.trash/file1.deleted.T12345 │ │ │ │ b) Add deletion timestamp │ │ │ │ c) Log operation │ │ │ │ d) Return success │ └──────────────────────────────────┘ File still exists, just renamed! Client can immediately: • Delete again (no-op) • Undelete (rename back)2. GRACE PERIOD ─────────── Hidden file stays for N days (configurable, e.g., 3 days) During this time: • File not visible to normal operations • Chunks still exist • Can be recovered if needed • Storage not freed After 3 days: → Master's background GC removes it3. BACKGROUND SCAN (Master) ─────────────────────── Periodic scan (e.g., every few hours): for each file in /.trash/: if (current_time - deletion_time) > grace_period: # Permanently delete chunk_handles = file.get_chunks() for chunk in chunk_handles: remove_from_namespace(chunk) delete_namespace_entry(file) No rush, happens eventually4. CHUNK REMOVAL ───────────── Deleted chunks marked in master memory: "Chunk 0x1a2b: DELETED" Not immediately removed from chunkservers → Next step handles this
CHUNK GARBAGE COLLECTION───────────────────────1. HEARTBEAT EXCHANGE ────────────────── Regular heartbeat (every ~30 seconds): Chunkserver → Master: "I have chunks: [0x1a2b, 0x2c3d, 0x3e4f, ...]" Master → Chunkserver: "Delete these: [0x5e6f, 0x7g8h]"2. MASTER IDENTIFIES GARBAGE ───────────────────────── Master compares chunkserver's list with metadata: Chunkserver reports: [0x1a2b, 0x2c3d, 0x9z8y] Master's namespace has: • 0x1a2b ✓ (belongs to /data/file1) • 0x2c3d ✓ (belongs to /data/file2) • 0x9z8y ✗ (NOT in namespace) → 0x9z8y is orphaned/garbage Reasons for garbage: • File was deleted • Chunk creation failed partway • Master crash during operation • Stale replica from old version3. CHUNKSERVER DELETES ─────────────────── Chunkserver receives delete list: for chunk in delete_list: os.remove(f"/gfs/chunks/{chunk}") log("Deleted chunk {chunk}") Simple, local operation No coordination needed4. STALE REPLICA DETECTION ─────────────────────── Version numbers detect stale replicas: Chunk 0x1a2b, current version: 5 Heartbeat report: cs-5: "chunk 0x1a2b, version 5" ✓ cs-12: "chunk 0x1a2b, version 5" ✓ cs-23: "chunk 0x1a2b, version 4" ✗ (STALE!) Master → cs-23: "Delete 0x1a2b (stale)" How version increases: • Master grants new lease → version++ • Chunkserver was down during lease grant • Chunkserver comes back with old version • Master detects and deletes stale replica
GC BACKGROUND PROCESS────────────────────Master runs continuous background tasks:TASK 1: Namespace Scan──────────────────────Frequency: Every few hoursPurpose: Remove old deleted filespseudocode:──────────def namespace_gc(): for file in namespace: if file.is_hidden_deleted(): age = now() - file.deletion_time if age > GRACE_PERIOD: permanently_delete(file)Impact:• Low priority• Runs when master idle• Rate limited (100s files/sec)TASK 2: Chunk Accounting────────────────────────Frequency: Ongoing (via heartbeats)Purpose: Identify orphaned chunkspseudocode:──────────def on_heartbeat(chunkserver, chunk_list): for chunk in chunk_list: if chunk not in namespace: send_delete_command(chunkserver, chunk) elif chunk.version < expected_version: send_delete_command(chunkserver, chunk)Impact:• Immediate (next heartbeat)• No extra overhead• Piggybacked on heartbeatTASK 3: Orphan Detection────────────────────────Frequency: DailyPurpose: Find chunks that should exist but don'tpseudocode:──────────def orphan_detection(): for file in namespace: for chunk in file.chunks: replicas = get_chunk_locations(chunk) if len(replicas) < REPLICATION_GOAL: schedule_replication(chunk)Impact:• Ensures all data replicated• Catches missed failures• Safety netRATE LIMITING:─────────────All GC tasks are rate-limited:• Don't overload master• Don't saturate network• Don't impact foreground opsConfiguration:• Max deletions per heartbeat: 10• Max namespace scans per sec: 100• Max re-replications concurrent: 50
Master starts (after crash or planned restart)1. Load latest checkpoint from disk • Compact B-tree format • Contains full namespace • Loads in 1-2 seconds • Typically < 1 GB2. Result: • Full file namespace in memory • Chunk handle mappings • Version numbers
2
Log Replay
Replay operation log since checkpoint1. Read log file2. Apply each operation in order: • File creates • Chunk allocations • Version updates • Deletions3. Typically few seconds (100K ops/sec)4. Result: Master has current namespace state
3
Chunk Location Discovery
Master doesn't persist chunk locations(Chunkservers are source of truth)1. Master sends to all chunkservers: "What chunks do you have?"2. Each chunkserver replies: "I have: [chunk1, chunk2, ..., chunkN] With versions: [v1, v2, ..., vN]"3. Master builds location map in memory4. Identifies stale replicas (old versions)5. Marks for garbage collectionTypically completes in 10-30 seconds
4
Resume Operations
Master is now ready!Total recovery time: < 1 minute1. Checkpoint load: 1-2 sec2. Log replay: 5-10 sec3. Chunk poll: 10-30 sec4. Ready: 30-60 sec totalMeanwhile:• Shadow masters can serve reads• System remains partially available• No data loss
Failure Handling: If deletion message lost, chunk collected eventually anyway
Spread Load: I/O spread over time, not sudden burst
Piggybacking: Uses existing heartbeat mechanism
The trade-off is that storage isn’t freed immediately, but for Google’s workload this was acceptable since storage was relatively cheap and safety was more important.Process: File deletion → rename to hidden → wait grace period → background GC removes → heartbeat informs chunkservers → chunkservers delete local chunks
In a distributed system, some replicas may miss updates (e.g., if a chunkserver crashes during a write). GFS uses Chunk Version Numbers to distinguish between up-to-date and stale replicas.