Chapter 3: Master Operations
The master is the brain of GFS, orchestrating all metadata operations and coordinating the distributed system. This chapter explores how the master manages the namespace, allocates chunks, grants leases, places replicas, and maintains system health through garbage collection.Chapter Goals:
- Understand namespace management with coarse-grained locking
- Learn chunk lease mechanism for consistency
- Explore replica placement strategies
- Master garbage collection techniques
- Grasp master fault tolerance mechanisms
Namespace Management
Unlike traditional file systems with per-directory data structures, GFS uses a flat namespace with efficient path-to-metadata lookups.Namespace Structure
Coarse-Grained Locking
GFS uses namespace locks to allow concurrent operations:- Lock Types
- Concurrent Operations
- Lock Ordering
- Performance Impact
Read-Write Locks on Paths:
Metadata Storage
The master keeps three types of metadata, all in memory:File Namespace
Directory and File Names
- Full pathname → metadata
- Persistent (operation log)
- Modification times
- Owner/permissions
- List of chunk handles
Chunk Metadata
Chunk Handle → Info
- Chunk version number
- List of replica locations
- Primary (if leased)
- Lease expiration
- Partially persistent
Server State
Chunkserver Info
- Available disk space
- Current load (CPU, I/O)
- Chunks stored
- Last heartbeat time
- Not persistent
Chunk Lease Mechanism
Leases are central to GFS’s consistency model, enabling mutations without distributed consensus.How Leases Work
Lease Benefits
Snapshot Mechanics (Copy-on-Write)
Snapshots in GFS are nearly instantaneous and allow users to create a copy of a file or a directory tree without copying data.How it Works (The Metadata-Only Copy)
GFS uses Copy-on-Write (CoW) at the chunk level to implement snapshots efficiently.- Lease Revocation: When the master receives a snapshot request, it first revokes any outstanding leases on the chunks of the files to be snapshotted. This ensures that any subsequent writes will require a new lease, giving the master a chance to intercept and create a copy.
- Log & Copy Metadata: The master logs the snapshot operation to disk. It then applies the operation to its in-memory state by duplicating the metadata for the file or directory. The new “snapshot” file points to the same chunk handles as the original.
- Reference Counting: Each chunk handle now has a reference count > 1.
The First Write After Snapshot
When a client wants to write to a chunk that has been snapshotted:Replica Placement
The master decides where to place chunk replicas, optimizing for reliability, bandwidth, and load balancing.Placement Goals
Maximize Reliability
Maximize Reliability
Survive Multiple Failures:
Maximize Bandwidth
Maximize Bandwidth
Network Topology Awareness:
Balance Load
Balance Load
Distribute Storage and I/O:
Chunk Creation
Re-replication
When replicas fall below target count (e.g., server failure), master re-replicates:Garbage Collection
GFS uses lazy garbage collection instead of immediate deletion.Why Lazy Deletion?
Benefits
Advantages of Lazy GC:
- Simple implementation
- Batched operations
- No complex distributed deletion
- Can recover from accidental deletes
- Spreads I/O load over time
- Handles failures gracefully
Trade-offs
Considerations:
- Storage not freed immediately
- May need manual cleanup for urgent cases
- Requires background process
- Deleted files visible briefly
- Not suitable for quota systems
Garbage Collection Process
- File Deletion
- Chunk Collection
- Background Process
Master Fault Tolerance
The master is replicated to ensure system availability:Replication Strategy
Fast Recovery
Interview Questions
Basic: Why does GFS use lazy garbage collection?
Basic: Why does GFS use lazy garbage collection?
Expected Answer:GFS uses lazy garbage collection instead of immediate deletion for several reasons:
- Simplicity: No complex distributed deletion protocol needed
- Recovery: Accidental deletions can be recovered during grace period (e.g., 3 days)
- Batch Operations: Deletions batched together, reducing overhead
- Failure Handling: If deletion message lost, chunk collected eventually anyway
- Spread Load: I/O spread over time, not sudden burst
- Piggybacking: Uses existing heartbeat mechanism
Intermediate: Explain GFS's namespace locking mechanism
Intermediate: Explain GFS's namespace locking mechanism
Expected Answer:GFS uses fine-grained read-write locks on full pathnames to enable concurrent operations:How it works:
- Each path (file or directory) has a read-write lock
- Operations acquire locks on full paths, not just the target
- Example: Creating
/home/user1/filerequires:- Read lock on
/home - Read lock on
/home/user1 - Write lock on
/home/user1/file
- Read lock on
- Multiple operations in same directory can proceed in parallel
- Example: Creating
/data/file1and/data/file2concurrently - Both acquire read lock on
/data(shared) - Each acquires write lock on different file (no conflict)
- Locks acquired in lexicographic order
- Example: Operation needs
/a/xand/b/y - Always acquire in sorted order:
/a/xthen/b/y
Advanced: How does the chunk lease mechanism ensure consistency?
Advanced: How does the chunk lease mechanism ensure consistency?
Expected Answer:The lease mechanism provides consistency without distributed consensus:Setup:
- Master grants 60-second lease to one replica (primary)
- Only primary can order mutations during lease period
- Primary identity cached by clients
- Data pushed to all replicas (in memory, not applied)
- Client sends write request to primary
- Primary assigns serial number (ordering)
- Primary applies to local disk
- Primary sends order to secondaries
- Secondaries apply in same order
- All ACK to primary
- Primary ACKs to client
- All replicas apply mutations in same order (serialized by primary)
- Same serial numbers → same state
- No distributed consensus needed
- Primary has authority during lease
- Lease timeout (60 sec) ensures master can regain control
- If primary fails, lease expires naturally
- Master can grant new lease to different replica
- No need for perfect failure detection
- Single authority (primary) during lease
- Time-bounded authority (60 sec)
- Master retains ultimate control
- Simple protocol, high performance
System Design: How would you improve GFS's single master limitation?
System Design: How would you improve GFS's single master limitation?
Expected Answer:Several approaches to scale beyond single master:1. Metadata Sharding (like Colossus):
- Partition namespace by path prefix
- Multiple master shards, each handles subset
- Example: Master1 handles
/data/*, Master2 handles/logs/* - Benefits: Scales metadata capacity and throughput
- Challenges: Cross-shard operations, rebalancing
- Root master coordinates multiple sub-masters
- Each sub-master handles subset of chunkservers
- Root handles namespace, sub-masters handle chunks
- Benefits: Scales chunk management
- Challenges: Two-level hierarchy complexity
- Aggressive client caching with long timeouts
- Lease-based cache consistency
- Master only for cache misses
- Benefits: Reduces master load dramatically
- Challenges: Consistency protocol more complex
- Distribute master state using consensus (Paxos/Raft)
- Read from any replica, write to leader
- Benefits: High availability, read scalability
- Challenges: Write latency, consistency overhead
- Colossus (GFS successor) uses metadata sharding
- HDFS Federation uses multiple namenodes
- Both prove that single master can be overcome while maintaining simplicity where possible
Stale Replica Detection (Version Numbers)
In a distributed system, some replicas may miss updates (e.g., if a chunkserver crashes during a write). GFS uses Chunk Version Numbers to distinguish between up-to-date and stale replicas.The Versioning Protocol
- Lease Granting: Before granting a lease, the master increments the chunk’s version number in its persistent metadata.
- Propagation: The master notifies the primary and all secondaries of the new version number.
- Persisting: Both the master and the chunkservers record the new version number on their respective persistent disks before the mutation starts.
- Stale Check: If a chunkserver was down during the update, it will still have the old version number.
How the Master Uses Versions
- During Heartbeats: Chunkservers report their (handle, version) pairs. If the master sees a version , it knows the replica is stale.
- Garbage Collection: Stale replicas are immediately scheduled for garbage collection.
- Client Requests: When a client asks for chunk locations, the master never returns a stale replica, ensuring the client only sees current data.
Key Takeaways
Master Operations Summary:
- Namespace Locking: Fine-grained path-based locks enable parallel operations
- Leases for Consistency: Time-bounded primary authority avoids distributed consensus
- Replica Placement: Rack-aware placement balances reliability and performance
- Lazy Garbage Collection: Simple, safe deletion with recovery window
- In-Memory Metadata: Fast operations, simple consistency, small memory footprint
- Master Replication: Operation log replication ensures durability and fast recovery
- Background Processes: GC, re-replication, balancing happen asynchronously
- Version Numbers: Detect stale replicas reliably without complex protocols
Up Next
In Chapter 4: Chunkservers & Data Flow, we’ll explore:- How chunkservers store and manage chunks
- Detailed read, write, and record append flows
- Data integrity mechanisms with checksums
- Replication pipeline and optimization
- Handling chunkserver failures