Chapter 2: Architecture Overview
The Google File System architecture is elegantly simple: a single master coordinating hundreds of chunkservers, with clients communicating directly with chunkservers for data operations. This chapter explores how these components work together to create a highly scalable distributed file system.Chapter Goals:
- Understand the three main components: Master, Chunkservers, and Clients
- Learn how data flows through the system
- Grasp the separation of control and data flow
- Appreciate the single-master design choice
System Components
GFS consists of three main types of components:The Master
The master is the brain of GFS—a single process that manages all metadata. While it appears to be a bottleneck, the design leverages Metadata Minimization and Direct Data Transfer to ensure it can manage petabytes of data from thousands of clients.- Namespace & Locking
- Lease Mechanism
- In-Memory State
Namespace Structure:
Unlike traditional file systems with directory inodes, GFS uses a lookup table mapping full pathnames to metadata. This lookup table is stored in a prefix-compressed Radix Tree (or Hash Table) for extreme RAM efficiency.The Locking Mechanism:
Since there is no directory structure, operations on
/a/b/c don’t require locking the parent /a or /a/b. Instead:- To create
/home/user/foo, the master acquires Read Locks on/homeand/home/user, and a Write Lock on/home/user/foo. - This allows concurrent file creations in the same directory (e.g.,
/home/user/barcan be created simultaneously as it only needs a read lock on the parent). - No deadlock risk: Locks are acquired in a consistent lexicographical order.
dentry locking.- Persistent State
- Why Single Master?
What Gets Persisted:
Chunkservers
Chunkservers are the workhorses that store actual data:Storage Model
How Chunks Are Stored:
- Each chunk: 64MB max
- Stored as Linux file on local disk
- Named by chunk handle (globally unique)
- File path:
/gfs/chunks/<chunk_handle> - Checksums for each 64KB block
- Multiple chunks per disk
Responsibilities
What Chunkservers Do:
- Store and retrieve chunks
- Verify data integrity (checksums)
- Replicate chunks to other servers
- Report to master via heartbeat
- Delete garbage chunks
- Handle client read/write requests
No Metadata Cache
Simple Design:
- Chunkservers don’t cache metadata
- Don’t track which files chunks belong to
- Master tells them what to do
- Simplifies consistency
- Reduces memory requirements
Replication
Chunk Copies:
- Each chunk replicated 3x (default)
- Replicas on different racks
- Replicas can serve reads
- One primary for writes (leased)
- Replicas forward writes in chain
Clients
GFS clients are library code linked into applications:Data Flow Patterns
Understanding how data flows through GFS is crucial. The key principle is the Separation of Control Flow and Data Flow. Control flows from client to master and then to the primary, while data is pushed linearly along a chain of chunkservers to maximize network throughput.The Pipeline Push Mechanism
To fully utilize each machine’s network bandwidth, data is pushed along a chain of chunkservers rather than in a star topology.Read Operation
Complete Read Flow:Write Operation
Writes are more complex due to maintaining consistency across replicas:Record Append Operation
The record append is GFS’s “killer feature”:- How It Works
- Detailed Flow
- Consistency Guarantees
- Application Usage
Architecture Deep Dive
Chunk Size: Why 64MB?
The 64MB chunk size is a defining characteristic of GFS:Advantages of Large Chunks
Advantages of Large Chunks
Benefits of 64MB Chunks:
-
Reduced Metadata:
-
Fewer Network Hops:
- Client requests chunk locations once
- Works with chunk for extended period
- Reduces master load significantly
-
Better TCP Performance:
- Long-lived TCP connections
- Connection setup cost amortized
- Better throughput utilization
-
Data Locality:
- MapReduce can schedule entire chunk to one worker
- Reduces network transfer
- Better cache utilization
Disadvantages and Mitigations
Disadvantages and Mitigations
Problems with Large Chunks:
-
Internal Fragmentation:
-
Hot Spots:
-
Startup Issues:
- Small configuration files
- Many clients read at startup
- Temporary hotspot
Metadata Design
- In-Memory Only
- What's Persistent
- Checkpointing
Why Keep Metadata in RAM?
Consistency Without Locks
GFS achieves consistency without distributed locking:Component Interactions
Master-Chunkserver Communication
Client-Master Communication
Key Architectural Decisions
Single Master
Decision: One master, many chunkserversWhy:
- Simplifies design dramatically
- Strong metadata consistency
- Global knowledge for decisions
- Fast in-memory operations
- Minimize master involvement
- Shadow masters for HA
- Client caching
Large Chunks
Decision: 64MB chunk sizeWhy:
- Reduces metadata size
- Fewer network round trips
- Better throughput
- Data locality benefits
- Internal fragmentation
- Potential hot spots
- Not optimal for small files
Separation of Control/Data
Decision: Metadata through master, data direct to chunkserversWhy:
- Master not bottleneck for data
- Scales to GB/s throughput
- Parallel data transfers
- Flexible data flow routing
- Master handles 1000s ops/sec
- Data path limited only by network
Relaxed Consistency
Decision: Defined consistency, not strictWhy:
- Higher performance
- Simpler implementation
- No distributed transactions
- Matches application needs
- Application-level handling
- Deduplication
- Validation
Interview Questions
Basic: What are the main components of GFS?
Basic: What are the main components of GFS?
Expected Answer:GFS has three main components:
-
Master (single):
- Stores all metadata in RAM
- Manages namespace (files/directories)
- Tracks chunk locations
- Makes placement decisions
- Coordinates system-wide operations
-
Chunkservers (hundreds/thousands):
- Store 64MB chunks as Linux files
- Serve read/write requests
- Replicate data to other chunkservers
- Report status to master via heartbeat
-
Clients (many):
- Application library
- Caches metadata
- Communicates with master for metadata
- Talks directly to chunkservers for data
Intermediate: How does GFS separate control and data flow?
Intermediate: How does GFS separate control and data flow?
Expected Answer:GFS separates control (metadata) and data flow for scalability:Control Flow (Client ↔ Master):
- Client asks master for chunk locations
- Master returns list of replicas
- Client caches this information
- Happens only once per chunk/file
- Client communicates directly with chunkservers
- No master involvement for data transfer
- Enables massive parallel throughput
- Master not a bottleneck
- Master handles thousands of metadata ops/sec
- Data throughput limited only by network/chunkservers
- System scales to hundreds of clients at GB/s aggregate
Advanced: Explain the write data flow in detail
Advanced: Explain the write data flow in detail
Expected Answer:GFS write operation has three phases:1. Lease Grant:
- Client asks master for chunk replicas
- Master grants lease to one replica (primary)
- Client caches primary and secondary locations
- Client pushes data to closest replica
- Each replica forwards to next in chain
- Pipelined: forward while receiving
- All replicas ACK when data buffered in memory
- Data not yet written to disk!
- Client sends write command to primary
- Primary assigns serial number (ordering)
- Primary applies writes to local disk in order
- Primary forwards write order to secondaries
- Secondaries apply in same order
- Secondaries ACK to primary
- Primary ACKs success/failure to client
- Data flow optimized for network topology
- Control flow ensures consistent ordering
- Primary serializes concurrent operations
- No distributed consensus needed
- If any replica fails, entire write fails (client retries)
System Design: Why is chunk location not persisted?
System Design: Why is chunk location not persisted?
Expected Answer:Chunk locations are NOT stored persistently in master, only in-memory. Reasons:Why Not Persist:
-
Chunkservers are source of truth:
- Chunkserver knows what chunks it has (on disk)
- No risk of inconsistency between master and reality
-
Dynamic nature:
- Chunks added/deleted frequently
- Chunkservers may fail
- Disks may fail
- Replication changes chunk locations
-
Simpler consistency:
- No need to keep persistent state in sync
- No risk of master having stale information
- Master polls chunkservers to rebuild state
- Startup: Master polls all chunkservers for chunk list
- Ongoing: Heartbeat messages update chunk locations
- Changes: Chunkservers report additions/deletions
- Eliminates entire class of consistency bugs
- Faster recovery (no persistent state to reload)
- Simpler implementation
- Startup requires polling all chunkservers
- Acceptable because happens rarely and completes quickly
Summary
Key Architecture Takeaways:
- Single Master Design: Simplicity and strong consistency for metadata
- Separation of Concerns: Control (master) vs. Data (chunkservers) flow
- Large Chunks: 64MB chunks reduce metadata and network overhead
- In-Memory Metadata: Fast operations, simple design
- Leases for Consistency: Primary replica orders operations without consensus
- Direct Client-Chunkserver: Data flow doesn’t involve master
- Relaxed Consistency: Trade-off for performance, application handles edge cases
- Record Append: Atomic concurrent appends without locks
Up Next
In Chapter 3: Master Operations, we’ll explore:- Namespace management and locking
- Chunk creation and allocation
- Replica placement strategies
- Lease management in detail
- Garbage collection mechanism