The Hadoop Distributed File System (HDFS) is the storage foundation of the Hadoop ecosystem. Inspired by Google’s GFS, HDFS implements a distributed file system designed to run on commodity hardware and provide high-throughput access to application data.
FSIMAGE (Namespace Snapshot)────────────────────────────Checkpoint of entire namespace:• All directories and files• File → block mappings• Permissions, ownership• Replication factors• Block sizesFormat: Protobuf (compact binary)Size: Gigabytes for large clustersCreated: Periodically by Secondary NameNodeExample:/fsimage/fsimage_0000000000012345678EDIT LOG (Transaction Log)──────────────────────────All namespace mutations since last fsimage:┌──────────────────────────────────────┐│ TxID: 12345678 ││ Op: OP_ADD ││ Path: /user/hadoop/data.txt ││ Replication: 3 ││ Timestamp: 1642531200 │├──────────────────────────────────────┤│ TxID: 12345679 ││ Op: OP_ALLOCATE_BLOCK ││ Path: /user/hadoop/data.txt ││ BlockID: blk_1234567890 ││ GenStamp: 1001 │├──────────────────────────────────────┤│ TxID: 12345680 ││ Op: OP_CLOSE ││ Path: /user/hadoop/data.txt │└──────────────────────────────────────┘Enables:• Fast recovery (replay edits)• Durability (survive crashes)• High availability (standby replays)RECOVERY PROCESS────────────────On NameNode startup:1. Load latest fsimage ↓2. Replay edit log entries ↓3. Merge to create current namespace ↓4. Receive block reports from DataNodes ↓5. Enter safe mode (read-only) ↓6. Verify block replication meets threshold ↓7. Exit safe mode (read-write)Time: Minutes to hours (depends on size)CHECKPOINT PROCESS──────────────────Secondary NameNode periodically:1. Downloads fsimage and edits from NameNode2. Loads fsimage into memory3. Applies edit log transactions4. Saves new merged fsimage5. Uploads to NameNode6. NameNode replaces old fsimage7. Starts new edit logWhy? Keeps edit log from growing unboundedFrequency: Hourly or when edits reach threshold
Deep Dive: NameNode High Availability:The NameNode is the brain of HDFS. In early versions, its failure meant total cluster downtime. HA solves this by maintaining a “Hot Standby” that can take over in seconds.
Copy
HDFS HA ARCHITECTURE WITH QJM ────────────────────────────── ┌────────────────┐ ┌────────────────┐ │ Active NN │ │ Standby NN │ │ (Read/Write) │ │ (Read-only) │ └───────┬────────┘ └───────┬────────┘ │ │ │ ┌───────────────────┐ │ └──▶│ JournalNodes │◀──┘ │ (Quorum of 3, 5)│ └───────────────────┘ │ ┌───────────────────┴───────────────────┐ │ │ ┌──────▼──────┐ ┌──────▼──────┐ │ ZKFC │◀───────(ZooKeeper)──────▶│ ZKFC │ └─────────────┘ └─────────────┘
The Shared Edit Log (Quorum Journal Manager):
Instead of a single local edits file, the Active NN sends edits to a set of JournalNodes (JNs).
A write is successful only if a quorum (majority) of JNs acknowledge it.
Standby NN continuously reads (tails) these edits from the JNs to keep its in-memory namespace in sync.
No Shared Storage Required: Unlike NFS-based HA, QJM doesn’t have a single point of failure in the storage layer.
JournalNode Mechanics:
JNs are lightweight processes, usually run on the same nodes as NameNodes or other masters.
Each edit is assigned an increasing Epoch Number.
Fencing: If a new NameNode becomes active, it increments the epoch. JNs will then reject any further writes from the old Active NN (preventing “split-brain”).
Automatic Failover (ZKFC & ZooKeeper):
ZKFC (ZooKeeper Failover Controller): A separate process running on each NN node.
Monitors the NN health and maintains a session in ZooKeeper.
Active Election: ZKFCs compete to create an ephemeral node in ZooKeeper (/hadoop-ha/.../ActiveStandbyElectorLock).
The winner triggers its local NN to become Active.
If the Active NN or its ZKFC crashes, the lock is released, and the Standby ZKFC immediately wins the election.
The Fencing Mechanism:
Crucial for preventing two NameNodes from thinking they are both Active.
QJM Fencing: As mentioned, JNs reject old epochs.
SSH Fencing: The Standby NN can log into the old Active node and kill -9 the process.
Fencing Script: Custom scripts can power off the node (PDU) or disable the network port.
Block Reports in HA:
In non-HA, only one NN gets reports.
In HA, DataNodes send heartbeats and block reports to BOTH NameNodes.
This allows the Standby to have a near-perfect map of block locations, enabling a “Hot” failover without waiting for a full block report.
Comparison: QJM vs. NFS for HA:
Feature
QJM (Recommended)
NFS (Legacy)
Shared Storage
Quorum of JNodes
Single NFS Filer
Fault Tolerance
Can lose (N-1)/2 JNs
NFS is a SPoF
Consistency
Strong (Quorum based)
Depends on NFS server
Network
Standard TCP
Requires robust NAS/SAN
Scaling the Namespace with Federation:While HA provides reliability, Federation provides scalability. In a single-namespace cluster, the NameNode’s memory is the limit (approx. 150 bytes per block).
PLACEMENT ALGORITHM───────────────────Input: Block to replicateOutput: List of DataNode targetsFactors Considered:1. Rack diversity (fault tolerance)2. Network bandwidth cost3. Even distribution across cluster4. Available disk space5. DataNode loadFor Replication Factor = 3:Case 1: Writer on DataNode in cluster──────────────────────────────────────1st replica: Same DataNode as writer2nd replica: DataNode on different rack3rd replica: Different DataNode, same rack as 2ndExample:Writer on DN1 (Rack 1)→ Replicas: DN1 (Rack 1), DN4 (Rack 2), DN5 (Rack 2)Case 2: Writer outside cluster (client)────────────────────────────────────────1st replica: Random DataNode2nd replica: DataNode on different rack3rd replica: Different DataNode, same rack as 2ndExample:External client→ Replicas: DN2 (Rack 1), DN3 (Rack 2), DN4 (Rack 2)For Replication Factor > 3:───────────────────────────4th+ replicas: Random placementConstraints:• No more than 2 replicas on same rack• Prefer under-utilized nodes• Balance across racksBENEFITS────────1. Fault Tolerance: Can lose any rack and still have data2. Read Performance: Multiple locations → parallel reads Choose closest replica to reader3. Write Performance: Only 1 off-rack transfer 2/3 replicas use fast rack-local network4. Load Balancing: Distribute popular blocks widely
Handling Under-Replication:
Copy
TRIGGERS FOR RE-REPLICATION──────────────────────────1. DataNode Failure DN3 dies → all its blocks under-replicated NameNode detects (no heartbeat) Initiates re-replication immediately2. Disk Failure DN reports disk error Blocks on that disk → under-replicated Re-replicate from other replicas3. Replication Factor Increase User increases replication: 3 → 5 NameNode creates 2 more replicas per block4. Corrupt Block Detected Checksum mismatch Mark block corrupt Re-replicate from good replicasPRIORITY QUEUE──────────────Blocks prioritized by replication level:Priority 1: 0 replicas (data loss imminent!)Priority 2: 1 replica (one failure = loss)Priority 3: < replication factorProcess highest priority firstRE-REPLICATION PROCESS──────────────────────1. NameNode identifies under-replicated blocks ↓2. Selects source DataNode (has good replica) ↓3. Selects target DataNode (where to copy) Considers: • Rack diversity • Available space • Current load ↓4. Sends command: DN_source → copy block to DN_target ↓5. Source initiates transfer to target ↓6. Target stores block ↓7. Target reports to NameNode ↓8. NameNode updates block mapTHROTTLING──────────Limit re-replication bandwidth:• Default: 2 concurrent transfers per DataNode• Prevents network saturation• Avoids impacting client requests• Configurable: dfs.datanode.max.transfer.threadsDuring large-scale failures:• May take hours to re-replicate all blocks• Cluster degraded but operational• Prioritization ensures critical data first
Graceful Node Removal:
Copy
DECOMMISSIONING WORKFLOW────────────────────────Scenario: Remove DN3 from cluster (hardware upgrade)1. Administrator marks DN3 for decommission Edit: hdfs-site.xml (exclude file) Command: hdfs dfsadmin -refreshNodes2. NameNode puts DN3 in "Decommissioning" state DN3 still accepts heartbeats No new blocks assigned to DN33. NameNode identifies blocks only on DN3 Creates list of all blocks to replicate elsewhere4. Re-replication begins For each block on DN3: • Select target DataNode • Copy block to target • Update block map5. Monitor progress Command: hdfs dfsadmin -report Shows: "Decommissioning in progress: 45% complete"6. All blocks copied DN3 state: "Decommissioned" DN3 has zero unique blocks7. Safe to power down DN3 No data loss Cluster continues normallyWHY DECOMMISSION (Don't just power off!)────────────────────────────────────────Without decommissioning:✗ Immediate under-replication✗ NameNode thinks DN3 is dead✗ Emergency re-replication triggered✗ Network saturation✗ Potential data loss if multiple failuresWith decommissioning:✓ Gradual, controlled process✓ No data loss risk✓ Network-friendly✓ Clean removalTIME ESTIMATE─────────────Depends on:• Amount of data on node• Network bandwidth• Cluster load• Number of concurrent decommissionsTypical: Hours to days for heavily-loaded nodes
Cluster Load Balancing:
Copy
THE BALANCING PROBLEM─────────────────────Over time, cluster becomes imbalanced:DN1: 90% full ▓▓▓▓▓▓▓▓▓░DN2: 45% full ▓▓▓▓░░░░░░DN3: 80% full ▓▓▓▓▓▓▓▓░░DN4: 30% full ▓▓▓░░░░░░░Causes:• New nodes added (start empty)• Uneven write patterns• Some nodes have larger disks• Decommissioned nodesProblems:✗ Some nodes fill up first✗ New writes fail (no space)✗ Uneven read load✗ Inefficient resource useHDFS BALANCER─────────────Standalone tool that redistributes blocks:Command:$ hdfs balancer -threshold 10Meaning:• Target: All nodes within 10% of average utilization• Average: 61.25% full• Goal: Each node between 51.25% - 71.25%BALANCER ALGORITHM──────────────────1. Calculate cluster average utilization Total used / Total capacity = 61.25%2. Identify over-utilized nodes (> threshold) DN1: 90% (over by 28.75%) DN3: 80% (over by 18.75%)3. Identify under-utilized nodes (< threshold) DN2: 45% (under by 16.25%) DN4: 30% (under by 31.25%)4. For each over-utilized node: Select blocks to move Choose under-utilized destination Initiate block transfer5. Repeat until balanced or max iterationsCONSTRAINTS───────────Balancer obeys replication policy:✓ Maintains rack diversity✓ Doesn't reduce replicas below factor✓ Respects rack awarenessThrottling:• Default: 10 MB/s per DataNode• Configurable: dfs.datanode.balance.bandwidthPerSec• Runs in background, low prioritySafety:✓ Only moves replicas (doesn't delete)✓ Verifies checksums after move✓ Updates NameNode metadata atomically✓ Can be stopped safely anytimeBEST PRACTICES──────────────When to run:→ After adding new nodes→ When utilization becomes skewed→ During maintenance windows (low load)How often:→ Weekly or monthly (depends on write rate)→ Not constantly (overhead)Monitoring:→ Check progress: hdfs balancer -h→ Watch cluster balance: hdfs dfsadmin -report
SEQUENTIAL REPLICATION (Slow):──────────────────────────────Client ──▶ DN1 (write 128MB) ↓ DN1 ──▶ DN3 (copy 128MB) ↓ DN3 ──▶ DN4 (copy 128MB)Time: 1s + 1s + 1s = 3 secondsBandwidth: Network used sequentiallyPIPELINED REPLICATION (Fast):─────────────────────────────Client ──▶ DN1 ──▶ DN3 ──▶ DN4All transfers happen simultaneously:Packet 1: Client→DN1 DN1→DN3 DN3→DN4Packet 2: Client→DN1 DN1→DN3 DN3→DN4Packet 3: Client→DN1 DN1→DN3 DN3→DN4Time: ~1 secondBandwidth: All links utilized in parallelPACKET STRUCTURE────────────────Packet (64KB):┌────────────────────────────────┐│ Header ││ • Sequence number ││ • Block ID ││ • Offset in block │├────────────────────────────────┤│ Data (64KB) │├────────────────────────────────┤│ Checksums ││ • Per 512 bytes │└────────────────────────────────┘ACK MECHANISM─────────────Each DataNode:1. Receives packet2. Verifies checksum3. Writes to disk4. Forwards to next (if not last)5. Waits for ACK from downstream6. Sends ACK upstreamACK Packet:┌────────────────────────────────┐│ Sequence number ││ Status: SUCCESS/FAILURE ││ Failed replicas (if any) │└────────────────────────────────┘BENEFITS────────✓ 3x faster than sequential✓ Better network utilization✓ Lower latency for clients✓ Scales with replication factor
Pipeline Failure Recovery:
Copy
SCENARIO: DN3 Fails During Write────────────────────────────────Original pipeline:Client ──▶ DN1 ──▶ DN3 ──▶ DN4 ❌ (fails)Step 1: Detect Failure──────────────────────DN1 doesn't receive ACK from DN3DN1 timeout (configurable: 60s)DN1 reports to ClientStep 2: Rebuild Pipeline────────────────────────Client:• Removes DN3 from pipeline• Current replicas: DN1, DN4 (only 2!)• Continues with reduced pipelineNew pipeline:Client ──▶ DN1 ──▶ DN4Step 3: Resume Writing──────────────────────Client resends packets not ACKed:• Keeps track of unacknowledged packets• Resends from last successful packet• No data lossStep 4: Inform NameNode───────────────────────Client ──▶ NameNode "DN3 failed during write of blk_1234567890"NameNode:• Marks DN3 as dead (or suspect)• Notes block is under-replicatedStep 5: Post-Write Re-Replication─────────────────────────────────After file closed:NameNode schedules re-replicationBlock copied from DN1 or DN4 to new DN (e.g., DN5)Final replicas: [DN1, DN4, DN5]MULTIPLE FAILURES─────────────────If 2 DataNodes fail:Still have 1 replicaContinue writingRe-replicate after closeIf all 3 fail:Write failsClient retries with new pipelineNameNode selects different DataNodesBLOCK VERSIONS──────────────Each attempt creates new generation stamp:Attempt 1: blk_1234567890, gen_stamp: 1001Attempt 2: blk_1234567890, gen_stamp: 1002Prevents confusion with partial writesOld generation stamps eventually garbage collectedCLIENT-SIDE ROBUSTNESS──────────────────────Client maintains:• Packet queue (unacknowledged)• Maximum retries• Timeout configurations• Error countersIf too many failures:→ Report to application→ Application decides retry or fail
Ensuring Data Integrity:
Copy
CHECKSUM VERIFICATION─────────────────────Write Path:1. Client computes checksums Per 512 bytes of data2. Sends data + checksums in packet3. Each DataNode: • Receives packet • Verifies checksums • Writes data to disk • Stores checksums separately • Forwards to next DataNode4. If checksum mismatch: • DataNode reports error • Client excludes DataNode from pipeline • Rebuilds pipeline without bad nodeATOMIC BLOCK CREATION─────────────────────Blocks created atomically:During write:• Block in "under construction" state• Not visible to readers• Stored in temporary locationAfter close:• Block finalized• Moved to final location• Made visible to readers• Generation stamp finalizedIf crash before finalization:• Incomplete blocks deleted• No partial data visibleLEASE MECHANISM───────────────Prevents concurrent writes:Client ──▶ NameNode: "open for write"NameNode grants exclusive lease (60s)Client must renew lease:• Heartbeat every 30s• While file openIf client crashes:• Lease expires after 60s• NameNode recovers file• Discards incomplete last blockAnother client tries to write same file:NameNode: "Lease already held, wait or fail"PIPELINE ACK────────────Guarantees all replicas written:Client sends packet↓DN1 writes ──▶ DN3 writes ──▶ DN4 writes ↓ DN4 ACKs ↓ DN3 ACKs (after DN4) ↓ DN1 ACKs (after DN3) ↓ Client receives ACKOnly when all replicas confirm:→ Client considers packet written→ Moves to next packetFAILURE ATOMICITY─────────────────Either all replicas succeed or none:Success: All ACKs received→ Packet committedPartial failure: Some ACK, some don't→ Remove failed DataNodes→ Resend packet→ Continue with reduced pipelineComplete failure: No ACKs→ Retry packet→ If persistent, fail entire block→ Start new block
Appending to Existing Files:
Copy
APPEND WORKFLOW───────────────Use case: Log aggregationMultiple writers append to same fileStep 1: Request Append──────────────────────Client ──▶ NameNode: "append /var/log/app.log"NameNode:• Checks permissions• Verifies file exists• Grants lease to client• Returns last block locationNameNode ──▶ Client: Last block: blk_999, size: 64MB Locations: [DN2, DN5, DN8]Step 2: Append Data───────────────────If last block < 128MB:• Append to existing block• Establish pipeline with existing replicas• Continue from last offsetIf last block full (128MB):• Allocate new block• Normal write pipelineStep 3: Pipeline Synchronization────────────────────────────────Critical: All replicas must agree on block stateClient ──▶ DN2, DN5, DN8: "What's your last confirmed offset for blk_999?"Responses:DN2: 67108864 bytes (64MB)DN5: 67108864 bytes (64MB)DN8: 67108864 bytes (64MB)All agree → Continue from 64MBIf mismatch:DN2: 64MBDN5: 64MBDN8: 63MB ← Out of sync!Resolution:• Use minimum confirmed offset• Truncate longer replicas• Restart from 63MBStep 4: Continue Writing────────────────────────Same as normal write:• Packet streaming• ACK protocol• Checksum verificationCONCURRENCY CONTROL───────────────────Only ONE writer at a time:Client A holds lease → Can appendClient B tries to append → Blocked until lease releasedLease duration: 60 secondsRenewable while writingIf Client A crashes:• Lease expires• NameNode recovers file• Client B can now appendUSE CASES─────────Good for:✓ Log aggregation✓ Audit trails✓ Sequential data streamsNot for:✗ Random writes✗ Frequent small appends (overhead)✗ Multiple concurrent writers (lease contention)LIMITATIONS───────────Performance:⚠ Append slower than fresh write⚠ Pipeline synchronization overhead⚠ Potential for replica divergenceRecommendation:→ For high-volume logs, write to new files periodically→ Append for occasional additions only