Skip to main content

Chapter 2: HDFS Architecture

The Hadoop Distributed File System (HDFS) is the storage foundation of the Hadoop ecosystem. Inspired by Google’s GFS, HDFS implements a distributed file system designed to run on commodity hardware and provide high-throughput access to application data.
Chapter Goals:
  • Understand HDFS architecture and component roles
  • Learn how NameNode and DataNodes interact
  • Master block replication and placement strategies
  • Explore read and write data flows
  • Compare HDFS with GFS and identify improvements

HDFS Architecture Overview

System Components

+---------------------------------------------------------------+
|                    HDFS ARCHITECTURE                          |
+---------------------------------------------------------------+
|                                                               |
|                   ┌──────────────────┐                        |
|                   │   NAMENODE       │                        |
|                   │  (Master Server) │                        |
|                   ├──────────────────┤                        |
|                   │ • Namespace      │  Responsibilities:     |
|                   │ • Block Mapping  │  • Metadata management |
|                   │ • Replication    │  • File operations     |
|                   │   Policy         │  • Block placement     |
|                   │ • Cluster State  │  • Replication control |
|                   └──────────────────┘                        |
|                    ↑ ↑ ↑        ↑ ↑                           |
|      Metadata ops  │ │ │        │ │  Heartbeats + Block       |
|      & block info  │ │ │        │ │  Reports                  |
|                    │ │ │        │ │                           |
|        ┌───────────┘ │ └────────┘ └──────────────┐            |
|        │             │                            │            |
|   ┌────────┐   ┌────────┐                   ┌────────┐        |
|   │ CLIENT │   │ CLIENT │                   │ CLIENT │        |
|   └────────┘   └────────┘                   └────────┘        |
|        │             │                            │            |
|        │ Read/Write  │                            │            |
|        │ data flow   │                            │            |
|        ↓             ↓                            ↓            |
|   ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐        |
|   │DATANODE 1│ │DATANODE 2│ │DATANODE 3│ │DATANODE 4│        |
|   ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤        |
|   │ Block A  │ │ Block A  │ │ Block B  │ │ Block A  │        |
|   │ Block B  │ │ Block C  │ │ Block C  │ │ Block D  │        |
|   │ Block E  │ │ Block D  │ │ Block F  │ │ Block E  │        |
|   └──────────┘ └──────────┘ └──────────┘ └──────────┘        |
|        ↑             ↑            ↑            ↑               |
|        └─────────────┴────────────┴────────────┘               |
|              Replication Pipeline (data flow)                 |
|                                                               |
+---------------------------------------------------------------+

KEY INSIGHT: Metadata and data flows are separated
            NameNode handles metadata, clients talk to DataNodes for data

NameNode: The Master

The NameNode is the central authority in HDFS, managing the file system namespace and controlling access to files by clients.
What the NameNode Does:
1. NAMESPACE MANAGEMENT
───────────────────────

File System Tree:
/
├── user/
│   ├── hadoop/
│   │   ├── input/
│   │   │   └── data.txt (3 blocks)
│   │   └── output/
│   └── hive/
│       └── warehouse/
└── tmp/

Operations:
• Create files and directories
• Delete files and directories
• Rename files and directories
• List directory contents
• Get file attributes
• Set permissions and ownership


2. BLOCK MANAGEMENT
──────────────────

Block Metadata:
File: /user/hadoop/input/data.txt (400MB)

Blocks:
┌──────────────────────────────────────┐
│ Block ID: blk_1234567890              │
│ Size: 128MB                           │
│ Locations: [DN1, DN3, DN5]            │
│ Generation Stamp: 1001                │
├──────────────────────────────────────┤
│ Block ID: blk_1234567891              │
│ Size: 128MB                           │
│ Locations: [DN2, DN4, DN6]            │
│ Generation Stamp: 1002                │
├──────────────────────────────────────┤
│ Block ID: blk_1234567892              │
│ Size: 144MB (last block)              │
│ Locations: [DN1, DN2, DN7]            │
│ Generation Stamp: 1003                │
└──────────────────────────────────────┘


3. REPLICATION MANAGEMENT
────────────────────────

Monitor and Maintain:
• Ensure each block has correct # replicas (default: 3)
• Re-replicate under-replicated blocks
• Delete over-replicated blocks
• Balance blocks across DataNodes
• Handle DataNode failures

Replication Policy:
1st replica: Same node as writer (if in cluster)
2nd replica: Different rack
3rd replica: Same rack as 2nd, different node

Why? Balance fault tolerance and network cost


4. CLUSTER COORDINATION
──────────────────────

Heartbeat Protocol:
DataNode → NameNode (every 3 seconds):
{
  "nodeId": "DN1",
  "capacity": "4TB",
  "used": "2.3TB",
  "remaining": "1.7TB",
  "blockCount": 15234,
  "timestamp": 1642531200
}

If no heartbeat for 10 minutes → DataNode dead

Block Reports:
DataNode → NameNode (every 6 hours):
{
  "nodeId": "DN1",
  "blocks": [
    {"blockId": "blk_1", "size": 128MB, "genStamp": 1001},
    {"blockId": "blk_5", "size": 128MB, "genStamp": 1005},
    ...thousands more...
  ]
}

Enables NameNode to verify block placement

DataNodes: The Workers

DataNodes store the actual data blocks and serve read/write requests from clients.

Storage Model

How Blocks Are Stored:
  • Each block: up to 128MB (default)
  • Stored as regular Linux files
  • Path: /data/hadoop/dfs/data/current/
  • File naming: blk_<block_id>
  • Separate metadata file: blk_<block_id>.meta
  • Contains checksums for integrity
  • Multiple blocks per disk

Responsibilities

What DataNodes Do:
  • Store and retrieve blocks
  • Serve read requests from clients
  • Execute write operations
  • Replicate blocks to other DataNodes
  • Send heartbeats to NameNode
  • Report blocks to NameNode
  • Delete blocks when instructed

Block Verification

Data Integrity:
  • Checksum per 512 bytes
  • Verify on read operations
  • Periodic background scanning
  • Report corrupt blocks to NameNode
  • Automatic re-replication from good replicas
  • Checksum stored in .meta file

Heartbeat Protocol

Communication with NameNode:
  • Heartbeat every 3 seconds
  • Contains capacity info
  • Receives commands from NameNode
  • Block reports every 6 hours
  • 10 minutes without heartbeat = dead
  • NameNode initiates re-replication

Block Replication and Placement

Replication Strategy

HDFS replicates blocks to ensure fault tolerance and enable data locality for MapReduce.
DEFAULT REPLICATION POLICY (Factor = 3)
───────────────────────────────────────

Rack Awareness:
┌─────────────────────────────────────────────────────────┐
│                        CLUSTER                          │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Rack 1               Rack 2               Rack 3       │
│  ┌──────────┐        ┌──────────┐        ┌──────────┐  │
│  │  DN1     │        │  DN3     │        │  DN5     │  │
│  │  DN2     │        │  DN4     │        │  DN6     │  │
│  └──────────┘        └──────────┘        └──────────┘  │
│                                                         │
│  Writer on DN1:                                         │
│  ───────────────                                        │
│                                                         │
│  Replica 1 → DN1 (same node as writer)                  │
│  Replica 2 → DN3 (different rack)                       │
│  Replica 3 → DN4 (same rack as replica 2, diff node)    │
│                                                         │
│                                                         │
│  Why This Policy?                                       │
│  ────────────────                                       │
│                                                         │
│  1. Local Write (Replica 1):                            │
│     • Zero network transfer for first replica           │
│     • Fast write initiation                             │
│                                                         │
│  2. Off-Rack (Replica 2):                               │
│     • Survives rack failure                             │
│     • Network cost of 1 rack transfer                   │
│                                                         │
│  3. Same Rack as #2 (Replica 3):                        │
│     • Rack-local transfer (faster than off-rack)        │
│     • Still survives single rack failure                │
│                                                         │
│  Trade-off: Availability vs Network Cost                │
│  ────────────────────────────────────────               │
│                                                         │
│  Survives:                                              │
│  ✓ Any single DataNode failure                          │
│  ✓ Any single rack failure                              │
│  ✗ Does NOT survive 2 rack failures                     │
│                                                         │
└─────────────────────────────────────────────────────────┘

Replica Placement Details

Default 3-Way Replication:
PLACEMENT ALGORITHM
───────────────────

Input: Block to replicate
Output: List of DataNode targets

Factors Considered:
1. Rack diversity (fault tolerance)
2. Network bandwidth cost
3. Even distribution across cluster
4. Available disk space
5. DataNode load


For Replication Factor = 3:

Case 1: Writer on DataNode in cluster
──────────────────────────────────────
1st replica: Same DataNode as writer
2nd replica: DataNode on different rack
3rd replica: Different DataNode, same rack as 2nd

Example:
Writer on DN1 (Rack 1)
→ Replicas: DN1 (Rack 1), DN4 (Rack 2), DN5 (Rack 2)


Case 2: Writer outside cluster (client)
────────────────────────────────────────
1st replica: Random DataNode
2nd replica: DataNode on different rack
3rd replica: Different DataNode, same rack as 2nd

Example:
External client
→ Replicas: DN2 (Rack 1), DN3 (Rack 2), DN4 (Rack 2)


For Replication Factor > 3:
───────────────────────────
4th+ replicas: Random placement
Constraints:
• No more than 2 replicas on same rack
• Prefer under-utilized nodes
• Balance across racks


BENEFITS
────────

1. Fault Tolerance:
   Can lose any rack and still have data

2. Read Performance:
   Multiple locations → parallel reads
   Choose closest replica to reader

3. Write Performance:
   Only 1 off-rack transfer
   2/3 replicas use fast rack-local network

4. Load Balancing:
   Distribute popular blocks widely

Data Flow: Read Operations

Understanding how data flows through HDFS is crucial for optimization.

Read Path

HDFS READ OPERATION
───────────────────

Client wants to read: /user/hadoop/input.txt

Step 1: Open File
─────────────────

Client ────(1)───▶ NameNode
    "open /user/hadoop/input.txt"

NameNode:
• Checks permissions
• Looks up file metadata
• Returns block locations

NameNode ────(2)───▶ Client
    Block Locations:
    blk_1: [DN1, DN3, DN5]
    blk_2: [DN2, DN4, DN6]
    blk_3: [DN1, DN2, DN7]


Step 2: Read Blocks
───────────────────

For each block, client:

1. Selects closest DataNode
   (based on network topology)

   Client chooses DN1 for blk_1
   (same rack → lowest latency)

2. Establishes connection

   Client ────(3)───▶ DN1
       "read blk_1"

3. DN1 reads from local disk

   DN1:
   • Locates block file
   • Reads data
   • Computes checksums
   • Verifies integrity

4. Streams data to client

   DN1 ────(4)───▶ Client
       [data stream with checksums]

5. Client verifies checksums

   If checksum mismatch:
   • Report to NameNode
   • Try different replica
   • NameNode schedules re-replication


Step 3: Repeat for All Blocks
──────────────────────────────

Client reads blk_2 from DN2
Client reads blk_3 from DN1


Step 4: Close File
──────────────────

Client ────(5)───▶ NameNode
    "close /user/hadoop/input.txt"

NameNode updates access time


TIMELINE
────────

┌─────────────────────────────────────────────────────┐
│                                                     │
│  t=0ms    Client → NameNode (open file)             │
│  t=10ms   NameNode → Client (block locations)       │
│  t=15ms   Client → DN1 (read blk_1)                 │
│  t=20ms   DN1 starts streaming data                 │
│  t=1000ms DN1 finishes streaming 128MB              │
│  t=1005ms Client → DN2 (read blk_2)                 │
│  t=2000ms DN2 finishes streaming                    │
│  ...                                                │
│                                                     │
└─────────────────────────────────────────────────────┘

Total time dominated by data transfer, not metadata ops

Read Optimizations

Choosing the Closest Replica:
Network Topology Awareness
──────────────────────────

HDFS models network as a tree:

                Cluster

        ┌──────────┼──────────┐
     Rack1       Rack2      Rack3
        │          │           │
    ┌───┼───┐  ┌───┼───┐   ┌───┼───┐
   DN1 DN2 DN3 DN4 DN5 DN6 DN7 DN8 DN9


Distance Calculation:
─────────────────────

distance(DN1, DN1) = 0  (same node)
distance(DN1, DN2) = 2  (same rack)
distance(DN1, DN4) = 4  (different rack)

Formula:
Sum of levels to common ancestor


Replica Selection:
──────────────────

Client on DN1 reading block with replicas [DN1, DN4, DN7]:

1. Sort by distance:
   DN1: distance 0 ← CHOSEN
   DN4: distance 4
   DN7: distance 4

2. Choose closest

3. If closest fails, try next


Impact:
───────

Same node: ~0 network cost
Same rack: ~1 Gbps intra-rack
Different rack: ~10 Gbps inter-rack (but contention)

Massive performance difference at scale!
Reading Multiple Blocks Concurrently:
Sequential Read (Slow):
───────────────────────

Read blk_1 ──▶ Read blk_2 ──▶ Read blk_3

Time: 1000ms + 1000ms + 1000ms = 3000ms


Parallel Read (Fast):
─────────────────────

Read blk_1 ──▶ ┐
               ├──▶ Done
Read blk_2 ──▶ │

Read blk_3 ──▶ ┘

Time: ~1000ms (network bandwidth permitting)


Implementation:
───────────────

// Pseudocode
ExecutorService executor = Executors.newFixedThreadPool(10);

for (BlockLocation block : blocks) {
    executor.submit(() -> {
        DataNode dn = selectClosestReplica(block);
        byte[] data = dn.readBlock(block.blockId);
        buffer.write(data);
    });
}

executor.awaitTermination();


Benefits:
─────────

✓ Higher aggregate throughput
✓ Better network utilization
✓ Reduced latency for large files

Caveats:
────────

⚠ Don't over-parallelize (network saturation)
⚠ Consider downstream processing capacity
⚠ Memory usage for buffering
Bypassing Network Stack:
Standard Read (Through Network):
────────────────────────────────

Client (same machine as DataNode):

Client ──▶ Network Stack ──▶ DataNode ──▶ Read Disk
       ◀── Network Stack ◀──           ◀──

Overhead:
• Network protocol overhead
• Serialization/deserialization
• Kernel/user space transitions
• Unnecessary copying


Short-Circuit Read (Direct):
────────────────────────────

Client ──▶ Shared Memory ──▶ Direct File Access
       ◀──                ◀──

Requirements:
• Client and DataNode on same machine
• File descriptor passing enabled
• Proper permissions


Configuration:
──────────────

hdfs-site.xml:
<property>
  <name>dfs.client.read.shortcircuit</name>
  <value>true</value>
</property>

<property>
  <name>dfs.domain.socket.path</name>
  <value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>


Performance Impact:
───────────────────

Standard read: 100 MB/s
Short-circuit: 500+ MB/s

5x improvement!

Critical for:
→ MapReduce (data locality)
→ Spark (local reads)
→ HBase (co-located RegionServers)
Caching Block Locations:
Without Caching:
────────────────

Every read operation:
Client → NameNode (get block locations)

For 1000 reads:
→ 1000 RPC calls to NameNode
→ NameNode bottleneck
→ Added latency


With Caching:
─────────────

First read:
Client → NameNode (get all block locations for file)
Client caches locations locally

Subsequent reads:
Client uses cached locations
No NameNode contact needed


Cache Invalidation:
───────────────────

Cache expires:
• After configurable timeout
• On checksum verification failure
• If DataNode unreachable

Then: Re-query NameNode


Implementation:
───────────────

// Pseudocode
class BlockLocationCache {
    Map<Path, List<BlockLocation>> cache;

    List<BlockLocation> get(Path file) {
        if (cache.contains(file) && !expired(file)) {
            return cache.get(file);
        }

        List<BlockLocation> locations =
            nameNode.getBlockLocations(file);
        cache.put(file, locations);
        return locations;
    }
}


Benefits:
─────────

✓ Reduced NameNode load
✓ Lower read latency
✓ Better scalability

Trade-off:
──────────

⚠ Slightly stale location info
⚠ Handled by trying alternate replicas on failure

Data Flow: Write Operations

Write operations are more complex than reads due to replication.

Write Path

HDFS WRITE OPERATION
────────────────────

Client wants to write: /user/hadoop/output.txt (400MB)

Step 1: Create File
───────────────────

Client ────(1)───▶ NameNode
    "create /user/hadoop/output.txt"

NameNode:
• Checks permissions
• Checks quota
• Adds to namespace
• Does NOT allocate blocks yet

NameNode ────(2)───▶ Client
    "OK, file created"


Step 2: Write Data
──────────────────

Client has 400MB data → needs 4 blocks (128MB each, last is 16MB)

For first block:

2a. Request Block Allocation

    Client ────(3)───▶ NameNode
        "allocate block for /user/hadoop/output.txt"

    NameNode:
    • Generates block ID: blk_1234567890
    • Selects 3 DataNodes: [DN1, DN3, DN4]
    • Considers:
      - Available space
      - Network topology
      - Load balancing
      - Replication policy

    NameNode ────(4)───▶ Client
        Block: blk_1234567890
        Pipeline: DN1 → DN3 → DN4


2b. Establish Pipeline

    Client ────(5)───▶ DN1
        "prepare to receive blk_1234567890"
        "forward to DN3"

    DN1 ────(6)───▶ DN3
        "prepare to receive blk_1234567890"
        "forward to DN4"

    DN3 ────(7)───▶ DN4
        "prepare to receive blk_1234567890"

    DN4 ────(8)───▶ DN3 ──▶ DN1 ──▶ Client
        "ACK: Pipeline established"


2c. Stream Data Through Pipeline

    REPLICATION PIPELINE:

    Client ──▶ DN1 ──▶ DN3 ──▶ DN4

    Data flows in packets (64KB):

    Packet 1:  Client ──▶ DN1 ──▶ DN3 ──▶ DN4
    Packet 2:  Client ──▶ DN1 ──▶ DN3 ──▶ DN4
    ...
    Packet N:  Client ──▶ DN1 ──▶ DN3 ──▶ DN4

    ACKs flow backward:

    DN4 ──▶ DN3 ──▶ DN1 ──▶ Client (ACK packet 1)
    DN4 ──▶ DN3 ──▶ DN1 ──▶ Client (ACK packet 2)


2d. Close Block

    All 128MB written

    Client ────(9)───▶ DN1
        "close block blk_1234567890"

    DN1 → DN3 → DN4 (propagate close)

    DN4 → DN3 → DN1 → Client (ACK closed)

    Each DataNode:
    • Finalizes block on disk
    • Computes final checksums


Step 3: Repeat for Remaining Blocks
────────────────────────────────────

Block 2: Allocate + Pipeline + Stream + Close
Block 3: Allocate + Pipeline + Stream + Close
Block 4: Allocate + Pipeline + Stream + Close (16MB)


Step 4: Close File
──────────────────

Client ────(10)───▶ NameNode
    "close /user/hadoop/output.txt"
    "blocks: [blk_1, blk_2, blk_3, blk_4]"

NameNode:
• Marks file as complete
• Updates metadata
• Block locations confirmed

NameNode ────(11)───▶ Client
    "File closed successfully"


TIMELINE
────────

t=0ms      Create file
t=10ms     Allocate block 1
t=20ms     Establish pipeline
t=30ms     Start streaming
t=1030ms   Block 1 complete (128MB @ ~125 MB/s)
t=1040ms   Allocate block 2
...
t=4000ms   All blocks written
t=4010ms   Close file

Write Optimizations and Reliability

Why Pipeline Instead of Sequential?:
SEQUENTIAL REPLICATION (Slow):
──────────────────────────────

Client ──▶ DN1 (write 128MB)

       DN1 ──▶ DN3 (copy 128MB)

            DN3 ──▶ DN4 (copy 128MB)

Time: 1s + 1s + 1s = 3 seconds
Bandwidth: Network used sequentially


PIPELINED REPLICATION (Fast):
─────────────────────────────

Client ──▶ DN1 ──▶ DN3 ──▶ DN4

All transfers happen simultaneously:

Packet 1:  Client→DN1  DN1→DN3  DN3→DN4
Packet 2:  Client→DN1  DN1→DN3  DN3→DN4
Packet 3:  Client→DN1  DN1→DN3  DN3→DN4

Time: ~1 second
Bandwidth: All links utilized in parallel


PACKET STRUCTURE
────────────────

Packet (64KB):
┌────────────────────────────────┐
│ Header                         │
│ • Sequence number              │
│ • Block ID                     │
│ • Offset in block              │
├────────────────────────────────┤
│ Data (64KB)                    │
├────────────────────────────────┤
│ Checksums                      │
│ • Per 512 bytes                │
└────────────────────────────────┘


ACK MECHANISM
─────────────

Each DataNode:
1. Receives packet
2. Verifies checksum
3. Writes to disk
4. Forwards to next (if not last)
5. Waits for ACK from downstream
6. Sends ACK upstream

ACK Packet:
┌────────────────────────────────┐
│ Sequence number                │
│ Status: SUCCESS/FAILURE        │
│ Failed replicas (if any)       │
└────────────────────────────────┘


BENEFITS
────────

✓ 3x faster than sequential
✓ Better network utilization
✓ Lower latency for clients
✓ Scales with replication factor

HDFS vs GFS: Key Differences

Block Size

128MB vs 64MB:
  • HDFS default: 128MB
  • GFS: 64MB
  • HDFS evolved with hardware
  • Reduces metadata overhead
  • Better for larger files
  • Configurable per file

Secondary NameNode

Checkpointing:
  • HDFS: Secondary NameNode
  • Not a hot standby (misleading name!)
  • Creates fsimage checkpoints
  • Reduces edit log size
  • Later: Standby NameNode (HA)

File Permissions

POSIX-like Security:
  • HDFS: Full permission model
  • User, group, others
  • Read, write, execute
  • ACLs in later versions
  • GFS: Simpler model

Quotas

Resource Management:
  • HDFS: Directory quotas
  • Space quotas per directory
  • Name quotas (file count)
  • Enables multi-tenancy
  • GFS: No native quotas

Key Takeaways

Remember These Core Insights:
  1. Single NameNode Design: Simplifies metadata management but requires HA for production
  2. Block-Based Storage: 128MB blocks optimize for large files and reduce metadata overhead
  3. Replication for Reliability: 3x replication default survives single rack failures
  4. Rack Awareness: Intelligent replica placement balances fault tolerance and network cost
  5. Pipelined Writes: Streaming data through replica pipeline is 3x faster than sequential
  6. Data Locality: Moving computation to data is fundamental to Hadoop’s efficiency
  7. Metadata Separated from Data: NameNode handles metadata, clients stream from DataNodes
  8. Checksums Everywhere: Data integrity verified at every step—write, read, and background

Interview Questions

Expected Answer:HDFS has a master-worker architecture with three main components:
  1. NameNode (Master):
    • Manages file system namespace (directories, files)
    • Stores metadata (file-to-block mapping)
    • Coordinates file operations
    • Monitors DataNode health
    • All metadata in RAM for fast access
  2. DataNodes (Workers):
    • Store actual data blocks (128MB each)
    • Serve read/write requests from clients
    • Send heartbeats to NameNode (every 3s)
    • Report blocks they store (every 6h)
    • Execute replication commands
  3. Clients:
    • Contact NameNode for metadata
    • Read/write data directly from/to DataNodes
    • Maintain consistency with checksums
Key principle: Metadata and data flows are separated. NameNode only handles metadata; clients talk to DataNodes for actual data.
Expected Answer:HDFS handles DataNode failures through multiple mechanisms:Detection:
  • DataNodes send heartbeats every 3 seconds
  • If NameNode doesn’t receive heartbeat for 10 minutes, marks DataNode as dead
  • Immediate action triggered
Recovery:
  1. NameNode identifies all blocks on failed DataNode
  2. Checks which blocks are now under-replicated
  3. Prioritizes blocks by replication level:
    • 0 replicas: Critical priority
    • 1 replica: High priority
    • 2 replicas: Normal priority
  4. Selects source (healthy replica) and target (new DataNode)
  5. Commands source to copy blocks to target
  6. Verifies checksums during copy
  7. Updates block location map
Prevention:
  • 3x replication by default
  • Rack-aware placement (survives rack failures)
  • Continuous background verification
  • Automatic re-replication maintains factor
Time to Recovery: Minutes to hours depending on data volume, but cluster remains operational during recovery.
Expected Answer:HDFS uses pipelined replication for writes, which is significantly more efficient than sequential replication:Pipeline Mechanism:
Client → DN1 → DN2 → DN3 (simultaneous streaming)
Instead of:
Client → DN1, then DN1 → DN2, then DN2 → DN3 (sequential)
How it Works:
  1. Client gets pipeline: [DN1, DN2, DN3] from NameNode
  2. Client establishes connection to DN1
  3. DN1 connects to DN2, DN2 to DN3
  4. Client streams 64KB packets to DN1
  5. DN1 simultaneously:
    • Writes to local disk
    • Forwards packet to DN2
  6. DN2 does same (write + forward to DN3)
  7. ACKs flow backward: DN3→DN2→DN1→Client
Efficiency Benefits:
  • 3x faster: All transfers happen in parallel vs sequential
  • Network utilization: All links active simultaneously
  • Latency: ~1 second for 128MB vs ~3 seconds sequential
  • Scalability: Time independent of replication factor
Failure Handling:
  • If DN2 fails, Client removes it from pipeline
  • Continues with [DN1, DN3]
  • Re-replicates to 3x after write completes
  • No data loss, minimal interruption
Expected Answer:Small files are HDFS’s Achilles’ heel. Each file consumes ~150 bytes of NameNode RAM regardless of file size. Solutions:Problem Quantification:
  • 1 million files = 150MB NameNode RAM
  • 100 million files = 15GB RAM
  • 1 billion files = 150GB RAM + slow operations
  • Each file requires NameNode RPC = overhead
Solutions:
  1. HAR Files (Hadoop Archives):
    • Combine many small files into larger archive
    • Like tar for HDFS
    • Reduces NameNode metadata
    • Trade-off: Slower access (need to unpack)
  2. Sequence Files:
    • Container format: key-value pairs
    • Multiple small files → single SequenceFile
    • Splittable for MapReduce
    • Built-in compression
  3. HBase:
    • Store small files as rows in HBase
    • HBase handles small data efficiently
    • Random access support
    • Better than HDFS for this use case
  4. CombineFileInputFormat:
    • MapReduce optimization
    • Combines multiple small files into single split
    • Reduces number of map tasks
    • Better resource utilization
  5. HDFS Federation (Hadoop 3.x):
    • Multiple NameNodes, each managing subset
    • Horizontal scaling of namespace
    • Allows more total files
    • But doesn’t solve per-file overhead
Real-World Approach: Most companies use a combination:
  • Archive old small files into SequenceFiles
  • Use HBase for active small file workloads
  • Educate users to avoid small files
  • Set up quotas and monitoring
  • Consider cloud object storage (S3) for small files
Modern Alternative: Many modern systems (Snowflake, Delta Lake) use cloud object storage (S3, GCS) which handles small files better than HDFS.
Expected Answer:HDFS models the network as a tree and places replicas strategically:Network Topology:
          Cluster
             |
   +---------+---------+
  Rack1    Rack2    Rack3
   |         |         |
 DN1-3     DN4-6     DN7-9
Distance Calculation:
  • Same node: 0
  • Same rack: 2 (node→rack→node)
  • Different rack: 4 (node→rack→cluster→rack→node)
Default Placement (3 replicas):If writer on DN1 (Rack1):
  1. 1st replica: DN1 (same node as writer)
    • Zero network cost
    • Fast write initiation
  2. 2nd replica: DN4 (different rack, e.g., Rack2)
    • Survives rack failure
    • One off-rack transfer
  3. 3rd replica: DN5 (same rack as 2nd, Rack2)
    • Rack-local transfer (faster)
    • Still survives rack failure
Why This Policy?Fault Tolerance:
  • Survives any single node failure
  • Survives any single rack failure
  • Does NOT survive 2 rack failures (acceptable trade-off)
Network Cost:
  • Only 1 out of 3 transfers crosses racks
  • 2/3 transfers are rack-local (10x faster)
  • Balances reliability with performance
Read Optimization:
  • Multiple racks → better read parallelism
  • Readers choose closest replica
  • Load distributed across racks
For Replication Factor > 3:
  • 4th and beyond: Random, but max 2 per rack
  • Diminishing returns on rack diversity
  • Focus on load balancing
Configuration: Rack topology specified in:
  • Script: topology.script.file.name
  • Returns rack ID for each host
  • Example: /rack1, /rack2
Impact of Wrong Topology:
  • If NameNode doesn’t know racks, treats all as same rack
  • Loses fault tolerance benefit
  • Survives only node failures, not rack failures
  • Critical to configure correctly!
Real-World Considerations:
  • Cloud environments: Availability zones = racks
  • On-premises: Physical rack layout
  • Network switches as failure domains
  • Balance across power circuits

Up Next

In Chapter 3: MapReduce Framework, we’ll explore:
  • The MapReduce programming model in depth
  • Job execution flow and task lifecycle
  • Shuffle and sort mechanisms
  • Optimization techniques for MapReduce jobs
  • How MapReduce leverages HDFS data locality
We’ve mastered HDFS storage. Next, we’ll learn how to process that data efficiently with MapReduce.