Skip to main content

HDFS Architecture & Internals

Module Duration: 4-5 hours Hands-on Labs: 6 practical exercises Prerequisites: Understanding of GFS concepts from Module 1

Introduction

HDFS (Hadoop Distributed File System) is the storage foundation of the Hadoop ecosystem. Inspired by Google’s GFS, HDFS provides:
  • Fault-tolerant storage across commodity hardware
  • High throughput for large datasets
  • Scalability to petabytes and beyond
  • Data locality for efficient processing
In this module, we’ll go beyond theory to understand implementation details and write real code.

HDFS Architecture Overview

The Core Components

┌──────────────────────────────────────────────────────────────┐
│                     HDFS CLUSTER                             │
│                                                              │
│  ┌────────────────────────────────────────┐                 │
│  │         NameNode (Master)              │                 │
│  │                                        │                 │
│  │  • Metadata storage (in-memory)        │                 │
│  │  • Namespace operations                │                 │
│  │  • Block mapping                       │                 │
│  │  • Replication policy enforcement      │                 │
│  │  • Heartbeat & Block reports           │                 │
│  └────────────┬───────────────────────────┘                 │
│               │                                             │
│               │ Heartbeat (3 sec) + Block Reports          │
│               │                                             │
│  ┌────────────┼────────────┬───────────────┬──────────┐    │
│  │            │            │               │          │    │
│  ▼            ▼            ▼               ▼          ▼    │
│ ┌──────────┐┌──────────┐┌──────────┐┌──────────┐┌────────┐│
│ │DataNode 1││DataNode 2││DataNode 3││DataNode 4││DataNode││
│ │          ││          ││          ││          ││   N    ││
│ │ Stores:  ││ Stores:  ││ Stores:  ││ Stores:  ││        ││
│ │ Block 1  ││ Block 1  ││ Block 2  ││ Block 2  ││  ...   ││
│ │ Block 3  ││ Block 2  ││ Block 3  ││ Block 4  ││        ││
│ │ Block 5  ││ Block 4  ││ Block 5  ││ Block 6  ││        ││
│ └──────────┘└──────────┘└──────────┘└──────────┘└────────┘│
│                                                              │
│  ┌─────────────────────────────────┐                        │
│  │  Secondary NameNode             │                        │
│  │  (Checkpoint creation)          │                        │
│  └─────────────────────────────────┘                        │
└──────────────────────────────────────────────────────────────┘

         ▲                           ▲
         │                           │
         │                           │
    ┌────┴────┐                 ┌────┴────┐
    │ Client  │                 │ Client  │
    │ (Read/  │                 │ (Read/  │
    │  Write) │                 │  Write) │
    └─────────┘                 └─────────┘

NameNode: The Brain

The NameNode is the single master that manages:
  1. File System Namespace:
    • Directory tree
    • File metadata (permissions, timestamps)
    • Mapping: filename → block IDs
  2. Block Management:
    • Block locations (which DataNodes store each block)
    • Replication factor enforcement
    • Block creation, deletion, and re-replication
  3. Heartbeat Management:
    • Monitors DataNode health (3-second heartbeats)
    • Detects failures and initiates recovery
Critical Design Choice: All metadata stored in RAM
  • Pros: Fast metadata operations (no disk I/O)
  • Cons: Limits namespace size (1GB RAM ≈ 1 million blocks)

DataNodes: The Workers

DataNodes:
  • Store actual data blocks on local disk
  • Send heartbeats to NameNode every 3 seconds
  • Report block lists periodically (block reports)
  • Serve read/write requests from clients
  • Execute replication commands from NameNode

Secondary NameNode: The Misconception

Common Misconception: Secondary NameNode is NOT a backup or standby!
Actual Role: Performs periodic checkpointing
  • Merges edit logs into FSImage
  • Reduces NameNode restart time
  • Does NOT take over if NameNode fails (use HA NameNode for that)

Block Storage Deep Dive

What is a Block?

HDFS splits files into fixed-size blocks:
  • Default size: 128MB (configurable: 64MB, 256MB, etc.)
  • Why so large?: Minimize metadata overhead, optimize sequential reads
Example:
File: access.log (350MB)

Split into blocks:
┌─────────────┬─────────────┬─────────────┐
│  Block 1    │  Block 2    │  Block 3    │
│  (128MB)    │  (128MB)    │  (94MB)     │
│  ID: 1001   │  ID: 1002   │  ID: 1003   │
└─────────────┴─────────────┴─────────────┘

Replication Strategy

Each block replicated (default: 3 copies) for fault tolerance. Replica Placement Algorithm (Rack-Aware):
Block X needs 3 replicas:

Replica 1: Local node (where client writes from)
           OR random node (if client outside cluster)

Replica 2: Different node on SAME rack as Replica 1
           (same rack-switch, low latency)

Replica 3: Different node on DIFFERENT rack
           (survives rack failure)

┌──────────────────────────────────────────────┐
│                  Network                     │
└────────┬─────────────────────────┬───────────┘
         │                         │
    ┌────▼─────┐              ┌────▼─────┐
    │ Rack 1   │              │ Rack 2   │
    │ Switch   │              │ Switch   │
    └────┬─────┘              └────┬─────┘
         │                         │
    ┌────┼─────┬─────┐        ┌────┴─────┐
    │    │     │     │        │          │
    ▼    ▼     ▼     ▼        ▼          ▼
   DN1  DN2   DN3   DN4      DN5        DN6
    ↑    ↑                    ↑
    │    │                    │
 Replica Replica           Replica
   1      2                  3
Why This Strategy?:
  • Fault tolerance: Survives node and rack failures
  • Write performance: 2 replicas on same rack (fast)
  • Read optimization: Multiple replicas for load balancing
  • Network efficiency: 1/3 of data crosses racks (not 3/3)

Block Placement Code Example

Here’s how HDFS decides block placement (simplified from actual code):
public class BlockPlacementPolicyDefault {

    /**
     * Choose target DataNodes for block replicas
     * @param numOfReplicas Number of replicas to create
     * @param writer Client writing the block (for locality)
     * @param excludedNodes Nodes to exclude
     * @param blocksize Size of the block
     * @return Array of chosen DataNodes
     */
    public DatanodeDescriptor[] chooseTarget(
            int numOfReplicas,
            Node writer,
            List<DatanodeDescriptor> excludedNodes,
            long blocksize) {

        if (numOfReplicas == 0 || clusterMap.getNumOfLeaves() == 0) {
            return new DatanodeDescriptor[0];
        }

        List<DatanodeDescriptor> results = new ArrayList<>();

        // First replica: local to writer if possible
        DatanodeDescriptor firstNode = chooseFirstReplica(writer);
        results.add(firstNode);
        excludedNodes.add(firstNode);

        if (numOfReplicas == 1) {
            return results.toArray(new DatanodeDescriptor[1]);
        }

        // Second replica: different node, same rack as first
        DatanodeDescriptor secondNode = chooseLocalRack(
            firstNode, excludedNodes, blocksize);
        results.add(secondNode);
        excludedNodes.add(secondNode);

        if (numOfReplicas == 2) {
            return results.toArray(new DatanodeDescriptor[2]);
        }

        // Third replica: different rack
        DatanodeDescriptor thirdNode = chooseRemoteRack(
            1, results, excludedNodes, blocksize);
        results.add(thirdNode);

        // Additional replicas: distribute randomly
        for (int i = 3; i < numOfReplicas; i++) {
            DatanodeDescriptor nextNode = chooseRandom(
                excludedNodes, blocksize);
            results.add(nextNode);
            excludedNodes.add(nextNode);
        }

        return results.toArray(new DatanodeDescriptor[numOfReplicas]);
    }

    private DatanodeDescriptor chooseFirstReplica(Node writer) {
        // If writer is a DataNode, use it
        if (writer instanceof DatanodeDescriptor) {
            return (DatanodeDescriptor) writer;
        }
        // Otherwise, choose random node
        return chooseRandom(new ArrayList<>(), 0);
    }

    private DatanodeDescriptor chooseLocalRack(
            DatanodeDescriptor node,
            List<DatanodeDescriptor> excluded,
            long blocksize) {

        String rackName = node.getNetworkLocation();
        List<DatanodeDescriptor> sameRackNodes =
            clusterMap.getDatanodesInRack(rackName);

        // Remove excluded nodes
        sameRackNodes.removeAll(excluded);

        // Choose node with most available space
        return chooseBestNode(sameRackNodes, blocksize);
    }

    private DatanodeDescriptor chooseBestNode(
            List<DatanodeDescriptor> nodes,
            long blocksize) {

        DatanodeDescriptor best = null;
        long maxSpace = 0;

        for (DatanodeDescriptor node : nodes) {
            long available = node.getRemaining();
            if (available > maxSpace && available >= blocksize) {
                maxSpace = available;
                best = node;
            }
        }

        return best;
    }
}

HDFS Read Operation

Let’s trace a complete read operation with detailed code.

The Read Flow

Step 1: Client → NameNode
   "Where are the blocks for /data/logs.txt?"

Step 2: NameNode → Client
   Block 1: [DN1, DN2, DN5]
   Block 2: [DN2, DN3, DN6]
   Block 3: [DN3, DN4, DN1]
   (sorted by proximity to client)

Step 3: Client → DN1 (closest DataNode)
   "Send me Block 1"

Step 4: DN1 → Client
   [Streaming Block 1 data...]

Step 5: Client → DN2 (closest for Block 2)
   "Send me Block 2"

... and so on

Java Client Code

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.io.IOUtils;

public class HDFSReader {

    public static void main(String[] args) throws Exception {
        // Configuration object loads hdfs-site.xml, core-site.xml
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://namenode:9000");

        // Get FileSystem instance (connects to NameNode)
        FileSystem fs = FileSystem.get(conf);

        // Path to read
        Path filePath = new Path("/user/hadoop/data/access.log");

        // Open input stream
        // Internally: Contacts NameNode for block locations
        FSDataInputStream in = fs.open(filePath);

        try {
            // Read data
            // Internally: Connects to DataNodes sequentially
            IOUtils.copyBytes(in, System.out, 4096, false);

            // Seek to specific position (if needed)
            in.seek(1024000); // Jump to byte offset 1024000
            byte[] buffer = new byte[1024];
            in.read(buffer);

        } finally {
            IOUtils.closeStream(in);
        }

        fs.close();
    }
}

Behind the Scenes: DFSInputStream

Here’s what happens inside fs.open():
public class DFSInputStream extends FSInputStream {

    private DFSClient dfsClient;
    private String src; // File path
    private LocatedBlocks locatedBlocks; // Block locations from NN
    private long fileLength;
    private long pos = 0; // Current position in file

    public DFSInputStream(DFSClient dfsClient, String src) {
        this.dfsClient = dfsClient;
        this.src = src;

        // Contact NameNode to get block locations
        openInfo();
    }

    private void openInfo() {
        // RPC call to NameNode
        LocatedBlocks newInfo =
            dfsClient.namenode.getBlockLocations(src, 0, fileLength);

        this.locatedBlocks = newInfo;
        this.fileLength = newInfo.getFileLength();
    }

    @Override
    public int read(byte[] buf, int off, int len) throws IOException {
        // Find which block contains current position
        LocatedBlock targetBlock = getBlockAt(pos);

        // Choose best DataNode (closest, most available)
        DatanodeInfo chosenNode = getBestNode(targetBlock);

        // Connect to DataNode and read
        BlockReader reader = getBlockReader(chosenNode, targetBlock);

        int bytesRead = reader.read(buf, off, len);
        pos += bytesRead;

        return bytesRead;
    }

    private DatanodeInfo getBestNode(LocatedBlock block) {
        DatanodeInfo[] nodes = block.getLocations();

        // Sort by network distance (local node preferred)
        Arrays.sort(nodes, new Comparator<DatanodeInfo>() {
            public int compare(DatanodeInfo a, DatanodeInfo b) {
                int distA = networkDistance(clientNode, a);
                int distB = networkDistance(clientNode, b);
                return distA - distB;
            }
        });

        return nodes[0]; // Closest node
    }

    private int networkDistance(Node a, Node b) {
        // Same node: distance 0
        // Same rack: distance 2
        // Different rack: distance 4
        String pathA = a.getNetworkLocation();
        String pathB = b.getNetworkLocation();

        if (a.equals(b)) return 0;
        if (pathA.equals(pathB)) return 2;
        return 4;
    }
}
Key Optimizations:
  • Data locality: Reads from closest replica
  • Streaming: Reads one block at a time (doesn’t load entire file)
  • Retry logic: Switches to another replica on failure
  • Checksum verification: Validates data integrity

HDFS Write Operation

Writing is more complex due to replication pipeline.

The Write Flow

Step 1: Client → NameNode
   "I want to create /data/new.log"

Step 2: NameNode checks permissions, quota
   Creates file entry in namespace
   Returns: "Proceed"

Step 3: Client buffers first 64KB of data

Step 4: Client → NameNode
   "Allocate a block for /data/new.log"

Step 5: NameNode → Client
   "Block ID 1234, write to [DN1, DN2, DN5]"

Step 6: Client → DN1
   "Store block 1234, forward to DN2"
   [Streams data to DN1]

Step 7: DN1 → DN2
   "Store block 1234, forward to DN5"
   [Streams data to DN2]

Step 8: DN2 → DN5
   "Store block 1234"
   [Streams data to DN5]

Step 9: DN5 → DN2 → DN1 → Client
   "ACK: Block written successfully"

Repeat Steps 4-9 for each block

Write Pipeline Architecture

Client

  │ (1) Write packet

┌────────────┐
│  DataNode  │───(2) Forward packet────┐
│     1      │                         │
│            │◄──(4) ACK──────────┐    │
└────────────┘                    │    │
                                  │    ▼
                              ┌────────────┐
                              │  DataNode  │──(3) Forward──┐
                              │     2      │               │
                              │            │◄──(5) ACK──┐  │
                              └────────────┘            │  │
                                                        │  ▼
                                                    ┌────────────┐
                                                    │  DataNode  │
                                                    │     3      │
                                                    │            │
                                                    └────────────┘

                                                         │ (6) ACK

                                               Pipeline complete

Java Write Code

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataOutputStream;

public class HDFSWriter {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://namenode:9000");

        FileSystem fs = FileSystem.get(conf);
        Path filePath = new Path("/user/hadoop/output/results.txt");

        // Create file (overwrites if exists)
        FSDataOutputStream out = fs.create(filePath);

        // Or append to existing file
        // FSDataOutputStream out = fs.append(filePath);

        try {
            String data = "Processing complete\n";
            out.write(data.getBytes("UTF-8"));

            // Flush to ensure data written
            out.flush();

            // Optional: Force sync to DataNodes
            out.hsync(); // Guarantees data visible to other readers

        } finally {
            out.close(); // Also flushes and syncs
        }

        fs.close();
    }
}

Advanced: Custom Replication Factor

public class CustomReplication {

    public static void writeWithReplication(
            String hdfsPath,
            byte[] data,
            short replicationFactor) throws Exception {

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path(hdfsPath);

        // Specify replication factor at creation
        FSDataOutputStream out = fs.create(
            path,
            true,              // overwrite
            4096,              // buffer size
            replicationFactor, // custom replication
            134217728          // block size (128MB)
        );

        out.write(data);
        out.close();
        fs.close();
    }

    public static void changeReplication(
            String hdfsPath,
            short newReplication) throws Exception {

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path(hdfsPath);

        // Change replication factor of existing file
        boolean success = fs.setReplication(path, newReplication);

        if (success) {
            System.out.println("Replication changed to " + newReplication);
        }

        fs.close();
    }
}

Fault Tolerance Mechanisms

DataNode Failure

Detection:
NameNode expects heartbeat every 3 seconds
If no heartbeat for 10 minutes (default):
  → Mark DataNode as dead
  → Identify under-replicated blocks
  → Schedule re-replication
Recovery Process:
public class NameNodeReplication {

    private void handleDeadDataNode(DatanodeDescriptor deadNode) {
        // Get all blocks stored on dead node
        List<Block> affectedBlocks = deadNode.getBlockList();

        for (Block block : affectedBlocks) {
            // Check current replication level
            int currentReplicas = countReplicas(block);
            int targetReplicas = block.getReplication();

            if (currentReplicas < targetReplicas) {
                // Under-replicated! Schedule re-replication
                addToReplicationQueue(block,
                    targetReplicas - currentReplicas);
            }
        }

        // Remove dead node from cluster
        removeDataNode(deadNode);
    }

    private void processReplicationQueue() {
        while (!replicationQueue.isEmpty()) {
            Block block = replicationQueue.poll();

            // Find nodes with existing replicas
            List<DatanodeDescriptor> sources =
                getDataNodesForBlock(block);

            // Choose target for new replica
            DatanodeDescriptor target =
                chooseTargetDataNode(sources);

            // Command source to replicate to target
            sources.get(0).addBlockToReplicate(block, target);
        }
    }
}

NameNode Failure (High Availability)

Problem: NameNode is single point of failure Solution: HA NameNode with ZooKeeper
┌─────────────────────────────────────────────┐
│              ZooKeeper Cluster              │
│     (Coordination & Leader Election)        │
└──────────────┬──────────────────────────────┘

        ┌──────┴──────┐
        │             │
        ▼             ▼
┌──────────────┐ ┌──────────────┐
│   Active     │ │   Standby    │
│  NameNode    │ │  NameNode    │
│              │ │              │
│ (Serves      │ │ (Tails edit  │
│  requests)   │ │  log, ready  │
│              │ │  to takeover)│
└──────┬───────┘ └──────┬───────┘
       │                │
       │                │
       ▼                ▼
┌──────────────────────────────┐
│   Shared Edit Log            │
│   (on JournalNodes or NFS)   │
└──────────────────────────────┘
Automatic Failover:
  1. Active NameNode crashes
  2. ZooKeeper detects failure
  3. Standby NameNode promoted to Active
  4. Clients automatically redirect to new Active

HDFS Configuration

Essential Configuration Files

core-site.xml (Cluster-wide settings):
<configuration>
    <!-- NameNode URI -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://namenode.example.com:9000</value>
    </property>

    <!-- Temporary directory -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/var/hadoop/tmp</value>
    </property>
</configuration>
hdfs-site.xml (HDFS-specific):
<configuration>
    <!-- Replication factor -->
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>

    <!-- Block size (128MB) -->
    <property>
        <name>dfs.blocksize</name>
        <value>134217728</value>
    </property>

    <!-- NameNode data directory -->
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///var/hadoop/namenode</value>
    </property>

    <!-- DataNode data directories -->
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///disk1/hadoop,file:///disk2/hadoop</value>
    </property>

    <!-- Permissions (disable for testing only!) -->
    <property>
        <name>dfs.permissions.enabled</name>
        <value>true</value>
    </property>
</configuration>

Hands-on Labs

Lab 1: HDFS CLI Operations

# Create directory
hdfs dfs -mkdir -p /user/hadoop/input

# Upload file
hdfs dfs -put localfile.txt /user/hadoop/input/

# List files with block info
hdfs fsck /user/hadoop/input/localfile.txt -files -blocks -locations

# Check file stats
hdfs dfs -stat "%b %r %n" /user/hadoop/input/localfile.txt
# Output: blocksize replication filename

# Change replication
hdfs dfs -setrep -w 5 /user/hadoop/input/localfile.txt

# View file content
hdfs dfs -cat /user/hadoop/input/localfile.txt

# Get file to local
hdfs dfs -get /user/hadoop/input/localfile.txt ./downloaded.txt

# Check cluster health
hdfs dfsadmin -report

# Balance cluster
hdfs balancer -threshold 10

Lab 2: Programmatic File Operations

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

public class HDFSOperations {

    private FileSystem fs;

    public HDFSOperations() throws Exception {
        Configuration conf = new Configuration();
        this.fs = FileSystem.get(conf);
    }

    // Create directory
    public void createDirectory(String path) throws Exception {
        Path dirPath = new Path(path);
        fs.mkdirs(dirPath);
        System.out.println("Created: " + path);
    }

    // List directory contents
    public void listDirectory(String path) throws Exception {
        Path dirPath = new Path(path);
        FileStatus[] statuses = fs.listStatus(dirPath);

        for (FileStatus status : statuses) {
            System.out.printf("%s %10d %s\n",
                status.isDirectory() ? "DIR " : "FILE",
                status.getLen(),
                status.getPath().getName()
            );
        }
    }

    // Get block locations for a file
    public void showBlockLocations(String path) throws Exception {
        Path filePath = new Path(path);
        FileStatus status = fs.getFileStatus(filePath);
        BlockLocation[] blocks = fs.getFileBlockLocations(
            status, 0, status.getLen());

        for (int i = 0; i < blocks.length; i++) {
            BlockLocation block = blocks[i];
            System.out.println("Block " + i + ":");
            System.out.println("  Hosts: " +
                String.join(", ", block.getHosts()));
            System.out.println("  Offset: " + block.getOffset());
            System.out.println("  Length: " + block.getLength());
        }
    }

    // Delete file/directory
    public void delete(String path, boolean recursive) throws Exception {
        Path targetPath = new Path(path);
        boolean success = fs.delete(targetPath, recursive);
        System.out.println("Deleted: " + success);
    }

    // Copy file within HDFS
    public void copyFile(String src, String dst) throws Exception {
        Path srcPath = new Path(src);
        Path dstPath = new Path(dst);
        FileUtil.copy(fs, srcPath, fs, dstPath, false, fs.getConf());
    }

    public void close() throws Exception {
        fs.close();
    }
}

Lab 3: Monitoring HDFS Health

import org.apache.hadoop.hdfs.DistributedFileSystem;
import org.apache.hadoop.hdfs.protocol.DatanodeInfo;

public class HDFSMonitor {

    public static void clusterHealth() throws Exception {
        Configuration conf = new Configuration();
        DistributedFileSystem dfs =
            (DistributedFileSystem) FileSystem.get(conf);

        // Get cluster statistics
        FsStatus status = dfs.getStatus();
        long capacity = status.getCapacity();
        long used = status.getUsed();
        long remaining = status.getRemaining();

        System.out.println("HDFS Cluster Status:");
        System.out.println("  Capacity:  " + formatSize(capacity));
        System.out.println("  Used:      " + formatSize(used) +
            " (" + (used * 100 / capacity) + "%)");
        System.out.println("  Remaining: " + formatSize(remaining));

        // Get DataNode information
        DatanodeInfo[] datanodes = dfs.getDataNodeStats();
        System.out.println("\nDataNodes: " + datanodes.length);

        for (DatanodeInfo node : datanodes) {
            System.out.println("  " + node.getHostName());
            System.out.println("    Capacity:  " +
                formatSize(node.getCapacity()));
            System.out.println("    Used:      " +
                formatSize(node.getDfsUsed()));
            System.out.println("    Remaining: " +
                formatSize(node.getRemaining()));
            System.out.println("    State:     " +
                node.getAdminState());
        }

        dfs.close();
    }

    private static String formatSize(long bytes) {
        long tb = 1024L * 1024 * 1024 * 1024;
        long gb = 1024L * 1024 * 1024;
        long mb = 1024L * 1024;

        if (bytes >= tb) return (bytes / tb) + " TB";
        if (bytes >= gb) return (bytes / gb) + " GB";
        if (bytes >= mb) return (bytes / mb) + " MB";
        return bytes + " B";
    }
}

Performance Tuning

Optimizing Block Size

Rule of Thumb: Block size should minimize metadata while avoiding too many small files
Scenario: Processing 1TB of log files

Option 1: 64MB blocks
  → 16,384 blocks
  → More metadata
  → More map tasks

Option 2: 128MB blocks (default)
  → 8,192 blocks
  → Balanced

Option 3: 256MB blocks
  → 4,096 blocks
  → Less metadata
  → Fewer map tasks (less parallelism)

Choice: 128MB for balanced parallelism

Short-Circuit Reads

When client and DataNode are on same machine, skip network:
<property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
</property>

<property>
    <name>dfs.domain.socket.path</name>
    <value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>
Performance Impact: 30-50% faster local reads

Common Issues & Troubleshooting

Symptoms: NameNode crashes, OutOfMemoryError in logsCause: Too many files/blocks for allocated heapSolutions:
  1. Increase NameNode heap: -Xmx16g-Xmx32g
  2. Enable HDFS Federation (multiple NameNodes)
  3. Merge small files into larger ones
  4. Clean up old/unused data
Prevention: Monitor namespace size, set quotas
Symptoms: hdfs fsck shows under-replicated blocksCauses:
  • DataNode failures
  • Network issues
  • Disk space exhaustion
Solutions:
# Check under-replicated blocks
hdfs fsck / | grep "Under-replicated"

# Identify problematic DataNodes
hdfs dfsadmin -report

# Trigger replication
hdfs dfsadmin -metasave /tmp/meta.txt

# If urgent, increase replication temporarily
hdfs dfs -setrep -R 4 /critical/data
Check logs: /var/log/hadoop/hadoop-hdfs-datanode-*.logCommon causes:
  1. Incompatible cluster ID:
    Error: Incompatible clusterIDs
    Solution: Delete DataNode data, rejoin cluster
    
  2. Port already in use:
    Error: Address already in use: 50010
    Solution: Kill process or change port in hdfs-site.xml
    
  3. Permission issues:
    Error: Permission denied on /var/hadoop/datanode
    Solution: chown -R hdfs:hadoop /var/hadoop/datanode
    

Interview Focus

Key Concepts to Master:
  • Explain NameNode vs DataNode responsibilities
  • Describe block replication algorithm and rack awareness
  • Walk through read/write data flows
  • Discuss Single NameNode limitations and HA solutions
  • Compare HDFS to other storage (S3, traditional file systems)
Sample Questions:
  1. “Why can’t HDFS handle lots of small files efficiently?”
    • Answer: Each file/block = metadata entry in NameNode RAM
    • 1 million 1KB files = 1 million blocks vs 8 files of 128MB = 8 blocks
  2. “How does HDFS ensure data locality for MapReduce?”
    • Answer: JobTracker queries NameNode for block locations, schedules map tasks on DataNodes storing blocks
  3. “What happens if a client crashes during a write?”
    • Answer: File remains in incomplete state, lease eventually expires, NameNode closes file

What’s Next?

You now understand HDFS storage layer. Next, learn how to process this data with MapReduce!

Module 3: MapReduce Programming Model

Master distributed data processing with MapReduce patterns and hands-on coding