HDFS Architecture & Internals

Module Duration: 4-5 hours Hands-on Labs: 6 practical exercises Prerequisites: Understanding of GFS concepts from Module 1

Introduction

HDFS (Hadoop Distributed File System) is the storage foundation of the Hadoop ecosystem. Inspired by Google’s GFS, HDFS provides:

Fault-tolerant storage across commodity hardware
High throughput for large datasets
Scalability to petabytes and beyond
Data locality for efficient processing

In this module, we’ll go beyond theory to understand implementation details and write real code.

HDFS Architecture Overview

The Core Components

┌──────────────────────────────────────────────────────────────┐
│                     HDFS CLUSTER                             │
│                                                              │
│  ┌────────────────────────────────────────┐                 │
│  │         NameNode (Master)              │                 │
│  │                                        │                 │
│  │  • Metadata storage (in-memory)        │                 │
│  │  • Namespace operations                │                 │
│  │  • Block mapping                       │                 │
│  │  • Replication policy enforcement      │                 │
│  │  • Heartbeat & Block reports           │                 │
│  └────────────┬───────────────────────────┘                 │
│               │                                             │
│               │ Heartbeat (3 sec) + Block Reports          │
│               │                                             │
│  ┌────────────┼────────────┬───────────────┬──────────┐    │
│  │            │            │               │          │    │
│  ▼            ▼            ▼               ▼          ▼    │
│ ┌──────────┐┌──────────┐┌──────────┐┌──────────┐┌────────┐│
│ │DataNode 1││DataNode 2││DataNode 3││DataNode 4││DataNode││
│ │          ││          ││          ││          ││   N    ││
│ │ Stores:  ││ Stores:  ││ Stores:  ││ Stores:  ││        ││
│ │ Block 1  ││ Block 1  ││ Block 2  ││ Block 2  ││  ...   ││
│ │ Block 3  ││ Block 2  ││ Block 3  ││ Block 4  ││        ││
│ │ Block 5  ││ Block 4  ││ Block 5  ││ Block 6  ││        ││
│ └──────────┘└──────────┘└──────────┘└──────────┘└────────┘│
│                                                              │
│  ┌─────────────────────────────────┐                        │
│  │  Secondary NameNode             │                        │
│  │  (Checkpoint creation)          │                        │
│  └─────────────────────────────────┘                        │
└──────────────────────────────────────────────────────────────┘

         ▲                           ▲
         │                           │
         │                           │
    ┌────┴────┐                 ┌────┴────┐
    │ Client  │                 │ Client  │
    │ (Read/  │                 │ (Read/  │
    │  Write) │                 │  Write) │
    └─────────┘                 └─────────┘

NameNode: The Brain

The NameNode is the single master that manages:

File System Namespace:
- Directory tree
- File metadata (permissions, timestamps)
- Mapping: filename → block IDs
Block Management:
- Block locations (which DataNodes store each block)
- Replication factor enforcement
- Block creation, deletion, and re-replication
Heartbeat Management:
- Monitors DataNode health (3-second heartbeats)
- Detects failures and initiates recovery

Critical Design Choice: All metadata stored in RAM

Pros: Fast metadata operations (no disk I/O)
Cons: Limits namespace size (1GB RAM ≈ 1 million blocks)

DataNodes: The Workers

DataNodes:

Store actual data blocks on local disk
Send heartbeats to NameNode every 3 seconds
Report block lists periodically (block reports)
Serve read/write requests from clients
Execute replication commands from NameNode

Secondary NameNode: The Misconception

Common Misconception: Secondary NameNode is NOT a backup or standby!

Actual Role: Performs periodic checkpointing

Merges edit logs into FSImage
Reduces NameNode restart time
Does NOT take over if NameNode fails (use HA NameNode for that)

Block Storage Deep Dive

What is a Block?

HDFS splits files into fixed-size blocks:

Default size: 128MB (configurable: 64MB, 256MB, etc.)
Why so large?: Minimize metadata overhead, optimize sequential reads

Example:

File: access.log (350MB)

Split into blocks:
┌─────────────┬─────────────┬─────────────┐
│  Block 1    │  Block 2    │  Block 3    │
│  (128MB)    │  (128MB)    │  (94MB)     │
│  ID: 1001   │  ID: 1002   │  ID: 1003   │
└─────────────┴─────────────┴─────────────┘

Replication Strategy

Each block replicated (default: 3 copies) for fault tolerance. Replica Placement Algorithm (Rack-Aware):

Block X needs 3 replicas:

Replica 1: Local node (where client writes from)
           OR random node (if client outside cluster)

Replica 2: Different node on SAME rack as Replica 1
           (same rack-switch, low latency)

Replica 3: Different node on DIFFERENT rack
           (survives rack failure)

┌──────────────────────────────────────────────┐
│                  Network                     │
└────────┬─────────────────────────┬───────────┘
         │                         │
    ┌────▼─────┐              ┌────▼─────┐
    │ Rack 1   │              │ Rack 2   │
    │ Switch   │              │ Switch   │
    └────┬─────┘              └────┬─────┘
         │                         │
    ┌────┼─────┬─────┐        ┌────┴─────┐
    │    │     │     │        │          │
    ▼    ▼     ▼     ▼        ▼          ▼
   DN1  DN2   DN3   DN4      DN5        DN6
    ↑    ↑                    ↑
    │    │                    │
 Replica Replica           Replica
   1      2                  3

Why This Strategy?:

Fault tolerance: Survives node and rack failures
Write performance: 2 replicas on same rack (fast)
Read optimization: Multiple replicas for load balancing
Network efficiency: 1/3 of data crosses racks (not 3/3)

Block Placement Code Example

Here’s how HDFS decides block placement (simplified from actual code):

public class BlockPlacementPolicyDefault {

    /**
     * Choose target DataNodes for block replicas
     * @param numOfReplicas Number of replicas to create
     * @param writer Client writing the block (for locality)
     * @param excludedNodes Nodes to exclude
     * @param blocksize Size of the block
     * @return Array of chosen DataNodes
     */
    public DatanodeDescriptor[] chooseTarget(
            int numOfReplicas,
            Node writer,
            List<DatanodeDescriptor> excludedNodes,
            long blocksize) {

        if (numOfReplicas == 0 || clusterMap.getNumOfLeaves() == 0) {
            return new DatanodeDescriptor[0];
        }

        List<DatanodeDescriptor> results = new ArrayList<>();

        // First replica: local to writer if possible
        DatanodeDescriptor firstNode = chooseFirstReplica(writer);
        results.add(firstNode);
        excludedNodes.add(firstNode);

        if (numOfReplicas == 1) {
            return results.toArray(new DatanodeDescriptor[1]);
        }

        // Second replica: different node, same rack as first
        DatanodeDescriptor secondNode = chooseLocalRack(
            firstNode, excludedNodes, blocksize);
        results.add(secondNode);
        excludedNodes.add(secondNode);

        if (numOfReplicas == 2) {
            return results.toArray(new DatanodeDescriptor[2]);
        }

        // Third replica: different rack
        DatanodeDescriptor thirdNode = chooseRemoteRack(
            1, results, excludedNodes, blocksize);
        results.add(thirdNode);

        // Additional replicas: distribute randomly
        for (int i = 3; i < numOfReplicas; i++) {
            DatanodeDescriptor nextNode = chooseRandom(
                excludedNodes, blocksize);
            results.add(nextNode);
            excludedNodes.add(nextNode);
        }

        return results.toArray(new DatanodeDescriptor[numOfReplicas]);
    }

    private DatanodeDescriptor chooseFirstReplica(Node writer) {
        // If writer is a DataNode, use it
        if (writer instanceof DatanodeDescriptor) {
            return (DatanodeDescriptor) writer;
        }
        // Otherwise, choose random node
        return chooseRandom(new ArrayList<>(), 0);
    }

    private DatanodeDescriptor chooseLocalRack(
            DatanodeDescriptor node,
            List<DatanodeDescriptor> excluded,
            long blocksize) {

        String rackName = node.getNetworkLocation();
        List<DatanodeDescriptor> sameRackNodes =
            clusterMap.getDatanodesInRack(rackName);

        // Remove excluded nodes
        sameRackNodes.removeAll(excluded);

        // Choose node with most available space
        return chooseBestNode(sameRackNodes, blocksize);
    }

    private DatanodeDescriptor chooseBestNode(
            List<DatanodeDescriptor> nodes,
            long blocksize) {

        DatanodeDescriptor best = null;
        long maxSpace = 0;

        for (DatanodeDescriptor node : nodes) {
            long available = node.getRemaining();
            if (available > maxSpace && available >= blocksize) {
                maxSpace = available;
                best = node;
            }
        }

        return best;
    }
}

HDFS Read Operation

Let’s trace a complete read operation with detailed code.

The Read Flow

Step 1: Client → NameNode
   "Where are the blocks for /data/logs.txt?"

Step 2: NameNode → Client
   Block 1: [DN1, DN2, DN5]
   Block 2: [DN2, DN3, DN6]
   Block 3: [DN3, DN4, DN1]
   (sorted by proximity to client)

Step 3: Client → DN1 (closest DataNode)
   "Send me Block 1"

Step 4: DN1 → Client
   [Streaming Block 1 data...]

Step 5: Client → DN2 (closest for Block 2)
   "Send me Block 2"

... and so on

Java Client Code

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.io.IOUtils;

public class HDFSReader {

    public static void main(String[] args) throws Exception {
        // Configuration object loads hdfs-site.xml, core-site.xml
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://namenode:9000");

        // Get FileSystem instance (connects to NameNode)
        FileSystem fs = FileSystem.get(conf);

        // Path to read
        Path filePath = new Path("/user/hadoop/data/access.log");

        // Open input stream
        // Internally: Contacts NameNode for block locations
        FSDataInputStream in = fs.open(filePath);

        try {
            // Read data
            // Internally: Connects to DataNodes sequentially
            IOUtils.copyBytes(in, System.out, 4096, false);

            // Seek to specific position (if needed)
            in.seek(1024000); // Jump to byte offset 1024000
            byte[] buffer = new byte[1024];
            in.read(buffer);

        } finally {
            IOUtils.closeStream(in);
        }

        fs.close();
    }
}

Behind the Scenes: DFSInputStream

Here’s what happens inside fs.open():

public class DFSInputStream extends FSInputStream {

    private DFSClient dfsClient;
    private String src; // File path
    private LocatedBlocks locatedBlocks; // Block locations from NN
    private long fileLength;
    private long pos = 0; // Current position in file

    public DFSInputStream(DFSClient dfsClient, String src) {
        this.dfsClient = dfsClient;
        this.src = src;

        // Contact NameNode to get block locations
        openInfo();
    }

    private void openInfo() {
        // RPC call to NameNode
        LocatedBlocks newInfo =
            dfsClient.namenode.getBlockLocations(src, 0, fileLength);

        this.locatedBlocks = newInfo;
        this.fileLength = newInfo.getFileLength();
    }

    @Override
    public int read(byte[] buf, int off, int len) throws IOException {
        // Find which block contains current position
        LocatedBlock targetBlock = getBlockAt(pos);

        // Choose best DataNode (closest, most available)
        DatanodeInfo chosenNode = getBestNode(targetBlock);

        // Connect to DataNode and read
        BlockReader reader = getBlockReader(chosenNode, targetBlock);

        int bytesRead = reader.read(buf, off, len);
        pos += bytesRead;

        return bytesRead;
    }

    private DatanodeInfo getBestNode(LocatedBlock block) {
        DatanodeInfo[] nodes = block.getLocations();

        // Sort by network distance (local node preferred)
        Arrays.sort(nodes, new Comparator<DatanodeInfo>() {
            public int compare(DatanodeInfo a, DatanodeInfo b) {
                int distA = networkDistance(clientNode, a);
                int distB = networkDistance(clientNode, b);
                return distA - distB;
            }
        });

        return nodes[0]; // Closest node
    }

    private int networkDistance(Node a, Node b) {
        // Same node: distance 0
        // Same rack: distance 2
        // Different rack: distance 4
        String pathA = a.getNetworkLocation();
        String pathB = b.getNetworkLocation();

        if (a.equals(b)) return 0;
        if (pathA.equals(pathB)) return 2;
        return 4;
    }
}

Key Optimizations:

Data locality: Reads from closest replica
Streaming: Reads one block at a time (doesn’t load entire file)
Retry logic: Switches to another replica on failure
Checksum verification: Validates data integrity

HDFS Write Operation

Writing is more complex due to replication pipeline.

The Write Flow

Step 1: Client → NameNode
   "I want to create /data/new.log"

Step 2: NameNode checks permissions, quota
   Creates file entry in namespace
   Returns: "Proceed"

Step 3: Client buffers first 64KB of data

Step 4: Client → NameNode
   "Allocate a block for /data/new.log"

Step 5: NameNode → Client
   "Block ID 1234, write to [DN1, DN2, DN5]"

Step 6: Client → DN1
   "Store block 1234, forward to DN2"
   [Streams data to DN1]

Step 7: DN1 → DN2
   "Store block 1234, forward to DN5"
   [Streams data to DN2]

Step 8: DN2 → DN5
   "Store block 1234"
   [Streams data to DN5]

Step 9: DN5 → DN2 → DN1 → Client
   "ACK: Block written successfully"

Repeat Steps 4-9 for each block

Write Pipeline Architecture

Client
  │
  │ (1) Write packet
  ▼
┌────────────┐
│  DataNode  │───(2) Forward packet────┐
│     1      │                         │
│            │◄──(4) ACK──────────┐    │
└────────────┘                    │    │
                                  │    ▼
                              ┌────────────┐
                              │  DataNode  │──(3) Forward──┐
                              │     2      │               │
                              │            │◄──(5) ACK──┐  │
                              └────────────┘            │  │
                                                        │  ▼
                                                    ┌────────────┐
                                                    │  DataNode  │
                                                    │     3      │
                                                    │            │
                                                    └────────────┘
                                                         │
                                                         │ (6) ACK
                                                         ▼
                                               Pipeline complete

Java Write Code

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataOutputStream;

public class HDFSWriter {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://namenode:9000");

        FileSystem fs = FileSystem.get(conf);
        Path filePath = new Path("/user/hadoop/output/results.txt");

        // Create file (overwrites if exists)
        FSDataOutputStream out = fs.create(filePath);

        // Or append to existing file
        // FSDataOutputStream out = fs.append(filePath);

        try {
            String data = "Processing complete\n";
            out.write(data.getBytes("UTF-8"));

            // Flush to ensure data written
            out.flush();

            // Optional: Force sync to DataNodes
            out.hsync(); // Guarantees data visible to other readers

        } finally {
            out.close(); // Also flushes and syncs
        }

        fs.close();
    }
}

Advanced: Custom Replication Factor

public class CustomReplication {

    public static void writeWithReplication(
            String hdfsPath,
            byte[] data,
            short replicationFactor) throws Exception {

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path(hdfsPath);

        // Specify replication factor at creation
        FSDataOutputStream out = fs.create(
            path,
            true,              // overwrite
            4096,              // buffer size
            replicationFactor, // custom replication
            134217728          // block size (128MB)
        );

        out.write(data);
        out.close();
        fs.close();
    }

    public static void changeReplication(
            String hdfsPath,
            short newReplication) throws Exception {

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path(hdfsPath);

        // Change replication factor of existing file
        boolean success = fs.setReplication(path, newReplication);

        if (success) {
            System.out.println("Replication changed to " + newReplication);
        }

        fs.close();
    }
}

Fault Tolerance Mechanisms

DataNode Failure

Detection:

NameNode expects heartbeat every 3 seconds
If no heartbeat for 10 minutes (default):
  → Mark DataNode as dead
  → Identify under-replicated blocks
  → Schedule re-replication

Recovery Process:

public class NameNodeReplication {

    private void handleDeadDataNode(DatanodeDescriptor deadNode) {
        // Get all blocks stored on dead node
        List<Block> affectedBlocks = deadNode.getBlockList();

        for (Block block : affectedBlocks) {
            // Check current replication level
            int currentReplicas = countReplicas(block);
            int targetReplicas = block.getReplication();

            if (currentReplicas < targetReplicas) {
                // Under-replicated! Schedule re-replication
                addToReplicationQueue(block,
                    targetReplicas - currentReplicas);
            }
        }

        // Remove dead node from cluster
        removeDataNode(deadNode);
    }

    private void processReplicationQueue() {
        while (!replicationQueue.isEmpty()) {
            Block block = replicationQueue.poll();

            // Find nodes with existing replicas
            List<DatanodeDescriptor> sources =
                getDataNodesForBlock(block);

            // Choose target for new replica
            DatanodeDescriptor target =
                chooseTargetDataNode(sources);

            // Command source to replicate to target
            sources.get(0).addBlockToReplicate(block, target);
        }
    }
}

NameNode Failure (High Availability)

Problem: NameNode is single point of failure Solution: HA NameNode with ZooKeeper

┌─────────────────────────────────────────────┐
│              ZooKeeper Cluster              │
│     (Coordination & Leader Election)        │
└──────────────┬──────────────────────────────┘
               │
        ┌──────┴──────┐
        │             │
        ▼             ▼
┌──────────────┐ ┌──────────────┐
│   Active     │ │   Standby    │
│  NameNode    │ │  NameNode    │
│              │ │              │
│ (Serves      │ │ (Tails edit  │
│  requests)   │ │  log, ready  │
│              │ │  to takeover)│
└──────┬───────┘ └──────┬───────┘
       │                │
       │                │
       ▼                ▼
┌──────────────────────────────┐
│   Shared Edit Log            │
│   (on JournalNodes or NFS)   │
└──────────────────────────────┘

Automatic Failover:

Active NameNode crashes
ZooKeeper detects failure
Standby NameNode promoted to Active
Clients automatically redirect to new Active

HDFS Configuration

Essential Configuration Files

core-site.xml (Cluster-wide settings):

<configuration>
    <!-- NameNode URI -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://namenode.example.com:9000</value>
    </property>

    <!-- Temporary directory -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/var/hadoop/tmp</value>
    </property>
</configuration>

hdfs-site.xml (HDFS-specific):

<configuration>
    <!-- Replication factor -->
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>

    <!-- Block size (128MB) -->
    <property>
        <name>dfs.blocksize</name>
        <value>134217728</value>
    </property>

    <!-- NameNode data directory -->
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///var/hadoop/namenode</value>
    </property>

    <!-- DataNode data directories -->
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///disk1/hadoop,file:///disk2/hadoop</value>
    </property>

    <!-- Permissions (disable for testing only!) -->
    <property>
        <name>dfs.permissions.enabled</name>
        <value>true</value>
    </property>
</configuration>

Hands-on Labs

Lab 1: HDFS CLI Operations

# Create directory
hdfs dfs -mkdir -p /user/hadoop/input

# Upload file
hdfs dfs -put localfile.txt /user/hadoop/input/

# List files with block info
hdfs fsck /user/hadoop/input/localfile.txt -files -blocks -locations

# Check file stats
hdfs dfs -stat "%b %r %n" /user/hadoop/input/localfile.txt
# Output: blocksize replication filename

# Change replication
hdfs dfs -setrep -w 5 /user/hadoop/input/localfile.txt

# View file content
hdfs dfs -cat /user/hadoop/input/localfile.txt

# Get file to local
hdfs dfs -get /user/hadoop/input/localfile.txt ./downloaded.txt

# Check cluster health
hdfs dfsadmin -report

# Balance cluster
hdfs balancer -threshold 10

Lab 2: Programmatic File Operations

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

public class HDFSOperations {

    private FileSystem fs;

    public HDFSOperations() throws Exception {
        Configuration conf = new Configuration();
        this.fs = FileSystem.get(conf);
    }

    // Create directory
    public void createDirectory(String path) throws Exception {
        Path dirPath = new Path(path);
        fs.mkdirs(dirPath);
        System.out.println("Created: " + path);
    }

    // List directory contents
    public void listDirectory(String path) throws Exception {
        Path dirPath = new Path(path);
        FileStatus[] statuses = fs.listStatus(dirPath);

        for (FileStatus status : statuses) {
            System.out.printf("%s %10d %s\n",
                status.isDirectory() ? "DIR " : "FILE",
                status.getLen(),
                status.getPath().getName()
            );
        }
    }

    // Get block locations for a file
    public void showBlockLocations(String path) throws Exception {
        Path filePath = new Path(path);
        FileStatus status = fs.getFileStatus(filePath);
        BlockLocation[] blocks = fs.getFileBlockLocations(
            status, 0, status.getLen());

        for (int i = 0; i < blocks.length; i++) {
            BlockLocation block = blocks[i];
            System.out.println("Block " + i + ":");
            System.out.println("  Hosts: " +
                String.join(", ", block.getHosts()));
            System.out.println("  Offset: " + block.getOffset());
            System.out.println("  Length: " + block.getLength());
        }
    }

    // Delete file/directory
    public void delete(String path, boolean recursive) throws Exception {
        Path targetPath = new Path(path);
        boolean success = fs.delete(targetPath, recursive);
        System.out.println("Deleted: " + success);
    }

    // Copy file within HDFS
    public void copyFile(String src, String dst) throws Exception {
        Path srcPath = new Path(src);
        Path dstPath = new Path(dst);
        FileUtil.copy(fs, srcPath, fs, dstPath, false, fs.getConf());
    }

    public void close() throws Exception {
        fs.close();
    }
}

Lab 3: Monitoring HDFS Health

import org.apache.hadoop.hdfs.DistributedFileSystem;
import org.apache.hadoop.hdfs.protocol.DatanodeInfo;

public class HDFSMonitor {

    public static void clusterHealth() throws Exception {
        Configuration conf = new Configuration();
        DistributedFileSystem dfs =
            (DistributedFileSystem) FileSystem.get(conf);

        // Get cluster statistics
        FsStatus status = dfs.getStatus();
        long capacity = status.getCapacity();
        long used = status.getUsed();
        long remaining = status.getRemaining();

        System.out.println("HDFS Cluster Status:");
        System.out.println("  Capacity:  " + formatSize(capacity));
        System.out.println("  Used:      " + formatSize(used) +
            " (" + (used * 100 / capacity) + "%)");
        System.out.println("  Remaining: " + formatSize(remaining));

        // Get DataNode information
        DatanodeInfo[] datanodes = dfs.getDataNodeStats();
        System.out.println("\nDataNodes: " + datanodes.length);

        for (DatanodeInfo node : datanodes) {
            System.out.println("  " + node.getHostName());
            System.out.println("    Capacity:  " +
                formatSize(node.getCapacity()));
            System.out.println("    Used:      " +
                formatSize(node.getDfsUsed()));
            System.out.println("    Remaining: " +
                formatSize(node.getRemaining()));
            System.out.println("    State:     " +
                node.getAdminState());
        }

        dfs.close();
    }

    private static String formatSize(long bytes) {
        long tb = 1024L * 1024 * 1024 * 1024;
        long gb = 1024L * 1024 * 1024;
        long mb = 1024L * 1024;

        if (bytes >= tb) return (bytes / tb) + " TB";
        if (bytes >= gb) return (bytes / gb) + " GB";
        if (bytes >= mb) return (bytes / mb) + " MB";
        return bytes + " B";
    }
}

Performance Tuning

Optimizing Block Size

Rule of Thumb: Block size should minimize metadata while avoiding too many small files

Scenario: Processing 1TB of log files

Option 1: 64MB blocks
  → 16,384 blocks
  → More metadata
  → More map tasks

Option 2: 128MB blocks (default)
  → 8,192 blocks
  → Balanced

Option 3: 256MB blocks
  → 4,096 blocks
  → Less metadata
  → Fewer map tasks (less parallelism)

Choice: 128MB for balanced parallelism

Short-Circuit Reads

When client and DataNode are on same machine, skip network:

<property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
</property>

<property>
    <name>dfs.domain.socket.path</name>
    <value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>

Performance Impact: 30-50% faster local reads

Common Issues & Troubleshooting

Issue: NameNode Out of Memory

Symptoms: NameNode crashes, OutOfMemoryError in logsCause: Too many files/blocks for allocated heapSolutions:

Increase NameNode heap: -Xmx16g → -Xmx32g
Enable HDFS Federation (multiple NameNodes)
Merge small files into larger ones
Clean up old/unused data

Prevention: Monitor namespace size, set quotas

Issue: Under-Replicated Blocks

Symptoms: hdfs fsck shows under-replicated blocksCauses:

DataNode failures
Network issues
Disk space exhaustion

Solutions:

# Check under-replicated blocks
hdfs fsck / | grep "Under-replicated"

# Identify problematic DataNodes
hdfs dfsadmin -report

# Trigger replication
hdfs dfsadmin -metasave /tmp/meta.txt

# If urgent, increase replication temporarily
hdfs dfs -setrep -R 4 /critical/data

Issue: DataNode Not Starting

Check logs: /var/log/hadoop/hadoop-hdfs-datanode-*.logCommon causes:

Incompatible cluster ID:

Error: Incompatible clusterIDs
Solution: Delete DataNode data, rejoin cluster

Port already in use:

Error: Address already in use: 50010
Solution: Kill process or change port in hdfs-site.xml

Permission issues:

Error: Permission denied on /var/hadoop/datanode
Solution: chown -R hdfs:hadoop /var/hadoop/datanode

Interview Focus

Key Concepts to Master:

Explain NameNode vs DataNode responsibilities
Describe block replication algorithm and rack awareness
Walk through read/write data flows
Discuss Single NameNode limitations and HA solutions
Compare HDFS to other storage (S3, traditional file systems)

Sample Questions:

“Why can’t HDFS handle lots of small files efficiently?”
- Answer: Each file/block = metadata entry in NameNode RAM
- 1 million 1KB files = 1 million blocks vs 8 files of 128MB = 8 blocks
“How does HDFS ensure data locality for MapReduce?”
- Answer: JobTracker queries NameNode for block locations, schedules map tasks on DataNodes storing blocks
“What happens if a client crashes during a write?”
- Answer: File remains in incomplete state, lease eventually expires, NameNode closes file

What’s Next?

You now understand HDFS storage layer. Next, learn how to process this data with MapReduce!

Module 3: MapReduce Programming Model

Master distributed data processing with MapReduce patterns and hands-on coding

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​HDFS Architecture & Internals

​Introduction

​HDFS Architecture Overview

​The Core Components

​NameNode: The Brain

​DataNodes: The Workers

​Secondary NameNode: The Misconception

​Block Storage Deep Dive

​What is a Block?

​Replication Strategy

​Block Placement Code Example

​HDFS Read Operation

​The Read Flow

​Java Client Code

​Behind the Scenes: DFSInputStream

​HDFS Write Operation

​The Write Flow

​Write Pipeline Architecture

​Java Write Code

​Advanced: Custom Replication Factor

​Fault Tolerance Mechanisms

​DataNode Failure

​NameNode Failure (High Availability)

​HDFS Configuration

​Essential Configuration Files

​Hands-on Labs

​Lab 1: HDFS CLI Operations

​Lab 2: Programmatic File Operations

​Lab 3: Monitoring HDFS Health

​Performance Tuning

​Optimizing Block Size

​Short-Circuit Reads

​Common Issues & Troubleshooting

​Interview Focus

​What’s Next?

Module 3: MapReduce Programming Model

HDFS Architecture & Internals

Introduction

HDFS Architecture Overview

The Core Components

NameNode: The Brain

DataNodes: The Workers

Secondary NameNode: The Misconception

Block Storage Deep Dive

What is a Block?

Replication Strategy

Block Placement Code Example

HDFS Read Operation

The Read Flow

Java Client Code

Behind the Scenes: DFSInputStream

HDFS Write Operation

The Write Flow

Write Pipeline Architecture

Java Write Code

Advanced: Custom Replication Factor

Fault Tolerance Mechanisms

DataNode Failure

NameNode Failure (High Availability)

HDFS Configuration

Essential Configuration Files

Hands-on Labs

Lab 1: HDFS CLI Operations

Lab 2: Programmatic File Operations

Lab 3: Monitoring HDFS Health

Performance Tuning

Optimizing Block Size

Short-Circuit Reads

Common Issues & Troubleshooting

Interview Focus

What’s Next?