> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# HDFS Architecture & Internals

> Master the Hadoop Distributed File System - architecture, replication, fault tolerance, and hands-on implementation

# HDFS Architecture & Internals

<Info>
  **Module Duration**: 4-5 hours
  **Hands-on Labs**: 6 practical exercises
  **Prerequisites**: Understanding of GFS concepts from Module 1
</Info>

## Introduction

HDFS (Hadoop Distributed File System) is the storage foundation of the Hadoop ecosystem. Inspired by Google's GFS, HDFS provides:

* **Fault-tolerant storage** across commodity hardware
* **High throughput** for large datasets
* **Scalability** to petabytes and beyond
* **Data locality** for efficient processing

In this module, we'll go beyond theory to understand implementation details and write real code.

***

## HDFS Architecture Overview

### The Core Components

```
┌──────────────────────────────────────────────────────────────┐
│                     HDFS CLUSTER                             │
│                                                              │
│  ┌────────────────────────────────────────┐                 │
│  │         NameNode (Master)              │                 │
│  │                                        │                 │
│  │  • Metadata storage (in-memory)        │                 │
│  │  • Namespace operations                │                 │
│  │  • Block mapping                       │                 │
│  │  • Replication policy enforcement      │                 │
│  │  • Heartbeat & Block reports           │                 │
│  └────────────┬───────────────────────────┘                 │
│               │                                             │
│               │ Heartbeat (3 sec) + Block Reports          │
│               │                                             │
│  ┌────────────┼────────────┬───────────────┬──────────┐    │
│  │            │            │               │          │    │
│  ▼            ▼            ▼               ▼          ▼    │
│ ┌──────────┐┌──────────┐┌──────────┐┌──────────┐┌────────┐│
│ │DataNode 1││DataNode 2││DataNode 3││DataNode 4││DataNode││
│ │          ││          ││          ││          ││   N    ││
│ │ Stores:  ││ Stores:  ││ Stores:  ││ Stores:  ││        ││
│ │ Block 1  ││ Block 1  ││ Block 2  ││ Block 2  ││  ...   ││
│ │ Block 3  ││ Block 2  ││ Block 3  ││ Block 4  ││        ││
│ │ Block 5  ││ Block 4  ││ Block 5  ││ Block 6  ││        ││
│ └──────────┘└──────────┘└──────────┘└──────────┘└────────┘│
│                                                              │
│  ┌─────────────────────────────────┐                        │
│  │  Secondary NameNode             │                        │
│  │  (Checkpoint creation)          │                        │
│  └─────────────────────────────────┘                        │
└──────────────────────────────────────────────────────────────┘

         ▲                           ▲
         │                           │
         │                           │
    ┌────┴────┐                 ┌────┴────┐
    │ Client  │                 │ Client  │
    │ (Read/  │                 │ (Read/  │
    │  Write) │                 │  Write) │
    └─────────┘                 └─────────┘
```

### NameNode: The Brain

The NameNode is the single master that manages:

1. **File System Namespace**:
   * Directory tree
   * File metadata (permissions, timestamps)
   * Mapping: filename → block IDs

2. **Block Management**:
   * Block locations (which DataNodes store each block)
   * Replication factor enforcement
   * Block creation, deletion, and re-replication

3. **Heartbeat Management**:
   * Monitors DataNode health (3-second heartbeats)
   * Detects failures and initiates recovery

**Critical Design Choice**: All metadata stored in RAM

* *Pros*: Fast metadata operations (no disk I/O)
* *Cons*: Limits namespace size (1GB RAM ≈ 1 million blocks)

### DataNodes: The Workers

DataNodes:

* Store actual data blocks on local disk
* Send heartbeats to NameNode every 3 seconds
* Report block lists periodically (block reports)
* Serve read/write requests from clients
* Execute replication commands from NameNode

### Secondary NameNode: The Misconception

<Warning>
  **Common Misconception**: Secondary NameNode is NOT a backup or standby!
</Warning>

**Actual Role**: Performs periodic checkpointing

* Merges edit logs into FSImage
* Reduces NameNode restart time
* Does NOT take over if NameNode fails (use HA NameNode for that)

***

## Block Storage Deep Dive

### What is a Block?

HDFS splits files into fixed-size blocks:

* **Default size**: 128MB (configurable: 64MB, 256MB, etc.)
* **Why so large?**: Minimize metadata overhead, optimize sequential reads

**Example**:

```
File: access.log (350MB)

Split into blocks:
┌─────────────┬─────────────┬─────────────┐
│  Block 1    │  Block 2    │  Block 3    │
│  (128MB)    │  (128MB)    │  (94MB)     │
│  ID: 1001   │  ID: 1002   │  ID: 1003   │
└─────────────┴─────────────┴─────────────┘
```

### Replication Strategy

Each block replicated (default: 3 copies) for fault tolerance.

**Replica Placement Algorithm** (Rack-Aware):

```
Block X needs 3 replicas:

Replica 1: Local node (where client writes from)
           OR random node (if client outside cluster)

Replica 2: Different node on SAME rack as Replica 1
           (same rack-switch, low latency)

Replica 3: Different node on DIFFERENT rack
           (survives rack failure)

┌──────────────────────────────────────────────┐
│                  Network                     │
└────────┬─────────────────────────┬───────────┘
         │                         │
    ┌────▼─────┐              ┌────▼─────┐
    │ Rack 1   │              │ Rack 2   │
    │ Switch   │              │ Switch   │
    └────┬─────┘              └────┬─────┘
         │                         │
    ┌────┼─────┬─────┐        ┌────┴─────┐
    │    │     │     │        │          │
    ▼    ▼     ▼     ▼        ▼          ▼
   DN1  DN2   DN3   DN4      DN5        DN6
    ↑    ↑                    ↑
    │    │                    │
 Replica Replica           Replica
   1      2                  3
```

**Why This Strategy?**:

* **Fault tolerance**: Survives node and rack failures
* **Write performance**: 2 replicas on same rack (fast)
* **Read optimization**: Multiple replicas for load balancing
* **Network efficiency**: 1/3 of data crosses racks (not 3/3)

### Block Placement Code Example

Here's how HDFS decides block placement (simplified from actual code):

```java theme={null}
public class BlockPlacementPolicyDefault {

    /**
     * Choose target DataNodes for block replicas
     * @param numOfReplicas Number of replicas to create
     * @param writer Client writing the block (for locality)
     * @param excludedNodes Nodes to exclude
     * @param blocksize Size of the block
     * @return Array of chosen DataNodes
     */
    public DatanodeDescriptor[] chooseTarget(
            int numOfReplicas,
            Node writer,
            List<DatanodeDescriptor> excludedNodes,
            long blocksize) {

        if (numOfReplicas == 0 || clusterMap.getNumOfLeaves() == 0) {
            return new DatanodeDescriptor[0];
        }

        List<DatanodeDescriptor> results = new ArrayList<>();

        // First replica: local to writer if possible
        DatanodeDescriptor firstNode = chooseFirstReplica(writer);
        results.add(firstNode);
        excludedNodes.add(firstNode);

        if (numOfReplicas == 1) {
            return results.toArray(new DatanodeDescriptor[1]);
        }

        // Second replica: different node, same rack as first
        DatanodeDescriptor secondNode = chooseLocalRack(
            firstNode, excludedNodes, blocksize);
        results.add(secondNode);
        excludedNodes.add(secondNode);

        if (numOfReplicas == 2) {
            return results.toArray(new DatanodeDescriptor[2]);
        }

        // Third replica: different rack
        DatanodeDescriptor thirdNode = chooseRemoteRack(
            1, results, excludedNodes, blocksize);
        results.add(thirdNode);

        // Additional replicas: distribute randomly
        for (int i = 3; i < numOfReplicas; i++) {
            DatanodeDescriptor nextNode = chooseRandom(
                excludedNodes, blocksize);
            results.add(nextNode);
            excludedNodes.add(nextNode);
        }

        return results.toArray(new DatanodeDescriptor[numOfReplicas]);
    }

    private DatanodeDescriptor chooseFirstReplica(Node writer) {
        // If writer is a DataNode, use it
        if (writer instanceof DatanodeDescriptor) {
            return (DatanodeDescriptor) writer;
        }
        // Otherwise, choose random node
        return chooseRandom(new ArrayList<>(), 0);
    }

    private DatanodeDescriptor chooseLocalRack(
            DatanodeDescriptor node,
            List<DatanodeDescriptor> excluded,
            long blocksize) {

        String rackName = node.getNetworkLocation();
        List<DatanodeDescriptor> sameRackNodes =
            clusterMap.getDatanodesInRack(rackName);

        // Remove excluded nodes
        sameRackNodes.removeAll(excluded);

        // Choose node with most available space
        return chooseBestNode(sameRackNodes, blocksize);
    }

    private DatanodeDescriptor chooseBestNode(
            List<DatanodeDescriptor> nodes,
            long blocksize) {

        DatanodeDescriptor best = null;
        long maxSpace = 0;

        for (DatanodeDescriptor node : nodes) {
            long available = node.getRemaining();
            if (available > maxSpace && available >= blocksize) {
                maxSpace = available;
                best = node;
            }
        }

        return best;
    }
}
```

***

## HDFS Read Operation

Let's trace a complete read operation with detailed code.

### The Read Flow

```
Step 1: Client → NameNode
   "Where are the blocks for /data/logs.txt?"

Step 2: NameNode → Client
   Block 1: [DN1, DN2, DN5]
   Block 2: [DN2, DN3, DN6]
   Block 3: [DN3, DN4, DN1]
   (sorted by proximity to client)

Step 3: Client → DN1 (closest DataNode)
   "Send me Block 1"

Step 4: DN1 → Client
   [Streaming Block 1 data...]

Step 5: Client → DN2 (closest for Block 2)
   "Send me Block 2"

... and so on
```

### Java Client Code

```java theme={null}
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.io.IOUtils;

public class HDFSReader {

    public static void main(String[] args) throws Exception {
        // Configuration object loads hdfs-site.xml, core-site.xml
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://namenode:9000");

        // Get FileSystem instance (connects to NameNode)
        FileSystem fs = FileSystem.get(conf);

        // Path to read
        Path filePath = new Path("/user/hadoop/data/access.log");

        // Open input stream
        // Internally: Contacts NameNode for block locations
        FSDataInputStream in = fs.open(filePath);

        try {
            // Read data
            // Internally: Connects to DataNodes sequentially
            IOUtils.copyBytes(in, System.out, 4096, false);

            // Seek to specific position (if needed)
            in.seek(1024000); // Jump to byte offset 1024000
            byte[] buffer = new byte[1024];
            in.read(buffer);

        } finally {
            IOUtils.closeStream(in);
        }

        fs.close();
    }
}
```

### Behind the Scenes: DFSInputStream

Here's what happens inside `fs.open()`:

```java theme={null}
public class DFSInputStream extends FSInputStream {

    private DFSClient dfsClient;
    private String src; // File path
    private LocatedBlocks locatedBlocks; // Block locations from NN
    private long fileLength;
    private long pos = 0; // Current position in file

    public DFSInputStream(DFSClient dfsClient, String src) {
        this.dfsClient = dfsClient;
        this.src = src;

        // Contact NameNode to get block locations
        openInfo();
    }

    private void openInfo() {
        // RPC call to NameNode
        LocatedBlocks newInfo =
            dfsClient.namenode.getBlockLocations(src, 0, fileLength);

        this.locatedBlocks = newInfo;
        this.fileLength = newInfo.getFileLength();
    }

    @Override
    public int read(byte[] buf, int off, int len) throws IOException {
        // Find which block contains current position
        LocatedBlock targetBlock = getBlockAt(pos);

        // Choose best DataNode (closest, most available)
        DatanodeInfo chosenNode = getBestNode(targetBlock);

        // Connect to DataNode and read
        BlockReader reader = getBlockReader(chosenNode, targetBlock);

        int bytesRead = reader.read(buf, off, len);
        pos += bytesRead;

        return bytesRead;
    }

    private DatanodeInfo getBestNode(LocatedBlock block) {
        DatanodeInfo[] nodes = block.getLocations();

        // Sort by network distance (local node preferred)
        Arrays.sort(nodes, new Comparator<DatanodeInfo>() {
            public int compare(DatanodeInfo a, DatanodeInfo b) {
                int distA = networkDistance(clientNode, a);
                int distB = networkDistance(clientNode, b);
                return distA - distB;
            }
        });

        return nodes[0]; // Closest node
    }

    private int networkDistance(Node a, Node b) {
        // Same node: distance 0
        // Same rack: distance 2
        // Different rack: distance 4
        String pathA = a.getNetworkLocation();
        String pathB = b.getNetworkLocation();

        if (a.equals(b)) return 0;
        if (pathA.equals(pathB)) return 2;
        return 4;
    }
}
```

**Key Optimizations**:

* **Data locality**: Reads from closest replica
* **Streaming**: Reads one block at a time (doesn't load entire file)
* **Retry logic**: Switches to another replica on failure
* **Checksum verification**: Validates data integrity

***

## HDFS Write Operation

Writing is more complex due to replication pipeline.

### The Write Flow

```
Step 1: Client → NameNode
   "I want to create /data/new.log"

Step 2: NameNode checks permissions, quota
   Creates file entry in namespace
   Returns: "Proceed"

Step 3: Client buffers first 64KB of data

Step 4: Client → NameNode
   "Allocate a block for /data/new.log"

Step 5: NameNode → Client
   "Block ID 1234, write to [DN1, DN2, DN5]"

Step 6: Client → DN1
   "Store block 1234, forward to DN2"
   [Streams data to DN1]

Step 7: DN1 → DN2
   "Store block 1234, forward to DN5"
   [Streams data to DN2]

Step 8: DN2 → DN5
   "Store block 1234"
   [Streams data to DN5]

Step 9: DN5 → DN2 → DN1 → Client
   "ACK: Block written successfully"

Repeat Steps 4-9 for each block
```

### Write Pipeline Architecture

```
Client
  │
  │ (1) Write packet
  ▼
┌────────────┐
│  DataNode  │───(2) Forward packet────┐
│     1      │                         │
│            │◄──(4) ACK──────────┐    │
└────────────┘                    │    │
                                  │    ▼
                              ┌────────────┐
                              │  DataNode  │──(3) Forward──┐
                              │     2      │               │
                              │            │◄──(5) ACK──┐  │
                              └────────────┘            │  │
                                                        │  ▼
                                                    ┌────────────┐
                                                    │  DataNode  │
                                                    │     3      │
                                                    │            │
                                                    └────────────┘
                                                         │
                                                         │ (6) ACK
                                                         ▼
                                               Pipeline complete
```

### Java Write Code

```java theme={null}
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataOutputStream;

public class HDFSWriter {

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://namenode:9000");

        FileSystem fs = FileSystem.get(conf);
        Path filePath = new Path("/user/hadoop/output/results.txt");

        // Create file (overwrites if exists)
        FSDataOutputStream out = fs.create(filePath);

        // Or append to existing file
        // FSDataOutputStream out = fs.append(filePath);

        try {
            String data = "Processing complete\n";
            out.write(data.getBytes("UTF-8"));

            // Flush to ensure data written
            out.flush();

            // Optional: Force sync to DataNodes
            out.hsync(); // Guarantees data visible to other readers

        } finally {
            out.close(); // Also flushes and syncs
        }

        fs.close();
    }
}
```

### Advanced: Custom Replication Factor

```java theme={null}
public class CustomReplication {

    public static void writeWithReplication(
            String hdfsPath,
            byte[] data,
            short replicationFactor) throws Exception {

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path(hdfsPath);

        // Specify replication factor at creation
        FSDataOutputStream out = fs.create(
            path,
            true,              // overwrite
            4096,              // buffer size
            replicationFactor, // custom replication
            134217728          // block size (128MB)
        );

        out.write(data);
        out.close();
        fs.close();
    }

    public static void changeReplication(
            String hdfsPath,
            short newReplication) throws Exception {

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        Path path = new Path(hdfsPath);

        // Change replication factor of existing file
        boolean success = fs.setReplication(path, newReplication);

        if (success) {
            System.out.println("Replication changed to " + newReplication);
        }

        fs.close();
    }
}
```

***

## Fault Tolerance Mechanisms

### DataNode Failure

**Detection**:

```
NameNode expects heartbeat every 3 seconds
If no heartbeat for 10 minutes (default):
  → Mark DataNode as dead
  → Identify under-replicated blocks
  → Schedule re-replication
```

**Recovery Process**:

```java theme={null}
public class NameNodeReplication {

    private void handleDeadDataNode(DatanodeDescriptor deadNode) {
        // Get all blocks stored on dead node
        List<Block> affectedBlocks = deadNode.getBlockList();

        for (Block block : affectedBlocks) {
            // Check current replication level
            int currentReplicas = countReplicas(block);
            int targetReplicas = block.getReplication();

            if (currentReplicas < targetReplicas) {
                // Under-replicated! Schedule re-replication
                addToReplicationQueue(block,
                    targetReplicas - currentReplicas);
            }
        }

        // Remove dead node from cluster
        removeDataNode(deadNode);
    }

    private void processReplicationQueue() {
        while (!replicationQueue.isEmpty()) {
            Block block = replicationQueue.poll();

            // Find nodes with existing replicas
            List<DatanodeDescriptor> sources =
                getDataNodesForBlock(block);

            // Choose target for new replica
            DatanodeDescriptor target =
                chooseTargetDataNode(sources);

            // Command source to replicate to target
            sources.get(0).addBlockToReplicate(block, target);
        }
    }
}
```

### NameNode Failure (High Availability)

**Problem**: NameNode is single point of failure

**Solution**: HA NameNode with ZooKeeper

```
┌─────────────────────────────────────────────┐
│              ZooKeeper Cluster              │
│     (Coordination & Leader Election)        │
└──────────────┬──────────────────────────────┘
               │
        ┌──────┴──────┐
        │             │
        ▼             ▼
┌──────────────┐ ┌──────────────┐
│   Active     │ │   Standby    │
│  NameNode    │ │  NameNode    │
│              │ │              │
│ (Serves      │ │ (Tails edit  │
│  requests)   │ │  log, ready  │
│              │ │  to takeover)│
└──────┬───────┘ └──────┬───────┘
       │                │
       │                │
       ▼                ▼
┌──────────────────────────────┐
│   Shared Edit Log            │
│   (on JournalNodes or NFS)   │
└──────────────────────────────┘
```

**Automatic Failover**:

1. Active NameNode crashes
2. ZooKeeper detects failure
3. Standby NameNode promoted to Active
4. Clients automatically redirect to new Active

***

## HDFS Configuration

### Essential Configuration Files

**core-site.xml** (Cluster-wide settings):

```xml theme={null}
<configuration>
    <!-- NameNode URI -->
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://namenode.example.com:9000</value>
    </property>

    <!-- Temporary directory -->
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/var/hadoop/tmp</value>
    </property>
</configuration>
```

**hdfs-site.xml** (HDFS-specific):

```xml theme={null}
<configuration>
    <!-- Replication factor -->
    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>

    <!-- Block size (128MB) -->
    <property>
        <name>dfs.blocksize</name>
        <value>134217728</value>
    </property>

    <!-- NameNode data directory -->
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///var/hadoop/namenode</value>
    </property>

    <!-- DataNode data directories -->
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///disk1/hadoop,file:///disk2/hadoop</value>
    </property>

    <!-- Permissions (disable for testing only!) -->
    <property>
        <name>dfs.permissions.enabled</name>
        <value>true</value>
    </property>
</configuration>
```

***

## Hands-on Labs

### Lab 1: HDFS CLI Operations

```bash theme={null}
# Create directory
hdfs dfs -mkdir -p /user/hadoop/input

# Upload file
hdfs dfs -put localfile.txt /user/hadoop/input/

# List files with block info
hdfs fsck /user/hadoop/input/localfile.txt -files -blocks -locations

# Check file stats
hdfs dfs -stat "%b %r %n" /user/hadoop/input/localfile.txt
# Output: blocksize replication filename

# Change replication
hdfs dfs -setrep -w 5 /user/hadoop/input/localfile.txt

# View file content
hdfs dfs -cat /user/hadoop/input/localfile.txt

# Get file to local
hdfs dfs -get /user/hadoop/input/localfile.txt ./downloaded.txt

# Check cluster health
hdfs dfsadmin -report

# Balance cluster
hdfs balancer -threshold 10
```

### Lab 2: Programmatic File Operations

```java theme={null}
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

public class HDFSOperations {

    private FileSystem fs;

    public HDFSOperations() throws Exception {
        Configuration conf = new Configuration();
        this.fs = FileSystem.get(conf);
    }

    // Create directory
    public void createDirectory(String path) throws Exception {
        Path dirPath = new Path(path);
        fs.mkdirs(dirPath);
        System.out.println("Created: " + path);
    }

    // List directory contents
    public void listDirectory(String path) throws Exception {
        Path dirPath = new Path(path);
        FileStatus[] statuses = fs.listStatus(dirPath);

        for (FileStatus status : statuses) {
            System.out.printf("%s %10d %s\n",
                status.isDirectory() ? "DIR " : "FILE",
                status.getLen(),
                status.getPath().getName()
            );
        }
    }

    // Get block locations for a file
    public void showBlockLocations(String path) throws Exception {
        Path filePath = new Path(path);
        FileStatus status = fs.getFileStatus(filePath);
        BlockLocation[] blocks = fs.getFileBlockLocations(
            status, 0, status.getLen());

        for (int i = 0; i < blocks.length; i++) {
            BlockLocation block = blocks[i];
            System.out.println("Block " + i + ":");
            System.out.println("  Hosts: " +
                String.join(", ", block.getHosts()));
            System.out.println("  Offset: " + block.getOffset());
            System.out.println("  Length: " + block.getLength());
        }
    }

    // Delete file/directory
    public void delete(String path, boolean recursive) throws Exception {
        Path targetPath = new Path(path);
        boolean success = fs.delete(targetPath, recursive);
        System.out.println("Deleted: " + success);
    }

    // Copy file within HDFS
    public void copyFile(String src, String dst) throws Exception {
        Path srcPath = new Path(src);
        Path dstPath = new Path(dst);
        FileUtil.copy(fs, srcPath, fs, dstPath, false, fs.getConf());
    }

    public void close() throws Exception {
        fs.close();
    }
}
```

### Lab 3: Monitoring HDFS Health

```java theme={null}
import org.apache.hadoop.hdfs.DistributedFileSystem;
import org.apache.hadoop.hdfs.protocol.DatanodeInfo;

public class HDFSMonitor {

    public static void clusterHealth() throws Exception {
        Configuration conf = new Configuration();
        DistributedFileSystem dfs =
            (DistributedFileSystem) FileSystem.get(conf);

        // Get cluster statistics
        FsStatus status = dfs.getStatus();
        long capacity = status.getCapacity();
        long used = status.getUsed();
        long remaining = status.getRemaining();

        System.out.println("HDFS Cluster Status:");
        System.out.println("  Capacity:  " + formatSize(capacity));
        System.out.println("  Used:      " + formatSize(used) +
            " (" + (used * 100 / capacity) + "%)");
        System.out.println("  Remaining: " + formatSize(remaining));

        // Get DataNode information
        DatanodeInfo[] datanodes = dfs.getDataNodeStats();
        System.out.println("\nDataNodes: " + datanodes.length);

        for (DatanodeInfo node : datanodes) {
            System.out.println("  " + node.getHostName());
            System.out.println("    Capacity:  " +
                formatSize(node.getCapacity()));
            System.out.println("    Used:      " +
                formatSize(node.getDfsUsed()));
            System.out.println("    Remaining: " +
                formatSize(node.getRemaining()));
            System.out.println("    State:     " +
                node.getAdminState());
        }

        dfs.close();
    }

    private static String formatSize(long bytes) {
        long tb = 1024L * 1024 * 1024 * 1024;
        long gb = 1024L * 1024 * 1024;
        long mb = 1024L * 1024;

        if (bytes >= tb) return (bytes / tb) + " TB";
        if (bytes >= gb) return (bytes / gb) + " GB";
        if (bytes >= mb) return (bytes / mb) + " MB";
        return bytes + " B";
    }
}
```

***

## Performance Tuning

### Optimizing Block Size

**Rule of Thumb**: Block size should minimize metadata while avoiding too many small files

```
Scenario: Processing 1TB of log files

Option 1: 64MB blocks
  → 16,384 blocks
  → More metadata
  → More map tasks

Option 2: 128MB blocks (default)
  → 8,192 blocks
  → Balanced

Option 3: 256MB blocks
  → 4,096 blocks
  → Less metadata
  → Fewer map tasks (less parallelism)

Choice: 128MB for balanced parallelism
```

### Short-Circuit Reads

When client and DataNode are on same machine, skip network:

```xml theme={null}
<property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
</property>

<property>
    <name>dfs.domain.socket.path</name>
    <value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>
```

**Performance Impact**: 30-50% faster local reads

***

## Common Issues & Troubleshooting

<AccordionGroup>
  <Accordion title="Issue: NameNode Out of Memory" icon="memory">
    **Symptoms**: NameNode crashes, `OutOfMemoryError` in logs

    **Cause**: Too many files/blocks for allocated heap

    **Solutions**:

    1. Increase NameNode heap: `-Xmx16g` → `-Xmx32g`
    2. Enable HDFS Federation (multiple NameNodes)
    3. Merge small files into larger ones
    4. Clean up old/unused data

    **Prevention**: Monitor namespace size, set quotas
  </Accordion>

  <Accordion title="Issue: Under-Replicated Blocks" icon="copy">
    **Symptoms**: `hdfs fsck` shows under-replicated blocks

    **Causes**:

    * DataNode failures
    * Network issues
    * Disk space exhaustion

    **Solutions**:

    ```bash theme={null}
    # Check under-replicated blocks
    hdfs fsck / | grep "Under-replicated"

    # Identify problematic DataNodes
    hdfs dfsadmin -report

    # Trigger replication
    hdfs dfsadmin -metasave /tmp/meta.txt

    # If urgent, increase replication temporarily
    hdfs dfs -setrep -R 4 /critical/data
    ```
  </Accordion>

  <Accordion title="Issue: DataNode Not Starting" icon="server">
    **Check logs**: `/var/log/hadoop/hadoop-hdfs-datanode-*.log`

    **Common causes**:

    1. **Incompatible cluster ID**:
       ```
       Error: Incompatible clusterIDs
       Solution: Delete DataNode data, rejoin cluster
       ```

    2. **Port already in use**:
       ```
       Error: Address already in use: 50010
       Solution: Kill process or change port in hdfs-site.xml
       ```

    3. **Permission issues**:
       ```
       Error: Permission denied on /var/hadoop/datanode
       Solution: chown -R hdfs:hadoop /var/hadoop/datanode
       ```
  </Accordion>
</AccordionGroup>

***

## Interview Focus

<Note>
  **Key Concepts to Master**:

  * Explain NameNode vs DataNode responsibilities
  * Describe block replication algorithm and rack awareness
  * Walk through read/write data flows
  * Discuss Single NameNode limitations and HA solutions
  * Compare HDFS to other storage (S3, traditional file systems)
</Note>

**Sample Questions**:

1. **"Why can't HDFS handle lots of small files efficiently?"**
   * Answer: Each file/block = metadata entry in NameNode RAM
   * 1 million 1KB files = 1 million blocks vs 8 files of 128MB = 8 blocks

2. **"How does HDFS ensure data locality for MapReduce?"**
   * Answer: JobTracker queries NameNode for block locations, schedules map tasks on DataNodes storing blocks

3. **"What happens if a client crashes during a write?"**
   * Answer: File remains in incomplete state, lease eventually expires, NameNode closes file

***

## What's Next?

You now understand HDFS storage layer. Next, learn how to process this data with MapReduce!

<Card title="Module 3: MapReduce Programming Model" icon="code" href="/distributed-systems-tools/hadoop-mapreduce">
  Master distributed data processing with MapReduce patterns and hands-on coding
</Card>
