Chapter 2: HDFS Architecture

The Hadoop Distributed File System (HDFS) is the storage foundation of the Hadoop ecosystem. Inspired by Google’s GFS, HDFS implements a distributed file system designed to run on commodity hardware and provide high-throughput access to application data.

Chapter Goals:

Understand HDFS architecture and component roles
Learn how NameNode and DataNodes interact
Master block replication and placement strategies
Explore read and write data flows
Compare HDFS with GFS and identify improvements

HDFS Architecture Overview

System Components

+---------------------------------------------------------------+
|                    HDFS ARCHITECTURE                          |
+---------------------------------------------------------------+
|                                                               |
|                   ┌──────────────────┐                        |
|                   │   NAMENODE       │                        |
|                   │  (Master Server) │                        |
|                   ├──────────────────┤                        |
|                   │ • Namespace      │  Responsibilities:     |
|                   │ • Block Mapping  │  • Metadata management |
|                   │ • Replication    │  • File operations     |
|                   │   Policy         │  • Block placement     |
|                   │ • Cluster State  │  • Replication control |
|                   └──────────────────┘                        |
|                    ↑ ↑ ↑        ↑ ↑                           |
|      Metadata ops  │ │ │        │ │  Heartbeats + Block       |
|      & block info  │ │ │        │ │  Reports                  |
|                    │ │ │        │ │                           |
|        ┌───────────┘ │ └────────┘ └──────────────┐            |
|        │             │                            │            |
|   ┌────────┐   ┌────────┐                   ┌────────┐        |
|   │ CLIENT │   │ CLIENT │                   │ CLIENT │        |
|   └────────┘   └────────┘                   └────────┘        |
|        │             │                            │            |
|        │ Read/Write  │                            │            |
|        │ data flow   │                            │            |
|        ↓             ↓                            ↓            |
|   ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐        |
|   │DATANODE 1│ │DATANODE 2│ │DATANODE 3│ │DATANODE 4│        |
|   ├──────────┤ ├──────────┤ ├──────────┤ ├──────────┤        |
|   │ Block A  │ │ Block A  │ │ Block B  │ │ Block A  │        |
|   │ Block B  │ │ Block C  │ │ Block C  │ │ Block D  │        |
|   │ Block E  │ │ Block D  │ │ Block F  │ │ Block E  │        |
|   └──────────┘ └──────────┘ └──────────┘ └──────────┘        |
|        ↑             ↑            ↑            ↑               |
|        └─────────────┴────────────┴────────────┘               |
|              Replication Pipeline (data flow)                 |
|                                                               |
+---------------------------------------------------------------+

KEY INSIGHT: Metadata and data flows are separated
            NameNode handles metadata, clients talk to DataNodes for data

NameNode: The Master

The NameNode is the central authority in HDFS, managing the file system namespace and controlling access to files by clients.

Core Responsibilities
Memory Structure
Persistent Storage
High Availability (HA) & QJM
HDFS Federation

What the NameNode Does:

1. NAMESPACE MANAGEMENT
───────────────────────

File System Tree:
/
├── user/
│   ├── hadoop/
│   │   ├── input/
│   │   │   └── data.txt (3 blocks)
│   │   └── output/
│   └── hive/
│       └── warehouse/
└── tmp/

Operations:
• Create files and directories
• Delete files and directories
• Rename files and directories
• List directory contents
• Get file attributes
• Set permissions and ownership


2. BLOCK MANAGEMENT
──────────────────

Block Metadata:
File: /user/hadoop/input/data.txt (400MB)

Blocks:
┌──────────────────────────────────────┐
│ Block ID: blk_1234567890              │
│ Size: 128MB                           │
│ Locations: [DN1, DN3, DN5]            │
│ Generation Stamp: 1001                │
├──────────────────────────────────────┤
│ Block ID: blk_1234567891              │
│ Size: 128MB                           │
│ Locations: [DN2, DN4, DN6]            │
│ Generation Stamp: 1002                │
├──────────────────────────────────────┤
│ Block ID: blk_1234567892              │
│ Size: 144MB (last block)              │
│ Locations: [DN1, DN2, DN7]            │
│ Generation Stamp: 1003                │
└──────────────────────────────────────┘


3. REPLICATION MANAGEMENT
────────────────────────

Monitor and Maintain:
• Ensure each block has correct # replicas (default: 3)
• Re-replicate under-replicated blocks
• Delete over-replicated blocks
• Balance blocks across DataNodes
• Handle DataNode failures

Replication Policy:
1st replica: Same node as writer (if in cluster)
2nd replica: Different rack
3rd replica: Same rack as 2nd, different node

Why? Balance fault tolerance and network cost


4. CLUSTER COORDINATION
──────────────────────

Heartbeat Protocol:
DataNode → NameNode (every 3 seconds):
{
  "nodeId": "DN1",
  "capacity": "4TB",
  "used": "2.3TB",
  "remaining": "1.7TB",
  "blockCount": 15234,
  "timestamp": 1642531200
}

If no heartbeat for 10 minutes → DataNode dead

Block Reports:
DataNode → NameNode (every 6 hours):
{
  "nodeId": "DN1",
  "blocks": [
    {"blockId": "blk_1", "size": 128MB, "genStamp": 1001},
    {"blockId": "blk_5", "size": 128MB, "genStamp": 1005},
    ...thousands more...
  ]
}

Enables NameNode to verify block placement

In-Memory Data Structures:

NAMENODE MEMORY (~150 bytes per block)
──────────────────────────────────────

For 100 million blocks (12.5 PB with 128MB blocks):
• Block metadata: ~15 GB
• File metadata: ~5 GB
• Total: ~20 GB (fits in single server RAM)


1. Namespace (File System Tree)
───────────────────────────────

InodeFile:
┌────────────────────────────────┐
│ Path: /user/hadoop/data.txt    │
│ Owner: hadoop                  │
│ Group: supergroup              │
│ Permissions: 644               │
│ Replication: 3                 │
│ Block Size: 128MB              │
│ Modification Time: 1642531200  │
│ Access Time: 1642535000        │
│ Blocks: [blk_1, blk_2, blk_3]  │
└────────────────────────────────┘

InodeDirectory:
┌────────────────────────────────┐
│ Path: /user/hadoop/            │
│ Owner: hadoop                  │
│ Permissions: 755               │
│ Children: [input, output]      │
└────────────────────────────────┘


2. Block Map (Block → Locations)
────────────────────────────────

BlockInfo:
{
  blockId: blk_1234567890,
  size: 134217728,  // 128MB
  generationStamp: 1001,
  replicas: [
    {datanode: DN1, state: FINALIZED},
    {datanode: DN3, state: FINALIZED},
    {datanode: DN5, state: FINALIZED}
  ]
}

Quick lookups:
• Given block ID → where is it stored?
• Given file → what blocks compose it?


3. DataNode Descriptors
───────────────────────

DataNodeDescriptor:
{
  nodeId: "DN1",
  hostname: "datanode1.example.com",
  ipAddress: "10.0.1.5",
  xferPort: 50010,
  infoPort: 50075,
  ipcPort: 50020,

  capacity: 4398046511104,      // 4TB
  dfsUsed: 2535301169152,       // 2.3TB
  remaining: 1862745341952,     // 1.7TB

  blocks: Set<BlockInfo>,
  lastHeartbeat: 1642531200,
  lastBlockReport: 1642520400
}

Enables:
• Load balancing decisions
• Replication target selection
• Cluster health monitoring

What Gets Persisted:

FSIMAGE (Namespace Snapshot)
────────────────────────────

Checkpoint of entire namespace:
• All directories and files
• File → block mappings
• Permissions, ownership
• Replication factors
• Block sizes

Format: Protobuf (compact binary)
Size: Gigabytes for large clusters
Created: Periodically by Secondary NameNode

Example:
/fsimage/fsimage_0000000000012345678


EDIT LOG (Transaction Log)
──────────────────────────

All namespace mutations since last fsimage:

┌──────────────────────────────────────┐
│ TxID: 12345678                        │
│ Op: OP_ADD                            │
│ Path: /user/hadoop/data.txt           │
│ Replication: 3                        │
│ Timestamp: 1642531200                 │
├──────────────────────────────────────┤
│ TxID: 12345679                        │
│ Op: OP_ALLOCATE_BLOCK                 │
│ Path: /user/hadoop/data.txt           │
│ BlockID: blk_1234567890               │
│ GenStamp: 1001                        │
├──────────────────────────────────────┤
│ TxID: 12345680                        │
│ Op: OP_CLOSE                          │
│ Path: /user/hadoop/data.txt           │
└──────────────────────────────────────┘

Enables:
• Fast recovery (replay edits)
• Durability (survive crashes)
• High availability (standby replays)


RECOVERY PROCESS
────────────────

On NameNode startup:

1. Load latest fsimage
   ↓
2. Replay edit log entries
   ↓
3. Merge to create current namespace
   ↓
4. Receive block reports from DataNodes
   ↓
5. Enter safe mode (read-only)
   ↓
6. Verify block replication meets threshold
   ↓
7. Exit safe mode (read-write)

Time: Minutes to hours (depends on size)


CHECKPOINT PROCESS
──────────────────

Secondary NameNode periodically:

1. Downloads fsimage and edits from NameNode
2. Loads fsimage into memory
3. Applies edit log transactions
4. Saves new merged fsimage
5. Uploads to NameNode
6. NameNode replaces old fsimage
7. Starts new edit log

Why? Keeps edit log from growing unbounded
Frequency: Hourly or when edits reach threshold

Deep Dive: NameNode High Availability:The NameNode is the brain of HDFS. In early versions, its failure meant total cluster downtime. HA solves this by maintaining a “Hot Standby” that can take over in seconds.

    HDFS HA ARCHITECTURE WITH QJM
    ──────────────────────────────

             ┌────────────────┐          ┌────────────────┐
             │   Active NN    │          │   Standby NN   │
             │ (Read/Write)   │          │  (Read-only)   │
             └───────┬────────┘          └───────┬────────┘
                     │                           │
                     │   ┌───────────────────┐   │
                     └──▶│   JournalNodes    │◀──┘
                         │   (Quorum of 3, 5)│
                         └───────────────────┘
                                   │
               ┌───────────────────┴───────────────────┐
               │                                       │
        ┌──────▼──────┐                         ┌──────▼──────┐
        │    ZKFC     │◀───────(ZooKeeper)──────▶│    ZKFC     │
        └─────────────┘                         └─────────────┘

The Shared Edit Log (Quorum Journal Manager):

Instead of a single local edits file, the Active NN sends edits to a set of JournalNodes (JNs).
A write is successful only if a quorum (majority) of JNs acknowledge it.
Standby NN continuously reads (tails) these edits from the JNs to keep its in-memory namespace in sync.
No Shared Storage Required: Unlike NFS-based HA, QJM doesn’t have a single point of failure in the storage layer.

JournalNode Mechanics:

JNs are lightweight processes, usually run on the same nodes as NameNodes or other masters.
Each edit is assigned an increasing Epoch Number.
Fencing: If a new NameNode becomes active, it increments the epoch. JNs will then reject any further writes from the old Active NN (preventing “split-brain”).

Automatic Failover (ZKFC & ZooKeeper):

ZKFC (ZooKeeper Failover Controller): A separate process running on each NN node.
Monitors the NN health and maintains a session in ZooKeeper.
Active Election: ZKFCs compete to create an ephemeral node in ZooKeeper (/hadoop-ha/.../ActiveStandbyElectorLock).
The winner triggers its local NN to become Active.
If the Active NN or its ZKFC crashes, the lock is released, and the Standby ZKFC immediately wins the election.

The Fencing Mechanism:

Crucial for preventing two NameNodes from thinking they are both Active.
QJM Fencing: As mentioned, JNs reject old epochs.
SSH Fencing: The Standby NN can log into the old Active node and kill -9 the process.
Fencing Script: Custom scripts can power off the node (PDU) or disable the network port.

Block Reports in HA:

In non-HA, only one NN gets reports.
In HA, DataNodes send heartbeats and block reports to BOTH NameNodes.
This allows the Standby to have a near-perfect map of block locations, enabling a “Hot” failover without waiting for a full block report.

Comparison: QJM vs. NFS for HA:

Feature	QJM (Recommended)	NFS (Legacy)
Shared Storage	Quorum of JNodes	Single NFS Filer
Fault Tolerance	Can lose (N-1)/2 JNs	NFS is a SPoF
Consistency	Strong (Quorum based)	Depends on NFS server
Network	Standard TCP	Requires robust NAS/SAN

Scaling the Namespace with Federation:While HA provides reliability, Federation provides scalability. In a single-namespace cluster, the NameNode’s memory is the limit (approx. 150 bytes per block).

HDFS FEDERATION VIEW
────────────────────

  NameService 1          NameService 2          NameService 3
 ┌─────────────┐        ┌─────────────┐        ┌─────────────┐
 │  NN1 (HA)   │        │  NN2 (HA)   │        │  NN3 (HA)   │
 │  /user/*    │        │  /data/*    │        │  /tmp/*     │
 └──────┬──────┘        └──────┬──────┘        └──────┬──────┘
        │                      │                      │
        └───────────┬──────────┴───────────┬──────────┘
                    │                      │
             ┌──────▼──────┐        ┌──────▼──────┐
             │  DataNode   │        │  DataNode   │
             │ (Pool 1,2,3)│        │ (Pool 1,2,3)│
             └─────────────┘        └─────────────┘

Block Pools:
- Each NameNode manages its own Block Pool.
- A Block Pool is a set of blocks that belong to a single namespace.
- DataNodes store blocks for all pools in the cluster.
- This decouples the Namespace layer from the Block Storage layer.
View File System (ViewFs):
- To provide a single unified path to clients, HDFS often uses viewfs://.
- It maps top-level directories to specific NameServices:
  - /user → hdfs://nameservice1/user
  - /data → hdfs://nameservice2/data
- The client-side configuration (mount table) handles this redirection.
Key Benefits:
- Scalability: Add more NameNodes as the cluster grows.
- Isolation: A busy namespace (e.g., /tmp) doesn’t starve the /user namespace of NameNode resources.
- Efficiency: Better utilization of DataNode storage across multiple NameNodes.
Use Case:
- Large enterprises (Yahoo, Facebook) with 10k+ nodes.
- Clusters with hundreds of millions of files.
- Environments with highly diverse workloads sharing one cluster.

DataNodes: The Workers

DataNodes store the actual data blocks and serve read/write requests from clients.

Storage Model

How Blocks Are Stored:

Each block: up to 128MB (default)
Stored as regular Linux files
Path: /data/hadoop/dfs/data/current/
File naming: blk_<block_id>
Separate metadata file: blk_<block_id>.meta
Contains checksums for integrity
Multiple blocks per disk

Responsibilities

What DataNodes Do:

Store and retrieve blocks
Serve read requests from clients
Execute write operations
Replicate blocks to other DataNodes
Send heartbeats to NameNode
Report blocks to NameNode
Delete blocks when instructed

Block Verification

Data Integrity:

Checksum per 512 bytes
Verify on read operations
Periodic background scanning
Report corrupt blocks to NameNode
Automatic re-replication from good replicas
Checksum stored in .meta file

Heartbeat Protocol

Communication with NameNode:

Heartbeat every 3 seconds
Contains capacity info
Receives commands from NameNode
Block reports every 6 hours
10 minutes without heartbeat = dead
NameNode initiates re-replication

Block Replication and Placement

Replication Strategy

HDFS replicates blocks to ensure fault tolerance and enable data locality for MapReduce.

DEFAULT REPLICATION POLICY (Factor = 3)
───────────────────────────────────────

Rack Awareness:
┌─────────────────────────────────────────────────────────┐
│                        CLUSTER                          │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Rack 1               Rack 2               Rack 3       │
│  ┌──────────┐        ┌──────────┐        ┌──────────┐  │
│  │  DN1     │        │  DN3     │        │  DN5     │  │
│  │  DN2     │        │  DN4     │        │  DN6     │  │
│  └──────────┘        └──────────┘        └──────────┘  │
│                                                         │
│  Writer on DN1:                                         │
│  ───────────────                                        │
│                                                         │
│  Replica 1 → DN1 (same node as writer)                  │
│  Replica 2 → DN3 (different rack)                       │
│  Replica 3 → DN4 (same rack as replica 2, diff node)    │
│                                                         │
│                                                         │
│  Why This Policy?                                       │
│  ────────────────                                       │
│                                                         │
│  1. Local Write (Replica 1):                            │
│     • Zero network transfer for first replica           │
│     • Fast write initiation                             │
│                                                         │
│  2. Off-Rack (Replica 2):                               │
│     • Survives rack failure                             │
│     • Network cost of 1 rack transfer                   │
│                                                         │
│  3. Same Rack as #2 (Replica 3):                        │
│     • Rack-local transfer (faster than off-rack)        │
│     • Still survives single rack failure                │
│                                                         │
│  Trade-off: Availability vs Network Cost                │
│  ────────────────────────────────────────               │
│                                                         │
│  Survives:                                              │
│  ✓ Any single DataNode failure                          │
│  ✓ Any single rack failure                              │
│  ✗ Does NOT survive 2 rack failures                     │
│                                                         │
└─────────────────────────────────────────────────────────┘

Replica Placement Details

Standard Policy
Re-Replication
Decommissioning
Balancer

Default 3-Way Replication:

PLACEMENT ALGORITHM
───────────────────

Input: Block to replicate
Output: List of DataNode targets

Factors Considered:
1. Rack diversity (fault tolerance)
2. Network bandwidth cost
3. Even distribution across cluster
4. Available disk space
5. DataNode load


For Replication Factor = 3:

Case 1: Writer on DataNode in cluster
──────────────────────────────────────
1st replica: Same DataNode as writer
2nd replica: DataNode on different rack
3rd replica: Different DataNode, same rack as 2nd

Example:
Writer on DN1 (Rack 1)
→ Replicas: DN1 (Rack 1), DN4 (Rack 2), DN5 (Rack 2)


Case 2: Writer outside cluster (client)
────────────────────────────────────────
1st replica: Random DataNode
2nd replica: DataNode on different rack
3rd replica: Different DataNode, same rack as 2nd

Example:
External client
→ Replicas: DN2 (Rack 1), DN3 (Rack 2), DN4 (Rack 2)


For Replication Factor > 3:
───────────────────────────
4th+ replicas: Random placement
Constraints:
• No more than 2 replicas on same rack
• Prefer under-utilized nodes
• Balance across racks


BENEFITS
────────

1. Fault Tolerance:
   Can lose any rack and still have data

2. Read Performance:
   Multiple locations → parallel reads
   Choose closest replica to reader

3. Write Performance:
   Only 1 off-rack transfer
   2/3 replicas use fast rack-local network

4. Load Balancing:
   Distribute popular blocks widely

Handling Under-Replication:

TRIGGERS FOR RE-REPLICATION
──────────────────────────

1. DataNode Failure
   DN3 dies → all its blocks under-replicated
   NameNode detects (no heartbeat)
   Initiates re-replication immediately

2. Disk Failure
   DN reports disk error
   Blocks on that disk → under-replicated
   Re-replicate from other replicas

3. Replication Factor Increase
   User increases replication: 3 → 5
   NameNode creates 2 more replicas per block

4. Corrupt Block Detected
   Checksum mismatch
   Mark block corrupt
   Re-replicate from good replicas


PRIORITY QUEUE
──────────────

Blocks prioritized by replication level:

Priority 1: 0 replicas (data loss imminent!)
Priority 2: 1 replica (one failure = loss)
Priority 3: < replication factor

Process highest priority first


RE-REPLICATION PROCESS
──────────────────────

1. NameNode identifies under-replicated blocks
   ↓
2. Selects source DataNode (has good replica)
   ↓
3. Selects target DataNode (where to copy)
   Considers:
   • Rack diversity
   • Available space
   • Current load
   ↓
4. Sends command: DN_source → copy block to DN_target
   ↓
5. Source initiates transfer to target
   ↓
6. Target stores block
   ↓
7. Target reports to NameNode
   ↓
8. NameNode updates block map


THROTTLING
──────────

Limit re-replication bandwidth:
• Default: 2 concurrent transfers per DataNode
• Prevents network saturation
• Avoids impacting client requests
• Configurable: dfs.datanode.max.transfer.threads

During large-scale failures:
• May take hours to re-replicate all blocks
• Cluster degraded but operational
• Prioritization ensures critical data first

Graceful Node Removal:

DECOMMISSIONING WORKFLOW
────────────────────────

Scenario: Remove DN3 from cluster (hardware upgrade)

1. Administrator marks DN3 for decommission
   Edit: hdfs-site.xml (exclude file)
   Command: hdfs dfsadmin -refreshNodes

2. NameNode puts DN3 in "Decommissioning" state
   DN3 still accepts heartbeats
   No new blocks assigned to DN3

3. NameNode identifies blocks only on DN3
   Creates list of all blocks to replicate elsewhere

4. Re-replication begins
   For each block on DN3:
   • Select target DataNode
   • Copy block to target
   • Update block map

5. Monitor progress
   Command: hdfs dfsadmin -report
   Shows: "Decommissioning in progress: 45% complete"

6. All blocks copied
   DN3 state: "Decommissioned"
   DN3 has zero unique blocks

7. Safe to power down DN3
   No data loss
   Cluster continues normally


WHY DECOMMISSION (Don't just power off!)
────────────────────────────────────────

Without decommissioning:
✗ Immediate under-replication
✗ NameNode thinks DN3 is dead
✗ Emergency re-replication triggered
✗ Network saturation
✗ Potential data loss if multiple failures

With decommissioning:
✓ Gradual, controlled process
✓ No data loss risk
✓ Network-friendly
✓ Clean removal


TIME ESTIMATE
─────────────

Depends on:
• Amount of data on node
• Network bandwidth
• Cluster load
• Number of concurrent decommissions

Typical: Hours to days for heavily-loaded nodes

Cluster Load Balancing:

THE BALANCING PROBLEM
─────────────────────

Over time, cluster becomes imbalanced:

DN1: 90% full  ▓▓▓▓▓▓▓▓▓░
DN2: 45% full  ▓▓▓▓░░░░░░
DN3: 80% full  ▓▓▓▓▓▓▓▓░░
DN4: 30% full  ▓▓▓░░░░░░░

Causes:
• New nodes added (start empty)
• Uneven write patterns
• Some nodes have larger disks
• Decommissioned nodes

Problems:
✗ Some nodes fill up first
✗ New writes fail (no space)
✗ Uneven read load
✗ Inefficient resource use


HDFS BALANCER
─────────────

Standalone tool that redistributes blocks:

Command:
$ hdfs balancer -threshold 10

Meaning:
• Target: All nodes within 10% of average utilization
• Average: 61.25% full
• Goal: Each node between 51.25% - 71.25%


BALANCER ALGORITHM
──────────────────

1. Calculate cluster average utilization
   Total used / Total capacity = 61.25%

2. Identify over-utilized nodes (> threshold)
   DN1: 90% (over by 28.75%)
   DN3: 80% (over by 18.75%)

3. Identify under-utilized nodes (< threshold)
   DN2: 45% (under by 16.25%)
   DN4: 30% (under by 31.25%)

4. For each over-utilized node:
   Select blocks to move
   Choose under-utilized destination
   Initiate block transfer

5. Repeat until balanced or max iterations


CONSTRAINTS
───────────

Balancer obeys replication policy:
✓ Maintains rack diversity
✓ Doesn't reduce replicas below factor
✓ Respects rack awareness

Throttling:
• Default: 10 MB/s per DataNode
• Configurable: dfs.datanode.balance.bandwidthPerSec
• Runs in background, low priority

Safety:
✓ Only moves replicas (doesn't delete)
✓ Verifies checksums after move
✓ Updates NameNode metadata atomically
✓ Can be stopped safely anytime


BEST PRACTICES
──────────────

When to run:
→ After adding new nodes
→ When utilization becomes skewed
→ During maintenance windows (low load)

How often:
→ Weekly or monthly (depends on write rate)
→ Not constantly (overhead)

Monitoring:
→ Check progress: hdfs balancer -h
→ Watch cluster balance: hdfs dfsadmin -report

Data Flow: Read Operations

Understanding how data flows through HDFS is crucial for optimization.

Read Path

HDFS READ OPERATION
───────────────────

Client wants to read: /user/hadoop/input.txt

Step 1: Open File
─────────────────

Client ────(1)───▶ NameNode
    "open /user/hadoop/input.txt"

NameNode:
• Checks permissions
• Looks up file metadata
• Returns block locations

NameNode ────(2)───▶ Client
    Block Locations:
    blk_1: [DN1, DN3, DN5]
    blk_2: [DN2, DN4, DN6]
    blk_3: [DN1, DN2, DN7]


Step 2: Read Blocks
───────────────────

For each block, client:

1. Selects closest DataNode
   (based on network topology)

   Client chooses DN1 for blk_1
   (same rack → lowest latency)

2. Establishes connection

   Client ────(3)───▶ DN1
       "read blk_1"

3. DN1 reads from local disk

   DN1:
   • Locates block file
   • Reads data
   • Computes checksums
   • Verifies integrity

4. Streams data to client

   DN1 ────(4)───▶ Client
       [data stream with checksums]

5. Client verifies checksums

   If checksum mismatch:
   • Report to NameNode
   • Try different replica
   • NameNode schedules re-replication


Step 3: Repeat for All Blocks
──────────────────────────────

Client reads blk_2 from DN2
Client reads blk_3 from DN1


Step 4: Close File
──────────────────

Client ────(5)───▶ NameNode
    "close /user/hadoop/input.txt"

NameNode updates access time


TIMELINE
────────

┌─────────────────────────────────────────────────────┐
│                                                     │
│  t=0ms    Client → NameNode (open file)             │
│  t=10ms   NameNode → Client (block locations)       │
│  t=15ms   Client → DN1 (read blk_1)                 │
│  t=20ms   DN1 starts streaming data                 │
│  t=1000ms DN1 finishes streaming 128MB              │
│  t=1005ms Client → DN2 (read blk_2)                 │
│  t=2000ms DN2 finishes streaming                    │
│  ...                                                │
│                                                     │
└─────────────────────────────────────────────────────┘

Total time dominated by data transfer, not metadata ops

Read Optimizations

Data Locality

Choosing the Closest Replica:

Network Topology Awareness
──────────────────────────

HDFS models network as a tree:

                Cluster
                   │
        ┌──────────┼──────────┐
     Rack1       Rack2      Rack3
        │          │           │
    ┌───┼───┐  ┌───┼───┐   ┌───┼───┐
   DN1 DN2 DN3 DN4 DN5 DN6 DN7 DN8 DN9


Distance Calculation:
─────────────────────

distance(DN1, DN1) = 0  (same node)
distance(DN1, DN2) = 2  (same rack)
distance(DN1, DN4) = 4  (different rack)

Formula:
Sum of levels to common ancestor


Replica Selection:
──────────────────

Client on DN1 reading block with replicas [DN1, DN4, DN7]:

1. Sort by distance:
   DN1: distance 0 ← CHOSEN
   DN4: distance 4
   DN7: distance 4

2. Choose closest

3. If closest fails, try next


Impact:
───────

Same node: ~0 network cost
Same rack: ~1 Gbps intra-rack
Different rack: ~10 Gbps inter-rack (but contention)

Massive performance difference at scale!

Parallel Reads

Reading Multiple Blocks Concurrently:

Sequential Read (Slow):
───────────────────────

Read blk_1 ──▶ Read blk_2 ──▶ Read blk_3

Time: 1000ms + 1000ms + 1000ms = 3000ms


Parallel Read (Fast):
─────────────────────

Read blk_1 ──▶ ┐
               ├──▶ Done
Read blk_2 ──▶ │
               │
Read blk_3 ──▶ ┘

Time: ~1000ms (network bandwidth permitting)


Implementation:
───────────────

// Pseudocode
ExecutorService executor = Executors.newFixedThreadPool(10);

for (BlockLocation block : blocks) {
    executor.submit(() -> {
        DataNode dn = selectClosestReplica(block);
        byte[] data = dn.readBlock(block.blockId);
        buffer.write(data);
    });
}

executor.awaitTermination();


Benefits:
─────────

✓ Higher aggregate throughput
✓ Better network utilization
✓ Reduced latency for large files

Caveats:
────────

⚠ Don't over-parallelize (network saturation)
⚠ Consider downstream processing capacity
⚠ Memory usage for buffering

Short-Circuit Reads

Bypassing Network Stack:

Standard Read (Through Network):
────────────────────────────────

Client (same machine as DataNode):

Client ──▶ Network Stack ──▶ DataNode ──▶ Read Disk
       ◀── Network Stack ◀──           ◀──

Overhead:
• Network protocol overhead
• Serialization/deserialization
• Kernel/user space transitions
• Unnecessary copying


Short-Circuit Read (Direct):
────────────────────────────

Client ──▶ Shared Memory ──▶ Direct File Access
       ◀──                ◀──

Requirements:
• Client and DataNode on same machine
• File descriptor passing enabled
• Proper permissions


Configuration:
──────────────

hdfs-site.xml:
<property>
  <name>dfs.client.read.shortcircuit</name>
  <value>true</value>
</property>

<property>
  <name>dfs.domain.socket.path</name>
  <value>/var/lib/hadoop-hdfs/dn_socket</value>
</property>


Performance Impact:
───────────────────

Standard read: 100 MB/s
Short-circuit: 500+ MB/s

5x improvement!

Critical for:
→ MapReduce (data locality)
→ Spark (local reads)
→ HBase (co-located RegionServers)

Client-Side Caching

Caching Block Locations:

Without Caching:
────────────────

Every read operation:
Client → NameNode (get block locations)

For 1000 reads:
→ 1000 RPC calls to NameNode
→ NameNode bottleneck
→ Added latency


With Caching:
─────────────

First read:
Client → NameNode (get all block locations for file)
Client caches locations locally

Subsequent reads:
Client uses cached locations
No NameNode contact needed


Cache Invalidation:
───────────────────

Cache expires:
• After configurable timeout
• On checksum verification failure
• If DataNode unreachable

Then: Re-query NameNode


Implementation:
───────────────

// Pseudocode
class BlockLocationCache {
    Map<Path, List<BlockLocation>> cache;

    List<BlockLocation> get(Path file) {
        if (cache.contains(file) && !expired(file)) {
            return cache.get(file);
        }

        List<BlockLocation> locations =
            nameNode.getBlockLocations(file);
        cache.put(file, locations);
        return locations;
    }
}


Benefits:
─────────

✓ Reduced NameNode load
✓ Lower read latency
✓ Better scalability

Trade-off:
──────────

⚠ Slightly stale location info
⚠ Handled by trying alternate replicas on failure

Data Flow: Write Operations

Write operations are more complex than reads due to replication.

Write Path

HDFS WRITE OPERATION
────────────────────

Client wants to write: /user/hadoop/output.txt (400MB)

Step 1: Create File
───────────────────

Client ────(1)───▶ NameNode
    "create /user/hadoop/output.txt"

NameNode:
• Checks permissions
• Checks quota
• Adds to namespace
• Does NOT allocate blocks yet

NameNode ────(2)───▶ Client
    "OK, file created"


Step 2: Write Data
──────────────────

Client has 400MB data → needs 4 blocks (128MB each, last is 16MB)

For first block:

2a. Request Block Allocation

    Client ────(3)───▶ NameNode
        "allocate block for /user/hadoop/output.txt"

    NameNode:
    • Generates block ID: blk_1234567890
    • Selects 3 DataNodes: [DN1, DN3, DN4]
    • Considers:
      - Available space
      - Network topology
      - Load balancing
      - Replication policy

    NameNode ────(4)───▶ Client
        Block: blk_1234567890
        Pipeline: DN1 → DN3 → DN4


2b. Establish Pipeline

    Client ────(5)───▶ DN1
        "prepare to receive blk_1234567890"
        "forward to DN3"

    DN1 ────(6)───▶ DN3
        "prepare to receive blk_1234567890"
        "forward to DN4"

    DN3 ────(7)───▶ DN4
        "prepare to receive blk_1234567890"

    DN4 ────(8)───▶ DN3 ──▶ DN1 ──▶ Client
        "ACK: Pipeline established"


2c. Stream Data Through Pipeline

    REPLICATION PIPELINE:

    Client ──▶ DN1 ──▶ DN3 ──▶ DN4

    Data flows in packets (64KB):

    Packet 1:  Client ──▶ DN1 ──▶ DN3 ──▶ DN4
    Packet 2:  Client ──▶ DN1 ──▶ DN3 ──▶ DN4
    ...
    Packet N:  Client ──▶ DN1 ──▶ DN3 ──▶ DN4

    ACKs flow backward:

    DN4 ──▶ DN3 ──▶ DN1 ──▶ Client (ACK packet 1)
    DN4 ──▶ DN3 ──▶ DN1 ──▶ Client (ACK packet 2)


2d. Close Block

    All 128MB written

    Client ────(9)───▶ DN1
        "close block blk_1234567890"

    DN1 → DN3 → DN4 (propagate close)

    DN4 → DN3 → DN1 → Client (ACK closed)

    Each DataNode:
    • Finalizes block on disk
    • Computes final checksums


Step 3: Repeat for Remaining Blocks
────────────────────────────────────

Block 2: Allocate + Pipeline + Stream + Close
Block 3: Allocate + Pipeline + Stream + Close
Block 4: Allocate + Pipeline + Stream + Close (16MB)


Step 4: Close File
──────────────────

Client ────(10)───▶ NameNode
    "close /user/hadoop/output.txt"
    "blocks: [blk_1, blk_2, blk_3, blk_4]"

NameNode:
• Marks file as complete
• Updates metadata
• Block locations confirmed

NameNode ────(11)───▶ Client
    "File closed successfully"


TIMELINE
────────

t=0ms      Create file
t=10ms     Allocate block 1
t=20ms     Establish pipeline
t=30ms     Start streaming
t=1030ms   Block 1 complete (128MB @ ~125 MB/s)
t=1040ms   Allocate block 2
...
t=4000ms   All blocks written
t=4010ms   Close file

Write Optimizations and Reliability

Pipelined Replication
Failure Handling
Write Consistency
Append Operations

Why Pipeline Instead of Sequential?:

SEQUENTIAL REPLICATION (Slow):
──────────────────────────────

Client ──▶ DN1 (write 128MB)
          ↓
       DN1 ──▶ DN3 (copy 128MB)
               ↓
            DN3 ──▶ DN4 (copy 128MB)

Time: 1s + 1s + 1s = 3 seconds
Bandwidth: Network used sequentially


PIPELINED REPLICATION (Fast):
─────────────────────────────

Client ──▶ DN1 ──▶ DN3 ──▶ DN4

All transfers happen simultaneously:

Packet 1:  Client→DN1  DN1→DN3  DN3→DN4
Packet 2:  Client→DN1  DN1→DN3  DN3→DN4
Packet 3:  Client→DN1  DN1→DN3  DN3→DN4

Time: ~1 second
Bandwidth: All links utilized in parallel


PACKET STRUCTURE
────────────────

Packet (64KB):
┌────────────────────────────────┐
│ Header                         │
│ • Sequence number              │
│ • Block ID                     │
│ • Offset in block              │
├────────────────────────────────┤
│ Data (64KB)                    │
├────────────────────────────────┤
│ Checksums                      │
│ • Per 512 bytes                │
└────────────────────────────────┘


ACK MECHANISM
─────────────

Each DataNode:
1. Receives packet
2. Verifies checksum
3. Writes to disk
4. Forwards to next (if not last)
5. Waits for ACK from downstream
6. Sends ACK upstream

ACK Packet:
┌────────────────────────────────┐
│ Sequence number                │
│ Status: SUCCESS/FAILURE        │
│ Failed replicas (if any)       │
└────────────────────────────────┘


BENEFITS
────────

✓ 3x faster than sequential
✓ Better network utilization
✓ Lower latency for clients
✓ Scales with replication factor

Pipeline Failure Recovery:

SCENARIO: DN3 Fails During Write
────────────────────────────────

Original pipeline:
Client ──▶ DN1 ──▶ DN3 ──▶ DN4
                   ❌ (fails)


Step 1: Detect Failure
──────────────────────

DN1 doesn't receive ACK from DN3
DN1 timeout (configurable: 60s)
DN1 reports to Client


Step 2: Rebuild Pipeline
────────────────────────

Client:
• Removes DN3 from pipeline
• Current replicas: DN1, DN4 (only 2!)
• Continues with reduced pipeline

New pipeline:
Client ──▶ DN1 ──▶ DN4


Step 3: Resume Writing
──────────────────────

Client resends packets not ACKed:
• Keeps track of unacknowledged packets
• Resends from last successful packet
• No data loss


Step 4: Inform NameNode
───────────────────────

Client ──▶ NameNode
    "DN3 failed during write of blk_1234567890"

NameNode:
• Marks DN3 as dead (or suspect)
• Notes block is under-replicated


Step 5: Post-Write Re-Replication
─────────────────────────────────

After file closed:
NameNode schedules re-replication
Block copied from DN1 or DN4 to new DN (e.g., DN5)
Final replicas: [DN1, DN4, DN5]


MULTIPLE FAILURES
─────────────────

If 2 DataNodes fail:
Still have 1 replica
Continue writing
Re-replicate after close

If all 3 fail:
Write fails
Client retries with new pipeline
NameNode selects different DataNodes


BLOCK VERSIONS
──────────────

Each attempt creates new generation stamp:

Attempt 1: blk_1234567890, gen_stamp: 1001
Attempt 2: blk_1234567890, gen_stamp: 1002

Prevents confusion with partial writes
Old generation stamps eventually garbage collected


CLIENT-SIDE ROBUSTNESS
──────────────────────

Client maintains:
• Packet queue (unacknowledged)
• Maximum retries
• Timeout configurations
• Error counters

If too many failures:
→ Report to application
→ Application decides retry or fail

Ensuring Data Integrity:

CHECKSUM VERIFICATION
─────────────────────

Write Path:

1. Client computes checksums
   Per 512 bytes of data

2. Sends data + checksums in packet

3. Each DataNode:
   • Receives packet
   • Verifies checksums
   • Writes data to disk
   • Stores checksums separately
   • Forwards to next DataNode

4. If checksum mismatch:
   • DataNode reports error
   • Client excludes DataNode from pipeline
   • Rebuilds pipeline without bad node


ATOMIC BLOCK CREATION
─────────────────────

Blocks created atomically:

During write:
• Block in "under construction" state
• Not visible to readers
• Stored in temporary location

After close:
• Block finalized
• Moved to final location
• Made visible to readers
• Generation stamp finalized

If crash before finalization:
• Incomplete blocks deleted
• No partial data visible


LEASE MECHANISM
───────────────

Prevents concurrent writes:

Client ──▶ NameNode: "open for write"
NameNode grants exclusive lease (60s)

Client must renew lease:
• Heartbeat every 30s
• While file open

If client crashes:
• Lease expires after 60s
• NameNode recovers file
• Discards incomplete last block

Another client tries to write same file:
NameNode: "Lease already held, wait or fail"


PIPELINE ACK
────────────

Guarantees all replicas written:

Client sends packet
↓
DN1 writes ──▶ DN3 writes ──▶ DN4 writes
                                ↓
                             DN4 ACKs
                                ↓
                        DN3 ACKs (after DN4)
                                ↓
                    DN1 ACKs (after DN3)
                                ↓
                            Client receives ACK

Only when all replicas confirm:
→ Client considers packet written
→ Moves to next packet


FAILURE ATOMICITY
─────────────────

Either all replicas succeed or none:

Success: All ACKs received
→ Packet committed

Partial failure: Some ACK, some don't
→ Remove failed DataNodes
→ Resend packet
→ Continue with reduced pipeline

Complete failure: No ACKs
→ Retry packet
→ If persistent, fail entire block
→ Start new block

Appending to Existing Files:

APPEND WORKFLOW
───────────────

Use case: Log aggregation
Multiple writers append to same file


Step 1: Request Append
──────────────────────

Client ──▶ NameNode: "append /var/log/app.log"

NameNode:
• Checks permissions
• Verifies file exists
• Grants lease to client
• Returns last block location

NameNode ──▶ Client:
    Last block: blk_999, size: 64MB
    Locations: [DN2, DN5, DN8]


Step 2: Append Data
───────────────────

If last block < 128MB:
• Append to existing block
• Establish pipeline with existing replicas
• Continue from last offset

If last block full (128MB):
• Allocate new block
• Normal write pipeline


Step 3: Pipeline Synchronization
────────────────────────────────

Critical: All replicas must agree on block state

Client ──▶ DN2, DN5, DN8:
    "What's your last confirmed offset for blk_999?"

Responses:
DN2: 67108864 bytes (64MB)
DN5: 67108864 bytes (64MB)
DN8: 67108864 bytes (64MB)

All agree → Continue from 64MB

If mismatch:
DN2: 64MB
DN5: 64MB
DN8: 63MB ← Out of sync!

Resolution:
• Use minimum confirmed offset
• Truncate longer replicas
• Restart from 63MB


Step 4: Continue Writing
────────────────────────

Same as normal write:
• Packet streaming
• ACK protocol
• Checksum verification


CONCURRENCY CONTROL
───────────────────

Only ONE writer at a time:

Client A holds lease → Can append
Client B tries to append → Blocked until lease released

Lease duration: 60 seconds
Renewable while writing

If Client A crashes:
• Lease expires
• NameNode recovers file
• Client B can now append


USE CASES
─────────

Good for:
✓ Log aggregation
✓ Audit trails
✓ Sequential data streams

Not for:
✗ Random writes
✗ Frequent small appends (overhead)
✗ Multiple concurrent writers (lease contention)


LIMITATIONS
───────────

Performance:
⚠ Append slower than fresh write
⚠ Pipeline synchronization overhead
⚠ Potential for replica divergence

Recommendation:
→ For high-volume logs, write to new files periodically
→ Append for occasional additions only

HDFS vs GFS: Key Differences

Block Size

128MB vs 64MB:

HDFS default: 128MB
GFS: 64MB
HDFS evolved with hardware
Reduces metadata overhead
Better for larger files
Configurable per file

Secondary NameNode

Checkpointing:

HDFS: Secondary NameNode
Not a hot standby (misleading name!)
Creates fsimage checkpoints
Reduces edit log size
Later: Standby NameNode (HA)

File Permissions

POSIX-like Security:

HDFS: Full permission model
User, group, others
Read, write, execute
ACLs in later versions
GFS: Simpler model

Quotas

Resource Management:

HDFS: Directory quotas
Space quotas per directory
Name quotas (file count)
Enables multi-tenancy
GFS: No native quotas

Key Takeaways

Remember These Core Insights:

Single NameNode Design: Simplifies metadata management but requires HA for production
Block-Based Storage: 128MB blocks optimize for large files and reduce metadata overhead
Replication for Reliability: 3x replication default survives single rack failures
Rack Awareness: Intelligent replica placement balances fault tolerance and network cost
Pipelined Writes: Streaming data through replica pipeline is 3x faster than sequential
Data Locality: Moving computation to data is fundamental to Hadoop’s efficiency
Metadata Separated from Data: NameNode handles metadata, clients stream from DataNodes
Checksums Everywhere: Data integrity verified at every step—write, read, and background

Interview Questions

Basic: Explain HDFS architecture and its main components

Expected Answer:HDFS has a master-worker architecture with three main components:

NameNode (Master):
- Manages file system namespace (directories, files)
- Stores metadata (file-to-block mapping)
- Coordinates file operations
- Monitors DataNode health
- All metadata in RAM for fast access
DataNodes (Workers):
- Store actual data blocks (128MB each)
- Serve read/write requests from clients
- Send heartbeats to NameNode (every 3s)
- Report blocks they store (every 6h)
- Execute replication commands
Clients:
- Contact NameNode for metadata
- Read/write data directly from/to DataNodes
- Maintain consistency with checksums

Key principle: Metadata and data flows are separated. NameNode only handles metadata; clients talk to DataNodes for actual data.

Intermediate: How does HDFS handle DataNode failures?

Expected Answer:HDFS handles DataNode failures through multiple mechanisms:Detection:

DataNodes send heartbeats every 3 seconds
If NameNode doesn’t receive heartbeat for 10 minutes, marks DataNode as dead
Immediate action triggered

Recovery:

NameNode identifies all blocks on failed DataNode
Checks which blocks are now under-replicated
Prioritizes blocks by replication level:
- 0 replicas: Critical priority
- 1 replica: High priority
- 2 replicas: Normal priority
Selects source (healthy replica) and target (new DataNode)
Commands source to copy blocks to target
Verifies checksums during copy
Updates block location map

Prevention:

3x replication by default
Rack-aware placement (survives rack failures)
Continuous background verification
Automatic re-replication maintains factor

Time to Recovery: Minutes to hours depending on data volume, but cluster remains operational during recovery.

Advanced: Explain HDFS's replication pipeline and why it's efficient

Expected Answer:HDFS uses pipelined replication for writes, which is significantly more efficient than sequential replication:Pipeline Mechanism:

Client → DN1 → DN2 → DN3 (simultaneous streaming)

Instead of:

Client → DN1, then DN1 → DN2, then DN2 → DN3 (sequential)

How it Works:

Client gets pipeline: [DN1, DN2, DN3] from NameNode
Client establishes connection to DN1
DN1 connects to DN2, DN2 to DN3
Client streams 64KB packets to DN1
DN1 simultaneously:
- Writes to local disk
- Forwards packet to DN2
DN2 does same (write + forward to DN3)
ACKs flow backward: DN3→DN2→DN1→Client

Efficiency Benefits:

3x faster: All transfers happen in parallel vs sequential
Network utilization: All links active simultaneously
Latency: ~1 second for 128MB vs ~3 seconds sequential
Scalability: Time independent of replication factor

Failure Handling:

If DN2 fails, Client removes it from pipeline
Continues with [DN1, DN3]
Re-replicates to 3x after write completes
No data loss, minimal interruption

System Design: How would you optimize HDFS for many small files?

Expected Answer:Small files are HDFS’s Achilles’ heel. Each file consumes ~150 bytes of NameNode RAM regardless of file size. Solutions:Problem Quantification:

1 million files = 150MB NameNode RAM
100 million files = 15GB RAM
1 billion files = 150GB RAM + slow operations
Each file requires NameNode RPC = overhead

Solutions:

HAR Files (Hadoop Archives):
- Combine many small files into larger archive
- Like tar for HDFS
- Reduces NameNode metadata
- Trade-off: Slower access (need to unpack)
Sequence Files:
- Container format: key-value pairs
- Multiple small files → single SequenceFile
- Splittable for MapReduce
- Built-in compression
HBase:
- Store small files as rows in HBase
- HBase handles small data efficiently
- Random access support
- Better than HDFS for this use case
CombineFileInputFormat:
- MapReduce optimization
- Combines multiple small files into single split
- Reduces number of map tasks
- Better resource utilization
HDFS Federation (Hadoop 3.x):
- Multiple NameNodes, each managing subset
- Horizontal scaling of namespace
- Allows more total files
- But doesn’t solve per-file overhead

Real-World Approach: Most companies use a combination:

Archive old small files into SequenceFiles
Use HBase for active small file workloads
Educate users to avoid small files
Set up quotas and monitoring
Consider cloud object storage (S3) for small files

Modern Alternative: Many modern systems (Snowflake, Delta Lake) use cloud object storage (S3, GCS) which handles small files better than HDFS.

Deep Dive: Explain HDFS's rack awareness and replica placement policy

Expected Answer:HDFS models the network as a tree and places replicas strategically:Network Topology:

          Cluster
             |
   +---------+---------+
  Rack1    Rack2    Rack3
   |         |         |
 DN1-3     DN4-6     DN7-9

Distance Calculation:

Same node: 0
Same rack: 2 (node→rack→node)
Different rack: 4 (node→rack→cluster→rack→node)

Default Placement (3 replicas):If writer on DN1 (Rack1):

1st replica: DN1 (same node as writer)
- Zero network cost
- Fast write initiation
2nd replica: DN4 (different rack, e.g., Rack2)
- Survives rack failure
- One off-rack transfer
3rd replica: DN5 (same rack as 2nd, Rack2)
- Rack-local transfer (faster)
- Still survives rack failure

Why This Policy?Fault Tolerance:

Survives any single node failure
Survives any single rack failure
Does NOT survive 2 rack failures (acceptable trade-off)

Network Cost:

Only 1 out of 3 transfers crosses racks
2/3 transfers are rack-local (10x faster)
Balances reliability with performance

Read Optimization:

Multiple racks → better read parallelism
Readers choose closest replica
Load distributed across racks

For Replication Factor > 3:

4th and beyond: Random, but max 2 per rack
Diminishing returns on rack diversity
Focus on load balancing

Configuration: Rack topology specified in:

Script: topology.script.file.name
Returns rack ID for each host
Example: /rack1, /rack2

Impact of Wrong Topology:

If NameNode doesn’t know racks, treats all as same rack
Loses fault tolerance benefit
Survives only node failures, not rack failures
Critical to configure correctly!

Real-World Considerations:

Cloud environments: Availability zones = racks
On-premises: Physical rack layout
Network switches as failure domains
Balance across power circuits

Up Next

In Chapter 3: MapReduce Framework, we’ll explore:

The MapReduce programming model in depth
Job execution flow and task lifecycle
Shuffle and sort mechanisms
Optimization techniques for MapReduce jobs
How MapReduce leverages HDFS data locality

We’ve mastered HDFS storage. Next, we’ll learn how to process that data efficiently with MapReduce.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Chapter 2: HDFS Architecture

​HDFS Architecture Overview

​System Components

​NameNode: The Master

​DataNodes: The Workers

Storage Model

Responsibilities

Block Verification

Heartbeat Protocol

​Block Replication and Placement

​Replication Strategy

​Replica Placement Details

​Data Flow: Read Operations

​Read Path

​Read Optimizations

​Data Flow: Write Operations

​Write Path

​Write Optimizations and Reliability

​HDFS vs GFS: Key Differences

Block Size

Secondary NameNode

File Permissions

Quotas

​Key Takeaways

​Interview Questions

​Up Next

Chapter 2: HDFS Architecture

HDFS Architecture Overview

System Components

NameNode: The Master

DataNodes: The Workers

Block Replication and Placement

Replication Strategy

Replica Placement Details

Data Flow: Read Operations

Read Path

Read Optimizations

Data Flow: Write Operations

Write Path

Write Optimizations and Reliability

HDFS vs GFS: Key Differences

Key Takeaways

Interview Questions

Up Next