Production Deployment & Operations
Cluster Planning
Hardware Selection
Cluster Sizing
Network Topology
Security
Kerberos Authentication
Encryption
Authorization (Apache Ranger)
High Availability (HA)
NameNode HA with QJM
Monitoring
Key Metrics to Monitor
Prometheus + Grafana Setup
Alerting Rules
Backup and Disaster Recovery
Metadata Backup
Data Replication Across Clusters
Troubleshooting Common Issues
Issue 1: Slow Job Performance
Issue 2: NameNode Out of Memory
Issue 3: DataNode Disk Failure
Capacity Planning
Growth Projection
Cost Optimization
Upgrading Hadoop
Rolling Upgrade Process
Best Practices Checklist
Security
Availability
Performance
Operations
What’s Next?

Production Deployment & Operations

Module Duration: 3-4 hours Focus: Real-world deployment, security, monitoring, troubleshooting Outcome: Production-ready operational knowledge

Cluster Planning

Hardware Selection

NameNode:

CPU: 16-24 cores (high clock speed)
RAM: 128-256GB (1GB per million blocks)
Disk: RAID-10 SSDs for metadata
Network: 10GbE bonded

Example:
  100M files, avg 2 blocks = 200M blocks
  RAM needed: 200GB
  Choose: 256GB server

DataNode:

CPU: 8-16 cores
RAM: 64-128GB
Disk: 12-24 × 4-8TB SATA (JBOD, not RAID)
Network: 10GbE

Calculation:
  24 disks × 6TB = 144TB raw per node
  With replication factor 3: 48TB usable per node

Edge Node (client gateway):

CPU: 8 cores
RAM: 32-64GB
Disk: 500GB SSD
Network: 10GbE

Cluster Sizing

Target capacity: 1PB usable

Calculation:
  1PB usable × 3 (replication) = 3PB raw
  3PB / 48TB per node = 63 DataNodes
  Add 10% overhead → 70 DataNodes

Total cluster:
  1 Active NameNode
  1 Standby NameNode
  70 DataNodes
  3 ZooKeeper nodes
  2-3 Edge nodes
  = 77-78 nodes

Network Topology

┌──────────────────────────────────────────────────┐
│              Core Switch (40GbE)                 │
└────────┬─────────────────────┬───────────────────┘
         │                     │
    ┌────▼────┐          ┌─────▼────┐
    │  Rack 1 │          │  Rack 2  │
    │ Switch  │          │  Switch  │
    │ (10GbE) │          │  (10GbE) │
    └────┬────┘          └─────┬────┘
         │                     │
    ┌────┼──────┬───────┐     │
    │    │      │       │     │
    ▼    ▼      ▼       ▼     ▼
   NN1  NN2   DN1-15   DN16-30 ...

Rack Awareness Configuration:

# /etc/hadoop/topology.sh
#!/bin/bash
# Map IP to rack location

case $1 in
    10.1.1.*) echo "/dc1/rack1" ;;
    10.1.2.*) echo "/dc1/rack2" ;;
    10.1.3.*) echo "/dc1/rack3" ;;
    *) echo "/default-rack" ;;
esac

<!-- core-site.xml -->
<property>
    <name>net.topology.script.file.name</name>
    <value>/etc/hadoop/topology.sh</value>
</property>

Security

Kerberos Authentication

1. Install Kerberos:

# On KDC server
yum install krb5-server krb5-libs krb5-workstation

# Edit /etc/krb5.conf
[libdefaults]
    default_realm = HADOOP.EXAMPLE.COM
    dns_lookup_realm = false
    dns_lookup_kdc = false

[realms]
    HADOOP.EXAMPLE.COM = {
        kdc = kdc.example.com
        admin_server = kdc.example.com
    }

# Initialize database
kdb5_util create -s

# Start KDC
systemctl start krb5kdc
systemctl start kadmin

2. Create Principals:

# Add Hadoop principals
kadmin.local

addprinc -randkey nn/namenode.example.com@HADOOP.EXAMPLE.COM
addprinc -randkey dn/datanode1.example.com@HADOOP.EXAMPLE.COM
addprinc -randkey hdfs/namenode.example.com@HADOOP.EXAMPLE.COM

# Export to keytabs
xst -k /etc/security/keytabs/nn.service.keytab \
    nn/namenode.example.com@HADOOP.EXAMPLE.COM

xst -k /etc/security/keytabs/dn.service.keytab \
    dn/datanode1.example.com@HADOOP.EXAMPLE.COM

3. Configure Hadoop:

<!-- core-site.xml -->
<property>
    <name>hadoop.security.authentication</name>
    <value>kerberos</value>
</property>

<property>
    <name>hadoop.security.authorization</name>
    <value>true</value>
</property>

<!-- hdfs-site.xml -->
<property>
    <name>dfs.block.access.token.enable</name>
    <value>true</value>
</property>

<property>
    <name>dfs.namenode.keytab.file</name>
    <value>/etc/security/keytabs/nn.service.keytab</value>
</property>

<property>
    <name>dfs.namenode.kerberos.principal</name>
    <value>nn/_HOST@HADOOP.EXAMPLE.COM</value>
</property>

Encryption

Data at Rest (HDFS Transparent Encryption):

# Create encryption zone
hadoop key create mykey
hdfs crypto -createZone -keyName mykey -path /encrypted

# Files in /encrypted automatically encrypted
hdfs dfs -put sensitive.txt /encrypted/

Data in Transit:

<!-- hdfs-site.xml -->
<property>
    <name>dfs.encrypt.data.transfer</name>
    <value>true</value>
</property>

<property>
    <name>dfs.encrypt.data.transfer.algorithm</name>
    <value>3des</value>
</property>

Authorization (Apache Ranger)

Ranger Policy Example:
┌─────────────────────────────────────────┐
│ Policy: Production Data Access          │
├─────────────────────────────────────────┤
│ Resource:                               │
│   Path: /prod/*                         │
│                                         │
│ Permissions:                            │
│   Group: data-engineers                 │
│     - Read, Write, Execute              │
│                                         │
│   Group: analysts                       │
│     - Read only                         │
│                                         │
│   User: admin                           │
│     - All permissions                   │
└─────────────────────────────────────────┘

High Availability (HA)

NameNode HA with QJM

Architecture:

┌──────────┐         ┌──────────┐
│ Active   │◄───────►│ Standby  │
│ NameNode │         │ NameNode │
└────┬─────┘         └────┬─────┘
     │                    │
     │  Edit logs         │
     ▼                    ▼
┌────────────────────────────────┐
│    JournalNode Cluster         │
│    (Quorum: 3 or 5 nodes)      │
└────────────────────────────────┘
         ▲
         │
    ┌────┴────┐
    │ZooKeeper│
    │ (Failover│
    │  Coord)  │
    └─────────┘

Configuration:

<!-- hdfs-site.xml -->
<property>
    <name>dfs.nameservices</name>
    <value>mycluster</value>
</property>

<property>
    <name>dfs.ha.namenodes.mycluster</name>
    <value>nn1,nn2</value>
</property>

<property>
    <name>dfs.namenode.rpc-address.mycluster.nn1</name>
    <value>namenode1:8020</value>
</property>

<property>
    <name>dfs.namenode.rpc-address.mycluster.nn2</name>
    <value>namenode2:8020</value>
</property>

<property>
    <name>dfs.namenode.shared.edits.dir</name>
    <value>qjournal://jn1:8485;jn2:8485;jn3:8485/mycluster</value>
</property>

<property>
    <name>dfs.client.failover.proxy.provider.mycluster</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

<property>
    <name>dfs.ha.automatic-failover.enabled</name>
    <value>true</value>
</property>

Initialize and Start:

# Format ZooKeeper for failover
hdfs zkfc -formatZK

# Start JournalNodes
hadoop-daemon.sh start journalnode

# Format NameNode 1
hdfs namenode -format

# Start NameNode 1
hadoop-daemon.sh start namenode

# Bootstrap standby NameNode 2
hdfs namenode -bootstrapStandby

# Start NameNode 2
hadoop-daemon.sh start namenode

# Start ZKFailoverControllers
hadoop-daemon.sh start zkfc

Monitoring

Key Metrics to Monitor

NameNode Metrics:

- Heap memory usage (alert at 80%)
- GC time (should be < 1% of total time)
- RPC queue time
- Number of under-replicated blocks
- Number of missing blocks
- Number of dead DataNodes

DataNode Metrics:

- Disk usage (alert at 85%)
- Disk I/O wait time
- Network throughput
- Number of failed volumes
- Heartbeat latency to NameNode

YARN Metrics:

- Cluster memory utilization
- Cluster CPU utilization
- Queue capacity usage
- Number of pending applications
- Application success/failure rate

Prometheus + Grafana Setup

1. Export Hadoop Metrics:

<!-- hadoop-metrics2.properties -->
*.sink.prometheus.class=org.apache.hadoop.metrics2.sink.PrometheusSink
*.sink.prometheus.port=9095

namenode.sink.prometheus.class=org.apache.hadoop.metrics2.sink.PrometheusSink
namenode.sink.prometheus.port=9096

datanode.sink.prometheus.class=org.apache.hadoop.metrics2.sink.PrometheusSink
datanode.sink.prometheus.port=9097

2. Prometheus Configuration:

# prometheus.yml
scrape_configs:
  - job_name: 'hadoop-namenode'
    static_configs:
      - targets: ['namenode:9096']

  - job_name: 'hadoop-datanodes'
    static_configs:
      - targets:
        - 'datanode1:9097'
        - 'datanode2:9097'
        - 'datanode3:9097'

  - job_name: 'yarn-resourcemanager'
    static_configs:
      - targets: ['resourcemanager:9098']

3. Grafana Dashboard:

{
  "title": "HDFS Cluster Health",
  "panels": [
    {
      "title": "NameNode Heap Usage",
      "targets": [
        {
          "expr": "jvm_memory_used_bytes{job='hadoop-namenode'}"
        }
      ]
    },
    {
      "title": "Under-Replicated Blocks",
      "targets": [
        {
          "expr": "hadoop_namenode_under_replicated_blocks"
        }
      ]
    }
  ]
}

Alerting Rules

# prometheus-alerts.yml
groups:
  - name: hadoop_alerts
    rules:
      - alert: NameNodeHeapHigh
        expr: jvm_memory_used_bytes{job="hadoop-namenode"} / jvm_memory_max_bytes > 0.8
        for: 5m
        annotations:
          summary: "NameNode heap usage high"

      - alert: UnderReplicatedBlocks
        expr: hadoop_namenode_under_replicated_blocks > 1000
        for: 10m
        annotations:
          summary: "High number of under-replicated blocks"

      - alert: DataNodeDown
        expr: up{job="hadoop-datanodes"} == 0
        for: 2m
        annotations:
          summary: "DataNode {{ $labels.instance }} is down"

Backup and Disaster Recovery

Metadata Backup

# Automatic metadata checkpointing (handled by Secondary NN or Standby NN in HA)

# Manual backup
hdfs dfsadmin -saveNamespace
scp -r /var/hadoop/namenode/current backup-server:/backups/nn-$(date +%Y%m%d)/

# Restore
scp -r backup-server:/backups/nn-20240115/ /var/hadoop/namenode/current
hdfs namenode -importCheckpoint

Data Replication Across Clusters

DistCp (Distributed Copy):

# Copy data to disaster recovery cluster
hadoop distcp \
  -update \
  -delete \
  -p \
  hdfs://prod-cluster/data \
  hdfs://dr-cluster/data

# Incremental copy (only changed files)
hadoop distcp \
  -update \
  -skipcrccheck \
  hdfs://prod-cluster/data \
  hdfs://dr-cluster/data

Automated Daily Backup:

#!/bin/bash
# backup-hdfs.sh

DATE=$(date +%Y%m%d)
LOG=/var/log/hadoop-backup-${DATE}.log

hadoop distcp \
  -update \
  -delete \
  -log /tmp/distcp-${DATE} \
  hdfs://prod/data \
  hdfs://backup/data/${DATE} 2>&1 | tee $LOG

if [ $? -eq 0 ]; then
    echo "Backup completed successfully"
else
    echo "Backup failed" | mail -s "HDFS Backup Failed" ops@example.com
fi

Troubleshooting Common Issues

Issue 1: Slow Job Performance

Diagnosis:

# Check job counters
hadoop job -counter job_123456 \
  org.apache.hadoop.mapreduce.TaskCounter \
  SPILLED_RECORDS

# High spills → Increase sort buffer
# Long shuffle time → Data skew or network issues

Solutions:

<!-- Increase sort buffer -->
<property>
    <name>mapreduce.task.io.sort.mb</name>
    <value>512</value>
</property>

<!-- Increase reduce shuffle parallelism -->
<property>
    <name>mapreduce.reduce.shuffle.parallelcopies</name>
    <value>20</value>
</property>

Issue 2: NameNode Out of Memory

Diagnosis:

# Check heap usage
jmap -heap <NameNode-PID>

# Count files/blocks
hdfs dfsadmin -report | grep "blocks"

Solutions:

Increase heap:
```
export HADOOP_NAMENODE_OPTS="-Xmx32g"
```
Enable HDFS Federation (multiple NameNodes)

Clean up small files:

hdfs dfs -find /path -type f -size -1m | xargs hdfs dfs -rm

Issue 3: DataNode Disk Failure

Detection:

2024-01-15 10:30:15 ERROR datanode.DataNode:
  Failed to read block blk_123456 from disk /mnt/disk3:
  Input/output error

Recovery:

# Mark disk as failed
touch /mnt/disk3/badDisk

# Restart DataNode (will skip bad disk)
hadoop-daemon.sh restart datanode

# NameNode will re-replicate blocks from other replicas
# Monitor: hdfs fsck / -blocks

Capacity Planning

Growth Projection

# Calculate growth trajectory
current_size_tb = 500  # TB
monthly_growth_rate = 0.15  # 15% per month

def project_capacity(months):
    size = current_size_tb
    for m in range(months):
        size *= (1 + monthly_growth_rate)
    return size

# Projection for 12 months
print(f"12 months: {project_capacity(12):.2f} TB")
# Output: 12 months: 2670.98 TB

# When to add nodes?
# Current capacity: 70 nodes × 48TB = 3,360 TB usable
# With 15% monthly growth, add 10 nodes every 6 months

Cost Optimization

Data Lifecycle Management:

# Move old data to erasure coding (saves space)
hdfs ec -setPolicy -path /archive -policy RS-6-3-1024k

# Erasure coding: 6 data + 3 parity = 1.5x overhead
# vs. Replication: 3x overhead
# Savings: 50% storage for archive data

Upgrading Hadoop

Rolling Upgrade Process

# 1. Prepare for upgrade (Hadoop 3.0+)
hdfs dfsadmin -rollingUpgrade prepare

# 2. Upgrade DataNodes one by one
ssh datanode1
hadoop-daemon.sh stop datanode
# Install new version
hadoop-daemon.sh start datanode

# Repeat for each DataNode with delay
# Wait for re-replication before next node

# 3. Upgrade NameNode standby
hadoop-daemon.sh stop namenode
# Install new version
hadoop-daemon.sh start namenode

# 4. Fail over to upgraded NameNode
hdfs haadmin -failover nn1 nn2

# 5. Upgrade former active NameNode
# (Now standby)

# 6. Finalize upgrade
hdfs dfsadmin -rollingUpgrade finalize

Best Practices Checklist

Security

Enable Kerberos authentication
Use encryption for sensitive data
Implement role-based access control (Ranger)
Regular security audits
Rotate credentials and keytabs

Availability

Performance

Monitor and tune GC settings
Optimize block size and replication factor
Use compression appropriately
Balance cluster regularly
Decommission slow/faulty nodes

Operations

Centralized logging (ELK stack)
Automated alerting (Prometheus/Grafana)
Capacity planning and forecasting
Regular cluster health checks
Documentation and runbooks

What’s Next?

Capstone Project: Building a Complete Data Pipeline

Apply everything you’ve learned in a comprehensive, production-ready project

6. Processing Patterns 8. Capstone Project

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Production Deployment & Operations

​Cluster Planning

​Hardware Selection

​Cluster Sizing

​Network Topology

​Security

​Kerberos Authentication

​Encryption

​Authorization (Apache Ranger)

​High Availability (HA)

​NameNode HA with QJM

​Monitoring

​Key Metrics to Monitor

​Prometheus + Grafana Setup

​Alerting Rules

​Backup and Disaster Recovery

​Metadata Backup

​Data Replication Across Clusters

​Troubleshooting Common Issues

​Issue 1: Slow Job Performance

​Issue 2: NameNode Out of Memory

​Issue 3: DataNode Disk Failure

​Capacity Planning

​Growth Projection

​Cost Optimization

​Upgrading Hadoop

​Rolling Upgrade Process

​Best Practices Checklist

​Security

​Availability

​Performance

​Operations

​What’s Next?

Capstone Project: Building a Complete Data Pipeline

Production Deployment & Operations

Cluster Planning

Hardware Selection

Cluster Sizing

Network Topology

Security

Kerberos Authentication

Encryption

Authorization (Apache Ranger)

High Availability (HA)

NameNode HA with QJM

Monitoring

Key Metrics to Monitor

Prometheus + Grafana Setup

Alerting Rules

Backup and Disaster Recovery

Metadata Backup

Data Replication Across Clusters

Troubleshooting Common Issues

Issue 1: Slow Job Performance

Issue 2: NameNode Out of Memory

Issue 3: DataNode Disk Failure

Capacity Planning

Growth Projection

Cost Optimization

Upgrading Hadoop

Rolling Upgrade Process

Best Practices Checklist

Security

Availability

Performance

Operations

What’s Next?