Skip to main content

Chapter 1: Introduction and Origins

Apache Hadoop revolutionized big data processing by bringing Google’s pioneering distributed systems concepts to the open-source world. Born from the need to process massive web data, Hadoop has evolved into the de facto platform for large-scale data processing across industries.
Chapter Goals:
  • Understand how Hadoop originated from Google’s research papers
  • Learn the relationship between GFS, MapReduce, and Hadoop
  • Grasp Hadoop’s design philosophy and goals
  • Explore the Hadoop ecosystem and its evolution

The Genesis: From Google Papers to Open Source

The Timeline

+---------------------------------------------------------------+
|                  HADOOP ORIGIN STORY                          |
+---------------------------------------------------------------+
|                                                               |
|  2003: GFS Paper Published                                    |
|  ┌────────────────────────────────────────┐                   |
|  │ Google publishes "The Google File      │                   |
|  │ System" at SOSP 2003                   │                   |
|  │ → Describes distributed file system    │                   |
|  └────────────────────────────────────────┘                   |
|         ↓                                                     |
|  2004: MapReduce Paper Published                              |
|  ┌────────────────────────────────────────┐                   |
|  │ "MapReduce: Simplified Data Processing │                   |
|  │ on Large Clusters" published           │                   |
|  │ → Distributed processing framework     │                   |
|  └────────────────────────────────────────┘                   |
|         ↓                                                     |
|  2004: Nutch Project Needs Scale                              |
|  ┌────────────────────────────────────────┐                   |
|  │ Doug Cutting & Mike Cafarella          │                   |
|  │ → Building open-source search engine   │                   |
|  │ → Need to process billions of web pages│                   |
|  │ → Existing solutions don't scale       │                   |
|  └────────────────────────────────────────┘                   |
|         ↓                                                     |
|  2005: Hadoop is Born                                         |
|  ┌────────────────────────────────────────┐                   |
|  │ Implement GFS concepts → HDFS          │                   |
|  │ Implement MapReduce → Hadoop MapReduce │                   |
|  │ Named after Doug's son's toy elephant  │                   |
|  └────────────────────────────────────────┘                   |
|         ↓                                                     |
|  2006: Yahoo! Adopts Hadoop                                   |
|  ┌────────────────────────────────────────┐                   |
|  │ Doug Cutting joins Yahoo!              │                   |
|  │ → Yahoo dedicates team to Hadoop       │                   |
|  │ → Production clusters with 1000s nodes │                   |
|  │ → Processes petabytes of web data      │                   |
|  └────────────────────────────────────────┘                   |
|         ↓                                                     |
|  2008: Apache Top-Level Project                               |
|  ┌────────────────────────────────────────┐                   |
|  │ Hadoop becomes Apache top-level project│                   |
|  │ → Broad industry adoption begins       │                   |
|  │ → Ecosystem tools emerge               │                   |
|  └────────────────────────────────────────┘                   |
|                                                               |
+---------------------------------------------------------------+

Why Was Hadoop Created?

Challenge: Doug Cutting and Mike Cafarella were building Nutch, an open-source web search engine.
Scale Requirements (2004):

Web Data:
- Billions of web pages to crawl
- Terabytes of HTML content
- Massive link graph analysis
- Real-time index updates needed

Processing Needs:
- Distributed crawling across many machines
- Parallel processing of crawl data
- Build searchable inverted index
- Handle machine failures gracefully

Existing Solutions:
✗ Single machine: Can't handle the scale
✗ Traditional databases: Too slow, too expensive
✗ Custom solutions: Too complex to maintain
✗ Commercial tools: Cost prohibitive

The Gap:
Google could do it (with GFS + MapReduce)
But those were proprietary, internal systems
No open-source equivalent existed
Solution: Build an open-source implementation of Google’s approach.
The GFS Paper (2003):
Key Ideas from GFS:

1. Commodity Hardware
   → Cheap machines will fail
   → Design for failure, not against it
   → Automatic recovery is essential

2. Large Chunks (64MB)
   → Optimize for large files
   → Reduce metadata overhead
   → Sequential access patterns

3. Single Master
   → Simplifies metadata management
   → Strong consistency for namespace
   → Master coordinates, doesn't bottleneck

4. Data Replication
   → 3 replicas by default
   → Survive machine failures
   → Enable data locality
The MapReduce Paper (2004):
Key Ideas from MapReduce:

1. Simple Programming Model
   → Map: process records in parallel
   → Reduce: aggregate results
   → Framework handles distribution

2. Automatic Parallelization
   → Developers write logic
   → Framework handles execution
   → No manual thread management

3. Fault Tolerance
   → Re-execute failed tasks
   → Speculative execution
   → No data loss on failures

4. Data Locality
   → Move computation to data
   → Minimize network transfer
   → Leverage file system knowledge
Hadoop’s Approach: Implement these ideas in Java, make them open source.
Why Open Source Mattered:
Before Hadoop:
──────────────
Big Data = Big Money
- Proprietary systems (Oracle, Teradata)
- Expensive hardware requirements
- Vendor lock-in
- Only large companies could afford

Hadoop's Impact:
────────────────
Democratization of Big Data
- Free and open source
- Runs on commodity hardware
- Community-driven development
- Available to everyone

Benefits:
────────
1. Transparency
   → See how it works
   → Understand trade-offs
   → Fix bugs yourself

2. No Vendor Lock-In
   → Free to use and modify
   → Multiple distributions available
   → Run anywhere

3. Community Innovation
   → Thousands of contributors
   → Rapid ecosystem growth
   → Diverse use cases drive features

4. Cost Effective
   → No licensing fees
   → Commodity hardware
   → Scale economically
Yahoo’s Investment in Hadoop:
Why Yahoo Bet on Hadoop (2006):

Yahoo's Challenge:
- Largest web index (billions of pages)
- Massive user data analytics
- Search ranking algorithms
- Ad targeting at scale

Why Hadoop:
✓ Open source (can customize)
✓ Designed for web-scale data
✓ Cost-effective (commodity hardware)
✓ Active development (hire Doug Cutting)

Yahoo's Contributions:

1. Production Testing
   → Deployed clusters with 4000+ nodes
   → Processed petabytes of data
   → Found and fixed bugs at scale

2. Engineering Resources
   → Dedicated team of developers
   → Performance optimizations
   → Operational tools

3. Ecosystem Development
   → Pig: data flow language
   → Contributed improvements to core
   → Shared operational knowledge

4. Industry Credibility
   → Proved Hadoop works at scale
   → Encouraged other companies to adopt
   → Created Hadoop job market

Result: Hadoop became production-ready

Hadoop vs Google’s Systems

Understanding how Hadoop relates to and differs from Google’s systems is crucial:

Architecture Comparison

+---------------------------------------------------------------+
|           GOOGLE SYSTEMS vs HADOOP EQUIVALENTS                |
+---------------------------------------------------------------+
|                                                               |
|  GOOGLE (Proprietary)        HADOOP (Open Source)            |
|  ───────────────────         ──────────────────              |
|                                                               |
|  GFS                         HDFS                             |
|  ┌──────────────┐           ┌──────────────┐                 |
|  │ Master       │           │ NameNode     │                 |
|  │ Chunkservers │           │ DataNodes    │                 |
|  │ 64MB chunks  │           │ 128MB blocks │                 |
|  └──────────────┘           └──────────────┘                 |
|       │                            │                         |
|       ↓                            ↓                         |
|  MapReduce                   Hadoop MapReduce                |
|  ┌──────────────┐           ┌──────────────┐                 |
|  │ C++          │           │ Java         │                 |
|  │ JobTracker   │           │ JobTracker   │                 |
|  │ TaskTrackers │           │ TaskTrackers │                 |
|  └──────────────┘           └──────────────┘                 |
|       │                            │                         |
|       ↓                            ↓                         |
|  (Unknown)                   YARN (Hadoop 2.0)               |
|  ┌──────────────┐           ┌──────────────┐                 |
|  │ Borg/Omega   │           │ Resource     │                 |
|  │ (Internal)   │           │ Manager      │                 |
|  └──────────────┘           │ Containers   │                 |
|                             └──────────────┘                 |
|                                                               |
+---------------------------------------------------------------+

Key Differences

C++ vs Java:
Google (C++):
────────────
Advantages:
✓ Performance (closer to metal)
✓ Memory control
✓ Lower overhead

Trade-offs:
✗ Harder to develop
✗ More complex debugging
✗ Platform dependencies

Hadoop (Java):
─────────────
Advantages:
✓ Write once, run anywhere (JVM)
✓ Easier development
✓ Rich ecosystem of libraries
✓ Automatic memory management
✓ Larger developer community

Trade-offs:
✗ GC pauses can cause issues
✗ Higher memory overhead
✗ Slightly lower performance

Why Java Won for Hadoop:
→ Portability across platforms
→ Larger developer ecosystem
→ Faster development cycle
→ Good-enough performance

Hadoop Design Goals

What did Hadoop set out to achieve?

Primary Objectives

Scalability

Scale to Thousands of Nodes:
  • Start with 10 nodes, grow to 4000+
  • Linear performance scaling
  • Petabytes of storage
  • Thousands of concurrent jobs
  • Add capacity without downtime

Reliability

Assume Failure, Ensure Reliability:
  • Automatic failure detection
  • Transparent recovery
  • No data loss on failures
  • Checksumming for integrity
  • Self-healing capabilities

Efficiency

Maximize Resource Utilization:
  • Data locality optimization
  • High aggregate throughput
  • Efficient use of network
  • Minimize data movement
  • Parallel processing

Simplicity

Easy to Use and Operate:
  • Simple programming model
  • Automatic parallelization
  • Framework handles complexity
  • Straightforward deployment
  • Manageable at scale

Design Principles

+---------------------------------------------------------------+
|                  HADOOP DESIGN PRINCIPLES                     |
+---------------------------------------------------------------+
|                                                               |
|  1. COMMODITY HARDWARE                                        |
|  ───────────────────────                                      |
|  → Use inexpensive, standard servers                          |
|  → No special storage hardware needed                         |
|  → Scale horizontally, not vertically                         |
|  → Failure is expected and handled                            |
|                                                               |
|  2. FAULT TOLERANCE                                           |
|  ────────────────                                             |
|  → Data replication (default 3x)                              |
|  → Automatic re-replication on failure                        |
|  → Task re-execution on node failure                          |
|  → Checkpointing and recovery                                 |
|                                                               |
|  3. DATA LOCALITY                                             |
|  ──────────────                                               |
|  → Move computation to data                                   |
|  → Minimize network bandwidth usage                           |
|  → Schedule tasks on nodes with data                          |
|  → Co-locate storage and compute                              |
|                                                               |
|  4. SIMPLE CORE, RICH ECOSYSTEM                               |
|  ─────────────────────────────                                |
|  → HDFS provides storage abstraction                          |
|  → MapReduce provides processing model                        |
|  → Higher-level tools build on core                           |
|  → Extensible and pluggable                                   |
|                                                               |
|  5. WRITE ONCE, READ MANY                                     |
|  ───────────────────────                                      |
|  → Optimized for append-only workloads                        |
|  → Immutable data simplifies consistency                      |
|  → Random writes not primary use case                         |
|  → Fits big data analytics patterns                           |
|                                                               |
+---------------------------------------------------------------+

The Hadoop Ecosystem

Hadoop is not just HDFS and MapReduce—it’s an entire ecosystem:

Core Components

                    THE HADOOP ECOSYSTEM

        ┌─────────────────────────────────────────┐
        │        Applications & Use Cases         │
        │  Web Analytics, ML, ETL, Log Analysis   │
        └─────────────────────────────────────────┘

        ┌────────────────┴────────────────┐
        │                                 │
┌───────────────┐              ┌──────────────────┐
│  SQL Tools    │              │ Processing       │
│  - Hive       │              │ - Spark          │
│  - Impala     │              │ - Flink          │
│  - Presto     │              │ - Tez            │
└───────────────┘              └──────────────────┘
        │                                 │
        └────────────────┬────────────────┘

        ┌─────────────────────────────────────────┐
        │        YARN (Resource Management)       │
        │  ResourceManager, NodeManagers          │
        └─────────────────────────────────────────┘

        ┌─────────────────────────────────────────┐
        │      HDFS (Distributed Storage)         │
        │  NameNode, DataNodes, Blocks            │
        └─────────────────────────────────────────┘

Ecosystem Tools

Processing Frameworks:
MapReduce (Original):
- Batch processing model
- Map and Reduce phases
- Disk-based intermediate data
- Good for large batch jobs

Apache Spark:
- In-memory processing
- 10-100x faster than MapReduce
- Unified API (batch, streaming, ML)
- Most popular Hadoop workload today

Apache Flink:
- Stream-first processing
- True streaming (not micro-batch)
- Stateful computations
- Event time processing

Apache Tez:
- DAG-based execution
- Faster than MapReduce
- Used by Hive for optimization
- Better resource utilization

Hadoop’s Impact on Industry

Adoption Timeline

2006-2008: Early Adopters
──────────────────────────
→ Yahoo: 4000+ node clusters
→ Facebook: User data analytics
→ Last.fm: Music recommendations

2009-2011: Enterprise Adoption
──────────────────────────────
→ LinkedIn: People You May Know
→ Twitter: Tweet analytics
→ eBay: Search indexing
→ Adobe: Log processing

2012-2015: Mainstream
─────────────────────
→ Banks: Fraud detection
→ Retailers: Customer analytics
→ Healthcare: Medical research
→ Telecommunications: Network analysis

2016-Present: Cloud Evolution
──────────────────────────────
→ AWS EMR: Managed Hadoop
→ Azure HDInsight: Cloud Hadoop
→ Google Dataproc: Cloud Hadoop
→ Databricks: Spark-focused platform
→ Many moving to cloud-native solutions

Key Success Stories

Yahoo

Search and Analytics:
  • 40,000+ node clusters
  • Processes petabytes daily
  • Web search indexing
  • Ad targeting optimization
  • Proved Hadoop at scale

Facebook

User Data Analysis:
  • Largest Hadoop cluster (2010s)
  • Analyze billions of interactions
  • News Feed optimization
  • Friend recommendations
  • Created Hive for SQL access

LinkedIn

Social Graph Analytics:
  • “People You May Know”
  • Job recommendations
  • Skills endorsements
  • Created Apache Kafka
  • Advanced data pipelines

Netflix

Recommendation Engine:
  • Analyze viewing patterns
  • Personalized recommendations
  • A/B testing infrastructure
  • Content quality analysis
  • Viewer behavior insights

Hadoop Today: Evolution and Alternatives

Current State (2025)

HADOOP'S EVOLUTION:

Still Widely Used:
─────────────────
✓ Many existing deployments
✓ Proven at petabyte scale
✓ Rich ecosystem of tools
✓ Battle-tested in production
✓ On-premises installations

Challenges:
──────────
⚠ Complex to operate
⚠ High operational overhead
⚠ Slower than modern alternatives
⚠ Not cloud-native
⚠ Declining new adoption

Modern Alternatives:
───────────────────
→ Apache Spark: Faster processing
→ Cloud data warehouses: Snowflake, BigQuery, Redshift
→ Object storage: S3, GCS instead of HDFS
→ Kubernetes: Container orchestration replacing YARN
→ Serverless: AWS Lambda, Google Cloud Functions

The Shift:
─────────
Many companies moving from:
  Hadoop on-premises

  Cloud-managed Hadoop (EMR, Dataproc)

  Cloud-native solutions (Snowflake, BigQuery)

  Hybrid: Spark on Kubernetes with cloud storage

Why Learn Hadoop Today?

Understanding Hadoop helps you understand modern data systems:
  • Spark builds on Hadoop concepts
  • Cloud data warehouses use similar distributed patterns
  • Kubernetes shares resource management concepts with YARN
  • Data lakes evolved from HDFS patterns
Learning Hadoop gives you the foundational knowledge to understand the entire big data ecosystem.
Many companies still run Hadoop in production:
  • Large enterprises with existing investments
  • On-premises deployments for compliance
  • Cost-sensitive organizations
  • Legacy applications dependent on Hadoop
Job market still demands Hadoop expertise for maintenance and migration projects.
Hadoop remains interview-relevant:
  • System design questions often reference Hadoop
  • Understanding HDFS helps explain distributed file systems
  • MapReduce is a classic programming model question
  • Comparing Hadoop vs modern alternatives shows depth
Employers value understanding both legacy and modern systems.
Core Hadoop patterns apply everywhere:
  • Data locality optimization
  • Fault tolerance through replication
  • Separating storage and compute
  • Resource management and scheduling
  • Shuffle and sort patterns
These patterns transcend Hadoop and appear in all distributed systems.

Key Takeaways

Remember These Core Insights:
  1. Hadoop = Open Source GFS + MapReduce: Born from Google’s research papers, made accessible to everyone
  2. Yahoo’s Investment Was Critical: Yahoo’s engineering resources and production usage made Hadoop enterprise-ready
  3. Java Was the Right Choice: Portability and developer community outweighed performance concerns
  4. Ecosystem Over Core: Hive, Pig, HBase, Spark built on Hadoop foundation created lasting value
  5. Data Locality is Key: Moving computation to data rather than vice versa is fundamental to Hadoop’s efficiency
  6. Simple Beats Complex: Single NameNode, straightforward replication, clear programming model
  7. Evolution is Continuous: From MapReduce-only to YARN, from batch to streaming, constant improvement
  8. Open Source Democratized Big Data: What only Google could do became available to everyone

Interview Questions

Expected Answer:Hadoop is an open-source framework for distributed storage and processing of large datasets. It was created by Doug Cutting and Mike Cafarella in 2005 to solve the web-scale data processing problem for the Nutch search engine project.Key Points:
  1. Inspired by Google: Based on GFS (2003) and MapReduce (2004) papers
  2. Open Source Implementation: Made Google’s concepts available to everyone
  3. Core Components: HDFS (storage) and MapReduce (processing)
  4. Yahoo’s Role: Crucial investment and production testing at scale
  5. Ecosystem: Grew beyond core to include Hive, HBase, Pig, Spark, etc.
Why It Mattered: Democratized big data processing, enabling companies without Google’s resources to process massive datasets cost-effectively.
Expected Answer:While Hadoop implements Google’s concepts, there are several key differences:Technical Differences:
  1. Language: Google used C++, Hadoop uses Java (for portability and ease of development)
  2. Block Size: GFS used 64MB chunks, HDFS uses 128MB blocks (evolved with hardware)
  3. Terminology: Master/Chunkserver vs NameNode/DataNode
  4. Resource Management: Google’s approach unknown, Hadoop added YARN (Hadoop 2.0)
Philosophical Differences:
  1. Open vs Closed: Hadoop is open-source and community-driven
  2. API Stability: Hadoop maintains backward compatibility
  3. Use Cases: Google optimized for internal workloads, Hadoop serves diverse use cases
  4. Evolution: Different trajectories (Colossus vs HDFS 2.x+)
Trade-offs: Hadoop chose portability and community over raw performance. Google optimized for their specific needs; Hadoop generalized for broad adoption.
Expected Answer:The choice of Java was strategic and practical:Advantages of Java:
  1. Portability: “Write once, run anywhere” - works on any platform with JVM
  2. Developer Ecosystem: Much larger pool of Java developers than C++ systems programmers
  3. Faster Development: Garbage collection, rich standard library, easier debugging
  4. Safety: Type safety, memory safety reduce entire classes of bugs
  5. Integration: Easier to integrate with enterprise Java applications
Performance Trade-offs:
  1. GC Pauses: Can cause issues but manageable with tuning
  2. Memory Overhead: Higher than C++ but acceptable with cheap RAM
  3. Throughput: Good enough for distributed systems where network is often bottleneck
Real-World Validation: Hadoop’s success proves Java was the right choice. Performance bottlenecks are usually disk I/O or network, not CPU. The ability to iterate quickly and attract contributors mattered more than raw performance.
Expected Answer:The decision depends on multiple factors:Choose Hadoop/HDFS When:
  • Existing investment in Hadoop ecosystem
  • On-premises deployment required (compliance, data sovereignty)
  • Cost-sensitive with own hardware
  • Need for Hive/HBase integration
  • Team expertise in Hadoop
Choose Cloud Alternatives (Snowflake, BigQuery) When:
  • Primarily SQL workloads
  • Want managed service (no operations)
  • Elastic scaling needed
  • Willing to pay premium for simplicity
  • Modern analytics use case
Choose Spark on Kubernetes When:
  • Complex data processing (beyond SQL)
  • Need flexibility and control
  • Want container-based orchestration
  • Modern DevOps practices
  • Mix of batch and streaming
Decision Framework:
  1. Workload: Batch vs streaming vs SQL
  2. Scale: Data size and growth rate
  3. Team: Skills and preferences
  4. Budget: CapEx vs OpEx
  5. Timeline: Build vs buy decision
  6. Compliance: Data location requirements
Modern Approach: Many companies use hybrid—Spark for processing, cloud storage (S3/GCS) instead of HDFS, managed Kubernetes instead of YARN.
Expected Answer:Yahoo’s contribution was transformative and multifaceted:Engineering Investment:
  1. Hired Doug Cutting: Brought creator in-house with dedicated team
  2. Production Scale: Deployed 4000+ node clusters, found and fixed bugs at scale
  3. Performance Tuning: Optimized for real-world workloads
  4. Operational Tools: Built monitoring, debugging, and management tools
Technical Contributions:
  1. Pig: Created high-level data flow language
  2. Core Improvements: Contributed optimizations back to Hadoop
  3. Testing: Stress-tested with petabytes of real web data
  4. Documentation: Shared learnings and best practices
Industry Impact:
  1. Credibility: Proved Hadoop works at web scale
  2. Talent Development: Trained engineers, created Hadoop expertise
  3. Ecosystem Growth: Success encouraged other companies to adopt
  4. Open Source Commitment: Could have kept improvements proprietary but didn’t
Counterfactual: Without Yahoo, Hadoop might have remained a small open-source project. Yahoo’s investment turned it into an industry-standard platform.Comparison to Google: Google published papers but kept code proprietary. Yahoo made the implementation truly open and production-ready.

Further Reading

GFS Paper

“The Google File System” (SOSP 2003) Foundation for HDFS design

MapReduce Paper

“MapReduce: Simplified Data Processing on Large Clusters” (2004) Original programming model

Hadoop: The Definitive Guide

Tom White’s comprehensive book Industry standard reference

Designing Data-Intensive Applications

Martin Kleppmann Chapter on Hadoop and batch processing

Deep Dive: Hadoop Version Evolution

Understanding Hadoop’s historical versions helps you interpret documentation, debug legacy clusters, and reason about architectural trade-offs.
HIGH-LEVEL VERSION HISTORY
──────────────────────────

2005-2010: Hadoop 0.x / 1.x ("Classic" Hadoop)
- Single NameNode (no built-in HA)
- JobTracker + many TaskTrackers
- MapReduce is the only processing model
- Focus: batch processing on-prem clusters

2012+: Hadoop 2.x (YARN + HDFS 2)
- YARN introduces generic resource management
- Multiple processing frameworks (MapReduce v2, Tez, Spark, etc.)
- NameNode HA and Federation added
- Focus: multi-tenant clusters, broader workloads

2017+: Hadoop 3.x
- Erasure coding (beyond 3x replication)
- Multiple standby NameNodes
- Improved scaling, containerized execution
- Focus: storage efficiency + modern integrations

Hadoop 1.x: Classic Architecture

Hadoop 1.x (often called “MRv1”) is what most early blog posts and tutorials describe.
  • Single NameNode: Manages HDFS namespace and block mappings
  • SecondaryNameNode: Periodically checkpoints the NameNode’s metadata (not a hot standby)
  • JobTracker: Schedules MapReduce jobs across the cluster
  • TaskTrackers: Run map/reduce tasks in fixed slots on each worker
HADOOP 1.x CONTROL PLANE

                ┌─────────────────────┐
                │       Client        │
                │  (submits job J)    │
                └─────────┬───────────┘
                          │ 1. submitJob(J)

                ┌──────────▼──────────┐
                │     JobTracker      │
                │ - Schedules tasks   │
                │ - Monitors progress │
                └──────────┬──────────┘
                          │ 2. assign map/reduce tasks
   ┌──────────────────────┼───────────────────────────┐
   │                      │                           │
┌──▼──────────┐      ┌────▼─────────┐           ┌─────▼────────┐
│ TaskTracker │      │ TaskTracker  │   ...     │ TaskTracker  │
│ (slots)     │      │             │           │            │
└─────────────┘      └──────────────┘           └──────────────┘

Each TaskTracker pre-allocates a fixed number of **map slots** and **reduce slots**.
- Pros: Simple mental model, easy to reason about
- Cons: Rigid resource model (slots may be idle even when CPU/memory is free)
Limitations of Hadoop 1.x:
  • Single JobTracker bottleneck: All scheduling and job bookkeeping centralized
  • Single NameNode: Operationally risky; manual failover required
  • MapReduce-only: Hard to run iterative/interactive workloads efficiently
  • Rigid slots: Poor resource utilization for mixed workloads
These limitations directly motivated the design of YARN and HDFS 2.x.

Hadoop 2.x: YARN and HDFS 2

Hadoop 2.x (MRv2) decouples resource management from computation.
FROM HADOOP 1.x TO 2.x

Before (1.x):
- JobTracker = scheduler + resource manager + job history
- TaskTracker = fixed slots per node

After (2.x):
- ResourceManager = cluster-wide resource scheduler
- NodeManagers = per-node resource agents
- ApplicationMaster = framework-specific logic (MapReduce, Spark, etc.)

Result: Hadoop becomes a general resource management platform (not MapReduce-only).
Key HDFS 2.x improvements:
  • NameNode High Availability (HA)
    • Active + Standby NameNodes coordinated via ZooKeeper
    • Shared edits (e.g., NFS, JournalNodes) ensure consistent metadata
    • Automatic failover reduces downtime dramatically
  • HDFS Federation
    • Multiple independent NameNodes, each managing a portion of the namespace
    • DataNodes register with multiple NameNodes
    • Improves scalability and isolation between workloads
  • Block Storage Enhancements
    • Support for heterogeneous storage (SSD vs HDD tiers)
    • Policy-based placement (hot vs cold data)
On the processing side, MapReduce is re-implemented on top of YARN as just one YARN application among many.
YARN LOGICAL VIEW

         ┌────────────────────────────────────┐
         │          ResourceManager           │
         │  - Schedules containers            │
         │  - Enforces cluster policies       │
         └───────────────┬────────────────────┘

            container allocations

   ┌─────────────────────▼─────────────────────┐
   │               NodeManagers               │
   │  (one per worker node, launch containers)│
   └───────────────────────────────────────────┘

Each application (e.g., MapReduce job, Spark app) runs an
**ApplicationMaster** inside a container and requests more containers
for its tasks.

Hadoop 3.x: Storage Efficiency and Modernization

Hadoop 3.x focuses on long-term operational efficiency.
  • Erasure Coding
    • Replaces 3x replication for cold data with Reed–Solomon-style encoding
    • Typical configuration: ~1.5x storage overhead instead of 3x
    • Trade-off: Higher CPU and network cost on reads/writes of erasure-coded files
  • Multiple Standby NameNodes
    • Support for more than one standby
    • Better failover and maintenance story for very large clusters
  • Containerized Execution
    • Better integration with Docker and container runtimes
    • Moves Hadoop closer to modern DevOps workflows
  • Java 8+ and Ecosystem Updates
    • Updated dependency baselines, better performance and security
Understanding these version differences is crucial when reading production postmortems or planning migrations.

Case Study: From Nutch to Web-Scale Analytics

To internalize Hadoop’s design, walk through a concrete evolution from the original Nutch use case to a generalized analytics platform.

Phase 1: Nutch on a Small Cluster

Goal: Crawl and index a few billion web pages

Cluster:
- Dozens of machines
- Local file systems, ad-hoc scripts

Pain Points:
- Re-running failed jobs manually
- Difficult to scale beyond a few dozen nodes
- No unified storage layer
Outcome: GFS + MapReduce ideas show a clear path forward, but the code is internal to Google.

Phase 2: Early Hadoop at Yahoo

Goal: Web search + log analytics at web scale

Cluster:
- Hundreds → thousands of commodity nodes
- HDFS + MapReduce (Hadoop 0.x/1.x)

Workloads:
- Web crawl processing
- Index construction
- Clickstream analysis
Key engineering lessons learned:
  1. Metadata pressure on NameNode
    • Billions of small files exhaust NameNode heap
    • Solution: File consolidation, sequence files, better schema design
  2. Stragglers and skew
    • A few slow TaskTrackers delay entire job completion
    • Speculative execution and better partitioners mitigate the issue
  3. Debuggability
    • MapReduce failures produce huge logs spread across nodes
    • Yahoo invested heavily in tooling, UIs, and standardized logging formats

Phase 3: Hadoop as a Multi-Purpose Data Platform

As more teams adopted Hadoop, requirements diversified:
  • Data scientists wanted interactive SQL → Hive and later Impala/Presto
  • Streaming teams needed near-real-time processing → Kafka + Storm/Flink
  • ML teams needed iterative algorithms → Mahout, then Spark MLlib
Hadoop evolved from “log cruncher” to a shared data lake foundation.
DATA PLATFORM VIEW

         Applications & Use Cases
         ───────────────────────
         - Dashboards & BI
         - Feature engineering
         - Offline model training
         - Compliance reporting


           Query & Processing Engines
           - Hive / Impala / Presto
           - Spark / Flink / Tez


           Storage & Resource Layer
           - HDFS (with replication/EC)
           - YARN (containers)
This evolution is why modern “data platform” diagrams still look very similar to a Hadoop architecture diagram, even when HDFS is replaced by S3 and YARN by Kubernetes.

Operational Lessons from Early Hadoop Clusters

Many war stories from the 2010s Hadoop era translate directly into design heuristics.
  • Avoid small files
    • Thousands of tiny files (KB-sized) cause NameNode memory blow-ups
    • Prefer large, partitioned files in columnar formats (Parquet/ORC)
  • Plan for hardware churn
    • In a 1000-node cluster, nodes fail every day
    • Automation (config management, auto-replacement) is mandatory
  • Capacity planning is subtle
    • Triple replication + temporary MapReduce outputs inflate storage
    • Network oversubscription can silently cap throughput
  • Multi-tenancy needs guardrails
    • Without queues and quotas, a single rogue job can saturate the cluster
    • YARN schedulers (Capacity/Fair) exist to enforce isolation
These lessons are as relevant for cloud-era data platforms as they were for on-prem Hadoop clusters.

Up Next

In Chapter 2: HDFS Architecture, we’ll dive deep into:
  • NameNode and DataNode design and responsibilities
  • How HDFS implements and improves upon GFS concepts
  • Block replication and placement strategies
  • Read, write, and append operations in detail
  • Metadata management and namespace operations
We’ve covered Hadoop’s origins and place in history. Next, we’ll explore the distributed file system that makes it all possible: HDFS.