Chapter 5: Hadoop Ecosystem
The true power of Hadoop lies not just in HDFS and MapReduce, but in the rich ecosystem of tools built on top of it. This chapter explores the key components that transform Hadoop from a low-level distributed file system and processing framework into a complete big data platform.Chapter Goals:
- Understand the Hadoop ecosystem landscape
- Master Hive for SQL on Hadoop
- Learn Pig for data flow programming
- Explore HBase for real-time NoSQL storage
- Study workflow orchestration with Oozie
- Understand data ingestion with Kafka and Flume
- Compare ecosystem tools and when to use each
Ecosystem Overview
The Hadoop Stack
Why the Ecosystem Matters
Abstraction
Hide Complexity:
- MapReduce requires Java programming
- Hive provides SQL interface
- Easier onboarding, faster development
- Reach more users (analysts, not just engineers)
Productivity
Faster Development:
- 100 lines of Java MapReduce → 5 lines of SQL
- Pig reduces code by 10-20x
- Faster iteration, fewer bugs
- Focus on logic, not plumbing
Specialization
Right Tool for Job:
- Hive for SQL analytics
- HBase for real-time access
- Kafka for stream ingestion
- Each optimized for specific use case
Innovation
Community-Driven:
- Open source enables experimentation
- Best tools emerge organically
- Spark displaced MapReduce
- Ecosystem evolves with needs
Apache Hive: SQL on Hadoop
What is Hive?
Hive provides a SQL interface (HiveQL) to data stored in HDFS, translating SQL queries into MapReduce, Tez, or Spark jobs.Hive Fundamentals
- HiveQL Basics
- Partitioning
- File Formats
- Optimization
SQL-Like SyntaxBehind the Scenes:Each query translates to MapReduce/Tez job:
Hive Metastore
Metastore Architecture
Metastore Architecture
Centralized Metadata RepositoryExample Metadata:Metastore Entries:
Managed vs External Tables
Managed vs External Tables
Table OwnershipWhen to Use Each:
SerDe (Serializer/Deserializer)
SerDe (Serializer/Deserializer)
Reading Custom FormatsSerDe Flow:
Deep Dive: Hive Metastore Scalability and the “Partition Explosion”
As data lakes grow to petabyte scale, the Hive Metastore often becomes the primary bottleneck in the entire stack.1. The Partition Bottleneck
In a standard RDBMS-backed Metastore (MySQL/PostgreSQL), a query likeSELECT * FROM sales WHERE year=2023 requires the Metastore to:
- Lookup the table
sales. - Scan the
PARTITIONStable for all entries matching the filter. - Fetch the
SDS(Storage Descriptor) for each partition to find the HDFS paths.
- If a table has 1,000,000 partitions (common in high-cardinality data), a single ad-hoc query might force the Metastore to load 1GB of metadata into memory just to plan the scan.
- This leads to “Metastore OOM” and serialized query planning that can take minutes before a single mapper even starts.
2. Scaling Strategies
- Metastore Federation: Splitting metadata across multiple Metastore instances based on database name.
- Partition Pruning at the Source: Ensuring clients use partition-bound filters to avoid full metadata scans.
- Direct SQL: Modern Hive versions use direct SQL queries to the backend DB instead of the slower DataNucleus ORM layer to fetch partitions.
Deep Dive: Hive Query Lifecycle and Execution
To understand Hive’s performance, one must look past the SQL interface into the transformation pipeline that converts a declarative query into a distributed DAG.1. The Query Planning Pipeline
Hive’s “Compiler” is a multi-stage engine that performs sophisticated optimizations before any data is touched.| Stage | Action | Output |
|---|---|---|
| Parser | Tokenizes HiveQL using Antlr. | Abstract Syntax Tree (AST) |
| Semantic Analyzer | Resolves table names, column types, and partition metadata from the Metastore. | Query Block (QB) Tree |
| Logical Plan Gen | Converts QB tree into basic relational algebra operators (Filter, Join, Project). | Operator Tree (Initial) |
| Optimizer | Applies rules like Predicate Pushdown, Column Pruning, and Partition Pruning. | Operator Tree (Optimized) |
| Physical Plan Gen | Breaks the operator tree into executable tasks (MapReduce, Tez, or Spark). | Task Tree (DAG) |
2. Tez vs. MapReduce: The DAG Revolution
While Hive 1.x relied on MapReduce, modern Hive (2.x/3.x) uses Apache Tez to eliminate the “HDFS barrier” between jobs.- MapReduce Barrier: Every stage in a complex query (e.g., multiple joins) must write intermediate data to HDFS, causing massive I/O overhead.
- Tez DAG: Tez allows data to flow directly from one task to the next (e.g., Map -> Reduce -> Reduce) without intermediate HDFS writes. It uses a Directed Acyclic Graph of tasks.
3. LLAP (Live Long and Process)
Introduced in Hive 2.0, LLAP is a hybrid architecture that combines persistent query servers with standard YARN containers.- Persistent Daemons: Instead of starting a new JVM for every task (high latency), LLAP uses long-running daemons on worker nodes.
- In-Memory Caching: LLAP caches columnar data (ORC/Parquet) in a smart, asynchronous cache, avoiding repetitive HDFS reads.
- Vectorized Execution: LLAP processes data in batches of 1024 rows at a time using SIMD instructions, drastically reducing CPU cycles per row.
| Feature | Standard Hive | Hive with LLAP |
|---|---|---|
| Startup Latency | High (Container launch) | Ultra-low (Always-on daemons) |
| Data Access | HDFS Scan | In-Memory Cache + HDFS |
| Execution | MapReduce/Tez Tasks | Fragment-based execution |
| Target Use Case | Large Batch ETL | Interactive BI / Sub-second SQL |
Apache Pig: Data Flow Language
What is Pig?
Pig provides a high-level scripting language (Pig Latin) for data transformations, compiling to MapReduce/Tez jobs.Pig Latin Basics
- Core Operations
- Advanced Features
- Pig vs Hive
Fundamental Pig CommandsExample: Word Count in PigEquivalent in MapReduce: ~200 lines of Java!
Apache HBase: NoSQL on HDFS
What is HBase?
HBase is a distributed, column-oriented NoSQL database built on HDFS, modeled after Google’s Bigtable.HBase Operations
- Basic CRUD
- Row Key Design
- HBase vs HDFS/RDBMS
Create, Read, Update, DeleteHBase Shell:
Deep Dive: HBase Internals and Storage Engine
HBase is not a relational database; it is a Log-Structured Merge-Tree (LSM-Tree) based storage system. This architecture is optimized for high-write throughput and sequential disk I/O.1. The Write Path: WAL and MemStore
Every write to HBase follows a strict “Persistence First” protocol to ensure data durability even if a RegionServer crashes.- WAL (Write-Ahead Log): The write is first appended to a log on HDFS. If the server dies, this log is used to replay the data.
- MemStore: After the WAL is synced, the data is written to an in-memory sorted buffer called the MemStore.
- Acknowledgement: The client receives a success response as soon as the data is in the MemStore.
2. MemStore Flush and HFile Creation
When a MemStore reaches its threshold (e.g., 128MB), it is “flushed” to HDFS as an HFile.- HFile (SSTable): A sorted, immutable file. Once written, it is never changed.
- BlockIndex: Each HFile contains an index of its data blocks for fast binary search during lookups.
3. LSM-Tree and Read Amplification
Because HFiles are immutable, a single row might have data scattered across multiple HFiles (e.g., an update in HFile 3 overriding a value in HFile 1).- Read Path: HBase must check the MemStore, then scan multiple HFiles, merging the results to find the latest version of a cell. This is known as Read Amplification.
- Bloom Filters: To speed this up, HBase uses Bloom Filters to skip HFiles that definitely do not contain a specific row key.
4. Compaction: Managing File Bloat
To prevent the number of HFiles from growing indefinitely, HBase performs Compaction.| Type | Action | Impact |
|---|---|---|
| Minor Compaction | Picks a few small HFiles and merges them into one slightly larger HFile. | Low I/O; cleans up some “Read Amplification”. |
| Major Compaction | Merges ALL HFiles in a Column Family into a single file. | High I/O; deletes expired cells and tombstones (deletes). |
5. Data Locality and HDFS Interaction
HBase achieves “Local Reads” by scheduling RegionServers on the same nodes as their HDFS DataNodes. When a MemStore flushes, the first replica is written to the local disk. Over time, as regions move, HBase relies on the HDFS Balancer and major compactions to restore data locality.6. The Mathematics of Region Sizing
A common failure in production HBase clusters is “Region Squatting”—having too many regions per RegionServer, which fragments the available MemStore memory. The Capacity Formula: The maximum number of regions a RegionServer can safely handle is bounded by the total MemStore heap: Example:- Heap: 32GB
- MemStore Fraction: 0.4 (40% of heap reserved for writes)
- Flush Size: 128MB
- Column Families: 2
- Result: regions.
7. The Region Split Protocol (State Machine)
When a region exceeds thehbase.hregion.max.filesize (e.g., 10GB), it must split. This is a complex distributed transaction coordinated via ZooKeeper:
- PRE_SPLIT: RegionServer (RS) creates a
splitznode in ZooKeeper. - OFFLINE: RS takes the parent region offline, stopping all writes.
- DAUGHTER_CREATION: RS creates two “Reference Files” (daughter regions) pointing to the parent’s HFiles. This is a metadata-only operation (very fast).
- OPEN_DAUGHTERS: RS opens the two daughter regions and begins serving requests.
- POST_SPLIT: RS updates the
.META.table and deletes the parent region metadata. - Compaction (Background): Over time, the daughter regions undergo compaction, which physically splits the parent HFiles into new, independent HFiles, eventually deleting the reference files.
Ecosystem Integration and Coordination
Beyond storage and processing, the ecosystem requires robust coordination and ingestion tools.1. Apache ZooKeeper: The Glue
ZooKeeper is a distributed coordination service used by almost every component:- HBase: Master election and region server tracking.
- HDFS: NameNode HA leader election.
- YARN: ResourceManager HA and state management.
2. Apache Oozie: Workflow Orchestration
Oozie manages complex pipelines of Hadoop jobs.- Workflow: A DAG of actions (Hive query -> Pig script -> MapReduce job).
- Coordinator: Triggers workflows based on time or data availability (e.g., “Run every day at 2 AM” or “Run when the /sales/today folder exists”).
3. Data Ingestion: Sqoop and Flume
- Sqoop (SQL-to-Hadoop): Efficiently transfers bulk data between RDBMS (MySQL, Oracle) and HDFS/Hive.
- Flume: A distributed service for collecting, aggregating, and moving large amounts of streaming log data into HDFS.