Skip to main content

Chapter 5: Hadoop Ecosystem

The true power of Hadoop lies not just in HDFS and MapReduce, but in the rich ecosystem of tools built on top of it. This chapter explores the key components that transform Hadoop from a low-level distributed file system and processing framework into a complete big data platform.
Chapter Goals:
  • Understand the Hadoop ecosystem landscape
  • Master Hive for SQL on Hadoop
  • Learn Pig for data flow programming
  • Explore HBase for real-time NoSQL storage
  • Study workflow orchestration with Oozie
  • Understand data ingestion with Kafka and Flume
  • Compare ecosystem tools and when to use each

Ecosystem Overview

The Hadoop Stack

+---------------------------------------------------------------+
|                  HADOOP ECOSYSTEM STACK                       |
+---------------------------------------------------------------+
|                                                               |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │              APPLICATIONS & USE CASES                   │  |
|  │  Business Intelligence, Analytics, ML, ETL, Reporting   │  |
|  └─────────────────────────────────────────────────────────┘  |
|                           ↑                                   |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │              HIGH-LEVEL TOOLS                           │  |
|  ├─────────────────────────────────────────────────────────┤  |
|  │  SQL Query:  │ Scripting:  │ ML:       │ Graph:        │  |
|  │  • Hive      │ • Pig       │ • Mahout  │ • Giraph      │  |
|  │  • Impala    │ • Cascading │ • Spark   │               │  |
|  │  • Presto    │             │   MLlib   │               │  |
|  └──────────────┴─────────────┴───────────┴───────────────┘  |
|                           ↑                                   |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │         DATA PROCESSING FRAMEWORKS                      │  |
|  ├─────────────────────────────────────────────────────────┤  |
|  │  Batch:      │ Stream:     │ Interactive:              │  |
|  │  • MapReduce │ • Storm     │ • Impala                  │  |
|  │  • Spark     │ • Flink     │ • Presto                  │  |
|  │  • Tez       │ • Samza     │ • Drill                   │  |
|  └──────────────┴─────────────┴───────────────────────────┘  |
|                           ↑                                   |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │         RESOURCE MANAGEMENT & ORCHESTRATION             │  |
|  ├─────────────────────────────────────────────────────────┤  |
|  │  • YARN (Resource Manager)                              │  |
|  │  • Oozie (Workflow Scheduler)                           │  |
|  │  • ZooKeeper (Coordination)                             │  |
|  └─────────────────────────────────────────────────────────┘  |
|                           ↑                                   |
|  ┌─────────────────────────────────────────────────────────┐  |
|  │              STORAGE LAYER                              │  |
|  ├─────────────────────────────────────────────────────────┤  |
|  │  Distributed FS:  │ NoSQL:    │ Ingestion:            │  |
|  │  • HDFS           │ • HBase   │ • Kafka               │  |
|  │  • S3             │ • Kudu    │ • Flume               │  |
|  │                   │           │ • Sqoop               │  |
|  └───────────────────┴───────────┴───────────────────────┘  |
|                                                               |
+---------------------------------------------------------------+

GUIDING PRINCIPLE:
─────────────────
Each layer builds on lower layers.
Higher layers provide easier abstractions.
Lower layers provide more control and flexibility.

Why the Ecosystem Matters

Abstraction

Hide Complexity:
  • MapReduce requires Java programming
  • Hive provides SQL interface
  • Easier onboarding, faster development
  • Reach more users (analysts, not just engineers)

Productivity

Faster Development:
  • 100 lines of Java MapReduce → 5 lines of SQL
  • Pig reduces code by 10-20x
  • Faster iteration, fewer bugs
  • Focus on logic, not plumbing

Specialization

Right Tool for Job:
  • Hive for SQL analytics
  • HBase for real-time access
  • Kafka for stream ingestion
  • Each optimized for specific use case

Innovation

Community-Driven:
  • Open source enables experimentation
  • Best tools emerge organically
  • Spark displaced MapReduce
  • Ecosystem evolves with needs

Apache Hive: SQL on Hadoop

What is Hive?

Hive provides a SQL interface (HiveQL) to data stored in HDFS, translating SQL queries into MapReduce, Tez, or Spark jobs.
+---------------------------------------------------------------+
|                    APACHE HIVE ARCHITECTURE                   |
+---------------------------------------------------------------+
|                                                               |
|  ┌──────────────────────────────────────────┐                 |
|  │  CLIENT (SQL Interface)                  │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  • Hive CLI                              │                 |
|  │  • Beeline (JDBC client)                 │                 |
|  │  • JDBC/ODBC drivers                     │                 |
|  │  • BI tools (Tableau, PowerBI)           │                 |
|  └─────────────────┬────────────────────────┘                 |
|                    │                                          |
|                    │ HiveQL query                             |
|                    ↓                                          |
|  ┌──────────────────────────────────────────┐                 |
|  │  HIVE SERVER (HiveServer2)               │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Parser                            │  │                 |
|  │  │  (SQL → Abstract Syntax Tree)      │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Semantic Analyzer                 │  │                 |
|  │  │  (Validate, resolve metadata)      │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Logical Plan Generator            │  │                 |
|  │  │  (Operator tree)                   │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Optimizer                         │  │                 |
|  │  │  (Predicate pushdown, join reorder)│  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Physical Plan (MapReduce/Tez)     │  │                 |
|  │  │  (Execution plan)                  │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  └──────────────────────────────────────────┘                 |
|                    │                                          |
|                    │ Submit MR/Tez job                        |
|                    ↓                                          |
|  ┌──────────────────────────────────────────┐                 |
|  │  EXECUTION ENGINE                        │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  • MapReduce (slow)                      │                 |
|  │  • Tez (faster, DAG-based)               │                 |
|  │  • Spark (fastest, in-memory)            │                 |
|  └──────────────────────────────────────────┘                 |
|                    ↓                                          |
|  ┌──────────────────────────────────────────┐                 |
|  │  METASTORE                               │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  Database: MySQL, PostgreSQL, Derby      │                 |
|  │  ┌────────────────────────────────────┐  │                 |
|  │  │  Tables:                           │  │                 |
|  │  │  • Table metadata                  │  │                 |
|  │  │  • Column types                    │  │                 |
|  │  │  • Partitions                      │  │                 |
|  │  │  • Storage location (HDFS path)    │  │                 |
|  │  │  • SerDe info                      │  │                 |
|  │  └────────────────────────────────────┘  │                 |
|  └──────────────────────────────────────────┘                 |
|                    ↓                                          |
|  ┌──────────────────────────────────────────┐                 |
|  │  STORAGE (HDFS)                          │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  /user/hive/warehouse/                   │                 |
|  │    ├─ sales/                             │                 |
|  │    │  ├─ year=2023/                      │                 |
|  │    │  │  └─ month=01/                    │                 |
|  │    │  │     └─ data.parquet              │                 |
|  │    └─ users/                             │                 |
|  │       └─ data.orc                        │                 |
|  └──────────────────────────────────────────┘                 |
|                                                               |
+---------------------------------------------------------------+

Hive Fundamentals

SQL-Like Syntax
-- CREATE TABLE
CREATE TABLE employees (
  id INT,
  name STRING,
  salary DECIMAL(10,2),
  dept STRING,
  hire_date DATE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;


-- LOAD DATA
LOAD DATA INPATH '/user/data/employees.csv'
INTO TABLE employees;


-- SELECT QUERY
SELECT dept, AVG(salary) as avg_salary
FROM employees
WHERE hire_date >= '2020-01-01'
GROUP BY dept
HAVING AVG(salary) > 50000
ORDER BY avg_salary DESC;


-- JOIN
SELECT e.name, e.salary, d.dept_name
FROM employees e
JOIN departments d
ON e.dept = d.dept_id;


-- SUBQUERY
SELECT name, salary
FROM employees
WHERE salary > (
  SELECT AVG(salary) FROM employees
);
Behind the Scenes:Each query translates to MapReduce/Tez job:
SELECT dept, COUNT(*) FROM employees GROUP BY dept;

↓ Translates to:

Map Phase:
• Read employees table from HDFS
• For each row, emit (dept, 1)

Shuffle:
• Group by dept

Reduce Phase:
• For each dept, sum counts
• Write to HDFS

Result written to temp HDFS location.

Hive Metastore

Centralized Metadata Repository
METASTORE ARCHITECTURE:
──────────────────────

┌──────────────────────────────────────┐
│  Hive Clients                        │
│  • Hive CLI                          │
│  • Beeline                           │
│  • Spark SQL                         │
│  • Impala                            │
│  • Presto                            │
└───────────────┬──────────────────────┘

                │ Thrift API

┌──────────────────────────────────────┐
│  Metastore Service                   │
│  (HiveMetastore)                     │
└───────────────┬──────────────────────┘

                │ JDBC

┌──────────────────────────────────────┐
│  Metastore Database                  │
│  (MySQL, PostgreSQL, Derby)          │
├──────────────────────────────────────┤
│  Tables:                             │
│  • DBS (databases)                   │
│  • TBLS (tables)                     │
│  • COLUMNS_V2 (columns)              │
│  • PARTITIONS (partition metadata)   │
│  • SDS (storage descriptors)         │
│  • SERDES (serialization info)       │
└──────────────────────────────────────┘


STORED METADATA:
───────────────

For each table:
• Database name
• Table name
• Column names and types
• Partition keys
• Storage location (HDFS path)
• File format (Parquet, ORC, etc.)
• SerDe class
• Bucket information
• Table statistics
Example Metadata:
CREATE TABLE sales (
  id INT,
  amount DECIMAL(10,2)
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET
LOCATION '/user/hive/warehouse/sales';
Metastore Entries:
DBS table:
┌────────────────────────────────────┐
│ DB_ID | NAME    | DESC | LOCATION  │
├───────┼─────────┼──────┼───────────┤
│ 1     | default | ...  | /user/... │
└────────────────────────────────────┘

TBLS table:
┌───────────────────────────────────────────┐
│ TBL_ID | DB_ID | TBL_NAME | TBL_TYPE   │
├────────┼───────┼──────────┼────────────┤
│ 100    | 1     | sales    | MANAGED    │
└───────────────────────────────────────────┘

COLUMNS_V2 table:
┌─────────────────────────────────────────┐
│ CD_ID | COLUMN_NAME | TYPE_NAME         │
├───────┼─────────────┼───────────────────┤
│ 200   | id          | int               │
│ 200   | amount      | decimal(10,2)     │
└─────────────────────────────────────────┘

PARTITIONS table:
┌───────────────────────────────────────┐
│ PART_ID | TBL_ID | PART_NAME         │
├─────────┼────────┼───────────────────┤
│ 300     | 100    | year=2023/month=1 │
│ 301     | 100    | year=2023/month=2 │
└───────────────────────────────────────┘

SDS (Storage Descriptor):
┌──────────────────────────────────────────┐
│ SD_ID | LOCATION                         │
├───────┼──────────────────────────────────┤
│ 400   | /user/hive/warehouse/sales/...  │
└──────────────────────────────────────────┘
Table Ownership
-- MANAGED TABLE (Hive owns data)
CREATE TABLE managed_sales (
  id INT,
  amount DECIMAL
);

-- Data stored in Hive warehouse:
-- /user/hive/warehouse/managed_sales/

DROP TABLE managed_sales;
-- Data DELETED from HDFS!


-- EXTERNAL TABLE (Hive doesn't own data)
CREATE EXTERNAL TABLE external_sales (
  id INT,
  amount DECIMAL
)
LOCATION '/data/sales';

-- Data at /data/sales/ (outside Hive warehouse)

DROP TABLE external_sales;
-- Only metadata deleted, data remains!
When to Use Each:
MANAGED TABLES:
──────────────

Use when:
• Hive is primary consumer
• Want Hive to manage lifecycle
• Data is temporary/intermediate

Benefit:
• DROP TABLE cleans everything
• Simpler management


EXTERNAL TABLES:
───────────────

Use when:
• Multiple tools access data (Hive, Spark, Impala)
• Data produced outside Hive (Kafka, Flume)
• Production data (don't want accidental deletion)

Benefit:
• Data survives table drop
• Flexibility in storage location


BEST PRACTICE:
─────────────

Production: Use EXTERNAL tables
Reason: Prevents accidental data loss
Reading Custom Formats
-- Default: LazySimpleSerDe (CSV-like)
CREATE TABLE default_format (
  id INT,
  name STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';


-- JSON SerDe
CREATE TABLE json_data (
  id INT,
  name STRING,
  attributes MAP<STRING, STRING>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';

-- Can read JSON files directly!


-- RegEx SerDe (parse logs)
CREATE TABLE apache_logs (
  ip STRING,
  timestamp STRING,
  request STRING,
  status INT,
  bytes BIGINT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) [^ ]* [^ ]* \\[([^\\]]*)\\] \"([^\"]*)\" ([0-9]*) ([0-9]*)"
);


-- Parquet SerDe (built-in)
CREATE TABLE parquet_data (
  id INT,
  name STRING
)
STORED AS PARQUET;
-- Uses org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe


-- Custom SerDe (write your own!)
CREATE TABLE custom_format (
  ...
)
ROW FORMAT SERDE 'com.company.CustomSerDe'
WITH SERDEPROPERTIES (
  "field.delim" = "|",
  "escape.delim" = "\\"
);
SerDe Flow:
READ PATH:
─────────

HDFS File

InputFormat (reads bytes)

SerDe (deserialize bytes → objects)

Hive Row Objects

Query Processing


WRITE PATH:
──────────

Query Results

Hive Row Objects

SerDe (serialize objects → bytes)

OutputFormat (write bytes)

HDFS File

Deep Dive: Hive Metastore Scalability and the “Partition Explosion”

As data lakes grow to petabyte scale, the Hive Metastore often becomes the primary bottleneck in the entire stack.

1. The Partition Bottleneck

In a standard RDBMS-backed Metastore (MySQL/PostgreSQL), a query like SELECT * FROM sales WHERE year=2023 requires the Metastore to:
  1. Lookup the table sales.
  2. Scan the PARTITIONS table for all entries matching the filter.
  3. Fetch the SDS (Storage Descriptor) for each partition to find the HDFS paths.
The Math of Failure:
  • If a table has 1,000,000 partitions (common in high-cardinality data), a single ad-hoc query might force the Metastore to load 1GB of metadata into memory just to plan the scan.
  • This leads to “Metastore OOM” and serialized query planning that can take minutes before a single mapper even starts.

2. Scaling Strategies

  • Metastore Federation: Splitting metadata across multiple Metastore instances based on database name.
  • Partition Pruning at the Source: Ensuring clients use partition-bound filters to avoid full metadata scans.
  • Direct SQL: Modern Hive versions use direct SQL queries to the backend DB instead of the slower DataNucleus ORM layer to fetch partitions.

Deep Dive: Hive Query Lifecycle and Execution

To understand Hive’s performance, one must look past the SQL interface into the transformation pipeline that converts a declarative query into a distributed DAG.

1. The Query Planning Pipeline

Hive’s “Compiler” is a multi-stage engine that performs sophisticated optimizations before any data is touched.
StageActionOutput
ParserTokenizes HiveQL using Antlr.Abstract Syntax Tree (AST)
Semantic AnalyzerResolves table names, column types, and partition metadata from the Metastore.Query Block (QB) Tree
Logical Plan GenConverts QB tree into basic relational algebra operators (Filter, Join, Project).Operator Tree (Initial)
OptimizerApplies rules like Predicate Pushdown, Column Pruning, and Partition Pruning.Operator Tree (Optimized)
Physical Plan GenBreaks the operator tree into executable tasks (MapReduce, Tez, or Spark).Task Tree (DAG)

2. Tez vs. MapReduce: The DAG Revolution

While Hive 1.x relied on MapReduce, modern Hive (2.x/3.x) uses Apache Tez to eliminate the “HDFS barrier” between jobs.
  • MapReduce Barrier: Every stage in a complex query (e.g., multiple joins) must write intermediate data to HDFS, causing massive I/O overhead.
  • Tez DAG: Tez allows data to flow directly from one task to the next (e.g., Map -> Reduce -> Reduce) without intermediate HDFS writes. It uses a Directed Acyclic Graph of tasks.

3. LLAP (Live Long and Process)

Introduced in Hive 2.0, LLAP is a hybrid architecture that combines persistent query servers with standard YARN containers.
  • Persistent Daemons: Instead of starting a new JVM for every task (high latency), LLAP uses long-running daemons on worker nodes.
  • In-Memory Caching: LLAP caches columnar data (ORC/Parquet) in a smart, asynchronous cache, avoiding repetitive HDFS reads.
  • Vectorized Execution: LLAP processes data in batches of 1024 rows at a time using SIMD instructions, drastically reducing CPU cycles per row.
FeatureStandard HiveHive with LLAP
Startup LatencyHigh (Container launch)Ultra-low (Always-on daemons)
Data AccessHDFS ScanIn-Memory Cache + HDFS
ExecutionMapReduce/Tez TasksFragment-based execution
Target Use CaseLarge Batch ETLInteractive BI / Sub-second SQL

Apache Pig: Data Flow Language

What is Pig?

Pig provides a high-level scripting language (Pig Latin) for data transformations, compiling to MapReduce/Tez jobs.
+---------------------------------------------------------------+
|                      APACHE PIG ARCHITECTURE                  |
+---------------------------------------------------------------+
|                                                               |
|  ┌──────────────────────────────────────────┐                 |
|  │  PIG SCRIPT (Pig Latin)                  │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  data = LOAD '/input' AS (id, name);     │                 |
|  │  filtered = FILTER data BY id > 100;     │                 |
|  │  grouped = GROUP filtered BY name;       │                 |
|  │  counts = FOREACH grouped GENERATE       │                 |
|  │             group, COUNT(filtered);      │                 |
|  │  STORE counts INTO '/output';            │                 |
|  └──────────────────┬───────────────────────┘                 |
|                     │                                         |
|                     │ Submit                                  |
|                     ↓                                         |
|  ┌──────────────────────────────────────────┐                 |
|  │  PIG COMPILER                            │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  • Parser (Pig Latin → logical plan)     │                 |
|  │  • Optimizer (merge, push filters)       │                 |
|  │  • Physical plan (MR/Tez operators)      │                 |
|  │  • Generate MapReduce jobs               │                 |
|  └──────────────────┬───────────────────────┘                 |
|                     │                                         |
|                     │ Submit MR jobs                          |
|                     ↓                                         |
|  ┌──────────────────────────────────────────┐                 |
|  │  EXECUTION ENGINE                        │                 |
|  │  (MapReduce, Tez)                        │                 |
|  └──────────────────────────────────────────┘                 |
|                                                               |
+---------------------------------------------------------------+

DESIGN PHILOSOPHY:
─────────────────

• Procedural (not declarative like SQL)
• Dataflow-oriented (transformations on datasets)
• Schema-optional (can work without strict types)
• ETL-focused (Extract, Transform, Load)

Pig Latin Basics

Fundamental Pig Commands
-- LOAD data from HDFS
users = LOAD '/data/users.csv'
        USING PigStorage(',')
        AS (id:int, name:chararray, age:int, city:chararray);


-- FILTER rows
adults = FILTER users BY age >= 18;


-- FOREACH (transform each row)
names_ages = FOREACH adults GENERATE name, age;


-- GROUP BY
by_city = GROUP users BY city;

-- Result: {group: "NYC", users: {(1, "Alice", 30, "NYC"), (2, "Bob", 25, "NYC")}}


-- JOIN
orders = LOAD '/data/orders.csv'
         AS (order_id:int, user_id:int, amount:double);

joined = JOIN users BY id, orders BY user_id;


-- ORDER BY
sorted = ORDER users BY age DESC;


-- DISTINCT
unique_cities = DISTINCT (FOREACH users GENERATE city);


-- LIMIT
top_10 = LIMIT sorted 10;


-- STORE results
STORE top_10 INTO '/output' USING PigStorage('\t');
Example: Word Count in Pig
-- Load input
lines = LOAD '/input/books.txt' AS (line:chararray);

-- Split into words
words = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- Group by word
grouped = GROUP words BY word;

-- Count
counts = FOREACH grouped GENERATE group AS word, COUNT(words) AS count;

-- Sort by count descending
sorted = ORDER counts BY count DESC;

-- Store result
STORE sorted INTO '/output/wordcount';
Equivalent in MapReduce: ~200 lines of Java!

Apache HBase: NoSQL on HDFS

What is HBase?

HBase is a distributed, column-oriented NoSQL database built on HDFS, modeled after Google’s Bigtable.
+---------------------------------------------------------------+
|                    APACHE HBASE ARCHITECTURE                  |
+---------------------------------------------------------------+
|                                                               |
|  ┌──────────────────────────────────────────┐                 |
|  │  CLIENT                                  │                 |
|  │  (HBase Shell, Java API, REST/Thrift)    │                 |
|  └───────────────┬──────────────────────────┘                 |
|                  │                                            |
|                  │ RPC                                        |
|                  ↓                                            |
|  ┌──────────────────────────────────────────┐                 |
|  │  HMASTER (Cluster Master)                │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  • Assign regions to RegionServers       │                 |
|  │  • Handle region splits/merges           │                 |
|  │  • Schema changes (create/delete tables) │                 |
|  │  • Load balancing                        │                 |
|  └──────────────────────────────────────────┘                 |
|                  ↑                                            |
|  ┌───────────────┴──────────────────────┐                     |
|  │         ZOOKEEPER                    │                     |
|  ├──────────────────────────────────────┤                     |
|  │  • Cluster coordination              │                     |
|  │  • Master election                   │                     |
|  │  • Region assignment tracking        │                     |
|  └──────────────────────────────────────┘                     |
|                  ↑                                            |
|  ┌───────────────┴──────────────────────────────┐             |
|  │  REGIONSERVER 1  │  REGIONSERVER 2  │  ...   │             |
|  ├──────────────────┼──────────────────┼────────┤             |
|  │  ┌────────────┐  │  ┌────────────┐  │        │             |
|  │  │ Region A   │  │  │ Region B   │  │        │             |
|  │  ├────────────┤  │  ├────────────┤  │        │             |
|  │  │ MemStore   │  │  │ MemStore   │  │        │             |
|  │  │ (in-memory)│  │  │ (in-memory)│  │        │             |
|  │  ├────────────┤  │  ├────────────┤  │        │             |
|  │  │ BlockCache │  │  │ BlockCache │  │        │             |
|  │  │ (reads)    │  │  │ (reads)    │  │        │             |
|  │  └────────────┘  │  └────────────┘  │        │             |
|  └──────────────────┴──────────────────┴────────┘             |
|                  ↓                                            |
|  ┌──────────────────────────────────────────┐                 |
|  │  HDFS (Persistent Storage)               │                 |
|  ├──────────────────────────────────────────┤                 |
|  │  HFiles (SSTable format):                │                 |
|  │  • Immutable sorted files                │                 |
|  │  • Stored per column family              │                 |
|  │  • 3x replicated (HDFS)                  │                 |
|  └──────────────────────────────────────────┘                 |
|                                                               |
+---------------------------------------------------------------+

DATA MODEL:
──────────

Table: users
┌─────────────┬──────────────────────────────────────────┐
│ Row Key     │ Column Family: info    │ Column Family: │
│             │                        │ prefs          │
├─────────────┼────────────────────────┼────────────────┤
│ user_001    │ info:name = "Alice"    │ prefs:theme=   │
│             │ info:email= "[email protected]"  │   "dark"       │
│             │ ts=1234567890          │ ts=1234567891  │
├─────────────┼────────────────────────┼────────────────┤
│ user_002    │ info:name = "Bob"      │ prefs:lang=    │
│             │ info:email= "[email protected]"  │   "en"         │
│             │ ts=1234567892          │ ts=1234567893  │
└─────────────┴────────────────────────┴────────────────┘

KEY CONCEPTS:
────────────

• Row Key: Primary key, lexicographically sorted
• Column Family: Physical grouping of columns
• Column Qualifier: Column name within family
• Cell: (row, column family, column, timestamp) → value
• Timestamp: Version of cell (multi-versioning)
• Sparse: Rows can have different columns

HBase Operations

Create, Read, Update, Delete
// CREATE TABLE
HBaseAdmin admin = new HBaseAdmin(conf);
HTableDescriptor tableDesc = new HTableDescriptor(TableName.valueOf("users"));
tableDesc.addFamily(new HColumnDescriptor("info"));
tableDesc.addFamily(new HColumnDescriptor("prefs"));
admin.createTable(tableDesc);


// PUT (insert or update)
HTable table = new HTable(conf, "users");
Put put = new Put(Bytes.toBytes("user_001"));
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"),
              Bytes.toBytes("Alice"));
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("email"),
              Bytes.toBytes("[email protected]"));
table.put(put);


// GET (read single row)
Get get = new Get(Bytes.toBytes("user_001"));
Result result = table.get(get);
byte[] name = result.getValue(Bytes.toBytes("info"), Bytes.toBytes("name"));
System.out.println("Name: " + Bytes.toString(name));


// SCAN (read multiple rows)
Scan scan = new Scan();
scan.setStartRow(Bytes.toBytes("user_000"));
scan.setStopRow(Bytes.toBytes("user_100"));
ResultScanner scanner = table.getScanner(scan);
for (Result r : scanner) {
  // Process each row
}
scanner.close();


// DELETE
Delete delete = new Delete(Bytes.toBytes("user_001"));
table.delete(delete);
HBase Shell:
# Create table
create 'users', 'info', 'prefs'

# Put data
put 'users', 'user_001', 'info:name', 'Alice'
put 'users', 'user_001', 'info:email', '[email protected]'

# Get data
get 'users', 'user_001'

# Scan table
scan 'users', {STARTROW => 'user_000', STOPROW => 'user_100'}

# Count rows
count 'users'

# Delete row
deleteall 'users', 'user_001'

# Disable and drop table
disable 'users'
drop 'users'

Deep Dive: HBase Internals and Storage Engine

HBase is not a relational database; it is a Log-Structured Merge-Tree (LSM-Tree) based storage system. This architecture is optimized for high-write throughput and sequential disk I/O.

1. The Write Path: WAL and MemStore

Every write to HBase follows a strict “Persistence First” protocol to ensure data durability even if a RegionServer crashes.
  1. WAL (Write-Ahead Log): The write is first appended to a log on HDFS. If the server dies, this log is used to replay the data.
  2. MemStore: After the WAL is synced, the data is written to an in-memory sorted buffer called the MemStore.
  3. Acknowledgement: The client receives a success response as soon as the data is in the MemStore.

2. MemStore Flush and HFile Creation

When a MemStore reaches its threshold (e.g., 128MB), it is “flushed” to HDFS as an HFile.
  • HFile (SSTable): A sorted, immutable file. Once written, it is never changed.
  • BlockIndex: Each HFile contains an index of its data blocks for fast binary search during lookups.

3. LSM-Tree and Read Amplification

Because HFiles are immutable, a single row might have data scattered across multiple HFiles (e.g., an update in HFile 3 overriding a value in HFile 1).
  • Read Path: HBase must check the MemStore, then scan multiple HFiles, merging the results to find the latest version of a cell. This is known as Read Amplification.
  • Bloom Filters: To speed this up, HBase uses Bloom Filters to skip HFiles that definitely do not contain a specific row key.

4. Compaction: Managing File Bloat

To prevent the number of HFiles from growing indefinitely, HBase performs Compaction.
TypeActionImpact
Minor CompactionPicks a few small HFiles and merges them into one slightly larger HFile.Low I/O; cleans up some “Read Amplification”.
Major CompactionMerges ALL HFiles in a Column Family into a single file.High I/O; deletes expired cells and tombstones (deletes).

5. Data Locality and HDFS Interaction

HBase achieves “Local Reads” by scheduling RegionServers on the same nodes as their HDFS DataNodes. When a MemStore flushes, the first replica is written to the local disk. Over time, as regions move, HBase relies on the HDFS Balancer and major compactions to restore data locality.

6. The Mathematics of Region Sizing

A common failure in production HBase clusters is “Region Squatting”—having too many regions per RegionServer, which fragments the available MemStore memory. The Capacity Formula: The maximum number of regions a RegionServer can safely handle is bounded by the total MemStore heap: MaxRegions=RS_Heap×MemStore_FractionMemStore_Flush_Size×Num_Column_FamiliesMaxRegions = \frac{RS\_Heap \times MemStore\_Fraction}{MemStore\_Flush\_Size \times Num\_Column\_Families} Example:
  • Heap: 32GB
  • MemStore Fraction: 0.4 (40% of heap reserved for writes)
  • Flush Size: 128MB
  • Column Families: 2
  • Result: 32,768×0.4128×251\frac{32,768 \times 0.4}{128 \times 2} \approx 51 regions.
Consequence of Over-provisioning: If you host 500 regions on this server, each region only gets ~2.5MB of MemStore. This causes “Thundering Flushes” where the server spends all its time writing tiny HFiles, leading to massive compaction pressure and I/O saturation.

7. The Region Split Protocol (State Machine)

When a region exceeds the hbase.hregion.max.filesize (e.g., 10GB), it must split. This is a complex distributed transaction coordinated via ZooKeeper:
  1. PRE_SPLIT: RegionServer (RS) creates a split znode in ZooKeeper.
  2. OFFLINE: RS takes the parent region offline, stopping all writes.
  3. DAUGHTER_CREATION: RS creates two “Reference Files” (daughter regions) pointing to the parent’s HFiles. This is a metadata-only operation (very fast).
  4. OPEN_DAUGHTERS: RS opens the two daughter regions and begins serving requests.
  5. POST_SPLIT: RS updates the .META. table and deletes the parent region metadata.
  6. Compaction (Background): Over time, the daughter regions undergo compaction, which physically splits the parent HFiles into new, independent HFiles, eventually deleting the reference files.

Ecosystem Integration and Coordination

Beyond storage and processing, the ecosystem requires robust coordination and ingestion tools.

1. Apache ZooKeeper: The Glue

ZooKeeper is a distributed coordination service used by almost every component:
  • HBase: Master election and region server tracking.
  • HDFS: NameNode HA leader election.
  • YARN: ResourceManager HA and state management.

2. Apache Oozie: Workflow Orchestration

Oozie manages complex pipelines of Hadoop jobs.
  • Workflow: A DAG of actions (Hive query -> Pig script -> MapReduce job).
  • Coordinator: Triggers workflows based on time or data availability (e.g., “Run every day at 2 AM” or “Run when the /sales/today folder exists”).

3. Data Ingestion: Sqoop and Flume

  • Sqoop (SQL-to-Hadoop): Efficiently transfers bulk data between RDBMS (MySQL, Oracle) and HDFS/Hive.
  • Flume: A distributed service for collecting, aggregating, and moving large amounts of streaming log data into HDFS.

Conclusion: The Modern Hadoop Stack

Today, the “Hadoop Ecosystem” has evolved. While Hive remains the standard for SQL-on-HDFS, many processing tasks have shifted to Apache Spark due to its in-memory performance. However, the core principles of the Hadoop ecosystem—separation of storage (HDFS), resource management (YARN), and specialized processing engines—continue to define modern big data architecture.