Skip to main content

Spark SQL & DataFrames

Module Duration: 6-8 hours Focus: Structured data processing with DataFrames and SQL Outcome: Build optimized analytical queries using Spark’s DataFrame API

From RDDs to DataFrames

RDD Limitations:
  • No schema → No optimization
  • Manual type handling
  • Verbose transformations
DataFrames solve this:
  • Schema-aware (like SQL tables)
  • Automatic optimization (Catalyst)
  • Unified API (SQL + functional)
  • 10-100x faster than RDDs

Part 1: DataFrame Basics

Creating DataFrames

From existing data:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataFrameDemo").getOrCreate()

# From list
data = [("Alice", 34), ("Bob", 28), ("Charlie", 42)]
df = spark.createDataFrame(data, ["name", "age"])
df.show()
From files:
# CSV
df = spark.read.csv("users.csv", header=True, inferSchema=True)

# JSON
df = spark.read.json("users.json")

# Parquet (columnar, optimized)
df = spark.read.parquet("users.parquet")

# With schema (better performance)
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])
df = spark.read.schema(schema).csv("users.csv")

DataFrame Operations

Select columns:
df.select("name", "age").show()
df.select(df.name, df.age + 1).show()
Filter rows:
df.filter(df.age > 30).show()
df.where("age > 30").show()  # SQL syntax
Add columns:
df.withColumn("senior", df.age > 65).show()
Aggregations:
from pyspark.sql.functions import avg, max, min

df.groupBy("department").agg(
    avg("salary").alias("avg_salary"),
    max("salary").alias("max_salary")
).show()

Part 2: SQL Queries

df.createOrReplaceTempView("users")

result = spark.sql("""
    SELECT age, COUNT(*) as count
    FROM users
    WHERE age > 25
    GROUP BY age
    ORDER BY count DESC
""")
result.show()

Part 3: Catalyst Optimizer

Stages:
  1. Analysis → Resolve columns
  2. Logical Optimization → Predicate pushdown
  3. Physical Planning → Choose join strategies
  4. Code Generation → Optimized bytecode
df.explain(True)  # View execution plan

Part 4: Joins & Window Functions

Joins:
employees.join(departments, "dept_id", "inner").show()
Window functions:
from pyspark.sql.window import Window
from pyspark.sql.functions import rank

windowSpec = Window.partitionBy("department").orderBy("salary")
df.withColumn("rank", rank().over(windowSpec)).show()

Part 5: Performance

Caching:
df.cache()
df.count()  # Triggers caching
Partitioning:
df.repartition(10, "department")
UDFs:
from pyspark.sql.functions import udf

square = udf(lambda x: x * x)
df.withColumn("squared", square(df.value)).show()

Summary

DataFrames provide 10-100x performance over RDDs through Catalyst optimization, schema awareness, and code generation. Use SQL or functional API interchangeably.

What’s Next?

Module 4: Spark Streaming

Process real-time data streams with Structured Streaming