Spark SQL & DataFrames
Module Duration : 6-8 hours
Focus : Structured data processing with DataFrames and SQL
Outcome : Build optimized analytical queries using Spark’s DataFrame API
From RDDs to DataFrames
RDD Limitations :
No schema → No optimization
Manual type handling
Verbose transformations
DataFrames solve this:
Schema-aware (like SQL tables)
Automatic optimization (Catalyst)
Unified API (SQL + functional)
10-100x faster than RDDs
Part 1: DataFrame Basics
Creating DataFrames
From existing data :
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName( "DataFrameDemo" ).getOrCreate()
# From list
data = [( "Alice" , 34 ), ( "Bob" , 28 ), ( "Charlie" , 42 )]
df = spark.createDataFrame(data, [ "name" , "age" ])
df.show()
From files :
# CSV
df = spark.read.csv( "users.csv" , header = True , inferSchema = True )
# JSON
df = spark.read.json( "users.json" )
# Parquet (columnar, optimized)
df = spark.read.parquet( "users.parquet" )
# With schema (better performance)
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
StructField( "name" , StringType(), True ),
StructField( "age" , IntegerType(), True )
])
df = spark.read.schema(schema).csv( "users.csv" )
DataFrame Operations
Select columns :
df.select( "name" , "age" ).show()
df.select(df.name, df.age + 1 ).show()
Filter rows :
df.filter(df.age > 30 ).show()
df.where( "age > 30" ).show() # SQL syntax
Add columns :
df.withColumn( "senior" , df.age > 65 ).show()
Aggregations :
from pyspark.sql.functions import avg, max , min
df.groupBy( "department" ).agg(
avg( "salary" ).alias( "avg_salary" ),
max ( "salary" ).alias( "max_salary" )
).show()
Part 2: SQL Queries
df.createOrReplaceTempView( "users" )
result = spark.sql( """
SELECT age, COUNT(*) as count
FROM users
WHERE age > 25
GROUP BY age
ORDER BY count DESC
""" )
result.show()
Part 3: Catalyst Optimizer
Stages :
Analysis → Resolve columns
Logical Optimization → Predicate pushdown
Physical Planning → Choose join strategies
Code Generation → Optimized bytecode
df.explain( True ) # View execution plan
Part 4: Joins & Window Functions
Joins :
employees.join(departments, "dept_id" , "inner" ).show()
Window functions :
from pyspark.sql.window import Window
from pyspark.sql.functions import rank
windowSpec = Window.partitionBy( "department" ).orderBy( "salary" )
df.withColumn( "rank" , rank().over(windowSpec)).show()
Caching :
df.cache()
df.count() # Triggers caching
Partitioning :
df.repartition( 10 , "department" )
UDFs :
from pyspark.sql.functions import udf
square = udf( lambda x : x * x)
df.withColumn( "squared" , square(df.value)).show()
Summary
DataFrames provide 10-100x performance over RDDs through Catalyst optimization, schema awareness, and code generation. Use SQL or functional API interchangeably.
What’s Next?
Module 4: Spark Streaming Process real-time data streams with Structured Streaming