Skip to main content
SQL Foundations - Relational model, query structure, and operations

Module 1: SQL Foundations

SQL (Structured Query Language) is the universal language for interacting with relational databases. This module builds your foundation from the ground up, ensuring you can write any query with confidence.
Estimated Time: 8-10 hours
Hands-On: 30+ practice exercises
Project: Build queries for an e-commerce analytics dashboard

Learning Objectives

By the end of this module, you will be able to:
  • Explain core relational concepts: tables, rows, columns, primary/foreign keys, and relationships.
  • Write idiomatic SELECT queries: projection, filtering, sorting, pagination, and basic expressions.
  • Join multiple tables: use INNER/LEFT joins correctly and reason about when each is appropriate.
  • Aggregate data: use GROUP BY, HAVING, and common aggregate functions to answer business questions.
  • Use window functions safely: apply ranking, running totals, and moving averages without breaking correctness.
  • Design and query simple schemas: model small domains (e.g., blog, e-commerce) and query them confidently.

Module-Specific Learning Outcomes

Section 1.1 - Relational Database Fundamentals

  • Understand the relational model and how it differs from file-based data storage
  • Identify and explain the purpose of primary keys, foreign keys, and their role in data integrity
  • Recognize common data types in PostgreSQL and know when to use each
  • Apply constraints (UNIQUE, NOT NULL, CHECK) to enforce data quality rules
  • Explain referential integrity and how foreign key constraints prevent orphaned records

Section 1.2 - Query Construction and Filtering

  • Construct basic SELECT queries with proper column selection and aliases
  • Apply WHERE clauses with comparison operators (=, !=, <, >, <=, >=)
  • Combine multiple conditions using AND, OR, and NOT operators
  • Use special operators (IN, BETWEEN, LIKE, ILIKE) for pattern matching
  • Implement result ordering (ORDER BY) and pagination (LIMIT, OFFSET)
  • Apply performance best practices like avoiding SELECT * in production code

Section 1.3 - Data Aggregation and Analysis

  • Utilize aggregate functions (COUNT, SUM, AVG, MIN, MAX) appropriately
  • Group data using GROUP BY with single and multiple columns
  • Filter groups using HAVING clauses after aggregation
  • Combine WHERE and HAVING clauses in complex analytical queries
  • Interpret aggregated results in business context

Section 1.4 - Multi-Table Operations

  • Differentiate between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN
  • Apply appropriate JOIN types based on desired result set behavior
  • Handle NULL values that result from unmatched JOINs
  • Execute multiple JOINs across several tables in a single query
  • Use self-JOINs for hierarchical data and comparative analysis
  • Construct complex queries involving multiple table relationships

Section 1.5 - Advanced Query Techniques

  • Create and utilize subqueries in WHERE, FROM, and SELECT clauses
  • Implement correlated subqueries that reference outer query context
  • Apply EXISTS and NOT EXISTS for existence checking
  • Understand query execution order and performance implications
  • Troubleshoot common query construction errors and misconceptions

Hands-On Lab 1: Query Basics and Filtering

Goal: Master basic SELECT queries, filtering, and result manipulation with realistic data.
  1. Create a lab_products table:
    CREATE TABLE lab_products (
      id SERIAL PRIMARY KEY,
      name TEXT NOT NULL,
      category TEXT NOT NULL,
      price DECIMAL(10,2) NOT NULL,
      stock_quantity INTEGER NOT NULL,
      supplier TEXT NOT NULL,
      created_at TIMESTAMP DEFAULT NOW()
    );
    
  2. Insert at least 15 diverse products across 4-5 categories (e.g., Electronics, Books, Clothing, Home, Sports) with varying prices and stock levels:
    • Electronics: Laptops, phones, tablets, headphones
    • Books: Fiction, non-fiction, textbooks
    • Clothing: shirts, pants, jackets
    • Home: kitchen items, furniture, decor
    • Sports: equipment, clothing, accessories
  3. Write queries to:
    • List all electronics priced under $500, sorted by price (lowest first)
    • Find all products with low stock (less than 10 units) in the ‘Electronics’ or ‘Home’ categories
    • Show the 5 most expensive products, displaying only name and price
    • Find all products with ‘pro’ or ‘max’ in their name (case insensitive)
    • List products that cost between 25and25 and 200, ordered by category then price
    • Show products that have no stock (stock_quantity = 0)
  4. Add a few products with special characters in names or unusual pricing, then run your queries again to ensure they handle edge cases properly.
  5. Practice using LIMIT and OFFSET to implement pagination: write a query that shows products 6-10 when sorted alphabetically by name.
When you can write these queries from memory without referencing examples, you’ve mastered basic query construction.

Hands-On Lab 2: Joins & Aggregations

Goal: Connect multiple tables with JOINs and analyze data using aggregate functions.
  1. Create three related tables representing an e-commerce scenario:
    CREATE TABLE lab_customers (
      id SERIAL PRIMARY KEY,
      name TEXT NOT NULL,
      email TEXT UNIQUE NOT NULL,
      city TEXT NOT NULL,
      registration_date DATE DEFAULT CURRENT_DATE
    );
    
    CREATE TABLE lab_products (
      id SERIAL PRIMARY KEY,
      name TEXT NOT NULL,
      category TEXT NOT NULL,
      price DECIMAL(10,2) NOT NULL,
      cost DECIMAL(10,2) NOT NULL  -- For profit calculations
    );
    
    CREATE TABLE lab_orders (
      id SERIAL PRIMARY KEY,
      customer_id INTEGER NOT NULL REFERENCES lab_customers(id),
      product_id INTEGER NOT NULL REFERENCES lab_products(id),
      quantity INTEGER NOT NULL,
      order_date DATE DEFAULT CURRENT_DATE,
      status TEXT DEFAULT 'completed'  -- completed, pending, cancelled
    );
    
  2. Populate the tables with realistic data:
    • 8-10 customers from different cities
    • 12-15 products across various categories
    • 25-30 orders linking customers to products (some customers should have multiple orders, some products should be ordered multiple times)
    • Include at least 2 customers who have never placed an order
    • Include at least 2 products that have never been ordered
  3. Write queries to:
    • List all orders with customer name and product name
    • Show total quantity ordered and total revenue per customer
    • Find the top 3 products by total revenue generated (price * quantity)
    • Calculate profit margins per product category (revenue - cost)
    • Find customers who have never placed an order using LEFT JOIN
    • Identify products that have never been ordered
    • Show monthly order trends (orders per month)
  4. Advanced challenges:
    • Create a summary showing each customer’s favorite product category (category they spent the most on)
    • Find the customer who spent the most money in each city
    • Calculate conversion rates (percentage of products that have been ordered at least once)
  5. Experiment with changing INNER JOINs to LEFT JOINs and observe how the result set changes when there are unmatched rows.
When you can confidently write JOINs and aggregations to answer complex business questions, you’re ready for advanced SQL techniques.

1.1 Introduction to Relational Databases

What is a Relational Database? (The Big Picture)

Think of a relational database like a digital filing cabinet that’s incredibly organized and can find information instantly. Unlike a spreadsheet where data might be scattered, a database stores data in structured tables that can be connected to each other. Real-World Analogy:
  • A table is like a spreadsheet sheet (e.g., “Users”, “Orders”, “Products”)
  • A row is like one entry in that spreadsheet (e.g., one user’s information)
  • A column is like a field/category (e.g., “name”, “email”, “age”)
  • A relationship connects tables (e.g., “Orders belong to Users”)
Why Use a Database Instead of Files/Spreadsheets?
  1. Data Integrity: Prevents invalid data (e.g., can’t create an order for a user that doesn’t exist)
  2. Concurrent Access: Multiple people can read/write simultaneously without conflicts
  3. Query Power: Find complex patterns across millions of records in milliseconds
  4. Relationships: Link related data automatically (orders → users → products)
  5. ACID Guarantees: Transactions ensure data consistency even if something crashes
A relational database organizes data into tables (relations) with rows (records) and columns (attributes). Tables can be linked through relationships using keys.
┌─────────────────────────────────────────────────────────────────────────┐
│                          RELATIONAL MODEL                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  USERS TABLE                          ORDERS TABLE                      │
│  ┌────┬───────────┬─────────────┐    ┌────┬─────────┬────────┬───────┐ │
│  │ id │   name    │    email    │    │ id │ user_id │ total  │ status│ │
│  ├────┼───────────┼─────────────┤    ├────┼─────────┼────────┼───────┤ │
│  │  1 │ Alice     │ a@email.com │◄───│  1 │    1    │ 150.00 │ done  │ │
│  │  2 │ Bob       │ b@email.com │◄───│  2 │    1    │  75.50 │ done  │ │
│  │  3 │ Charlie   │ c@email.com │◄───│  3 │    2    │ 200.00 │pending│ │
│  └────┴───────────┴─────────────┘    └────┴─────────┴────────┴───────┘ │
│         ▲                                      │                        │
│         │          PRIMARY KEY                 │                        │
│         └──────────────────────────────────────┘                        │
│                     FOREIGN KEY                                         │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Key Concepts

What is it? A column (or combination of columns) that uniquely identifies each row in a table.Why do we need it? Think of it like a Social Security Number or student ID. Every person has a unique one, and you can use it to find exactly that person. In databases, the primary key lets you:
  • Uniquely identify any row (no duplicates allowed!)
  • Link tables together (foreign keys reference primary keys)
  • Make queries faster (databases automatically index primary keys)
Real-World Example:
  • In a library: Each book has a unique ISBN (primary key)
  • In a school: Each student has a unique student ID
  • In an e-commerce site: Each user has a unique user ID
CREATE TABLE users (
    id SERIAL PRIMARY KEY,  -- Auto-incrementing unique identifier
    email VARCHAR(255) UNIQUE NOT NULL,
    name VARCHAR(100) NOT NULL
);
Breaking down SERIAL PRIMARY KEY:
  • SERIAL: PostgreSQL automatically generates the next number (1, 2, 3, 4…)
  • PRIMARY KEY: This column is the unique identifier
  • You don’t need to provide the ID when inserting - it’s automatic!
Rules (Why these matter):
  • Must be unique: If two rows had the same ID, how would you tell them apart?
  • Cannot be NULL: Every row MUST have an ID (you can’t have “unknown” as an identifier)
  • Should rarely change: If you change an ID, all foreign keys pointing to it break!
  • Often auto-incrementing: Makes it easy - just insert data, ID is assigned automatically
Common Mistake:
-- ❌ BAD: Using email as primary key
CREATE TABLE users (
    email VARCHAR(255) PRIMARY KEY,  -- What if user changes email?
    name VARCHAR(100)
);

-- ✅ GOOD: Separate ID as primary key
CREATE TABLE users (
    id SERIAL PRIMARY KEY,  -- Never changes
    email VARCHAR(255) UNIQUE NOT NULL,  -- Can change, but stays unique
    name VARCHAR(100)
);
What is it? A column that references the primary key of another table, creating a relationship.Why do we need it? Think of it like a pointer or reference. It says “this order belongs to that user” by storing the user’s ID.Real-World Analogy:
  • In a library system: A “loan” record has a book_id (foreign key) pointing to the “books” table
  • In a school: A “grade” record has a student_id (foreign key) pointing to the “students” table
  • In e-commerce: An “order” has a user_id (foreign key) pointing to the “users” table
CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    user_id INTEGER NOT NULL REFERENCES users(id),  -- This is the foreign key!
    total DECIMAL(10, 2) NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);
Breaking down REFERENCES users(id):
  • user_id: The column in the orders table that stores the user’s ID
  • REFERENCES users(id): This tells PostgreSQL “user_id must exist in the users table’s id column”
  • NOT NULL: Every order MUST belong to a user (can’t be orphaned)
What happens with foreign keys?
  1. Prevents Invalid Data:
-- ❌ This will FAIL:
INSERT INTO orders (user_id, total) VALUES (99999, 100.00);
-- Error: insert or update on table "orders" violates foreign key constraint
-- Because user_id 99999 doesn't exist in users table!

-- ✅ This works:
INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com') RETURNING id;
-- Returns: id = 1
INSERT INTO orders (user_id, total) VALUES (1, 100.00);  -- Success!
  1. Prevents Orphaned Records:
-- If you try to delete a user who has orders:
DELETE FROM users WHERE id = 1;
-- Error: update or delete on table "users" violates foreign key constraint
-- PostgreSQL protects you from creating "orphaned" orders with no user!

-- You must delete orders first, or use CASCADE:
CREATE TABLE orders (
    user_id INTEGER REFERENCES users(id) ON DELETE CASCADE
    -- Now deleting a user automatically deletes their orders
);
  1. Makes Queries Powerful:
-- Join tables using the foreign key relationship:
SELECT u.name, o.total, o.created_at
FROM users u
JOIN orders o ON u.id = o.user_id;  -- The foreign key makes this join possible!
Benefits (Why this matters):
  • Enforces referential integrity: Can’t create an order for a non-existent user
  • Prevents orphaned records: Can’t delete a user if they have orders (unless you handle it)
  • Documents relationships: The schema itself shows how tables connect
  • Enables JOINs: Foreign keys make it easy to combine related data
Common Patterns:
-- One-to-Many: One user has many orders
users (1) ──< (many) orders

-- Many-to-Many: Students enroll in courses
students (many) ──< enrollments >── (many) courses
-- Need a junction table with foreign keys to both!
Common PostgreSQL data types:
TypeDescriptionExample
INTEGERWhole numbers42, -100
BIGINTLarge whole numbersUser IDs at scale
SERIALAuto-incrementing integerPrimary keys
VARCHAR(n)Variable-length stringNames, emails
TEXTUnlimited length stringDescriptions
DECIMAL(p,s)Exact numericMoney: DECIMAL(10,2)
BOOLEANTrue/Falseis_active
TIMESTAMPDate and timecreated_at
TIMESTAMPTZTimestamp with timezoneAlways use for times
JSONBBinary JSONFlexible data
UUIDUniversally unique IDDistributed systems

1.2 Basic SELECT Queries

Hands-On Lab 1 – Query the users table
Goal: Get comfortable with SELECT, WHERE, ORDER BY, and LIMIT.
  1. Create a users table:
    CREATE TABLE lab_users (
      id SERIAL PRIMARY KEY,
      name TEXT NOT NULL,
      email TEXT NOT NULL UNIQUE,
      age INT NOT NULL,
      country TEXT NOT NULL
    );
    
  2. Insert at least 10 realistic rows covering multiple ages and countries.
  3. Write queries to:
    • List all users ordered by newest id first.
    • Find all users older than 25.
    • Find all users from a specific country (e.g. ‘US’).
    • Return only name and email for users younger than 21.
  4. Add one or two edge cases (very old age, unusual country) and re-run your queries to make sure they still behave as expected.
When you can write these queries without looking at the examples, you’re done with this lab.
What is SELECT? The SELECT statement is how you ask the database questions. It’s like saying “Show me…” or “Find me…”. The Basic Structure:
SELECT [what columns] FROM [which table] WHERE [what conditions];
Think of it like a question:
  • “SELECT name, email” = “Show me the name and email”
  • “FROM users” = “from the users table”
  • “WHERE age > 18” = “where the age is greater than 18”
Step-by-Step: Your First Query Let’s say you have a users table:
users table:
┌────┬──────────┬─────────────────────┬─────┐
│ id │   name   │       email         │ age │
├────┼──────────┼─────────────────────┼─────┤
│  1 │ Alice    │ alice@example.com   │  25 │
│  2 │ Bob      │ bob@example.com    │  30 │
│  3 │ Charlie  │ charlie@example.com │  18 │
└────┴──────────┴─────────────────────┴─────┘
Query 1: Get everything
SELECT * FROM users;
What happens:
  1. Database looks at the users table
  2. * means “all columns”
  3. Returns all rows with all columns
Result:
┌────┬──────────┬─────────────────────┬─────┐
│ id │   name   │       email         │ age │
├────┼──────────┼─────────────────────┼─────┤
│  1 │ Alice    │ alice@example.com   │  25 │
│  2 │ Bob      │ bob@example.com    │  30 │
│  3 │ Charlie  │ charlie@example.com │  18 │
└────┴──────────┴─────────────────────┴─────┘
Query 2: Get specific columns
SELECT name, email FROM users;
What happens:
  1. Database looks at the users table
  2. Only selects name and email columns
  3. Returns all rows, but only those two columns
Result:
┌──────────┬─────────────────────┐
│   name   │       email         │
├──────────┼─────────────────────┤
│ Alice    │ alice@example.com   │
│ Bob      │ bob@example.com    │
│ Charlie  │ charlie@example.com │
└──────────┴─────────────────────┘
Why specify columns instead of *?
  1. Performance: Fetching only what you need is faster
  2. Clarity: Makes it obvious what data you’re using
  3. Safety: If someone adds a huge column later, SELECT * would fetch it unnecessarily
  4. Index optimization: Some queries can use “index-only scans” if you only select indexed columns
The SELECT statement retrieves data from tables.
-- Select all columns from users table
SELECT * FROM users;

-- Select specific columns
SELECT name, email FROM users;

-- Select with column alias
SELECT 
    name AS customer_name,
    email AS contact_email
FROM users;
Avoid SELECT * in production code! It fetches unnecessary data, breaks when schema changes, and prevents index-only scans. Always specify the columns you need.

Filtering with WHERE: Finding Specific Data

What is WHERE? The WHERE clause is like a filter. It says “only show me rows that match these conditions.” Real-World Analogy:
  • Without WHERE: “Show me all books in the library” (thousands of books!)
  • With WHERE: “Show me all books published after 2020” (much smaller, relevant list)
How WHERE Works:
  1. Database looks at each row in the table
  2. Checks if the row matches your WHERE condition
  3. Only returns rows where the condition is TRUE
  4. Skips rows where the condition is FALSE or NULL
Step-by-Step Example: Starting data:
users table:
┌────┬──────────┬─────────────────────┬─────┐
│ id │   name   │       email         │ age │
├────┼──────────┼─────────────────────┼─────┤
│  1 │ Alice    │ alice@example.com   │  25 │
│  2 │ Bob      │ bob@example.com    │  30 │
│  3 │ Charlie  │ charlie@example.com │  18 │
│  4 │ Diana    │ diana@example.com  │  22 │
└────┴──────────┴─────────────────────┴─────┘
Query: Find users older than 21
SELECT * FROM users WHERE age > 21;
What the database does:
  1. Look at row 1: age = 25, is 25 > 21? YES → Include this row
  2. Look at row 2: age = 30, is 30 > 21? YES → Include this row
  3. Look at row 3: age = 18, is 18 > 21? NO → Skip this row
  4. Look at row 4: age = 22, is 22 > 21? YES → Include this row
Result:
┌────┬──────────┬─────────────────────┬─────┐
│ id │   name   │       email         │ age │
├────┼──────────┼─────────────────────┼─────┤
│  1 │ Alice    │ alice@example.com   │  25 │
│  2 │ Bob      │ bob@example.com    │  30 │
│  4 │ Diana    │ diana@example.com  │  22 │
└────┴──────────┴─────────────────────┴─────┘
Important: WHERE vs HAVING
  • WHERE: Filters rows BEFORE grouping/aggregation
  • HAVING: Filters groups AFTER grouping/aggregation
  • You’ll learn HAVING later, but remember: WHERE comes first!
The WHERE clause filters rows based on conditions.
-- Basic equality
SELECT * FROM users WHERE id = 1;

-- String matching
SELECT * FROM users WHERE email = 'alice@example.com';

-- Comparison operators
SELECT * FROM products WHERE price > 100;
SELECT * FROM products WHERE stock <= 10;

-- Not equal
SELECT * FROM users WHERE status != 'deleted';
SELECT * FROM users WHERE status <> 'deleted';  -- Same as !=

Multiple Conditions

Combine conditions with AND, OR, and NOT.
-- AND: Both conditions must be true
SELECT * FROM products 
WHERE category = 'electronics' 
  AND price < 500;

-- OR: At least one condition must be true
SELECT * FROM products 
WHERE category = 'electronics' 
   OR category = 'computers';

-- NOT: Negates a condition
SELECT * FROM users 
WHERE NOT status = 'banned';

-- Complex combinations (use parentheses!)
SELECT * FROM products 
WHERE (category = 'electronics' OR category = 'computers')
  AND price BETWEEN 100 AND 1000
  AND stock > 0;

Special Operators

-- IN: Match any value in a list
SELECT * FROM orders 
WHERE status IN ('pending', 'processing', 'shipped');

-- BETWEEN: Range inclusive
SELECT * FROM orders 
WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31';

-- LIKE: Pattern matching
SELECT * FROM users WHERE email LIKE '%@gmail.com';     -- Ends with
SELECT * FROM users WHERE name LIKE 'John%';            -- Starts with
SELECT * FROM users WHERE name LIKE '%son%';            -- Contains

-- ILIKE: Case-insensitive LIKE (PostgreSQL)
SELECT * FROM users WHERE name ILIKE '%john%';

-- IS NULL / IS NOT NULL
SELECT * FROM orders WHERE shipped_at IS NULL;          -- Not yet shipped
SELECT * FROM users WHERE phone IS NOT NULL;            -- Has phone number

Sorting Results

-- Sort ascending (default)
SELECT * FROM products ORDER BY price;
SELECT * FROM products ORDER BY price ASC;

-- Sort descending
SELECT * FROM products ORDER BY price DESC;

-- Multiple sort columns
SELECT * FROM products 
ORDER BY category ASC, price DESC;

-- Sort with NULL handling
SELECT * FROM products 
ORDER BY discount NULLS LAST;  -- NULLs at the end

Limiting Results

-- Get first N rows
SELECT * FROM products ORDER BY created_at DESC LIMIT 10;

-- Pagination with OFFSET
SELECT * FROM products 
ORDER BY id 
LIMIT 20 OFFSET 40;  -- Page 3 (rows 41-60)

-- Better pagination with keyset (cursor-based)
SELECT * FROM products 
WHERE id > 1000      -- Last seen ID
ORDER BY id 
LIMIT 20;
Performance Tip: OFFSET becomes slow for large values because the database must scan and discard all skipped rows. Use keyset pagination for better performance on large datasets.

1.3 Aggregations and Grouping

Aggregate functions compute a single result from multiple rows.

Aggregate Functions

-- COUNT: Number of rows
SELECT COUNT(*) FROM orders;                    -- All rows
SELECT COUNT(shipped_at) FROM orders;           -- Non-NULL values only
SELECT COUNT(DISTINCT user_id) FROM orders;     -- Unique users

-- SUM: Total of numeric values
SELECT SUM(total) FROM orders WHERE status = 'completed';

-- AVG: Average
SELECT AVG(price) FROM products WHERE category = 'electronics';

-- MIN / MAX
SELECT MIN(price), MAX(price) FROM products;

-- Combined
SELECT 
    COUNT(*) AS total_orders,
    SUM(total) AS revenue,
    AVG(total) AS avg_order_value,
    MIN(total) AS smallest_order,
    MAX(total) AS largest_order
FROM orders
WHERE status = 'completed';

GROUP BY

Group rows by column values and apply aggregates to each group.
-- Orders per user
SELECT 
    user_id,
    COUNT(*) AS order_count,
    SUM(total) AS total_spent
FROM orders
GROUP BY user_id;

-- Sales by category
SELECT 
    category,
    COUNT(*) AS product_count,
    AVG(price) AS avg_price,
    SUM(stock) AS total_inventory
FROM products
GROUP BY category;

-- Multiple grouping columns
SELECT 
    EXTRACT(YEAR FROM created_at) AS year,
    EXTRACT(MONTH FROM created_at) AS month,
    COUNT(*) AS order_count,
    SUM(total) AS revenue
FROM orders
GROUP BY 
    EXTRACT(YEAR FROM created_at),
    EXTRACT(MONTH FROM created_at)
ORDER BY year, month;

HAVING: Filter Groups

WHERE filters rows before grouping. HAVING filters groups after aggregation.
-- Find users with more than 5 orders
SELECT 
    user_id,
    COUNT(*) AS order_count
FROM orders
GROUP BY user_id
HAVING COUNT(*) > 5;

-- Categories with average price over $100
SELECT 
    category,
    AVG(price) AS avg_price
FROM products
GROUP BY category
HAVING AVG(price) > 100
ORDER BY avg_price DESC;

-- Combined WHERE and HAVING
SELECT 
    user_id,
    SUM(total) AS total_spent
FROM orders
WHERE status = 'completed'           -- Filter rows first
GROUP BY user_id
HAVING SUM(total) > 1000             -- Then filter groups
ORDER BY total_spent DESC;

1.4 JOINs Mastery: Combining Data from Multiple Tables

Why Do We Need JOINs?

The Problem: Data is split across multiple tables for organization, but you often need to see it together. Real-World Example:
  • You have a users table with names and emails
  • You have an orders table with order totals and dates
  • You want to see: “Show me all orders with the customer’s name”
Without JOINs: You’d have to:
  1. Query orders table → get user_id and total
  2. Query users table → get name for each user_id
  3. Manually match them in your application code (slow, error-prone!)
With JOINs: One query gets everything, matched automatically! The Concept: JOINs combine rows from multiple tables based on related columns (usually foreign keys). Think of it like a VLOOKUP in Excel, but much more powerful.

Understanding JOIN Types: A Visual Guide

Before diving in, here’s the key difference:
INNER JOIN:  "Show me only matches" (intersection)
LEFT JOIN:   "Show me all from left table, plus matches from right"
RIGHT JOIN:  "Show me all from right table, plus matches from left"
FULL JOIN:   "Show me everything from both tables"

INNER JOIN: The Most Common Join

What it does: Returns only rows that have matches in BOTH tables. Real-World Analogy: Like a Venn diagram intersection. Only show things that exist in both sets. Step-by-Step Example: Starting Data: Users table:
┌────┬──────────┬─────────────────────┐
│ id │   name   │       email         │
├────┼──────────┼─────────────────────┤
│  1 │ Alice    │ alice@example.com   │
│  2 │ Bob      │ bob@example.com    │
│  3 │ Charlie  │ charlie@example.com│
└────┴──────────┴─────────────────────┘
Orders table:
┌────┬─────────┬────────┬──────────────┐
│ id │ user_id │ total  │   status    │
├────┼─────────┼────────┼──────────────┤
│  1 │    1    │ 150.00 │ completed   │
│  2 │    1    │  75.50 │ completed   │
│  3 │    2    │ 200.00 │ pending     │
└────┴─────────┴────────┴──────────────┘
Query:
SELECT 
    o.id AS order_id,
    o.total,
    o.created_at,
    u.name AS customer_name,
    u.email
FROM orders o
INNER JOIN users u ON o.user_id = u.id;
What happens step-by-step:
  1. Database starts with the orders table (the “left” table in the JOIN)
  2. For each order row:
    • Order 1: user_id = 1 → Look for user with id = 1 → Found Alice! → Include this match
    • Order 2: user_id = 1 → Look for user with id = 1 → Found Alice! → Include this match
    • Order 3: user_id = 2 → Look for user with id = 2 → Found Bob! → Include this match
  3. Result: Only orders that have matching users (all 3 orders in this case)
Result:
┌──────────┬────────┬──────────────┬──────────────┬─────────────────────┐
│ order_id │ total  │ created_at   │customer_name │       email         │
├──────────┼────────┼──────────────┼──────────────┼─────────────────────┤
│    1     │ 150.00 │ 2024-01-15   │ Alice        │ alice@example.com   │
│    2     │  75.50 │ 2024-01-16   │ Alice        │ alice@example.com   │
│    3     │ 200.00 │ 2024-01-17   │ Bob          │ bob@example.com    │
└──────────┴────────┴──────────────┴──────────────┴─────────────────────┘
Notice: Charlie doesn’t appear because they have no orders. INNER JOIN only shows matches! When to use INNER JOIN:
  • When you only want rows that have matches in both tables
  • Most common join type (90% of joins are INNER)
  • Example: “Show me orders with customer info” (only orders that have customers)
Returns only rows that have matches in both tables.
-- Get order details with user names
SELECT 
    o.id AS order_id,
    o.total,
    o.created_at,
    u.name AS customer_name,
    u.email
FROM orders o
INNER JOIN users u ON o.user_id = u.id;
┌─────────────────────────────────────────────────────────────────────┐
│                           INNER JOIN                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   USERS                    ORDERS                 RESULT            │
│   ┌───┬───────┐           ┌───┬─────────┐       ┌───┬───────┬────┐ │
│   │ 1 │ Alice │ ─────────▶│ 1 │ user: 1 │──────▶│ 1 │ Alice │ O1 │ │
│   │ 2 │ Bob   │ ─────────▶│ 2 │ user: 1 │──────▶│ 2 │ Alice │ O2 │ │
│   │ 3 │ Carol │ ✗         │ 3 │ user: 2 │──────▶│ 3 │ Bob   │ O3 │ │
│   └───┴───────┘           └───┴─────────┘       └───┴───────┴────┘ │
│                                                                     │
│   Carol has no orders ──▶ Not in result                            │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

LEFT JOIN (LEFT OUTER JOIN): Keep Everything from the Left

What it does: Returns ALL rows from the left table, plus matching rows from the right table. If there’s no match, the right table columns are NULL. Real-World Analogy: Like a report card. You list ALL students (left table), and show their grades (right table) if they have any. If a student has no grades yet, you still list them but show “No grades” (NULL). Why “LEFT”? The table mentioned first (before JOIN) is the “left” table. The table after JOIN is the “right” table. Step-by-Step Example: Starting Data: Users table:
┌────┬──────────┬─────────────────────┐
│ id │   name   │       email         │
├────┼──────────┼─────────────────────┤
│  1 │ Alice    │ alice@example.com   │
│  2 │ Bob      │ bob@example.com    │
│  3 │ Charlie  │ charlie@example.com│  ← Has no orders!
└────┴──────────┴─────────────────────┘
Orders table:
┌────┬─────────┬────────┬──────────────┐
│ id │ user_id │ total  │   status    │
├────┼─────────┼────────┼──────────────┤
│  1 │    1    │ 150.00 │ completed   │
│  2 │    1    │  75.50 │ completed   │
│  3 │    2    │ 200.00 │ pending     │
└────┴─────────┴────────┴──────────────┘
Query:
SELECT 
    u.id,
    u.name,
    COUNT(o.id) AS order_count,
    COALESCE(SUM(o.total), 0) AS total_spent
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name;
What happens step-by-step:
  1. Database starts with ALL users (the left table)
  2. For each user:
    • Alice (id=1): Find orders with user_id=1 → Found 2 orders → Include with order data
    • Bob (id=2): Find orders with user_id=2 → Found 1 order → Include with order data
    • Charlie (id=3): Find orders with user_id=3 → Found 0 orders → Still include, but order columns are NULL
  3. Result: All users appear, even if they have no orders!
Result:
┌────┬──────────┬─────────────┬─────────────┐
│ id │   name   │ order_count │ total_spent │
├────┼──────────┼─────────────┼─────────────┤
│  1 │ Alice    │      2      │   225.50    │
│  2 │ Bob      │      1      │   200.00    │
│  3 │ Charlie  │      0      │     0.00    │  ← Still appears!
└────┴──────────┴─────────────┴─────────────┘
Key Points:
  • COALESCE(SUM(o.total), 0): Converts NULL to 0 (SUM returns NULL when there are no rows)
  • COUNT(o.id): Counts 0 for users with no orders (COUNT counts non-NULL values)
  • Charlie appears even though they have no orders!
When to use LEFT JOIN:
  • “Show me all users and their order counts” (including users with 0 orders)
  • “List all products and their sales” (including products that never sold)
  • “Find customers who haven’t placed orders” (use LEFT JOIN + WHERE right_table.id IS NULL)
Common Pattern: Finding Missing Relationships
-- Find users who have never placed an order
SELECT u.name, u.email
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE o.id IS NULL;  -- No matching order found!
Returns all rows from the left table, with matching rows from the right (or NULL if no match).
-- All users, with their orders (if any)
SELECT 
    u.id,
    u.name,
    COUNT(o.id) AS order_count,
    COALESCE(SUM(o.total), 0) AS total_spent
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name;
┌─────────────────────────────────────────────────────────────────────┐
│                           LEFT JOIN                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   USERS                    ORDERS                 RESULT            │
│   ┌───┬───────┐           ┌───┬─────────┐       ┌───┬───────┬────┐ │
│   │ 1 │ Alice │ ─────────▶│ 1 │ user: 1 │──────▶│ 1 │ Alice │ O1 │ │
│   │ 2 │ Bob   │ ─────────▶│ 2 │ user: 1 │──────▶│ 2 │ Alice │ O2 │ │
│   │ 3 │ Carol │ ─────────▶│ 3 │ user: 2 │──────▶│ 3 │ Bob   │ O3 │ │
│   └───┴───────┘           └───┴─────────┘       │ 4 │ Carol │NULL│ │
│                                                  └───┴───────┴────┘ │
│                                                                     │
│   Carol included with NULL for order data                          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

RIGHT JOIN and FULL OUTER JOIN

-- RIGHT JOIN: All rows from right table
SELECT u.name, o.id AS order_id
FROM users u
RIGHT JOIN orders o ON u.id = o.user_id;

-- FULL OUTER JOIN: All rows from both tables
SELECT u.name, o.id AS order_id
FROM users u
FULL OUTER JOIN orders o ON u.id = o.user_id;

Multiple JOINs

-- Orders with user info and product details
SELECT 
    o.id AS order_id,
    u.name AS customer,
    p.name AS product,
    oi.quantity,
    oi.unit_price,
    oi.quantity * oi.unit_price AS line_total
FROM orders o
INNER JOIN users u ON o.user_id = u.id
INNER JOIN order_items oi ON o.id = oi.order_id
INNER JOIN products p ON oi.product_id = p.id
WHERE o.created_at > '2024-01-01'
ORDER BY o.id, p.name;

Self JOIN

Join a table to itself, useful for hierarchical or comparative data.
-- Employees with their managers
SELECT 
    e.name AS employee,
    m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

-- Find products with the same category and similar prices
SELECT 
    p1.name AS product1,
    p2.name AS product2,
    p1.category,
    p1.price AS price1,
    p2.price AS price2
FROM products p1
INNER JOIN products p2 
    ON p1.category = p2.category 
    AND p1.id < p2.id  -- Avoid duplicates
    AND ABS(p1.price - p2.price) < 50;

CROSS JOIN

Cartesian product: every row from first table paired with every row from second.
-- All possible size/color combinations
SELECT s.name AS size, c.name AS color
FROM sizes s
CROSS JOIN colors c;

-- Generate date series with categories
SELECT 
    d.date,
    c.name AS category
FROM generate_series('2024-01-01', '2024-12-31', '1 day') AS d(date)
CROSS JOIN categories c;

1.5 Subqueries

A subquery is a query nested inside another query.

Subqueries in WHERE

-- Users who have placed orders
SELECT * FROM users
WHERE id IN (SELECT DISTINCT user_id FROM orders);

-- Products priced above average
SELECT * FROM products
WHERE price > (SELECT AVG(price) FROM products);

-- Users with more than 5 orders
SELECT * FROM users
WHERE id IN (
    SELECT user_id 
    FROM orders 
    GROUP BY user_id 
    HAVING COUNT(*) > 5
);

Correlated Subqueries

References the outer query — executes once per outer row.
-- Products with price above their category average
SELECT p1.name, p1.price, p1.category
FROM products p1
WHERE p1.price > (
    SELECT AVG(p2.price)
    FROM products p2
    WHERE p2.category = p1.category  -- References outer query
);

-- Users with their latest order
SELECT 
    u.name,
    (SELECT MAX(created_at) FROM orders o WHERE o.user_id = u.id) AS last_order
FROM users u;

EXISTS and NOT EXISTS

-- Users who have placed at least one order
SELECT u.name, u.email
FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o WHERE o.user_id = u.id
);

-- Products that have never been ordered
SELECT p.name, p.price
FROM products p
WHERE NOT EXISTS (
    SELECT 1 FROM order_items oi WHERE oi.product_id = p.id
);
Performance Tip: EXISTS often performs better than IN for large subqueries because it can stop as soon as it finds one match.

Subqueries in FROM (Derived Tables)

-- Monthly revenue with growth calculation
SELECT 
    current_month.month,
    current_month.revenue,
    prev_month.revenue AS prev_revenue,
    current_month.revenue - COALESCE(prev_month.revenue, 0) AS growth
FROM (
    SELECT 
        DATE_TRUNC('month', created_at) AS month,
        SUM(total) AS revenue
    FROM orders
    GROUP BY DATE_TRUNC('month', created_at)
) current_month
LEFT JOIN (
    SELECT 
        DATE_TRUNC('month', created_at) AS month,
        SUM(total) AS revenue
    FROM orders
    GROUP BY DATE_TRUNC('month', created_at)
) prev_month ON current_month.month = prev_month.month + INTERVAL '1 month'
ORDER BY current_month.month;

1.6 Common Table Expressions (CTEs)

CTEs make complex queries readable by breaking them into named, logical steps.

Basic CTE Syntax

WITH monthly_revenue AS (
    SELECT 
        DATE_TRUNC('month', created_at) AS month,
        SUM(total) AS revenue
    FROM orders
    WHERE status = 'completed'
    GROUP BY DATE_TRUNC('month', created_at)
)
SELECT 
    month,
    revenue,
    AVG(revenue) OVER () AS avg_monthly_revenue,
    revenue - AVG(revenue) OVER () AS diff_from_avg
FROM monthly_revenue
ORDER BY month;

Multiple CTEs

WITH 
-- Step 1: Calculate user order stats
user_stats AS (
    SELECT 
        user_id,
        COUNT(*) AS order_count,
        SUM(total) AS total_spent,
        AVG(total) AS avg_order
    FROM orders
    WHERE status = 'completed'
    GROUP BY user_id
),

-- Step 2: Classify users into tiers
user_tiers AS (
    SELECT 
        user_id,
        order_count,
        total_spent,
        CASE 
            WHEN total_spent >= 10000 THEN 'platinum'
            WHEN total_spent >= 5000 THEN 'gold'
            WHEN total_spent >= 1000 THEN 'silver'
            ELSE 'bronze'
        END AS tier
    FROM user_stats
),

-- Step 3: Count users per tier
tier_summary AS (
    SELECT 
        tier,
        COUNT(*) AS user_count,
        SUM(total_spent) AS tier_revenue
    FROM user_tiers
    GROUP BY tier
)

-- Final result
SELECT 
    tier,
    user_count,
    tier_revenue,
    ROUND(100.0 * tier_revenue / SUM(tier_revenue) OVER (), 2) AS revenue_pct
FROM tier_summary
ORDER BY tier_revenue DESC;

Recursive CTEs

For hierarchical data like org charts, categories, or graphs.
-- Employee hierarchy (org chart)
WITH RECURSIVE org_chart AS (
    -- Base case: top-level employees (no manager)
    SELECT id, name, manager_id, 1 AS level, name::TEXT AS path
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive case: employees with managers
    SELECT e.id, e.name, e.manager_id, oc.level + 1, oc.path || ' > ' || e.name
    FROM employees e
    INNER JOIN org_chart oc ON e.manager_id = oc.id
)
SELECT * FROM org_chart ORDER BY path;

-- Category tree with all descendants
WITH RECURSIVE category_tree AS (
    SELECT id, name, parent_id, 0 AS depth, ARRAY[id] AS path
    FROM categories
    WHERE parent_id IS NULL
    
    UNION ALL
    
    SELECT c.id, c.name, c.parent_id, ct.depth + 1, ct.path || c.id
    FROM categories c
    INNER JOIN category_tree ct ON c.parent_id = ct.id
)
SELECT 
    REPEAT('  ', depth) || name AS category_tree,
    depth,
    path
FROM category_tree
ORDER BY path;

1.7 Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing them into a single output row like aggregates.

ROW_NUMBER, RANK, DENSE_RANK

-- Rank products by price within each category
SELECT 
    name,
    category,
    price,
    ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) AS row_num,
    RANK() OVER (PARTITION BY category ORDER BY price DESC) AS rank,
    DENSE_RANK() OVER (PARTITION BY category ORDER BY price DESC) AS dense_rank
FROM products;
┌──────────────────────────────────────────────────────────────────────┐
│           ROW_NUMBER vs RANK vs DENSE_RANK                           │
├──────────────────────────────────────────────────────────────────────┤
│ Product    │ Price │ ROW_NUMBER │ RANK │ DENSE_RANK                  │
├────────────┼───────┼────────────┼──────┼────────────                  │
│ Laptop     │ 1200  │     1      │  1   │     1                        │
│ Desktop    │ 1000  │     2      │  2   │     2                        │
│ Tablet     │ 1000  │     3      │  2   │     2      ◄── Same price    │
│ Phone      │  800  │     4      │  4   │     3      ◄── RANK skips 3  │
│ Earbuds    │  100  │     5      │  5   │     4                        │
└──────────────────────────────────────────────────────────────────────┘

LAG and LEAD

Access previous or next row values.
-- Compare each order to the previous order for the same user
SELECT 
    user_id,
    id AS order_id,
    total,
    created_at,
    LAG(total) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_order_total,
    total - LAG(total) OVER (PARTITION BY user_id ORDER BY created_at) AS change,
    LEAD(total) OVER (PARTITION BY user_id ORDER BY created_at) AS next_order_total
FROM orders;

-- Calculate month-over-month revenue growth
WITH monthly_revenue AS (
    SELECT 
        DATE_TRUNC('month', created_at) AS month,
        SUM(total) AS revenue
    FROM orders
    GROUP BY DATE_TRUNC('month', created_at)
)
SELECT 
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month,
    ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY month)) / 
          LAG(revenue) OVER (ORDER BY month), 2) AS growth_pct
FROM monthly_revenue;

Running Totals and Moving Averages

-- Running total of sales
SELECT 
    created_at::date AS date,
    total,
    SUM(total) OVER (ORDER BY created_at) AS running_total
FROM orders;

-- 7-day moving average
SELECT 
    date,
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM daily_sales;

-- Running total within each month
SELECT 
    created_at::date AS date,
    total,
    SUM(total) OVER (
        PARTITION BY DATE_TRUNC('month', created_at) 
        ORDER BY created_at
    ) AS monthly_running_total
FROM orders;

FIRST_VALUE, LAST_VALUE, NTH_VALUE

-- Compare each product to the most expensive in its category
SELECT 
    name,
    category,
    price,
    FIRST_VALUE(name) OVER (
        PARTITION BY category 
        ORDER BY price DESC
    ) AS top_product,
    FIRST_VALUE(price) OVER (
        PARTITION BY category 
        ORDER BY price DESC
    ) AS top_price,
    ROUND(100.0 * price / FIRST_VALUE(price) OVER (
        PARTITION BY category 
        ORDER BY price DESC
    ), 2) AS pct_of_top
FROM products;

Percentiles and Distribution

-- Percentile ranks
SELECT 
    name,
    salary,
    PERCENT_RANK() OVER (ORDER BY salary) AS percentile,
    NTILE(4) OVER (ORDER BY salary) AS quartile
FROM employees;

-- Median salary per department
SELECT DISTINCT
    department,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) OVER (PARTITION BY department) AS median_salary
FROM employees;

Advanced Query Semantics and Patterns

How SQL Actually Runs (Logical Evaluation Order)

SQL is not evaluated in the same order you write it. Understanding the logical evaluation order explains many “weird” rules (like why you can’t use column aliases in WHERE). Typical written order:
SELECT ...
FROM ...
WHERE ...
GROUP BY ...
HAVING ...
WINDOW ...        -- e.g., window clause in some dialects
ORDER BY ...
LIMIT ... OFFSET ...
Logical evaluation order (simplified):
  1. FROM + JOIN (build row set)
  2. WHERE (filter rows)
  3. GROUP BY (form groups)
  4. HAVING (filter groups)
  5. Window functions (OVER (...)) are computed
  6. SELECT (compute final columns/aliases)
  7. DISTINCT
  8. ORDER BY
  9. LIMIT / OFFSET
Key consequences:
  • WHERE cannot see column aliases defined in SELECT (they don’t exist yet), but ORDER BY can.
  • You cannot use aggregates in WHERE (aggregation happens after grouping), but you can use them in HAVING.
  • Window functions are computed after grouping but before the final SELECT projection.
-- ❌ This fails: alias not visible in WHERE
SELECT
  price * quantity AS line_total
FROM order_items
WHERE line_total > 100;  -- ERROR

-- ✅ Use the expression directly or a subquery/CTE
SELECT
  price * quantity AS line_total
FROM order_items
WHERE price * quantity > 100;

-- ✅ Or wrap and filter outside
SELECT *
FROM (
  SELECT price * quantity AS line_total
  FROM order_items
) t
WHERE line_total > 100;
-- ❌ Aggregates not allowed in WHERE
SELECT user_id, SUM(total) AS total_spent
FROM orders
WHERE SUM(total) > 1000;  -- ERROR

-- ✅ Use HAVING instead
SELECT user_id, SUM(total) AS total_spent
FROM orders
GROUP BY user_id
HAVING SUM(total) > 1000;

Choosing Between JOINs, Subqueries, and CTEs

For many problems you can write equivalent queries in different shapes. You should understand the trade-offs:
  • JOIN: Best when you truly need columns from multiple tables and clear relationships.
  • Subquery in WHERE / EXISTS: Great for existence checks and filtering sets.
  • Derived table / CTE: Best for multi-step logic where readability matters.
-- 1) Join style: users who have placed orders
SELECT DISTINCT u.id, u.name
FROM users u
JOIN orders o ON o.user_id = u.id;

-- 2) Subquery IN style
SELECT id, name
FROM users
WHERE id IN (SELECT DISTINCT user_id FROM orders);

-- 3) EXISTS style (often best for large data)
SELECT id, name
FROM users u
WHERE EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.user_id = u.id
);
Practical guidance:
  • Prefer EXISTS when you only care if related rows exist (not how many).
  • Prefer CTEs when your query has multiple reusable steps (e.g., compute user stats, then bucket into tiers, then summarize). Treat a CTE like a named temp view.

Advanced Filtering Patterns (Production Query Shapes)

1. Anti-joins: “Who has NOT done X?”

Anti-joins show entities that lack related rows.
-- Customers who have never placed an order
SELECT c.id, c.name
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
WHERE o.id IS NULL;

-- Same idea with NOT EXISTS
SELECT c.id, c.name
FROM customers c
WHERE NOT EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.id
);

2. Conditional Aggregation instead of CASE explosions

Instead of multiple joins or subqueries, you can use conditional aggregates to pivot by condition.
-- Per-customer order stats broken down by status
SELECT
  user_id,
  COUNT(*)                             AS orders_total,
  SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) AS orders_completed,
  SUM(CASE WHEN status = 'pending'   THEN 1 ELSE 0 END) AS orders_pending,
  SUM(CASE WHEN status = 'cancelled' THEN 1 ELSE 0 END) AS orders_cancelled
FROM orders
GROUP BY user_id;
This pattern is heavily used in dashboards and analytics queries.

3. Top-N Per Group (without subqueries)

A very common requirement: “Top 3 products per category”, “Top 5 students per class”, etc.
-- Top 3 most expensive products in each category
WITH ranked AS (
  SELECT
    id,
    name,
    category,
    price,
    ROW_NUMBER() OVER (
      PARTITION BY category
      ORDER BY price DESC
    ) AS rn
  FROM products
)
SELECT id, name, category, price
FROM ranked
WHERE rn <= 3
ORDER BY category, price DESC;
This uses the window + filter pattern: rank within group using a window function, then filter by the rank.

Common Interview Traps and How to Avoid Them

  • Trap 1: Mixing WHERE and HAVING incorrectly
    • Use WHERE for row-level filters (no aggregates).
    • Use HAVING for group-level filters (aggregates allowed).
  • Trap 2: Assuming GROUP BY output is ordered
    • SQL does not guarantee any order unless you use ORDER BY.
    • Always add an explicit ORDER BY if order matters.
  • Trap 3: Relying on vendor-specific behavior
    • Some engines allow selecting non-grouped columns without aggregates — this is non-standard and can produce surprising results.
    • Prefer the strict rule: every selected, non-aggregated column must appear in GROUP BY.
-- Non-portable pattern (allowed by some engines, but ambiguous)
SELECT country, name
FROM users
GROUP BY country;  -- What name should be picked?

-- Portable, deterministic pattern
SELECT country, COUNT(*) AS user_count
FROM users
GROUP BY country
ORDER BY user_count DESC;

Advanced Query Techniques and Optimization

Complex Subquery Optimization

Understanding how the database optimizer handles subqueries is critical for performance. Here are key optimization patterns:
  1. Correlated vs Non-Correlated Subqueries:
    • Non-correlated (evaluated once): SELECT * FROM users WHERE department_id IN (SELECT id FROM departments WHERE budget > 100000)
    • Correlated (evaluated per row): SELECT * FROM users u WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id AND o.amount > 1000)
  2. Subquery Flattening: Modern optimizers convert certain subqueries into JOINs automatically:
    -- This subquery...
    SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE total > 100);
    
    -- May be flattened to this JOIN internally:
    SELECT DISTINCT u.* FROM users u INNER JOIN orders o ON u.id = o.user_id WHERE o.total > 100;
    
  3. Semi-Joins and Anti-Joins: IN and EXISTS often become semi-joins, while NOT IN and NOT EXISTS become anti-joins. EXISTS is often more efficient than IN for existence checks.

LATERAL Joins

LATERAL joins allow you to reference columns from preceding tables in the FROM clause, enabling correlated subqueries in a JOIN-like syntax:
-- Get top 3 orders for each user
SELECT u.name, orders_subq.order_id, orders_subq.total
FROM users u
CROSS JOIN LATERAL (
    SELECT id AS order_id, total
    FROM orders o
    WHERE o.user_id = u.id
    ORDER BY total DESC
    LIMIT 3
) orders_subq;

-- Calculate derived values that depend on other table's data
SELECT p.name, p.category, avg_by_category.avg_price
FROM products p
CROSS JOIN LATERAL (
    SELECT AVG(price) AS avg_price
    FROM products p2
    WHERE p2.category = p.category
) avg_by_category;
LATERAL is particularly powerful for:
  • Correlated aggregations
  • Top-N per group queries
  • Complex calculations that reference multiple tables
  • Calling table functions with parameters from other tables

Advanced Window Frame Specifications

Window functions offer sophisticated frame controls beyond basic PARTITION BY:
-- ROWS frame: physical row offsets
SELECT 
    date,
    sales,
    AVG(sales) OVER (
        ORDER BY date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW  -- 7-day moving average
    ) AS weekly_avg
FROM daily_sales;

-- RANGE frame: logical value-based window
SELECT 
    employee_id,
    salary,
    COUNT(*) OVER (
        ORDER BY salary 
        RANGE BETWEEN 10000 PRECEDING AND 10000 FOLLOWING  -- employees with similar salary
    ) AS peers_in_salary_range
FROM employees;

-- Groups frame: treating tied values as a single unit
SELECT 
    product_id,
    price,
    RANK() OVER (ORDER BY price) AS rank,
    COUNT(*) OVER (
        ORDER BY price 
        GROUPS BETWEEN CURRENT ROW AND 2 FOLLOWING  -- count products in current and next 2 price groups
    ) AS products_in_current_and_next_groups
FROM products;
Frame specifications allow for:
  • Running calculations with specific lookback periods
  • Comparisons within value ranges
  • Handling ties consistently in analytical computations
  • Efficient computation of rolling statistics

1.8 Practice Exercises

Exercise 1: Basic Queries

Using a products table with columns: id, name, category, price, stock, created_at:
-- 1. Find all products in the 'Electronics' category priced under $500
-- 2. Find products with low stock (less than 10) that need reordering
-- 3. List the 5 most recently added products
-- 4. Find products with names containing 'Pro' or 'Ultra'
-- 1
SELECT * FROM products 
WHERE category = 'Electronics' AND price < 500;

-- 2
SELECT name, stock, category FROM products 
WHERE stock < 10 
ORDER BY stock ASC;

-- 3
SELECT * FROM products 
ORDER BY created_at DESC 
LIMIT 5;

-- 4
SELECT * FROM products 
WHERE name ILIKE '%pro%' OR name ILIKE '%ultra%';

Exercise 2: Aggregations

-- 1. Count products per category
-- 2. Find the average price per category, only for categories with avg > $200
-- 3. Find the total inventory value (price * stock) per category
-- 4. Find categories with more than 50 products
-- 1
SELECT category, COUNT(*) as product_count 
FROM products 
GROUP BY category 
ORDER BY product_count DESC;

-- 2
SELECT category, AVG(price) as avg_price 
FROM products 
GROUP BY category 
HAVING AVG(price) > 200;

-- 3
SELECT category, SUM(price * stock) as inventory_value 
FROM products 
GROUP BY category 
ORDER BY inventory_value DESC;

-- 4
SELECT category, COUNT(*) as count 
FROM products 
GROUP BY category 
HAVING COUNT(*) > 50;

Exercise 3: JOINs

Hands-On Lab 2 – Joins & aggregations on e-commerce data
Goal: Become comfortable joining tables and aggregating results.
  1. Create three tables:
    CREATE TABLE lab_customers (
      id SERIAL PRIMARY KEY,
      name TEXT NOT NULL
    );
    
    CREATE TABLE lab_products (
      id SERIAL PRIMARY KEY,
      name TEXT NOT NULL,
      price NUMERIC(10,2) NOT NULL
    );
    
    CREATE TABLE lab_orders (
      id SERIAL PRIMARY KEY,
      customer_id INT NOT NULL REFERENCES lab_customers(id),
      product_id INT NOT NULL REFERENCES lab_products(id),
      quantity INT NOT NULL,
      created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
    );
    
  2. Insert at least 5 customers, 5 products, and 20 orders with a mix of quantities.
  3. Write queries to:
    • List each order with customer name and product name.
    • Show total quantity and total revenue per customer.
    • Find the top 3 products by total revenue.
    • Find customers who have never placed an order (add at least one such customer).
  4. Experiment with turning your queries into LEFT JOINs and see how the result set changes when there are missing rows.
When you can write these joins and aggregations from memory, you’re ready for more advanced SQL.
-- 1. List all orders with customer name and email
-- 2. Find customers who have never placed an order
-- 3. Get the top 10 customers by total spending
-- 4. List products that have been ordered more than 100 times
-- 1
SELECT o.id, o.total, o.created_at, u.name, u.email
FROM orders o
INNER JOIN users u ON o.user_id = u.id
ORDER BY o.created_at DESC;

-- 2
SELECT u.name, u.email
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE o.id IS NULL;

-- 3
SELECT u.name, u.email, SUM(o.total) as total_spent
FROM users u
INNER JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name, u.email
ORDER BY total_spent DESC
LIMIT 10;

-- 4
SELECT p.name, COUNT(oi.id) as times_ordered
FROM products p
INNER JOIN order_items oi ON p.id = oi.product_id
GROUP BY p.id, p.name
HAVING COUNT(oi.id) > 100
ORDER BY times_ordered DESC;

Exercise 4: Window Functions

-- 1. Rank products by price within each category
-- 2. For each order, show the running total for that user
-- 3. Calculate the percentage each product contributes to its category's total stock
-- 4. Find the second-highest priced product in each category
-- 1
SELECT 
    name, category, price,
    RANK() OVER (PARTITION BY category ORDER BY price DESC) as price_rank
FROM products;

-- 2
SELECT 
    user_id, id as order_id, total, created_at,
    SUM(total) OVER (PARTITION BY user_id ORDER BY created_at) as running_total
FROM orders;

-- 3
SELECT 
    name, category, stock,
    ROUND(100.0 * stock / SUM(stock) OVER (PARTITION BY category), 2) as stock_pct
FROM products;

-- 4
WITH ranked AS (
    SELECT 
        name, category, price,
        ROW_NUMBER() OVER (PARTITION BY category ORDER BY price DESC) as rn
    FROM products
)
SELECT name, category, price
FROM ranked
WHERE rn = 2;

Module Mastery Checklist

Complete this checklist to confirm you’ve mastered SQL Foundations:

Relational Database Fundamentals

  • Can explain the difference between relational databases and file-based storage
  • Understand the purpose of primary keys and can identify them in any schema
  • Know how foreign keys create relationships and enforce referential integrity
  • Can distinguish between different PostgreSQL data types and their appropriate use cases
  • Apply constraints (UNIQUE, NOT NULL, CHECK) to maintain data quality

Query Construction and Filtering

  • Write SELECT queries with proper column selection and aliases
  • Apply WHERE clauses with comparison operators (= != < > <= >=)
  • Combine multiple conditions using AND, OR, and NOT operators
  • Use special operators (IN, BETWEEN, LIKE, ILIKE) for advanced filtering
  • Implement result ordering (ORDER BY) and pagination (LIMIT, OFFSET)
  • Avoid performance pitfalls like SELECT * in production code

Data Aggregation and Analysis

  • Use aggregate functions (COUNT, SUM, AVG, MIN, MAX) appropriately
  • Group data using GROUP BY with single and multiple columns
  • Filter groups using HAVING clauses after aggregation
  • Combine WHERE and HAVING clauses in complex analytical queries
  • Interpret aggregated results in business context

Multi-Table Operations

  • Differentiate between INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL OUTER JOIN
  • Apply appropriate JOIN types based on desired result set behavior
  • Handle NULL values that result from unmatched JOINs
  • Execute multiple JOINs across several tables in a single query
  • Use self-JOINs for hierarchical data and comparative analysis
  • Troubleshoot complex multi-table queries

Advanced Query Techniques

  • Create and utilize subqueries in WHERE, FROM, and SELECT clauses
  • Implement correlated subqueries that reference outer query context
  • Apply EXISTS and NOT EXISTS for existence checking
  • Understand query execution order and performance implications
  • Write basic window functions (ROW_NUMBER, RANK, LAG/LEAD)
  • Solve common interview problems involving SQL patterns

Practical Application

  • Complete both hands-on labs (Query Basics and Joins & Aggregations)
  • Design and query a simple schema for a small domain (e.g., blog, e-commerce)
  • Optimize queries for performance using appropriate indexes and techniques
  • Troubleshoot common SQL errors and misconceptions

Next Module

Module 2: Database Design & Modeling

Learn to design schemas that scale and maintain data integrity