Module 4: Indexing Deep Dive
Indexing is the single most impactful skill for database performance. A well-placed index can turn a query from minutes to milliseconds. This module gives you deep understanding of how indexes work and when to use them.Estimated Time: 12-15 hours
Hands-On: Index analysis labs with real query plans
Key Skill: Knowing exactly which index to create for any query
Hands-On: Index analysis labs with real query plans
Key Skill: Knowing exactly which index to create for any query
4.1 How Indexes Work
The Problem: Full Table Scans
Without indexes, every query must scan every row:Copy
-- Without index: scans 10 million rows
SELECT * FROM users WHERE email = '[email protected]';
-- Time: 5-10 seconds
-- With index: jumps directly to the row
-- Time: 1-5 milliseconds
Copy
┌─────────────────────────────────────────────────────────────────────────┐
│ FULL TABLE SCAN vs INDEX LOOKUP │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ SEQUENTIAL SCAN (No Index) B-TREE INDEX LOOKUP │
│ ─────────────────────────── ───────────────── │
│ │
│ ┌─────┬─────┬─────┬─────┐ ┌─────────────────┐ │
│ │Row 1│Row 2│Row 3│ ... │ │ Root Node │ │
│ └──┬──┴──┬──┴──┬──┴─────┘ │ [M values] │ │
│ │ │ │ └────────┬────────┘ │
│ ▼ ▼ ▼ │ │
│ Check Check Check ┌───────┴───────┐ │
│ each each each ▼ ▼ │
│ row row row ┌─────────┐ ┌─────────┐ │
│ │ │ │ │ Branch │ │ Branch │ │
│ ▼ ▼ ▼ └────┬────┘ └────┬────┘ │
│ 10 million comparisons │ │ │
│ ▼ ▼ │
│ O(n) - Linear time ┌──────────┐ ┌──────────┐ │
│ │ Leaf │ │ Leaf │ │
│ │ (Target!)│ │ │ │
│ └──────────┘ └──────────┘ │
│ │
│ O(log n) - Logarithmic time │
│ ~25 comparisons for 10M rows │
│ │
└─────────────────────────────────────────────────────────────────────────┘
B-Tree Index Structure
PostgreSQL’s default index type. Perfect for equality and range queries.Copy
┌─────────────────────────────────────────────────────────────────────────┐
│ B-TREE STRUCTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────┐ │
│ │ [30] [60] │ Root Node │
│ └──┬─────┬─────┬──────┘ │
│ │ │ │ │
│ ┌──────────────┘ │ └──────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ [10] [20] │ │ [40] [50] │ │ [70] [80] │ │
│ └──┬──┬──┬─────┘ └──┬──┬──┬─────┘ └──┬──┬──┬─────┘ │
│ │ │ │ │ │ │ │ │ │ │
│ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ │
│ Leaf Nodes: Contain actual data pointers (ctid: page, offset) │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ [5,8,10] ──▶ [12,15,20] ──▶ [25,28,30] ──▶ ... │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ Leaf nodes are linked for efficient range scans │
│ │
│ Properties: │
│ • Balanced: All leaves at same depth │
│ • Sorted: Keys in order (enables range queries) │
│ • Dense: Each leaf contains multiple keys │
│ • Linked: Leaves form a linked list │
│ │
└─────────────────────────────────────────────────────────────────────────┘
What Indexes Support
Copy
-- B-Tree supports these operators:
-- <, <=, =, >=, >, BETWEEN, IN, IS NULL, IS NOT NULL
-- Equality lookup: O(log n)
SELECT * FROM users WHERE email = '[email protected]';
-- Range scan: O(log n + k) where k is result size
SELECT * FROM orders WHERE created_at > '2024-01-01';
-- Sorting (if index order matches): No extra sort needed!
SELECT * FROM products ORDER BY price DESC LIMIT 10;
-- PREFIX pattern matching (starts with)
SELECT * FROM users WHERE name LIKE 'John%'; -- Uses index!
SELECT * FROM users WHERE name LIKE '%John'; -- Cannot use index!
4.2 Creating and Managing Indexes
Basic Index Creation
Copy
-- Single column index
CREATE INDEX idx_users_email ON users(email);
-- Unique index (also enforces uniqueness)
CREATE UNIQUE INDEX idx_users_email_unique ON users(email);
-- Multi-column (composite) index
CREATE INDEX idx_orders_user_date ON orders(user_id, created_at DESC);
-- Index with included columns (covering index)
CREATE INDEX idx_orders_user_covering ON orders(user_id)
INCLUDE (status, total);
-- Partial index (index only some rows)
CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';
-- Expression index (index on computed value)
CREATE INDEX idx_users_lower_email ON users(lower(email));
-- Concurrent creation (doesn't lock table)
CREATE INDEX CONCURRENTLY idx_products_name ON products(name);
Index Naming Convention
Copy
idx_<table>_<columns>_<type>
Examples:
- idx_users_email -- Single column
- idx_orders_user_date -- Multiple columns
- idx_users_email_unique -- Unique index
- idx_orders_pending_part -- Partial index
- idx_users_email_gin -- Specific type (GIN, GiST, etc.)
Dropping and Rebuilding
Copy
-- Drop index
DROP INDEX idx_users_email;
-- Drop concurrently (doesn't lock)
DROP INDEX CONCURRENTLY idx_users_email;
-- Rebuild index (useful after lots of updates/deletes)
REINDEX INDEX idx_users_email;
-- Rebuild concurrently (PostgreSQL 12+)
REINDEX INDEX CONCURRENTLY idx_users_email;
-- Rebuild all indexes on a table
REINDEX TABLE users;
4.3 Multi-Column Indexes
Understanding column order in composite indexes is crucial.The Leftmost Prefix Rule
Copy
CREATE INDEX idx_orders_user_status_date ON orders(user_id, status, created_at);
- ✅
(user_id)- First column - ✅
(user_id, status)- First two columns - ✅
(user_id, status, created_at)- All columns - ❌
(status)- Not leftmost column! - ❌
(status, created_at)- Not leftmost! - ❌
(user_id, created_at)- Skipsstatus
Copy
┌─────────────────────────────────────────────────────────────────────────┐
│ COMPOSITE INDEX STRUCTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
```sql
CREATE INDEX idx_sessions_token ON sessions USING hash(token);
-- Only supports =
SELECT * FROM sessions WHERE token = 'abc123xyz'; -- Fast!
SELECT * FROM sessions WHERE token > 'abc'; -- Cannot use!
Hash indexes are rarely needed. B-Tree handles equality almost as fast and is more versatile. Use Hash only when you’re certain you’ll never need range queries.
GIN (Generalized Inverted Index)
Best for: Array containment, JSONB, full-text searchCopy
-- Array operations
CREATE INDEX idx_posts_tags ON posts USING gin(tags);
SELECT * FROM posts WHERE tags @> ARRAY['postgresql'];
SELECT * FROM posts WHERE tags && ARRAY['sql', 'nosql']; -- Overlap
-- JSONB operations
CREATE INDEX idx_products_attrs ON products USING gin(attributes);
SELECT * FROM products WHERE attributes @> '{"color": "red"}';
SELECT * FROM products WHERE attributes ? 'size';
-- Full-text search
CREATE INDEX idx_articles_search ON articles
USING gin(to_tsvector('english', title || ' ' || body));
SELECT * FROM articles
WHERE to_tsvector('english', title || ' ' || body)
@@ to_tsquery('english', 'postgresql & performance');
Copy
┌─────────────────────────────────────────────────────────────────────────┐
│ GIN INDEX STRUCTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Data: │
│ ┌────────────────────────────────────────────────────┐ │
│ │ id │ tags │ │
│ ├────┼───────────────────────────────────────────────┤ │
│ │ 1 │ ['postgresql', 'database', 'sql'] │ │
│ │ 2 │ ['python', 'postgresql'] │ │
│ │ 3 │ ['javascript', 'react'] │ │
│ │ 4 │ ['database', 'nosql', 'mongodb'] │ │
│ └────┴───────────────────────────────────────────────┘ │
│ │
│ GIN Index (Inverted): │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Key │ Posting List (row IDs) │ │
│ ├──────────────┼─────────────────────────────────────┤ │
│ │ database │ [1, 4] │ │
│ │ javascript │ [3] │ │
│ │ mongodb │ [4] │ │
│ │ nosql │ [4] │ │
│ │ postgresql │ [1, 2] │ │
│ │ python │ [2] │ │
│ │ react │ [3] │ │
│ │ sql │ [1] │ │
│ └──────────────┴─────────────────────────────────────┘ │
│ │
│ Query: tags @> ARRAY['postgresql'] │
│ → Look up 'postgresql' → [1, 2] → Fetch rows 1 and 2 │
│ │
└─────────────────────────────────────────────────────────────────────────┘
GiST (Generalized Search Tree)
Best for: Geometric data, ranges, full-text with phrasesCopy
-- Range types
CREATE INDEX idx_events_duration ON events USING gist(during);
SELECT * FROM events
WHERE during && '[2024-01-01, 2024-01-31]'::daterange; -- Overlaps
-- Geometric (with PostGIS)
CREATE INDEX idx_locations_point ON locations USING gist(geom);
SELECT * FROM locations
WHERE ST_DWithin(geom, ST_MakePoint(-73.99, 40.73)::geography, 1000);
-- Full-text with phrase search
CREATE INDEX idx_docs_search ON documents
USING gist(to_tsvector('english', content));
BRIN (Block Range Index)
Best for: Very large tables with naturally ordered data (time series)Copy
-- Perfect for append-only time series data
CREATE INDEX idx_logs_time ON logs USING brin(created_at);
-- Much smaller than B-Tree: stores min/max per block range
-- Table: 100GB → B-Tree: 2GB, BRIN: 2MB
Copy
┌─────────────────────────────────────────────────────────────────────────┐
│ BRIN INDEX STRUCTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Table blocks (pages): │
│ ┌─────────────┬─────────────┬─────────────┬─────────────┐ │
│ │ Block 0-127 │ Block 128-255│ Block 256-383│ Block 384+ │ │
│ │ Jan 1-7 │ Jan 8-14 │ Jan 15-21 │ Jan 22+ │ │
│ └─────────────┴─────────────┴─────────────┴─────────────┘ │
│ │
│ BRIN Index (tiny!): │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ Block Range │ Min Date │ Max Date │ │
│ ├──────────────┼─────────────┼──────────────────────────┤ │
│ │ 0-127 │ Jan 1 │ Jan 7 │ │
│ │ 128-255 │ Jan 8 │ Jan 14 │ │
│ │ 256-383 │ Jan 15 │ Jan 21 │ │
│ │ 384-511 │ Jan 22 │ Jan 28 │ │
│ └──────────────┴─────────────┴──────────────────────────┘ │
│ │
│ Query: WHERE created_at = 'Jan 10' │
│ → Check BRIN: Jan 10 could be in blocks 128-255 │
│ → Scan only those blocks (not entire table!) │
│ │
│ Requirements: │
│ • Data must be physically clustered by index column │
│ • Works best for append-only tables │
│ • Loses effectiveness if data is scattered │
│ │
└─────────────────────────────────────────────────────────────────────────┘
4.5 Advanced Index Techniques
Partial Indexes
Index only the rows you actually query.Copy
-- Only index active users (most queries filter for active)
CREATE INDEX idx_users_email_active ON users(email)
WHERE deleted_at IS NULL;
-- Only index pending orders (statuses you filter often)
CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';
-- Index high-value orders
CREATE INDEX idx_orders_high_value ON orders(user_id, created_at)
WHERE total > 1000;
-- Size comparison
-- Full index: 500MB
-- Partial index: 50MB (if only 10% of rows match condition)
Expression Indexes
Index computed values.Copy
-- Case-insensitive email lookup
CREATE INDEX idx_users_lower_email ON users(lower(email));
-- Now this uses the index:
SELECT * FROM users WHERE lower(email) = '[email protected]';
-- Extract year for date queries
CREATE INDEX idx_orders_year ON orders(EXTRACT(YEAR FROM created_at));
-- JSONB field extraction
CREATE INDEX idx_products_brand ON products((attributes->>'brand'));
SELECT * FROM products WHERE attributes->>'brand' = 'Apple';
Covering Indexes (Index-Only Scans)
Include extra columns to avoid table lookups entirely.Copy
-- Regular index
CREATE INDEX idx_orders_user ON orders(user_id);
-- Query needs status and total:
SELECT status, total FROM orders WHERE user_id = 123;
-- 1. Lookup user_id in index → get row pointer
-- 2. Fetch row from table → get status, total
-- 2 operations!
-- Covering index (PostgreSQL 11+)
CREATE INDEX idx_orders_user_covering ON orders(user_id)
INCLUDE (status, total);
-- Now:
-- 1. Lookup user_id in index → status, total already there!
-- Index-only scan (faster!)
-- Use EXPLAIN to verify:
EXPLAIN SELECT status, total FROM orders WHERE user_id = 123;
-- "Index Only Scan using idx_orders_user_covering"
Unique Indexes for Constraints
Copy
-- Primary key creates unique index automatically
CREATE TABLE users (
id SERIAL PRIMARY KEY -- Creates unique index on id
);
-- UNIQUE constraint creates unique index
ALTER TABLE users ADD CONSTRAINT unique_email UNIQUE (email);
-- Creates idx_users_email (unique)
-- Unique with condition (partial unique)
CREATE UNIQUE INDEX idx_users_email_active ON users(email)
WHERE deleted_at IS NULL;
-- Only active users need unique emails
-- Unique across multiple columns
CREATE UNIQUE INDEX idx_product_variants ON products(product_id, size, color);
-- Each combination must be unique
4.6 Reading Execution Plans
Understanding EXPLAIN output is crucial for index optimization.EXPLAIN Basics
Copy
-- Basic explain (estimated costs only)
EXPLAIN SELECT * FROM users WHERE email = '[email protected]';
-- With actual execution stats
EXPLAIN ANALYZE SELECT * FROM users WHERE email = '[email protected]';
-- Full details
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM users WHERE email = '[email protected]';
Reading the Plan
Copy
EXPLAIN (ANALYZE, BUFFERS)
SELECT u.name, COUNT(o.id) as order_count
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
GROUP BY u.id, u.name
ORDER BY order_count DESC
LIMIT 10;
Copy
Limit (cost=1500.00..1500.03 rows=10 width=40) (actual time=45.2..45.3 rows=10 loops=1)
Buffers: shared hit=1200 read=50
-> Sort (cost=1500.00..1510.00 rows=4000 width=40) (actual time=45.2..45.2 rows=10 loops=1)
Sort Key: (count(o.id)) DESC
Sort Method: top-N heapsort Memory: 25kB
-> HashAggregate (cost=1200.00..1250.00 rows=4000 width=40) (actual time=35.5..40.1 rows=4000 loops=1)
Group Key: u.id
Batches: 1 Memory Usage: 600kB
-> Hash Join (cost=100.00..1000.00 rows=20000 width=32) (actual time=5.2..25.4 rows=20000 loops=1)
Hash Cond: (o.user_id = u.id)
-> Seq Scan on orders o (cost=0.00..500.00 rows=30000 width=8) (actual time=0.01..8.5 rows=30000 loops=1)
-> Hash (cost=80.00..80.00 rows=4000 width=24) (actual time=4.8..4.8 rows=4000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 250kB
-> Index Scan using idx_users_created on users u (cost=0.29..80.00 rows=4000 width=24)
Index Cond: (created_at > '2024-01-01')
Buffers: shared hit=100
Planning Time: 0.5 ms
Execution Time: 45.5 ms
Key Metrics to Watch
| Metric | Good Sign | Bad Sign |
|---|---|---|
Index Scan | ✅ Using index | |
Seq Scan on large table | ❌ Missing index | |
Buffers: shared hit | ✅ Data in cache | |
Buffers: read | ⚠️ Disk I/O | |
actual rows ≈ estimated rows | ✅ Good statistics | |
actual rows >> estimated rows | ❌ Run ANALYZE | |
Sort Method: quicksort Memory | ✅ Fits in memory | |
Sort Method: external merge Disk | ❌ Needs more work_mem |
Common Scan Types
Copy
┌─────────────────────────────────────────────────────────────────────────┐
│ SCAN TYPE COMPARISON │
├──────────────────────┬──────────────────────────────────────────────────┤
│ Seq Scan │ Reads entire table, row by row │
│ │ O(n) - Use when: no index, or selecting most │
│ │ of the table │
├──────────────────────┼──────────────────────────────────────────────────┤
│ Index Scan │ Traverse index, then fetch rows from table │
│ │ O(log n + k) - Use when: selecting few rows │
├──────────────────────┼──────────────────────────────────────────────────┤
│ Index Only Scan │ All data from index, no table access │
│ │ Fastest! Requires covering index + visible rows │
├──────────────────────┼──────────────────────────────────────────────────┤
│ Bitmap Index Scan │ Build bitmap of matching rows, then fetch │
│ │ Good for: medium selectivity, multiple indexes │
│ │ combining with OR │
└──────────────────────┴──────────────────────────────────────────────────┘
4.7 Index Maintenance
Monitoring Index Usage
Copy
-- Find unused indexes
SELECT
schemaname,
relname AS table,
indexrelname AS index,
idx_scan AS times_used,
pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
-- Find most used indexes
SELECT
relname AS table,
indexrelname AS index,
idx_scan AS times_used,
idx_tup_read AS rows_read,
idx_tup_fetch AS rows_fetched
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC
LIMIT 20;
-- Tables with too many sequential scans (may need indexes)
SELECT
relname AS table,
seq_scan,
idx_scan,
CASE WHEN idx_scan > 0 THEN
ROUND(100.0 * seq_scan / (seq_scan + idx_scan), 2)
END AS seq_scan_pct,
pg_size_pretty(pg_relation_size(relid)) AS size
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_scan DESC
LIMIT 20;
Index Bloat
Indexes accumulate dead entries over time, becoming larger and slower.Copy
-- Check index size vs table size
SELECT
relname AS table,
pg_size_pretty(pg_table_size(relid)) AS table_size,
pg_size_pretty(pg_indexes_size(relid)) AS indexes_size,
ROUND(100.0 * pg_indexes_size(relid) / NULLIF(pg_table_size(relid), 0), 2) AS index_ratio
FROM pg_stat_user_tables
ORDER BY pg_indexes_size(relid) DESC;
-- Rebuild bloated index
REINDEX INDEX CONCURRENTLY idx_orders_user;
-- Or create replacement and swap
CREATE INDEX CONCURRENTLY idx_orders_user_new ON orders(user_id);
DROP INDEX idx_orders_user;
ALTER INDEX idx_orders_user_new RENAME TO idx_orders_user;
Keeping Statistics Fresh
Copy
-- Update statistics for a table
ANALYZE users;
-- Update for entire database
ANALYZE;
-- Check when statistics were last updated
SELECT
relname,
last_analyze,
last_autoanalyze,
n_live_tup,
n_dead_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
-- Configure autovacuum for busy tables
ALTER TABLE orders SET (
autovacuum_analyze_scale_factor = 0.01, -- Analyze at 1% change
autovacuum_vacuum_scale_factor = 0.01 -- Vacuum at 1% dead tuples
);
4.8 Practice Exercises
Exercise 1: Index Design
Given these queries, design optimal indexes:Copy
-- Query 1: User lookup by email
SELECT * FROM users WHERE email = '[email protected]';
-- Query 2: User's recent orders
SELECT * FROM orders
WHERE user_id = 123
ORDER BY created_at DESC
LIMIT 10;
-- Query 3: Pending orders older than 1 hour
SELECT * FROM orders
WHERE status = 'pending'
AND created_at < NOW() - INTERVAL '1 hour';
-- Query 4: Product search by category and price range
SELECT * FROM products
WHERE category_id = 5
AND price BETWEEN 100 AND 500
ORDER BY price;
-- Query 5: Count of orders by status per user (dashboard)
SELECT user_id, status, COUNT(*)
FROM orders
GROUP BY user_id, status;
Solutions
Solutions
Copy
-- Query 1: Simple unique index
CREATE UNIQUE INDEX idx_users_email ON users(email);
-- Query 2: Composite with sort order
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);
-- Query 3: Partial index on pending orders
CREATE INDEX idx_orders_pending_created ON orders(created_at)
WHERE status = 'pending';
-- Query 4: Composite for filter + sort
CREATE INDEX idx_products_category_price ON products(category_id, price);
-- Query 5: Covering index for aggregation
CREATE INDEX idx_orders_user_status ON orders(user_id, status);
-- Or include count columns if other columns needed
Exercise 2: Analyze Slow Query
Copy
-- This query is slow. Analyze and fix it.
EXPLAIN ANALYZE
SELECT
p.name,
p.price,
c.name as category_name,
AVG(r.rating) as avg_rating,
COUNT(r.id) as review_count
FROM products p
LEFT JOIN categories c ON p.category_id = c.id
LEFT JOIN reviews r ON p.id = r.product_id
WHERE lower(p.name) LIKE '%laptop%'
AND p.price > 500
AND p.status = 'active'
GROUP BY p.id, p.name, p.price, c.name
HAVING COUNT(r.id) > 10
ORDER BY avg_rating DESC
LIMIT 20;
Analysis and Solution
Analysis and Solution
Copy
-- Problems identified:
-- 1. lower(p.name) LIKE '%laptop%' - Cannot use index (function + leading wildcard)
-- 2. No covering index for the main filter conditions
-- 3. Reviews aggregation might be slow
-- Solution 1: Full-text search for name
ALTER TABLE products ADD COLUMN name_search tsvector
GENERATED ALWAYS AS (to_tsvector('english', name)) STORED;
CREATE INDEX idx_products_name_search ON products USING gin(name_search);
-- Solution 2: Index for filters
CREATE INDEX idx_products_active_price ON products(price)
WHERE status = 'active';
-- Solution 3: Index for reviews aggregation
CREATE INDEX idx_reviews_product ON reviews(product_id);
-- Optimized query:
SELECT
p.name,
p.price,
c.name as category_name,
AVG(r.rating) as avg_rating,
COUNT(r.id) as review_count
FROM products p
LEFT JOIN categories c ON p.category_id = c.id
LEFT JOIN reviews r ON p.id = r.product_id
WHERE p.name_search @@ to_tsquery('english', 'laptop') -- Use full-text search
AND p.price > 500
AND p.status = 'active'
GROUP BY p.id, p.name, p.price, c.name
HAVING COUNT(r.id) > 10
ORDER BY avg_rating DESC
LIMIT 20;
Next Module
Module 5: Query Optimization
Learn systematic approaches to optimize any slow query