-- This query takes 5 seconds. What's wrong?SELECT u.name, COUNT(o.id) as order_countFROM users uLEFT JOIN orders o ON u.id = o.user_idWHERE u.created_at > '2024-01-01'GROUP BY u.id, u.nameORDER BY order_count DESC;-- Without execution plans: Pure guesswork-- - Add an index? Which column?-- - Rewrite the query? How?-- - Increase memory? By how much?-- With execution plans: Surgical precision-- - See exactly what's slow (Seq Scan on orders: 4.8s)-- - Know the exact fix (Index on orders.user_id)-- - Verify improvement (Now 50ms)
What You’ll Master:
Read and interpret every node type (Seq Scan, Index Scan, Hash Join, etc.)
Identify performance bottlenecks instantly
Understand cost calculations and row estimates
Optimize queries systematically, not randomly
Compare execution across PostgreSQL, MySQL, SQL Server
This is not just EXPLAIN syntax. This is learning to think like the query planner, predict its decisions, and guide it to the fastest execution path.
-- EXPLAIN: Shows the PLAN (doesn't execute)EXPLAINSELECT * FROM users WHERE email = '[email protected]';-- Output:Index Scan using users_email_idx on users (cost=0.42..8.44 rows=1 width=40) Index Cond: (email = '[email protected]'::text)-- EXPLAIN ANALYZE: Shows PLAN + ACTUAL EXECUTIONEXPLAIN ANALYZESELECT * FROM users WHERE email = '[email protected]';-- Output:Index Scan using users_email_idx on users (cost=0.42..8.44 rows=1 width=40) (actual time=0.025..0.027 rows=1 loops=1) Index Cond: (email = '[email protected]'::text)Planning Time: 0.123 msExecution Time: 0.052 ms
Key Differences:
Aspect
EXPLAIN
EXPLAIN ANALYZE
Executes query?
❌ No
✅ Yes
Shows actual times?
❌ No
✅ Yes
Shows actual rows?
❌ No
✅ Yes
Safe for production?
✅ Yes
⚠️ Caution (runs query!)
Use when
Checking plan
Diagnosing slow query
EXPLAIN ANALYZE actually runs the query!
For writes (INSERT/UPDATE/DELETE), wrap in transaction:
Copy
BEGIN;EXPLAIN ANALYZE DELETE FROM huge_table WHERE ...;ROLLBACK; -- Prevent actual deletion
Sort (cost=1000.00..1050.00 rows=100 width=50) ^^^^^^^^^ ^^^^^^^^ startup total
Startup Cost: Work before first row returned
Useful for queries with LIMIT
Matters for interactive applications
Total Cost: Complete execution
What you care about for batch jobs
Example:
Copy
-- Sort has high startup (must collect all rows first)EXPLAINSELECT * FROM users ORDER BY created_at LIMIT 10;Sort (cost=5000.00..5500.00 ...) -- High startup -> Seq Scan on users (cost=0.00..100.00 ...)-- vs Index Scan (low startup, returns rows immediately)EXPLAINSELECT * FROM users WHERE id = 5 LIMIT 10;Index Scan (cost=0.42..8.44 ...) -- Low startup, immediate results
Seq Scan on orders (cost=0.00..1000.00 rows=50 width=100) ^^^^ ESTIMATED rowsEXPLAIN ANALYZE output:Seq Scan on orders (cost=0.00..1000.00 rows=50 width=100) (actual time=0.1..10.5 rows=5000 loops=1) ^^^^^^^^^ ACTUAL rows (100x off!)
Why Row Estimates Matter:
Copy
Bad estimate (planner thinks 50 rows, actually 5000): ↓Chooses wrong join method (Nested Loop instead of Hash Join) ↓Slow query (10 seconds instead of 100ms)
Common Causes of Bad Estimates:
Outdated statistics: Run ANALYZE
Correlated columns: Planner assumes independence
Complex WHERE conditions: Planner guesses conservatively
Functions in WHERE: Planner can’t estimate WHERE lower(email) = ...
Fixing Bad Estimates:
Copy
-- 1. Update statisticsANALYZE users;-- 2. Increase statistics target for important columnsALTER TABLE users ALTER COLUMN email SET STATISTICS 1000;ANALYZE users;-- 3. Create expression index for function-based queriesCREATE INDEX users_lower_email_idx ON users(lower(email));-- 4. Use extended statistics for correlated columnsCREATE STATISTICS users_city_state_stats (dependencies)ON city, state FROM users;ANALYZE users;
Seq Scan on users (cost=0.00..100.00 rows=1000 width=40) ^^^^^^^^ Average row size in bytes
What it means:
Average bytes per row returned
Used for memory calculations (work_mem, hash tables)
Includes only columns in SELECT
Example:
Copy
-- All columnsSELECT * FROM users; -- width=200 (name + email + address + ...)-- Specific columnsSELECT id, name FROM users; -- width=40 (just id + name)
EXPLAIN ANALYZESELECT * FROM users WHERE created_at > '2020-01-01';-- Returns 95% of rowsSeq Scan on users (cost=0.00..1000.00 rows=9500 width=100) (actual time=0.05..15.2 rows=9500 loops=1) Filter: (created_at > '2020-01-01'::date) Rows Removed by Filter: 500
How It Works:
Copy
1. Read table pages sequentially from disk2. For each page: - Load into buffer pool - Scan all tuples - Apply filter - Return matching rows3. Continue until end of table
Performance Characteristics:
Copy
Advantages:✅ Efficient for reading large % of table (>5-10%)✅ Utilizes disk sequential read (fast)✅ No index overheadDisadvantages:❌ Reads entire table (wasteful if few matches)❌ Slow for selective queries
CREATE INDEX users_email_idx ON users(email);EXPLAIN ANALYZESELECT * FROM users WHERE email = '[email protected]';Index Scan using users_email_idx on users (cost=0.42..8.44 rows=1 width=100) (actual time=0.025..0.027 rows=1 loops=1) Index Cond: (email = '[email protected]'::text)
How It Works:
Copy
1. Search B-tree index for '[email protected]'2. Find leaf node with pointer to heap tuple3. Fetch heap page containing the row4. Return rowFor multiple rows: Repeat steps 2-4 for each matching index entry
Performance:
Copy
Advantages:✅ Fast for selective queries (<5% of table)✅ Returns rows in index order (helps ORDER BY)Disadvantages:❌ Random I/O (slower than sequential)❌ Heap lookups add overhead
Cost Breakdown:
Copy
-- Startup cost (0.42):- B-tree descent (log N)- Typically 3-4 levels-- Total cost (8.44):- Index read + Heap tuple fetch
When Used: Index contains all needed columns (covering index).
Copy
CREATE INDEX users_email_name_idx ON users(email, name);EXPLAIN ANALYZESELECT name FROM users WHERE email = '[email protected]';Index Only Scan using users_email_name_idx on users (cost=0.42..4.44 rows=1 width=15) (actual time=0.015..0.016 rows=1 loops=1) Index Cond: (email = '[email protected]'::text) Heap Fetches: 0 ← No heap access!
How It Works:
Copy
1. Search index for '[email protected]'2. Read 'name' directly from index leaf3. Return result (No heap tuple fetch needed!)
Massive Performance Win:
Copy
Index Scan: Index read + Heap fetch = ~8.44 costIndex Only Scan: Index read only = ~4.44 cost ↑ 2x faster
Caveat - Visibility Map:
Copy
Index Only Scan ... (Heap Fetches: 342) ^^^^ Had to check heap for some rowsWhy?- PostgreSQL MVCC: Need to check row visibility- Visibility map tracks "all tuples on page visible"- Old/dead tuples → Heap fetch requiredSolution: VACUUM regularlyVACUUM users;-- Updates visibility map → More index-only scans
When Used: Combining multiple indexes OR fetching many rows.
Copy
CREATE INDEX users_city_idx ON users(city);CREATE INDEX users_age_idx ON users(age);EXPLAIN ANALYZESELECT * FROM users WHERE city = 'NYC' AND age > 30;Bitmap Heap Scan on users (cost=100.00..500.00 rows=500) Recheck Cond: ((city = 'NYC') AND (age > 30)) -> BitmapAnd (cost=100.00..100.00 rows=500) -> Bitmap Index Scan on users_city_idx -> Bitmap Index Scan on users_age_idx
How It Works:
Copy
1. Scan users_city_idx → Create bitmap of matching TIDs (tuple IDs) Bitmap: [page 5, page 12, page 23, ...]2. Scan users_age_idx → Create second bitmap Bitmap: [page 3, page 12, page 45, ...]3. AND bitmaps together Result: [page 12, ...] (pages in both bitmaps)4. Sort pages by physical location [page 12] → Sequential access!5. Fetch heap pages in sorted order6. Recheck conditions (bitmap is lossy)
Why Bitmap?
Copy
Problem with regular Index Scan on multiple rows: - Fetches rows in index order - Random heap access - SlowBitmap solution: - Buffers TIDs in memory - Sorts by heap location - Sequential heap access - Much faster for moderate result sets (100-10000 rows)
Bitmap vs Index Scan Decision:
Copy
Few rows (<100): Index Scan (direct access)Moderate (100-10k): Bitmap Index Scan (sorted access)Many (>10k): Seq Scan (read entire table)
Concept: For each row in outer table, scan inner table.
Copy
EXPLAIN ANALYZESELECT u.name, o.totalFROM users uJOIN orders o ON u.id = o.user_idWHERE u.id = 5;Nested Loop (cost=0.84..16.88 rows=1) -> Index Scan on users u (cost=0.42..8.44 rows=1) Index Cond: (id = 5) -> Index Scan on orders o (cost=0.42..8.44 rows=1) Index Cond: (user_id = 5)
Pseudocode:
Copy
result = []for outer_row in outer_table: # users for inner_row in inner_table: # orders if outer_row.id == inner_row.user_id: result.append((outer_row, inner_row))return result
Performance:
Copy
Cost: O(outer_rows * inner_rows) per matchBest case: Inner table has index on join key - Index Scan for each outer row - Cost: outer_rows * index_lookup_costWorst case: No index on inner table - Seq Scan for EACH outer row - Cost: outer_rows * table_size - Catastrophic!
When Optimal:
Copy
✅ Small outer table (few rows)✅ Index exists on inner table join key✅ Selective WHERE clause on outer tableExample:- Outer: 1 user (WHERE user_id = 5)- Inner: Orders with index on user_id- Result: 1 index lookup = Fast!
When Terrible:
Copy
❌ Large outer table❌ No index on inner table❌ Cartesian product (no join condition)Example:- Outer: 10,000 users- Inner: 1 million orders, NO index- Result: 10,000 table scans = Days!
Concept: Build hash table on one input, probe with other.
Copy
EXPLAIN ANALYZESELECT u.name, o.totalFROM users uJOIN orders o ON u.id = o.user_id;Hash Join (cost=500.00..5000.00 rows=10000) Hash Cond: (o.user_id = u.id) -> Seq Scan on orders o (cost=0.00..2000.00 rows=100000) -> Hash (cost=100.00..100.00 rows=10000) -> Seq Scan on users u (cost=0.00..100.00 rows=10000)
Algorithm:
Copy
# Phase 1: Build hash table (smaller table)hash_table = {}for row in users: hash_table[row.id] = row# Phase 2: Probe (larger table)result = []for row in orders: if row.user_id in hash_table: result.append((hash_table[row.user_id], row))return result
Performance:
Copy
Build phase: O(smaller_table_rows)Probe phase: O(larger_table_rows)Total: O(N + M) - Linear!Memory requirement: Hash table size ≈ smaller_table_rows * average_row_width Must fit in work_mem (default 4MB)
When Optimal:
Copy
✅ Large tables without indexes✅ Equi-joins (=, not <, >, !=)✅ Smaller table fits in work_mem✅ Both tables scanned anyway (no indexes help)Example:- 10,000 users × 100,000 orders- Build hash on users (smaller)- Probe with orders- Time: Seconds (vs hours with Nested Loop)
Memory Spills:
Copy
Hash Join (...) (actual time=1000..5000 ...) Buckets: 16384 Batches: 4 Memory Usage: 8192kB ^ Spilled to disk! (work_mem too small)When hash table > work_mem: - Splits into batches - Writes batches to temp files on disk - Slower (disk I/O)Fix: Increase work_memSET work_mem = '256MB';
EXPLAIN ANALYZESELECT u.name, o.totalFROM users uJOIN orders o ON u.id = o.user_idORDER BY u.id;Merge Join (cost=2000.00..3000.00 rows=10000) Merge Cond: (u.id = o.user_id) -> Index Scan using users_pkey on users u -> Sort (cost=1500.00..1750.00 rows=100000) Sort Key: o.user_id -> Seq Scan on orders o
Algorithm:
Copy
# Both inputs sorted by join keyusers_sorted = sorted(users, key=lambda r: r.id)orders_sorted = sorted(orders, key=lambda r: r.user_id)i, j = 0, 0result = []while i < len(users_sorted) and j < len(orders_sorted): if users_sorted[i].id == orders_sorted[j].user_id: result.append((users_sorted[i], orders_sorted[j])) j += 1 elif users_sorted[i].id < orders_sorted[j].user_id: i += 1 else: j += 1return result
Performance:
Copy
Best case: Both inputs pre-sorted (indexes) Cost: O(N + M) - Linear scanWorst case: Must sort both inputs Cost: O(N log N + M log M + N + M) Still better than Nested Loop for large datasets
When Optimal:
Copy
✅ Both inputs already sorted (indexes on join keys)✅ Query includes ORDER BY on join key✅ Large datasets (competitive with Hash Join)Example:SELECT * FROM users uJOIN orders o ON u.id = o.user_idORDER BY u.id;- users has primary key on id (sorted)- orders has index on user_id (sorted)- Merge join reads both in order- Result already sorted for ORDER BY- No additional sort step needed!
EXPLAIN ANALYZESELECT city, COUNT(*)FROM usersGROUP BY city;HashAggregate (cost=1500.00..1550.00 rows=1000) Group Key: city -> Seq Scan on users (cost=0.00..1000.00 rows=10000)
Two Methods:1. HashAggregate (default for unsorted input):
Copy
# Build hash tablegroups = {}for row in users: if row.city not in groups: groups[row.city] = {'count': 0} groups[row.city]['count'] += 1return groups.values()# Memory: O(distinct_groups)# Time: O(N)
-- HashAggregate (unsorted input)SELECT city, COUNT(*) FROM users GROUP BY city;-- GroupAggregate (sorted by index)CREATE INDEX users_city_idx ON users(city);SELECT city, COUNT(*) FROM users GROUP BY city;GroupAggregate (cost=0.00..500.00 rows=1000) Group Key: city -> Index Only Scan using users_city_idx on users ↑ Already sorted!
Memory Issues:
Copy
-- Too many groups → HashAggregate spills to diskHashAggregate (Batches: 5 Memory Usage: 65536kB) ^^^ Disk spills-- Solution 1: Increase work_memSET work_mem = '256MB';-- Solution 2: Force GroupAggregate with indexCREATE INDEX ON users(city);-- Now planner uses sorted index scan
EXPLAIN ANALYZESELECT * FROM users ORDER BY created_at DESC LIMIT 10;Limit (cost=500.00..500.02 rows=10) -> Sort (cost=500.00..525.00 rows=10000) Sort Key: created_at DESC Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on users (cost=0.00..100.00 rows=10000)
Sort Methods:1. Quicksort (general purpose):
Copy
Used when: Sorting all rowsMemory: All rows must fit in work_memTime: O(N log N)
2. Top-N Heapsort (for LIMIT):
Copy
# Only keep top N rows in heapheap = []for row in users: if len(heap) < 10: heappush(heap, row) elif row.created_at > heap[0].created_at: heapreplace(heap, row)return sorted(heap, reverse=True)# Memory: O(N) where N = LIMIT# Time: O(M log N) where M = total rows, N = LIMIT
3. External Sort (disk-based):
Copy
When: Data > work_memProcess: 1. Read work_mem chunks 2. Sort each chunk 3. Write to temp files 4. Merge sorted chunksSlow! Avoid by increasing work_mem
Checking for Disk Sorts:
Copy
EXPLAIN (ANALYZE, BUFFERS)SELECT * FROM huge_table ORDER BY created_at;Sort (...) (actual time=5000..10000 ...) Sort Method: external merge Disk: 204800kB ^^^^^^^^^^^ Spilled to disk!-- Fix:SET work_mem = '512MB';-- Now:Sort Method: quicksort Memory: 204800kB ^^^ In-memory!
Avoiding Sorts with Indexes:
Copy
-- Without index: Full table sortEXPLAIN SELECT * FROM users ORDER BY created_at;Sort + Seq Scan-- With index: No sort needed!CREATE INDEX users_created_at_idx ON users(created_at);EXPLAIN SELECT * FROM users ORDER BY created_at;Index Scan using users_created_at_idx ↑ Returns rows in sorted order
EXPLAIN ANALYZESELECT *FROM users uJOIN orders o ON u.id = o.user_idJOIN products p ON o.product_id = p.id;Nested Loop (cost=0.84..50000.00 rows=100) (actual time=0.1..2500.5 rows=100 loops=1) -> Nested Loop (cost=0.42..10000.00 rows=100) (actual time=0.05..500.2 rows=100 loops=1) -> Seq Scan on users u (cost=0.00..100.00 rows=10) (actual time=0.01..5.0 rows=10 loops=1) Filter: (active = true) Rows Removed by Filter: 90 -> Index Scan on orders o (cost=0.42..980.00 rows=10) (actual time=0.02..48.0 rows=10 loops=10) ^^^^^^^^ Ran 10 times! Index Cond: (user_id = u.id) -> Index Scan on products p (cost=0.42..398.00 rows=1) (actual time=0.01..20.0 rows=1 loops=100) ^^^^^^^^ Ran 100 times! Index Cond: (id = o.product_id)
Understanding loops:
Copy
loops=1: Operation ran onceloops=10: Operation ran 10 times (once per outer row)loops=100: Operation ran 100 times (once per combined outer rows)Total cost = node_cost × loopsExample:- Index Scan on products: cost=398 × loops=100- Total: 39,800- If planner expected loops=1 but actual=100 → 100x slower!
Optimizing High-Loop Nested Joins:
Copy
-- Problem: Inner scans repeated many timesNested Loop (loops=10000) -> Large outer scan -> Index scan repeated 10000 times-- Solution 1: Switch to Hash JoinSET enable_nestloop = off;-- Forces planner to consider Hash Join-- Solution 2: Better WHERE clause selectivityWHERE users.active = true -- Reduce outer rowsAND orders.status = 'pending' -- Reduce inner rows per outer-- Solution 3: MaterializationCREATE TEMP TABLE filtered_users ASSELECT * FROM users WHERE active = true;SELECT * FROM filtered_users uJOIN orders o ON u.id = o.user_id;-- Now outer is small → Fewer loops
-- PostgreSQL 11 and earlier:WITH expensive_cte AS ( SELECT * FROM huge_table WHERE complex_condition)SELECT * FROM expensive_cte WHERE simple_condition;-- Plan (PostgreSQL 11):CTE Scan on expensive_cte Filter: simple_condition CTE expensive_cte -> Seq Scan on huge_table Filter: complex_condition-- Problem: CTE materialized BEFORE outer WHERE applied-- Result: Processes all huge_table rows even if only need few-- PostgreSQL 12+: Inlines CTEsSeq Scan on huge_table Filter: (complex_condition AND simple_condition) ↑ Optimized! Applies both filters together-- Force materialization if needed:WITH expensive_cte AS MATERIALIZED (...)
EXPLAIN FORMAT=TREESELECT ...;-> Nested loop inner join -> Table scan on u (actual time=0.1..10.5 rows=100) -> Index lookup on o (actual time=0.01..0.02 rows=5)
MySQL 8.0.18+ shows actual times in TREE format
Older versions: No actual execution stats in EXPLAIN
SQL Server:
Copy
SET STATISTICS TIME ON;SET STATISTICS IO ON;SELECT ...;-- Output:SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 1 ms.SQL Server Execution Times: CPU time = 15 ms, elapsed time = 42 ms.Table 'users'. Scan count 1, logical reads 5, physical reads 0.Table 'orders'. Scan count 100, logical reads 500, physical reads 10.
CREATE TABLE products ( id SERIAL PRIMARY KEY, name TEXT, category TEXT, price DECIMAL(10, 2), stock INT);INSERT INTO productsSELECT i, 'Product ' || i, (ARRAY['Electronics', 'Clothing', 'Food'])[1 + mod(i, 3)], random() * 1000, floor(random() * 100)FROM generate_series(1, 100000) i;
Query:
Copy
EXPLAIN ANALYZESELECT * FROM products WHERE price > 500;
Questions:
What scan method is used?
What’s the estimated vs actual row count?
How many rows were filtered?
What’s the execution time?
Expected Answer:
Copy
Seq Scan on products (cost=0.00..2834.00 rows=50000 width=...) (actual time=0.015..25.123 rows=49823 loops=1) Filter: (price > '500'::numeric) Rows Removed by Filter: 50177Planning Time: 0.142 msExecution Time: 27.456 msAnalysis:1. Seq Scan (no index on price)2. Estimate: 50,000 rows; Actual: 49,823 (very close!)3. Filtered: 50,177 rows4. Time: ~27ms
Optimization Challenge:
Copy
-- Add indexCREATE INDEX products_price_idx ON products(price);-- Re-runEXPLAIN ANALYZESELECT * FROM products WHERE price > 500;-- Still Seq Scan! Why?-- Answer: Query returns ~50% of rows-- Seq Scan cheaper than 50,000 index lookups-- Try more selective query:EXPLAIN ANALYZESELECT * FROM products WHERE price > 900;-- Now Index Scan!Bitmap Heap Scan on products (cost=542.84..2156.12 rows=9952) Recheck Cond: (price > '900'::numeric) -> Bitmap Index Scan on products_price_idx
CREATE TABLE customers ( id SERIAL PRIMARY KEY, name TEXT, city TEXT);CREATE TABLE orders ( id SERIAL PRIMARY KEY, customer_id INT, order_date DATE, total DECIMAL(10, 2));INSERT INTO customersSELECT i, 'Customer ' || i, (ARRAY['NYC', 'LA', 'Chicago'])[1 + mod(i, 3)]FROM generate_series(1, 10000) i;INSERT INTO ordersSELECT i, 1 + floor(random() * 10000)::INT, CURRENT_DATE - floor(random() * 365)::INT, random() * 1000FROM generate_series(1, 100000) i;ANALYZE customers, orders;
Query 1: Small result set:
Copy
EXPLAIN ANALYZESELECT c.name, o.totalFROM customers cJOIN orders o ON c.id = o.customer_idWHERE c.id = 42;
Expected:
Copy
Nested Loop (cost=0.84..120.00 rows=10) -> Index Scan using customers_pkey on customers c Index Cond: (id = 42) -> Seq Scan on orders o Filter: (customer_id = 42)Why Nested Loop?- Outer: 1 row (highly selective)- Inner: Sequential scan (no index on customer_id yet)- Only scans orders once
Optimization:
Copy
CREATE INDEX orders_customer_id_idx ON orders(customer_id);EXPLAIN ANALYZESELECT c.name, o.totalFROM customers cJOIN orders o ON c.id = o.customer_idWHERE c.id = 42;-- Now:Nested Loop (cost=0.84..25.00 rows=10) -> Index Scan using customers_pkey on customers c -> Index Scan using orders_customer_id_idx on orders o Index Cond: (customer_id = 42)-- Much faster! Index on inner relation
Query 2: Large result set:
Copy
EXPLAIN ANALYZESELECT c.name, COUNT(o.id)FROM customers cJOIN orders o ON c.id = o.customer_idGROUP BY c.id, c.name;
Expected:
Copy
Hash Join (cost=3000.00..8000.00 rows=10000) Hash Cond: (o.customer_id = c.id) -> Seq Scan on orders o -> Hash -> Seq Scan on customers cWhy Hash Join?- Large result set (all customers)- Both tables fully scanned- Hash Join optimal for this pattern
EXPLAIN ANALYZESELECT category, AVG(price), COUNT(*)FROM productsGROUP BY category;
Without Index:
Copy
HashAggregate (cost=2834.00..2834.03 rows=3) Group Key: category -> Seq Scan on products
With Index:
Copy
CREATE INDEX products_category_idx ON products(category);EXPLAIN ANALYZESELECT category, AVG(price), COUNT(*)FROM productsGROUP BY category;-- Now:GroupAggregate (cost=0.42..5000.00 rows=3) Group Key: category -> Index Only Scan using products_category_idx on products-- Streaming aggregation (less memory)
EXPLAIN ANALYZESELECT * FROM productsORDER BY price DESCLIMIT 10;
Without Index:
Copy
Limit (cost=5250.00..5250.03 rows=10) -> Sort (cost=5250.00..5500.00 rows=100000) Sort Key: price DESC Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on products-- Top-N heapsort (efficient for LIMIT)-- Only keeps top 10 in memory
With Index:
Copy
CREATE INDEX products_price_desc_idx ON products(price DESC);EXPLAIN ANALYZESELECT * FROM productsORDER BY price DESCLIMIT 10;-- Now:Limit (cost=0.42..0.85 rows=10) -> Index Scan using products_price_desc_idx on products-- No sort needed! Index already sorted DESC-- Just read first 10 entries
-- This query suddenly became slow (was fast last week)SELECT * FROM orders WHERE status = 'pending';-- Takes 30 seconds now (was 100ms before)
Diagnosis:
Copy
EXPLAIN ANALYZESELECT * FROM orders WHERE status = 'pending';Seq Scan on orders (cost=0.00..50000.00 rows=1000000) (actual time=0.01..15234.56 rows=999500) Filter: (status = 'pending'::text) Rows Removed by Filter: 500-- Problem: Index exists but not used!-- Estimated 1,000,000 rows, actual 999,500 rows
Root Cause:
Copy
-- Check index\d orders-- Index "orders_status_idx" BTREE (status)-- Check statisticsSELECT attname, n_distinct, most_common_vals, most_common_freqsFROM pg_statsWHERE tablename = 'orders' AND attname = 'status';-- Result:n_distinct: 3 (only 3 distinct values)most_common_vals: {pending, completed, cancelled}most_common_freqs: {0.9995, 0.0003, 0.0002}-- Ah! "pending" is 99.95% of rows-- Planner correctly chose Seq Scan (index would be slower)
Why it Changed:
Copy
Last week:- 1M orders, 10K pending (1%) → Index Scan- Fast!This week:- 1M orders, 999.5K pending (99.95%) → Seq Scan- Slow (but optimal given data distribution)
Solutions:
Copy
-- Solution 1: Index on selective statusCREATE INDEX orders_completed_idx ON orders(id)WHERE status = 'completed';-- Now:SELECT * FROM orders WHERE status = 'completed';-- Uses partial index (fast)SELECT * FROM orders WHERE status = 'pending';-- Seq Scan (correct, most rows)-- Solution 2: Partition tableCREATE TABLE orders_pending (LIKE orders);CREATE TABLE orders_completed (LIKE orders);-- Move old completed orders out of main table
-- Join query degraded from 1s to 60sSELECT u.name, COUNT(o.id)FROM users uLEFT JOIN orders o ON u.id = o.user_idGROUP BY u.id, u.name;
Diagnosis:
Copy
EXPLAIN (ANALYZE, BUFFERS)SELECT u.name, COUNT(o.id)FROM users uLEFT JOIN orders o ON u.id = o.user_idGROUP BY u.id, u.name;Hash Left Join (cost=5000.00..100000.00 rows=10000) (actual time=1000..58234 rows=10000 loops=1) Hash Cond: (u.id = o.user_id) Buffers: shared hit=50000 read=150000 ^^^^^^^^^^^^^^^^^^ Massive disk reads! -> Seq Scan on users u -> Hash (cost=3000.00..3000.00 rows=5000000) (actual time=50000..50000 rows=5000000 loops=1) Buckets: 32768 Batches: 64 Memory Usage: 4096kB ^^^ Spilled to disk! -> Seq Scan on orders o
Root Cause:
Copy
- Orders table grew from 1M to 5M rows- Hash table > work_mem (4MB default)- Spilled to disk in 64 batches- Each batch: Read from disk, hash, probe- Result: 60x slowdown
Solution:
Copy
-- Temporary fix: Increase work_mem for this sessionSET work_mem = '256MB';-- Verify:EXPLAIN (ANALYZE, BUFFERS)SELECT u.name, COUNT(o.id)FROM users uLEFT JOIN orders o ON u.id = o.user_idGROUP BY u.id, u.name;Hash Left Join (...) Hash (...) Buckets: 32768 Batches: 1 Memory Usage: 256000kB ^ In-memory now!-- Back to 1s execution time-- Permanent fix: Increase globallyALTER DATABASE mydb SET work_mem = '256MB';-- OR: Index optimizationCREATE INDEX orders_user_id_covering_idxON orders(user_id) INCLUDE (id);-- May allow Merge Join instead of Hash Join
Before deploying a new query to production:□ Run EXPLAIN ANALYZE on production-sized data - Use realistic data volumes - Test with actual distribution (not uniform test data)□ Check for Seq Scans on large tables - Acceptable if: Returns >5% of rows - Red flag if: Returns <1% of rows - Fix: Add appropriate index□ Verify row estimates are reasonable - If actual/estimated > 10x: Run ANALYZE - If still off: Consider extended statistics□ Check for memory spills - Look for "Batches > 1" in Hash nodes - Look for "external merge" in Sort nodes - Fix: Increase work_mem or reduce data□ Look for high loop counts in joins - Nested Loop with loops=10000+ → Consider Hash Join - Fix: Add index or rewrite query□ Validate ORDER BY optimization - Sort node present? Consider index - LIMIT present? Ensure Top-N heap sort□ Test with concurrent load - Queries fast alone but slow under load? - May indicate lock contention or cache pressure□ Set reasonable statement timeout SET statement_timeout = '30s'; - Prevents runaway queries
-- Default: 4MB (often too small)-- Affects: Hash joins, sorts, bitmap operations-- Finding optimal value:-- 1. Run EXPLAIN (ANALYZE, BUFFERS)-- 2. Check for "Batches > 1" or "external merge"-- 3. Increase work_mem gradually-- For specific query:SET work_mem = '256MB';-- Globally (careful! per operation, can multiply):ALTER DATABASE mydb SET work_mem = '64MB';-- Caution: 10 concurrent queries × 5 hash joins each × 64MB = 3.2GB memory!
effective_cache_size (hint to planner):
Copy
-- Tells planner how much RAM available for caching-- Doesn't allocate memory, just informs decisions-- Rule of thumb: 50-75% of system RAM-- 16GB RAM server:ALTER SYSTEM SET effective_cache_size = '12GB';-- Makes planner favor index scans (knows data likely cached)
random_page_cost:
Copy
-- Default: 4.0 (assumes spinning disk)-- For SSD: 1.1 - 1.5 (random ≈ sequential)-- SSDs:ALTER SYSTEM SET random_page_cost = 1.1;SELECT pg_reload_conf();-- Makes planner favor index scans on SSDs
Statistics targets:
Copy
-- Default: 100 (samples ~3000 rows)-- For important columns: 1000+ (samples ~30,000 rows)ALTER TABLE products ALTER COLUMN category_id SET STATISTICS 1000;ANALYZE products;-- Better histograms → Better estimates → Better plans
-- BAD: OFFSET scales linearlySELECT * FROM productsORDER BY idLIMIT 10 OFFSET 100000;-- Must scan 100,010 rows, return 10-- GOOD: Keyset paginationSELECT * FROM productsWHERE id > 100000ORDER BY idLIMIT 10;-- Scans 10 rows via index
2. OR in WHERE clause:
Copy
-- BAD: Often forces Seq ScanSELECT * FROM usersWHERE email = '[email protected]' OR phone = '555-1234';-- Planner can't use indexes efficiently-- GOOD: UNION ALLSELECT * FROM users WHERE email = '[email protected]'UNION ALLSELECT * FROM users WHERE phone = '555-1234' AND email != '[email protected]';-- Uses both indexes
3. Function calls prevent index use:
Copy
-- BAD: Can't use index on emailSELECT * FROM users WHERE lower(email) = '[email protected]';-- GOOD: Expression indexCREATE INDEX users_lower_email_idx ON users(lower(email));-- Now index can be used-- OR: Store normalizedALTER TABLE users ADD COLUMN email_lower TEXT;UPDATE users SET email_lower = lower(email);CREATE INDEX users_email_lower_idx ON users(email_lower);