Performance Engineering Deep Dive
This module covers the tools and techniques used by database engineers at companies like PlanetScale, Supabase, Neon, and CockroachDB to diagnose and fix performance issues at scale.Target Audience: Engineers joining distributed database teams
Prerequisites: Query Engine, Data Structures modules
Tools Covered: perf, flamegraphs, pg_stat_*, BPF
Interview Relevance: Staff+ performance interviews
Prerequisites: Query Engine, Data Structures modules
Tools Covered: perf, flamegraphs, pg_stat_*, BPF
Interview Relevance: Staff+ performance interviews
Part 1: The Performance Investigation Framework
1.1 The Performance Stack
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ PERFORMANCE INVESTIGATION LAYERS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Layer 6: Application │
│ ├── Query patterns, connection management, ORM behavior │
│ │ │
│ Layer 5: PostgreSQL Query │
│ ├── Query plans, statistics, locks, wait events │
│ │ │
│ Layer 4: PostgreSQL Internals │
│ ├── Buffer pool, WAL, vacuum, checkpointer │
│ │ │
│ Layer 3: Operating System │
│ ├── I/O scheduler, memory management, CPU scheduling │
│ │ │
│ Layer 2: Hardware │
│ ├── SSD/NVMe, RAM, CPU cache, network │
│ │ │
│ Layer 1: Cloud Infrastructure │
│ └── EBS limits, network bandwidth, IOPS credits │
│ │
│ Investigation Order: │
│ 1. Check cloud/hardware limits first (IOPS, bandwidth) │
│ 2. OS-level metrics (iostat, vmstat) │
│ 3. PostgreSQL wait events and statistics │
│ 4. Query-level analysis │
│ 5. Application patterns │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
1.2 The Four Golden Signals
Copy
-- 1. LATENCY: How long do queries take?
SELECT
query,
calls,
mean_exec_time::numeric(10,2) as avg_ms,
max_exec_time::numeric(10,2) as max_ms,
stddev_exec_time::numeric(10,2) as stddev_ms
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
-- 2. THROUGHPUT: How many queries per second?
SELECT
sum(calls) /
EXTRACT(EPOCH FROM (now() - stats_reset)) as qps
FROM pg_stat_statements, pg_stat_statements_info;
-- 3. ERRORS: What's failing?
SELECT
datname,
xact_rollback,
conflicts,
deadlocks
FROM pg_stat_database;
-- 4. SATURATION: Are resources maxed out?
-- Check wait events to see what's saturated
SELECT
wait_event_type,
wait_event,
count(*) as waiting_sessions
FROM pg_stat_activity
WHERE state = 'active' AND wait_event IS NOT NULL
GROUP BY wait_event_type, wait_event
ORDER BY count(*) DESC;
Part 2: PostgreSQL Statistics Deep Dive
2.1 pg_stat_statements Mastery
Copy
-- Enable pg_stat_statements with all options
-- In postgresql.conf:
-- shared_preload_libraries = 'pg_stat_statements'
-- pg_stat_statements.track = all
-- pg_stat_statements.track_utility = on
-- pg_stat_statements.track_planning = on
-- Top queries by total time (production bottlenecks)
SELECT
left(query, 80) as query_preview,
calls,
total_exec_time::numeric(12,2) as total_ms,
mean_exec_time::numeric(10,2) as avg_ms,
rows,
100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0)
as hit_percent
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
-- Queries with worst cache hit ratio (I/O bound)
SELECT
left(query, 80),
calls,
shared_blks_read,
shared_blks_hit,
100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0)
as hit_ratio
FROM pg_stat_statements
WHERE shared_blks_hit + shared_blks_read > 1000
ORDER BY hit_ratio ASC NULLS LAST
LIMIT 20;
-- Queries with high planning time (complex plans)
SELECT
left(query, 80),
calls,
mean_plan_time::numeric(10,2) as avg_plan_ms,
mean_exec_time::numeric(10,2) as avg_exec_ms,
mean_plan_time / nullif(mean_exec_time, 0) as plan_exec_ratio
FROM pg_stat_statements
WHERE calls > 100
ORDER BY mean_plan_time DESC
LIMIT 20;
-- Queries with high temp file usage (work_mem issues)
SELECT
left(query, 80),
calls,
temp_blks_read,
temp_blks_written,
(temp_blks_read + temp_blks_written) * 8 / 1024 as temp_mb
FROM pg_stat_statements
WHERE temp_blks_read + temp_blks_written > 0
ORDER BY temp_blks_read + temp_blks_written DESC;
2.2 Wait Event Analysis
Copy
-- Current wait events
SELECT
pid,
usename,
application_name,
state,
wait_event_type,
wait_event,
left(query, 60) as query_preview
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY wait_event_type, wait_event;
-- Wait event categories explained:
/*
┌─────────────────────────────────────────────────────────────────────────────┐
│ WAIT EVENT CATEGORIES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Category Common Events Meaning │
│ ─────────────────────────────────────────────────────────────────────────── │
│ LWLock buffer_content Buffer pool contention │
│ buffer_mapping Finding buffer in pool │
│ WALWriteLock WAL write serialization │
│ lock_manager Lock table access │
│ │
│ Lock relation Waiting for table lock │
│ tuple Waiting for row lock │
│ transactionid Waiting for xact to complete │
│ virtualxid Waiting for virtual xact │
│ │
│ BufferPin BufferPin Another backend has buffer pinned │
│ │
│ IO DataFileRead Reading from disk │
│ DataFileWrite Writing to disk │
│ WALWrite Writing WAL │
│ WALSync Fsyncing WAL │
│ │
│ Client ClientRead Waiting for client to send data │
│ ClientWrite Waiting to send to client │
│ │
│ IPC BgWorkerStartup Waiting for background worker │
│ ParallelFinish Waiting for parallel workers │
│ ProcArrayLock Transaction visibility check │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
*/
-- Aggregate wait events over time (sample periodically)
CREATE TABLE wait_event_samples (
sample_time timestamptz,
wait_event_type text,
wait_event text,
count int
);
-- Sample query (run via cron/pg_cron every second)
INSERT INTO wait_event_samples
SELECT
now(),
wait_event_type,
wait_event,
count(*)
FROM pg_stat_activity
WHERE state = 'active' AND wait_event IS NOT NULL
GROUP BY wait_event_type, wait_event;
2.3 Index Statistics
Copy
-- Index usage statistics
SELECT
schemaname,
relname as table_name,
indexrelname as index_name,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
-- Unused indexes (candidates for removal)
SELECT
schemaname || '.' || relname as table,
indexrelname as index,
idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) as size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelid NOT IN (
SELECT conindid FROM pg_constraint -- Keep constraint indexes
)
ORDER BY pg_relation_size(indexrelid) DESC;
-- Index bloat estimation
SELECT
schemaname,
tablename,
indexname,
pg_size_pretty(pg_relation_size(indexrelid::regclass)) as index_size,
pg_size_pretty(
pg_relation_size(indexrelid::regclass) -
(reltuples * 40)::bigint -- Rough estimate: 40 bytes per index entry
) as estimated_bloat
FROM pg_stat_user_indexes
JOIN pg_class ON pg_class.oid = indexrelid
WHERE reltuples > 10000
ORDER BY pg_relation_size(indexrelid::regclass) DESC;
-- Index efficiency (fetches vs reads)
SELECT
indexrelname,
idx_tup_read,
idx_tup_fetch,
CASE WHEN idx_tup_read > 0
THEN round(100.0 * idx_tup_fetch / idx_tup_read, 2)
ELSE 0 END as fetch_ratio
FROM pg_stat_user_indexes
WHERE idx_tup_read > 1000
ORDER BY fetch_ratio ASC;
-- Low fetch ratio = index reads many tuples but few are actually needed
-- Consider more selective index or query optimization
2.4 Table Statistics
Copy
-- Table access patterns
SELECT
schemaname,
relname,
seq_scan,
seq_tup_read,
idx_scan,
idx_tup_fetch,
n_tup_ins,
n_tup_upd,
n_tup_del,
n_live_tup,
n_dead_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2)
as dead_tup_percent,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze
FROM pg_stat_user_tables
ORDER BY seq_tup_read DESC;
-- Tables needing vacuum
SELECT
schemaname || '.' || relname as table,
n_live_tup,
n_dead_tup,
round(100.0 * n_dead_tup / nullif(n_live_tup, 0), 2) as dead_percent,
last_autovacuum,
pg_size_pretty(pg_total_relation_size(relid)) as total_size
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;
-- HOT update ratio (in-place updates, no index update needed)
SELECT
schemaname,
relname,
n_tup_upd,
n_tup_hot_upd,
round(100.0 * n_tup_hot_upd / nullif(n_tup_upd, 0), 2) as hot_update_percent
FROM pg_stat_user_tables
WHERE n_tup_upd > 1000
ORDER BY hot_update_percent ASC;
-- Low HOT ratio = indexes being updated frequently = consider fillfactor
Part 3: System-Level Profiling
3.1 Linux perf Basics
Copy
# Install perf (Ubuntu/Debian)
apt-get install linux-tools-common linux-tools-$(uname -r)
# Basic CPU profiling of PostgreSQL
# Find postgres backend PID handling your query
psql -c "SELECT pg_backend_pid();" # e.g., 12345
# Profile that backend
perf record -g -p 12345 -- sleep 30
# Generate report
perf report --stdio
# Example output:
# 35.00% postgres [kernel] [k] copy_user_enhanced_fast_string
# 22.00% postgres postgres [.] hash_search_with_hash_value
# 15.00% postgres postgres [.] ExecScan
# 10.00% postgres postgres [.] slot_deform_heap_tuple
# 8.00% postgres postgres [.] ExecQual
# Profile all postgres processes
perf record -g -p $(pgrep -d',' postgres) -- sleep 30
3.2 Flame Graphs
Copy
# Clone flamegraph tools
git clone https://github.com/brendangregg/FlameGraph.git
cd FlameGraph
# Capture profile
perf record -F 99 -g -p $(pgrep -d',' postgres) -- sleep 60
# Generate flame graph
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > pg-flamegraph.svg
# View in browser
# Open pg-flamegraph.svg
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ READING FLAME GRAPHS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │████████████████████████████ hash_search ████████████████████████████│ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ ┌────────────────────────────┐ ┌──────────────────────────────────────┐ │
│ │██████ ExecHashJoin █████████│ │███████████ ExecSeqScan ████████████│ │
│ └────────────────────────────┘ └──────────────────────────────────────┘ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │████████████████████████ ExecProcNode ██████████████████████████████│ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │██████████████████████████ ExecutorRun █████████████████████████████│ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ Reading Guide: │
│ • Y-axis: Stack depth (bottom = entry point) │
│ • X-axis: Time spent (width = % of samples) │
│ • Color: Random, just for differentiation │
│ • Plateaus: Functions that consume CPU time │
│ • Towers: Deep call stacks │
│ │
│ What to look for: │
│ • Wide plateaus at top = hot functions to optimize │
│ • hash_search = possible hash table sizing issue │
│ • ExecSeqScan = missing index │
│ • slot_deform_heap_tuple = consider column ordering │
│ • LWLockAcquire = contention issue │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
3.3 Off-CPU Analysis (I/O Waits)
Copy
# Off-CPU flame graph shows where processes are WAITING, not running
# Requires BPF/eBPF
# Using bcc-tools
apt-get install bpfcc-tools
# Trace off-CPU events
offcputime-bpfcc -p $(pgrep -d',' postgres) -f 30 > offcpu.stacks
# Generate off-CPU flame graph
./flamegraph.pl --color=io --countname=us < offcpu.stacks > offcpu.svg
# Common off-CPU patterns:
# • pread/pwrite → Disk I/O
# • epoll_wait → Waiting for client
# • futex → Lock contention
# • nanosleep → Intentional sleep (background workers)
3.4 I/O Analysis
Copy
# Basic I/O stats
iostat -xz 1
# Key columns:
# r/s, w/s - Reads/writes per second
# rkB/s, wkB/s - KB read/written per second
# await - Average I/O wait time (ms)
# %util - Device utilization
# Example concerning output:
# Device r/s w/s rkB/s wkB/s await %util
# nvme0n1 5000 2000 400000 160000 12.5 99.8
# ↑ ↑
# High wait Saturated!
# Per-process I/O with iotop
iotop -oPa
# BPF-based I/O latency histogram
biolatency-bpfcc -D 10
# Example output shows latency distribution:
# usecs : count distribution
# 0 -> 1 : 0 | |
# 2 -> 3 : 0 | |
# 4 -> 7 : 12 |* |
# 8 -> 15 : 156 |************ |
# 16 -> 31 : 298 |***********************|
# 32 -> 63 : 842 |************************| ← Most I/O here
# 64 -> 127 : 234 |****************** |
# 128 -> 255: 89 |******* |
# 256 -> 511: 12 |* | ← Tail latency
Part 4: Benchmarking
4.1 pgbench Fundamentals
Copy
# Initialize test database
pgbench -i -s 100 testdb # Scale factor 100 = 10M rows
# Basic TPC-B style benchmark
pgbench -c 10 -j 2 -T 60 testdb
# -c 10: 10 concurrent clients
# -j 2: 2 worker threads
# -T 60: Run for 60 seconds
# Output interpretation:
# tps = 1234.567890 (including connections establishing)
# tps = 1245.678901 (excluding connections establishing)
# ↑ This is your throughput
# With detailed timing
pgbench -c 10 -j 2 -T 60 -r testdb
# -r: Report per-statement latencies
# Example output:
# statement latencies in milliseconds:
# 0.002 \set aid random(1, 100000 * :scale)
# 0.001 \set bid random(1, 1 * :scale)
# 0.001 \set tid random(1, 10 * :scale)
# 0.001 \set delta random(-5000, 5000)
# 0.156 BEGIN;
# 0.234 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
# 0.089 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
# 0.067 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
# 0.078 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
# 0.045 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
# 0.823 END;
4.2 Custom pgbench Scripts
Copy
-- Create custom benchmark script: mixed_workload.sql
-- Point query (80% of workload)
\set user_id random(1, 10000000)
SELECT * FROM users WHERE id = :user_id;
-- Save as point_read.sql, then run with weight:
-- pgbench -f point_read.sql@80 -f range_scan.sql@15 -f write.sql@5 -T 60 testdb
Copy
-- Range scan workload: range_scan.sql
\set start_id random(1, 9900000)
\set end_id :start_id + 1000
SELECT * FROM users WHERE id BETWEEN :start_id AND :end_id ORDER BY id;
Copy
-- Write workload: write.sql
\set user_id random(1, 10000000)
\set new_balance random(0, 100000)
UPDATE users SET balance = :new_balance WHERE id = :user_id;
4.3 Latency Histogram Analysis
Copy
# Capture latency histogram
pgbench -c 20 -j 4 -T 300 -L 100 --log --aggregate-interval=1 testdb
# -L 100: Latency limit 100ms (report if exceeded)
# --log: Write per-transaction logs
# --aggregate-interval=1: Report stats every second
# Analyze with percentiles
cat pgbench_log.* | awk '{print $3}' | sort -n | \
awk 'BEGIN{c=0} {a[c++]=$1} END{
print "P50:", a[int(c*0.50)], "ms";
print "P95:", a[int(c*0.95)], "ms";
print "P99:", a[int(c*0.99)], "ms";
print "P99.9:", a[int(c*0.999)], "ms";
}'
# Or use HdrHistogram for accurate percentiles
# https://github.com/HdrHistogram/HdrHistogram_c
4.4 Benchmark Methodology
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ BENCHMARK BEST PRACTICES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. Warm Up Phase │
│ • Run benchmark for 5-10 minutes before measuring │
│ • Ensures buffer pool is populated │
│ • JIT compilation has occurred │
│ • Statistics are accurate │
│ │
│ 2. Measurement Phase │
│ • Run for at least 5-10 minutes for stable results │
│ • Multiple runs (at least 3) to check variance │
│ • Report median, not mean (avoids outlier skew) │
│ │
│ 3. Control Variables │
│ • Same hardware configuration │
│ • Same data size and distribution │
│ • Same PostgreSQL configuration │
│ • Same OS tuning │
│ • Quiesce other workloads │
│ │
│ 4. Metrics to Capture │
│ • Throughput (TPS or QPS) │
│ • Latency percentiles (P50, P95, P99, P99.9) │
│ • CPU utilization │
│ • I/O utilization │
│ • Memory usage │
│ • Error rate │
│ │
│ 5. Common Mistakes │
│ ✗ Running too short (< 1 minute) │
│ ✗ Not warming up │
│ ✗ Comparing averages instead of percentiles │
│ ✗ Ignoring variance between runs │
│ ✗ Testing with unrealistic data sizes │
│ ✗ Not monitoring system resources during test │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Part 5: Connection and Memory Tuning
5.1 Connection Pool Sizing
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONNECTION POOL FORMULA │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Optimal pool size (per database node): │
│ │
│ connections = (cores * 2) + effective_spindle_count │
│ │
│ For SSD/NVMe (no spindles): │
│ connections = (cores * 2) + 1 │
│ │
│ Example: 8-core server with NVMe: │
│ connections = (8 * 2) + 1 = 17 │
│ │
│ Key insight: More connections != more throughput │
│ After optimal point, contention increases, throughput decreases │
│ │
│ Symptoms of too many connections: │
│ • Increased latency │
│ • Context switching overhead │
│ • Lock contention │
│ • Memory pressure (each backend uses ~10MB+) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
5.2 PgBouncer Configuration
Copy
# pgbouncer.ini for high performance
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
# Connection pooling mode
pool_mode = transaction # Best for most workloads
# Pool sizing
default_pool_size = 20 # Connections per user/database pair
min_pool_size = 5 # Keep some connections warm
reserve_pool_size = 5 # Extra for bursts
reserve_pool_timeout = 3 # How long before using reserve
# Connection limits
max_client_conn = 1000 # Max client connections to pgbouncer
max_db_connections = 50 # Max connections to actual database
# Timeouts
query_timeout = 30 # Kill query if longer than 30s
client_idle_timeout = 600 # Close idle clients after 10min
server_idle_timeout = 60 # Close idle server connections
# Performance
tcp_keepalive = 1
tcp_keepidle = 60
tcp_keepintvl = 10
tcp_keepcnt = 3
# Logging
log_connections = 0 # Reduce log spam in production
log_disconnections = 0
log_pooler_errors = 1
# Admin
admin_users = pgbouncer_admin
stats_users = pgbouncer_stats
5.3 Memory Configuration
Copy
-- Memory settings for different workload types
-- OLTP workload (many small queries)
shared_buffers = '8GB' -- 25% of RAM (up to ~8GB)
effective_cache_size = '24GB' -- 75% of RAM
work_mem = '64MB' -- Conservative for many connections
maintenance_work_mem = '2GB' -- For vacuum, index builds
wal_buffers = '64MB'
-- OLAP workload (few large queries)
shared_buffers = '32GB' -- Can be larger for analytics
effective_cache_size = '96GB'
work_mem = '2GB' -- Larger for complex queries
maintenance_work_mem = '4GB'
max_parallel_workers_per_gather = 8
-- Work_mem calculation for hash joins:
/*
* Memory needed = (inner_rows × tuple_width × hash_overhead)
*
* For 1M rows, 100 bytes/row:
* Memory = 1,000,000 × 100 × 1.5 ≈ 150MB
*
* Set work_mem = 256MB to fit with margin
*
* Warning: work_mem is PER OPERATION
* Complex query with 5 sorts/hashes = 5 × work_mem
* With 100 connections: potential 5 × 256MB × 100 = 128GB!
*/
Part 6: Lock Contention Analysis
6.1 Lock Monitoring
Copy
-- Current locks (who's blocking whom)
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
-- Lock types and their conflicts
/*
┌─────────────────────────────────────────────────────────────────────────────┐
│ LOCK CONFLICT MATRIX │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Requested → ACCESS ROW ROW SHARE SHARE SHARE EXCL ACCESS │
│ Held ↓ SHARE SHARE EXCL UPDATE ROW UPDATE EXCL │
│ EXCL EXCL │
│ ─────────────────────────────────────────────────────────────────────────── │
│ ACCESS SHARE - - - - - - - X │
│ ROW SHARE - - - - - - X X │
│ ROW EXCLUSIVE - - - - - X X X │
│ SHARE UPDATE EX - - - - X X X X │
│ SHARE - - - - X X X X │
│ SHARE ROW EXCL - - X X X X X X │
│ EXCLUSIVE - X X X X X X X │
│ ACCESS EXCL X X X X X X X X │
│ │
│ X = Conflict (must wait) │
│ - = No conflict (can proceed) │
│ │
│ Common operations and their locks: │
│ • SELECT: ACCESS SHARE │
│ • INSERT/UPDATE/DELETE: ROW EXCLUSIVE │
│ • CREATE INDEX CONCURRENTLY: SHARE UPDATE EXCLUSIVE │
│ • VACUUM FULL, REINDEX: ACCESS EXCLUSIVE │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
*/
6.2 Deadlock Analysis
Copy
-- Enable deadlock logging
-- In postgresql.conf:
-- log_lock_waits = on
-- deadlock_timeout = 1s
-- Check logs for deadlock patterns:
/*
LOG: process 12345 detected deadlock while waiting for ShareLock on
transaction 987654 after 1000.123 ms
DETAIL: Process holding the lock: 12346. Wait queue: .
CONTEXT: while updating tuple (0,42) in relation "orders"
STATEMENT: UPDATE orders SET status = 'shipped' WHERE id = 100
LOG: process 12345 still waiting for ShareLock on transaction 987654
after 1000.456 ms
DETAIL: Process holding the lock: 12346. Wait queue: 12345.
CONTEXT: while updating tuple (0,42) in relation "orders"
STATEMENT: UPDATE orders SET status = 'shipped' WHERE id = 100
*/
-- Analyze deadlock from logs:
/*
Deadlock scenario:
Transaction A: Transaction B:
UPDATE orders WHERE id=100 UPDATE orders WHERE id=200
(holds lock on row 100) (holds lock on row 200)
... ...
UPDATE orders WHERE id=200 UPDATE orders WHERE id=100
(waits for B's lock) (waits for A's lock)
↓ DEADLOCK
Solution: Consistent lock ordering
- Always acquire locks in same order (e.g., by id ASC)
- Or use SELECT ... FOR UPDATE with explicit ordering
*/
Part 7: Production Troubleshooting Runbook
7.1 High CPU Usage
Copy
-- Step 1: Identify CPU-heavy queries
SELECT
pid,
usename,
state,
wait_event_type,
wait_event,
left(query, 100)
FROM pg_stat_activity
WHERE state = 'active'
AND wait_event_type IS NULL -- Not waiting = actually running
ORDER BY backend_start;
-- Step 2: Check pg_stat_statements
SELECT
query,
calls,
mean_exec_time,
total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;
-- Step 3: Profile with perf
-- perf record -g -p <pid> -- sleep 30
-- perf report
-- Common causes:
-- • Missing index (ExecSeqScan in profile)
-- • Complex expression evaluation (ExecEvalExpr)
-- • Hash table operations (hash_search)
-- • Sort operations (tuplesort_*)
7.2 High I/O Wait
Copy
-- Step 1: Check wait events
SELECT
wait_event_type,
wait_event,
count(*)
FROM pg_stat_activity
WHERE state = 'active'
AND wait_event_type IN ('IO', 'BufferPin')
GROUP BY wait_event_type, wait_event;
-- Step 2: Check buffer cache hit ratio
SELECT
sum(heap_blks_read) as heap_read,
sum(heap_blks_hit) as heap_hit,
sum(heap_blks_hit) /
nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) as ratio
FROM pg_statio_user_tables;
-- Should be > 0.99 for OLTP
-- Step 3: Check checkpoint frequency
SELECT
checkpoints_timed,
checkpoints_req, -- Requested = forced = BAD
buffers_checkpoint,
buffers_backend -- Backend writes = BAD, should be 0
FROM pg_stat_bgwriter;
-- Step 4: System level
-- iostat -xz 1
-- iotop
-- Common causes:
-- • shared_buffers too small
-- • Checkpoint too aggressive (checkpoint_completion_target low)
-- • seq scans on large tables
-- • Bloated tables need VACUUM
7.3 Lock Contention
Copy
-- Step 1: Check current locks
SELECT
locktype,
relation::regclass,
mode,
granted,
pid,
left(query, 50) as query
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT granted
ORDER BY relation;
-- Step 2: Find blocking queries
SELECT * FROM pg_blocking_pids(12345); -- Replace with waiting PID
-- Step 3: Check for long-running transactions
SELECT
pid,
now() - xact_start as xact_duration,
now() - query_start as query_duration,
state,
left(query, 60)
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start;
-- Step 4: Kill if necessary
SELECT pg_terminate_backend(12345); -- Last resort
-- Prevention:
-- • Use statement_timeout and idle_in_transaction_session_timeout
-- • Use lock_timeout
-- • Avoid long transactions in OLTP
Part 8: Interview Questions
Staff/Principal Level
Q: How would you diagnose a sudden 10x latency increase in production?
Q: How would you diagnose a sudden 10x latency increase in production?
Systematic Approach:
- Immediate checks (< 1 min)
pg_stat_activity: What’s running, what’s waiting?pg_locks: Any blocking?- System metrics: CPU, I/O, memory via monitoring
- Compare to baseline
- pg_stat_statements: Which queries slowed down?
- Was there a deploy? Schema change?
- Traffic pattern change?
- Drill down
- EXPLAIN ANALYZE the slow queries
- Check for plan changes (pg_stat_plans if available)
- Statistics stale? Run ANALYZE
- System level if needed
- iostat: I/O saturation?
- vmstat: Memory pressure?
- perf/flamegraph: CPU profile
- Common root causes
- Autovacuum not keeping up
- Statistics out of date → bad plans
- Connection saturation
- Lock contention from long transaction
- Checkpoint storm
Q: Design a monitoring system for PostgreSQL at scale
Q: Design a monitoring system for PostgreSQL at scale
Architecture:
- Metrics Collection
- pg_stat_* views polled every 10-60s
- pg_stat_statements for query patterns
- OS metrics (node_exporter)
- Custom metrics via pgbouncer stats
- Storage
- Time-series database (Prometheus, InfluxDB, TimescaleDB)
- Retention: 15 days high-res, 1 year aggregated
- Dashboards
- Golden signals: latency, throughput, errors, saturation
- Per-query latency percentiles
- Replication lag
- Vacuum progress
- Connection pool stats
- Alerting
- Latency P99 > threshold
- Replication lag > 30s
- Connections > 80% max
- Disk usage > 80%
- Long-running transactions > 10 min
- Query Insights
- Track query fingerprints over time
- Detect plan regressions
- Identify optimization opportunities
Q: How would you scale PostgreSQL to handle 1M QPS?
Q: How would you scale PostgreSQL to handle 1M QPS?
Strategy:
- Read Scaling
- Read replicas (streaming replication)
- Load balancer (HAProxy, pgpool) with read/write splitting
- Connection pooling (PgBouncer) in front of each node
- Caching layer (Redis) for hot data
- Write Scaling (harder)
- Vertical scaling first (bigger instance)
- Application-level sharding
- Citus for distributed PostgreSQL
- Partitioning for time-series data
- Connection Management
- PgBouncer with transaction pooling
- Aim for ~50 actual connections per node
- 1000s of clients → PgBouncer → 50 DB connections
- Query Optimization
- Prepared statements (reduce parsing)
- Efficient indexes
- Minimize lock contention
- Infrastructure
- NVMe storage for IOPS
- Sufficient RAM for working set
- Network: 10+ Gbps between components
- Many read replicas (10-50+)
- Aggressive caching
- Sharding for writes
- Not all queries hit the database