Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Performance Engineering: The Bottleneck Analysis Guide

Performance engineering is not about “tuning parameters” — it is a systematic process of identifying the bottleneck in a complex, layered system and applying the minimum change required to shift that bottleneck to a more scalable resource. Real-world analogy: Think of a database system like a factory assembly line. The overall throughput is determined by the slowest station. If the painting station can only process 10 units per hour but assembly handles 100, making assembly faster is a total waste of effort. Performance engineering is the discipline of finding the painting station, improving it, and then finding the next bottleneck. The moment you start optimizing without measuring, you are guessing which station to fix — and guessing has about a 10% hit rate.
Target Audience: Staff/Principal Database Engineers
Core Methodology: The USE Method + Wait Event Profiling
Key Goal: Move from “guessing” to “evidential analysis”

The Performance Engineering Mindset

A principal engineer view of performance follows three laws:
  1. The Law of Amdahl: The speedup of a system is limited by its serial components.
  2. The Law of Little: Average number of items in a system = arrival rate × average time spent in system (L = λW).
  3. The Law of Bottlenecks: Performance is always limited by the most saturated resource. Optimizing anything else is a waste of time.

Phase 0: Triage & The Four Golden Signals

Before diving into profiles, you must identify the symptom. We use the “Four Golden Signals” to determine if the system is healthy.
SignalMetricPrincipal Question
LatencyP99 Response TimeIs the database slow, or is it just a few “outlier” queries?
ThroughputQueries Per Second (QPS)Are we handling the load, or is the engine stalling?
ErrorsRollback/Conflict RateAre we failing due to deadlocks or serialization issues?
SaturationResource % (CPU/IO)How much “headroom” is left before a total collapse?

Triage Query: The 30-Second Snapshot

-- Identify the "Current Reality" of the engine.
-- This is the single most important query in your performance toolkit.
-- Run it first, every time. It tells you WHERE the database is spending time.
SELECT 
    state,
    wait_event_type,          -- NULL = actively using CPU; non-NULL = waiting
    wait_event,               -- Specific lock/IO/event the backend is stalled on
    count(*) as count,
    round(100.0 * count(*) / sum(count(*)) over(), 2) as pct  -- percentage of active backends
FROM pg_stat_activity
WHERE state != 'idle'         -- Exclude connections doing nothing
GROUP BY 1, 2, 3
ORDER BY 4 DESC;
How to read the results: If wait_event_type is NULL for most sessions, you are CPU Bound — the database is actively computing, not waiting. If it is IO, you are Disk Bound — backends are stalled waiting for pages from disk. If it is LWLock, you are Contention Bound — multiple backends are fighting over the same internal resource. Each diagnosis leads to a fundamentally different remediation path, which is why this query must always be your first step.

Phase 1: Resource Saturation (The USE Method)

Developed by Brendan Gregg, the USE Method should be applied to every resource (CPU, Memory, Disk, Network).

1.1 CPU: Computation vs. Context Switching

Utilization: top or mpstat. High utilization (>80%) is expected in high-performance DBs. Saturation: Look at the Run Queue Length (load average). If Load > Number of Cores, tasks are waiting for a CPU slot. Principal Observation: High CPU with low throughput often indicates Spinlock Contention or Excessive Context Switching.
# Check context switches per second
vmstat 1
# 'cs' column > 100k/sec usually indicates a problem (e.g., too many connections)

1.2 Memory: The Buffer Hit Ratio Myth

Utilization: free -m. Saturation: Scanning/Paging. Look at si/so (swap in/out) in vmstat. Staff Engineer Insight: A “99% Buffer Cache Hit Ratio” can be a trap. If you have 1,000,000 hits and 10,000 reads, your ratio is 99%. But if those 10,000 reads are random I/Os on a slow disk, your latency will be terrible. Focus on the absolute number of reads/sec, not the ratio. Real-world analogy: A 99% cache hit ratio is like saying “99% of our customers were served in under a second.” That sounds great until you learn you have 1 million customers per hour — meaning 10,000 people per hour waited 50ms+ for a disk read. In an OLTP system with a P99 latency SLA, those 10,000 slow operations are your entire problem, and the ratio hides them completely.

1.3 Disk: IOPS vs. Throughput

Utilization: %util in iostat. Saturation: await (Average Wait Time).
iostat -xy 1
# If await > 5ms on an NVMe drive, the storage is saturated.
# If %util is 100% but rkB/s is low, you have a Random I/O bottleneck.

Phase 2: Database Internal Waits

Phase 2: Database Internal Waits

If resources are not saturated but latency is high, the database is likely waiting on internal synchronization.

2.1 The Hierarchy of Waits

Wait ClassPrincipal ArchetypeRoot Cause
LWLockbuffer_contentShared Buffer Contention. Too many backends trying to access the same page (Hot Block).
LWLockWALWriteLockWAL Serializer Bottleneck. Too many small transactions; slow fsync to WAL disk.
LockrelationCatalog Contention. An uncommitted DDL (e.g., ALTER TABLE) is blocking all DML.
IODataFileReadCache Miss. Query is pulling cold data from disk into the buffer pool.

2.2 Advanced: ProcArrayLock Contention

In high-concurrency systems (1000+ TPS), ProcArrayLock often becomes the global bottleneck. This lock is used when a transaction needs to compute a “Snapshot” of the system (visibility). Staff Engineer Insight: If you see ProcArrayLock as a top wait event, scaling CPU won’t help. You must reduce the transaction rate, use a connection pooler (PgBouncer) to reduce backend count, or move to a version of Postgres (14+) with snapshot caching optimizations.

Phase 3: Code-Level Profiling (Flamegraphs)

When you are CPU Bound and cannot find a simple query fix, you must look at what the CPU is actually doing inside the Postgres binary.

3.1 Capturing the Flamegraph

Real-world analogy: A flamegraph is like a time-lapse photo of a kitchen during a dinner rush. Each flame shows where the cooks (CPU) are spending their time. A wide plateau means a lot of time is spent in one activity (e.g., chopping onions). Jagged towers mean the work is fragmented across many small tasks. You optimize by making the widest plateaus narrower.
# Profile all postgres backends for 60 seconds.
# -F 99: Sample at 99 Hz (not 100, to avoid aliasing with periodic tasks).
# -g: Capture full call stacks (essential for useful flamegraphs).
# -p: Target only postgres backend processes.
perf record -F 99 -g -p $(pgrep -d',' postgres) -- sleep 60

# Generate the SVG (requires Brendan Gregg's FlameGraph tools)
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > db_profile.svg
# Open db_profile.svg in a browser. Click on frames to zoom in.

3.2 Interpreting “Staff-Level” Patterns

PatternCode PathMeaning
Wide Plateauhash_searchThe Hash Table (for Joins or Aggs) is huge and spending time in bucket lookup. Consider work_mem tuning.
Wide Plateauslot_deform_tupleThe Defragmentation Tax. Postgres spends massive CPU “deforming” rows to find a column at the end of a wide table. Move heavily accessed columns to the front.
Jagged TowersExecQualMassive expression evaluation. Likely a complex WHERE clause with non-indexed functions.
Wide TowerLWLockAcquireLock Contention. The engine is spending more time waiting for metadata than doing work.

Phase 4: Query-Level Optimization (The 80/20 Rule)

You cannot optimize every query. A principal engineer identifies the 20% of queries causing 80% of the load.

4.1 The pg_stat_statements Strategy

-- Identify queries with the highest I/O impact.
-- This is your "top offenders" list. Fix these first.
SELECT 
    query,
    calls,                        -- How often this query runs
    round(total_exec_time::numeric, 2) as total_ms,  -- Cumulative wall-clock time
    shared_blks_read + shared_blks_written as io_blocks,  -- Total disk I/O in 8KB blocks
    round(100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0), 2) as hit_ratio
FROM pg_stat_statements
ORDER BY io_blocks DESC          -- Sort by I/O impact, not just time
LIMIT 10;
Staff Engineer Insight: Look for queries with high calls but low rows. This indicates index-thrashing or unnecessary repeat work. A query called 10,000 times a second that takes 1ms is often more dangerous than a 10s query called once an hour. Why sort by io_blocks instead of total_exec_time? Time-based sorting conflates CPU-bound and I/O-bound work. A CPU-bound query on a fast machine might show high time but zero disk pressure. An I/O-bound query might show modest time but be saturating your disk, degrading every other query on the system. I/O blocks reveal the queries that are poisoning the shared resource.

4.2 The “Temp File” Trap

If work_mem is too small, Postgres spills sorts and hashes to disk. This is a silent performance killer — you will not see an error or warning; the query just silently becomes 10-100x slower because in-memory operations become disk I/O operations. Real-world analogy: Imagine sorting a deck of 1,000 cards on a table that only fits 100 cards. You would have to sort in batches of 100, write each batch to a pile on the floor, then merge the piles. The floor (disk) is orders of magnitude slower than the table (RAM). That is exactly what an “external merge sort” does.
-- Find queries that are spilling to disk.
-- Any row in this result is a query that could be dramatically faster
-- with more work_mem or a query rewrite that reduces intermediate results.
SELECT query, temp_blks_read, temp_blks_written
FROM pg_stat_statements
WHERE temp_blks_read > 0
ORDER BY 2 DESC;
Solution: Increase work_mem for specific users or queries (not globally — see the memory management pitfall about per-operation limits), or optimize the join to reduce the intermediate result set size so it fits in memory.

Phase 5: Concurrency & Lock Contention

5.1 The Blocker Tree

Use this recursive query to find the “Root Blocker” causing a pile-up.
WITH RECURSIVE blockers AS (
  SELECT 
    pid, 
    pg_blocking_pids(pid) AS pids 
  FROM pg_stat_activity 
  WHERE cardinality(pg_blocking_pids(pid)) > 0
  UNION ALL
  SELECT 
    s.pid, 
    pg_blocking_pids(s.pid) 
  FROM pg_stat_activity s 
  JOIN blockers b ON s.pid = ANY(b.pids)
)
SELECT 
  pid, 
  state, 
  wait_event, 
  query 
FROM pg_stat_activity 
WHERE pid IN (SELECT unnest(pids) FROM blockers)
  AND pid NOT IN (SELECT pid FROM blockers);

5.2 Common Contention Archetypes

  1. The Hot Row: Millions of updates to a single row (e.g., a counter). Solution: Use batching or CRDT-style additive counters.
  2. The Foreign Key Lock: Updating a child table requires a shared lock on the parent. If many children update simultaneously, the parent becomes a bottleneck.
  3. The Index Build: Creating an index without CONCURRENTLY locks the whole table for writes.

Phase 6: Production Troubleshooting Runbook

SymptomInitial CheckProbable Root Cause
Spiking Latencypg_stat_activityLock pile-up or a “Plan Flip” (CBO chose a bad plan due to stale stats).
High Load Avgvmstat / topHigh context switching or Spinlock contention.
I/O SaturationiostatCheckpoint spike or a huge Sequential Scan.
Out of Memorydmesgwork_mem is set too high for the connection count.

Summary: The Principal’s Workflow

  1. Start at the OS: Is there a resource (CPU/IO) at 100%? (USE Method)
  2. Look at the Waits: Is the database waiting on itself? (pg_stat_activity)
  3. Find the Culprit: Which queries are causing the most impact? (pg_stat_statements)
  4. Profile if Needed: If CPU is high but queries look fine, use perf and Flamegraphs.
  5. Optimize Evidentially: Change one variable, measure the throughput shift, and repeat.

5.2 PgBouncer Configuration

Why connection pooling matters: Each PostgreSQL connection is a separate OS process consuming approximately 5-10MB of RAM. A naive setup with 500 application threads each holding a database connection wastes 2.5-5GB of RAM just on connection overhead, and PostgreSQL performs poorly beyond ~100-200 active connections due to lock contention in the ProcArray. PgBouncer solves this by multiplexing thousands of client connections onto a small pool of real database connections. Real-world analogy: PgBouncer is like a restaurant host who manages a waiting list. The restaurant (PostgreSQL) has 50 tables (connections). When 500 customers (application threads) show up, the host does not try to seat them all at once. Instead, each customer gets a table only while they are actively eating (executing a transaction), and gives it up between courses. This is transaction pooling mode.
# pgbouncer.ini for high performance

[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
# Connection pooling mode
# "transaction" returns connection to pool after each transaction completes.
# "session" holds connection for entire client session (like no pooling).
# "statement" returns after each statement (breaks multi-statement transactions).
pool_mode = transaction  # Best for most workloads; use "session" if you need prepared statements

# Pool sizing -- the most important tuning parameter
default_pool_size = 20      # Connections per user/database pair (start low, increase if needed)
min_pool_size = 5           # Keep some connections warm (avoids cold-start latency)
reserve_pool_size = 5       # Extra for traffic bursts
reserve_pool_timeout = 3    # Seconds to wait before dipping into the reserve pool

# Connection limits
max_client_conn = 1000      # Max client connections to pgbouncer (can be very high)
max_db_connections = 50     # Max connections to actual database (this is the real constraint)

# Timeouts
query_timeout = 30          # Kill query if longer than 30s
client_idle_timeout = 600   # Close idle clients after 10min
server_idle_timeout = 60    # Close idle server connections

# Performance
tcp_keepalive = 1
tcp_keepidle = 60
tcp_keepintvl = 10
tcp_keepcnt = 3

# Logging
log_connections = 0         # Reduce log spam in production
log_disconnections = 0
log_pooler_errors = 1

# Admin
admin_users = pgbouncer_admin
stats_users = pgbouncer_stats

5.3 Memory Configuration

Practical tip: The settings below are starting points, not final answers. Every workload is different. The right approach is to start with these values, monitor with pg_stat_bgwriter (for checkpoint tuning), pg_stat_statements (for temp file usage), and pg_stat_activity (for memory-related wait events), then adjust based on evidence.
-- Memory settings for different workload types

-- OLTP workload (many small queries, many connections)
shared_buffers = '8GB'              -- ~25% of RAM. Above 8-16GB, diminishing returns
                                    -- because the OS page cache handles the rest.
effective_cache_size = '24GB'       -- Tell the planner how much total cache exists
                                    -- (shared_buffers + OS cache). Does NOT allocate memory.
work_mem = '64MB'                   -- Conservative for OLTP. Many connections = many
                                    -- concurrent operations, each consuming work_mem.
maintenance_work_mem = '2GB'        -- For VACUUM, CREATE INDEX. Only a few run at once,
                                    -- so this can be much larger than work_mem.
wal_buffers = '64MB'                -- WAL insertion buffer. Default is too small for
                                    -- write-heavy workloads. 64MB is a safe upper bound.

-- OLAP workload (few large queries, few connections)
shared_buffers = '32GB'             -- Larger because analytics scans more data.
effective_cache_size = '96GB'       
work_mem = '2GB'                    -- Much larger because OLAP queries have complex
                                    -- sorts and hash joins. Acceptable because there
                                    -- are only 5-20 concurrent queries, not 500.
maintenance_work_mem = '4GB'
max_parallel_workers_per_gather = 8 -- Parallelism is critical for large scans/aggregations.

-- Work_mem calculation for hash joins:
/*
 * Memory needed = (inner_rows × tuple_width × hash_overhead)
 * 
 * For 1M rows, 100 bytes/row:
 * Memory = 1,000,000 × 100 × 1.5 ≈ 150MB
 * 
 * Set work_mem = 256MB to fit with margin
 * 
 * Warning: work_mem is PER OPERATION
 * Complex query with 5 sorts/hashes = 5 × work_mem
 * With 100 connections: potential 5 × 256MB × 100 = 128GB!
 */

Part 6: Lock Contention Analysis

6.1 Lock Monitoring

-- Current locks (who's blocking whom)
SELECT 
    blocked_locks.pid AS blocked_pid,
    blocked_activity.usename AS blocked_user,
    blocking_locks.pid AS blocking_pid,
    blocking_activity.usename AS blocking_user,
    blocked_activity.query AS blocked_statement,
    blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity 
    ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks 
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
    AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity 
    ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

-- Lock types and their conflicts
/*
┌─────────────────────────────────────────────────────────────────────────────┐
│                      LOCK CONFLICT MATRIX                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│ Requested →   ACCESS  ROW    ROW    SHARE  SHARE  SHARE   EXCL   ACCESS    │
│ Held ↓        SHARE   SHARE  EXCL   UPDATE ROW    UPDATE         EXCL      │
│                                            EXCL   EXCL                       │
│ ─────────────────────────────────────────────────────────────────────────── │
│ ACCESS SHARE    -       -      -      -      -      -       -      X        │
│ ROW SHARE       -       -      -      -      -      -       X      X        │
│ ROW EXCLUSIVE   -       -      -      -      -      X       X      X        │
│ SHARE UPDATE EX -       -      -      -      X      X       X      X        │
│ SHARE           -       -      -      -      X      X       X      X        │
│ SHARE ROW EXCL  -       -      X      X      X      X       X      X        │
│ EXCLUSIVE       -       X      X      X      X      X       X      X        │
│ ACCESS EXCL     X       X      X      X      X      X       X      X        │
│                                                                              │
│ X = Conflict (must wait)                                                    │
│ - = No conflict (can proceed)                                               │
│                                                                              │
│ Common operations and their locks:                                          │
│ • SELECT: ACCESS SHARE                                                      │
│ • INSERT/UPDATE/DELETE: ROW EXCLUSIVE                                       │
│ • CREATE INDEX CONCURRENTLY: SHARE UPDATE EXCLUSIVE                         │
│ • VACUUM FULL, REINDEX: ACCESS EXCLUSIVE                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
*/

6.2 Deadlock Analysis

-- Enable deadlock logging
-- In postgresql.conf:
-- log_lock_waits = on
-- deadlock_timeout = 1s

-- Check logs for deadlock patterns:
/*
LOG:  process 12345 detected deadlock while waiting for ShareLock on 
      transaction 987654 after 1000.123 ms
DETAIL:  Process holding the lock: 12346. Wait queue: .
CONTEXT:  while updating tuple (0,42) in relation "orders"
STATEMENT:  UPDATE orders SET status = 'shipped' WHERE id = 100

LOG:  process 12345 still waiting for ShareLock on transaction 987654 
      after 1000.456 ms
DETAIL:  Process holding the lock: 12346. Wait queue: 12345.
CONTEXT:  while updating tuple (0,42) in relation "orders"
STATEMENT:  UPDATE orders SET status = 'shipped' WHERE id = 100
*/

-- Analyze deadlock from logs:
/*
 Deadlock scenario:
 
 Transaction A:                  Transaction B:
 UPDATE orders WHERE id=100      UPDATE orders WHERE id=200
 (holds lock on row 100)         (holds lock on row 200)
 ...                             ...
 UPDATE orders WHERE id=200      UPDATE orders WHERE id=100
 (waits for B's lock)            (waits for A's lock)
 
 ↓ DEADLOCK
 
 Solution: Consistent lock ordering
 - Always acquire locks in same order (e.g., by id ASC)
 - Or use SELECT ... FOR UPDATE with explicit ordering
*/

Part 7: Production Troubleshooting Runbook

7.1 High CPU Usage

-- Step 1: Identify CPU-heavy queries
SELECT 
    pid,
    usename,
    state,
    wait_event_type,
    wait_event,
    left(query, 100)
FROM pg_stat_activity
WHERE state = 'active'
  AND wait_event_type IS NULL  -- Not waiting = actually running
ORDER BY backend_start;

-- Step 2: Check pg_stat_statements
SELECT 
    query,
    calls,
    mean_exec_time,
    total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Step 3: Profile with perf
-- perf record -g -p <pid> -- sleep 30
-- perf report

-- Common causes:
-- • Missing index (ExecSeqScan in profile)
-- • Complex expression evaluation (ExecEvalExpr)
-- • Hash table operations (hash_search)
-- • Sort operations (tuplesort_*)

7.2 High I/O Wait

-- Step 1: Check wait events
SELECT 
    wait_event_type,
    wait_event,
    count(*)
FROM pg_stat_activity
WHERE state = 'active'
  AND wait_event_type IN ('IO', 'BufferPin')
GROUP BY wait_event_type, wait_event;

-- Step 2: Check buffer cache hit ratio
SELECT 
    sum(heap_blks_read) as heap_read,
    sum(heap_blks_hit) as heap_hit,
    sum(heap_blks_hit) / 
        nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) as ratio
FROM pg_statio_user_tables;
-- Should be > 0.99 for OLTP

-- Step 3: Check checkpoint frequency
SELECT 
    checkpoints_timed,
    checkpoints_req,  -- Requested = forced = BAD
    buffers_checkpoint,
    buffers_backend   -- Backend writes = BAD, should be 0
FROM pg_stat_bgwriter;

-- Step 4: System level
-- iostat -xz 1
-- iotop

-- Common causes:
-- • shared_buffers too small
-- • Checkpoint too aggressive (checkpoint_completion_target low)
-- • seq scans on large tables
-- • Bloated tables need VACUUM

7.3 Lock Contention

-- Step 1: Check current locks
SELECT 
    locktype,
    relation::regclass,
    mode,
    granted,
    pid,
    left(query, 50) as query
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE NOT granted
ORDER BY relation;

-- Step 2: Find blocking queries
SELECT * FROM pg_blocking_pids(12345);  -- Replace with waiting PID

-- Step 3: Check for long-running transactions
SELECT 
    pid,
    now() - xact_start as xact_duration,
    now() - query_start as query_duration,
    state,
    left(query, 60)
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start;

-- Step 4: Kill if necessary
SELECT pg_terminate_backend(12345);  -- Last resort

-- Prevention:
-- • Use statement_timeout and idle_in_transaction_session_timeout
-- • Use lock_timeout
-- • Avoid long transactions in OLTP

Part 8: Interview Questions

Staff/Principal Level

Systematic Approach:
  1. Immediate checks (< 1 min)
    • pg_stat_activity: What’s running, what’s waiting?
    • pg_locks: Any blocking?
    • System metrics: CPU, I/O, memory via monitoring
  2. Compare to baseline
    • pg_stat_statements: Which queries slowed down?
    • Was there a deploy? Schema change?
    • Traffic pattern change?
  3. Drill down
    • EXPLAIN ANALYZE the slow queries
    • Check for plan changes (pg_stat_plans if available)
    • Statistics stale? Run ANALYZE
  4. System level if needed
    • iostat: I/O saturation?
    • vmstat: Memory pressure?
    • perf/flamegraph: CPU profile
  5. Common root causes
    • Autovacuum not keeping up
    • Statistics out of date → bad plans
    • Connection saturation
    • Lock contention from long transaction
    • Checkpoint storm
Architecture:
  1. Metrics Collection
    • pg_stat_* views polled every 10-60s
    • pg_stat_statements for query patterns
    • OS metrics (node_exporter)
    • Custom metrics via pgbouncer stats
  2. Storage
    • Time-series database (Prometheus, InfluxDB, TimescaleDB)
    • Retention: 15 days high-res, 1 year aggregated
  3. Dashboards
    • Golden signals: latency, throughput, errors, saturation
    • Per-query latency percentiles
    • Replication lag
    • Vacuum progress
    • Connection pool stats
  4. Alerting
    • Latency P99 > threshold
    • Replication lag > 30s
    • Connections > 80% max
    • Disk usage > 80%
    • Long-running transactions > 10 min
  5. Query Insights
    • Track query fingerprints over time
    • Detect plan regressions
    • Identify optimization opportunities
Strategy:
  1. Read Scaling
    • Read replicas (streaming replication)
    • Load balancer (HAProxy, pgpool) with read/write splitting
    • Connection pooling (PgBouncer) in front of each node
    • Caching layer (Redis) for hot data
  2. Write Scaling (harder)
    • Vertical scaling first (bigger instance)
    • Application-level sharding
    • Citus for distributed PostgreSQL
    • Partitioning for time-series data
  3. Connection Management
    • PgBouncer with transaction pooling
    • Aim for ~50 actual connections per node
    • 1000s of clients → PgBouncer → 50 DB connections
  4. Query Optimization
    • Prepared statements (reduce parsing)
    • Efficient indexes
    • Minimize lock contention
  5. Infrastructure
    • NVMe storage for IOPS
    • Sufficient RAM for working set
    • Network: 10+ Gbps between components
Reality check: 1M QPS usually requires:
  • Many read replicas (10-50+)
  • Aggressive caching
  • Sharding for writes
  • Not all queries hit the database

Production Troubleshooting Scenario 1: High-CPU Query Analysis

Incident Report

Symptoms: Database server showing 95%+ CPU utilization. Application reporting 5x normal query latency.
Impact: User-facing API timing out, customer complaints escalating.
Time to Resolve: Target < 15 minutes

Investigation Workflow

Step 1: Identify CPU-Intensive Queries (2 minutes)
-- Find queries consuming the most CPU time
SELECT 
    left(query, 100) as query_preview,
    calls,
    total_exec_time::numeric(12,2) as total_exec_ms,
    mean_exec_time::numeric(10,2) as mean_exec_ms,
    (total_exec_time / sum(total_exec_time) OVER ()) * 100 as pct_total_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

-- Check currently running expensive queries
SELECT 
    pid,
    usename,
    application_name,
    now() - query_start as duration,
    state,
    left(query, 100) as query_preview
FROM pg_stat_activity
WHERE state = 'active' 
  AND query NOT LIKE '%pg_stat_activity%'
ORDER BY query_start;
What to Look For:
  • Queries with high total_exec_time but reasonable mean_exec_time → high call frequency issue
  • Queries with high mean_exec_time → individual query performance issue
  • Long-running queries that started recently → may be holding locks
Step 2: Analyze the Problematic Query (3 minutes)
-- Get the full query text if truncated
SELECT query FROM pg_stat_statements WHERE queryid = <identified_queryid>;

-- Check execution plan
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) 
<paste the problematic query here>;
Common High-CPU Patterns:
  1. Full Table Scans on Large Tables
    • Look for Seq Scan on tables with millions of rows
    • Missing index or statistics out of date
  2. Nested Loop Joins with High Row Estimates
    • Nested Loop on large result sets is CPU-intensive
    • May need hash join or merge join instead
  3. Complex Aggregations Without Indexes
    • GROUP BY on non-indexed columns
    • Window functions scanning entire tables
  4. Regular Expression or String Operations
    • LIKE '%pattern%' (can’t use indexes)
    • regexp_match on large text fields
Step 3: Check for Statistics Issues (2 minutes)
-- Check when tables were last analyzed
SELECT 
    schemaname,
    relname,
    last_analyze,
    last_autoanalyze,
    n_live_tup,
    n_dead_tup
FROM pg_stat_user_tables
WHERE relname IN (
    -- List tables from the problematic query
    'users', 'orders', 'products'
)
ORDER BY last_analyze NULLS FIRST;

-- If statistics are stale, run ANALYZE
ANALYZE users;
ANALYZE orders;
-- Check if plan improves
Step 4: Immediate Mitigation (3 minutes) Option A: Kill the Offending Queries
-- Identify specific PIDs
SELECT 
    pid,
    usename,
    left(query, 80),
    now() - query_start as duration
FROM pg_stat_activity
WHERE query ILIKE '%<pattern from step 1>%'
  AND state = 'active';

-- Terminate them
SELECT pg_terminate_backend(pid) FROM pg_stat_activity 
WHERE query ILIKE '%<pattern>%' AND state = 'active';
Option B: Add Missing Index
-- If Seq Scan is the issue and WHERE clause is clear
CREATE INDEX CONCURRENTLY idx_users_created_at ON users(created_at);
-- CONCURRENTLY prevents table locking
Option C: Query Rewrite (if possible)
-- Example: Replace full table scan with indexed lookup
-- Bad:
SELECT * FROM orders WHERE status IN ('pending', 'processing');

-- Better:
SELECT * FROM orders WHERE status = 'pending'
UNION ALL
SELECT * FROM orders WHERE status = 'processing';
-- If there's an index on status, this uses it twice instead of seq scan
Step 5: Monitor and Verify (5 minutes)
-- Watch CPU and query stats
SELECT 
    sum(total_exec_time) as total_time,
    sum(calls) as total_calls
FROM pg_stat_statements;

-- Check every 30 seconds, verify total_time growth rate decreased
-- Monitor system CPU with `top` or monitoring dashboard

Post-Incident Analysis

Root Cause Categories:
  1. Missing Index: Application added new query pattern without proper indexing
  2. Statistics Drift: Large data changes made planner choose wrong plan
  3. Query Regression: Application deployed with inefficient query
  4. Load Spike: Sudden traffic increase exposed existing slow query
Preventive Measures:
  • Enable auto_explain with log_min_duration to catch slow queries before they become incidents
  • Set up alerts on CPU >80% sustained for >2 minutes
  • Run ANALYZE after bulk data imports
  • Review new queries in development with EXPLAIN ANALYZE
  • Implement query review process before production deployment

Production Troubleshooting Scenario 2: I/O-Bound Workload with Low Cache Hit Ratio

Incident Report

Symptoms: Response times degraded 10x. Disk I/O is saturated at 100%. Cache hit ratio dropped from 99% to 85%.
Impact: All database-dependent services experiencing timeouts.
Time to Resolve: Target < 20 minutes

Investigation Workflow

Step 1: Confirm I/O Saturation (1 minute)
# Check disk I/O utilization (Linux)
iostat -x 1 5

# Look for:
# - %util near 100% (disk saturated)
# - await > 10ms (high latency)
# - r/s and w/s (read/write operations per second)

# PostgreSQL wait events
SELECT 
    wait_event_type,
    wait_event,
    count(*) as waiting_backends
FROM pg_stat_activity
WHERE wait_event IS NOT NULL
GROUP BY wait_event_type, wait_event
ORDER BY count(*) DESC;

-- Look for: DataFileRead, DataFileWrite, WALWrite
Step 2: Identify Cache-Unfriendly Queries (3 minutes)
-- Queries with worst buffer hit ratios
SELECT 
    left(query, 100) as query_preview,
    calls,
    shared_blks_read as disk_reads,
    shared_blks_hit as cache_hits,
    shared_blks_read + shared_blks_hit as total_blks,
    round(100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0), 2) as cache_hit_pct
FROM pg_stat_statements
WHERE shared_blks_read > 1000  -- Significant disk activity
ORDER BY shared_blks_read DESC
LIMIT 20;

-- Identify specific tables being read from disk
SELECT 
    schemaname,
    relname as table_name,
    heap_blks_read as disk_reads,
    heap_blks_hit as cache_hits,
    heap_blks_read + heap_blks_hit as total_accesses,
    round(100.0 * heap_blks_hit / nullif(heap_blks_hit + heap_blks_read, 0), 2) as hit_ratio
FROM pg_statio_user_tables
WHERE heap_blks_read > 0
ORDER BY heap_blks_read DESC
LIMIT 20;
Step 3: Analyze Query Plans for I/O Issues (4 minutes)
-- Get explain plan with buffer statistics
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
<paste problematic query>;

-- Look for:
-- • "Buffers: shared read=XXXXX" - high numbers mean disk reads
-- • Seq Scans on large tables
-- • Index scans that read too many pages
-- • Sorts or aggregations spilling to disk
Common I/O Problem Patterns:
  1. Large Sequential Scans
    • Reading entire large tables from disk
    • Solution: Add appropriate indexes or partition data
  2. Index Scans with Poor Selectivity
    • Index scan fetching >10% of table (seq scan might be better)
    • Solution: Consider covering index or change query
  3. Work Memory Exhaustion
    • Sorts/hashes spilling to temp files
    • Look for “external merge disk” in plan
    • Solution: Increase work_mem
  4. Table Bloat
    • Dead tuples force reading more pages
    • Solution: Run VACUUM or VACUUM FULL
Step 4: Check for Bloat and Fragmentation (3 minutes)
-- Identify bloated tables
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) as total_size,
    n_live_tup,
    n_dead_tup,
    round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_tup_pct,
    last_vacuum,
    last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

-- Check if autovacuum is running
SELECT * FROM pg_stat_progress_vacuum;
Step 5: Immediate Mitigation (5 minutes) Option A: Reduce Query Load
-- Kill or throttle the I/O-heavy queries
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE query ILIKE '%<identified pattern>%'
  AND state = 'active'
  AND backend_type = 'client backend';

-- Or set a statement timeout for that session
ALTER DATABASE mydb SET statement_timeout = '5s';
Option B: Emergency VACUUM
-- If bloat is the issue
VACUUM (ANALYZE, VERBOSE) problematic_table;

-- Monitor progress
SELECT * FROM pg_stat_progress_vacuum;
Option C: Add Targeted Index
-- If missing index is clear from EXPLAIN
CREATE INDEX CONCURRENTLY idx_orders_user_created 
ON orders(user_id, created_at) 
INCLUDE (status, total);  -- Covering index to avoid heap lookups
Option D: Temporarily Increase Shared Buffers
-- If you have available RAM and can restart
-- postgresql.conf:
-- shared_buffers = 8GB  (increase from current value)
-- effective_cache_size = 24GB  (hint for planner)

-- Or for immediate session-level help:
SET work_mem = '256MB';  -- For current session only
Step 6: Monitor Recovery (4 minutes)
-- Watch cache hit ratio improve
SELECT 
    sum(heap_blks_hit) as cache_hits,
    sum(heap_blks_read) as disk_reads,
    round(100.0 * sum(heap_blks_hit) / 
          nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0), 2) as hit_ratio
FROM pg_statio_user_tables;

-- Check every 30 seconds, target >99%

-- Monitor I/O wait events
SELECT 
    wait_event_type,
    count(*) as waiting
FROM pg_stat_activity
WHERE wait_event_type = 'IO'
GROUP BY wait_event_type;
-- Should decrease significantly

Post-Incident Analysis

Root Cause Categories:
  1. Working Set Exceeds Cache: Data accessed frequently doesn’t fit in shared_buffers
  2. Sudden Query Change: New query pattern reads cold data from disk
  3. Bloat Accumulation: VACUUM not keeping up, forcing reads of many dead-tuple pages
  4. Cache Eviction Storm: Massive seq scan pushed frequently-used data out of cache
Preventive Measures:
  • Right-size shared_buffers: Aim for 25-40% of system RAM
  • Monitor bloat proactively: Set up alerts when dead_tup_pct >20%
  • Tune autovacuum aggressively: Lower autovacuum_vacuum_scale_factor
  • Implement query review: Catch full table scans in development
  • Set up I/O alerts: Alert when disk await >10ms sustained or cache hit ratio under 95%
  • Consider read replicas: Offload analytical queries to dedicated instances

Module Mastery Checklist

Complete this checklist to confirm you’ve mastered Performance Engineering:

Performance Investigation Framework

  • Apply the 6-layer performance stack to diagnose issues systematically
  • Implement the four golden signals for database monitoring
  • Conduct root cause analysis using appropriate metrics at each layer
  • Triage production incidents within the first 5 minutes
  • Communicate performance findings to technical and non-technical stakeholders

Statistics and Monitoring

  • Configure and query pg_stat_statements for query performance analysis
  • Interpret wait event statistics to identify resource bottlenecks
  • Use pg_stat_activity to monitor active connections and queries
  • Analyze pg_stat_user_tables for table-level performance metrics
  • Build custom monitoring queries for production dashboards

CPU Profiling and System Analysis

  • Use perf and flamegraphs to profile PostgreSQL processes
  • Identify hot code paths and CPU-intensive operations
  • Interpret flamegraph patterns for different workload types
  • Analyze system-level CPU metrics and context switches
  • Apply profiling insights to optimize queries

I/O Performance

  • Monitor and optimize buffer cache hit ratios
  • Identify queries causing excessive physical I/O
  • Diagnose disk saturation using iostat and wait events
  • Configure shared_buffers and effective_cache_size appropriately
  • Implement strategies to reduce I/O amplification

Query Optimization

  • Read and interpret EXPLAIN and EXPLAIN ANALYZE output
  • Identify suboptimal scan types and join algorithms
  • Recognize when statistics are stale and run ANALYZE
  • Force better query plans when necessary
  • Optimize queries to use appropriate indexes

Concurrency and Locking

  • Diagnose lock contention using pg_locks and wait events
  • Identify and resolve deadlocks
  • Understand lock mode compatibility
  • Optimize transaction scope to minimize lock duration
  • Debug serialization failures in high-concurrency scenarios

Production Troubleshooting

  • Diagnose and resolve high-CPU query workloads
  • Fix I/O-bound performance degradation
  • Respond to sudden query slowdowns systematically
  • Debug connection pool exhaustion
  • Identify and remediate bloat-related performance issues

Configuration Tuning

  • Tune shared_buffers based on workload and available RAM
  • Configure work_mem to prevent disk spills
  • Optimize checkpoint settings for write performance
  • Tune autovacuum for production workloads
  • Apply connection pooling strategies effectively

Practical Application

  • Complete production troubleshooting scenarios
  • Build a monitoring dashboard for PostgreSQL
  • Conduct performance reviews of new queries before deployment
  • Implement proactive alerting for performance metrics
  • Document performance baselines and SLOs for production systems

Next Steps

Storage Engine Deep Dive

MVCC internals, vacuum, WAL, and recovery

Distributed Systems

Replication, consensus, and sharding

Interview Deep-Dive

Strong Answer:
  • First 30 seconds: Run the 30-second snapshot query: SELECT state, wait_event_type, wait_event, count(*) FROM pg_stat_activity WHERE state != 'idle' GROUP BY 1, 2, 3 ORDER BY count(*) DESC. This tells you where backends are spending time. If most are active with NULL wait_event, the database is CPU-bound on query execution. If most show Lock wait events, you have contention. If IO wait events dominate, storage is the bottleneck.
  • Next 60 seconds: Check pg_stat_statements for the top offenders: SELECT query, calls, mean_exec_time, total_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 5. This identifies whether one specific query is causing the problem or the load is distributed.
  • Next 60 seconds: If a specific query is the culprit, run EXPLAIN (ANALYZE, BUFFERS) on it. If lock contention is the issue, check SELECT * FROM pg_locks WHERE NOT granted to find blocked queries, then find what is holding the lock.
  • Action (remaining 2 minutes): For a runaway query: SELECT pg_terminate_backend(pid) to kill it. For lock contention: identify and terminate the blocking transaction. For systemic slowness: check if autovacuum is running aggressively (anti-wraparound vacuum cannot be interrupted — you may need to wait), or if a large COPY/DELETE is consuming I/O bandwidth.
  • The key principle: measure before acting. The difference between a junior and a senior responding to an incident is that the senior runs three diagnostic queries before touching anything, while the junior starts restarting services.
Follow-up: The snapshot shows 200 backends all waiting on LWLock:BufferMapping. What does this mean?BufferMapping lock contention means too many backends are simultaneously trying to look up or insert pages in the buffer pool’s hash table. This typically happens during high-concurrency sequential scans on large tables where many backends are fetching the same pages. Solutions: reduce the number of concurrent backends hitting the same table (queue at the application level), add indexes to convert sequential scans into index scans (fewer page lookups), or if the workload is legitimate, increase the number of buffer partitions (requires PostgreSQL recompilation — not a quick fix).
Strong Answer:
  • Amdahl’s Law: The maximum speedup of a system is limited by the fraction that cannot be parallelized. If 80% of a query’s time is spent in a parallel hash join and 20% in a serial sort, making the hash join infinitely fast only gives you a 5x speedup (1 / 0.2). Concrete example: you enable parallel query with max_parallel_workers_per_gather = 8. The parallel sequential scan portion scales nearly linearly, but the final aggregation is serial. If aggregation is 30% of the total time, your 8 workers can only achieve ~3.3x speedup at best. This is why throwing more parallelism at a query with a large serial component shows diminishing returns.
  • Little’s Law: L = lambda * W. Average number of items in a system = arrival rate * average time in system. Concrete example: your database receives 1000 queries/sec (lambda) and each query takes an average of 50ms (W = 0.05s). L = 1000 * 0.05 = 50. You need 50 concurrent connections to sustain this throughput. If max_connections is 30, you will see queuing. If a slow query increases average latency from 50ms to 500ms, you suddenly need 500 concurrent connections — your connection pool explodes. This is why a single slow query can cascade into a full system outage.
  • These laws inform capacity planning. Before the CEO asks “can we handle 10x traffic?”, I use Little’s Law to calculate the required concurrency, Amdahl’s Law to predict the benefit of adding parallel workers, and the bottleneck analysis to identify which resource (CPU, I/O, connections) will saturate first.
Follow-up: How do you apply the USE method (Utilization, Saturation, Errors) to PostgreSQL specifically?For each resource: CPU — utilization from OS metrics, saturation from pg_stat_activity showing backends in active state with NULL wait_event (CPU-bound), errors from query timeouts. Disk I/O — utilization from iostat, saturation from pg_stat_activity wait_event_type = ‘IO’, errors from LOG: could not write to file. Connections — utilization from SELECT count(*) FROM pg_stat_activity vs max_connections, saturation from application-side connection pool queue depth, errors from FATAL: too many connections. Locks — utilization not directly measurable, saturation from pg_stat_activity wait_event_type = ‘Lock’, errors from deadlock_detected counts.
Strong Answer:
  • Baseline establishment: Run the production workload for 2 weeks and capture: P50/P95/P99 query latency (from pg_stat_statements), QPS (queries per second), buffer cache hit ratio, replication lag, connection count distribution, and disk I/O utilization. These form the baseline. Capture during peak hours and off-peak separately.
  • SLO definition: SLOs should be based on user-facing impact, not raw metrics. Example: “P99 query latency for the checkout flow will not exceed 100ms” (derived from the 200ms total API latency budget with 100ms allocated to database). “Replication lag will not exceed 5 seconds” (derived from the acceptable staleness for read-replica-served pages). “Buffer cache hit ratio will remain above 99.5%” (below this, the working set does not fit in memory and disk I/O degrades latency).
  • Violation response: Tier the response based on severity. P99 > 100ms for 5 minutes: page the on-call DBA, investigate via pg_stat_statements for new slow queries. P99 > 500ms for 1 minute: auto-scale read replicas, enable query circuit breakers in the application. Cache hit ratio < 95%: emergency investigation for unexpected table growth or working set change. Replication lag > 60s: consider promoting a fresher replica.
  • The meta-principle: SLOs are contracts between the database team and the application team. They must be derived from user experience requirements, not from arbitrary database metrics. A 50% CPU utilization SLO is meaningless if the application is happy at 80% CPU. A 10ms P99 latency SLO is unnecessarily expensive if the application adds 200ms of its own processing.
Follow-up: How do you handle SLO erosion — where latency gradually increases by 5% per month?This is the most insidious performance problem because no single change triggers an alert. Automated regression detection: track pg_stat_statements mean_exec_time per query fingerprint over time. If the 30-day moving average increases by more than 20%, flag it for investigation. Common causes: data growth (table/index size increases, scans take longer), statistics drift (planner makes increasingly poor choices), index bloat (gradual degradation), or cumulative schema changes that add overhead. The fix is usually ANALYZE + REINDEX + reviewing recently added columns or constraints.