Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Performance Engineering: The Bottleneck Analysis Guide
Performance engineering is not about “tuning parameters” — it is a systematic process of identifying the bottleneck in a complex, layered system and applying the minimum change required to shift that bottleneck to a more scalable resource. Real-world analogy: Think of a database system like a factory assembly line. The overall throughput is determined by the slowest station. If the painting station can only process 10 units per hour but assembly handles 100, making assembly faster is a total waste of effort. Performance engineering is the discipline of finding the painting station, improving it, and then finding the next bottleneck. The moment you start optimizing without measuring, you are guessing which station to fix — and guessing has about a 10% hit rate.Core Methodology: The USE Method + Wait Event Profiling
Key Goal: Move from “guessing” to “evidential analysis”
The Performance Engineering Mindset
A principal engineer view of performance follows three laws:- The Law of Amdahl: The speedup of a system is limited by its serial components.
- The Law of Little: Average number of items in a system = arrival rate × average time spent in system (L = λW).
- The Law of Bottlenecks: Performance is always limited by the most saturated resource. Optimizing anything else is a waste of time.
Phase 0: Triage & The Four Golden Signals
Before diving into profiles, you must identify the symptom. We use the “Four Golden Signals” to determine if the system is healthy.| Signal | Metric | Principal Question |
|---|---|---|
| Latency | P99 Response Time | Is the database slow, or is it just a few “outlier” queries? |
| Throughput | Queries Per Second (QPS) | Are we handling the load, or is the engine stalling? |
| Errors | Rollback/Conflict Rate | Are we failing due to deadlocks or serialization issues? |
| Saturation | Resource % (CPU/IO) | How much “headroom” is left before a total collapse? |
Triage Query: The 30-Second Snapshot
wait_event_type is NULL for most sessions, you are CPU Bound — the database is actively computing, not waiting. If it is IO, you are Disk Bound — backends are stalled waiting for pages from disk. If it is LWLock, you are Contention Bound — multiple backends are fighting over the same internal resource. Each diagnosis leads to a fundamentally different remediation path, which is why this query must always be your first step.
Phase 1: Resource Saturation (The USE Method)
Developed by Brendan Gregg, the USE Method should be applied to every resource (CPU, Memory, Disk, Network).1.1 CPU: Computation vs. Context Switching
Utilization:top or mpstat. High utilization (>80%) is expected in high-performance DBs.
Saturation: Look at the Run Queue Length (load average). If Load > Number of Cores, tasks are waiting for a CPU slot.
Principal Observation: High CPU with low throughput often indicates Spinlock Contention or Excessive Context Switching.
1.2 Memory: The Buffer Hit Ratio Myth
Utilization:free -m.
Saturation: Scanning/Paging. Look at si/so (swap in/out) in vmstat.
Staff Engineer Insight: A “99% Buffer Cache Hit Ratio” can be a trap. If you have 1,000,000 hits and 10,000 reads, your ratio is 99%. But if those 10,000 reads are random I/Os on a slow disk, your latency will be terrible. Focus on the absolute number of reads/sec, not the ratio.
Real-world analogy: A 99% cache hit ratio is like saying “99% of our customers were served in under a second.” That sounds great until you learn you have 1 million customers per hour — meaning 10,000 people per hour waited 50ms+ for a disk read. In an OLTP system with a P99 latency SLA, those 10,000 slow operations are your entire problem, and the ratio hides them completely.
1.3 Disk: IOPS vs. Throughput
Utilization:%util in iostat.
Saturation: await (Average Wait Time).
Phase 2: Database Internal Waits
Phase 2: Database Internal Waits
If resources are not saturated but latency is high, the database is likely waiting on internal synchronization.2.1 The Hierarchy of Waits
| Wait Class | Principal Archetype | Root Cause |
|---|---|---|
| LWLock | buffer_content | Shared Buffer Contention. Too many backends trying to access the same page (Hot Block). |
| LWLock | WALWriteLock | WAL Serializer Bottleneck. Too many small transactions; slow fsync to WAL disk. |
| Lock | relation | Catalog Contention. An uncommitted DDL (e.g., ALTER TABLE) is blocking all DML. |
| IO | DataFileRead | Cache Miss. Query is pulling cold data from disk into the buffer pool. |
2.2 Advanced: ProcArrayLock Contention
In high-concurrency systems (1000+ TPS),ProcArrayLock often becomes the global bottleneck. This lock is used when a transaction needs to compute a “Snapshot” of the system (visibility).
Staff Engineer Insight: If you see ProcArrayLock as a top wait event, scaling CPU won’t help. You must reduce the transaction rate, use a connection pooler (PgBouncer) to reduce backend count, or move to a version of Postgres (14+) with snapshot caching optimizations.
Phase 3: Code-Level Profiling (Flamegraphs)
When you are CPU Bound and cannot find a simple query fix, you must look at what the CPU is actually doing inside the Postgres binary.3.1 Capturing the Flamegraph
Real-world analogy: A flamegraph is like a time-lapse photo of a kitchen during a dinner rush. Each flame shows where the cooks (CPU) are spending their time. A wide plateau means a lot of time is spent in one activity (e.g., chopping onions). Jagged towers mean the work is fragmented across many small tasks. You optimize by making the widest plateaus narrower.3.2 Interpreting “Staff-Level” Patterns
| Pattern | Code Path | Meaning |
|---|---|---|
| Wide Plateau | hash_search | The Hash Table (for Joins or Aggs) is huge and spending time in bucket lookup. Consider work_mem tuning. |
| Wide Plateau | slot_deform_tuple | The Defragmentation Tax. Postgres spends massive CPU “deforming” rows to find a column at the end of a wide table. Move heavily accessed columns to the front. |
| Jagged Towers | ExecQual | Massive expression evaluation. Likely a complex WHERE clause with non-indexed functions. |
| Wide Tower | LWLockAcquire | Lock Contention. The engine is spending more time waiting for metadata than doing work. |
Phase 4: Query-Level Optimization (The 80/20 Rule)
You cannot optimize every query. A principal engineer identifies the 20% of queries causing 80% of the load.4.1 The pg_stat_statements Strategy
calls but low rows. This indicates index-thrashing or unnecessary repeat work. A query called 10,000 times a second that takes 1ms is often more dangerous than a 10s query called once an hour.
Why sort by io_blocks instead of total_exec_time? Time-based sorting conflates CPU-bound and I/O-bound work. A CPU-bound query on a fast machine might show high time but zero disk pressure. An I/O-bound query might show modest time but be saturating your disk, degrading every other query on the system. I/O blocks reveal the queries that are poisoning the shared resource.
4.2 The “Temp File” Trap
Ifwork_mem is too small, Postgres spills sorts and hashes to disk. This is a silent performance killer — you will not see an error or warning; the query just silently becomes 10-100x slower because in-memory operations become disk I/O operations.
Real-world analogy: Imagine sorting a deck of 1,000 cards on a table that only fits 100 cards. You would have to sort in batches of 100, write each batch to a pile on the floor, then merge the piles. The floor (disk) is orders of magnitude slower than the table (RAM). That is exactly what an “external merge sort” does.
work_mem for specific users or queries (not globally — see the memory management pitfall about per-operation limits), or optimize the join to reduce the intermediate result set size so it fits in memory.
Phase 5: Concurrency & Lock Contention
5.1 The Blocker Tree
Use this recursive query to find the “Root Blocker” causing a pile-up.5.2 Common Contention Archetypes
- The Hot Row: Millions of updates to a single row (e.g., a counter). Solution: Use batching or CRDT-style additive counters.
- The Foreign Key Lock: Updating a child table requires a shared lock on the parent. If many children update simultaneously, the parent becomes a bottleneck.
- The Index Build: Creating an index without
CONCURRENTLYlocks the whole table for writes.
Phase 6: Production Troubleshooting Runbook
| Symptom | Initial Check | Probable Root Cause |
|---|---|---|
| Spiking Latency | pg_stat_activity | Lock pile-up or a “Plan Flip” (CBO chose a bad plan due to stale stats). |
| High Load Avg | vmstat / top | High context switching or Spinlock contention. |
| I/O Saturation | iostat | Checkpoint spike or a huge Sequential Scan. |
| Out of Memory | dmesg | work_mem is set too high for the connection count. |
Summary: The Principal’s Workflow
- Start at the OS: Is there a resource (CPU/IO) at 100%? (USE Method)
- Look at the Waits: Is the database waiting on itself? (
pg_stat_activity) - Find the Culprit: Which queries are causing the most impact? (
pg_stat_statements) - Profile if Needed: If CPU is high but queries look fine, use
perfand Flamegraphs. - Optimize Evidentially: Change one variable, measure the throughput shift, and repeat.
5.2 PgBouncer Configuration
Why connection pooling matters: Each PostgreSQL connection is a separate OS process consuming approximately 5-10MB of RAM. A naive setup with 500 application threads each holding a database connection wastes 2.5-5GB of RAM just on connection overhead, and PostgreSQL performs poorly beyond ~100-200 active connections due to lock contention in the ProcArray. PgBouncer solves this by multiplexing thousands of client connections onto a small pool of real database connections. Real-world analogy: PgBouncer is like a restaurant host who manages a waiting list. The restaurant (PostgreSQL) has 50 tables (connections). When 500 customers (application threads) show up, the host does not try to seat them all at once. Instead, each customer gets a table only while they are actively eating (executing a transaction), and gives it up between courses. This istransaction pooling mode.
5.3 Memory Configuration
Practical tip: The settings below are starting points, not final answers. Every workload is different. The right approach is to start with these values, monitor withpg_stat_bgwriter (for checkpoint tuning), pg_stat_statements (for temp file usage), and pg_stat_activity (for memory-related wait events), then adjust based on evidence.
Part 6: Lock Contention Analysis
6.1 Lock Monitoring
6.2 Deadlock Analysis
Part 7: Production Troubleshooting Runbook
7.1 High CPU Usage
7.2 High I/O Wait
7.3 Lock Contention
Part 8: Interview Questions
Staff/Principal Level
Q: How would you diagnose a sudden 10x latency increase in production?
Q: How would you diagnose a sudden 10x latency increase in production?
-
Immediate checks (< 1 min)
pg_stat_activity: What’s running, what’s waiting?pg_locks: Any blocking?- System metrics: CPU, I/O, memory via monitoring
-
Compare to baseline
- pg_stat_statements: Which queries slowed down?
- Was there a deploy? Schema change?
- Traffic pattern change?
-
Drill down
- EXPLAIN ANALYZE the slow queries
- Check for plan changes (pg_stat_plans if available)
- Statistics stale? Run ANALYZE
-
System level if needed
- iostat: I/O saturation?
- vmstat: Memory pressure?
- perf/flamegraph: CPU profile
-
Common root causes
- Autovacuum not keeping up
- Statistics out of date → bad plans
- Connection saturation
- Lock contention from long transaction
- Checkpoint storm
Q: Design a monitoring system for PostgreSQL at scale
Q: Design a monitoring system for PostgreSQL at scale
-
Metrics Collection
- pg_stat_* views polled every 10-60s
- pg_stat_statements for query patterns
- OS metrics (node_exporter)
- Custom metrics via pgbouncer stats
-
Storage
- Time-series database (Prometheus, InfluxDB, TimescaleDB)
- Retention: 15 days high-res, 1 year aggregated
-
Dashboards
- Golden signals: latency, throughput, errors, saturation
- Per-query latency percentiles
- Replication lag
- Vacuum progress
- Connection pool stats
-
Alerting
- Latency P99 > threshold
- Replication lag > 30s
- Connections > 80% max
- Disk usage > 80%
- Long-running transactions > 10 min
-
Query Insights
- Track query fingerprints over time
- Detect plan regressions
- Identify optimization opportunities
Q: How would you scale PostgreSQL to handle 1M QPS?
Q: How would you scale PostgreSQL to handle 1M QPS?
-
Read Scaling
- Read replicas (streaming replication)
- Load balancer (HAProxy, pgpool) with read/write splitting
- Connection pooling (PgBouncer) in front of each node
- Caching layer (Redis) for hot data
-
Write Scaling (harder)
- Vertical scaling first (bigger instance)
- Application-level sharding
- Citus for distributed PostgreSQL
- Partitioning for time-series data
-
Connection Management
- PgBouncer with transaction pooling
- Aim for ~50 actual connections per node
- 1000s of clients → PgBouncer → 50 DB connections
-
Query Optimization
- Prepared statements (reduce parsing)
- Efficient indexes
- Minimize lock contention
-
Infrastructure
- NVMe storage for IOPS
- Sufficient RAM for working set
- Network: 10+ Gbps between components
- Many read replicas (10-50+)
- Aggressive caching
- Sharding for writes
- Not all queries hit the database
Production Troubleshooting Scenario 1: High-CPU Query Analysis
Incident Report
Symptoms: Database server showing 95%+ CPU utilization. Application reporting 5x normal query latency.Impact: User-facing API timing out, customer complaints escalating.
Time to Resolve: Target < 15 minutes
Investigation Workflow
Step 1: Identify CPU-Intensive Queries (2 minutes)- Queries with high
total_exec_timebut reasonablemean_exec_time→ high call frequency issue - Queries with high
mean_exec_time→ individual query performance issue - Long-running queries that started recently → may be holding locks
-
Full Table Scans on Large Tables
- Look for
Seq Scanon tables with millions of rows - Missing index or statistics out of date
- Look for
-
Nested Loop Joins with High Row Estimates
Nested Loopon large result sets is CPU-intensive- May need hash join or merge join instead
-
Complex Aggregations Without Indexes
GROUP BYon non-indexed columns- Window functions scanning entire tables
-
Regular Expression or String Operations
LIKE '%pattern%'(can’t use indexes)regexp_matchon large text fields
Post-Incident Analysis
Root Cause Categories:- Missing Index: Application added new query pattern without proper indexing
- Statistics Drift: Large data changes made planner choose wrong plan
- Query Regression: Application deployed with inefficient query
- Load Spike: Sudden traffic increase exposed existing slow query
- Enable
auto_explainwithlog_min_durationto catch slow queries before they become incidents - Set up alerts on CPU >80% sustained for >2 minutes
- Run
ANALYZEafter bulk data imports - Review new queries in development with
EXPLAIN ANALYZE - Implement query review process before production deployment
Production Troubleshooting Scenario 2: I/O-Bound Workload with Low Cache Hit Ratio
Incident Report
Symptoms: Response times degraded 10x. Disk I/O is saturated at 100%. Cache hit ratio dropped from 99% to 85%.Impact: All database-dependent services experiencing timeouts.
Time to Resolve: Target < 20 minutes
Investigation Workflow
Step 1: Confirm I/O Saturation (1 minute)-
Large Sequential Scans
- Reading entire large tables from disk
- Solution: Add appropriate indexes or partition data
-
Index Scans with Poor Selectivity
- Index scan fetching >10% of table (seq scan might be better)
- Solution: Consider covering index or change query
-
Work Memory Exhaustion
- Sorts/hashes spilling to temp files
- Look for “external merge disk” in plan
- Solution: Increase
work_mem
-
Table Bloat
- Dead tuples force reading more pages
- Solution: Run VACUUM or VACUUM FULL
Post-Incident Analysis
Root Cause Categories:- Working Set Exceeds Cache: Data accessed frequently doesn’t fit in shared_buffers
- Sudden Query Change: New query pattern reads cold data from disk
- Bloat Accumulation: VACUUM not keeping up, forcing reads of many dead-tuple pages
- Cache Eviction Storm: Massive seq scan pushed frequently-used data out of cache
- Right-size shared_buffers: Aim for 25-40% of system RAM
- Monitor bloat proactively: Set up alerts when dead_tup_pct >20%
- Tune autovacuum aggressively: Lower
autovacuum_vacuum_scale_factor - Implement query review: Catch full table scans in development
- Set up I/O alerts: Alert when disk await >10ms sustained or cache hit ratio under 95%
- Consider read replicas: Offload analytical queries to dedicated instances
Module Mastery Checklist
Complete this checklist to confirm you’ve mastered Performance Engineering:Performance Investigation Framework
- Apply the 6-layer performance stack to diagnose issues systematically
- Implement the four golden signals for database monitoring
- Conduct root cause analysis using appropriate metrics at each layer
- Triage production incidents within the first 5 minutes
- Communicate performance findings to technical and non-technical stakeholders
Statistics and Monitoring
- Configure and query pg_stat_statements for query performance analysis
- Interpret wait event statistics to identify resource bottlenecks
- Use pg_stat_activity to monitor active connections and queries
- Analyze pg_stat_user_tables for table-level performance metrics
- Build custom monitoring queries for production dashboards
CPU Profiling and System Analysis
- Use perf and flamegraphs to profile PostgreSQL processes
- Identify hot code paths and CPU-intensive operations
- Interpret flamegraph patterns for different workload types
- Analyze system-level CPU metrics and context switches
- Apply profiling insights to optimize queries
I/O Performance
- Monitor and optimize buffer cache hit ratios
- Identify queries causing excessive physical I/O
- Diagnose disk saturation using iostat and wait events
- Configure shared_buffers and effective_cache_size appropriately
- Implement strategies to reduce I/O amplification
Query Optimization
- Read and interpret EXPLAIN and EXPLAIN ANALYZE output
- Identify suboptimal scan types and join algorithms
- Recognize when statistics are stale and run ANALYZE
- Force better query plans when necessary
- Optimize queries to use appropriate indexes
Concurrency and Locking
- Diagnose lock contention using pg_locks and wait events
- Identify and resolve deadlocks
- Understand lock mode compatibility
- Optimize transaction scope to minimize lock duration
- Debug serialization failures in high-concurrency scenarios
Production Troubleshooting
- Diagnose and resolve high-CPU query workloads
- Fix I/O-bound performance degradation
- Respond to sudden query slowdowns systematically
- Debug connection pool exhaustion
- Identify and remediate bloat-related performance issues
Configuration Tuning
- Tune shared_buffers based on workload and available RAM
- Configure work_mem to prevent disk spills
- Optimize checkpoint settings for write performance
- Tune autovacuum for production workloads
- Apply connection pooling strategies effectively
Practical Application
- Complete production troubleshooting scenarios
- Build a monitoring dashboard for PostgreSQL
- Conduct performance reviews of new queries before deployment
- Implement proactive alerting for performance metrics
- Document performance baselines and SLOs for production systems
Next Steps
Storage Engine Deep Dive
Distributed Systems
Interview Deep-Dive
Production is slow. You have 5 minutes before the CEO's all-hands meeting. Walk me through your triage process.
Production is slow. You have 5 minutes before the CEO's all-hands meeting. Walk me through your triage process.
- First 30 seconds: Run the 30-second snapshot query:
SELECT state, wait_event_type, wait_event, count(*) FROM pg_stat_activity WHERE state != 'idle' GROUP BY 1, 2, 3 ORDER BY count(*) DESC. This tells you where backends are spending time. If most areactivewith NULL wait_event, the database is CPU-bound on query execution. If most showLockwait events, you have contention. IfIOwait events dominate, storage is the bottleneck. - Next 60 seconds: Check pg_stat_statements for the top offenders:
SELECT query, calls, mean_exec_time, total_exec_time FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 5. This identifies whether one specific query is causing the problem or the load is distributed. - Next 60 seconds: If a specific query is the culprit, run
EXPLAIN (ANALYZE, BUFFERS)on it. If lock contention is the issue, checkSELECT * FROM pg_locks WHERE NOT grantedto find blocked queries, then find what is holding the lock. - Action (remaining 2 minutes): For a runaway query:
SELECT pg_terminate_backend(pid)to kill it. For lock contention: identify and terminate the blocking transaction. For systemic slowness: check if autovacuum is running aggressively (anti-wraparound vacuum cannot be interrupted — you may need to wait), or if a large COPY/DELETE is consuming I/O bandwidth. - The key principle: measure before acting. The difference between a junior and a senior responding to an incident is that the senior runs three diagnostic queries before touching anything, while the junior starts restarting services.
Explain Amdahl's Law and Little's Law in the context of database performance. Give a concrete example for each.
Explain Amdahl's Law and Little's Law in the context of database performance. Give a concrete example for each.
- Amdahl’s Law: The maximum speedup of a system is limited by the fraction that cannot be parallelized. If 80% of a query’s time is spent in a parallel hash join and 20% in a serial sort, making the hash join infinitely fast only gives you a 5x speedup (1 / 0.2). Concrete example: you enable parallel query with
max_parallel_workers_per_gather = 8. The parallel sequential scan portion scales nearly linearly, but the final aggregation is serial. If aggregation is 30% of the total time, your 8 workers can only achieve ~3.3x speedup at best. This is why throwing more parallelism at a query with a large serial component shows diminishing returns. - Little’s Law: L = lambda * W. Average number of items in a system = arrival rate * average time in system. Concrete example: your database receives 1000 queries/sec (lambda) and each query takes an average of 50ms (W = 0.05s). L = 1000 * 0.05 = 50. You need 50 concurrent connections to sustain this throughput. If max_connections is 30, you will see queuing. If a slow query increases average latency from 50ms to 500ms, you suddenly need 500 concurrent connections — your connection pool explodes. This is why a single slow query can cascade into a full system outage.
- These laws inform capacity planning. Before the CEO asks “can we handle 10x traffic?”, I use Little’s Law to calculate the required concurrency, Amdahl’s Law to predict the benefit of adding parallel workers, and the bottleneck analysis to identify which resource (CPU, I/O, connections) will saturate first.
pg_stat_activity showing backends in active state with NULL wait_event (CPU-bound), errors from query timeouts. Disk I/O — utilization from iostat, saturation from pg_stat_activity wait_event_type = ‘IO’, errors from LOG: could not write to file. Connections — utilization from SELECT count(*) FROM pg_stat_activity vs max_connections, saturation from application-side connection pool queue depth, errors from FATAL: too many connections. Locks — utilization not directly measurable, saturation from pg_stat_activity wait_event_type = ‘Lock’, errors from deadlock_detected counts.How do you establish performance baselines and SLOs for a PostgreSQL database, and what happens when you violate them?
How do you establish performance baselines and SLOs for a PostgreSQL database, and what happens when you violate them?
- Baseline establishment: Run the production workload for 2 weeks and capture: P50/P95/P99 query latency (from pg_stat_statements), QPS (queries per second), buffer cache hit ratio, replication lag, connection count distribution, and disk I/O utilization. These form the baseline. Capture during peak hours and off-peak separately.
- SLO definition: SLOs should be based on user-facing impact, not raw metrics. Example: “P99 query latency for the checkout flow will not exceed 100ms” (derived from the 200ms total API latency budget with 100ms allocated to database). “Replication lag will not exceed 5 seconds” (derived from the acceptable staleness for read-replica-served pages). “Buffer cache hit ratio will remain above 99.5%” (below this, the working set does not fit in memory and disk I/O degrades latency).
- Violation response: Tier the response based on severity. P99 > 100ms for 5 minutes: page the on-call DBA, investigate via pg_stat_statements for new slow queries. P99 > 500ms for 1 minute: auto-scale read replicas, enable query circuit breakers in the application. Cache hit ratio < 95%: emergency investigation for unexpected table growth or working set change. Replication lag > 60s: consider promoting a fresher replica.
- The meta-principle: SLOs are contracts between the database team and the application team. They must be derived from user experience requirements, not from arbitrary database metrics. A 50% CPU utilization SLO is meaningless if the application is happy at 80% CPU. A 10ms P99 latency SLO is unnecessarily expensive if the application adds 200ms of its own processing.