Module 6: PostgreSQL Internals

Understanding how PostgreSQL works internally transforms you from someone who uses databases to someone who truly understands them. This knowledge helps you make better design decisions and debug complex issues.

Estimated Time: 12-15 hours
Depth Level: Advanced
Key Skill: Understanding MVCC, storage, and memory management

Learning Objectives

By the end of this module, you will be able to:

Understand PostgreSQL’s multi-process architecture: Explain the roles of postmaster, backend processes, and auxiliary processes.
Master MVCC principles: Describe how PostgreSQL implements Multi-Version Concurrency Control for isolation and performance.
Navigate shared memory structures: Identify the purpose of shared buffers, WAL buffers, lock tables, and process arrays.
Analyze storage mechanisms: Understand how PostgreSQL stores data in 8KB pages, heaps, and indexes on disk.
Debug using system catalogs: Query pg_stat_* and pg_* views to diagnose performance issues and observe internal state.
Optimize VACUUM strategies: Configure and tune autovacuum to manage dead tuples and prevent bloat.
Interpret transaction internals: Understand how XIDs, snapshots, and commit logs work together to enforce isolation.

Module-Specific Learning Outcomes

Section 6.1 - Architecture Overview

Explain PostgreSQL’s multi-process model vs. multi-threading approaches
Identify the responsibilities of the postmaster process
Describe the lifecycle of a backend process from connection to termination
Understand the purpose and function of auxiliary processes (checkpointer, bgwriter, WAL writer, autovacuum)
Map signal handling mechanisms between postmaster and backend processes
Navigate the shared memory layout and its major regions

Section 6.2 - Shared Memory and Processes

Calculate shared memory requirements based on configuration parameters
Understand the structure and purpose of the buffer pool
Explain how ProcGlobal tracks backend states
Describe the role of WAL buffers in transaction durability
Analyze memory usage using pg_stat_* views
Inspect process behavior using system catalogs

Section 6.3 - Storage Architecture

Understand PostgreSQL’s heap storage model and page structure
Explain how data is organized in 8KB pages with tuples and metadata
Describe the role of the Free Space Map (FSM) and Visibility Map (VM)
Identify how TOAST handles large values that exceed page size
Navigate the data directory structure and file naming conventions
Analyze table and index files using system tools

Section 6.4 - MVCC Implementation

Explain how MVCC enables non-blocking reads
Understand transaction IDs (XIDs) and their role in visibility determination
Describe how snapshots capture database state for isolation
Analyze tuple visibility rules based on xmin, xmax, and commit status
Diagnose transaction ID wraparound and its prevention strategies
Optimize for MVCC overhead in high-write workloads

Section 6.5 - VACUUM and Tuple Management

Explain the difference between VACUUM and VACUUM FULL
Understand how dead tuples accumulate and impact performance
Configure autovacuum thresholds and aggressive strategies
Monitor bloat using pg_stat_user_tables and pgstattuple
Diagnose and fix excessive table bloat
Apply best practices for VACUUM scheduling in production

Section 6.6 - WAL and Checkpoints

Understand Write-Ahead Logging principles and durability guarantees
Describe WAL record structure and LSN (Log Sequence Numbers)
Explain checkpoint mechanisms and their performance impact
Configure checkpoint parameters to balance durability and performance
Analyze WAL generation rate and optimize for write workloads
Implement Point-in-Time Recovery (PITR) strategies

Section 6.7 - System Catalogs and Metadata

Query pg_class, pg_attribute, pg_index for schema metadata
Use pg_stat_user_tables and pg_statio_user_tables for performance analysis
Inspect pg_stat_activity to monitor active queries and connections
Leverage pg_stat_statements for query performance tracking
Analyze lock contention using pg_locks
Understand catalog cache invalidation and its implications

Hands-On Lab 1: Inspecting PostgreSQL Processes and Shared Memory

Goal: Master the tools for observing PostgreSQL’s internal process architecture and memory structures.

Examine the Process Hierarchy:

-- View all PostgreSQL processes
SELECT 
    pid,
    usename AS username,
    application_name,
    client_addr,
    backend_start,
    state,
    backend_type,
    query
FROM pg_stat_activity
ORDER BY backend_type, backend_start;

-- Identify process types:
-- - client backend: Your query connections
-- - autovacuum launcher/worker: Cleanup processes
-- - logical replication launcher/worker: Replication processes
-- - background writer: Writes dirty buffers
-- - checkpointer: Performs checkpoints
-- - walwriter: Writes WAL records

Investigate Shared Memory Configuration:

-- Check shared memory settings
SELECT 
    name,
    setting,
    unit,
    context,
    short_desc
FROM pg_settings
WHERE name IN (
    'shared_buffers',
    'wal_buffers',
    'max_connections',
    'max_prepared_transactions',
    'max_locks_per_transaction',
    'shared_memory_size'
)
ORDER BY name;

-- Calculate total shared memory usage
SELECT pg_size_pretty(
    current_setting('shared_buffers')::bigint * 
    (SELECT setting FROM pg_settings WHERE name = 'block_size')::bigint
) AS shared_buffers_size;

Monitor Buffer Pool Activity:

-- Install pg_buffercache extension (if not already installed)
CREATE EXTENSION IF NOT EXISTS pg_buffercache;

-- See what tables are cached in shared buffers
SELECT 
    c.relname AS table_name,
    COUNT(*) AS buffer_count,
    pg_size_pretty(COUNT(*) * 8192) AS size_in_cache,
    ROUND(100.0 * COUNT(*) / (SELECT setting FROM pg_settings WHERE name='shared_buffers')::int, 2) AS percent_of_cache
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
WHERE b.reldatabase = (SELECT oid FROM pg_database WHERE datname = current_database())
  AND c.relkind IN ('r', 'i')  -- tables and indexes
GROUP BY c.relname
ORDER BY buffer_count DESC
LIMIT 20;

-- Check buffer cache hit ratio
SELECT 
    SUM(heap_blks_hit) AS heap_read_hits,
    SUM(heap_blks_read) AS heap_disk_reads,
    ROUND(100.0 * SUM(heap_blks_hit) / NULLIF(SUM(heap_blks_hit) + SUM(heap_blks_read), 0), 2) AS cache_hit_ratio
FROM pg_statio_user_tables;
-- Target: > 99% for well-tuned systems

Analyze WAL Activity:

-- Check current WAL position
SELECT 
    pg_current_wal_lsn() AS current_wal_position,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS total_wal_generated;

-- View WAL writer statistics
SELECT 
    wal_records,
    wal_fpi,  -- Full Page Images
    wal_bytes,
    pg_size_pretty(wal_bytes) AS wal_bytes_pretty,
    wal_buffers_full,
    wal_write,
    wal_sync,
    stats_reset
FROM pg_stat_wal;

Monitor Background Workers:

-- Check background writer stats
SELECT 
    checkpoints_timed,      -- Scheduled checkpoints
    checkpoints_req,         -- Requested checkpoints (may indicate tuning needed)
    buffers_checkpoint,      -- Buffers written during checkpoint
    buffers_clean,          -- Buffers cleaned by bgwriter
    buffers_backend,        -- Buffers written by backends (should be low)
    buffers_alloc,          -- Buffers allocated
    stats_reset
FROM pg_stat_bgwriter;

-- If buffers_backend is high, consider increasing bgwriter activity

Challenge Exercises:
- Create a table with 1 million rows and observe how it affects the buffer cache
- Run a full table scan and watch the buffer pool fill up
- Compare cache hit ratios before and after the scan
- Identify which auxiliary process is consuming the most CPU using pg_stat_activity
- Calculate the percentage of shared memory used by each major component

When you can confidently navigate PostgreSQL’s internal monitoring views and interpret the results, you’ve mastered process and memory inspection.

Hands-On Lab 2: Observing MVCC Behavior and VACUUM Effects

Goal: Understand how MVCC creates tuple versions and how VACUUM reclaims space.

Set Up a Test Environment:

-- Create a test table
CREATE TABLE mvcc_test (
    id SERIAL PRIMARY KEY,
    data TEXT,
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Insert initial data
INSERT INTO mvcc_test (data)
SELECT 'Initial data ' || i
FROM generate_series(1, 10000) AS i;

-- Check initial table size
SELECT 
    pg_size_pretty(pg_table_size('mvcc_test')) AS table_size,
    pg_size_pretty(pg_total_relation_size('mvcc_test')) AS total_size_with_indexes;

Observe Tuple Versioning:

-- Enable pageinspect extension
CREATE EXTENSION IF NOT EXISTS pageinspect;

-- View tuple details including xmin and xmax
SELECT 
    ctid,           -- Physical location (page, offset)
    xmin,           -- Transaction ID that created this tuple
    xmax,           -- Transaction ID that deleted/updated (0 if current)
    id,
    data
FROM mvcc_test
WHERE id <= 5;

-- Note the xmin values - they're the transaction IDs that created these rows

Create Dead Tuples Through Updates:

-- Update every row to create dead tuples
UPDATE mvcc_test SET data = 'Updated once', updated_at = NOW();

-- Check table stats
SELECT 
    n_live_tup AS live_tuples,
    n_dead_tup AS dead_tuples,
    n_tup_upd AS total_updates,
    n_tup_del AS total_deletes,
    last_vacuum,
    last_autovacuum
FROM pg_stat_user_tables
WHERE relname = 'mvcc_test';

-- Check table size (should have increased significantly)
SELECT pg_size_pretty(pg_table_size('mvcc_test')) AS table_size_after_update;

Observe Dead Tuples with Multiple Updates:

-- Do several more updates to accumulate dead tuples
UPDATE mvcc_test SET data = 'Updated twice' WHERE id % 2 = 0;
UPDATE mvcc_test SET data = 'Updated thrice' WHERE id % 3 = 0;
UPDATE mvcc_test SET data = 'Updated fourth' WHERE id % 4 = 0;

-- Check bloat accumulation
SELECT 
    relname AS table_name,
    n_live_tup,
    n_dead_tup,
    ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_tuple_percent,
    pg_size_pretty(pg_table_size(relname::regclass)) AS table_size,
    last_vacuum,
    last_autovacuum
FROM pg_stat_user_tables
WHERE relname = 'mvcc_test';

-- You should see significant dead tuple accumulation

Run VACUUM and Observe Effects:

-- Run VACUUM (not FULL)
VACUUM VERBOSE mvcc_test;

-- Check results
SELECT 
    n_live_tup,
    n_dead_tup,
    pg_size_pretty(pg_table_size('mvcc_test')) AS table_size
FROM pg_stat_user_tables
WHERE relname = 'mvcc_test';

-- Note: Table size may not shrink much because VACUUM doesn't return space to OS
-- It marks space as reusable within the table

Experiment with VACUUM FULL:

-- VACUUM FULL rewrites entire table (locks table exclusively)
VACUUM FULL VERBOSE mvcc_test;

-- Check size reduction
SELECT 
    pg_size_pretty(pg_table_size('mvcc_test')) AS table_size_after_vacuum_full;

-- Should see significant size reduction

Observe Transaction ID Visibility:

-- In one session, start a transaction
BEGIN;
SELECT txid_current() AS my_transaction_id;
INSERT INTO mvcc_test (data) VALUES ('Uncommitted data');

-- Note the transaction ID
SELECT ctid, xmin, xmax, id, data 
FROM mvcc_test 
WHERE data = 'Uncommitted data';

-- In another session (different connection), try to see this row
SELECT * FROM mvcc_test WHERE data = 'Uncommitted data';
-- Result: Empty! This demonstrates MVCC isolation

-- Back in first session, commit
COMMIT;

-- Now second session can see it
SELECT * FROM mvcc_test WHERE data = 'Uncommitted data';

Monitor Autovacuum Activity:

-- Check autovacuum configuration
SELECT 
    name,
    setting,
    unit,
    short_desc
FROM pg_settings
WHERE name LIKE 'autovacuum%'
ORDER BY name;

-- View recent autovacuum activity
SELECT 
    schemaname,
    relname,
    last_autovacuum,
    autovacuum_count,
    n_live_tup,
    n_dead_tup
FROM pg_stat_user_tables
WHERE autovacuum_count > 0
ORDER BY last_autovacuum DESC NULLS LAST;

Advanced Challenge:
- Create a workload with 50,000 updates and measure bloat accumulation
- Configure autovacuum to run more aggressively and observe the difference
- Use pageinspect to examine the internal structure of a page before and after VACUUM
- Simulate a long-running transaction and observe how it prevents VACUUM from reclaiming space
- Calculate the “bloat ratio” for all tables in your database

When you can explain how MVCC creates multiple tuple versions, how dead tuples accumulate, and how VACUUM reclaims space (with and without returning it to the OS), you’ve mastered PostgreSQL’s core concurrency mechanism.

Lock Manager

Spinlocks, LWLocks, heavyweight locks

Buffer Manager

Clock sweep, ring buffers, page eviction

Memory Management

Memory contexts, palloc, shared memory

Catalog System

System catalogs and caching

Extension Development

Building PostgreSQL extensions

6.1 Architecture Overview

Process Model

PostgreSQL uses a multi-process architecture (not multi-threaded). Each client connection gets its own backend process.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    PROCESS ARCHITECTURE                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                        POSTMASTER (pid=1)                              │  │
│  │  • Parent of all PostgreSQL processes                                 │  │
│  │  • Listens on TCP port 5432 (configurable)                           │  │
│  │  • fork()s backend processes for each connection                      │  │
│  │  • Monitors children, restarts if crash                               │  │
│  │  • Never touches database data directly                               │  │
│  └───────────────────────────────┬───────────────────────────────────────┘  │
│                                  │ fork()                                    │
│    ┌─────────────────────────────┼─────────────────────────────┐            │
│    │                             │                             │            │
│    ▼                             ▼                             ▼            │
│  ┌─────────────┐           ┌─────────────┐           ┌─────────────┐       │
│  │  Backend 1  │           │  Backend 2  │           │  Backend 3  │       │
│  │ pid=1001    │           │ pid=1002    │           │ pid=1003    │       │
│  │ user=alice  │           │ user=bob    │           │ user=alice  │       │
│  │ db=myapp    │           │ db=analytics│           │ db=myapp    │       │
│  └─────────────┘           └─────────────┘           └─────────────┘       │
│         │                         │                         │              │
│         └─────────────────────────┼─────────────────────────┘              │
│                                   │                                         │
│                                   ▼                                         │
│           ┌───────────────────────────────────────────────────┐            │
│           │              SHARED MEMORY                         │            │
│           │  • Shared Buffer Pool (shared_buffers)            │            │
│           │  • WAL Buffers (wal_buffers)                      │            │
│           │  • Lock Tables                                     │            │
│           │  • Proc Array (backend state)                     │            │
│           │  • CLOG (commit status)                           │            │
│           │  • Background worker state                        │            │
│           └───────────────────────────────────────────────────┘            │
│                                   │                                         │
│    ┌──────────────────────────────┴──────────────────────────────┐         │
│    │               AUXILIARY PROCESSES                            │         │
│    │                                                              │         │
│    │  ┌────────────┐  ┌────────────┐  ┌────────────┐             │         │
│    │  │ Checkpointer│  │ BG Writer  │  │ WAL Writer │             │         │
│    │  │ Flushes all │  │ Writes    │  │ Flushes WAL│             │         │
│    │  │ dirty pages │  │ dirty     │  │ buffers to │             │         │
│    │  │ periodically│  │ pages     │  │ disk       │             │         │
│    │  └────────────┘  └────────────┘  └────────────┘             │         │
│    │                                                              │         │
│    │  ┌────────────┐  ┌────────────┐  ┌────────────┐             │         │
│    │  │ Autovacuum │  │ Stats      │  │ Archiver   │             │         │
│    │  │ Launcher   │  │ Collector  │  │ Archives   │             │         │
│    │  │ + Workers  │  │ Aggregates │  │ WAL files  │             │         │
│    │  └────────────┘  └────────────┘  └────────────┘             │         │
│    │                                                              │         │
│    │  ┌────────────┐  ┌────────────┐                             │         │
│    │  │ WAL Sender │  │ WAL        │  (Replication)              │         │
│    │  │ (primary)  │  │ Receiver   │                             │         │
│    │  └────────────┘  └────────────┘                             │         │
│    └──────────────────────────────────────────────────────────────┘         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Signal Handling

PostgreSQL uses Unix signals for inter-process communication:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SIGNAL HANDLING                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Signal      │ Sender        │ Effect on Backend                            │
│  ────────────┼───────────────┼───────────────────────────────────────────── │
│  SIGTERM     │ Postmaster    │ Smart shutdown (finish current txn, exit)   │
│  SIGINT      │ Postmaster    │ Fast shutdown (rollback current txn, exit)  │
│  SIGQUIT     │ Postmaster    │ Immediate shutdown (crash recovery needed)  │
│  SIGHUP      │ pg_reload_conf│ Reload configuration files                  │
│  SIGUSR1     │ Postmaster    │ Latch wakeup (check for work)               │
│  SIGUSR2     │ Various       │ Recovery-related                            │
│  SIGALRM     │ Self          │ Statement timeout, lock timeout             │
│                                                                              │
│  Postmaster signal handling:                                                 │
│  • SIGTERM/SIGINT → Initiate shutdown sequence                              │
│  • SIGCHLD → Child process died, check if restart needed                    │
│  • SIGHUP → Signal all children to reload config                            │
│                                                                              │
│  Example: Graceful shutdown                                                  │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │ 1. Admin: pg_ctl stop -m smart                                         │ │
│  │ 2. Postmaster receives SIGTERM                                         │ │
│  │ 3. Postmaster sends SIGTERM to all backends                            │ │
│  │ 4. Backends finish current transaction                                  │ │
│  │ 5. Backends exit                                                        │ │
│  │ 6. Postmaster waits for all children                                   │ │
│  │ 7. Postmaster performs checkpoint                                       │ │
│  │ 8. Postmaster exits                                                     │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

-- View all PostgreSQL processes
SELECT pid, usename, application_name, client_addr, 
       state, query, backend_type
FROM pg_stat_activity;

-- Send signals to backends
SELECT pg_terminate_backend(pid);  -- SIGTERM to specific backend
SELECT pg_cancel_backend(pid);     -- SIGINT to cancel current query

-- Reload configuration without restart
SELECT pg_reload_conf();           -- SIGHUP to postmaster

-- View postmaster start time
SELECT pg_postmaster_start_time();

Shared Memory Layout

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SHARED MEMORY LAYOUT                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  FIXED-SIZE STRUCTURES                                                │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ ShmemIndex: Hash table mapping names to shared memory regions   │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ ProcGlobal: Array of PGPROC structures (one per max_connections)│  │  │
│  │  │   • PGPROC: Backend state, transaction info, lock info          │  │  │
│  │  │   • size = max_connections * sizeof(PGPROC)                     │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ XLOG Control: WAL write position, flush position, etc.         │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ Control File Data: Database system state                        │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  BUFFER POOL (largest region)                                         │  │
│  │  size = shared_buffers (typically 25-40% of RAM)                      │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ Buffer Descriptors: Metadata for each buffer                    │  │  │
│  │  │   Array[NBuffers] of { tag, flags, usage_count, refcount, ... } │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ Buffer Blocks: Actual 8KB page data                             │  │  │
│  │  │   Array[NBuffers] of char[BLCKSZ]                               │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ Buffer Hash Table: Maps (relfilenode, forknum, blknum) → bufid  │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  WAL BUFFERS                                                          │  │
│  │  size = wal_buffers (default: 1/32 of shared_buffers)                │  │
│  │  Ring buffer of WAL pages pending write to disk                       │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  CLOG (Commit Log / pg_xact)                                          │  │
│  │  2 bits per transaction: committed, aborted, in-progress, sub-commit  │  │
│  │  Cached pages for fast commit status lookup                           │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  LOCK TABLES                                                          │  │
│  │  • Main Lock Table: Heavyweight locks (max_locks_per_transaction)    │  │
│  │  • Predicate Lock Table: SSI serialization                           │  │
│  │  • Fast-path locks: Per-backend array for common cases               │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  OTHER SUBSYSTEMS                                                     │  │
│  │  • Subtransaction cache                                               │  │
│  │  • Multixact state                                                    │  │
│  │  • Two-phase commit state                                             │  │
│  │  • Async notification state                                           │  │
│  │  • Background worker state                                            │  │
│  │  • Replication slots                                                  │  │
│  │  • Logical replication state                                          │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

-- View shared memory settings
SHOW shared_buffers;
SHOW wal_buffers;
SHOW max_connections;

-- Estimate shared memory size
-- Approximately: shared_buffers + (max_connections * ~10MB) + overhead

-- On Linux, check actual shared memory usage
-- $ ipcs -m | grep postgres

-- View proc array (backend slots)
SELECT count(*) as backends_used FROM pg_stat_activity;
SELECT setting::int as max_backends FROM pg_settings WHERE name = 'max_connections';

┌─────────────────────────────────────────────────────────────────────────────┐
│                      POSTGRESQL ARCHITECTURE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Client Applications                                                        │
│   ┌─────────┐ ┌─────────┐ ┌─────────┐                                       │
│   │  App 1  │ │  App 2  │ │  App 3  │                                       │
│   └────┬────┘ └────┬────┘ └────┬────┘                                       │
│        │           │           │                                             │
│        └───────────┼───────────┘                                             │
│                    │                                                         │
│                    ▼                                                         │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                        POSTMASTER                                    │   │
│   │                   (Main PostgreSQL Process)                         │   │
│   │        Listens for connections, spawns backend processes            │   │
│   └────────────────────────────────┬────────────────────────────────────┘   │
│                                    │                                         │
│        ┌───────────────────────────┼───────────────────────────┐            │
│        │                           │                           │            │
│        ▼                           ▼                           ▼            │
│   ┌─────────────┐           ┌─────────────┐           ┌─────────────┐       │
│   │  Backend 1  │           │  Backend 2  │           │  Backend 3  │       │
│   │  (Session)  │           │  (Session)  │           │  (Session)  │       │
│   └──────┬──────┘           └──────┬──────┘           └──────┬──────┘       │
│          │                         │                         │              │
│          └─────────────────────────┼─────────────────────────┘              │
│                                    │                                         │
│                                    ▼                                         │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                        SHARED MEMORY                                 │   │
│   │  ┌───────────────────────────────────────────────────────────────┐  │   │
│   │  │                    Shared Buffer Pool                         │  │   │
│   │  │   ┌────────┬────────┬────────┬────────┬────────┬────────┐    │  │   │
│   │  │   │ Buffer │ Buffer │ Buffer │ Buffer │ Buffer │  ...   │    │  │   │
│   │  │   │   1    │   2    │   3    │   4    │   5    │        │    │  │   │
│   │  │   └────────┴────────┴────────┴────────┴────────┴────────┘    │  │   │
│   │  └───────────────────────────────────────────────────────────────┘  │   │
│   │  ┌────────────────┐ ┌────────────────┐ ┌────────────────────────┐  │   │
│   │  │   WAL Buffers  │ │   Clog Buffer  │ │  Lock Tables/Semaphores│  │   │
│   │  └────────────────┘ └────────────────┘ └────────────────────────┘  │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                         │
│   ┌────────────────────────────────┼────────────────────────────────────┐   │
│   │                    BACKGROUND PROCESSES                              │   │
│   │                                                                      │   │
│   │  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────────┐    │   │
│   │  │   WAL      │ │ Background │ │ Checkpointer│ │   Autovacuum   │    │   │
│   │  │   Writer   │ │   Writer   │ │            │ │   Launcher     │    │   │
│   │  └────────────┘ └────────────┘ └────────────┘ └────────────────┘    │   │
│   │  ┌────────────┐ ┌────────────┐ ┌────────────┐                       │   │
│   │  │   Stats    │ │  Archiver  │ │  Logical   │                       │   │
│   │  │ Collector  │ │            │ │ Replication│                       │   │
│   │  └────────────┘ └────────────┘ └────────────┘                       │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                         │
│                                    ▼                                         │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                           DISK STORAGE                               │   │
│   │  ┌──────────────┐  ┌──────────────┐  ┌────────────────────────────┐ │   │
│   │  │  Data Files  │  │  WAL Files   │  │ Configuration & Logs       │ │   │
│   │  │  (Tables,    │  │  (pg_wal/)   │  │ (postgresql.conf, pg_log/) │ │   │
│   │  │   Indexes)   │  │              │  │                            │ │   │
│   │  │  (base/)     │  │              │  │                            │ │   │
│   │  └──────────────┘  └──────────────┘  └────────────────────────────┘ │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

6.2 Query Processing Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                        QUERY PROCESSING STAGES                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. PARSER                                                                   │
│     ┌───────────────────────────────────────────────────────────────────┐   │
│     │ Input: "SELECT name FROM users WHERE id = 1"                      │   │
│     │                                                                   │   │
│     │ Lexer → Tokens: [SELECT] [name] [FROM] [users] [WHERE] [id] [=]  │   │
│     │ Parser → Parse Tree (AST)                                        │   │
│     └───────────────────────────────────────────────────────────────────┘   │
│                              │                                               │
│                              ▼                                               │
│  2. ANALYZER                                                                 │
│     ┌───────────────────────────────────────────────────────────────────┐   │
│     │ Semantic Analysis:                                                │   │
│     │ • Does table 'users' exist?                                       │   │
│     │ • Does column 'name' exist in 'users'?                            │   │
│     │ • What are the data types?                                        │   │
│     │ • Does user have SELECT permission?                               │   │
│     │                                                                   │   │
│     │ Output: Query Tree (with resolved names and types)                │   │
│     └───────────────────────────────────────────────────────────────────┘   │
│                              │                                               │
│                              ▼                                               │
│  3. REWRITER                                                                 │
│     ┌───────────────────────────────────────────────────────────────────┐   │
│     │ Apply transformation rules:                                       │   │
│     │ • Expand views to their underlying queries                        │   │
│     │ • Apply row-level security policies                               │   │
│     │ • Handle rules (CREATE RULE)                                      │   │
│     └───────────────────────────────────────────────────────────────────┘   │
│                              │                                               │
│                              ▼                                               │
│  4. PLANNER / OPTIMIZER                                                      │
│     ┌───────────────────────────────────────────────────────────────────┐   │
│     │ Generate possible execution plans:                                │   │
│     │ • Plan A: Seq Scan + Filter                                       │   │
│     │ • Plan B: Index Scan on idx_users_id                              │   │
│     │ • Plan C: Bitmap Index Scan + Bitmap Heap Scan                    │   │
│     │                                                                   │   │
│     │ Cost estimation using statistics (pg_statistic)                   │   │
│     │ Choose lowest-cost plan                                           │   │
│     │                                                                   │   │
│     │ Output: Execution Plan (tree of plan nodes)                       │   │
│     └───────────────────────────────────────────────────────────────────┘   │
│                              │                                               │
│                              ▼                                               │
│  5. EXECUTOR                                                                 │
│     ┌───────────────────────────────────────────────────────────────────┐   │
│     │ Execute the plan:                                                 │   │
│     │ • Access data pages via buffer manager                            │   │
│     │ • Apply filters, joins, aggregations                             │   │
│     │ • Handle transactions (visibility checks)                        │   │
│     │ • Return results to client                                       │   │
│     └───────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

6.3 Storage Architecture

Page Structure

Everything in PostgreSQL is stored in 8KB pages (blocks).

┌─────────────────────────────────────────────────────────────────────────────┐
│                         PAGE STRUCTURE (8KB)                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Offset 0                                                                    │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                    PAGE HEADER (24 bytes)                             │  │
│  │  • pd_lsn: Last WAL position that modified this page                  │  │
│  │  • pd_checksum: Data checksum                                         │  │
│  │  • pd_lower: Offset to start of free space                           │  │
│  │  • pd_upper: Offset to end of free space                             │  │
│  │  • pd_special: Offset to special space (for indexes)                 │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                    LINE POINTERS (4 bytes each)                       │  │
│  │  Array of (offset, length, flags) pointing to tuples                 │  │
│  │  [LP1] [LP2] [LP3] [LP4] [LP5] ...                                   │  │
│  │    │     │     │                                                      │  │
│  └────┼─────┼─────┼──────────────────────────────────────────────────────┘  │
│       │     │     │                                                         │
│  ┌────┼─────┼─────┼──────────────────────────────────────────────────────┐  │
│  │    │     │     │       FREE SPACE                                     │  │
│  │    │     │     │                                                      │  │
│  │    │     │     │   (pd_lower)          (pd_upper)                     │  │
│  │    │     │     │       ↓                   ↓                          │  │
│  │    │     │     │    ████████████████████████                         │  │
│  └────┼─────┼─────┼──────────────────────────────────────────────────────┘  │
│       │     │     │                                                         │
│  ┌────┼─────┼─────┼──────────────────────────────────────────────────────┐  │
│  │    │     │     └───────────────────────────────────┐                  │  │
│  │    │     └─────────────────────────┐               │                  │  │
│  │    └────────────────┐              │               │                  │  │
│  │                     ▼              ▼               ▼                  │  │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                  │  │
│  │  │   Tuple 3    │ │   Tuple 2    │ │   Tuple 1    │  ← Grows DOWN    │  │
│  │  │ (Row Data)   │ │ (Row Data)   │ │ (Row Data)   │                  │  │
│  │  └──────────────┘ └──────────────┘ └──────────────┘                  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  Offset 8192                                                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Tuple (Row) Structure

┌─────────────────────────────────────────────────────────────────────────────┐
│                          TUPLE HEADER (23+ bytes)                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  t_xmin (4 bytes)     │ Transaction ID that inserted this tuple      │   │
│  │  t_xmax (4 bytes)     │ Transaction ID that deleted/updated          │   │
│  │  t_cid (4 bytes)      │ Command ID within transaction                │   │
│  │  t_ctid (6 bytes)     │ Current tuple ID (page, offset)              │   │
│  │  t_infomask (2 bytes) │ Status flags (committed, aborted, etc.)      │   │
│  │  t_infomask2 (2 bytes)│ More flags + number of attributes            │   │
│  │  t_hoff (1 byte)      │ Offset to user data                          │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  NULL BITMAP (variable)                                              │   │
│  │  One bit per column, indicates if value is NULL                      │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  USER DATA                                                           │   │
│  │  [Column 1 Value] [Column 2 Value] [Column 3 Value] ...              │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

View Hidden Columns

-- See system columns for any row
SELECT 
    ctid,           -- Physical location (page, offset)
    xmin,           -- Transaction that created
    xmax,           -- Transaction that deleted (0 if live)
    cmin,           -- Command ID for insert
    cmax,           -- Command ID for delete
    tableoid,       -- OID of the table
    *
FROM users 
WHERE id = 1;

-- Result:
-- ctid   | (0,1)        -- Page 0, tuple 1
-- xmin   | 12345        -- Created by txn 12345
-- xmax   | 0            -- Not deleted
-- cmin   | 0            -- First command in its txn
-- cmax   | 0

6.4 MVCC (Multi-Version Concurrency Control)

PostgreSQL’s MVCC allows readers and writers to work without blocking each other.

How MVCC Works

┌─────────────────────────────────────────────────────────────────────────────┐
│                            MVCC IN ACTION                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Initial State: Row with id=1, name='Alice'                                  │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │ xmin=100 │ xmax=0 │ id=1 │ name='Alice' │                          │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  Transaction 200: UPDATE users SET name='Bob' WHERE id=1                    │
│                                                                              │
│  After UPDATE: TWO tuple versions exist!                                     │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │ xmin=100 │ xmax=200 │ id=1 │ name='Alice' │  ← Old version (dead)  │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │ xmin=200 │ xmax=0   │ id=1 │ name='Bob'   │  ← New version (live)  │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  Visibility Rules:                                                          │
│  ─────────────────                                                          │
│  Transaction 150 (started before 200):                                      │
│    → Sees 'Alice' (xmin=100 committed, xmax=200 not visible)               │
│                                                                              │
│  Transaction 250 (started after 200 committed):                             │
│    → Sees 'Bob' (xmin=200 committed, xmax=0 means live)                    │
│                                                                              │
│  Transaction 200 (the updater itself):                                      │
│    → Sees 'Bob' (its own changes are visible)                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Tuple Visibility Check

-- Simplified visibility logic (actual code is more complex)
-- A tuple is visible if:
-- 1. xmin is committed AND
-- 2. xmin started before our snapshot AND
-- 3. (xmax is 0 OR xmax is aborted OR xmax started after our snapshot)

-- Check transaction status
SELECT * FROM pg_stat_activity WHERE pid = pg_backend_pid();

-- View commit log status
SELECT 
    pg_xact_status(xmin) as xmin_status,
    pg_xact_status(xmax) as xmax_status
FROM users WHERE id = 1;
-- Returns: 'committed', 'aborted', 'in progress', or NULL

Snapshot Isolation

-- REPEATABLE READ takes a snapshot at first query
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;

SELECT * FROM accounts WHERE id = 1;  -- Snapshot taken HERE
-- Returns balance = 1000

-- Another transaction updates and commits...

SELECT * FROM accounts WHERE id = 1;
-- Still returns balance = 1000! (same snapshot)

COMMIT;

-- New transaction sees the update
SELECT * FROM accounts WHERE id = 1;
-- Returns balance = 1200

6.5 VACUUM and Dead Tuples

Why VACUUM is Necessary

┌─────────────────────────────────────────────────────────────────────────────┐
│                        THE DEAD TUPLE PROBLEM                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  After many updates, pages fill with dead tuples:                           │
│                                                                              │
│  Page:                                                                       │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ [DEAD] [LIVE] [DEAD] [DEAD] [LIVE] [DEAD] [LIVE] [DEAD] [DEAD]      │   │
│  │                                                                      │   │
│  │  70% dead tuples = wasted space + slow scans!                       │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  Problems:                                                                   │
│  • Table bloat: 10GB of data uses 100GB on disk                            │
│  • Slow sequential scans: must read dead tuples too                        │
│  • Index bloat: indexes point to dead tuples                               │
│  • Transaction ID wraparound risk                                          │
│                                                                              │
│  After VACUUM:                                                               │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ [FREE] [LIVE] [FREE] [FREE] [LIVE] [FREE] [LIVE] [FREE] [FREE]      │   │
│  │                                                                      │   │
│  │  Space marked as free for reuse (but not returned to OS)            │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  After VACUUM FULL (rewrites table):                                        │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │ [LIVE] [LIVE] [LIVE]                                                │     │
│  │                                                                     │     │
│  │  Compacted! Space returned to OS. But table is locked during.      │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

VACUUM Operations

-- Regular VACUUM: Mark dead tuples as reusable
VACUUM users;

-- VACUUM with analysis: Also update statistics
VACUUM ANALYZE users;

-- VACUUM VERBOSE: Show what it's doing
VACUUM VERBOSE users;

-- VACUUM FULL: Rewrite table (locks table exclusively!)
VACUUM FULL users;  -- Use sparingly!

-- Check vacuum stats
SELECT 
    relname,
    n_live_tup,
    n_dead_tup,
    round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct,
    last_vacuum,
    last_autovacuum,
    last_analyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

Autovacuum Configuration

-- View current settings
SELECT name, setting, unit, short_desc
FROM pg_settings
WHERE name LIKE '%autovacuum%';

-- Per-table settings for high-traffic tables
ALTER TABLE orders SET (
    autovacuum_vacuum_threshold = 1000,        -- Min dead tuples
    autovacuum_vacuum_scale_factor = 0.01,     -- + 1% of table
    autovacuum_analyze_threshold = 500,
    autovacuum_analyze_scale_factor = 0.005,
    autovacuum_vacuum_cost_delay = 2           -- Less aggressive
);

-- Check if autovacuum is running
SELECT pid, datname, usename, state, query, 
       backend_start, xact_start
FROM pg_stat_activity 
WHERE query LIKE '%autovacuum%';

6.6 Write-Ahead Logging (WAL)

How WAL Works

┌─────────────────────────────────────────────────────────────────────────────┐
│                        WRITE-AHEAD LOGGING                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  The Golden Rule: "Write the log before the data"                           │
│                                                                              │
│  Transaction Flow:                                                           │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  1. Application: INSERT INTO orders VALUES (...)                    │   │
│  │                      │                                               │   │
│  │                      ▼                                               │   │
│  │  2. PostgreSQL: Create WAL record in WAL Buffer                     │   │
│  │                      │                                               │   │
│  │                      ▼                                               │   │
│  │  3. Modify page in Shared Buffer (in memory)                        │   │
│  │                      │                                               │   │
│  │                      ▼                                               │   │
│  │  4. Application: COMMIT                                              │   │
│  │                      │                                               │   │
│  │                      ▼                                               │   │
│  │  5. Flush WAL Buffer to disk (fsync)  ← Durability guaranteed here! │   │
│  │                      │                                               │   │
│  │                      ▼                                               │   │
│  │  6. Return success to application                                   │   │
│  │                      │                                               │   │
│  │                      ▼ (Later, asynchronously)                       │   │
│  │  7. Background Writer: Write dirty pages to data files             │   │
│  │                                                                      │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  Recovery After Crash:                                                       │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  1. Start PostgreSQL                                                 │   │
│  │  2. Read WAL from last checkpoint                                    │   │
│  │  3. Replay all WAL records                                           │   │
│  │  4. Database is consistent! No data loss.                            │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

WAL Configuration

-- View WAL settings
SHOW wal_level;              -- minimal, replica, logical
SHOW max_wal_size;           -- Maximum WAL size before checkpoint
SHOW min_wal_size;           -- Minimum WAL to keep
SHOW wal_buffers;            -- WAL buffer size

-- View current WAL position
SELECT pg_current_wal_lsn();

-- View WAL file info
SELECT * FROM pg_stat_wal;

-- Force a checkpoint (flushes all data to disk)
CHECKPOINT;

-- View checkpoint stats
SELECT * FROM pg_stat_bgwriter;

6.7 Buffer Management

Shared Buffer Pool

┌─────────────────────────────────────────────────────────────────────────────┐
│                        BUFFER POOL MANAGEMENT                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Shared Buffer Pool (shared_buffers):                                        │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐   │   │
│  │  │ Buf │ │ Buf │ │ Buf │ │ Buf │ │ Buf │ │ Buf │ │ Buf │ │ Buf │   │   │
│  │  │  1  │ │  2  │ │  3  │ │  4  │ │  5  │ │  6  │ │  7  │ │  8  │   │   │
│  │  └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘   │   │
│  │     │      │      │      │      │      │      │      │              │   │
│  │     │      │      │      │      │      │      │      │              │   │
│  │  ┌──┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──┐          │   │
│  │  │           Buffer Descriptor Array                    │          │   │
│  │  │  • Page ID (table OID, block number)                │          │   │
│  │  │  • Usage count (for clock sweep eviction)           │          │   │
│  │  │  • Dirty flag                                       │          │   │
│  │  │  • Pin count (number of backends using it)          │          │   │
│  │  └──────────────────────────────────────────────────────┘          │   │
│  │                                                                      │   │
│  │  ┌──────────────────────────────────────────────────────┐          │   │
│  │  │              Hash Table                              │          │   │
│  │  │  Maps (table_oid, block_num) → buffer_id             │          │   │
│  │  └──────────────────────────────────────────────────────┘          │   │
│  │                                                                      │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  When page is needed:                                                       │
│  1. Check hash table for (table, block)                                    │
│  2. If found: buffer hit! Increment usage count.                           │
│  3. If not found: buffer miss                                              │
│     a. Find victim buffer (clock sweep: usage count = 0)                   │
│     b. If victim is dirty, write to disk first                             │
│     c. Read requested page from disk                                       │
│     d. Insert into hash table                                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Monitoring Buffer Usage

-- Buffer cache hit ratio
SELECT 
    sum(heap_blks_hit) / nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) as cache_hit_ratio
FROM pg_statio_user_tables;
-- Should be > 0.99 for well-tuned system

-- Per-table buffer stats
SELECT 
    relname,
    heap_blks_read,    -- Blocks read from disk
    heap_blks_hit,     -- Blocks found in cache
    round(100.0 * heap_blks_hit / nullif(heap_blks_hit + heap_blks_read, 0), 2) as hit_ratio
FROM pg_statio_user_tables
ORDER BY heap_blks_read DESC
LIMIT 20;

-- What's in the buffer cache? (requires pg_buffercache extension)
CREATE EXTENSION pg_buffercache;

SELECT 
    c.relname,
    count(*) AS buffers,
    pg_size_pretty(count(*) * 8192) AS size_in_cache
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = c.relfilenode
WHERE b.reldatabase = (SELECT oid FROM pg_database WHERE datname = current_database())
GROUP BY c.relname
ORDER BY count(*) DESC
LIMIT 20;

6.8 Practice Exercises

Exercise 1: Investigate Tuple Visibility

-- Setup: Create a test scenario
CREATE TABLE visibility_test (id SERIAL PRIMARY KEY, value TEXT);

-- Session 1
BEGIN;
INSERT INTO visibility_test (value) VALUES ('first');
-- Note the xmin

-- Session 2 (before Session 1 commits)
SELECT ctid, xmin, xmax, * FROM visibility_test;
-- What do you see?

-- Session 1
COMMIT;

-- Session 2
SELECT ctid, xmin, xmax, * FROM visibility_test;
-- What changed?

Exercise 2: Analyze Table Bloat

-- Create and bloat a table
CREATE TABLE bloat_test AS 
SELECT generate_series(1, 100000) AS id, 'data' AS value;

-- Update every row multiple times
UPDATE bloat_test SET value = 'updated1';
UPDATE bloat_test SET value = 'updated2';
UPDATE bloat_test SET value = 'updated3';

-- Check bloat
SELECT 
    pg_size_pretty(pg_table_size('bloat_test')) AS table_size,
    n_live_tup,
    n_dead_tup,
    last_vacuum
FROM pg_stat_user_tables 
WHERE relname = 'bloat_test';

-- Run vacuum and compare
VACUUM bloat_test;
-- Check sizes again

Module Mastery Checklist

Complete this checklist to confirm you’ve mastered PostgreSQL Internals:

Process Architecture

Explain PostgreSQL’s multi-process model and its advantages
Identify the role of the postmaster process
Describe how backend processes are created and managed
Understand the function of auxiliary processes (checkpointer, bgwriter, WAL writer, autovacuum)
Interpret signal handling between postmaster and backends
Monitor process activity using pg_stat_activity

Shared Memory Structures

Explain the layout and purpose of shared memory regions
Calculate shared memory requirements based on configuration
Understand the buffer pool structure and management
Describe how ProcGlobal tracks backend state
Analyze WAL buffer usage and configuration
Query shared memory statistics using system views

Storage Architecture

Explain PostgreSQL’s heap storage model
Describe the structure of 8KB pages
Understand tuple layout and metadata (xmin, xmax, ctid)
Identify the role of Free Space Map (FSM) and Visibility Map (VM)
Explain how TOAST handles large values
Navigate the data directory and interpret file naming

MVCC and Transaction Management

Explain how MVCC enables non-blocking reads
Understand transaction IDs (XIDs) and snapshot isolation
Describe tuple visibility rules
Analyze xmin/xmax values to determine tuple status
Diagnose transaction ID wraparound scenarios
Optimize for MVCC overhead in high-concurrency workloads

VACUUM and Maintenance

Explain the difference between VACUUM and VACUUM FULL
Understand how dead tuples accumulate
Configure autovacuum thresholds appropriately
Monitor bloat using pg_stat_user_tables
Diagnose and remediate table bloat
Apply VACUUM best practices for production systems

WAL and Durability

Explain Write-Ahead Logging principles
Understand WAL record structure and LSNs
Describe checkpoint mechanisms and configuration
Monitor WAL generation rate
Configure WAL settings for performance and durability
Implement WAL archiving for PITR

Buffer Management

Understand clock-sweep buffer eviction algorithm
Monitor buffer cache hit ratios
Identify which tables/indexes are cached
Analyze buffer pool efficiency
Configure shared_buffers appropriately
Diagnose buffer-related performance issues

System Catalogs and Monitoring

Query pg_stat_* views for performance metrics
Use pg_stat_activity to monitor connections
Analyze pg_stat_user_tables for table statistics
Leverage pg_stat_statements for query analysis
Inspect pg_locks for lock contention
Navigate system catalogs (pg_class, pg_attribute, pg_index)

Practical Application

Complete Hands-On Lab 1: Process and memory inspection
Complete Hands-On Lab 2: MVCC and VACUUM observation
Diagnose real-world performance issues using internals knowledge
Tune PostgreSQL configuration based on workload characteristics
Implement monitoring strategies for production databases

Next Module

Module 7: Replication & High Availability

Build systems that survive failures

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Module 6: PostgreSQL Internals

​Learning Objectives

​Module-Specific Learning Outcomes

​Section 6.1 - Architecture Overview

​Section 6.2 - Shared Memory and Processes

​Section 6.3 - Storage Architecture

​Section 6.4 - MVCC Implementation

​Section 6.5 - VACUUM and Tuple Management

​Section 6.6 - WAL and Checkpoints

​Section 6.7 - System Catalogs and Metadata

​Hands-On Lab 1: Inspecting PostgreSQL Processes and Shared Memory

​Hands-On Lab 2: Observing MVCC Behavior and VACUUM Effects

Lock Manager

Buffer Manager

Memory Management

Catalog System

Extension Development

​6.1 Architecture Overview

​Process Model

​Signal Handling

​Shared Memory Layout

​6.2 Query Processing Pipeline

​6.3 Storage Architecture

​Page Structure

​Tuple (Row) Structure

​View Hidden Columns

​6.4 MVCC (Multi-Version Concurrency Control)

​How MVCC Works

​Tuple Visibility Check

​Snapshot Isolation

​6.5 VACUUM and Dead Tuples

​Why VACUUM is Necessary

​VACUUM Operations

​Autovacuum Configuration

​6.6 Write-Ahead Logging (WAL)

​How WAL Works

​WAL Configuration

​6.7 Buffer Management

Module 6: PostgreSQL Internals

Learning Objectives

Module-Specific Learning Outcomes

Section 6.1 - Architecture Overview

Section 6.2 - Shared Memory and Processes

Section 6.3 - Storage Architecture

Section 6.4 - MVCC Implementation

Section 6.5 - VACUUM and Tuple Management

Section 6.6 - WAL and Checkpoints

Section 6.7 - System Catalogs and Metadata

Hands-On Lab 1: Inspecting PostgreSQL Processes and Shared Memory

Hands-On Lab 2: Observing MVCC Behavior and VACUUM Effects

6.1 Architecture Overview

Process Model

Signal Handling

Shared Memory Layout

6.2 Query Processing Pipeline

6.3 Storage Architecture

Page Structure

Tuple (Row) Structure

View Hidden Columns

6.4 MVCC (Multi-Version Concurrency Control)

How MVCC Works

Tuple Visibility Check

Snapshot Isolation

6.5 VACUUM and Dead Tuples

Why VACUUM is Necessary

VACUUM Operations

Autovacuum Configuration

6.6 Write-Ahead Logging (WAL)

How WAL Works

WAL Configuration

6.7 Buffer Management

Shared Buffer Pool

Monitoring Buffer Usage

6.8 Practice Exercises

Exercise 1: Investigate Tuple Visibility

Exercise 2: Analyze Table Bloat

Module Mastery Checklist

Process Architecture

Shared Memory Structures

Storage Architecture

MVCC and Transaction Management

VACUUM and Maintenance

WAL and Durability

Buffer Management

System Catalogs and Monitoring

Practical Application

Next Module