> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# PostgreSQL Internals

> Understand how PostgreSQL works under the hood

<Frame>
  <img src="https://mintcdn.com/devweeekends/0kwJwOL2KCwg2YYu/images/courses/database-engineering/internals-concept.svg?fit=max&auto=format&n=0kwJwOL2KCwg2YYu&q=85&s=f9f7da60d86f7ed056fb1aac2776c25d" alt="PostgreSQL Internals Concept" width="1080" height="1080" data-path="images/courses/database-engineering/internals-concept.svg" />
</Frame>

# Module 6: PostgreSQL Internals

Understanding how PostgreSQL works internally transforms you from someone who uses databases to someone who truly understands them. This knowledge helps you make better design decisions and debug complex issues.

<Info>
  **Estimated Time**: 12-15 hours\
  **Depth Level**: Advanced\
  **Key Skill**: Understanding MVCC, storage, and memory management
</Info>

## Learning Objectives

By the end of this module, you will be able to:

* **Understand PostgreSQL's multi-process architecture**: Explain the roles of postmaster, backend processes, and auxiliary processes.
* **Master MVCC principles**: Describe how PostgreSQL implements Multi-Version Concurrency Control for isolation and performance.
* **Navigate shared memory structures**: Identify the purpose of shared buffers, WAL buffers, lock tables, and process arrays.
* **Analyze storage mechanisms**: Understand how PostgreSQL stores data in 8KB pages, heaps, and indexes on disk.
* **Debug using system catalogs**: Query pg\_stat\_\* and pg\_\* views to diagnose performance issues and observe internal state.
* **Optimize VACUUM strategies**: Configure and tune autovacuum to manage dead tuples and prevent bloat.
* **Interpret transaction internals**: Understand how XIDs, snapshots, and commit logs work together to enforce isolation.

### Module-Specific Learning Outcomes

#### Section 6.1 - Architecture Overview

* Explain PostgreSQL's multi-process model vs. multi-threading approaches
* Identify the responsibilities of the postmaster process
* Describe the lifecycle of a backend process from connection to termination
* Understand the purpose and function of auxiliary processes (checkpointer, bgwriter, WAL writer, autovacuum)
* Map signal handling mechanisms between postmaster and backend processes
* Navigate the shared memory layout and its major regions

#### Section 6.2 - Shared Memory and Processes

* Calculate shared memory requirements based on configuration parameters
* Understand the structure and purpose of the buffer pool
* Explain how ProcGlobal tracks backend states
* Describe the role of WAL buffers in transaction durability
* Analyze memory usage using pg\_stat\_\* views
* Inspect process behavior using system catalogs

#### Section 6.3 - Storage Architecture

* Understand PostgreSQL's heap storage model and page structure
* Explain how data is organized in 8KB pages with tuples and metadata
* Describe the role of the Free Space Map (FSM) and Visibility Map (VM)
* Identify how TOAST handles large values that exceed page size
* Navigate the data directory structure and file naming conventions
* Analyze table and index files using system tools

#### Section 6.4 - MVCC Implementation

* Explain how MVCC enables non-blocking reads
* Understand transaction IDs (XIDs) and their role in visibility determination
* Describe how snapshots capture database state for isolation
* Analyze tuple visibility rules based on xmin, xmax, and commit status
* Diagnose transaction ID wraparound and its prevention strategies
* Optimize for MVCC overhead in high-write workloads

#### Section 6.5 - VACUUM and Tuple Management

* Explain the difference between VACUUM and VACUUM FULL
* Understand how dead tuples accumulate and impact performance
* Configure autovacuum thresholds and aggressive strategies
* Monitor bloat using pg\_stat\_user\_tables and pgstattuple
* Diagnose and fix excessive table bloat
* Apply best practices for VACUUM scheduling in production

#### Section 6.6 - WAL and Checkpoints

* Understand Write-Ahead Logging principles and durability guarantees
* Describe WAL record structure and LSN (Log Sequence Numbers)
* Explain checkpoint mechanisms and their performance impact
* Configure checkpoint parameters to balance durability and performance
* Analyze WAL generation rate and optimize for write workloads
* Implement Point-in-Time Recovery (PITR) strategies

#### Section 6.7 - System Catalogs and Metadata

* Query pg\_class, pg\_attribute, pg\_index for schema metadata
* Use pg\_stat\_user\_tables and pg\_statio\_user\_tables for performance analysis
* Inspect pg\_stat\_activity to monitor active queries and connections
* Leverage pg\_stat\_statements for query performance tracking
* Analyze lock contention using pg\_locks
* Understand catalog cache invalidation and its implications

***

## Hands-On Lab 1: Inspecting PostgreSQL Processes and Shared Memory

> **Goal**: Master the tools for observing PostgreSQL's internal process architecture and memory structures.

> 1. **Examine the Process Hierarchy:**
>    ```sql theme={null}
>    -- View all PostgreSQL processes
>    SELECT 
>        pid,
>        usename AS username,
>        application_name,
>        client_addr,
>        backend_start,
>        state,
>        backend_type,
>        query
>    FROM pg_stat_activity
>    ORDER BY backend_type, backend_start;
>
>    -- Identify process types:
>    -- - client backend: Your query connections
>    -- - autovacuum launcher/worker: Cleanup processes
>    -- - logical replication launcher/worker: Replication processes
>    -- - background writer: Writes dirty buffers
>    -- - checkpointer: Performs checkpoints
>    -- - walwriter: Writes WAL records
>    ```
>
> 2. **Investigate Shared Memory Configuration:**
>    ```sql theme={null}
>    -- Check shared memory settings
>    SELECT 
>        name,
>        setting,
>        unit,
>        context,
>        short_desc
>    FROM pg_settings
>    WHERE name IN (
>        'shared_buffers',
>        'wal_buffers',
>        'max_connections',
>        'max_prepared_transactions',
>        'max_locks_per_transaction',
>        'shared_memory_size'
>    )
>    ORDER BY name;
>
>    -- Calculate total shared memory usage
>    SELECT pg_size_pretty(
>        current_setting('shared_buffers')::bigint * 
>        (SELECT setting FROM pg_settings WHERE name = 'block_size')::bigint
>    ) AS shared_buffers_size;
>    ```
>
> 3. **Monitor Buffer Pool Activity:**
>    ```sql theme={null}
>    -- Install pg_buffercache extension (if not already installed)
>    CREATE EXTENSION IF NOT EXISTS pg_buffercache;
>
>    -- See what tables are cached in shared buffers
>    SELECT 
>        c.relname AS table_name,
>        COUNT(*) AS buffer_count,
>        pg_size_pretty(COUNT(*) * 8192) AS size_in_cache,
>        ROUND(100.0 * COUNT(*) / (SELECT setting FROM pg_settings WHERE name='shared_buffers')::int, 2) AS percent_of_cache
>    FROM pg_buffercache b
>    JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
>    WHERE b.reldatabase = (SELECT oid FROM pg_database WHERE datname = current_database())
>      AND c.relkind IN ('r', 'i')  -- tables and indexes
>    GROUP BY c.relname
>    ORDER BY buffer_count DESC
>    LIMIT 20;
>
>    -- Check buffer cache hit ratio
>    SELECT 
>        SUM(heap_blks_hit) AS heap_read_hits,
>        SUM(heap_blks_read) AS heap_disk_reads,
>        ROUND(100.0 * SUM(heap_blks_hit) / NULLIF(SUM(heap_blks_hit) + SUM(heap_blks_read), 0), 2) AS cache_hit_ratio
>    FROM pg_statio_user_tables;
>    -- Target: > 99% for well-tuned systems
>    ```
>
> 4. **Analyze WAL Activity:**
>    ```sql theme={null}
>    -- Check current WAL position
>    SELECT 
>        pg_current_wal_lsn() AS current_wal_position,
>        pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '0/0')) AS total_wal_generated;
>
>    -- View WAL writer statistics
>    SELECT 
>        wal_records,
>        wal_fpi,  -- Full Page Images
>        wal_bytes,
>        pg_size_pretty(wal_bytes) AS wal_bytes_pretty,
>        wal_buffers_full,
>        wal_write,
>        wal_sync,
>        stats_reset
>    FROM pg_stat_wal;
>    ```
>
> 5. **Monitor Background Workers:**
>    ```sql theme={null}
>    -- Check background writer stats
>    SELECT 
>        checkpoints_timed,      -- Scheduled checkpoints
>        checkpoints_req,         -- Requested checkpoints (may indicate tuning needed)
>        buffers_checkpoint,      -- Buffers written during checkpoint
>        buffers_clean,          -- Buffers cleaned by bgwriter
>        buffers_backend,        -- Buffers written by backends (should be low)
>        buffers_alloc,          -- Buffers allocated
>        stats_reset
>    FROM pg_stat_bgwriter;
>
>    -- If buffers_backend is high, consider increasing bgwriter activity
>    ```
>
> 6. **Challenge Exercises:**
>    * Create a table with 1 million rows and observe how it affects the buffer cache
>    * Run a full table scan and watch the buffer pool fill up
>    * Compare cache hit ratios before and after the scan
>    * Identify which auxiliary process is consuming the most CPU using pg\_stat\_activity
>    * Calculate the percentage of shared memory used by each major component
>
> When you can confidently navigate PostgreSQL's internal monitoring views and interpret the results, you've mastered process and memory inspection.

***

## Hands-On Lab 2: Observing MVCC Behavior and VACUUM Effects

> **Goal**: Understand how MVCC creates tuple versions and how VACUUM reclaims space.

> 1. **Set Up a Test Environment:**
>    ```sql theme={null}
>    -- Create a test table
>    CREATE TABLE mvcc_test (
>        id SERIAL PRIMARY KEY,
>        data TEXT,
>        updated_at TIMESTAMP DEFAULT NOW()
>    );
>
>    -- Insert initial data
>    INSERT INTO mvcc_test (data)
>    SELECT 'Initial data ' || i
>    FROM generate_series(1, 10000) AS i;
>
>    -- Check initial table size
>    SELECT 
>        pg_size_pretty(pg_table_size('mvcc_test')) AS table_size,
>        pg_size_pretty(pg_total_relation_size('mvcc_test')) AS total_size_with_indexes;
>    ```
>
> 2. **Observe Tuple Versioning:**
>    ```sql theme={null}
>    -- Enable pageinspect extension
>    CREATE EXTENSION IF NOT EXISTS pageinspect;
>
>    -- View tuple details including xmin and xmax
>    SELECT 
>        ctid,           -- Physical location (page, offset)
>        xmin,           -- Transaction ID that created this tuple
>        xmax,           -- Transaction ID that deleted/updated (0 if current)
>        id,
>        data
>    FROM mvcc_test
>    WHERE id <= 5;
>
>    -- Note the xmin values - they're the transaction IDs that created these rows
>    ```
>
> 3. **Create Dead Tuples Through Updates:**
>    ```sql theme={null}
>    -- Update every row to create dead tuples
>    UPDATE mvcc_test SET data = 'Updated once', updated_at = NOW();
>
>    -- Check table stats
>    SELECT 
>        n_live_tup AS live_tuples,
>        n_dead_tup AS dead_tuples,
>        n_tup_upd AS total_updates,
>        n_tup_del AS total_deletes,
>        last_vacuum,
>        last_autovacuum
>    FROM pg_stat_user_tables
>    WHERE relname = 'mvcc_test';
>
>    -- Check table size (should have increased significantly)
>    SELECT pg_size_pretty(pg_table_size('mvcc_test')) AS table_size_after_update;
>    ```
>
> 4. **Observe Dead Tuples with Multiple Updates:**
>    ```sql theme={null}
>    -- Do several more updates to accumulate dead tuples
>    UPDATE mvcc_test SET data = 'Updated twice' WHERE id % 2 = 0;
>    UPDATE mvcc_test SET data = 'Updated thrice' WHERE id % 3 = 0;
>    UPDATE mvcc_test SET data = 'Updated fourth' WHERE id % 4 = 0;
>
>    -- Check bloat accumulation
>    SELECT 
>        relname AS table_name,
>        n_live_tup,
>        n_dead_tup,
>        ROUND(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_tuple_percent,
>        pg_size_pretty(pg_table_size(relname::regclass)) AS table_size,
>        last_vacuum,
>        last_autovacuum
>    FROM pg_stat_user_tables
>    WHERE relname = 'mvcc_test';
>
>    -- You should see significant dead tuple accumulation
>    ```
>
> 5. **Run VACUUM and Observe Effects:**
>    ```sql theme={null}
>    -- Run VACUUM (not FULL)
>    VACUUM VERBOSE mvcc_test;
>
>    -- Check results
>    SELECT 
>        n_live_tup,
>        n_dead_tup,
>        pg_size_pretty(pg_table_size('mvcc_test')) AS table_size
>    FROM pg_stat_user_tables
>    WHERE relname = 'mvcc_test';
>
>    -- Note: Table size may not shrink much because VACUUM doesn't return space to OS
>    -- It marks space as reusable within the table
>    ```
>
> 6. **Experiment with VACUUM FULL:**
>    ```sql theme={null}
>    -- VACUUM FULL rewrites entire table (locks table exclusively)
>    VACUUM FULL VERBOSE mvcc_test;
>
>    -- Check size reduction
>    SELECT 
>        pg_size_pretty(pg_table_size('mvcc_test')) AS table_size_after_vacuum_full;
>
>    -- Should see significant size reduction
>    ```
>
> 7. **Observe Transaction ID Visibility:**
>    ```sql theme={null}
>    -- In one session, start a transaction
>    BEGIN;
>    SELECT txid_current() AS my_transaction_id;
>    INSERT INTO mvcc_test (data) VALUES ('Uncommitted data');
>
>    -- Note the transaction ID
>    SELECT ctid, xmin, xmax, id, data 
>    FROM mvcc_test 
>    WHERE data = 'Uncommitted data';
>
>    -- In another session (different connection), try to see this row
>    SELECT * FROM mvcc_test WHERE data = 'Uncommitted data';
>    -- Result: Empty! This demonstrates MVCC isolation
>
>    -- Back in first session, commit
>    COMMIT;
>
>    -- Now second session can see it
>    SELECT * FROM mvcc_test WHERE data = 'Uncommitted data';
>    ```
>
> 8. **Monitor Autovacuum Activity:**
>    ```sql theme={null}
>    -- Check autovacuum configuration
>    SELECT 
>        name,
>        setting,
>        unit,
>        short_desc
>    FROM pg_settings
>    WHERE name LIKE 'autovacuum%'
>    ORDER BY name;
>
>    -- View recent autovacuum activity
>    SELECT 
>        schemaname,
>        relname,
>        last_autovacuum,
>        autovacuum_count,
>        n_live_tup,
>        n_dead_tup
>    FROM pg_stat_user_tables
>    WHERE autovacuum_count > 0
>    ORDER BY last_autovacuum DESC NULLS LAST;
>    ```
>
> 9. **Advanced Challenge:**
>    * Create a workload with 50,000 updates and measure bloat accumulation
>    * Configure autovacuum to run more aggressively and observe the difference
>    * Use pageinspect to examine the internal structure of a page before and after VACUUM
>    * Simulate a long-running transaction and observe how it prevents VACUUM from reclaiming space
>    * Calculate the "bloat ratio" for all tables in your database
>
> When you can explain how MVCC creates multiple tuple versions, how dead tuples accumulate, and how VACUUM reclaims space (with and without returning it to the OS), you've mastered PostgreSQL's core concurrency mechanism.

***

<CardGroup cols={2}>
  <Card title="Lock Manager" icon="lock" href="/courses/database-engineering/lock-manager">
    Spinlocks, LWLocks, heavyweight locks
  </Card>

  <Card title="Buffer Manager" icon="database" href="/courses/database-engineering/buffer-manager">
    Clock sweep, ring buffers, page eviction
  </Card>

  <Card title="Memory Management" icon="memory" href="/courses/database-engineering/memory-management">
    Memory contexts, palloc, shared memory
  </Card>

  <Card title="Catalog System" icon="folder-tree" href="/courses/database-engineering/catalog-system">
    System catalogs and caching
  </Card>

  <Card title="Extension Development" icon="puzzle-piece" href="/courses/database-engineering/extension-development">
    Building PostgreSQL extensions
  </Card>
</CardGroup>

***

## 6.1 Architecture Overview

### Process Model

PostgreSQL uses a **multi-process architecture** (not multi-threaded). Each client connection gets its own backend process.

**Real-world analogy**: Think of PostgreSQL as a hospital emergency room. The Postmaster is the triage nurse at the front desk who greets every patient (client connection) and assigns them a dedicated doctor (backend process). Each doctor works independently in their own examination room -- if one doctor makes a mistake, the other patients are unaffected. The support staff working behind the scenes (auxiliary processes) are the specialists: the janitor who cleans rooms for reuse (Autovacuum), the pharmacist who stocks the medicine cabinet from the warehouse (Background Writer flushing dirty pages), the record keeper who writes everything into the logbook before any procedure happens (WAL Writer), and the charge nurse who does periodic full inventory counts (Checkpointer). This process-per-connection design means PostgreSQL survives individual backend crashes without bringing down the whole hospital.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    PROCESS ARCHITECTURE                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                        POSTMASTER (pid=1)                              │  │
│  │  • Parent of all PostgreSQL processes                                 │  │
│  │  • Listens on TCP port 5432 (configurable)                           │  │
│  │  • fork()s backend processes for each connection                      │  │
│  │  • Monitors children, restarts if crash                               │  │
│  │  • Never touches database data directly                               │  │
│  └───────────────────────────────┬───────────────────────────────────────┘  │
│                                  │ fork()                                    │
│    ┌─────────────────────────────┼─────────────────────────────┐            │
│    │                             │                             │            │
│    ▼                             ▼                             ▼            │
│  ┌─────────────┐           ┌─────────────┐           ┌─────────────┐       │
│  │  Backend 1  │           │  Backend 2  │           │  Backend 3  │       │
│  │ pid=1001    │           │ pid=1002    │           │ pid=1003    │       │
│  │ user=alice  │           │ user=bob    │           │ user=alice  │       │
│  │ db=myapp    │           │ db=analytics│           │ db=myapp    │       │
│  └─────────────┘           └─────────────┘           └─────────────┘       │
│         │                         │                         │              │
│         └─────────────────────────┼─────────────────────────┘              │
│                                   │                                         │
│                                   ▼                                         │
│           ┌───────────────────────────────────────────────────┐            │
│           │              SHARED MEMORY                         │            │
│           │  • Shared Buffer Pool (shared_buffers)            │            │
│           │  • WAL Buffers (wal_buffers)                      │            │
│           │  • Lock Tables                                     │            │
│           │  • Proc Array (backend state)                     │            │
│           │  • CLOG (commit status)                           │            │
│           │  • Background worker state                        │            │
│           └───────────────────────────────────────────────────┘            │
│                                   │                                         │
│    ┌──────────────────────────────┴──────────────────────────────┐         │
│    │               AUXILIARY PROCESSES                            │         │
│    │                                                              │         │
│    │  ┌────────────┐  ┌────────────┐  ┌────────────┐             │         │
│    │  │ Checkpointer│  │ BG Writer  │  │ WAL Writer │             │         │
│    │  │ Flushes all │  │ Writes    │  │ Flushes WAL│             │         │
│    │  │ dirty pages │  │ dirty     │  │ buffers to │             │         │
│    │  │ periodically│  │ pages     │  │ disk       │             │         │
│    │  └────────────┘  └────────────┘  └────────────┘             │         │
│    │                                                              │         │
│    │  ┌────────────┐  ┌────────────┐  ┌────────────┐             │         │
│    │  │ Autovacuum │  │ Stats      │  │ Archiver   │             │         │
│    │  │ Launcher   │  │ Collector  │  │ Archives   │             │         │
│    │  │ + Workers  │  │ Aggregates │  │ WAL files  │             │         │
│    │  └────────────┘  └────────────┘  └────────────┘             │         │
│    │                                                              │         │
│    │  ┌────────────┐  ┌────────────┐                             │         │
│    │  │ WAL Sender │  │ WAL        │  (Replication)              │         │
│    │  │ (primary)  │  │ Receiver   │                             │         │
│    │  └────────────┘  └────────────┘                             │         │
│    └──────────────────────────────────────────────────────────────┘         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Signal Handling

PostgreSQL uses Unix signals for inter-process communication:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        SIGNAL HANDLING                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Signal      │ Sender        │ Effect on Backend                            │
│  ────────────┼───────────────┼───────────────────────────────────────────── │
│  SIGTERM     │ Postmaster    │ Smart shutdown (finish current txn, exit)   │
│  SIGINT      │ Postmaster    │ Fast shutdown (rollback current txn, exit)  │
│  SIGQUIT     │ Postmaster    │ Immediate shutdown (crash recovery needed)  │
│  SIGHUP      │ pg_reload_conf│ Reload configuration files                  │
│  SIGUSR1     │ Postmaster    │ Latch wakeup (check for work)               │
│  SIGUSR2     │ Various       │ Recovery-related                            │
│  SIGALRM     │ Self          │ Statement timeout, lock timeout             │
│                                                                              │
│  Postmaster signal handling:                                                 │
│  • SIGTERM/SIGINT → Initiate shutdown sequence                              │
│  • SIGCHLD → Child process died, check if restart needed                    │
│  • SIGHUP → Signal all children to reload config                            │
│                                                                              │
│  Example: Graceful shutdown                                                  │
│  ┌────────────────────────────────────────────────────────────────────────┐ │
│  │ 1. Admin: pg_ctl stop -m smart                                         │ │
│  │ 2. Postmaster receives SIGTERM                                         │ │
│  │ 3. Postmaster sends SIGTERM to all backends                            │ │
│  │ 4. Backends finish current transaction                                  │ │
│  │ 5. Backends exit                                                        │ │
│  │ 6. Postmaster waits for all children                                   │ │
│  │ 7. Postmaster performs checkpoint                                       │ │
│  │ 8. Postmaster exits                                                     │ │
│  └────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

```sql theme={null}
-- View all PostgreSQL processes
-- Why backend_type matters: it distinguishes client backends from system processes.
-- If you see many 'client backend' entries in 'idle in transaction' state, those are
-- holding snapshots open and preventing VACUUM from reclaiming dead tuples.
SELECT pid, usename, application_name, client_addr, 
       state, query, backend_type
FROM pg_stat_activity;

-- Send signals to backends
SELECT pg_terminate_backend(pid);  -- SIGTERM: finish current work, then exit
SELECT pg_cancel_backend(pid);     -- SIGINT: cancel current query only (backend stays)

-- Reload configuration without restart
-- Why this is safe: only certain parameters (e.g., work_mem, shared_buffers) require
-- a full restart. Most GUCs (logging, autovacuum tuning) take effect via reload.
SELECT pg_reload_conf();           -- SIGHUP to postmaster

-- View postmaster start time -- useful for calculating uptime
SELECT pg_postmaster_start_time();

-- Performance pitfall: each backend process consumes ~10MB of private memory plus
-- any work_mem allocations. With max_connections=1000 and work_mem=256MB, a burst
-- of complex queries could demand 256GB of RAM. Use connection pooling (PgBouncer)
-- to keep active backends low, and set max_connections to the actual concurrency
-- your hardware can sustain (typically 2-4x CPU cores for OLTP).
```

### Shared Memory Layout

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    SHARED MEMORY LAYOUT                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  FIXED-SIZE STRUCTURES                                                │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ ShmemIndex: Hash table mapping names to shared memory regions   │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ ProcGlobal: Array of PGPROC structures (one per max_connections)│  │  │
│  │  │   • PGPROC: Backend state, transaction info, lock info          │  │  │
│  │  │   • size = max_connections * sizeof(PGPROC)                     │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ XLOG Control: WAL write position, flush position, etc.         │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ Control File Data: Database system state                        │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  BUFFER POOL (largest region)                                         │  │
│  │  size = shared_buffers (typically 25-40% of RAM)                      │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ Buffer Descriptors: Metadata for each buffer                    │  │  │
│  │  │   Array[NBuffers] of { tag, flags, usage_count, refcount, ... } │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ Buffer Blocks: Actual 8KB page data                             │  │  │
│  │  │   Array[NBuffers] of char[BLCKSZ]                               │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  │  ┌─────────────────────────────────────────────────────────────────┐  │  │
│  │  │ Buffer Hash Table: Maps (relfilenode, forknum, blknum) → bufid  │  │  │
│  │  └─────────────────────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  WAL BUFFERS                                                          │  │
│  │  size = wal_buffers (default: 1/32 of shared_buffers)                │  │
│  │  Ring buffer of WAL pages pending write to disk                       │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  CLOG (Commit Log / pg_xact)                                          │  │
│  │  2 bits per transaction: committed, aborted, in-progress, sub-commit  │  │
│  │  Cached pages for fast commit status lookup                           │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  LOCK TABLES                                                          │  │
│  │  • Main Lock Table: Heavyweight locks (max_locks_per_transaction)    │  │
│  │  • Predicate Lock Table: SSI serialization                           │  │
│  │  • Fast-path locks: Per-backend array for common cases               │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │  OTHER SUBSYSTEMS                                                     │  │
│  │  • Subtransaction cache                                               │  │
│  │  • Multixact state                                                    │  │
│  │  • Two-phase commit state                                             │  │
│  │  • Async notification state                                           │  │
│  │  • Background worker state                                            │  │
│  │  • Replication slots                                                  │  │
│  │  • Logical replication state                                          │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

```sql theme={null}
-- View shared memory settings
SHOW shared_buffers;
SHOW wal_buffers;
SHOW max_connections;

-- Estimate shared memory size
-- Approximately: shared_buffers + (max_connections * ~10MB) + overhead

-- On Linux, check actual shared memory usage
-- $ ipcs -m | grep postgres

-- View proc array (backend slots)
SELECT count(*) as backends_used FROM pg_stat_activity;
SELECT setting::int as max_backends FROM pg_settings WHERE name = 'max_connections';
```

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      POSTGRESQL ARCHITECTURE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Client Applications                                                        │
│   ┌─────────┐ ┌─────────┐ ┌─────────┐                                       │
│   │  App 1  │ │  App 2  │ │  App 3  │                                       │
│   └────┬────┘ └────┬────┘ └────┬────┘                                       │
│        │           │           │                                             │
│        └───────────┼───────────┘                                             │
│                    │                                                         │
│                    ▼                                                         │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                        POSTMASTER                                    │   │
│   │                   (Main PostgreSQL Process)                         │   │
│   │        Listens for connections, spawns backend processes            │   │
│   └────────────────────────────────┬────────────────────────────────────┘   │
│                                    │                                         │
│        ┌───────────────────────────┼───────────────────────────┐            │
│        │                           │                           │            │
│        ▼                           ▼                           ▼            │
│   ┌─────────────┐           ┌─────────────┐           ┌─────────────┐       │
│   │  Backend 1  │           │  Backend 2  │           │  Backend 3  │       │
│   │  (Session)  │           │  (Session)  │           │  (Session)  │       │
│   └──────┬──────┘           └──────┬──────┘           └──────┬──────┘       │
│          │                         │                         │              │
│          └─────────────────────────┼─────────────────────────┘              │
│                                    │                                         │
│                                    ▼                                         │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                        SHARED MEMORY                                 │   │
│   │  ┌───────────────────────────────────────────────────────────────┐  │   │
│   │  │                    Shared Buffer Pool                         │  │   │
│   │  │   ┌────────┬────────┬────────┬────────┬────────┬────────┐    │  │   │
│   │  │   │ Buffer │ Buffer │ Buffer │ Buffer │ Buffer │  ...   │    │  │   │
│   │  │   │   1    │   2    │   3    │   4    │   5    │        │    │  │   │
│   │  │   └────────┴────────┴────────┴────────┴────────┴────────┘    │  │   │
│   │  └───────────────────────────────────────────────────────────────┘  │   │
│   │  ┌────────────────┐ ┌────────────────┐ ┌────────────────────────┐  │   │
│   │  │   WAL Buffers  │ │   Clog Buffer  │ │  Lock Tables/Semaphores│  │   │
│   │  └────────────────┘ └────────────────┘ └────────────────────────┘  │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                         │
│   ┌────────────────────────────────┼────────────────────────────────────┐   │
│   │                    BACKGROUND PROCESSES                              │   │
│   │                                                                      │   │
│   │  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────────┐    │   │
│   │  │   WAL      │ │ Background │ │ Checkpointer│ │   Autovacuum   │    │   │
│   │  │   Writer   │ │   Writer   │ │            │ │   Launcher     │    │   │
│   │  └────────────┘ └────────────┘ └────────────┘ └────────────────┘    │   │
│   │  ┌────────────┐ ┌────────────┐ ┌────────────┐                       │   │
│   │  │   Stats    │ │  Archiver  │ │  Logical   │                       │   │
│   │  │ Collector  │ │            │ │ Replication│                       │   │
│   │  └────────────┘ └────────────┘ └────────────┘                       │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                    │                                         │
│                                    ▼                                         │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                           DISK STORAGE                               │   │
│   │  ┌──────────────┐  ┌──────────────┐  ┌────────────────────────────┐ │   │
│   │  │  Data Files  │  │  WAL Files   │  │ Configuration & Logs       │ │   │
│   │  │  (Tables,    │  │  (pg_wal/)   │  │ (postgresql.conf, pg_log/) │ │   │
│   │  │   Indexes)   │  │              │  │                            │ │   │
│   │  │  (base/)     │  │              │  │                            │ │   │
│   │  └──────────────┘  └──────────────┘  └────────────────────────────┘ │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## 6.2 Query Processing Pipeline

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        QUERY PROCESSING STAGES                               │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. PARSER                                                                   │
│     ┌───────────────────────────────────────────────────────────────────┐   │
│     │ Input: "SELECT name FROM users WHERE id = 1"                      │   │
│     │                                                                   │   │
│     │ Lexer → Tokens: [SELECT] [name] [FROM] [users] [WHERE] [id] [=]  │   │
│     │ Parser → Parse Tree (AST)                                        │   │
│     └───────────────────────────────────────────────────────────────────┘   │
│                              │                                               │
│                              ▼                                               │
│  2. ANALYZER                                                                 │
│     ┌───────────────────────────────────────────────────────────────────┐   │
│     │ Semantic Analysis:                                                │   │
│     │ • Does table 'users' exist?                                       │   │
│     │ • Does column 'name' exist in 'users'?                            │   │
│     │ • What are the data types?                                        │   │
│     │ • Does user have SELECT permission?                               │   │
│     │                                                                   │   │
│     │ Output: Query Tree (with resolved names and types)                │   │
│     └───────────────────────────────────────────────────────────────────┘   │
│                              │                                               │
│                              ▼                                               │
│  3. REWRITER                                                                 │
│     ┌───────────────────────────────────────────────────────────────────┐   │
│     │ Apply transformation rules:                                       │   │
│     │ • Expand views to their underlying queries                        │   │
│     │ • Apply row-level security policies                               │   │
│     │ • Handle rules (CREATE RULE)                                      │   │
│     └───────────────────────────────────────────────────────────────────┘   │
│                              │                                               │
│                              ▼                                               │
│  4. PLANNER / OPTIMIZER                                                      │
│     ┌───────────────────────────────────────────────────────────────────┐   │
│     │ Generate possible execution plans:                                │   │
│     │ • Plan A: Seq Scan + Filter                                       │   │
│     │ • Plan B: Index Scan on idx_users_id                              │   │
│     │ • Plan C: Bitmap Index Scan + Bitmap Heap Scan                    │   │
│     │                                                                   │   │
│     │ Cost estimation using statistics (pg_statistic)                   │   │
│     │ Choose lowest-cost plan                                           │   │
│     │                                                                   │   │
│     │ Output: Execution Plan (tree of plan nodes)                       │   │
│     └───────────────────────────────────────────────────────────────────┘   │
│                              │                                               │
│                              ▼                                               │
│  5. EXECUTOR                                                                 │
│     ┌───────────────────────────────────────────────────────────────────┐   │
│     │ Execute the plan:                                                 │   │
│     │ • Access data pages via buffer manager                            │   │
│     │ • Apply filters, joins, aggregations                             │   │
│     │ • Handle transactions (visibility checks)                        │   │
│     │ • Return results to client                                       │   │
│     └───────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## 6.3 Storage Architecture

### Page Structure

Everything in PostgreSQL is stored in 8KB pages (blocks).

**Real-world analogy**: An 8KB page is like a fixed-size filing cabinet drawer. The drawer label on the front (page header) records the last time anything was filed here (pd\_lsn) and a tamper-evident seal (pd\_checksum). Inside, sticky tabs at the front (line pointers) point to individual folders (tuples) stacked from the back. New folders are added from the back toward the front, and new sticky tabs are added from the front toward the back. When the sticky tabs meet the folders, the drawer is full. This "grow toward each other" layout means PostgreSQL never needs to shift data around within a page just to insert a new row.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         PAGE STRUCTURE (8KB)                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Offset 0                                                                    │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                    PAGE HEADER (24 bytes)                             │  │
│  │  • pd_lsn: Last WAL position that modified this page                  │  │
│  │  • pd_checksum: Data checksum                                         │  │
│  │  • pd_lower: Offset to start of free space                           │  │
│  │  • pd_upper: Offset to end of free space                             │  │
│  │  • pd_special: Offset to special space (for indexes)                 │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│  ┌───────────────────────────────────────────────────────────────────────┐  │
│  │                    LINE POINTERS (4 bytes each)                       │  │
│  │  Array of (offset, length, flags) pointing to tuples                 │  │
│  │  [LP1] [LP2] [LP3] [LP4] [LP5] ...                                   │  │
│  │    │     │     │                                                      │  │
│  └────┼─────┼─────┼──────────────────────────────────────────────────────┘  │
│       │     │     │                                                         │
│  ┌────┼─────┼─────┼──────────────────────────────────────────────────────┐  │
│  │    │     │     │       FREE SPACE                                     │  │
│  │    │     │     │                                                      │  │
│  │    │     │     │   (pd_lower)          (pd_upper)                     │  │
│  │    │     │     │       ↓                   ↓                          │  │
│  │    │     │     │    ████████████████████████                         │  │
│  └────┼─────┼─────┼──────────────────────────────────────────────────────┘  │
│       │     │     │                                                         │
│  ┌────┼─────┼─────┼──────────────────────────────────────────────────────┐  │
│  │    │     │     └───────────────────────────────────┐                  │  │
│  │    │     └─────────────────────────┐               │                  │  │
│  │    └────────────────┐              │               │                  │  │
│  │                     ▼              ▼               ▼                  │  │
│  │  ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                  │  │
│  │  │   Tuple 3    │ │   Tuple 2    │ │   Tuple 1    │  ← Grows DOWN    │  │
│  │  │ (Row Data)   │ │ (Row Data)   │ │ (Row Data)   │                  │  │
│  │  └──────────────┘ └──────────────┘ └──────────────┘                  │  │
│  └───────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
│  Offset 8192                                                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Tuple (Row) Structure

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                          TUPLE HEADER (23+ bytes)                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  t_xmin (4 bytes)     │ Transaction ID that inserted this tuple      │   │
│  │  t_xmax (4 bytes)     │ Transaction ID that deleted/updated          │   │
│  │  t_cid (4 bytes)      │ Command ID within transaction                │   │
│  │  t_ctid (6 bytes)     │ Current tuple ID (page, offset)              │   │
│  │  t_infomask (2 bytes) │ Status flags (committed, aborted, etc.)      │   │
│  │  t_infomask2 (2 bytes)│ More flags + number of attributes            │   │
│  │  t_hoff (1 byte)      │ Offset to user data                          │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  NULL BITMAP (variable)                                              │   │
│  │  One bit per column, indicates if value is NULL                      │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  USER DATA                                                           │   │
│  │  [Column 1 Value] [Column 2 Value] [Column 3 Value] ...              │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### View Hidden Columns

```sql theme={null}
-- See system columns for any row
SELECT 
    ctid,           -- Physical location (page, offset)
    xmin,           -- Transaction that created
    xmax,           -- Transaction that deleted (0 if live)
    cmin,           -- Command ID for insert
    cmax,           -- Command ID for delete
    tableoid,       -- OID of the table
    *
FROM users 
WHERE id = 1;

-- Result:
-- ctid   | (0,1)        -- Page 0, tuple 1
-- xmin   | 12345        -- Created by txn 12345
-- xmax   | 0            -- Not deleted
-- cmin   | 0            -- First command in its txn
-- cmax   | 0
```

***

## 6.4 MVCC (Multi-Version Concurrency Control)

PostgreSQL's MVCC allows readers and writers to work without blocking each other.

**Real-world analogy**: MVCC works like Google Docs version history. When you edit a document, Google does not overwrite the old version -- it creates a new version while keeping the old one accessible. Anyone who opened the document before your edit continues to see the version they started reading (their "snapshot"). Only people who open the document after your edit see the new version. PostgreSQL does the same thing at the row level: an UPDATE creates a new tuple version while the old one remains visible to transactions that started before the update. The old versions pile up like document revisions and must eventually be cleaned up by VACUUM -- PostgreSQL's equivalent of "delete old revision history."

### How MVCC Works

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                            MVCC IN ACTION                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Initial State: Row with id=1, name='Alice'                                  │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │ xmin=100 │ xmax=0 │ id=1 │ name='Alice' │                          │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  Transaction 200: UPDATE users SET name='Bob' WHERE id=1                    │
│                                                                              │
│  After UPDATE: TWO tuple versions exist!                                     │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │ xmin=100 │ xmax=200 │ id=1 │ name='Alice' │  ← Old version (dead)  │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │ xmin=200 │ xmax=0   │ id=1 │ name='Bob'   │  ← New version (live)  │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│  Visibility Rules:                                                          │
│  ─────────────────                                                          │
│  Transaction 150 (started before 200):                                      │
│    → Sees 'Alice' (xmin=100 committed, xmax=200 not visible)               │
│                                                                              │
│  Transaction 250 (started after 200 committed):                             │
│    → Sees 'Bob' (xmin=200 committed, xmax=0 means live)                    │
│                                                                              │
│  Transaction 200 (the updater itself):                                      │
│    → Sees 'Bob' (its own changes are visible)                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Tuple Visibility Check

```sql theme={null}
-- Simplified visibility logic (actual code is more complex)
-- A tuple is visible if:
-- 1. xmin is committed AND
-- 2. xmin started before our snapshot AND
-- 3. (xmax is 0 OR xmax is aborted OR xmax started after our snapshot)

-- Check transaction status
SELECT * FROM pg_stat_activity WHERE pid = pg_backend_pid();

-- View commit log status
SELECT 
    pg_xact_status(xmin) as xmin_status,
    pg_xact_status(xmax) as xmax_status
FROM users WHERE id = 1;
-- Returns: 'committed', 'aborted', 'in progress', or NULL
```

### Snapshot Isolation

```sql theme={null}
-- REPEATABLE READ takes a snapshot at first query
BEGIN TRANSACTION ISOLATION LEVEL REPEATABLE READ;

SELECT * FROM accounts WHERE id = 1;  -- Snapshot taken HERE
-- Returns balance = 1000

-- Another transaction updates and commits...

SELECT * FROM accounts WHERE id = 1;
-- Still returns balance = 1000! (same snapshot)

COMMIT;

-- New transaction sees the update
SELECT * FROM accounts WHERE id = 1;
-- Returns balance = 1200
```

***

## 6.5 VACUUM and Dead Tuples

**Performance pitfall -- the "silent bloat killer"**: In a high-write OLTP system, dead tuples can accumulate faster than autovacuum cleans them up. The default autovacuum trigger is `threshold + scale_factor * table_rows`, which means a 100-million-row table must accumulate 20 million dead tuples (20%) before autovacuum fires. For large tables, set per-table `autovacuum_vacuum_scale_factor = 0.01` (1%) to trigger sooner. Also watch for `idle in transaction` sessions -- a single forgotten `BEGIN` without `COMMIT` prevents VACUUM from reclaiming any rows created after that transaction's snapshot.

### Why VACUUM is Necessary

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        THE DEAD TUPLE PROBLEM                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  After many updates, pages fill with dead tuples:                           │
│                                                                              │
│  Page:                                                                       │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ [DEAD] [LIVE] [DEAD] [DEAD] [LIVE] [DEAD] [LIVE] [DEAD] [DEAD]      │   │
│  │                                                                      │   │
│  │  70% dead tuples = wasted space + slow scans!                       │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  Problems:                                                                   │
│  • Table bloat: 10GB of data uses 100GB on disk                            │
│  • Slow sequential scans: must read dead tuples too                        │
│  • Index bloat: indexes point to dead tuples                               │
│  • Transaction ID wraparound risk                                          │
│                                                                              │
│  After VACUUM:                                                               │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │ [FREE] [LIVE] [FREE] [FREE] [LIVE] [FREE] [LIVE] [FREE] [FREE]      │   │
│  │                                                                      │   │
│  │  Space marked as free for reuse (but not returned to OS)            │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  After VACUUM FULL (rewrites table):                                        │
│  ┌────────────────────────────────────────────────────────────────────┐     │
│  │ [LIVE] [LIVE] [LIVE]                                                │     │
│  │                                                                     │     │
│  │  Compacted! Space returned to OS. But table is locked during.      │     │
│  └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### VACUUM Operations

```sql theme={null}
-- Regular VACUUM: Mark dead tuples as reusable
VACUUM users;

-- VACUUM with analysis: Also update statistics
VACUUM ANALYZE users;

-- VACUUM VERBOSE: Show what it's doing
VACUUM VERBOSE users;

-- VACUUM FULL: Rewrite table (locks table exclusively!)
VACUUM FULL users;  -- Use sparingly!

-- Check vacuum stats
SELECT 
    relname,
    n_live_tup,
    n_dead_tup,
    round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 2) as dead_pct,
    last_vacuum,
    last_autovacuum,
    last_analyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
```

### Autovacuum Configuration

```sql theme={null}
-- View current settings
SELECT name, setting, unit, short_desc
FROM pg_settings
WHERE name LIKE '%autovacuum%';

-- Per-table settings for high-traffic tables
ALTER TABLE orders SET (
    autovacuum_vacuum_threshold = 1000,        -- Min dead tuples
    autovacuum_vacuum_scale_factor = 0.01,     -- + 1% of table
    autovacuum_analyze_threshold = 500,
    autovacuum_analyze_scale_factor = 0.005,
    autovacuum_vacuum_cost_delay = 2           -- Less aggressive
);

-- Check if autovacuum is running
SELECT pid, datname, usename, state, query, 
       backend_start, xact_start
FROM pg_stat_activity 
WHERE query LIKE '%autovacuum%';
```

***

## 6.6 Write-Ahead Logging (WAL)

<img src="https://mintcdn.com/devweeekends/0kwJwOL2KCwg2YYu/images/courses/database-engineering/wal-mechanism.svg?fit=max&auto=format&n=0kwJwOL2KCwg2YYu&q=85&s=c8e113d10f2cb1b50ceb9d981e442970" alt="WAL Mechanism" width="1080" height="1080" data-path="images/courses/database-engineering/wal-mechanism.svg" />

**Real-world analogy**: WAL is like the black box flight recorder on an airplane. Before the pilot (backend process) makes any maneuver (data modification), the action is recorded in the black box (WAL) first. If the plane crashes (server power loss), investigators can replay the black box recording to reconstruct exactly what happened and recover to a consistent state. The key insight is that writing sequentially to the black box is much faster than updating every instrument panel (random data page writes) in real time. PostgreSQL exploits this same principle: sequential WAL writes are 10-100x faster than random data file writes, so committing a transaction only requires flushing the WAL -- the actual data pages can be written lazily later by the background writer.

### How WAL Works

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        WRITE-AHEAD LOGGING                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  The Golden Rule: "Write the log before the data"                           │
│                                                                              │
│  Transaction Flow:                                                           │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  1. Application: INSERT INTO orders VALUES (...)                    │   │
│  │                      │                                               │   │
│  │                      ▼                                               │   │
│  │  2. PostgreSQL: Create WAL record in WAL Buffer                     │   │
│  │                      │                                               │   │
│  │                      ▼                                               │   │
│  │  3. Modify page in Shared Buffer (in memory)                        │   │
│  │                      │                                               │   │
│  │                      ▼                                               │   │
│  │  4. Application: COMMIT                                              │   │
│  │                      │                                               │   │
│  │                      ▼                                               │   │
│  │  5. Flush WAL Buffer to disk (fsync)  ← Durability guaranteed here! │   │
│  │                      │                                               │   │
│  │                      ▼                                               │   │
│  │  6. Return success to application                                   │   │
│  │                      │                                               │   │
│  │                      ▼ (Later, asynchronously)                       │   │
│  │  7. Background Writer: Write dirty pages to data files             │   │
│  │                                                                      │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  Recovery After Crash:                                                       │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │  1. Start PostgreSQL                                                 │   │
│  │  2. Read WAL from last checkpoint                                    │   │
│  │  3. Replay all WAL records                                           │   │
│  │  4. Database is consistent! No data loss.                            │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### WAL Configuration

```sql theme={null}
-- View WAL settings
SHOW wal_level;              -- minimal, replica, logical
                              -- 'replica' is needed for streaming replication + PITR
                              -- 'logical' adds row-level change decoding (CDC)
SHOW max_wal_size;           -- Maximum WAL before forced checkpoint (default: 1GB)
                              -- Setting this too low causes frequent checkpoints (I/O spikes).
                              -- Setting it too high means longer crash recovery times.
SHOW min_wal_size;           -- Minimum WAL to keep (default: 80MB)
SHOW wal_buffers;            -- WAL buffer size (auto-tuned from shared_buffers)

-- View current WAL position -- the LSN (Log Sequence Number) is a monotonically
-- increasing pointer into the WAL stream. Two LSNs can be subtracted to calculate
-- bytes of WAL generated between them.
SELECT pg_current_wal_lsn();

-- View WAL file info -- wal_fpi (full page images) count is important: after a
-- checkpoint, the first write to each page must include the entire 8KB page in WAL
-- (not just the diff). High wal_fpi means your checkpoint interval is too short.
SELECT * FROM pg_stat_wal;

-- Force a checkpoint (flushes all data to disk)
-- Performance pitfall: calling CHECKPOINT manually during peak traffic causes a
-- massive I/O storm. Let PostgreSQL manage checkpoints automatically via
-- checkpoint_timeout and max_wal_size.
CHECKPOINT;

-- View checkpoint stats -- if checkpoints_req >> checkpoints_timed, your
-- max_wal_size is too small and WAL is filling up between scheduled checkpoints.
SELECT * FROM pg_stat_bgwriter;
```

***

## 6.7 Buffer Management

### Shared Buffer Pool

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        BUFFER POOL MANAGEMENT                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Shared Buffer Pool (shared_buffers):                                        │
│  ┌──────────────────────────────────────────────────────────────────────┐   │
│  │                                                                      │   │
│  │  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐   │   │
│  │  │ Buf │ │ Buf │ │ Buf │ │ Buf │ │ Buf │ │ Buf │ │ Buf │ │ Buf │   │   │
│  │  │  1  │ │  2  │ │  3  │ │  4  │ │  5  │ │  6  │ │  7  │ │  8  │   │   │
│  │  └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘   │   │
│  │     │      │      │      │      │      │      │      │              │   │
│  │     │      │      │      │      │      │      │      │              │   │
│  │  ┌──┴──────┴──────┴──────┴──────┴──────┴──────┴──────┴──┐          │   │
│  │  │           Buffer Descriptor Array                    │          │   │
│  │  │  • Page ID (table OID, block number)                │          │   │
│  │  │  • Usage count (for clock sweep eviction)           │          │   │
│  │  │  • Dirty flag                                       │          │   │
│  │  │  • Pin count (number of backends using it)          │          │   │
│  │  └──────────────────────────────────────────────────────┘          │   │
│  │                                                                      │   │
│  │  ┌──────────────────────────────────────────────────────┐          │   │
│  │  │              Hash Table                              │          │   │
│  │  │  Maps (table_oid, block_num) → buffer_id             │          │   │
│  │  └──────────────────────────────────────────────────────┘          │   │
│  │                                                                      │   │
│  └──────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  When page is needed:                                                       │
│  1. Check hash table for (table, block)                                    │
│  2. If found: buffer hit! Increment usage count.                           │
│  3. If not found: buffer miss                                              │
│     a. Find victim buffer (clock sweep: usage count = 0)                   │
│     b. If victim is dirty, write to disk first                             │
│     c. Read requested page from disk                                       │
│     d. Insert into hash table                                              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Monitoring Buffer Usage

```sql theme={null}
-- Buffer cache hit ratio
SELECT 
    sum(heap_blks_hit) / nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) as cache_hit_ratio
FROM pg_statio_user_tables;
-- Should be > 0.99 for well-tuned system

-- Per-table buffer stats
SELECT 
    relname,
    heap_blks_read,    -- Blocks read from disk
    heap_blks_hit,     -- Blocks found in cache
    round(100.0 * heap_blks_hit / nullif(heap_blks_hit + heap_blks_read, 0), 2) as hit_ratio
FROM pg_statio_user_tables
ORDER BY heap_blks_read DESC
LIMIT 20;

-- What's in the buffer cache? (requires pg_buffercache extension)
CREATE EXTENSION pg_buffercache;

SELECT 
    c.relname,
    count(*) AS buffers,
    pg_size_pretty(count(*) * 8192) AS size_in_cache
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = c.relfilenode
WHERE b.reldatabase = (SELECT oid FROM pg_database WHERE datname = current_database())
GROUP BY c.relname
ORDER BY count(*) DESC
LIMIT 20;
```

***

## 6.8 Practice Exercises

### Exercise 1: Investigate Tuple Visibility

```sql theme={null}
-- Setup: Create a test scenario
CREATE TABLE visibility_test (id SERIAL PRIMARY KEY, value TEXT);

-- Session 1
BEGIN;
INSERT INTO visibility_test (value) VALUES ('first');
-- Note the xmin

-- Session 2 (before Session 1 commits)
SELECT ctid, xmin, xmax, * FROM visibility_test;
-- What do you see?

-- Session 1
COMMIT;

-- Session 2
SELECT ctid, xmin, xmax, * FROM visibility_test;
-- What changed?
```

### Exercise 2: Analyze Table Bloat

```sql theme={null}
-- Create and bloat a table
CREATE TABLE bloat_test AS 
SELECT generate_series(1, 100000) AS id, 'data' AS value;

-- Update every row multiple times
UPDATE bloat_test SET value = 'updated1';
UPDATE bloat_test SET value = 'updated2';
UPDATE bloat_test SET value = 'updated3';

-- Check bloat
SELECT 
    pg_size_pretty(pg_table_size('bloat_test')) AS table_size,
    n_live_tup,
    n_dead_tup,
    last_vacuum
FROM pg_stat_user_tables 
WHERE relname = 'bloat_test';

-- Run vacuum and compare
VACUUM bloat_test;
-- Check sizes again
```

***

## Module Mastery Checklist

Complete this checklist to confirm you've mastered PostgreSQL Internals:

### Process Architecture

* [ ] Explain PostgreSQL's multi-process model and its advantages
* [ ] Identify the role of the postmaster process
* [ ] Describe how backend processes are created and managed
* [ ] Understand the function of auxiliary processes (checkpointer, bgwriter, WAL writer, autovacuum)
* [ ] Interpret signal handling between postmaster and backends
* [ ] Monitor process activity using pg\_stat\_activity

### Shared Memory Structures

* [ ] Explain the layout and purpose of shared memory regions
* [ ] Calculate shared memory requirements based on configuration
* [ ] Understand the buffer pool structure and management
* [ ] Describe how ProcGlobal tracks backend state
* [ ] Analyze WAL buffer usage and configuration
* [ ] Query shared memory statistics using system views

### Storage Architecture

* [ ] Explain PostgreSQL's heap storage model
* [ ] Describe the structure of 8KB pages
* [ ] Understand tuple layout and metadata (xmin, xmax, ctid)
* [ ] Identify the role of Free Space Map (FSM) and Visibility Map (VM)
* [ ] Explain how TOAST handles large values
* [ ] Navigate the data directory and interpret file naming

### MVCC and Transaction Management

* [ ] Explain how MVCC enables non-blocking reads
* [ ] Understand transaction IDs (XIDs) and snapshot isolation
* [ ] Describe tuple visibility rules
* [ ] Analyze xmin/xmax values to determine tuple status
* [ ] Diagnose transaction ID wraparound scenarios
* [ ] Optimize for MVCC overhead in high-concurrency workloads

### VACUUM and Maintenance

* [ ] Explain the difference between VACUUM and VACUUM FULL
* [ ] Understand how dead tuples accumulate
* [ ] Configure autovacuum thresholds appropriately
* [ ] Monitor bloat using pg\_stat\_user\_tables
* [ ] Diagnose and remediate table bloat
* [ ] Apply VACUUM best practices for production systems

### WAL and Durability

* [ ] Explain Write-Ahead Logging principles
* [ ] Understand WAL record structure and LSNs
* [ ] Describe checkpoint mechanisms and configuration
* [ ] Monitor WAL generation rate
* [ ] Configure WAL settings for performance and durability
* [ ] Implement WAL archiving for PITR

### Buffer Management

* [ ] Understand clock-sweep buffer eviction algorithm
* [ ] Monitor buffer cache hit ratios
* [ ] Identify which tables/indexes are cached
* [ ] Analyze buffer pool efficiency
* [ ] Configure shared\_buffers appropriately
* [ ] Diagnose buffer-related performance issues

### System Catalogs and Monitoring

* [ ] Query pg\_stat\_\* views for performance metrics
* [ ] Use pg\_stat\_activity to monitor connections
* [ ] Analyze pg\_stat\_user\_tables for table statistics
* [ ] Leverage pg\_stat\_statements for query analysis
* [ ] Inspect pg\_locks for lock contention
* [ ] Navigate system catalogs (pg\_class, pg\_attribute, pg\_index)

### Practical Application

* [ ] Complete Hands-On Lab 1: Process and memory inspection
* [ ] Complete Hands-On Lab 2: MVCC and VACUUM observation
* [ ] Diagnose real-world performance issues using internals knowledge
* [ ] Tune PostgreSQL configuration based on workload characteristics
* [ ] Implement monitoring strategies for production databases

***

## Next Module

<Card title="Module 7: Replication & High Availability" icon="arrow-right" href="/courses/database-engineering/replication">
  Build systems that survive failures
</Card>

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="Explain how PostgreSQL's MVCC implementation differs from MySQL/InnoDB's approach and the operational consequences of each.">
    **Strong Answer:**

    * PostgreSQL stores old row versions directly in the main heap. When a row is updated, the old version remains with xmax set and a new version is appended. This requires VACUUM to reclaim space. The benefit is consistent read performance -- no undo chain traversal to reconstruct old versions.
    * InnoDB stores only the latest version in the clustered index. Old versions go to a separate undo log. Reads of old snapshots require reverse-applying undo records, which becomes expensive for long-running transactions. But the main table stays compact without a VACUUM equivalent.
    * PostgreSQL consequence: table bloat is a first-class operational concern. Write amplification is higher because every UPDATE creates a full new tuple (mitigated by HOT updates when no indexed column changes). InnoDB consequence: undo log can grow large under long transactions, and "history list length" is the monitoring metric to watch.

    **Follow-up: What is a HOT update and when does it fail?**

    HOT (Heap-Only Tuple) allows an UPDATE to skip all index maintenance if no indexed column changes and the new tuple fits on the same page. The old line pointer is redirected. It fails when an indexed column is modified, no free space exists on the same page, or fillfactor is set too high. Monitor with `pg_stat_user_tables.n_tup_hot_upd` vs `n_tup_upd`.
  </Accordion>

  <Accordion title="Walk through what happens inside PostgreSQL when you execute a SELECT, from SQL text arrival to result delivery.">
    **Strong Answer:**

    * **Parsing**: Lexer (scan.l) tokenizes SQL, parser (gram.y) builds a raw parse tree -- pure syntax, no semantic meaning.
    * **Analysis**: Analyzer resolves names against system catalogs (pg\_class, pg\_attribute, pg\_proc), producing a type-checked Query tree.
    * **Rewriting**: View expansion and rule application. Most queries pass through unchanged.
    * **Planning**: The planner generates execution paths, estimates costs using pg\_statistic, considers join orders and index usage, and selects the cheapest Plan tree.
    * **Execution**: The executor uses a demand-driven pull model. Top node requests tuples from children recursively down to scan nodes reading from tables/indexes via the buffer manager.
    * **Result delivery**: Tuples are serialized into the PostgreSQL wire protocol and streamed to the client over TCP.

    **Follow-up: Where does the buffer pool come into play during execution?**

    When a scan node needs a page, it calls the buffer manager which checks the buffer mapping hash table. Buffer hits return a pinned pointer immediately. Buffer misses trigger clock-sweep eviction, a disk read (possibly satisfied by OS page cache), and loading into shared\_buffers. The `Buffers: shared hit=X read=Y` in EXPLAIN directly reflects this.
  </Accordion>

  <Accordion title="Why does PostgreSQL use a multi-process architecture instead of threads? What are the implications for modern workloads?">
    **Strong Answer:**

    * PostgreSQL predates POSIX threads standardization. The fork()-based model gives each connection its own OS process with isolated address space, communicating via shared memory.
    * Key advantage: a segfault in one backend does not crash the server -- postmaster restarts just that process. Shared memory is explicitly managed, simplifying concurrency reasoning.
    * Key cost: 5-10MB private memory per backend process. 1000 connections means 5-10GB overhead. Context switching is more expensive than threads. Connection pooling (PgBouncer) is mandatory for high-connection workloads.
    * Active community work on AIO patches and background worker threading. A full thread migration would require making all global variables thread-safe across 1.3M lines of C.

    **Follow-up: How does this compare to MySQL's threading model?**

    MySQL uses one thread per connection within a single process. Lower per-connection overhead means MySQL handles thousands of connections natively. The tradeoff: a thread crash can corrupt the entire process memory. In practice, connection pooling neutralizes the difference -- both architectures operate with limited active workers regardless of external connection count.
  </Accordion>
</AccordionGroup>