This module provides the internals knowledge required to work on storage engines at companies building database infrastructure. We cover PostgreSQL’s MVCC, vacuum, WAL, and recovery systems in production-level depth.
Real-world analogy: Transaction IDs are like the ticket numbers at a bakery. Every customer gets the next number in sequence. The critical difference is that PostgreSQL’s counter is only 32 bits — meaning it can only count to about 4 billion before it wraps around to zero. When it wraps, suddenly “new” tickets look like they came before “old” tickets, breaking visibility rules. This is the transaction ID wraparound problem, and it is why VACUUM’s “freezing” operation is not optional — it is an existential requirement that prevents the database from becoming unusable.
/* Transaction ID (XID) - 32-bit wraparound counter */typedef uint32 TransactionId;/* Special transaction IDs */#define InvalidTransactionId ((TransactionId) 0)#define BootstrapTransactionId ((TransactionId) 1)#define FrozenTransactionId ((TransactionId) 2)#define FirstNormalTransactionId ((TransactionId) 3)/* XID comparison (handles wraparound) *//* * TransactionIdPrecedes(a, b): * Returns true if a < b in circular XID space * * Example with 32-bit wraparound: * If current XID = 100: * XID 50 is in the past * XID 150 is in the future * XID 2^31 + 100 = 2147483748 is "past" (wraparound danger!) */boolTransactionIdPrecedes(TransactionId id1, TransactionId id2){ /* If id1 and id2 differ by more than 2^31, one has wrapped */ int32 diff = (int32) (id1 - id2); return diff < 0;}
┌─────────────────────────────────────────────────────────────────────────────┐│ MVCC VISIBILITY RULES │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Tuple is VISIBLE if: ││ ───────────────────────────────────────────────────────────────────────── ││ ││ 1. xmin is committed AND ││ 2. xmin was committed before my snapshot AND ││ 3. (xmax is invalid OR ││ xmax is aborted OR ││ xmax was not committed before my snapshot OR ││ xmax is my own transaction) ││ ││ Snapshot contains: ││ • xmin: Oldest active transaction ID ││ • xmax: One past newest transaction ID ││ • xip[]: Array of in-progress transaction IDs ││ ││ Visibility Example: ││ ───────────────────────────────────────────────────────────────────────── ││ ││ My Transaction ID: 105 ││ My Snapshot: xmin=100, xmax=107, xip=[101, 103, 104] ││ ││ Tuple A: xmin=95, xmax=0 → VISIBLE (committed, not deleted) ││ Tuple B: xmin=101, xmax=0 → INVISIBLE (101 in xip, not committed) ││ Tuple C: xmin=90, xmax=102 → VISIBLE (102 not in xip, committed) ││ Tuple D: xmin=90, xmax=103 → VISIBLE (103 in xip, not committed yet) ││ Tuple E: xmin=90, xmax=105 → VISIBLE (105 is my own xact) ││ Tuple F: xmin=108, xmax=0 → INVISIBLE (108 >= snapshot xmax) ││ │└─────────────────────────────────────────────────────────────────────────────┘
Real-world analogy: Hint bits are like sticky notes on a file folder. The first time someone needs to know whether a transaction committed, they have to walk to the filing cabinet (CLOG) and look it up. Then they stick a note on the folder saying “committed” or “aborted.” Every subsequent reader just looks at the sticky note instead of walking to the cabinet. The tradeoff: sticking the note makes the folder “dirty” (it has been modified), so it needs to be written back to disk eventually.Common pitfall: Hint-bit setting is the reason you sometimes see unexpected write I/O on a read-only replica. The first read of a tuple after a checkpoint may set hint bits, dirtying the page. This is normal behavior, not corruption, but it can be surprising when monitoring shows write activity on a server that should be read-only.
/* Hint bits in t_infomask - avoid repeated CLOG lookups */#define HEAP_XMIN_COMMITTED 0x0100 /* xmin transaction committed */#define HEAP_XMIN_INVALID 0x0200 /* xmin transaction aborted */#define HEAP_XMAX_COMMITTED 0x0400 /* xmax transaction committed */#define HEAP_XMAX_INVALID 0x0800 /* xmax transaction aborted/invalid *//* * Hint bit flow: * * 1. First read of tuple: No hint bits set * → Must check CLOG for xmin/xmax status * * 2. After CLOG lookup: * → Set hint bits based on transaction status * → Tuple page becomes dirty (will be written back) * * 3. Subsequent reads: * → Hint bits checked first * → Skip CLOG lookup if hints present * * Impact: * - First read after checkpoint may set hints * - Causes unexpected dirty pages * - Can affect I/O patterns *//* Example visibility check with hint bits */boolHeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot){ HeapTupleHeader tuple = htup->t_data; if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED)) { if (tuple->t_infomask & HEAP_XMIN_INVALID) return false; /* Aborted, definitely not visible */ /* No hint bit - check CLOG */ if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple))) { /* Set hint bit for future reads */ tuple->t_infomask |= HEAP_XMIN_COMMITTED; /* Page is now dirty */ } else if (TransactionIdDidAbort(HeapTupleHeaderGetRawXmin(tuple))) { tuple->t_infomask |= HEAP_XMIN_INVALID; return false; } else { /* Transaction still in progress */ return false; /* My snapshot doesn't see it */ } } /* Continue checking xmax... */}
The CLOG (now called pg_xact/ on disk, though the internal terminology persists) is a remarkably compact data structure. It stores the status of every transaction that has ever run using just 2 bits per transaction. This means 4 billion transaction statuses fit in about 1GB of disk space — and only a small portion is cached in shared memory at any given time.Practical tip: If you see high SLRUReadPage wait events in pg_stat_activity, backends are contending on CLOG buffer pages. This typically happens when many transactions are performing visibility checks on tuples whose hint bits have not been set yet (common after a crash recovery or after restoring from a backup). Running VACUUM on affected tables will set hint bits, eliminating the CLOG lookups.
┌─────────────────────────────────────────────────────────────────────────────┐│ THE DEAD TUPLE PROBLEM │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Without Vacuum: ││ ───────────────────────────────────────────────────────────────────────── ││ ││ Time T1: INSERT user (id=1, name='Alice') ││ → Tuple version V1 created, xmin=100 ││ ││ Time T2: UPDATE users SET name='Alicia' WHERE id=1 ││ → V1 marked deleted: xmax=200 ││ → Tuple version V2 created: xmin=200 ││ ││ Time T3: UPDATE users SET name='Alison' WHERE id=1 ││ → V2 marked deleted: xmax=300 ││ → Tuple version V3 created: xmin=300 ││ ││ Table now contains: V1 (dead), V2 (dead), V3 (live) ││ 3x the storage for 1 logical row! ││ ││ Problems: ││ 1. Table bloat → slower scans, more I/O ││ 2. Index bloat → indexes point to dead tuples ││ 3. Transaction ID wraparound → database shutdown ││ ││ Vacuum's Job: ││ 1. Identify dead tuples (no active transaction can see them) ││ 2. Mark space as reusable in FSM (Free Space Map) ││ 3. Update visibility map ││ 4. Freeze old tuples (prevent XID wraparound) ││ 5. Update statistics ││ │└─────────────────────────────────────────────────────────────────────────────┘
/* Full Page Writes (FPW) - protection against torn pages *//* * Problem: Partial page write * ─────────────────────────────────────────────────── * PostgreSQL page: 8KB * Disk sector: 512B or 4KB * * If crash during page write: * - Some sectors written, some not * - Page is corrupted (torn page) * - Can't apply WAL to corrupted page! * * Solution: Full Page Image (FPI) * ─────────────────────────────────────────────────── * First modification after checkpoint: write entire page to WAL * Recovery: restore page from FPI, then apply subsequent WAL */typedef struct XLogRecordBlockHeader{ uint8 id; /* Block ID (for multi-block records) */ uint8 fork_flags; /* Fork number + flags */ uint16 data_length; /* Length of block-specific data */ /* If BKPBLOCK_HAS_IMAGE flag set: */ uint16 bimg_len; /* Length of page image */ uint16 hole_offset; /* Offset of hole (if any) */ uint8 bimg_info; /* Compression, etc. */ /* Block reference follows... */} XLogRecordBlockHeader;/* FPW behavior *//* * Checkpoint at LSN 0/10000000 * * WAL record at 0/10000100: UPDATE page 42 * → First change to page 42 since checkpoint * → Include full 8KB page image * → xl_info includes XLR_BKP_BLOCK flag * * WAL record at 0/10001000: UPDATE page 42 (same page again) * → Not first change since checkpoint * → Only include delta (changed bytes) * → Much smaller record * * Next checkpoint at 0/20000000 * → Page 42 modifications reset * → Next change will be FPW again */
-- Checkpoint configuration-- How often to checkpoint (at minimum)checkpoint_timeout = '5min' -- Default: every 5 minutes-- Or when this much WAL generatedmax_wal_size = '1GB' -- Default: ~64 WAL segmentsmin_wal_size = '80MB' -- Keep at least this much-- Spread checkpoint I/O over timecheckpoint_completion_target = 0.9 -- Use 90% of interval-- Example with completion_target:/* * checkpoint_timeout = 5min = 300s * checkpoint_completion_target = 0.9 * * Target completion time = 300 × 0.9 = 270s * * If we have 1000 dirty buffers: * Rate = 1000 / 270 = ~3.7 buffers/second * * Spreads I/O instead of burst at checkpoint start */-- Monitoring checkpointsSELECT checkpoints_timed, -- By timeout checkpoints_req, -- Forced (WAL full, manual) checkpoint_write_time, -- Time writing buffers (ms) checkpoint_sync_time, -- Time fsync'ing (ms) buffers_checkpoint, -- Buffers written in checkpoints buffers_backend -- Emergency backend writes (BAD)FROM pg_stat_bgwriter;-- buffers_backend should be 0 or very low-- If high: shared_buffers too small or checkpoint too infrequent
┌─────────────────────────────────────────────────────────────────────────────┐│ CRASH RECOVERY PROCESS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Startup after crash: ││ ││ 1. Read pg_control ││ ├── State = "in production" → needs recovery ││ ├── Last checkpoint location ││ └── Last known timeline ││ ││ 2. Read checkpoint record ││ └── Get REDO point (where to start replay) ││ ││ 3. REDO Phase (replay WAL from checkpoint) ││ for each WAL record from REDO point to end: ││ ├── Read record ││ ├── Check if page needs record applied ││ │ (compare page LSN to record LSN) ││ ├── If FPI: restore full page image ││ ├── Else: apply redo function ││ └── Continue until end of valid WAL ││ ││ 4. Mark recovery complete ││ ├── Update pg_control state = "normal" ││ ├── Write end-of-recovery checkpoint ││ └── Open for connections ││ ││ Timeline: ││ ───────────────────────────────────────────────────────────────────────── ││ ││ |←─── already on disk ───→|←─ in WAL buffers ─→| ││ | | (lost) | ││ ──────────●───────────────●────────────────────● ││ Checkpoint Crash Now ││ ↑ ││ REDO point ││ |←─────── REDO phase replays these records ─────→| ││ │└─────────────────────────────────────────────────────────────────────────────┘
/* Page LSN - tracks last WAL record applied to page *//* Every PostgreSQL page has a header */typedef struct PageHeaderData{ PageXLogRecPtr pd_lsn; /* LSN of last change to this page */ uint16 pd_checksum; /* Page checksum if enabled */ uint16 pd_flags; /* Flag bits */ LocationIndex pd_lower; /* Offset to start of free space */ LocationIndex pd_upper; /* Offset to end of free space */ LocationIndex pd_special; /* Offset to start of special space */ uint16 pd_pagesize_version; TransactionId pd_prune_xid; /* Hint for when to prune */} PageHeaderData;/* * Recovery logic using page LSN: * * for each WAL record with LSN X: * page = read_page(record->block_ref) * page_lsn = PageGetLSN(page) * * if (X <= page_lsn) * // Record already applied (page was flushed after this record) * skip record * else * // Page is older than record * apply record to page * PageSetLSN(page, X) * * This handles the case where: * 1. WAL record written at LSN X * 2. Page flushed to disk (has LSN X) * 3. Crash before checkpoint * 4. Recovery replays from before X * 5. Record at X is skipped because page already has it */
-- Point-In-Time Recovery configuration-- Step 1: Enable archiving (postgresql.conf)archive_mode = onarchive_command = 'cp %p /archive/%f'-- Or use pg_receivewal for continuous archiving-- Step 2: Take base backuppg_basebackup -D /backup/base -Fp -Xs -P-- Step 3: After disaster, create recovery.signal-- And set recovery target in postgresql.conf:restore_command = 'cp /archive/%f %p'-- Recovery targets (choose one):recovery_target_time = '2024-01-15 14:30:00 UTC'-- ORrecovery_target_xid = '12345678'-- ORrecovery_target_lsn = '0/1A2B3C4D'-- ORrecovery_target_name = 'before_migration' -- Created with pg_create_restore_point()-- Stop before or after targetrecovery_target_inclusive = true -- Include target transaction-- What to do after reaching targetrecovery_target_action = 'pause' -- pause, promote, or shutdown-- Step 4: Start PostgreSQL-- It will replay WAL until target, then pause/promote
Explain the WAL protocol in PostgreSQL. Why does writing to WAL first improve both durability and performance?
Strong Answer:
The WAL protocol guarantees that no data page modification is written to disk before the corresponding WAL record is flushed. On COMMIT, PostgreSQL flushes the WAL (fsync) but does NOT immediately flush the modified data pages. Dirty pages remain in shared_buffers and are written lazily by the background writer or checkpointer.
Durability: if the server crashes after COMMIT, dirty data pages may be lost, but the WAL on disk contains enough information to reconstruct those pages during recovery by replaying from the last checkpoint.
Performance: WAL writes are sequential appends to a single file. Data page writes are random I/O scattered across hundreds of files. Sequential I/O is 10-100x faster on spinning disks and 2-5x faster on SSDs. A transaction modifying 50 pages only needs one sequential WAL flush instead of 50 random page writes.
The checkpoint mechanism periodically flushes all dirty pages to disk, limiting recovery time. checkpoint_completion_target = 0.9 spreads checkpoint I/O over 90% of the interval to smooth the impact.
Follow-up: What is full_page_writes and why is it important?After a checkpoint, the first modification to any page triggers a full 8KB page image write to WAL (not just the delta). This protects against torn pages — if a crash occurs during a partial page write to disk, recovery would otherwise apply a WAL delta to a corrupted page. The cost is WAL volume inflation immediately after each checkpoint. This is a significant factor in WAL sizing and replication bandwidth planning.
A 100GB table contains only 20GB of live data. How did this happen and how do you fix it without downtime?
Strong Answer:
This is classic table bloat. Common causes: autovacuum throttled too aggressively and could not keep up with dead tuple creation, a long-running transaction prevented VACUUM from removing tuples visible to that snapshot, a massive UPDATE created millions of dead tuples faster than autovacuum could process them, or hot_standby_feedback on a replica fed back an old xmin.
Online fix: use pg_repack. It creates a new copy of the table with only live tuples, builds new indexes, and swaps atomically with a brief exclusive lock. The table remains readable and writable throughout. You need roughly 100GB of free disk temporarily.
If pg_repack is unavailable, VACUUM FULL rewrites the table but holds an ACCESS EXCLUSIVE lock for the entire duration (potentially hours for 100GB). Never run this during business hours.
Prevention: tune autovacuum per-table for high-churn tables (autovacuum_vacuum_scale_factor = 0.01), monitor n_dead_tup / n_live_tup ratio, and alert when it exceeds 20%. Kill transactions older than 1 hour.
Follow-up: Why does regular VACUUM not return space to the OS?VACUUM marks dead tuple space as reusable within PostgreSQL’s file but can only truncate empty pages at the END of the file. If the last page has even one live tuple, the file cannot shrink. Truncating mid-file would invalidate all ctid references used by indexes, requiring a full index rebuild. VACUUM FULL and pg_repack create entirely new files with only live data and rebuild all indexes.
Describe the page layout of a PostgreSQL heap page and explain how understanding it helps debug production issues.
Strong Answer:
Every data file is divided into 8KB pages. Each page structure: page header (24 bytes) containing LSN of last WAL modification, flags, and free space boundaries. Then a line pointer array (4-byte entries growing forward) where each pointer contains the offset and length of a tuple. Then free space in the middle. Then tuples stored from the end of the page backward, each with a header (23 bytes minimum) containing xmin, xmax, infomask flags, null bitmap, and actual column data.
Line pointer indirection is key: tuples are never accessed directly, always through their line pointer. This enables HOT updates (redirect a line pointer without touching indexes) and makes VACUUM’s dead tuple cleanup possible without index updates.
Practical debugging: the pageinspect extension lets you examine raw page contents. SELECT * FROM heap_page_items(get_raw_page('users', 0)) shows every tuple on page 0, including dead tuples, xmin/xmax, and infomask flags. This is invaluable for diagnosing visibility issues, corruption, or unexpected bloat patterns.
Follow-up: What is TOAST and when does it activate?TOAST (The Oversized-Attribute Storage Technique) handles values exceeding approximately 2KB per column. Large values are compressed and/or moved to a separate TOAST table, with the main heap storing a small pointer. The key production impact: TOAST reads are additional random I/O. A query selecting a TOASTed text column is slower than selecting an integer on the same row. This is why SELECT * is discouraged — you may trigger TOAST decompression for columns you do not need.