Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

PostgreSQL Data Structures

Understanding the data structures inside PostgreSQL is essential for performance optimization, debugging, and contributing to the codebase. This module covers everything from page layout to B-Tree internals.
Estimated Time: 14-16 hours
Difficulty: Advanced
Prerequisite: Basic understanding of data structures
OSS Relevance: Critical — core storage code

1. Page Layout

Every data file in PostgreSQL is divided into 8KB pages (configurable at compile time). Understanding page layout is fundamental. Real-world analogy: Think of each 8KB page as a page in a physical notebook. The top of the page has a header with the page number and date (page header with LSN and checksum). Just below the header is a table of contents listing where each note starts (line pointers). The actual notes (tuples) are written from the bottom of the page upward. Free space sits in the middle — the table of contents grows down while the notes grow up. When they meet, the page is full. This “meet in the middle” design lets PostgreSQL add both line pointers and tuples without shuffling existing data.

Page Structure

Line Pointer (ItemId)

/* src/include/storage/itemid.h */
typedef struct ItemIdData {
    unsigned    lp_off:15,      /* offset to tuple (from start of page) */
                lp_flags:2,     /* state of line pointer */
                lp_len:15;      /* byte length of tuple */
} ItemIdData;

/* Line pointer flags */
#define LP_UNUSED       0       /* unused (available) */
#define LP_NORMAL       1       /* used (tuple data exists) */
#define LP_REDIRECT     2       /* HOT redirect (points to another line pointer) */
#define LP_DEAD         3       /* dead (tuple was deleted) */

2. Heap Tuple Structure

Heap tuples are the actual row data stored in table pages. Real-world analogy: A heap tuple is like a shipping package. The tuple header is the shipping label — it tells you who sent it (xmin: the transaction that created it), who might be returning it (xmax: the transaction that deleted or updated it), and a tracking number (ctid: physical location). The infomask flags are the “FRAGILE” or “THIS SIDE UP” stickers — metadata about how to handle the package. The actual column data is the contents inside the box. Every time you “update” a row, PostgreSQL does not rip open the box and swap items. Instead, it ships a whole new package and stamps the old one as “return to sender.”

Tuple Layout

Key Infomask Flags

/* src/include/access/htup_details.h */

/* Tuple has been inserted/deleted */
#define HEAP_XMIN_COMMITTED     0x0100  /* t_xmin committed */
#define HEAP_XMIN_INVALID       0x0200  /* t_xmin invalid/aborted */
#define HEAP_XMAX_COMMITTED     0x0400  /* t_xmax committed */
#define HEAP_XMAX_INVALID       0x0800  /* t_xmax invalid/aborted */

/* Locking flags */
#define HEAP_XMAX_IS_MULTI      0x1000  /* t_xmax is a MultiXactId */
#define HEAP_XMAX_LOCK_ONLY     0x0080  /* xmax is for locking only */
#define HEAP_XMAX_KEYSHR_LOCK   0x0010  /* xmax is a key-share lock */
#define HEAP_XMAX_EXCL_LOCK     0x0040  /* xmax is an exclusive lock */

/* Other flags */
#define HEAP_HASNULL            0x0001  /* has null attribute(s) */
#define HEAP_HASVARWIDTH        0x0002  /* has variable-width attributes */
#define HEAP_HASEXTERNAL        0x0004  /* has external stored attributes */
#define HEAP_HASOID_OLD         0x0008  /* has an OID (deprecated) */
#define HEAP_HOT_UPDATED        0x4000  /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE         0x8000  /* this is a heap-only tuple */

MVCC Visibility

-- View tuple visibility info
-- Why these columns matter: ctid shows physical location (page, offset),
-- xmin/xmax reveal the MVCC lifecycle of each row version.
SELECT ctid, xmin, xmax, 
       CASE WHEN xmax = 0 THEN 'visible' 
            WHEN xmax::text::int > txid_current() THEN 'visible (xmax in future)'
            ELSE 'check clog' END AS status
FROM users;

-- Check transaction status in CLOG (commit log)
-- pg_xact/0000 contains commit status for XIDs 0-8M
-- Each XID uses 2 bits: 00=in_progress, 01=committed, 10=aborted, 11=sub-committed
-- Performance pitfall: Long-running transactions prevent dead tuple cleanup because
-- VACUUM cannot reclaim rows that might still be visible to an old snapshot.
-- A single idle-in-transaction session can cause unbounded table bloat.

3. B-Tree Index Structure

B-Trees are the default and most common index type in PostgreSQL. Real-world analogy: A B-Tree index works exactly like the index at the back of a textbook. The root page is the main alphabetical directory (“A-F: page 1, G-L: page 2”). Internal pages are like sub-sections that narrow your search further. Leaf pages are the final entries listing the exact page numbers (TIDs) where each term appears. The key difference with a B+ tree — which PostgreSQL uses — is that the leaf pages are also linked together left-to-right, like a chain. This means once you find “JavaScript” you can slide along the chain to “Java,” “Jest,” “JIT” without climbing back up the tree. That is what makes range scans (WHERE name BETWEEN 'J' AND 'K') efficient.

B+ Tree Structure

PostgreSQL uses B+ trees, where all data is stored in leaf nodes and internal nodes only contain keys for navigation: Key Characteristics:
  • Internal nodes: Only store keys and pointers (no data)
  • Leaf nodes: Store key-value pairs (TIDs pointing to heap tuples)
  • Leaf linking: All leaf nodes are linked together for efficient range scans
  • Fanout: Higher fanout than B-trees (more keys per internal node)

B-Tree Page Layout

┌─────────────────────────────────────────────────────────────────────────────┐
│                    B-TREE INDEX STRUCTURE                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   META PAGE (block 0)                                                        │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  btm_magic, btm_version, btm_root, btm_level, btm_fastroot, etc.   │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   ROOT PAGE (internal node, level = tree height)                            │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  PageHeader + BTPageOpaqueData                                      │   │
│   │  ┌─────────────────────────────────────────────────────────────┐   │   │
│   │  │ High Key │ Key₁,Ptr₁ │ Key₂,Ptr₂ │ ... │ KeyN,PtrN │       │   │   │
│   │  └─────────────────────────────────────────────────────────────┘   │   │
│   │                     │          │          │                         │   │
│   └─────────────────────┼──────────┼──────────┼─────────────────────────┘   │
│                         │          │          │                             │
│           ┌─────────────┘          │          └─────────────┐               │
│           ▼                        ▼                        ▼               │
│   ┌──────────────┐         ┌──────────────┐         ┌──────────────┐       │
│   │ INTERNAL     │         │ INTERNAL     │         │ INTERNAL     │       │
│   │ PAGE         │         │ PAGE         │         │ PAGE         │       │
│   │ (level N-1)  │         │ (level N-1)  │         │ (level N-1)  │       │
│   └──────┬───────┘         └──────┬───────┘         └──────┬───────┘       │
│          │                        │                        │                │
│          ▼                        ▼                        ▼                │
│   ┌──────────────┐         ┌──────────────┐         ┌──────────────┐       │
│   │ LEAF PAGE    │◀───────▶│ LEAF PAGE    │◀───────▶│ LEAF PAGE    │       │
│   │ (level 0)    │ sibling │ (level 0)    │ sibling │ (level 0)    │       │
│   │              │  links  │              │  links  │              │       │
│   │ Key,TID pairs│         │ Key,TID pairs│         │ Key,TID pairs│       │
│   └──────────────┘         └──────────────┘         └──────────────┘       │
│                                                                              │
│   LEAF PAGE DETAIL:                                                          │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  PageHeader                                                         │   │
│   │  ┌─────────────────────────────────────────────────────────────┐   │   │
│   │  │ High Key │ Key₁,TID₁ │ Key₂,TID₂ │ Key₃,TID₃ │ ...         │   │   │
│   │  └─────────────────────────────────────────────────────────────┘   │   │
│   │  BTPageOpaqueData:                                                  │   │
│   │  ┌─────────────────────────────────────────────────────────────┐   │   │
│   │  │ btpo_prev │ btpo_next │ btpo_level │ btpo_flags            │   │   │
│   │  └─────────────────────────────────────────────────────────────┘   │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Lehman-Yao Algorithm

PostgreSQL uses the Lehman-Yao B-Tree algorithm for concurrent access:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    LEHMAN-YAO CONCURRENT B-TREE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Key Innovations:                                                           │
│                                                                              │
│   1. HIGH KEY: Each page stores the highest key that can be on that page   │
│      ┌───────────────────────────────────────────────────────────┐          │
│      │ High Key = 50 │ 10 │ 20 │ 30 │ 40 │                      │          │
│      └───────────────────────────────────────────────────────────┘          │
│      All keys on this page are ≤ 50                                         │
│                                                                              │
│   2. RIGHT-LINK: Every page has a pointer to its right sibling             │
│      ┌────────────┐     ┌────────────┐     ┌────────────┐                  │
│      │ Page A     │────▶│ Page B     │────▶│ Page C     │                  │
│      │ keys ≤ 50  │     │ keys ≤ 100 │     │ keys ≤ ∞   │                  │
│      └────────────┘     └────────────┘     └────────────┘                  │
│                                                                              │
│   3. MOVE RIGHT: If key > high key, follow right-link                      │
│                                                                              │
│   SEARCH ALGORITHM:                                                          │
│   ─────────────────                                                          │
│   search(key):                                                               │
│     page = root                                                              │
│     while not leaf(page):                                                    │
│       while key > high_key(page):    # concurrent split happened           │
│         page = right_link(page)      # move right                          │
│       page = find_child(page, key)                                          │
│     while key > high_key(page):      # check at leaf level too             │
│       page = right_link(page)                                               │
│     return search_in_page(page, key)                                        │
│                                                                              │
│   SPLIT ALGORITHM:                                                           │
│   ────────────────                                                           │
│   1. Create new page (becomes right sibling)                                │
│   2. Move upper half of keys to new page                                    │
│   3. Set high key of old page = first key of new page                      │
│   4. Set right-link of old page = new page                                  │
│   5. Insert downlink to new page in parent                                  │
│                                                                              │
│   BENEFIT: Searches never block, even during splits!                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

B-Tree Operations: Insertion and Splitting

Understanding how B-trees handle insertions and node splits is crucial for performance tuning: B-Tree Operations Insertion Process:
  1. Search: Navigate from root to appropriate leaf node
  2. Insert: Add key-value pair to leaf node
  3. Split if needed: If node exceeds maximum keys, split into two nodes
  4. Promote: Push middle key up to parent node
  5. Recursive: Parent may also need to split (propagates up)
Performance Implications:
  • Best case: O(log n) - no splits needed
  • Worst case: O(log n) - splits propagate to root
  • Write amplification: Splits require multiple page writes
  • Fillfactor: PostgreSQL uses 90% fillfactor by default to reduce splits

B-Tree Inspection

-- Install pageinspect -- a diagnostic extension that lets you peer inside
-- index and heap pages at the byte level.
CREATE EXTENSION pageinspect;

-- View B-Tree meta page (block 0) -- this tells you the tree depth and root location.
-- Why it matters: if 'level' is high (>4), the index may have excessive bloat.
SELECT * FROM bt_metap('users_pkey');
-- magic | version | root | level | fastroot | fastlevel | ...

-- View B-Tree page stats -- dead_items > 0 means the index needs vacuuming.
SELECT * FROM bt_page_stats('users_pkey', 1);
-- blkno | type | live_items | dead_items | avg_item_size | ...

-- View B-Tree page items -- each entry is a (key, TID) pair pointing to a heap tuple.
-- Performance pitfall: if avg_item_size is large relative to the page size (8KB),
-- you get low fanout and a deeper tree, meaning more I/O per lookup.
-- This commonly happens with wide composite index keys.
SELECT * FROM bt_page_items('users_pkey', 1);
-- itemoffset | ctid | itemlen | nulls | vars | data

### Practical B-Tree Implementation (Python)

A simplified in-memory B-Tree to demonstrate splitting and insertion logic:

```python
class BTreeNode:
    def __init__(self, leaf=False):
        self.leaf = leaf
        self.keys = []
        self.children = []

class BTree:
    def __init__(self, t):
        self.root = BTreeNode(True)
        self.t = t  # Minimum degree

    def insert(self, k):
        root = self.root
        if len(root.keys) == (2 * self.t) - 1:
            temp = BTreeNode()
            self.root = temp
            temp.children.insert(0, root)
            self.split_child(temp, 0)
            self.insert_non_full(temp, k)
        else:
            self.insert_non_full(root, k)

    def insert_non_full(self, x, k):
        i = len(x.keys) - 1
        if x.leaf:
            x.keys.append((None, None))
            while i >= 0 and k < x.keys[i]:
                x.keys[i + 1] = x.keys[i]
                i -= 1
            x.keys[i + 1] = k
        else:
            while i >= 0 and k < x.keys[i]:
                i -= 1
            i += 1
            if len(x.children[i].keys) == (2 * self.t) - 1:
                self.split_child(x, i)
                if k > x.keys[i]:
                    i += 1
            self.insert_non_full(x.children[i], k)

    def split_child(self, x, i):
        t = self.t
        y = x.children[i]
        z = BTreeNode(y.leaf)
        x.children.insert(i + 1, z)
        x.keys.insert(i, y.keys[t - 1])
        z.keys = y.keys[t: (2 * t) - 1]
        y.keys = y.keys[0: t - 1]
        if not y.leaf:
            z.children = y.children[t: 2 * t]
            y.children = y.children[0: t]

---

## 4. Other Index Types

### Hash Index

**Real-world analogy**: A hash index is like a coat check at a theater. You hand over your coat (the key), the attendant gives you a numbered ticket (the hash), and they hang your coat on that numbered hook. Retrieval is instant -- hand over the ticket, get the coat. But if someone asks "give me all coats between ticket 50 and 75," the attendant has to check every single hook because there is no ordering. That is why hash indexes only support equality lookups (`=`), never range scans.

┌─────────────────────────────────────────────────────────────────────────────┐ │ HASH INDEX STRUCTURE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ META PAGE (block 0) │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ hashm_ntuples, hashm_ffactor, hashm_bsize, hashm_maxbucket, etc. │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ BUCKET PAGES (organized by hash value) │ │ │ │ hash(key) = 0b…XXXX │ │ ││││ │ │ ││││ ┌───────────────────┐ │ │ │││└───▶│ Bucket 0 │ │ │ ││└────▶│ Bucket 1 │ │ │ │└─────▶│ Bucket 2 │ │ │ └──────▶│ Bucket 3 │ │ │ │ … │ │ │ │ Bucket 2^N - 1 │ │ │ └───────────────────┘ │ │ │ │ OVERFLOW PAGES (when bucket is full) │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Bucket Page │────▶│ Overflow 1 │────▶│ Overflow 2 │ │ │ │ (primary) │ │ │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ Only supports equality lookups (=) │ │ O(1) average case, but can degrade with many overflows │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

### GIN (Generalized Inverted Index)

**Real-world analogy**: A GIN index works like the index at the back of a cookbook, but for individual ingredients rather than recipe names. Under "garlic," you see a list of every recipe page that uses garlic. Under "basil," another list. When you search for recipes using both garlic AND basil, the database intersects the two posting lists. The "pending list" optimization is like having a sticky note pad -- new recipes go on the sticky notes first and get merged into the main index during quiet periods (VACUUM).

┌─────────────────────────────────────────────────────────────────────────────┐ │ GIN INDEX STRUCTURE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Perfect for: Full-text search, JSONB, arrays, hstore │ │ │ │ STRUCTURE: │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Entry Tree (B-Tree) │ │ │ │ │ │ │ │ Each distinct key value (word, array element, JSON key) is an │ │ │ │ entry pointing to a posting tree/list of TIDs │ │ │ │ │ │ │ │ ┌───────────────────────────────────────────────┐ │ │ │ │ │ “apple” │ “banana” │ “cherry” │ “date” │ … │ │ │ │ │ └────┬─────┴────┬─────┴────┬─────┴────┬────┴────┘ │ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ │ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │ │ │Posting │ │Posting │ │Posting │ │Posting │ │ │ │ │ │Tree/ │ │Tree/ │ │Tree/ │ │Tree/ │ │ │ │ │ │List │ │List │ │List │ │List │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │TID1 │ │TID3 │ │TID1 │ │TID2 │ │ │ │ │ │TID2 │ │TID5 │ │TID4 │ │TID7 │ │ │ │ │ │TID4 │ │TID9 │ │TID8 │ │… │ │ │ │ │ └────────┘ └────────┘ └────────┘ └────────┘ │ │ │ │ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ PENDING LIST (for fast writes): │ │ Newly inserted entries go to pending list first, then bulk-merged │ │ into main structure during VACUUM or when list gets too big │ │ │ │ gin_pending_list_limit (default 4MB) │ │ fastupdate = on (default) │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

```sql
-- GIN index for full-text search
-- Why GIN over GiST for FTS: GIN is faster for reads (exact posting lists) but
-- slower for writes (must update posting lists). Use GiST if your table has
-- very high write throughput and you can tolerate slower search.
CREATE INDEX idx_articles_fts ON articles USING GIN (to_tsvector('english', content));

-- GIN for JSONB -- jsonb_path_ops is ~40% smaller than the default jsonb_ops
-- but only supports @> (containment) queries, not key-exists (?).
CREATE INDEX idx_data_gin ON events USING GIN (properties jsonb_path_ops);

-- Performance pitfall: with fastupdate=on (default), the pending list can grow
-- large during bulk inserts, making the FIRST query after the insert slow because
-- it triggers a pending list cleanup. For bulk loads, consider:
--   ALTER INDEX idx_articles_fts SET (fastupdate = off);
-- then re-enable after the load.

GiST (Generalized Search Tree)

┌─────────────────────────────────────────────────────────────────────────────┐
│                         GiST INDEX STRUCTURE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Perfect for: Geometric data, ranges, full-text (alternative to GIN)      │
│                                                                              │
│   KEY CONCEPT: Bounding box containment hierarchy                           │
│                                                                              │
│   R-TREE EXAMPLE (for geometric data):                                       │
│                                                                              │
│              ┌─────────────────────────────────────────┐                    │
│              │            Root Bounding Box            │                    │
│              │  ┌───────────────┬───────────────────┐ │                    │
│              │  │   R1          │        R2         │ │                    │
│              │  │ ┌───┐ ┌───┐  │  ┌───────┐ ┌───┐  │ │                    │
│              │  │ │ a │ │ b │  │  │   c   │ │ d │  │ │                    │
│              │  │ └───┘ └───┘  │  └───────┘ └───┘  │ │                    │
│              │  │              │                    │ │                    │
│              │  └───────────────┴───────────────────┘ │                    │
│              └─────────────────────────────────────────┘                    │
│                                                                              │
│   TREE STRUCTURE:                                                            │
│                                                                              │
│                    ┌─────────────────────┐                                  │
│                    │    Root (bbox)      │                                  │
│                    └──────────┬──────────┘                                  │
│                    ┌──────────┴──────────┐                                  │
│                    ▼                     ▼                                  │
│            ┌───────────────┐     ┌───────────────┐                         │
│            │ R1 (bbox)     │     │ R2 (bbox)     │                         │
│            └───────┬───────┘     └───────┬───────┘                         │
│            ┌───────┴───────┐     ┌───────┴───────┐                         │
│            ▼               ▼     ▼               ▼                         │
│       ┌────────┐      ┌────────┐┌────────┐  ┌────────┐                    │
│       │ a TID  │      │ b TID ││ c TID  │  │ d TID  │                    │
│       └────────┘      └────────┘└────────┘  └────────┘                    │
│                                                                              │
│   QUERY: Find all objects overlapping query box                             │
│   → Descend only into nodes whose bounding box overlaps query               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

BRIN (Block Range Index)

Real-world analogy: A BRIN index is like the date labels on archive boxes in a warehouse. You do not index every document — you just label each box with the earliest and latest date it contains. When someone asks for documents from January 10-12, you scan the labels, grab only the relevant boxes, and search within them. The trade-off is obvious: you might open a box that contains January 8-14 even though you only need 10-12 (false positives), but labeling boxes is nearly free compared to indexing every individual document. BRIN indexes can be 1000x smaller than B-Tree indexes on the same column.
┌─────────────────────────────────────────────────────────────────────────────┐
│                         BRIN INDEX STRUCTURE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Perfect for: Large tables with naturally ordered data (timestamps, etc.)  │
│                                                                              │
│   CONCEPT: Store min/max summary for ranges of blocks                       │
│                                                                              │
│   Table (naturally ordered by timestamp):                                    │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │ Block 0-127  │ Block 128-255 │ Block 256-383 │ Block 384-511 │ ... │   │
│   │ Jan 1-7      │ Jan 8-14      │ Jan 15-21     │ Jan 22-28     │     │   │
│   └──────────────┴───────────────┴───────────────┴───────────────┴─────┘   │
│                                                                              │
│   BRIN Index (very small!):                                                  │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │ Range 0-127:   min=Jan 1,  max=Jan 7                               │   │
│   │ Range 128-255: min=Jan 8,  max=Jan 14                              │   │
│   │ Range 256-383: min=Jan 15, max=Jan 21                              │   │
│   │ Range 384-511: min=Jan 22, max=Jan 28                              │   │
│   │ ...                                                                 │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   QUERY: WHERE timestamp BETWEEN 'Jan 10' AND 'Jan 12'                      │
│   → Only scan blocks 128-255 (skip everything else!)                        │
│                                                                              │
│   pages_per_range = 128 (default, tunable)                                  │
│   Index size: ~1000x smaller than B-Tree                                    │
│   Trade-off: Less precise (may scan extra blocks)                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
-- BRIN for time-series data
-- Why pages_per_range = 128? Each summary covers 128 pages (~1MB).
-- Smaller values = more precise but larger index. Larger values = smaller index
-- but more false-positive blocks scanned. Tune based on your data's natural ordering.
CREATE INDEX idx_events_time_brin ON events USING BRIN (created_at) 
WITH (pages_per_range = 128);

-- Check BRIN summaries -- reveals min/max per block range
SELECT * FROM brin_page_items(get_raw_page('idx_events_time_brin', 2), 'idx_events_time_brin');

-- Performance pitfall: BRIN indexes are USELESS if your data is not physically
-- ordered by the indexed column. If events are inserted out of order (e.g.,
-- backfilled from multiple sources), each block range will have min=earliest,
-- max=latest, and the index will never eliminate any blocks.
-- Fix: CLUSTER events USING idx_events_time_brin; (rewrites table in order,
-- but requires an exclusive lock -- schedule during maintenance windows).

5. Buffer Pool

The buffer pool is PostgreSQL’s main memory cache for data pages. Real-world analogy: The buffer pool is like the desk of a researcher in a library. The library’s stacks (disk) hold millions of books (pages), but the researcher can only have a few dozen on their desk (shared_buffers) at once. When they need a new book, they check the desk first (buffer hit). If it is not there, they walk to the stacks (buffer miss — expensive I/O). When the desk is full and a new book is needed, they must put one back. The clock-sweep algorithm decides which book to return: frequently referenced books get a “do not return yet” sticky note (usage_count > 0), and the algorithm sweeps the desk like a clock hand, peeling off one sticky note per pass. A book with no sticky notes left gets returned to the stacks.

Buffer Pool Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                      BUFFER POOL ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   SHARED MEMORY                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                                                                     │   │
│   │   BUFFER DESCRIPTORS (BufferDesc array)                            │   │
│   │   ┌────────┬────────┬────────┬────────┬────────┬────────┐          │   │
│   │   │ Desc 0 │ Desc 1 │ Desc 2 │ Desc 3 │  ...   │ Desc N │          │   │
│   │   └───┬────┴───┬────┴───┬────┴───┬────┴────────┴───┬────┘          │   │
│   │       │        │        │        │                  │               │   │
│   │       ▼        ▼        ▼        ▼                  ▼               │   │
│   │   BUFFER POOL (array of 8KB buffers)                               │   │
│   │   ┌────────┬────────┬────────┬────────┬────────┬────────┐          │   │
│   │   │ Buf 0  │ Buf 1  │ Buf 2  │ Buf 3  │  ...   │ Buf N  │          │   │
│   │   │ 8KB    │ 8KB    │ 8KB    │ 8KB    │        │ 8KB    │          │   │
│   │   └────────┴────────┴────────┴────────┴────────┴────────┘          │   │
│   │                                                                     │   │
│   │   BUFFER TABLE (hash table: BufferTag → buffer_id)                 │   │
│   │   ┌─────────────────────────────────────────────────────────────┐  │   │
│   │   │ BufferTag = (RelFileNode, ForkNumber, BlockNumber)         │  │   │
│   │   │                                                             │  │   │
│   │   │ Hash bucket 0: tag1→buf5, tag2→buf9                        │  │   │
│   │   │ Hash bucket 1: tag3→buf12                                  │  │   │
│   │   │ Hash bucket 2: (empty)                                     │  │   │
│   │   │ ...                                                        │  │   │
│   │   └─────────────────────────────────────────────────────────────┘  │   │
│   │                                                                     │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   BUFFER DESCRIPTOR CONTENTS:                                                │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  tag: (relfilenode, forknum, blocknum) - identifies the page       │   │
│   │  buf_id: index in buffer pool                                       │   │
│   │  state: (BM_LOCKED | BM_DIRTY | BM_VALID | BM_TAG_VALID | ...)     │   │
│   │  refcount: number of backends pinning this buffer                   │   │
│   │  usage_count: for clock-sweep (0-5)                                │   │
│   │  wait_backend_pgprocno: for buffer pin waits                       │   │
│   │  freeNext: link in freelist                                        │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Clock-Sweep Algorithm Details

/* Simplified clock-sweep from src/backend/storage/buffer/freelist.c */

static BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
{
    BufferDesc *buf;
    int         tries;
    uint32      local_buf_state;

    for (;;)
    {
        /* Circularly scan through buffer pool */
        buf = GetBufferDescriptor(nextVictimBuffer);
        nextVictimBuffer = (nextVictimBuffer + 1) % NBuffers;

        local_buf_state = LockBufHdr(buf);

        /* Skip if pinned */
        if (BUF_STATE_GET_REFCOUNT(local_buf_state) != 0)
        {
            UnlockBufHdr(buf, local_buf_state);
            continue;
        }

        /* If usage_count > 0, decrement and continue */
        if (BUF_STATE_GET_USAGECOUNT(local_buf_state) > 0)
        {
            local_buf_state -= BUF_USAGECOUNT_ONE;
            UnlockBufHdr(buf, local_buf_state);
            continue;
        }

        /* Found a victim! */
        /* Pin it and return */
        local_buf_state += BUF_REFCOUNT_ONE;
        *buf_state = local_buf_state;
        return buf;
    }
}

Ring Buffer Strategy

Performance pitfall: Without ring buffers, a single SELECT * FROM big_table (sequential scan) would evict every hot page from the buffer pool, replacing your carefully cached OLTP data with pages you will never read again. This is called “buffer pool pollution.” PostgreSQL detects large sequential scans (> 1/4 of shared_buffers) and automatically confines them to a tiny 256KB ring buffer. The scan reuses the same 32 buffers over and over, leaving the rest of the pool untouched. If you see poor cache hit ratios after an analytics query, check whether the table was small enough to dodge the ring buffer threshold.
┌─────────────────────────────────────────────────────────────────────────────┐
│                     BUFFER ACCESS STRATEGIES                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Large sequential scans can "pollute" the buffer pool.                     │
│   PostgreSQL uses RING BUFFERS to prevent this.                             │
│                                                                              │
│   NORMAL ACCESS:                                                             │
│   Uses main buffer pool, full clock-sweep algorithm                         │
│                                                                              │
│   BULK READ (sequential scan > 1/4 of shared_buffers):                      │
│   ┌────────────────────────────────────────────────────────────────────┐    │
│   │                    Ring of 256KB (32 buffers)                      │    │
│   │   ┌───┬───┬───┬───┬───┬───┬───┬───┐                               │    │
│   │   │ 0 │ 1 │ 2 │ 3 │...│30 │31 │ 0 │ ← wraps around                │    │
│   │   └───┴───┴───┴───┴───┴───┴───┴───┘                               │    │
│   │   Reuses same 32 buffers, doesn't evict hot data                  │    │
│   └────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│   BULK WRITE (COPY, CREATE TABLE AS SELECT, etc.):                          │
│   Ring of 16MB (2048 buffers)                                               │
│                                                                              │
│   VACUUM:                                                                    │
│   Ring of 256KB (32 buffers)                                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

6. TOAST (The Oversized-Attribute Storage Technique)

Real-world analogy: TOAST is PostgreSQL’s “overflow storage” system. Think of each 8KB page as a standard mailing envelope. Most letters (rows) fit inside easily. But sometimes you need to send a poster (a large TEXT or JSONB column). TOAST handles this automatically: first, it tries folding the poster (compression with LZ4 or pglz). If it still does not fit, TOAST ships the poster in a separate box (the TOAST table) and puts a claim ticket (TOAST pointer) in the original envelope. When you SELECT that column, PostgreSQL follows the claim ticket to fetch the poster. The critical insight: if your query never selects that column, the poster is never fetched — saving significant I/O.

TOAST Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                         TOAST SYSTEM                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Problem: Page size is 8KB, but attributes can be larger                   │
│   Solution: TOAST - automatically handles large values                       │
│                                                                              │
│   TOAST STRATEGIES (per column):                                            │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                                                                     │   │
│   │   PLAIN  - No compression, no out-of-line storage                  │   │
│   │            (for fixed-size types like integer)                     │   │
│   │                                                                     │   │
│   │   MAIN   - Try compression first, out-of-line only if needed      │   │
│   │            (default for most variable-length types)                │   │
│   │                                                                     │   │
│   │   EXTERNAL - No compression, out-of-line if > threshold           │   │
│   │              (for pre-compressed data like images)                 │   │
│   │                                                                     │   │
│   │   EXTENDED - Compression + out-of-line storage                     │   │
│   │              (default for text, bytea, jsonb)                      │   │
│   │                                                                     │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   TOAST THRESHOLD: ~2KB (TOAST_TUPLE_THRESHOLD)                             │
│                                                                              │
│   STORAGE PROCESS:                                                           │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                                                                     │   │
│   │   1. If tuple fits in page → store inline                          │   │
│   │                                                                     │   │
│   │   2. If too large:                                                 │   │
│   │      a. Try compression (LZ4 or pglz)                              │   │
│   │      b. If still too large → move to TOAST table                  │   │
│   │                                                                     │   │
│   │   3. TOAST table stores chunks:                                    │   │
│   │      ┌──────────┬──────────┬──────────┬──────────┐                │   │
│   │      │ chunk_id │ chunk_seq│ chunk_data (≤2KB)  │                │   │
│   │      │ chunk_id │ chunk_seq│ chunk_data         │                │   │
│   │      │ chunk_id │ chunk_seq│ chunk_data         │                │   │
│   │      └──────────┴──────────┴──────────┴──────────┘                │   │
│   │                                                                     │   │
│   │   4. Main tuple stores TOAST pointer:                              │   │
│   │      ┌──────────────────────────────────────────────────┐         │   │
│   │      │ va_header │ toast_relid │ chunk_id │ chunk_size │         │   │
│   │      └──────────────────────────────────────────────────┘         │   │
│   │                                                                     │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
-- View TOAST table -- every TOASTable table gets a companion pg_toast.pg_toast_<OID> table
SELECT relname, reltoastrelid::regclass 
FROM pg_class WHERE relname = 'articles';

-- Check TOAST storage -- pg_column_size shows the on-disk inline size,
-- length() shows the decompressed logical size. A big gap means compression is working.
SELECT 
    pg_column_size(content) AS inline_size,
    length(content) AS actual_length
FROM articles LIMIT 5;

-- Change TOAST strategy
-- EXTERNAL = no compression, out-of-line storage. Use this for already-compressed
-- data (images, protobuf) to avoid wasting CPU on compression that yields no savings.
ALTER TABLE articles ALTER COLUMN content SET STORAGE EXTERNAL;

-- Performance pitfall: SELECT * on a table with large TOAST columns decompresses
-- and fetches every oversized value even if you do not need it. Always select only
-- the columns you need. The difference can be 100x in I/O for tables with large
-- TEXT/JSONB columns.

7. Practice: Analyze Storage

Hands-On Exercise

-- Create test table
CREATE TABLE storage_test (
    id SERIAL PRIMARY KEY,
    small_text VARCHAR(100),
    large_text TEXT,
    json_data JSONB
);

-- Insert data
INSERT INTO storage_test (small_text, large_text, json_data)
SELECT 
    'Small text ' || i,
    repeat('Large text content ', 1000),
    jsonb_build_object('key', i, 'data', repeat('x', 500))
FROM generate_series(1, 1000) i;

-- Analyze page layout
SELECT * FROM page_header(get_raw_page('storage_test', 0));

-- Check tuple sizes
SELECT 
    id,
    pg_column_size(small_text) AS small_size,
    pg_column_size(large_text) AS large_size,
    pg_column_size(json_data) AS json_size,
    pg_column_size(storage_test.*) AS row_size
FROM storage_test LIMIT 5;

-- Find TOAST table
SELECT c.relname, t.relname AS toast_table
FROM pg_class c
JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relname = 'storage_test';

-- Check TOAST chunks
SELECT chunk_id, chunk_seq, length(chunk_data)
FROM pg_toast.pg_toast_<oid> LIMIT 10;

8. Comparison: PostgreSQL vs LSM Trees

Understanding how PostgreSQL differs from LSM-based databases (RocksDB, LevelDB, Cassandra).
┌─────────────────────────────────────────────────────────────────────────────┐
│              POSTGRESQL (B-TREE) vs LSM TREE COMPARISON                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   POSTGRESQL B-TREE:                     LSM TREE:                          │
│   ─────────────────                      ─────────                          │
│                                                                              │
│   ┌──────────────┐                       ┌──────────────┐                   │
│   │   B-Tree     │                       │  MemTable    │ ← writes go here  │
│   │   (on disk)  │                       │  (in memory) │   (sorted)        │
│   │              │                       └──────┬───────┘                   │
│   │  ┌────────┐  │                              │ flush                     │
│   │  │  Root  │  │                              ▼                           │
│   │  └───┬────┘  │                       ┌──────────────┐                   │
│   │      │       │                       │   Level 0    │ (immutable)       │
│   │  ┌───┴───┐   │                       │   SSTable    │                   │
│   │  ▼       ▼   │                       └──────┬───────┘                   │
│   │┌────┐ ┌────┐ │                              │ compact                   │
│   ││Leaf│ │Leaf│ │                              ▼                           │
│   │└────┘ └────┘ │                       ┌──────────────┐                   │
│   │              │                       │   Level 1    │                   │
│   │ In-place     │                       │   SSTables   │                   │
│   │ updates      │                       └──────┬───────┘                   │
│   └──────────────┘                              │                           │
│                                                 ▼                           │
│                                          ┌──────────────┐                   │
│                                          │   Level N    │                   │
│                                          │   SSTables   │                   │
│                                          └──────────────┘                   │
│                                                                              │
│   WRITES:                                WRITES:                            │
│   • In-place update                      • Append to MemTable               │
│   • Random I/O                           • Sequential I/O (batch)           │
│   • WAL for durability                   • WAL + MemTable flush             │
│                                                                              │
│   READS:                                 READS:                             │
│   • O(log N) B-Tree traversal           • Check MemTable first              │
│   • Single page usually                 • Then each level                   │
│   • Very predictable                     • May check many SSTables          │
│                                          • Bloom filters help               │
│                                                                              │
│   SPACE:                                 SPACE:                             │
│   • Dead tuples until VACUUM            • Write amplification               │
│   • No write amplification              • Compaction overhead               │
│   • MVCC overhead                        • Efficient for deletes            │
│                                                                              │
│   BEST FOR:                              BEST FOR:                          │
│   • Read-heavy OLTP                      • Write-heavy workloads            │
│   • Point lookups                        • Time-series data                 │
│   • Range scans                          • High-volume ingestion            │
│   • Mixed workloads                      • SSD storage                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

9. Source Code References

Data StructureSource FilesKey Functions
Page Layoutsrc/include/storage/bufpage.hPageInit, PageAddItem
Heap Tuplesrc/include/access/htup_details.hheap_form_tuple, heap_deform_tuple
B-Treesrc/backend/access/nbtree/_bt_search, _bt_insertonpg
Buffer Poolsrc/backend/storage/buffer/bufmgr.cReadBuffer, MarkBufferDirty
TOASTsrc/backend/access/heap/tuptoaster.ctoast_insert_or_update
GINsrc/backend/access/gin/ginInsert, gingetbitmap
GiSTsrc/backend/access/gist/gistinsert, gistgettuple


6. Alternative Data Structures

While PostgreSQL primarily uses B-Trees and Heaps, understanding other structures is crucial for database engineering.

Log-Structured Merge-Tree (LSM Tree)

Used by RocksDB, CockroachDB, and Cassandra. Optimized for high write throughput. Key Concepts:
  • Writes: Go to in-memory MemTable (fast).
  • Flushes: MemTable flushed to disk as SSTable (Sorted String Table).
  • Compaction: Background process merges SSTables to reclaim space and remove deleted data.
  • Read Path: Check MemTable -> L0 SSTables -> L1 -> …

Skip List

Used by Redis (Sorted Sets) and some in-memory databases. Probabilistic alternative to balanced trees. Key Concepts:
  • Layers: Multiple linked lists. Bottom layer has all nodes.
  • Search: Start at top, move right until value > target, then move down.
  • Complexity: Average O(log n) for search/insert/delete.
  • Concurrency: Easier to implement lock-free than B-Trees.

Hash Index (Chaining)

Used in PostgreSQL Hash Indexes (for equality checks only) and many in-memory maps. Key Concepts:
  • Buckets: Array of pointers.
  • Collisions: Handled by chaining (linked list) or open addressing.
  • O(1) average access time.
  • Limitation: No range queries (unlike B-Trees).

Next Module

Query Processing Deep Dive

From SQL text to execution results — the complete pipeline

Interview Deep-Dive

Strong Answer:
  • B-Tree (PostgreSQL, MySQL InnoDB): In-place updates, reads traverse a balanced tree in O(log n). Writes may trigger leaf splits (allocating a new page, moving half the keys, updating the parent). Optimized for read-heavy workloads with mixed reads and writes. Operational implication: fragmentation over time requires REINDEX, and the tree must be locked during splits (briefly).
  • LSM-Tree (RocksDB, Cassandra, CockroachDB’s storage): Writes go to an in-memory memtable, then flush as immutable sorted files (SSTables) on disk. Reads must check the memtable plus multiple SSTable levels. Background compaction merges SSTables to reduce read amplification. Optimized for write-heavy workloads. Operational implication: write amplification from compaction (data is rewritten multiple times), and read latency can spike if compaction falls behind (too many levels to check).
  • Tradeoff framework: B-Tree has lower read amplification but higher write amplification (random I/O for in-place updates). LSM-Tree has lower write amplification (sequential I/O) but higher read amplification (multiple files to check) and space amplification (data exists in multiple levels during compaction). For OLTP with balanced read/write: B-Tree. For time-series ingestion or write-heavy logging: LSM-Tree.
  • CockroachDB chose LSM (RocksDB/Pebble) because distributed writes benefit from sequential I/O patterns, and the Raft consensus layer already adds write latency that dwarfs the B-Tree vs LSM difference.
Follow-up: PostgreSQL uses B-Trees for indexes but heaps for tables. Why not use B-Trees for both like InnoDB’s clustered indexes?InnoDB stores table rows directly in the primary key B-Tree (clustered index). This makes primary key lookups extremely fast (single B-Tree traversal returns the row) but secondary index lookups require a double lookup (secondary index to find the primary key, then primary key B-Tree to find the row). PostgreSQL’s heap stores rows in insertion order with a separate B-Tree index pointing to heap locations (TID). This means every index is equally efficient (single index traversal + one heap lookup) but primary key lookups are not privileged. PostgreSQL’s design makes HOT updates possible (update without touching indexes) and avoids the “index maintenance cascade” that InnoDB faces when a primary key changes.
Strong Answer:
  • The 8KB page size is a compile-time constant that affects every data structure in PostgreSQL. Each heap page holds multiple tuples, each B-Tree page holds multiple index entries, and each WAL record references 8KB pages. The page size determines the granularity of I/O, buffering, and locking.
  • Why 8KB: It aligns with common filesystem block sizes and SSD page sizes. It is small enough to avoid wasting buffer pool space on partially-used pages, but large enough to amortize the per-page overhead (24-byte header, line pointer array). For typical OLTP row sizes (100-500 bytes), 8KB pages hold 10-50 rows, which is a reasonable unit of I/O.
  • Larger pages (16KB, 32KB): Fewer I/O operations for sequential scans (reading one 32KB page instead of four 8KB pages). More rows per page means B-Trees are shallower (wider fan-out). But: more wasted space for small rows, full_page_writes in WAL become larger (inflating WAL volume), and buffer pool granularity is coarser (one hot row pins 32KB instead of 8KB). Some databases (Oracle defaults to 8KB but supports 16/32KB) use larger pages for data warehouse workloads.
  • Smaller pages (4KB): Finer buffer pool granularity, less WAL inflation from full_page_writes. But: deeper B-Trees (more I/O per lookup), more page-level overhead, and more line pointers to manage. Generally not beneficial.
  • Practical note: changing PostgreSQL’s page size requires recompilation and full data reload. It is not a tuning knob — it is an architectural decision made at cluster creation.
Follow-up: How does TOAST interact with the page size limitation?TOAST activates when a tuple exceeds approximately 2KB (one quarter of the page size). Since the maximum tuple size must be less than one page (a single tuple cannot span pages in a heap file), and PostgreSQL wants multiple tuples per page for efficiency, TOAST compresses and/or out-of-lines large column values. With a 32KB page size, TOAST would activate at ~8KB per tuple instead of ~2KB, allowing larger inline storage but consuming more buffer pool per page. This is one reason larger page sizes benefit analytical workloads with wide rows.
Strong Answer:
  • A GIN (Generalized Inverted Index) maps each distinct value (or element) to a posting list of row TIDs that contain that value. Think of it as a book’s back-of-the-book index: “PostgreSQL” -> pages 5, 23, 47, 89. For array columns, each array element gets its own posting list entry. For JSONB, each key and value gets an entry. For full-text search, each lexeme gets an entry.
  • B-Tree vs GIN: A B-Tree indexes whole column values and is optimal for equality and range queries on scalar values. A GIN index decomposes composite values (arrays, JSONB, text documents) into individual elements and indexes each element separately. You cannot efficiently ask a B-Tree “which rows contain the element ‘postgresql’ somewhere in their tags array?” but a GIN index answers this in O(1) lookup + posting list scan.
  • GIN advantages: (1) Containment queries on arrays: WHERE tags @> ARRAY['postgresql']. (2) JSONB key existence: WHERE data ? 'email'. (3) Full-text search: WHERE tsvector @@ tsquery. These are impossible or extremely slow with B-Tree because B-Tree treats the entire array/JSONB as a single opaque value.
  • GIN disadvantages: (1) Slower writes — inserting one row with a 10-element array requires 10 posting list updates. GIN mitigates this with a “pending list” that batches insertions, but at the cost of slower reads until the pending list is flushed. (2) Larger on disk than a B-Tree for the same column if the column has low cardinality elements. (3) Not useful for range queries — GIN answers “contains X” not “between X and Y.”
Follow-up: What is the GIN pending list and how does it affect query performance?To avoid the write penalty of updating multiple posting lists per INSERT, GIN stores new entries in an unsorted “pending list” (a linear list of entries not yet merged into the main index). Reads must check both the main index and the pending list, which degrades read performance as the pending list grows. VACUUM or explicit gin_clean_pending_list() merges the pending list into the main index. Tune gin_pending_list_limit (default 4MB) — higher values reduce write overhead but increase read overhead. For read-heavy workloads, set it lower and vacuum frequently.