Skip to main content

PostgreSQL Data Structures

Understanding the data structures inside PostgreSQL is essential for performance optimization, debugging, and contributing to the codebase. This module covers everything from page layout to B-Tree internals.
Estimated Time: 14-16 hours
Difficulty: Advanced
Prerequisite: Basic understanding of data structures
OSS Relevance: Critical — core storage code

1. Page Layout

Every data file in PostgreSQL is divided into 8KB pages (configurable at compile time). Understanding page layout is fundamental.

Page Structure

Line Pointer (ItemId)

/* src/include/storage/itemid.h */
typedef struct ItemIdData {
    unsigned    lp_off:15,      /* offset to tuple (from start of page) */
                lp_flags:2,     /* state of line pointer */
                lp_len:15;      /* byte length of tuple */
} ItemIdData;

/* Line pointer flags */
#define LP_UNUSED       0       /* unused (available) */
#define LP_NORMAL       1       /* used (tuple data exists) */
#define LP_REDIRECT     2       /* HOT redirect (points to another line pointer) */
#define LP_DEAD         3       /* dead (tuple was deleted) */

2. Heap Tuple Structure

Heap tuples are the actual row data stored in table pages.

Tuple Layout

Key Infomask Flags

/* src/include/access/htup_details.h */

/* Tuple has been inserted/deleted */
#define HEAP_XMIN_COMMITTED     0x0100  /* t_xmin committed */
#define HEAP_XMIN_INVALID       0x0200  /* t_xmin invalid/aborted */
#define HEAP_XMAX_COMMITTED     0x0400  /* t_xmax committed */
#define HEAP_XMAX_INVALID       0x0800  /* t_xmax invalid/aborted */

/* Locking flags */
#define HEAP_XMAX_IS_MULTI      0x1000  /* t_xmax is a MultiXactId */
#define HEAP_XMAX_LOCK_ONLY     0x0080  /* xmax is for locking only */
#define HEAP_XMAX_KEYSHR_LOCK   0x0010  /* xmax is a key-share lock */
#define HEAP_XMAX_EXCL_LOCK     0x0040  /* xmax is an exclusive lock */

/* Other flags */
#define HEAP_HASNULL            0x0001  /* has null attribute(s) */
#define HEAP_HASVARWIDTH        0x0002  /* has variable-width attributes */
#define HEAP_HASEXTERNAL        0x0004  /* has external stored attributes */
#define HEAP_HASOID_OLD         0x0008  /* has an OID (deprecated) */
#define HEAP_HOT_UPDATED        0x4000  /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE         0x8000  /* this is a heap-only tuple */

MVCC Visibility

-- View tuple visibility info
SELECT ctid, xmin, xmax, 
       CASE WHEN xmax = 0 THEN 'visible' 
            WHEN xmax::text::int > txid_current() THEN 'visible (xmax in future)'
            ELSE 'check clog' END AS status
FROM users;

-- Check transaction status in CLOG
-- pg_xact/0000 contains commit status for XIDs 0-8M
-- Each XID uses 2 bits: 00=in_progress, 01=committed, 10=aborted, 11=sub-committed

3. B-Tree Index Structure

B-Trees are the default and most common index type in PostgreSQL.

B+ Tree Structure

PostgreSQL uses B+ trees, where all data is stored in leaf nodes and internal nodes only contain keys for navigation: Key Characteristics:
  • Internal nodes: Only store keys and pointers (no data)
  • Leaf nodes: Store key-value pairs (TIDs pointing to heap tuples)
  • Leaf linking: All leaf nodes are linked together for efficient range scans
  • Fanout: Higher fanout than B-trees (more keys per internal node)

B-Tree Page Layout

┌─────────────────────────────────────────────────────────────────────────────┐
│                    B-TREE INDEX STRUCTURE                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   META PAGE (block 0)                                                        │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  btm_magic, btm_version, btm_root, btm_level, btm_fastroot, etc.   │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   ROOT PAGE (internal node, level = tree height)                            │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  PageHeader + BTPageOpaqueData                                      │   │
│   │  ┌─────────────────────────────────────────────────────────────┐   │   │
│   │  │ High Key │ Key₁,Ptr₁ │ Key₂,Ptr₂ │ ... │ KeyN,PtrN │       │   │   │
│   │  └─────────────────────────────────────────────────────────────┘   │   │
│   │                     │          │          │                         │   │
│   └─────────────────────┼──────────┼──────────┼─────────────────────────┘   │
│                         │          │          │                             │
│           ┌─────────────┘          │          └─────────────┐               │
│           ▼                        ▼                        ▼               │
│   ┌──────────────┐         ┌──────────────┐         ┌──────────────┐       │
│   │ INTERNAL     │         │ INTERNAL     │         │ INTERNAL     │       │
│   │ PAGE         │         │ PAGE         │         │ PAGE         │       │
│   │ (level N-1)  │         │ (level N-1)  │         │ (level N-1)  │       │
│   └──────┬───────┘         └──────┬───────┘         └──────┬───────┘       │
│          │                        │                        │                │
│          ▼                        ▼                        ▼                │
│   ┌──────────────┐         ┌──────────────┐         ┌──────────────┐       │
│   │ LEAF PAGE    │◀───────▶│ LEAF PAGE    │◀───────▶│ LEAF PAGE    │       │
│   │ (level 0)    │ sibling │ (level 0)    │ sibling │ (level 0)    │       │
│   │              │  links  │              │  links  │              │       │
│   │ Key,TID pairs│         │ Key,TID pairs│         │ Key,TID pairs│       │
│   └──────────────┘         └──────────────┘         └──────────────┘       │
│                                                                              │
│   LEAF PAGE DETAIL:                                                          │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  PageHeader                                                         │   │
│   │  ┌─────────────────────────────────────────────────────────────┐   │   │
│   │  │ High Key │ Key₁,TID₁ │ Key₂,TID₂ │ Key₃,TID₃ │ ...         │   │   │
│   │  └─────────────────────────────────────────────────────────────┘   │   │
│   │  BTPageOpaqueData:                                                  │   │
│   │  ┌─────────────────────────────────────────────────────────────┐   │   │
│   │  │ btpo_prev │ btpo_next │ btpo_level │ btpo_flags            │   │   │
│   │  └─────────────────────────────────────────────────────────────┘   │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Lehman-Yao Algorithm

PostgreSQL uses the Lehman-Yao B-Tree algorithm for concurrent access:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    LEHMAN-YAO CONCURRENT B-TREE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Key Innovations:                                                           │
│                                                                              │
│   1. HIGH KEY: Each page stores the highest key that can be on that page   │
│      ┌───────────────────────────────────────────────────────────┐          │
│      │ High Key = 50 │ 10 │ 20 │ 30 │ 40 │                      │          │
│      └───────────────────────────────────────────────────────────┘          │
│      All keys on this page are ≤ 50                                         │
│                                                                              │
│   2. RIGHT-LINK: Every page has a pointer to its right sibling             │
│      ┌────────────┐     ┌────────────┐     ┌────────────┐                  │
│      │ Page A     │────▶│ Page B     │────▶│ Page C     │                  │
│      │ keys ≤ 50  │     │ keys ≤ 100 │     │ keys ≤ ∞   │                  │
│      └────────────┘     └────────────┘     └────────────┘                  │
│                                                                              │
│   3. MOVE RIGHT: If key > high key, follow right-link                      │
│                                                                              │
│   SEARCH ALGORITHM:                                                          │
│   ─────────────────                                                          │
│   search(key):                                                               │
│     page = root                                                              │
│     while not leaf(page):                                                    │
│       while key > high_key(page):    # concurrent split happened           │
│         page = right_link(page)      # move right                          │
│       page = find_child(page, key)                                          │
│     while key > high_key(page):      # check at leaf level too             │
│       page = right_link(page)                                               │
│     return search_in_page(page, key)                                        │
│                                                                              │
│   SPLIT ALGORITHM:                                                           │
│   ────────────────                                                           │
│   1. Create new page (becomes right sibling)                                │
│   2. Move upper half of keys to new page                                    │
│   3. Set high key of old page = first key of new page                      │
│   4. Set right-link of old page = new page                                  │
│   5. Insert downlink to new page in parent                                  │
│                                                                              │
│   BENEFIT: Searches never block, even during splits!                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

B-Tree Operations: Insertion and Splitting

Understanding how B-trees handle insertions and node splits is crucial for performance tuning: B-Tree Operations Insertion Process:
  1. Search: Navigate from root to appropriate leaf node
  2. Insert: Add key-value pair to leaf node
  3. Split if needed: If node exceeds maximum keys, split into two nodes
  4. Promote: Push middle key up to parent node
  5. Recursive: Parent may also need to split (propagates up)
Performance Implications:
  • Best case: O(log n) - no splits needed
  • Worst case: O(log n) - splits propagate to root
  • Write amplification: Splits require multiple page writes
  • Fillfactor: PostgreSQL uses 90% fillfactor by default to reduce splits

B-Tree Inspection

-- Install pageinspect
CREATE EXTENSION pageinspect;

-- View B-Tree meta page
SELECT * FROM bt_metap('users_pkey');
-- magic | version | root | level | fastroot | fastlevel | ...

-- View B-Tree page stats
SELECT * FROM bt_page_stats('users_pkey', 1);
-- blkno | type | live_items | dead_items | avg_item_size | ...

-- View B-Tree page items
SELECT * FROM bt_page_items('users_pkey', 1);
-- itemoffset | ctid | itemlen | nulls | vars | data

### Practical B-Tree Implementation (Python)

A simplified in-memory B-Tree to demonstrate splitting and insertion logic:

```python
class BTreeNode:
    def __init__(self, leaf=False):
        self.leaf = leaf
        self.keys = []
        self.children = []

class BTree:
    def __init__(self, t):
        self.root = BTreeNode(True)
        self.t = t  # Minimum degree

    def insert(self, k):
        root = self.root
        if len(root.keys) == (2 * self.t) - 1:
            temp = BTreeNode()
            self.root = temp
            temp.children.insert(0, root)
            self.split_child(temp, 0)
            self.insert_non_full(temp, k)
        else:
            self.insert_non_full(root, k)

    def insert_non_full(self, x, k):
        i = len(x.keys) - 1
        if x.leaf:
            x.keys.append((None, None))
            while i >= 0 and k < x.keys[i]:
                x.keys[i + 1] = x.keys[i]
                i -= 1
            x.keys[i + 1] = k
        else:
            while i >= 0 and k < x.keys[i]:
                i -= 1
            i += 1
            if len(x.children[i].keys) == (2 * self.t) - 1:
                self.split_child(x, i)
                if k > x.keys[i]:
                    i += 1
            self.insert_non_full(x.children[i], k)

    def split_child(self, x, i):
        t = self.t
        y = x.children[i]
        z = BTreeNode(y.leaf)
        x.children.insert(i + 1, z)
        x.keys.insert(i, y.keys[t - 1])
        z.keys = y.keys[t: (2 * t) - 1]
        y.keys = y.keys[0: t - 1]
        if not y.leaf:
            z.children = y.children[t: 2 * t]
            y.children = y.children[0: t]

---

## 4. Other Index Types

### Hash Index

┌─────────────────────────────────────────────────────────────────────────────┐ │ HASH INDEX STRUCTURE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ META PAGE (block 0) │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ hashm_ntuples, hashm_ffactor, hashm_bsize, hashm_maxbucket, etc. │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ BUCKET PAGES (organized by hash value) │ │ │ │ hash(key) = 0b…XXXX │ │ ││││ │ │ ││││ ┌───────────────────┐ │ │ │││└───▶│ Bucket 0 │ │ │ ││└────▶│ Bucket 1 │ │ │ │└─────▶│ Bucket 2 │ │ │ └──────▶│ Bucket 3 │ │ │ │ … │ │ │ │ Bucket 2^N - 1 │ │ │ └───────────────────┘ │ │ │ │ OVERFLOW PAGES (when bucket is full) │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Bucket Page │────▶│ Overflow 1 │────▶│ Overflow 2 │ │ │ │ (primary) │ │ │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ Only supports equality lookups (=) │ │ O(1) average case, but can degrade with many overflows │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

### GIN (Generalized Inverted Index)

┌─────────────────────────────────────────────────────────────────────────────┐ │ GIN INDEX STRUCTURE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Perfect for: Full-text search, JSONB, arrays, hstore │ │ │ │ STRUCTURE: │ │ ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ Entry Tree (B-Tree) │ │ │ │ │ │ │ │ Each distinct key value (word, array element, JSON key) is an │ │ │ │ entry pointing to a posting tree/list of TIDs │ │ │ │ │ │ │ │ ┌───────────────────────────────────────────────┐ │ │ │ │ │ “apple” │ “banana” │ “cherry” │ “date” │ … │ │ │ │ │ └────┬─────┴────┬─────┴────┬─────┴────┬────┴────┘ │ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ │ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │ │ │Posting │ │Posting │ │Posting │ │Posting │ │ │ │ │ │Tree/ │ │Tree/ │ │Tree/ │ │Tree/ │ │ │ │ │ │List │ │List │ │List │ │List │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │TID1 │ │TID3 │ │TID1 │ │TID2 │ │ │ │ │ │TID2 │ │TID5 │ │TID4 │ │TID7 │ │ │ │ │ │TID4 │ │TID9 │ │TID8 │ │… │ │ │ │ │ └────────┘ └────────┘ └────────┘ └────────┘ │ │ │ │ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ PENDING LIST (for fast writes): │ │ Newly inserted entries go to pending list first, then bulk-merged │ │ into main structure during VACUUM or when list gets too big │ │ │ │ gin_pending_list_limit (default 4MB) │ │ fastupdate = on (default) │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

```sql
-- GIN index for full-text search
CREATE INDEX idx_articles_fts ON articles USING GIN (to_tsvector('english', content));

-- GIN for JSONB
CREATE INDEX idx_data_gin ON events USING GIN (properties jsonb_path_ops);

-- Check pending list
SELECT * FROM gin_meta(indexrelid) FROM pg_index WHERE indexrelid = 'idx_articles_fts'::regclass;

GiST (Generalized Search Tree)

┌─────────────────────────────────────────────────────────────────────────────┐
│                         GiST INDEX STRUCTURE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Perfect for: Geometric data, ranges, full-text (alternative to GIN)      │
│                                                                              │
│   KEY CONCEPT: Bounding box containment hierarchy                           │
│                                                                              │
│   R-TREE EXAMPLE (for geometric data):                                       │
│                                                                              │
│              ┌─────────────────────────────────────────┐                    │
│              │            Root Bounding Box            │                    │
│              │  ┌───────────────┬───────────────────┐ │                    │
│              │  │   R1          │        R2         │ │                    │
│              │  │ ┌───┐ ┌───┐  │  ┌───────┐ ┌───┐  │ │                    │
│              │  │ │ a │ │ b │  │  │   c   │ │ d │  │ │                    │
│              │  │ └───┘ └───┘  │  └───────┘ └───┘  │ │                    │
│              │  │              │                    │ │                    │
│              │  └───────────────┴───────────────────┘ │                    │
│              └─────────────────────────────────────────┘                    │
│                                                                              │
│   TREE STRUCTURE:                                                            │
│                                                                              │
│                    ┌─────────────────────┐                                  │
│                    │    Root (bbox)      │                                  │
│                    └──────────┬──────────┘                                  │
│                    ┌──────────┴──────────┐                                  │
│                    ▼                     ▼                                  │
│            ┌───────────────┐     ┌───────────────┐                         │
│            │ R1 (bbox)     │     │ R2 (bbox)     │                         │
│            └───────┬───────┘     └───────┬───────┘                         │
│            ┌───────┴───────┐     ┌───────┴───────┐                         │
│            ▼               ▼     ▼               ▼                         │
│       ┌────────┐      ┌────────┐┌────────┐  ┌────────┐                    │
│       │ a TID  │      │ b TID ││ c TID  │  │ d TID  │                    │
│       └────────┘      └────────┘└────────┘  └────────┘                    │
│                                                                              │
│   QUERY: Find all objects overlapping query box                             │
│   → Descend only into nodes whose bounding box overlaps query               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

BRIN (Block Range Index)

┌─────────────────────────────────────────────────────────────────────────────┐
│                         BRIN INDEX STRUCTURE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Perfect for: Large tables with naturally ordered data (timestamps, etc.)  │
│                                                                              │
│   CONCEPT: Store min/max summary for ranges of blocks                       │
│                                                                              │
│   Table (naturally ordered by timestamp):                                    │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │ Block 0-127  │ Block 128-255 │ Block 256-383 │ Block 384-511 │ ... │   │
│   │ Jan 1-7      │ Jan 8-14      │ Jan 15-21     │ Jan 22-28     │     │   │
│   └──────────────┴───────────────┴───────────────┴───────────────┴─────┘   │
│                                                                              │
│   BRIN Index (very small!):                                                  │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │ Range 0-127:   min=Jan 1,  max=Jan 7                               │   │
│   │ Range 128-255: min=Jan 8,  max=Jan 14                              │   │
│   │ Range 256-383: min=Jan 15, max=Jan 21                              │   │
│   │ Range 384-511: min=Jan 22, max=Jan 28                              │   │
│   │ ...                                                                 │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   QUERY: WHERE timestamp BETWEEN 'Jan 10' AND 'Jan 12'                      │
│   → Only scan blocks 128-255 (skip everything else!)                        │
│                                                                              │
│   pages_per_range = 128 (default, tunable)                                  │
│   Index size: ~1000x smaller than B-Tree                                    │
│   Trade-off: Less precise (may scan extra blocks)                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
-- BRIN for time-series data
CREATE INDEX idx_events_time_brin ON events USING BRIN (created_at) 
WITH (pages_per_range = 128);

-- Check BRIN summaries
SELECT * FROM brin_page_items(get_raw_page('idx_events_time_brin', 2), 'idx_events_time_brin');

5. Buffer Pool

The buffer pool is PostgreSQL’s main memory cache for data pages.

Buffer Pool Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                      BUFFER POOL ARCHITECTURE                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   SHARED MEMORY                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                                                                     │   │
│   │   BUFFER DESCRIPTORS (BufferDesc array)                            │   │
│   │   ┌────────┬────────┬────────┬────────┬────────┬────────┐          │   │
│   │   │ Desc 0 │ Desc 1 │ Desc 2 │ Desc 3 │  ...   │ Desc N │          │   │
│   │   └───┬────┴───┬────┴───┬────┴───┬────┴────────┴───┬────┘          │   │
│   │       │        │        │        │                  │               │   │
│   │       ▼        ▼        ▼        ▼                  ▼               │   │
│   │   BUFFER POOL (array of 8KB buffers)                               │   │
│   │   ┌────────┬────────┬────────┬────────┬────────┬────────┐          │   │
│   │   │ Buf 0  │ Buf 1  │ Buf 2  │ Buf 3  │  ...   │ Buf N  │          │   │
│   │   │ 8KB    │ 8KB    │ 8KB    │ 8KB    │        │ 8KB    │          │   │
│   │   └────────┴────────┴────────┴────────┴────────┴────────┘          │   │
│   │                                                                     │   │
│   │   BUFFER TABLE (hash table: BufferTag → buffer_id)                 │   │
│   │   ┌─────────────────────────────────────────────────────────────┐  │   │
│   │   │ BufferTag = (RelFileNode, ForkNumber, BlockNumber)         │  │   │
│   │   │                                                             │  │   │
│   │   │ Hash bucket 0: tag1→buf5, tag2→buf9                        │  │   │
│   │   │ Hash bucket 1: tag3→buf12                                  │  │   │
│   │   │ Hash bucket 2: (empty)                                     │  │   │
│   │   │ ...                                                        │  │   │
│   │   └─────────────────────────────────────────────────────────────┘  │   │
│   │                                                                     │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   BUFFER DESCRIPTOR CONTENTS:                                                │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │  tag: (relfilenode, forknum, blocknum) - identifies the page       │   │
│   │  buf_id: index in buffer pool                                       │   │
│   │  state: (BM_LOCKED | BM_DIRTY | BM_VALID | BM_TAG_VALID | ...)     │   │
│   │  refcount: number of backends pinning this buffer                   │   │
│   │  usage_count: for clock-sweep (0-5)                                │   │
│   │  wait_backend_pgprocno: for buffer pin waits                       │   │
│   │  freeNext: link in freelist                                        │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Clock-Sweep Algorithm Details

/* Simplified clock-sweep from src/backend/storage/buffer/freelist.c */

static BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
{
    BufferDesc *buf;
    int         tries;
    uint32      local_buf_state;

    for (;;)
    {
        /* Circularly scan through buffer pool */
        buf = GetBufferDescriptor(nextVictimBuffer);
        nextVictimBuffer = (nextVictimBuffer + 1) % NBuffers;

        local_buf_state = LockBufHdr(buf);

        /* Skip if pinned */
        if (BUF_STATE_GET_REFCOUNT(local_buf_state) != 0)
        {
            UnlockBufHdr(buf, local_buf_state);
            continue;
        }

        /* If usage_count > 0, decrement and continue */
        if (BUF_STATE_GET_USAGECOUNT(local_buf_state) > 0)
        {
            local_buf_state -= BUF_USAGECOUNT_ONE;
            UnlockBufHdr(buf, local_buf_state);
            continue;
        }

        /* Found a victim! */
        /* Pin it and return */
        local_buf_state += BUF_REFCOUNT_ONE;
        *buf_state = local_buf_state;
        return buf;
    }
}

Ring Buffer Strategy

┌─────────────────────────────────────────────────────────────────────────────┐
│                     BUFFER ACCESS STRATEGIES                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Large sequential scans can "pollute" the buffer pool.                     │
│   PostgreSQL uses RING BUFFERS to prevent this.                             │
│                                                                              │
│   NORMAL ACCESS:                                                             │
│   Uses main buffer pool, full clock-sweep algorithm                         │
│                                                                              │
│   BULK READ (sequential scan > 1/4 of shared_buffers):                      │
│   ┌────────────────────────────────────────────────────────────────────┐    │
│   │                    Ring of 256KB (32 buffers)                      │    │
│   │   ┌───┬───┬───┬───┬───┬───┬───┬───┐                               │    │
│   │   │ 0 │ 1 │ 2 │ 3 │...│30 │31 │ 0 │ ← wraps around                │    │
│   │   └───┴───┴───┴───┴───┴───┴───┴───┘                               │    │
│   │   Reuses same 32 buffers, doesn't evict hot data                  │    │
│   └────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│   BULK WRITE (COPY, CREATE TABLE AS SELECT, etc.):                          │
│   Ring of 16MB (2048 buffers)                                               │
│                                                                              │
│   VACUUM:                                                                    │
│   Ring of 256KB (32 buffers)                                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

6. TOAST (The Oversized-Attribute Storage Technique)

TOAST Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                         TOAST SYSTEM                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Problem: Page size is 8KB, but attributes can be larger                   │
│   Solution: TOAST - automatically handles large values                       │
│                                                                              │
│   TOAST STRATEGIES (per column):                                            │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                                                                     │   │
│   │   PLAIN  - No compression, no out-of-line storage                  │   │
│   │            (for fixed-size types like integer)                     │   │
│   │                                                                     │   │
│   │   MAIN   - Try compression first, out-of-line only if needed      │   │
│   │            (default for most variable-length types)                │   │
│   │                                                                     │   │
│   │   EXTERNAL - No compression, out-of-line if > threshold           │   │
│   │              (for pre-compressed data like images)                 │   │
│   │                                                                     │   │
│   │   EXTENDED - Compression + out-of-line storage                     │   │
│   │              (default for text, bytea, jsonb)                      │   │
│   │                                                                     │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│   TOAST THRESHOLD: ~2KB (TOAST_TUPLE_THRESHOLD)                             │
│                                                                              │
│   STORAGE PROCESS:                                                           │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │                                                                     │   │
│   │   1. If tuple fits in page → store inline                          │   │
│   │                                                                     │   │
│   │   2. If too large:                                                 │   │
│   │      a. Try compression (LZ4 or pglz)                              │   │
│   │      b. If still too large → move to TOAST table                  │   │
│   │                                                                     │   │
│   │   3. TOAST table stores chunks:                                    │   │
│   │      ┌──────────┬──────────┬──────────┬──────────┐                │   │
│   │      │ chunk_id │ chunk_seq│ chunk_data (≤2KB)  │                │   │
│   │      │ chunk_id │ chunk_seq│ chunk_data         │                │   │
│   │      │ chunk_id │ chunk_seq│ chunk_data         │                │   │
│   │      └──────────┴──────────┴──────────┴──────────┘                │   │
│   │                                                                     │   │
│   │   4. Main tuple stores TOAST pointer:                              │   │
│   │      ┌──────────────────────────────────────────────────┐         │   │
│   │      │ va_header │ toast_relid │ chunk_id │ chunk_size │         │   │
│   │      └──────────────────────────────────────────────────┘         │   │
│   │                                                                     │   │
│   └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
-- View TOAST table
SELECT relname, reltoastrelid::regclass 
FROM pg_class WHERE relname = 'articles';

-- Check TOAST storage
SELECT 
    pg_column_size(content) AS inline_size,
    length(content) AS actual_length
FROM articles LIMIT 5;

-- Change TOAST strategy
ALTER TABLE articles ALTER COLUMN content SET STORAGE EXTERNAL;

7. Practice: Analyze Storage

Hands-On Exercise

-- Create test table
CREATE TABLE storage_test (
    id SERIAL PRIMARY KEY,
    small_text VARCHAR(100),
    large_text TEXT,
    json_data JSONB
);

-- Insert data
INSERT INTO storage_test (small_text, large_text, json_data)
SELECT 
    'Small text ' || i,
    repeat('Large text content ', 1000),
    jsonb_build_object('key', i, 'data', repeat('x', 500))
FROM generate_series(1, 1000) i;

-- Analyze page layout
SELECT * FROM page_header(get_raw_page('storage_test', 0));

-- Check tuple sizes
SELECT 
    id,
    pg_column_size(small_text) AS small_size,
    pg_column_size(large_text) AS large_size,
    pg_column_size(json_data) AS json_size,
    pg_column_size(storage_test.*) AS row_size
FROM storage_test LIMIT 5;

-- Find TOAST table
SELECT c.relname, t.relname AS toast_table
FROM pg_class c
JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relname = 'storage_test';

-- Check TOAST chunks
SELECT chunk_id, chunk_seq, length(chunk_data)
FROM pg_toast.pg_toast_<oid> LIMIT 10;

8. Comparison: PostgreSQL vs LSM Trees

Understanding how PostgreSQL differs from LSM-based databases (RocksDB, LevelDB, Cassandra).
┌─────────────────────────────────────────────────────────────────────────────┐
│              POSTGRESQL (B-TREE) vs LSM TREE COMPARISON                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   POSTGRESQL B-TREE:                     LSM TREE:                          │
│   ─────────────────                      ─────────                          │
│                                                                              │
│   ┌──────────────┐                       ┌──────────────┐                   │
│   │   B-Tree     │                       │  MemTable    │ ← writes go here  │
│   │   (on disk)  │                       │  (in memory) │   (sorted)        │
│   │              │                       └──────┬───────┘                   │
│   │  ┌────────┐  │                              │ flush                     │
│   │  │  Root  │  │                              ▼                           │
│   │  └───┬────┘  │                       ┌──────────────┐                   │
│   │      │       │                       │   Level 0    │ (immutable)       │
│   │  ┌───┴───┐   │                       │   SSTable    │                   │
│   │  ▼       ▼   │                       └──────┬───────┘                   │
│   │┌────┐ ┌────┐ │                              │ compact                   │
│   ││Leaf│ │Leaf│ │                              ▼                           │
│   │└────┘ └────┘ │                       ┌──────────────┐                   │
│   │              │                       │   Level 1    │                   │
│   │ In-place     │                       │   SSTables   │                   │
│   │ updates      │                       └──────┬───────┘                   │
│   └──────────────┘                              │                           │
│                                                 ▼                           │
│                                          ┌──────────────┐                   │
│                                          │   Level N    │                   │
│                                          │   SSTables   │                   │
│                                          └──────────────┘                   │
│                                                                              │
│   WRITES:                                WRITES:                            │
│   • In-place update                      • Append to MemTable               │
│   • Random I/O                           • Sequential I/O (batch)           │
│   • WAL for durability                   • WAL + MemTable flush             │
│                                                                              │
│   READS:                                 READS:                             │
│   • O(log N) B-Tree traversal           • Check MemTable first              │
│   • Single page usually                 • Then each level                   │
│   • Very predictable                     • May check many SSTables          │
│                                          • Bloom filters help               │
│                                                                              │
│   SPACE:                                 SPACE:                             │
│   • Dead tuples until VACUUM            • Write amplification               │
│   • No write amplification              • Compaction overhead               │
│   • MVCC overhead                        • Efficient for deletes            │
│                                                                              │
│   BEST FOR:                              BEST FOR:                          │
│   • Read-heavy OLTP                      • Write-heavy workloads            │
│   • Point lookups                        • Time-series data                 │
│   • Range scans                          • High-volume ingestion            │
│   • Mixed workloads                      • SSD storage                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

9. Source Code References

Data StructureSource FilesKey Functions
Page Layoutsrc/include/storage/bufpage.hPageInit, PageAddItem
Heap Tuplesrc/include/access/htup_details.hheap_form_tuple, heap_deform_tuple
B-Treesrc/backend/access/nbtree/_bt_search, _bt_insertonpg
Buffer Poolsrc/backend/storage/buffer/bufmgr.cReadBuffer, MarkBufferDirty
TOASTsrc/backend/access/heap/tuptoaster.ctoast_insert_or_update
GINsrc/backend/access/gin/ginInsert, gingetbitmap
GiSTsrc/backend/access/gist/gistinsert, gistgettuple


6. Alternative Data Structures

While PostgreSQL primarily uses B-Trees and Heaps, understanding other structures is crucial for database engineering.

Log-Structured Merge-Tree (LSM Tree)

Used by RocksDB, CockroachDB, and Cassandra. Optimized for high write throughput. Key Concepts:
  • Writes: Go to in-memory MemTable (fast).
  • Flushes: MemTable flushed to disk as SSTable (Sorted String Table).
  • Compaction: Background process merges SSTables to reclaim space and remove deleted data.
  • Read Path: Check MemTable -> L0 SSTables -> L1 -> …

Skip List

Used by Redis (Sorted Sets) and some in-memory databases. Probabilistic alternative to balanced trees. Key Concepts:
  • Layers: Multiple linked lists. Bottom layer has all nodes.
  • Search: Start at top, move right until value > target, then move down.
  • Complexity: Average O(log n) for search/insert/delete.
  • Concurrency: Easier to implement lock-free than B-Trees.

Hash Index (Chaining)

Used in PostgreSQL Hash Indexes (for equality checks only) and many in-memory maps. Key Concepts:
  • Buckets: Array of pointers.
  • Collisions: Handled by chaining (linked list) or open addressing.
  • O(1) average access time.
  • Limitation: No range queries (unlike B-Trees).

Next Module

Query Processing Deep Dive

From SQL text to execution results — the complete pipeline