PostgreSQL Data Structures
Understanding the data structures inside PostgreSQL is essential for performance optimization, debugging, and contributing to the codebase. This module covers everything from page layout to B-Tree internals.
Estimated Time : 14-16 hours
Difficulty : Advanced
Prerequisite : Basic understanding of data structures
OSS Relevance : Critical — core storage code
1. Page Layout
Every data file in PostgreSQL is divided into 8KB pages (configurable at compile time). Understanding page layout is fundamental.
Page Structure
Line Pointer (ItemId)
/* src/include/storage/itemid.h */
typedef struct ItemIdData {
unsigned lp_off: 15 , /* offset to tuple (from start of page) */
lp_flags: 2 , /* state of line pointer */
lp_len: 15 ; /* byte length of tuple */
} ItemIdData;
/* Line pointer flags */
#define LP_UNUSED 0 /* unused (available) */
#define LP_NORMAL 1 /* used (tuple data exists) */
#define LP_REDIRECT 2 /* HOT redirect (points to another line pointer) */
#define LP_DEAD 3 /* dead (tuple was deleted) */
2. Heap Tuple Structure
Heap tuples are the actual row data stored in table pages.
Tuple Layout
Key Infomask Flags
/* src/include/access/htup_details.h */
/* Tuple has been inserted/deleted */
#define HEAP_XMIN_COMMITTED 0x 0100 /* t_xmin committed */
#define HEAP_XMIN_INVALID 0x 0200 /* t_xmin invalid/aborted */
#define HEAP_XMAX_COMMITTED 0x 0400 /* t_xmax committed */
#define HEAP_XMAX_INVALID 0x 0800 /* t_xmax invalid/aborted */
/* Locking flags */
#define HEAP_XMAX_IS_MULTI 0x 1000 /* t_xmax is a MultiXactId */
#define HEAP_XMAX_LOCK_ONLY 0x 0080 /* xmax is for locking only */
#define HEAP_XMAX_KEYSHR_LOCK 0x 0010 /* xmax is a key-share lock */
#define HEAP_XMAX_EXCL_LOCK 0x 0040 /* xmax is an exclusive lock */
/* Other flags */
#define HEAP_HASNULL 0x 0001 /* has null attribute(s) */
#define HEAP_HASVARWIDTH 0x 0002 /* has variable-width attributes */
#define HEAP_HASEXTERNAL 0x 0004 /* has external stored attributes */
#define HEAP_HASOID_OLD 0x 0008 /* has an OID (deprecated) */
#define HEAP_HOT_UPDATED 0x 4000 /* tuple was HOT-updated */
#define HEAP_ONLY_TUPLE 0x 8000 /* this is a heap-only tuple */
MVCC Visibility
-- View tuple visibility info
SELECT ctid, xmin, xmax,
CASE WHEN xmax = 0 THEN 'visible'
WHEN xmax:: text :: int > txid_current() THEN 'visible (xmax in future)'
ELSE 'check clog' END AS status
FROM users;
-- Check transaction status in CLOG
-- pg_xact/0000 contains commit status for XIDs 0-8M
-- Each XID uses 2 bits: 00=in_progress, 01=committed, 10=aborted, 11=sub-committed
3. B-Tree Index Structure
B-Trees are the default and most common index type in PostgreSQL.
B+ Tree Structure
PostgreSQL uses B+ trees, where all data is stored in leaf nodes and internal nodes only contain keys for navigation:
Key Characteristics :
Internal nodes : Only store keys and pointers (no data)
Leaf nodes : Store key-value pairs (TIDs pointing to heap tuples)
Leaf linking : All leaf nodes are linked together for efficient range scans
Fanout : Higher fanout than B-trees (more keys per internal node)
B-Tree Page Layout
┌─────────────────────────────────────────────────────────────────────────────┐
│ B-TREE INDEX STRUCTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ META PAGE (block 0) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ btm_magic, btm_version, btm_root, btm_level, btm_fastroot, etc. │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ ROOT PAGE (internal node, level = tree height) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ PageHeader + BTPageOpaqueData │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ High Key │ Key₁,Ptr₁ │ Key₂,Ptr₂ │ ... │ KeyN,PtrN │ │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ │ │ │ │ │ │
│ └─────────────────────┼──────────┼──────────┼─────────────────────────┘ │
│ │ │ │ │
│ ┌─────────────┘ │ └─────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ INTERNAL │ │ INTERNAL │ │ INTERNAL │ │
│ │ PAGE │ │ PAGE │ │ PAGE │ │
│ │ (level N-1) │ │ (level N-1) │ │ (level N-1) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ LEAF PAGE │◀───────▶│ LEAF PAGE │◀───────▶│ LEAF PAGE │ │
│ │ (level 0) │ sibling │ (level 0) │ sibling │ (level 0) │ │
│ │ │ links │ │ links │ │ │
│ │ Key,TID pairs│ │ Key,TID pairs│ │ Key,TID pairs│ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ LEAF PAGE DETAIL: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ PageHeader │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ High Key │ Key₁,TID₁ │ Key₂,TID₂ │ Key₃,TID₃ │ ... │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ │ BTPageOpaqueData: │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ btpo_prev │ btpo_next │ btpo_level │ btpo_flags │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Lehman-Yao Algorithm
PostgreSQL uses the Lehman-Yao B-Tree algorithm for concurrent access:
┌─────────────────────────────────────────────────────────────────────────────┐
│ LEHMAN-YAO CONCURRENT B-TREE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Key Innovations: │
│ │
│ 1. HIGH KEY: Each page stores the highest key that can be on that page │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ High Key = 50 │ 10 │ 20 │ 30 │ 40 │ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ All keys on this page are ≤ 50 │
│ │
│ 2. RIGHT-LINK: Every page has a pointer to its right sibling │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Page A │────▶│ Page B │────▶│ Page C │ │
│ │ keys ≤ 50 │ │ keys ≤ 100 │ │ keys ≤ ∞ │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │
│ 3. MOVE RIGHT: If key > high key, follow right-link │
│ │
│ SEARCH ALGORITHM: │
│ ───────────────── │
│ search(key): │
│ page = root │
│ while not leaf(page): │
│ while key > high_key(page): # concurrent split happened │
│ page = right_link(page) # move right │
│ page = find_child(page, key) │
│ while key > high_key(page): # check at leaf level too │
│ page = right_link(page) │
│ return search_in_page(page, key) │
│ │
│ SPLIT ALGORITHM: │
│ ──────────────── │
│ 1. Create new page (becomes right sibling) │
│ 2. Move upper half of keys to new page │
│ 3. Set high key of old page = first key of new page │
│ 4. Set right-link of old page = new page │
│ 5. Insert downlink to new page in parent │
│ │
│ BENEFIT: Searches never block, even during splits! │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
B-Tree Operations: Insertion and Splitting
Understanding how B-trees handle insertions and node splits is crucial for performance tuning:
Insertion Process :
Search : Navigate from root to appropriate leaf node
Insert : Add key-value pair to leaf node
Split if needed : If node exceeds maximum keys, split into two nodes
Promote : Push middle key up to parent node
Recursive : Parent may also need to split (propagates up)
Performance Implications :
Best case : O(log n) - no splits needed
Worst case : O(log n) - splits propagate to root
Write amplification : Splits require multiple page writes
Fillfactor : PostgreSQL uses 90% fillfactor by default to reduce splits
B-Tree Inspection
-- Install pageinspect
CREATE EXTENSION pageinspect;
-- View B-Tree meta page
SELECT * FROM bt_metap( 'users_pkey' );
-- magic | version | root | level | fastroot | fastlevel | ...
-- View B-Tree page stats
SELECT * FROM bt_page_stats( 'users_pkey' , 1 );
-- blkno | type | live_items | dead_items | avg_item_size | ...
-- View B-Tree page items
SELECT * FROM bt_page_items( 'users_pkey' , 1 );
-- itemoffset | ctid | itemlen | nulls | vars | data
### Practical B - Tree Implementation (Python)
A simplified in - memory B - Tree to demonstrate splitting and insertion logic:
```python
class BTreeNode:
def __init__(self, leaf=False):
self.leaf = leaf
self.keys = []
self.children = []
class BTree:
def __init__(self, t):
self.root = BTreeNode(True)
self.t = t # Minimum degree
def insert(self, k):
root = self.root
if len(root.keys) == (2 * self.t) - 1:
temp = BTreeNode()
self.root = temp
temp.children.insert(0, root)
self.split_child(temp, 0)
self.insert_non_full(temp, k)
else:
self.insert_non_full(root, k)
def insert_non_full(self, x, k):
i = len(x.keys) - 1
if x.leaf:
x.keys.append((None, None))
while i >= 0 and k < x.keys[i]:
x.keys[i + 1] = x.keys[i]
i -= 1
x.keys[i + 1] = k
else:
while i >= 0 and k < x.keys[i]:
i -= 1
i += 1
if len(x.children[i].keys) == (2 * self.t) - 1:
self.split_child(x, i)
if k > x.keys[i]:
i += 1
self.insert_non_full(x.children[i], k)
def split_child(self, x, i):
t = self.t
y = x.children[i]
z = BTreeNode(y.leaf)
x.children.insert(i + 1, z)
x.keys.insert(i, y.keys[t - 1])
z.keys = y.keys[t: (2 * t) - 1]
y.keys = y.keys[0: t - 1]
if not y.leaf:
z.children = y.children[t: 2 * t]
y.children = y.children[0: t]
---
## 4. Other Index Types
### Hash Index
┌─────────────────────────────────────────────────────────────────────────────┐
│ HASH INDEX STRUCTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ META PAGE (block 0) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ hashm_ntuples, hashm_ffactor, hashm_bsize, hashm_maxbucket, etc. │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ BUCKET PAGES (organized by hash value) │
│ │
│ hash(key) = 0b…XXXX │
│ ││││ │
│ ││││ ┌───────────────────┐ │
│ │││└───▶│ Bucket 0 │ │
│ ││└────▶│ Bucket 1 │ │
│ │└─────▶│ Bucket 2 │ │
│ └──────▶│ Bucket 3 │ │
│ │ … │ │
│ │ Bucket 2^N - 1 │ │
│ └───────────────────┘ │
│ │
│ OVERFLOW PAGES (when bucket is full) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Bucket Page │────▶│ Overflow 1 │────▶│ Overflow 2 │ │
│ │ (primary) │ │ │ │ │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Only supports equality lookups (=) │
│ O(1) average case, but can degrade with many overflows │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
### GIN (Generalized Inverted Index)
┌─────────────────────────────────────────────────────────────────────────────┐
│ GIN INDEX STRUCTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Perfect for: Full-text search, JSONB, arrays, hstore │
│ │
│ STRUCTURE: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Entry Tree (B-Tree) │ │
│ │ │ │
│ │ Each distinct key value (word, array element, JSON key) is an │ │
│ │ entry pointing to a posting tree/list of TIDs │ │
│ │ │ │
│ │ ┌───────────────────────────────────────────────┐ │ │
│ │ │ “apple” │ “banana” │ “cherry” │ “date” │ … │ │ │
│ │ └────┬─────┴────┬─────┴────┬─────┴────┬────┴────┘ │ │
│ │ │ │ │ │ │ │
│ │ ▼ ▼ ▼ ▼ │ │
│ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │
│ │ │Posting │ │Posting │ │Posting │ │Posting │ │ │
│ │ │Tree/ │ │Tree/ │ │Tree/ │ │Tree/ │ │ │
│ │ │List │ │List │ │List │ │List │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │TID1 │ │TID3 │ │TID1 │ │TID2 │ │ │
│ │ │TID2 │ │TID5 │ │TID4 │ │TID7 │ │ │
│ │ │TID4 │ │TID9 │ │TID8 │ │… │ │ │
│ │ └────────┘ └────────┘ └────────┘ └────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ PENDING LIST (for fast writes): │
│ Newly inserted entries go to pending list first, then bulk-merged │
│ into main structure during VACUUM or when list gets too big │
│ │
│ gin_pending_list_limit (default 4MB) │
│ fastupdate = on (default) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```sql
-- GIN index for full-text search
CREATE INDEX idx_articles_fts ON articles USING GIN (to_tsvector('english', content));
-- GIN for JSONB
CREATE INDEX idx_data_gin ON events USING GIN (properties jsonb_path_ops);
-- Check pending list
SELECT * FROM gin_meta(indexrelid) FROM pg_index WHERE indexrelid = 'idx_articles_fts'::regclass;
GiST (Generalized Search Tree)
┌─────────────────────────────────────────────────────────────────────────────┐
│ GiST INDEX STRUCTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Perfect for: Geometric data, ranges, full-text (alternative to GIN) │
│ │
│ KEY CONCEPT: Bounding box containment hierarchy │
│ │
│ R-TREE EXAMPLE (for geometric data): │
│ │
│ ┌─────────────────────────────────────────┐ │
│ │ Root Bounding Box │ │
│ │ ┌───────────────┬───────────────────┐ │ │
│ │ │ R1 │ R2 │ │ │
│ │ │ ┌───┐ ┌───┐ │ ┌───────┐ ┌───┐ │ │ │
│ │ │ │ a │ │ b │ │ │ c │ │ d │ │ │ │
│ │ │ └───┘ └───┘ │ └───────┘ └───┘ │ │ │
│ │ │ │ │ │ │
│ │ └───────────────┴───────────────────┘ │ │
│ └─────────────────────────────────────────┘ │
│ │
│ TREE STRUCTURE: │
│ │
│ ┌─────────────────────┐ │
│ │ Root (bbox) │ │
│ └──────────┬──────────┘ │
│ ┌──────────┴──────────┐ │
│ ▼ ▼ │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ R1 (bbox) │ │ R2 (bbox) │ │
│ └───────┬───────┘ └───────┬───────┘ │
│ ┌───────┴───────┐ ┌───────┴───────┐ │
│ ▼ ▼ ▼ ▼ │
│ ┌────────┐ ┌────────┐┌────────┐ ┌────────┐ │
│ │ a TID │ │ b TID ││ c TID │ │ d TID │ │
│ └────────┘ └────────┘└────────┘ └────────┘ │
│ │
│ QUERY: Find all objects overlapping query box │
│ → Descend only into nodes whose bounding box overlaps query │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
BRIN (Block Range Index)
┌─────────────────────────────────────────────────────────────────────────────┐
│ BRIN INDEX STRUCTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Perfect for: Large tables with naturally ordered data (timestamps, etc.) │
│ │
│ CONCEPT: Store min/max summary for ranges of blocks │
│ │
│ Table (naturally ordered by timestamp): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Block 0-127 │ Block 128-255 │ Block 256-383 │ Block 384-511 │ ... │ │
│ │ Jan 1-7 │ Jan 8-14 │ Jan 15-21 │ Jan 22-28 │ │ │
│ └──────────────┴───────────────┴───────────────┴───────────────┴─────┘ │
│ │
│ BRIN Index (very small!): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Range 0-127: min=Jan 1, max=Jan 7 │ │
│ │ Range 128-255: min=Jan 8, max=Jan 14 │ │
│ │ Range 256-383: min=Jan 15, max=Jan 21 │ │
│ │ Range 384-511: min=Jan 22, max=Jan 28 │ │
│ │ ... │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ QUERY: WHERE timestamp BETWEEN 'Jan 10' AND 'Jan 12' │
│ → Only scan blocks 128-255 (skip everything else!) │
│ │
│ pages_per_range = 128 (default, tunable) │
│ Index size: ~1000x smaller than B-Tree │
│ Trade-off: Less precise (may scan extra blocks) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
-- BRIN for time-series data
CREATE INDEX idx_events_time_brin ON events USING BRIN (created_at)
WITH (pages_per_range = 128 );
-- Check BRIN summaries
SELECT * FROM brin_page_items(get_raw_page( 'idx_events_time_brin' , 2 ), 'idx_events_time_brin' );
5. Buffer Pool
The buffer pool is PostgreSQL’s main memory cache for data pages.
Buffer Pool Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ BUFFER POOL ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SHARED MEMORY │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ BUFFER DESCRIPTORS (BufferDesc array) │ │
│ │ ┌────────┬────────┬────────┬────────┬────────┬────────┐ │ │
│ │ │ Desc 0 │ Desc 1 │ Desc 2 │ Desc 3 │ ... │ Desc N │ │ │
│ │ └───┬────┴───┬────┴───┬────┴───┬────┴────────┴───┬────┘ │ │
│ │ │ │ │ │ │ │ │
│ │ ▼ ▼ ▼ ▼ ▼ │ │
│ │ BUFFER POOL (array of 8KB buffers) │ │
│ │ ┌────────┬────────┬────────┬────────┬────────┬────────┐ │ │
│ │ │ Buf 0 │ Buf 1 │ Buf 2 │ Buf 3 │ ... │ Buf N │ │ │
│ │ │ 8KB │ 8KB │ 8KB │ 8KB │ │ 8KB │ │ │
│ │ └────────┴────────┴────────┴────────┴────────┴────────┘ │ │
│ │ │ │
│ │ BUFFER TABLE (hash table: BufferTag → buffer_id) │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ BufferTag = (RelFileNode, ForkNumber, BlockNumber) │ │ │
│ │ │ │ │ │
│ │ │ Hash bucket 0: tag1→buf5, tag2→buf9 │ │ │
│ │ │ Hash bucket 1: tag3→buf12 │ │ │
│ │ │ Hash bucket 2: (empty) │ │ │
│ │ │ ... │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ BUFFER DESCRIPTOR CONTENTS: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ tag: (relfilenode, forknum, blocknum) - identifies the page │ │
│ │ buf_id: index in buffer pool │ │
│ │ state: (BM_LOCKED | BM_DIRTY | BM_VALID | BM_TAG_VALID | ...) │ │
│ │ refcount: number of backends pinning this buffer │ │
│ │ usage_count: for clock-sweep (0-5) │ │
│ │ wait_backend_pgprocno: for buffer pin waits │ │
│ │ freeNext: link in freelist │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Clock-Sweep Algorithm Details
/* Simplified clock-sweep from src/backend/storage/buffer/freelist.c */
static BufferDesc *
StrategyGetBuffer (BufferAccessStrategy strategy , uint32 * buf_state )
{
BufferDesc * buf;
int tries;
uint32 local_buf_state;
for (;;)
{
/* Circularly scan through buffer pool */
buf = GetBufferDescriptor (nextVictimBuffer);
nextVictimBuffer = (nextVictimBuffer + 1 ) % NBuffers;
local_buf_state = LockBufHdr (buf);
/* Skip if pinned */
if ( BUF_STATE_GET_REFCOUNT (local_buf_state) != 0 )
{
UnlockBufHdr (buf, local_buf_state);
continue ;
}
/* If usage_count > 0, decrement and continue */
if ( BUF_STATE_GET_USAGECOUNT (local_buf_state) > 0 )
{
local_buf_state -= BUF_USAGECOUNT_ONE;
UnlockBufHdr (buf, local_buf_state);
continue ;
}
/* Found a victim! */
/* Pin it and return */
local_buf_state += BUF_REFCOUNT_ONE;
* buf_state = local_buf_state;
return buf;
}
}
Ring Buffer Strategy
┌─────────────────────────────────────────────────────────────────────────────┐
│ BUFFER ACCESS STRATEGIES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Large sequential scans can "pollute" the buffer pool. │
│ PostgreSQL uses RING BUFFERS to prevent this. │
│ │
│ NORMAL ACCESS: │
│ Uses main buffer pool, full clock-sweep algorithm │
│ │
│ BULK READ (sequential scan > 1/4 of shared_buffers): │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Ring of 256KB (32 buffers) │ │
│ │ ┌───┬───┬───┬───┬───┬───┬───┬───┐ │ │
│ │ │ 0 │ 1 │ 2 │ 3 │...│30 │31 │ 0 │ ← wraps around │ │
│ │ └───┴───┴───┴───┴───┴───┴───┴───┘ │ │
│ │ Reuses same 32 buffers, doesn't evict hot data │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ BULK WRITE (COPY, CREATE TABLE AS SELECT, etc.): │
│ Ring of 16MB (2048 buffers) │
│ │
│ VACUUM: │
│ Ring of 256KB (32 buffers) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
6. TOAST (The Oversized-Attribute Storage Technique)
TOAST Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ TOAST SYSTEM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Problem: Page size is 8KB, but attributes can be larger │
│ Solution: TOAST - automatically handles large values │
│ │
│ TOAST STRATEGIES (per column): │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ PLAIN - No compression, no out-of-line storage │ │
│ │ (for fixed-size types like integer) │ │
│ │ │ │
│ │ MAIN - Try compression first, out-of-line only if needed │ │
│ │ (default for most variable-length types) │ │
│ │ │ │
│ │ EXTERNAL - No compression, out-of-line if > threshold │ │
│ │ (for pre-compressed data like images) │ │
│ │ │ │
│ │ EXTENDED - Compression + out-of-line storage │ │
│ │ (default for text, bytea, jsonb) │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ TOAST THRESHOLD: ~2KB (TOAST_TUPLE_THRESHOLD) │
│ │
│ STORAGE PROCESS: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ 1. If tuple fits in page → store inline │ │
│ │ │ │
│ │ 2. If too large: │ │
│ │ a. Try compression (LZ4 or pglz) │ │
│ │ b. If still too large → move to TOAST table │ │
│ │ │ │
│ │ 3. TOAST table stores chunks: │ │
│ │ ┌──────────┬──────────┬──────────┬──────────┐ │ │
│ │ │ chunk_id │ chunk_seq│ chunk_data (≤2KB) │ │ │
│ │ │ chunk_id │ chunk_seq│ chunk_data │ │ │
│ │ │ chunk_id │ chunk_seq│ chunk_data │ │ │
│ │ └──────────┴──────────┴──────────┴──────────┘ │ │
│ │ │ │
│ │ 4. Main tuple stores TOAST pointer: │ │
│ │ ┌──────────────────────────────────────────────────┐ │ │
│ │ │ va_header │ toast_relid │ chunk_id │ chunk_size │ │ │
│ │ └──────────────────────────────────────────────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
-- View TOAST table
SELECT relname, reltoastrelid::regclass
FROM pg_class WHERE relname = 'articles' ;
-- Check TOAST storage
SELECT
pg_column_size(content) AS inline_size,
length (content) AS actual_length
FROM articles LIMIT 5 ;
-- Change TOAST strategy
ALTER TABLE articles ALTER COLUMN content SET STORAGE EXTERNAL ;
7. Practice: Analyze Storage
Hands-On Exercise
-- Create test table
CREATE TABLE storage_test (
id SERIAL PRIMARY KEY ,
small_text VARCHAR ( 100 ),
large_text TEXT ,
json_data JSONB
);
-- Insert data
INSERT INTO storage_test (small_text, large_text, json_data)
SELECT
'Small text ' || i,
repeat ( 'Large text content ' , 1000 ),
jsonb_build_object( 'key' , i, 'data' , repeat ( 'x' , 500 ))
FROM generate_series ( 1 , 1000 ) i;
-- Analyze page layout
SELECT * FROM page_header(get_raw_page( 'storage_test' , 0 ));
-- Check tuple sizes
SELECT
id,
pg_column_size(small_text) AS small_size,
pg_column_size(large_text) AS large_size,
pg_column_size(json_data) AS json_size,
pg_column_size(storage_test. * ) AS row_size
FROM storage_test LIMIT 5 ;
-- Find TOAST table
SELECT c . relname , t . relname AS toast_table
FROM pg_class c
JOIN pg_class t ON c . reltoastrelid = t . oid
WHERE c . relname = 'storage_test' ;
-- Check TOAST chunks
SELECT chunk_id, chunk_seq, length (chunk_data)
FROM pg_toast . pg_toast_ < oid > LIMIT 10 ;
8. Comparison: PostgreSQL vs LSM Trees
Understanding how PostgreSQL differs from LSM-based databases (RocksDB, LevelDB, Cassandra).
┌─────────────────────────────────────────────────────────────────────────────┐
│ POSTGRESQL (B-TREE) vs LSM TREE COMPARISON │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ POSTGRESQL B-TREE: LSM TREE: │
│ ───────────────── ───────── │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ B-Tree │ │ MemTable │ ← writes go here │
│ │ (on disk) │ │ (in memory) │ (sorted) │
│ │ │ └──────┬───────┘ │
│ │ ┌────────┐ │ │ flush │
│ │ │ Root │ │ ▼ │
│ │ └───┬────┘ │ ┌──────────────┐ │
│ │ │ │ │ Level 0 │ (immutable) │
│ │ ┌───┴───┐ │ │ SSTable │ │
│ │ ▼ ▼ │ └──────┬───────┘ │
│ │┌────┐ ┌────┐ │ │ compact │
│ ││Leaf│ │Leaf│ │ ▼ │
│ │└────┘ └────┘ │ ┌──────────────┐ │
│ │ │ │ Level 1 │ │
│ │ In-place │ │ SSTables │ │
│ │ updates │ └──────┬───────┘ │
│ └──────────────┘ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Level N │ │
│ │ SSTables │ │
│ └──────────────┘ │
│ │
│ WRITES: WRITES: │
│ • In-place update • Append to MemTable │
│ • Random I/O • Sequential I/O (batch) │
│ • WAL for durability • WAL + MemTable flush │
│ │
│ READS: READS: │
│ • O(log N) B-Tree traversal • Check MemTable first │
│ • Single page usually • Then each level │
│ • Very predictable • May check many SSTables │
│ • Bloom filters help │
│ │
│ SPACE: SPACE: │
│ • Dead tuples until VACUUM • Write amplification │
│ • No write amplification • Compaction overhead │
│ • MVCC overhead • Efficient for deletes │
│ │
│ BEST FOR: BEST FOR: │
│ • Read-heavy OLTP • Write-heavy workloads │
│ • Point lookups • Time-series data │
│ • Range scans • High-volume ingestion │
│ • Mixed workloads • SSD storage │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
9. Source Code References
Data Structure Source Files Key Functions Page Layout src/include/storage/bufpage.hPageInit, PageAddItemHeap Tuple src/include/access/htup_details.hheap_form_tuple, heap_deform_tupleB-Tree src/backend/access/nbtree/_bt_search, _bt_insertonpgBuffer Pool src/backend/storage/buffer/bufmgr.cReadBuffer, MarkBufferDirtyTOAST src/backend/access/heap/tuptoaster.ctoast_insert_or_updateGIN src/backend/access/gin/ginInsert, gingetbitmapGiST src/backend/access/gist/gistinsert, gistgettuple
6. Alternative Data Structures
While PostgreSQL primarily uses B-Trees and Heaps, understanding other structures is crucial for database engineering.
Log-Structured Merge-Tree (LSM Tree)
Used by RocksDB , CockroachDB , and Cassandra . Optimized for high write throughput.
Key Concepts:
Writes : Go to in-memory MemTable (fast).
Flushes : MemTable flushed to disk as SSTable (Sorted String Table).
Compaction : Background process merges SSTables to reclaim space and remove deleted data.
Read Path : Check MemTable -> L0 SSTables -> L1 -> …
Skip List
Used by Redis (Sorted Sets) and some in-memory databases. Probabilistic alternative to balanced trees.
Key Concepts:
Layers : Multiple linked lists. Bottom layer has all nodes.
Search : Start at top, move right until value > target, then move down.
Complexity : Average O(log n) for search/insert/delete.
Concurrency : Easier to implement lock-free than B-Trees.
Hash Index (Chaining)
Used in PostgreSQL Hash Indexes (for equality checks only) and many in-memory maps.
Key Concepts:
Buckets : Array of pointers.
Collisions : Handled by chaining (linked list) or open addressing.
O(1) average access time.
Limitation : No range queries (unlike B-Trees).
Next Module
Query Processing Deep Dive From SQL text to execution results — the complete pipeline