Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
PostgreSQL Data Structures
Understanding the data structures inside PostgreSQL is essential for performance optimization, debugging, and contributing to the codebase. This module covers everything from page layout to B-Tree internals.Estimated Time: 14-16 hours
Difficulty: Advanced
Prerequisite: Basic understanding of data structures
OSS Relevance: Critical — core storage code
Difficulty: Advanced
Prerequisite: Basic understanding of data structures
OSS Relevance: Critical — core storage code
1. Page Layout
Every data file in PostgreSQL is divided into 8KB pages (configurable at compile time). Understanding page layout is fundamental. Real-world analogy: Think of each 8KB page as a page in a physical notebook. The top of the page has a header with the page number and date (page header with LSN and checksum). Just below the header is a table of contents listing where each note starts (line pointers). The actual notes (tuples) are written from the bottom of the page upward. Free space sits in the middle — the table of contents grows down while the notes grow up. When they meet, the page is full. This “meet in the middle” design lets PostgreSQL add both line pointers and tuples without shuffling existing data.Page Structure
Line Pointer (ItemId)
2. Heap Tuple Structure
Heap tuples are the actual row data stored in table pages. Real-world analogy: A heap tuple is like a shipping package. The tuple header is the shipping label — it tells you who sent it (xmin: the transaction that created it), who might be returning it (xmax: the transaction that deleted or updated it), and a tracking number (ctid: physical location). The infomask flags are the “FRAGILE” or “THIS SIDE UP” stickers — metadata about how to handle the package. The actual column data is the contents inside the box. Every time you “update” a row, PostgreSQL does not rip open the box and swap items. Instead, it ships a whole new package and stamps the old one as “return to sender.”Tuple Layout
Key Infomask Flags
MVCC Visibility
3. B-Tree Index Structure
B-Trees are the default and most common index type in PostgreSQL. Real-world analogy: A B-Tree index works exactly like the index at the back of a textbook. The root page is the main alphabetical directory (“A-F: page 1, G-L: page 2”). Internal pages are like sub-sections that narrow your search further. Leaf pages are the final entries listing the exact page numbers (TIDs) where each term appears. The key difference with a B+ tree — which PostgreSQL uses — is that the leaf pages are also linked together left-to-right, like a chain. This means once you find “JavaScript” you can slide along the chain to “Java,” “Jest,” “JIT” without climbing back up the tree. That is what makes range scans (WHERE name BETWEEN 'J' AND 'K') efficient.
B+ Tree Structure
PostgreSQL uses B+ trees, where all data is stored in leaf nodes and internal nodes only contain keys for navigation: Key Characteristics:- Internal nodes: Only store keys and pointers (no data)
- Leaf nodes: Store key-value pairs (TIDs pointing to heap tuples)
- Leaf linking: All leaf nodes are linked together for efficient range scans
- Fanout: Higher fanout than B-trees (more keys per internal node)
B-Tree Page Layout
Lehman-Yao Algorithm
PostgreSQL uses the Lehman-Yao B-Tree algorithm for concurrent access:B-Tree Operations: Insertion and Splitting
Understanding how B-trees handle insertions and node splits is crucial for performance tuning:- Search: Navigate from root to appropriate leaf node
- Insert: Add key-value pair to leaf node
- Split if needed: If node exceeds maximum keys, split into two nodes
- Promote: Push middle key up to parent node
- Recursive: Parent may also need to split (propagates up)
- Best case: O(log n) - no splits needed
- Worst case: O(log n) - splits propagate to root
- Write amplification: Splits require multiple page writes
- Fillfactor: PostgreSQL uses 90% fillfactor by default to reduce splits
B-Tree Inspection
GiST (Generalized Search Tree)
BRIN (Block Range Index)
Real-world analogy: A BRIN index is like the date labels on archive boxes in a warehouse. You do not index every document — you just label each box with the earliest and latest date it contains. When someone asks for documents from January 10-12, you scan the labels, grab only the relevant boxes, and search within them. The trade-off is obvious: you might open a box that contains January 8-14 even though you only need 10-12 (false positives), but labeling boxes is nearly free compared to indexing every individual document. BRIN indexes can be 1000x smaller than B-Tree indexes on the same column.5. Buffer Pool
The buffer pool is PostgreSQL’s main memory cache for data pages. Real-world analogy: The buffer pool is like the desk of a researcher in a library. The library’s stacks (disk) hold millions of books (pages), but the researcher can only have a few dozen on their desk (shared_buffers) at once. When they need a new book, they check the desk first (buffer hit). If it is not there, they walk to the stacks (buffer miss — expensive I/O). When the desk is full and a new book is needed, they must put one back. The clock-sweep algorithm decides which book to return: frequently referenced books get a “do not return yet” sticky note (usage_count > 0), and the algorithm sweeps the desk like a clock hand, peeling off one sticky note per pass. A book with no sticky notes left gets returned to the stacks.Buffer Pool Architecture
Clock-Sweep Algorithm Details
Ring Buffer Strategy
Performance pitfall: Without ring buffers, a singleSELECT * FROM big_table (sequential scan) would evict every hot page from the buffer pool, replacing your carefully cached OLTP data with pages you will never read again. This is called “buffer pool pollution.” PostgreSQL detects large sequential scans (> 1/4 of shared_buffers) and automatically confines them to a tiny 256KB ring buffer. The scan reuses the same 32 buffers over and over, leaving the rest of the pool untouched. If you see poor cache hit ratios after an analytics query, check whether the table was small enough to dodge the ring buffer threshold.
6. TOAST (The Oversized-Attribute Storage Technique)
Real-world analogy: TOAST is PostgreSQL’s “overflow storage” system. Think of each 8KB page as a standard mailing envelope. Most letters (rows) fit inside easily. But sometimes you need to send a poster (a large TEXT or JSONB column). TOAST handles this automatically: first, it tries folding the poster (compression with LZ4 or pglz). If it still does not fit, TOAST ships the poster in a separate box (the TOAST table) and puts a claim ticket (TOAST pointer) in the original envelope. When you SELECT that column, PostgreSQL follows the claim ticket to fetch the poster. The critical insight: if your query never selects that column, the poster is never fetched — saving significant I/O.TOAST Overview
7. Practice: Analyze Storage
Hands-On Exercise
8. Comparison: PostgreSQL vs LSM Trees
Understanding how PostgreSQL differs from LSM-based databases (RocksDB, LevelDB, Cassandra).9. Source Code References
| Data Structure | Source Files | Key Functions |
|---|---|---|
| Page Layout | src/include/storage/bufpage.h | PageInit, PageAddItem |
| Heap Tuple | src/include/access/htup_details.h | heap_form_tuple, heap_deform_tuple |
| B-Tree | src/backend/access/nbtree/ | _bt_search, _bt_insertonpg |
| Buffer Pool | src/backend/storage/buffer/bufmgr.c | ReadBuffer, MarkBufferDirty |
| TOAST | src/backend/access/heap/tuptoaster.c | toast_insert_or_update |
| GIN | src/backend/access/gin/ | ginInsert, gingetbitmap |
| GiST | src/backend/access/gist/ | gistinsert, gistgettuple |
6. Alternative Data Structures
While PostgreSQL primarily uses B-Trees and Heaps, understanding other structures is crucial for database engineering.Log-Structured Merge-Tree (LSM Tree)
Used by RocksDB, CockroachDB, and Cassandra. Optimized for high write throughput. Key Concepts:- Writes: Go to in-memory MemTable (fast).
- Flushes: MemTable flushed to disk as SSTable (Sorted String Table).
- Compaction: Background process merges SSTables to reclaim space and remove deleted data.
- Read Path: Check MemTable -> L0 SSTables -> L1 -> …
Skip List
Used by Redis (Sorted Sets) and some in-memory databases. Probabilistic alternative to balanced trees. Key Concepts:- Layers: Multiple linked lists. Bottom layer has all nodes.
- Search: Start at top, move right until value > target, then move down.
- Complexity: Average O(log n) for search/insert/delete.
- Concurrency: Easier to implement lock-free than B-Trees.
Hash Index (Chaining)
Used in PostgreSQL Hash Indexes (for equality checks only) and many in-memory maps. Key Concepts:- Buckets: Array of pointers.
- Collisions: Handled by chaining (linked list) or open addressing.
- O(1) average access time.
- Limitation: No range queries (unlike B-Trees).
Next Module
Query Processing Deep Dive
From SQL text to execution results — the complete pipeline
Interview Deep-Dive
Compare B-Tree and LSM-Tree data structures. When would you choose each, and what are the operational implications?
Compare B-Tree and LSM-Tree data structures. When would you choose each, and what are the operational implications?
Strong Answer:
- B-Tree (PostgreSQL, MySQL InnoDB): In-place updates, reads traverse a balanced tree in O(log n). Writes may trigger leaf splits (allocating a new page, moving half the keys, updating the parent). Optimized for read-heavy workloads with mixed reads and writes. Operational implication: fragmentation over time requires REINDEX, and the tree must be locked during splits (briefly).
- LSM-Tree (RocksDB, Cassandra, CockroachDB’s storage): Writes go to an in-memory memtable, then flush as immutable sorted files (SSTables) on disk. Reads must check the memtable plus multiple SSTable levels. Background compaction merges SSTables to reduce read amplification. Optimized for write-heavy workloads. Operational implication: write amplification from compaction (data is rewritten multiple times), and read latency can spike if compaction falls behind (too many levels to check).
- Tradeoff framework: B-Tree has lower read amplification but higher write amplification (random I/O for in-place updates). LSM-Tree has lower write amplification (sequential I/O) but higher read amplification (multiple files to check) and space amplification (data exists in multiple levels during compaction). For OLTP with balanced read/write: B-Tree. For time-series ingestion or write-heavy logging: LSM-Tree.
- CockroachDB chose LSM (RocksDB/Pebble) because distributed writes benefit from sequential I/O patterns, and the Raft consensus layer already adds write latency that dwarfs the B-Tree vs LSM difference.
Explain how PostgreSQL's 8KB page size affects performance and what tradeoffs a different page size would create.
Explain how PostgreSQL's 8KB page size affects performance and what tradeoffs a different page size would create.
Strong Answer:
- The 8KB page size is a compile-time constant that affects every data structure in PostgreSQL. Each heap page holds multiple tuples, each B-Tree page holds multiple index entries, and each WAL record references 8KB pages. The page size determines the granularity of I/O, buffering, and locking.
- Why 8KB: It aligns with common filesystem block sizes and SSD page sizes. It is small enough to avoid wasting buffer pool space on partially-used pages, but large enough to amortize the per-page overhead (24-byte header, line pointer array). For typical OLTP row sizes (100-500 bytes), 8KB pages hold 10-50 rows, which is a reasonable unit of I/O.
- Larger pages (16KB, 32KB): Fewer I/O operations for sequential scans (reading one 32KB page instead of four 8KB pages). More rows per page means B-Trees are shallower (wider fan-out). But: more wasted space for small rows, full_page_writes in WAL become larger (inflating WAL volume), and buffer pool granularity is coarser (one hot row pins 32KB instead of 8KB). Some databases (Oracle defaults to 8KB but supports 16/32KB) use larger pages for data warehouse workloads.
- Smaller pages (4KB): Finer buffer pool granularity, less WAL inflation from full_page_writes. But: deeper B-Trees (more I/O per lookup), more page-level overhead, and more line pointers to manage. Generally not beneficial.
- Practical note: changing PostgreSQL’s page size requires recompilation and full data reload. It is not a tuning knob — it is an architectural decision made at cluster creation.
What is a GIN index and how does its inverted structure differ from a B-Tree? When does GIN outperform B-Tree and vice versa?
What is a GIN index and how does its inverted structure differ from a B-Tree? When does GIN outperform B-Tree and vice versa?
Strong Answer:
- A GIN (Generalized Inverted Index) maps each distinct value (or element) to a posting list of row TIDs that contain that value. Think of it as a book’s back-of-the-book index: “PostgreSQL” -> pages 5, 23, 47, 89. For array columns, each array element gets its own posting list entry. For JSONB, each key and value gets an entry. For full-text search, each lexeme gets an entry.
- B-Tree vs GIN: A B-Tree indexes whole column values and is optimal for equality and range queries on scalar values. A GIN index decomposes composite values (arrays, JSONB, text documents) into individual elements and indexes each element separately. You cannot efficiently ask a B-Tree “which rows contain the element ‘postgresql’ somewhere in their tags array?” but a GIN index answers this in O(1) lookup + posting list scan.
- GIN advantages: (1) Containment queries on arrays:
WHERE tags @> ARRAY['postgresql']. (2) JSONB key existence:WHERE data ? 'email'. (3) Full-text search:WHERE tsvector @@ tsquery. These are impossible or extremely slow with B-Tree because B-Tree treats the entire array/JSONB as a single opaque value. - GIN disadvantages: (1) Slower writes — inserting one row with a 10-element array requires 10 posting list updates. GIN mitigates this with a “pending list” that batches insertions, but at the cost of slower reads until the pending list is flushed. (2) Larger on disk than a B-Tree for the same column if the column has low cardinality elements. (3) Not useful for range queries — GIN answers “contains X” not “between X and Y.”
VACUUM or explicit gin_clean_pending_list() merges the pending list into the main index. Tune gin_pending_list_limit (default 4MB) — higher values reduce write overhead but increase read overhead. For read-heavy workloads, set it lower and vacuum frequently.