Memory Management: The Hardware-Software Interface
Memory management is the most complex dance between hardware (CPU/MMU) and software (Kernel). It is not merely about “giving memory to programs”; it is about creating a consistent, isolated, and high-performance execution environment while hiding the messy reality of physical RAM. This module covers everything from basic allocation strategies to the deep internals of 5-level paging and TLB management on modern x86-64 processors.Interview Frequency: Critical
Key Topics: Page Tables, TLB, Fragmentation, Buddy/Slab Allocators, NUMA
Time to Master: 15-20 hours
Key Topics: Page Tables, TLB, Fragmentation, Buddy/Slab Allocators, NUMA
Time to Master: 15-20 hours
1. The Core Architecture: Virtual vs. Physical
At the heart of modern computing lies a lie: Every process thinks it has the entire memory space to itself. This is Virtual Memory.1.1. Why Virtual Memory?
- Isolation: Process A cannot read Process B’s memory.
- Relocatability: A process can be loaded at any physical address without changing its code.
- Efficiency: We only keep the “working set” of a process in physical RAM.
- Security: We can mark certain areas as “No-Execute” or “Read-Only”.
1.2. The Address Space Layout
A standard 64-bit address space is effectively a vast void. On x86-64, only 48 or 57 bits are actually used (canonical addresses).- User Space (Lower half):
- The “Gap” (Non-canonical): Hardware throws a #GP fault if accessed.
- Kernel Space (Upper half):
2. Hardware Support: The MMU and Page Tables
The Memory Management Unit (MMU) is a hardware component in the CPU that performs the translation from Virtual Address (VA) to Physical Address (PA) on every single memory access.2.1. The Paging Mechanism
Memory is divided into fixed-size Pages (Virtual) and Frames (Physical). A typical page is 4KB. Virtual Address Breakdown (4KB pages):- VPN (Virtual Page Number): Used to index into the page table.
- Offset: Position within the 4KB page (lowest 12 bits).
2.2. Multi-Level Page Tables (The x86-64 Walk)
A single-level page table for a 64-bit space would be impossibly large. We use a tree structure. For 4-level paging (standard):- CR3 Register: Holds the physical address of the PML4 (Level 4 table).
- PML4: Uses bits 47-39 of the VA to find the PDPT (Page Directory Pointer Table).
- PDPT: Uses bits 38-30 to find the PD (Page Directory).
- PD: Uses bits 29-21 to find the PT (Page Table).
- PT: Uses bits 20-12 to find the Physical Frame.
- Offset: Bits 11-0 are added to the Frame start to get the byte.
2.5 From *p = 42 to a DRAM Cell
To make all the abstractions concrete, walk through a single C statement:
*p = 42?
- User-space allocator (
malloc) has already:- Reserved a chunk of virtual address space (via
brk()ormmap()). - Returned a pointer
pthat lives in your process’s virtual address space.
- Reserved a chunk of virtual address space (via
- CPU executes the store:
- The compiler emits something like
movl $42, (%rdi)where%rdiholdsp. - The Virtual Address (VA) in
%rdiis handed to the MMU.
- The compiler emits something like
- MMU + Page Tables:
- The MMU looks up the VA in the TLB.
- On a TLB hit: It immediately gets the Physical Frame Number (PFN) and offset.
- On a TLB miss: It walks the multi-level page tables using the scheme described above (PML4 → PDPT → PD → PT) to find the PFN, then updates the TLB.
- Potential Page Fault:
- If the page table entry is not present or lacks write permissions, the CPU raises a page fault exception.
- The kernel’s page fault handler (
do_page_fault) decides whether to allocate a new frame, fetch from swap, or kill the process (e.g., invalid pointer).
- Cache Hierarchy:
- Once the Physical Address is known, the CPU looks in L1 data cache.
- On a miss, it walks out to L2/L3 and finally to DRAM.
- The cache line (typically 64 bytes) containing the target address is loaded into L1.
- Store Buffer and Write:
- The value
42is placed into the CPU’s store buffer and eventually merged into the L1 cache line. - Later, the cache coherence protocol (MESI) ensures other cores see the updated value when needed.
- The value
*p = 42; is a single operation. In reality, it is a coordinated dance between:
- User-space allocator (managing virtual addresses).
- Kernel (managing page tables and page faults).
- MMU/TLB (translating addresses).
- Cache hierarchy and coherence protocol (moving and sharing data).
- DRAM controller (accessing physical cells).
3. The TLB: Translation Lookaside Buffer
To solve the 4-lookup penalty, the CPU uses a specialized cache called the TLB.3.1. TLB Internals
- CAM (Content-Addressable Memory): The TLB is extremely fast. It stores (VPN → PFN) mappings.
- TLB Hit: Translation in cycle.
- TLB Miss: Hardware (or software on some RISC) must perform the page table walk.
3.2. Context Switches and the TLB
When switching from Process A to Process B, the page tables change. Traditionally, we must flush the TLB because A’s mapping of address is different from B’s.- Performance Hit: A flushed TLB leads to a “cold start” period of high latency.
- Optimization: ASIDs (Address Space IDs): Modern CPUs tag TLB entries with a process ID (PCID in Intel). We only flush if the ASID isn’t found.
4. Contiguous Allocation: The Buddy System
While paging handles user memory, the kernel often needs contiguous physical memory (e.g., for DMA or hugepages).4.1. The Buddy System Logic
Linux uses the Buddy Allocator. It manages memory in blocks of pages (Orders).- Allocation: If you want Order 2 (16KB) but only Order 4 (64KB) is free, the allocator splits 64 -> 32+32, then 32 -> 16+16.
- Freeing (Coalescing): When you free a block, the kernel checks its “buddy.” If the buddy is also free, they merge back into .
4.2. Fragmentation
- Internal: Wasted space inside the block (e.g., requesting 17KB gets you 32KB).
- External: Plenty of free memory, but none of it is contiguous enough for a large request.
5. Small Object Allocation: SLAB and SLUB
Requesting a 4KB page just to store a 32-bytetask_struct is wasteful.
5.1. The Slab Concept
A Slab is a set of one or more contiguous pages, partitioned into small, fixed-size slots for specific kernel objects (e.g.,inodes, dentries, mm_struct).
5.2. Object Caching
Instead of initializing memory every time, the Slab allocator keeps a pool of “constructed” objects. When you free an object, it remains initialized in the “free” pool, ready for the next request.6. Segmentation: A Historical Perspective
Before Paging became dominant, Segmentation was used to divide memory into logical units (Code, Data, Stack).- GDT (Global Descriptor Table): A table where each entry defines a segment’s base, limit, and permissions.
- Why it failed: Variable-sized segments led to massive External Fragmentation and complex compaction needs.
- Modern Use: On x86-64, segmentation is mostly disabled, but the
FSandGSsegment registers are still used for Thread Local Storage (TLS) and pointing to per-CPU kernel data.
7. Advanced Memory Features
7.1. HugePages
Standard 4KB pages are small. For a 1TB database, the page table itself consumes gigabytes.- HugePages (2MB or 1GB): Reduces the depth of the page table walk and increases TLB coverage.
- THP (Transparent HugePages): A Linux kernel feature that automatically attempts to promote 4KB pages to 2MB HugePages in the background.
7.2. NUMA (Non-Uniform Memory Access)
In multi-socket servers, CPU 0 is “closer” to RAM Bank 0 than RAM Bank 1.- Local vs Remote: Accessing remote RAM can be slower.
- First-Touch Policy: Linux generally allocates physical frames on the node that first writes to the page.
7.3. Copy-on-Write (COW)
When youfork() a process, the kernel doesn’t copy the memory. It points the new process’s page tables to the same physical frames but marks them Read-Only.
- If either process tries to write, the CPU triggers a Page Fault.
- The kernel then copies that specific page, updates the page table to Read-Write, and restarts the instruction.
8. Senior Interview Deep Dive
Q1: Walk through exactly what happens during a Page Fault.
Q1: Walk through exactly what happens during a Page Fault.
- Hardware Trigger: CPU tries to access a virtual address. The MMU finds the PTE (Page Table Entry) is either not present or the permissions (W/X) are violated.
- Trap: CPU generates an exception and jumps to the kernel’s Page Fault Handler (e.g.,
do_page_faultin Linux). - Context Capture: The CPU saves the faulting address in a control register (e.g.,
CR2). - VMA Lookup: The kernel looks up the process’s
vm_area_structto see if the address is valid (e.g., is it in a known segment like the heap?). - Allocation/Swap:
- If it’s a new heap page: Find a free frame in the Buddy Allocator.
- If it’s swapped: Issue I/O to read the page from disk.
- PTE Update: The kernel updates the page table with the new Physical Frame address and sets the “Present” bit.
- Return: The kernel returns from the trap, and the CPU re-executes the exact same instruction that faulted.
Q2: What is a TLB Shootdown?
Q2: What is a TLB Shootdown?
When a kernel changes a page table mapping on a multi-core system (e.g., unmapping memory or changing permissions), it must ensure that other CPUs don’t have the old mapping cached in their local TLBs.The kernel sends an Inter-Processor Interrupt (IPI) to all other cores, forcing them to flush the specific entry from their TLBs. This is expensive and a major bottleneck for high-frequency mapping changes.
Q3: Why do we use 4-level or 5-level page tables instead of 2?
Q3: Why do we use 4-level or 5-level page tables instead of 2?
A 2-level table for a 64-bit space would still require a massive top-level directory. Multi-level tables allow for sparsity. We only allocate the lower-level tables for address ranges that the process is actually using. If a process only uses 1MB of memory, we only need one PML4 entry, one PDPT entry, one PD entry, and one PT entry.
9. Practice: Memory Forensics
In Linux, you can inspect a process’s memory map directly:Next: Virtual Memory & Swapping →