Skip to main content

Memory Management: The Hardware-Software Interface

Memory management is the most complex dance between hardware (CPU/MMU) and software (Kernel). It is not merely about “giving memory to programs”; it is about creating a consistent, isolated, and high-performance execution environment while hiding the messy reality of physical RAM. This module covers everything from basic allocation strategies to the deep internals of 5-level paging and TLB management on modern x86-64 processors.
Interview Frequency: Critical
Key Topics: Page Tables, TLB, Fragmentation, Buddy/Slab Allocators, NUMA
Time to Master: 15-20 hours

1. The Core Architecture: Virtual vs. Physical

At the heart of modern computing lies a lie: Every process thinks it has the entire memory space to itself. This is Virtual Memory.

1.1. Why Virtual Memory?

  1. Isolation: Process A cannot read Process B’s memory.
  2. Relocatability: A process can be loaded at any physical address without changing its code.
  3. Efficiency: We only keep the “working set” of a process in physical RAM.
  4. Security: We can mark certain areas as “No-Execute” or “Read-Only”.

1.2. The Address Space Layout

A standard 64-bit address space is effectively a vast void. On x86-64, only 48 or 57 bits are actually used (canonical addresses).
  • User Space (Lower half): 000007FFFFFFFFFFF0 \dots 00007FFFFFFFFFFF
  • The “Gap” (Non-canonical): Hardware throws a #GP fault if accessed.
  • Kernel Space (Upper half): FFFF800000000000FFFFFFFFFFFFFFFFFFFF800000000000 \dots FFFFFFFFFFFFFFFF

2. Hardware Support: The MMU and Page Tables

The Memory Management Unit (MMU) is a hardware component in the CPU that performs the translation from Virtual Address (VA) to Physical Address (PA) on every single memory access.

2.1. The Paging Mechanism

Memory is divided into fixed-size Pages (Virtual) and Frames (Physical). A typical page is 4KB. Virtual Address Breakdown (4KB pages):
  • VPN (Virtual Page Number): Used to index into the page table.
  • Offset: Position within the 4KB page (lowest 12 bits).

2.2. Multi-Level Page Tables (The x86-64 Walk)

A single-level page table for a 64-bit space would be impossibly large. We use a tree structure. For 4-level paging (standard):
  1. CR3 Register: Holds the physical address of the PML4 (Level 4 table).
  2. PML4: Uses bits 47-39 of the VA to find the PDPT (Page Directory Pointer Table).
  3. PDPT: Uses bits 38-30 to find the PD (Page Directory).
  4. PD: Uses bits 29-21 to find the PT (Page Table).
  5. PT: Uses bits 20-12 to find the Physical Frame.
  6. Offset: Bits 11-0 are added to the Frame start to get the byte.
The “Walk” Cost: Every memory access theoretically requires 4 additional memory lookups!

2.5 From *p = 42 to a DRAM Cell

To make all the abstractions concrete, walk through a single C statement:
int *p = malloc(sizeof(int));
*p = 42;
What actually happens when the CPU executes *p = 42?
  1. User-space allocator (malloc) has already:
    • Reserved a chunk of virtual address space (via brk() or mmap()).
    • Returned a pointer p that lives in your process’s virtual address space.
  2. CPU executes the store:
    • The compiler emits something like movl $42, (%rdi) where %rdi holds p.
    • The Virtual Address (VA) in %rdi is handed to the MMU.
  3. MMU + Page Tables:
    • The MMU looks up the VA in the TLB.
    • On a TLB hit: It immediately gets the Physical Frame Number (PFN) and offset.
    • On a TLB miss: It walks the multi-level page tables using the scheme described above (PML4 → PDPT → PD → PT) to find the PFN, then updates the TLB.
  4. Potential Page Fault:
    • If the page table entry is not present or lacks write permissions, the CPU raises a page fault exception.
    • The kernel’s page fault handler (do_page_fault) decides whether to allocate a new frame, fetch from swap, or kill the process (e.g., invalid pointer).
  5. Cache Hierarchy:
    • Once the Physical Address is known, the CPU looks in L1 data cache.
    • On a miss, it walks out to L2/L3 and finally to DRAM.
    • The cache line (typically 64 bytes) containing the target address is loaded into L1.
  6. Store Buffer and Write:
    • The value 42 is placed into the CPU’s store buffer and eventually merged into the L1 cache line.
    • Later, the cache coherence protocol (MESI) ensures other cores see the updated value when needed.
From the C programmer’s perspective, *p = 42; is a single operation. In reality, it is a coordinated dance between:
  • User-space allocator (managing virtual addresses).
  • Kernel (managing page tables and page faults).
  • MMU/TLB (translating addresses).
  • Cache hierarchy and coherence protocol (moving and sharing data).
  • DRAM controller (accessing physical cells).
Keep this mental picture in mind when debugging performance issues: a “simple store” might involve a TLB miss, a page fault, cache misses, and NUMA penalties.

3. The TLB: Translation Lookaside Buffer

To solve the 4-lookup penalty, the CPU uses a specialized cache called the TLB.

3.1. TLB Internals

  • CAM (Content-Addressable Memory): The TLB is extremely fast. It stores (VPN → PFN) mappings.
  • TLB Hit: Translation in 1\approx 1 cycle.
  • TLB Miss: Hardware (or software on some RISC) must perform the page table walk.

3.2. Context Switches and the TLB

When switching from Process A to Process B, the page tables change. Traditionally, we must flush the TLB because A’s mapping of address 0x40000x4000 is different from B’s.
  • Performance Hit: A flushed TLB leads to a “cold start” period of high latency.
  • Optimization: ASIDs (Address Space IDs): Modern CPUs tag TLB entries with a process ID (PCID in Intel). We only flush if the ASID isn’t found.

4. Contiguous Allocation: The Buddy System

While paging handles user memory, the kernel often needs contiguous physical memory (e.g., for DMA or hugepages).

4.1. The Buddy System Logic

Linux uses the Buddy Allocator. It manages memory in blocks of 2n2^n pages (Orders).
  • Allocation: If you want Order 2 (16KB) but only Order 4 (64KB) is free, the allocator splits 64 -> 32+32, then 32 -> 16+16.
  • Freeing (Coalescing): When you free a block, the kernel checks its “buddy.” If the buddy is also free, they merge back into 2n+12^{n+1}.

4.2. Fragmentation

  • Internal: Wasted space inside the 2n2^n block (e.g., requesting 17KB gets you 32KB).
  • External: Plenty of free memory, but none of it is contiguous enough for a large request.

5. Small Object Allocation: SLAB and SLUB

Requesting a 4KB page just to store a 32-byte task_struct is wasteful.

5.1. The Slab Concept

A Slab is a set of one or more contiguous pages, partitioned into small, fixed-size slots for specific kernel objects (e.g., inodes, dentries, mm_struct).

5.2. Object Caching

Instead of initializing memory every time, the Slab allocator keeps a pool of “constructed” objects. When you free an object, it remains initialized in the “free” pool, ready for the next request.

6. Segmentation: A Historical Perspective

Before Paging became dominant, Segmentation was used to divide memory into logical units (Code, Data, Stack).
  • GDT (Global Descriptor Table): A table where each entry defines a segment’s base, limit, and permissions.
  • Why it failed: Variable-sized segments led to massive External Fragmentation and complex compaction needs.
  • Modern Use: On x86-64, segmentation is mostly disabled, but the FS and GS segment registers are still used for Thread Local Storage (TLS) and pointing to per-CPU kernel data.

7. Advanced Memory Features

7.1. HugePages

Standard 4KB pages are small. For a 1TB database, the page table itself consumes gigabytes.
  • HugePages (2MB or 1GB): Reduces the depth of the page table walk and increases TLB coverage.
  • THP (Transparent HugePages): A Linux kernel feature that automatically attempts to promote 4KB pages to 2MB HugePages in the background.

7.2. NUMA (Non-Uniform Memory Access)

In multi-socket servers, CPU 0 is “closer” to RAM Bank 0 than RAM Bank 1.
  • Local vs Remote: Accessing remote RAM can be 2x3x2x-3x slower.
  • First-Touch Policy: Linux generally allocates physical frames on the node that first writes to the page.

7.3. Copy-on-Write (COW)

When you fork() a process, the kernel doesn’t copy the memory. It points the new process’s page tables to the same physical frames but marks them Read-Only.
  • If either process tries to write, the CPU triggers a Page Fault.
  • The kernel then copies that specific page, updates the page table to Read-Write, and restarts the instruction.

8. Senior Interview Deep Dive

  1. Hardware Trigger: CPU tries to access a virtual address. The MMU finds the PTE (Page Table Entry) is either not present or the permissions (W/X) are violated.
  2. Trap: CPU generates an exception and jumps to the kernel’s Page Fault Handler (e.g., do_page_fault in Linux).
  3. Context Capture: The CPU saves the faulting address in a control register (e.g., CR2).
  4. VMA Lookup: The kernel looks up the process’s vm_area_struct to see if the address is valid (e.g., is it in a known segment like the heap?).
  5. Allocation/Swap:
    • If it’s a new heap page: Find a free frame in the Buddy Allocator.
    • If it’s swapped: Issue I/O to read the page from disk.
  6. PTE Update: The kernel updates the page table with the new Physical Frame address and sets the “Present” bit.
  7. Return: The kernel returns from the trap, and the CPU re-executes the exact same instruction that faulted.
When a kernel changes a page table mapping on a multi-core system (e.g., unmapping memory or changing permissions), it must ensure that other CPUs don’t have the old mapping cached in their local TLBs.The kernel sends an Inter-Processor Interrupt (IPI) to all other cores, forcing them to flush the specific entry from their TLBs. This is expensive and a major bottleneck for high-frequency mapping changes.
A 2-level table for a 64-bit space would still require a massive top-level directory. Multi-level tables allow for sparsity. We only allocate the lower-level tables for address ranges that the process is actually using. If a process only uses 1MB of memory, we only need one PML4 entry, one PDPT entry, one PD entry, and one PT entry.

9. Practice: Memory Forensics

In Linux, you can inspect a process’s memory map directly:
# See the segments of a process
cat /proc/[pid]/maps

# Deep dive into page table stats
cat /proc/[pid]/smaps

# See physical frame usage (needs root)
cat /proc/kpageflags

Next: Virtual Memory & Swapping