Skip to main content

Virtual Memory

Virtual memory is the abstraction that gives each process its own private address space, decoupled from physical memory. It’s one of the most important OS concepts and heavily tested in senior interviews.
Interview Frequency: Very High
Key Topics: Paging, TLB, page faults, replacement algorithms
Time to Master: 12-14 hours

Why Virtual Memory?

Isolation

Each process has private address space. One process can’t corrupt another.

Simplification

Every process thinks it has contiguous memory starting at 0. No coordination needed.

Overcommit

Virtual space can exceed physical memory. Use disk as extension.

Sharing

Libraries, code pages can be shared between processes with different virtual addresses.

Address Translation

The Big Picture

Paging Mechanism

Virtual Address Structure (x86-64, 4KB pages)

Virtual Address Structure

Paging Deep Dive

Page Table Entry (PTE)

┌─────────────────────────────────────────────────────────────────┐
│                    PAGE TABLE ENTRY (x86-64)                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  63    62    52 51                  12 11  9 8 7 6 5 4 3 2 1 0  │
│  ┌──┬───────┬───────────────────────┬─────┬─┬─┬─┬─┬─┬─┬─┬─┬─┐  │
│  │NX│ Avail │    Physical Frame #   │Avail│G│0│D│A│C│T│U│W│P│  │
│  └──┴───────┴───────────────────────┴─────┴─┴─┴─┴─┴─┴─┴─┴─┴─┘  │
│                                                                  │
│  Key bits:                                                       │
│  P  (Present)      - Is page in physical memory?                │
│  W  (Write)        - Is write access allowed?                   │
│  U  (User)         - Accessible from user mode?                 │
│  A  (Accessed)     - Has page been read?                        │
│  D  (Dirty)        - Has page been written?                     │
│  NX (No Execute)   - Prevent code execution (security)         │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Multi-Level Page Tables

Page Table Walk Why multi-level?
  • Sparse address spaces: Only allocate tables for used regions
  • 48-bit address space with single-level table = 512GB of PTEs!
  • Multi-level: Most entries are NULL pointers, saving memory

Translation Lookaside Buffer (TLB)

The TLB caches recent virtual→physical translations: TLB Operation

TLB Characteristics

PropertyTypical Value
Size64-2048 entries
AssociativityFully associative or 4-8 way
Access time0.5-1 cycle
Miss penalty10-100+ cycles
HierarchyL1 (instruction), L1 (data), L2 (unified)

TLB Shootdown

When a page mapping changes (e.g., munmap), all TLBs must be invalidated:
┌─────────────────────────────────────────────────────────────────┐
│                    TLB SHOOTDOWN                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   CPU 0                  CPU 1                  CPU 2           │
│   ┌─────┐                ┌─────┐                ┌─────┐         │
│   │ TLB │                │ TLB │                │ TLB │         │
│   │ 0x5 │                │ 0x5 │                │ 0x5 │         │
│   └──┬──┘                └──┬──┘                └──┬──┘         │
│      │                      │                      │            │
│   Process unmaps page 0x5000                                    │
│      │                      │                      │            │
│      ▼                      ▼                      ▼            │
│   ┌──────────────────────────────────────────────────────┐     │
│   │           Inter-Processor Interrupt (IPI)             │     │
│   │            "Invalidate TLB entry for 0x5"            │     │
│   └──────────────────────────────────────────────────────┘     │
│      │                      │                      │            │
│      ▼                      ▼                      ▼            │
│   Invalidated            Invalidated            Invalidated    │
│                                                                  │
│   Expensive! 1000s of cycles for cross-CPU coordination        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
ASID (Address Space Identifier): Tags TLB entries with process ID to avoid flush on context switch.

Page Faults

A page fault occurs when the virtual address can’t be translated:
// Types of page faults

// 1. Minor (soft) fault - page in memory but not mapped
void *ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
                 MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// First access: minor fault, kernel maps zero page

// 2. Major (hard) fault - page must be read from disk
char *data = read_from_file();  // If page was swapped out

// 3. Invalid fault - actual error
int *bad = (int *)0xDEADBEEF;
*bad = 42;  // SIGSEGV - segmentation fault

Page Fault Handler Flow

Page Fault Handling

Page Replacement Algorithms

When physical memory is full, which page to evict?

FIFO (First-In-First-Out)

Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
Frames: 3

Step | Reference | Frame 1 | Frame 2 | Frame 3 | Fault?
-----|-----------|---------|---------|---------|-------
  1  |     1     |    1    |    -    |    -    |  Yes
  2  |     2     |    1    |    2    |    -    |  Yes
  3  |     3     |    1    |    2    |    3    |  Yes
  4  |     4     |    4    |    2    |    3    |  Yes (evict 1)
  5  |     1     |    4    |    1    |    3    |  Yes (evict 2)
  6  |     2     |    4    |    1    |    2    |  Yes (evict 3)
  7  |     5     |    5    |    1    |    2    |  Yes (evict 4)
  8  |     1     |    5    |    1    |    2    |  No
  9  |     2     |    5    |    1    |    2    |  No
 10  |     3     |    5    |    3    |    2    |  Yes (evict 1)
 11  |     4     |    5    |    3    |    4    |  Yes (evict 2)
 12  |     5     |    5    |    3    |    4    |  No

Total faults: 9
Belady’s Anomaly: More frames can cause MORE faults with FIFO!

LRU (Least Recently Used)

Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
Frames: 3

Step | Reference | Frame 1 | Frame 2 | Frame 3 | Fault?
-----|-----------|---------|---------|---------|-------
  1  |     1     |    1    |    -    |    -    |  Yes
  2  |     2     |    1    |    2    |    -    |  Yes
  3  |     3     |    1    |    2    |    3    |  Yes
  4  |     4     |    4    |    2    |    3    |  Yes (evict 1, LRU)
  5  |     1     |    4    |    1    |    3    |  Yes (evict 2, LRU)
  6  |     2     |    4    |    1    |    2    |  Yes (evict 3, LRU)
  7  |     5     |    5    |    1    |    2    |  Yes (evict 4, LRU)
  8  |     1     |    5    |    1    |    2    |  No
  9  |     2     |    5    |    1    |    2    |  No
 10  |     3     |    3    |    1    |    2    |  Yes (evict 5, LRU)
 11  |     4     |    3    |    4    |    2    |  Yes (evict 1, LRU)
 12  |     5     |    3    |    4    |    5    |  Yes (evict 2, LRU)

Total faults: 10
Problem: True LRU requires tracking access order (expensive).

Clock Algorithm (Second Chance)

Approximates LRU with just a reference bit: Clock Algorithm

Comparison

AlgorithmFaults (example)ComplexityNotes
OPTMinimum possibleN/ARequires future knowledge
LRUNear optimalO(n) or O(1) with hardwareStack property (no Belady’s anomaly)
ClockSlightly worse than LRUO(1)Practical approximation
FIFOOften poorO(1)Simple but Belady’s anomaly

Demand Paging and Copy-on-Write

Demand Paging

Pages loaded only when accessed:
// Executable loading with demand paging
exec("program");

// Initially: No code pages loaded
// Page table entries marked "not present"

// First instruction fetch:
// 1. Page fault on address of main()
// 2. Kernel loads that page from disk
// 3. Execution continues

// Benefits:
// - Fast startup (don't load entire program)
// - Memory efficient (unused code never loaded)

Copy-on-Write (COW)

Defer copying until modification: Copy-on-Write

CPU Extensions for Memory Management

Modern CPUs provide hardware extensions to enhance memory management capabilities:

PAE (Physical Address Extension)

Allows 32-bit CPUs to access more than 4GB of physical memory:
┌─────────────────────────────────────────────────────────────────┐
│              PAE (Physical Address Extension)                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Without PAE (32-bit):                                          │
│  • 32-bit physical addresses → 4GB max physical RAM             │
│  • 2-level page tables                                          │
│                                                                  │
│  With PAE enabled:                                              │
│  • 36-bit physical addresses → 64GB max physical RAM            │
│  • 3-level page tables (PDPT → PD → PT)                        │
│  • Page table entries: 64 bits (instead of 32)                 │
│                                                                  │
│  Enable in Linux:                                               │
│  CONFIG_HIGHMEM64G=y                                            │
│  CONFIG_X86_PAE=y                                               │
│                                                                  │
│  Limitation:                                                    │
│  • Each 32-bit process still limited to 4GB virtual space       │
│  • But multiple processes can use different physical memory     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

PSE (Page Size Extensions)

Enables support for larger page sizes beyond standard 4KB:
┌─────────────────────────────────────────────────────────────────┐
│              PSE (Page Size Extensions)                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Standard pages:  4KB                                           │
│  PSE-36:         4MB pages on 32-bit systems                    │
│  PSE on x86-64:  2MB and 1GB page support                       │
│                                                                  │
│  Benefits:                                                      │
│  • Fewer TLB entries needed                                     │
│  • Reduced page table overhead                                  │
│  • Better performance for large memory workloads                │
│                                                                  │
│  Check PSE support:                                             │
│  $ grep pse /proc/cpuinfo                                       │
│  flags: ... pse pse36 ...                                       │
│                                                                  │
│  Page size hierarchy (x86-64):                                  │
│  • 4KB  - Standard pages (12-bit offset)                        │
│  • 2MB  - Huge pages    (21-bit offset, PSE enabled)           │
│  • 1GB  - Gigantic pages (30-bit offset, PDPE1GB)              │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

NX Bit (No-Execute / DEP)

Prevents code execution from data pages — critical for security:
┌─────────────────────────────────────────────────────────────────┐
│                 NX Bit (No-Execute Protection)                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  AMD: NX (No-Execute) bit                                       │
│  Intel: XD (Execute Disable) bit                                │
│  Windows: DEP (Data Execution Prevention)                       │
│                                                                  │
│  Page Table Entry (bit 63):                                     │
│  ┌──┬───────┬────────────────────┬────────────────┐            │
│  │NX│ ...   │   Physical Frame   │  Other bits    │            │
│  └──┴───────┴────────────────────┴────────────────┘            │
│   │                                                             │
│   └─ NX=1: Execution not allowed                               │
│      NX=0: Code execution permitted                             │
│                                                                  │
│  Memory region classifications:                                 │
│  • Code segment:  NX=0, W=0 (executable, read-only)            │
│  • Data segment:  NX=1, W=1 (writable, non-executable)         │
│  • Stack:         NX=1, W=1 (non-executable stack)             │
│  • Heap:          NX=1, W=1 (non-executable heap)              │
│                                                                  │
│  Prevents common exploits:                                      │
│  • Buffer overflow → shellcode on stack → NX blocks execution   │
│  • Heap spraying attacks                                        │
│  • Return-to-libc partially mitigated                           │
│                                                                  │
│  Check NX support:                                              │
│  $ grep nx /proc/cpuinfo                                        │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

SMEP & SMAP (Supervisor Mode Protections)

Advanced kernel protection mechanisms:
┌─────────────────────────────────────────────────────────────────┐
│            SMEP & SMAP (Kernel Protections)                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  SMEP (Supervisor Mode Execution Prevention):                   │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━              │
│  • Prevents kernel from executing user-space code               │
│  • Stops kernel exploits that jump to user-controlled code      │
│  • Enabled in CR4 register (bit 20)                             │
│                                                                  │
│  Without SMEP:                                                  │
│  ┌─────────────┐                                                │
│  │   Kernel    │ ──can execute──► User code (exploit!)          │
│  └─────────────┘                                                │
│                                                                  │
│  With SMEP:                                                     │
│  ┌─────────────┐                                                │
│  │   Kernel    │ ──X blocked──► User code (page fault)          │
│  └─────────────┘                                                │
│                                                                  │
│  SMAP (Supervisor Mode Access Prevention):                      │
│  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━              │
│  • Prevents kernel from accessing user-space memory             │
│  • Must use copy_from_user() / copy_to_user() APIs             │
│  • Enabled in CR4 register (bit 21)                             │
│                                                                  │
│  Legitimate kernel access:                                      │
│  • Temporarily disable SMAP (STAC instruction)                  │
│  • Access user memory                                           │
│  • Re-enable SMAP (CLAC instruction)                            │
│                                                                  │
│  Check support:                                                 │
│  $ grep -E 'smep|smap' /proc/cpuinfo                            │
│  flags: ... smep smap ...                                       │
│                                                                  │
│  Kernel boot parameters:                                        │
│  nosmep    - Disable SMEP (dangerous!)                          │
│  nosmap    - Disable SMAP (dangerous!)                          │
│                                                                  │
│  Security impact:                                               │
│  • Stops many privilege escalation attacks                      │
│  • Attackers need kernel-space ROP gadgets                      │
│  • Combined with KASLR, makes exploits harder                   │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

PCID (Process-Context Identifiers)

Improves TLB efficiency across context switches:
┌─────────────────────────────────────────────────────────────────┐
│            PCID (Process-Context Identifiers)                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Without PCID:                                                  │
│  • Context switch → Flush entire TLB                            │
│  • First memory accesses after switch: all TLB misses           │
│  • Expensive for workloads with frequent context switches       │
│                                                                  │
│  With PCID:                                                     │
│  • Each TLB entry tagged with 12-bit PCID                       │
│  • Context switch preserves TLB entries                         │
│  • Only flush entries for specific PCID when needed             │
│                                                                  │
│  TLB entry with PCID:                                           │
│  ┌──────────┬──────────────┬───────────────────┐                │
│  │   PCID   │  Virtual Pg  │  Physical Frame   │                │
│  │ (12-bit) │              │                   │                │
│  └──────────┴──────────────┴───────────────────┘                │
│                                                                  │
│  Performance benefit:                                           │
│  • Microservices with many processes: 5-10% improvement         │
│  • Reduce context switch overhead by 50-70%                     │
│                                                                  │
│  Check support:                                                 │
│  $ grep pcid /proc/cpuinfo                                      │
│                                                                  │
│  Linux kernel support:                                          │
│  • Enabled by default since kernel 4.14                         │
│  • Disable with: noinvpcid kernel parameter                     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Huge Pages

Larger pages reduce TLB misses:
Normal 4KB pages:           Huge 2MB pages:
┌───────────────────┐      ┌───────────────────┐
│ 1GB working set   │      │ 1GB working set   │
│ = 262,144 pages   │      │ = 512 pages       │
│                   │      │                   │
│ TLB entries needed│      │ TLB entries needed│
│ for full coverage:│      │ for full coverage:│
│     262,144       │      │       512         │
│                   │      │                   │
│ With 1024-entry   │      │ With 1024-entry   │
│ TLB: 0.4% hit rate│      │ TLB: 100% coverage│
└───────────────────┘      └───────────────────┘

Using Huge Pages in Linux

// Method 1: mmap with MAP_HUGETLB
void *ptr = mmap(NULL, 2 * 1024 * 1024,  // 2MB
                 PROT_READ | PROT_WRITE,
                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                 -1, 0);

// Method 2: Transparent Huge Pages (THP)
// Enabled by default, kernel auto-promotes
// Check: cat /sys/kernel/mm/transparent_hugepage/enabled

// Method 3: hugetlbfs
// mount -t hugetlbfs none /mnt/hugepages
// Or mmap files from hugetlbfs mount
Trade-offs:
  • Pros: Fewer TLB misses, better for large working sets
  • Cons: More internal fragmentation
  • Cons: Memory compaction overhead (for THP)

Memory-Mapped Files

Map files directly into virtual address space:
#include <sys/mman.h>
#include <fcntl.h>

int fd = open("data.bin", O_RDWR);
struct stat sb;
fstat(fd, &sb);

// Map entire file into memory
char *data = mmap(NULL, sb.st_size,
                  PROT_READ | PROT_WRITE,
                  MAP_SHARED,  // Changes written back
                  fd, 0);

// Access like regular memory
data[0] = 'A';  // Writes to file (eventually)

// Changes visible to other processes with MAP_SHARED
// msync() to force flush to disk

munmap(data, sb.st_size);
close(fd);

mmap vs read/write

Aspectread()/write()mmap()
CopiesKernel→User bufferZero-copy (usually)
Random accessSeek + readDirect pointer access
Large filesMemory efficientMaps entire region
SharingRequires IPCMAP_SHARED automatic
Small I/OMore efficientPage fault overhead

Interview Deep Dive Questions

Answer (assuming TLB miss, 4KB pages, x86-64):
  1. CPU generates virtual address (e.g., 0x7fff12345678)
  2. Check TLB: Not found (TLB miss)
  3. Page table walk:
    • Read CR3 to get PML4 base
    • Extract bits [47:39] → index into PML4 → get PDPT address
    • Extract bits [38:30] → index into PDPT → get PD address
    • Extract bits [29:21] → index into PD → get PT address
    • Extract bits [20:12] → index into PT → get PTE
    • Each step = 1 memory read (4 total)
  4. Check PTE:
    • Present bit = 1? Continue
    • Present bit = 0? Page fault
    • Permissions match access type? Continue
  5. Calculate physical address:
    • Physical frame from PTE + offset [11:0] from VA
  6. Fill TLB with this translation
  7. Access physical memory through cache hierarchy
Total time (TLB miss): ~100-400 cycles
Answer:Kernel maintains Virtual Memory Areas (VMAs):
struct vm_area_struct {
    unsigned long vm_start;    // Start address
    unsigned long vm_end;      // End address
    unsigned long vm_flags;    // Permissions (read, write, exec)
    struct file *vm_file;      // Backing file (if any)
    // ...
};
On page fault:
  1. Search VMA tree for faulting address
  2. No VMA found → Invalid access → SIGSEGV
  3. VMA found, but permissions wrong → SIGSEGV
  4. VMA found, permissions OK:
    • Anonymous page? Allocate zero frame
    • File-backed? Read from disk
    • Copy-on-write? Make private copy
Example:
VMA list for process:
[0x0000 - 0x1000] code, r-x
[0x2000 - 0x3000] data, rw-
[0x7fff0000 - 0x7fff1000] stack, rw-

Access to 0x5000 → No VMA → SIGSEGV
Write to 0x0500 → VMA found but no 'w' → SIGSEGV
Read from 0x2500 → VMA found, permitted → Handle fault
Answer:Why expensive:
  1. IPI (Inter-Processor Interrupt): Must interrupt all CPUs
  2. Synchronization: Initiating CPU waits for acknowledgment
  3. TLB flush: Target CPUs invalidate entries
  4. Memory barrier: Ensure ordering
Typical cost: 1,000-10,000+ cyclesAvoiding shootdowns:
  1. ASID (Address Space Identifiers):
    • Tag TLB entries with process ID
    • Context switch doesn’t need flush
    • Shootdown only for entries with matching ASID
  2. Lazy shootdown:
    • Mark pages invalid but don’t send IPI
    • Other CPUs will fault and update themselves
    • Works for some scenarios (e.g., unmap)
  3. Batch invalidations:
    • Collect multiple unmaps, single shootdown
    • Linux: mmu_gather structure
  4. Huge pages:
    • Fewer TLB entries to invalidate
    • One shootdown covers 2MB instead of 4KB
  5. Design for locality:
    • Keep threads on same NUMA node
    • Reduce cross-CPU sharing
Answer:Requirements:
  • Fixed-size pages (e.g., 8KB)
  • Fast allocation/deallocation
  • Track dirty pages
  • LRU-like eviction
Design:
struct BufferPool {
    Page *pages;           // Pre-allocated page array
    size_t num_pages;
    
    // Free list for O(1) allocation
    Page *free_list_head;
    
    // Hash table: page_id → buffer_pool_slot
    HashMap *page_table;
    
    // LRU tracking
    Page *lru_head, *lru_tail;
    
    pthread_mutex_t mutex;
};

struct Page {
    page_id_t id;
    bool is_dirty;
    int pin_count;      // Prevent eviction while in use
    Page *lru_next, *lru_prev;
    char data[PAGE_SIZE];
};
Operations:
  1. Allocate: Pop from free list (O(1))
  2. Free: Push to free list (O(1))
  3. Access: Move to LRU head (O(1))
  4. Evict: Take from LRU tail, write if dirty
  5. Pin/Unpin: Increment/decrement reference count
Enhancements:
  • Background writer for dirty pages
  • Clock algorithm instead of strict LRU
  • NUMA-aware allocation
  • Huge pages for buffer pool
Answer:Memory overcommit:
  • Linux allows allocating more virtual memory than physical
  • malloc(1GB) succeeds even with 512MB RAM
  • Actual physical pages allocated on first access
Why?:
  • Most allocations not fully used
  • fork() would fail otherwise (COW makes it cheap)
  • Allows efficient sparse arrays
Overcommit settings (/proc/sys/vm/overcommit_memory):
  • 0: Heuristic (default, usually allows 2x RAM)
  • 1: Always allow (never fail malloc)
  • 2: Strict (fail if > RAM + swap * ratio)
OOM Killer: When memory truly exhausted:
  1. Kernel invokes OOM killer
  2. Calculates “badness” score for each process:
    • Memory usage (primary factor)
    • Process age, nice value
    • oom_score_adj tuning
  3. Kills process with highest score
  4. Reclaims memory
Avoiding OOM kills:
# Protect critical process
echo -1000 > /proc/$(pidof myapp)/oom_score_adj

# Or use cgroups memory limits

Practice Exercises

1

Page Table Simulator

Implement a two-level page table with TLB. Measure TLB hit rates for different access patterns.
2

Page Replacement

Implement FIFO, LRU, and Clock algorithms. Compare fault rates on trace files.
3

COW Fork

Write a program demonstrating COW. Use /proc/self/pagemap to observe physical pages.
4

Memory-Mapped IPC

Implement producer-consumer using mmap with MAP_SHARED.

Key Takeaways

Multi-Level Saves Memory

Sparse address spaces only need page tables for used regions.

TLB is Critical

TLB misses cost 100+ cycles. Huge pages reduce misses dramatically.

Clock ≈ LRU in Practice

Reference bit approximation is simple and nearly as effective as LRU.

COW Enables Cheap fork()

Defer copying until write. Most pages never copied.

Next: File Systems