Skip to main content

Virtual Memory: The Grand Abstraction

Virtual Memory is the crowning achievement of operating system design. It creates the illusion that every process has access to a massive, contiguous block of memory, while simultaneously providing hardware-enforced isolation, efficient resource sharing, and the ability to run programs larger than physical RAM. This module provides an exhaustive, senior-level exploration of the mechanisms, algorithms, and hardware optimizations that make virtual memory possible.
Interview Frequency: Extremely High
Key Topics: Demand Paging, Thrashing, Page Replacement Algorithms, KPTI, Meltdown
Time to Master: 15-20 hours

1. The Philosophical Shift: Mapping intent to Reality

In the early days, programs were loaded into fixed physical addresses. This meant only one program could run at a time (or a few with fixed partitions). Virtual memory changed this by introducing a Mapping Layer.

1.1. The Mapping Function

At any moment, a virtual address VV is mapped to a physical address PP: f(V)=P (if present) OR Exception (if not present)f(V) = P \text{ (if present) OR Exception (if not present)}

1.2. Why the Abstraction Wins

  • Decoupling: The compiler doesn’t need to know how much RAM the end-user has.
  • Protection: A process literally cannot name memory that belongs to another process.
  • Overcommit: You can allocate 1TB1TB of virtual memory on a system with 16GB16GB of RAM.

2. Advanced Paging Mechanics

Building on Memory Management, we now look at how the OS handles paging at scale.

2.1. Inverted Page Tables (IPT)

On extremely large address spaces (64-bit or 128-bit), multi-level tables can still be deep.
  • The Concept: Instead of a table per process, have one table for the entire system.
  • Structure: Each entry in the IPT corresponds to a Physical Frame. The entry contains (ProcessID, VirtualPageNumber).
  • Search: To translate, you hash (PID, VPN) and look it up.
  • Used by: IBM PowerPC, IA-64.
  • Trade-off: Harder to share pages between processes.

2.2. Shared Memory Implementation

How do libc.so or shared memory segments work?
  • Multiple processes have different VPNs that point to the same PFN (Physical Frame Number) in their respective page tables.
  • Permissions: Process A might have Read-Only access, while Process B has Read-Write.

3. Page Replacement: The Quest for the Optimal Page

When RAM is full and a page fault occurs, we must evict a page. This is the Page Replacement Problem.

3.1. The Benchmarks

  1. Optimal Algorithm (MIN/Belady’s): Evict the page that will not be used for the longest period in the future.
    • Status: Impossible to implement (requires prescience).
    • Utility: Used as a baseline for measuring other algorithms.
  2. FIFO (First-In, First-Out):
    • The Flaw: Belady’s Anomaly. Adding more frames can actually increase the number of page faults. This happens because FIFO doesn’t follow the Stack Property.

3.2. LRU: The Practical King

Least Recently Used (LRU) is based on the principle of Temporal Locality: If you used it recently, you’ll probably use it again.
  • LRU is a Stack Algorithm: It is provably immune to Belady’s Anomaly.
  • Implementation Cost: Keeping a timestamp or a sorted list on every memory access is too slow for hardware.

3.3. Approximations: Clock and WSClock

Modern OSs use the Clock Algorithm (Second Chance).
  • The Reference Bit: Every PTE has a bit that the hardware sets to 1 when the page is accessed.
  • The Pointer: A “clock hand” sweeps through pages.
    • If bit is 1, set to 0 and move on (giving it a second chance).
    • If bit is 0, evict.

4. Thrashing and the Working Set Model

Thrashing occurs when the system spends more time moving pages in and out of disk than executing instructions.

4.1. The Cause: Over-multiprogramming

As you increase the number of processes, each gets fewer frames. Eventually, no process has enough frames to hold its “Working Set,” and the system performance collapses (throughput drops to near zero).

4.2. The Working Set Model (Δ\Delta)

The Working Set W(t,Δ)W(t, \Delta) is the set of pages used by a process in the last Δ\Delta time units.
  • The Goal: The sum of the working sets of all active processes must fit in RAM.
  • Admission Control: If Wi>TotalRAM\sum W_i > TotalRAM, the OS must suspend a process entirely (Swap Out) to save the others.

5. Linux Internals: Memory Reclaim Path

Linux uses a sophisticated dual-path reclaim strategy.

5.1. kswapd vs. Direct Reclaim

  • kswapd: A background kernel thread that wakes up when free memory falls below a “Low Watermark.” It tries to proactively free pages to avoid blocking applications.
  • Direct Reclaim: If an application requests memory and kswapd hasn’t freed enough, the application itself is forced to stop and free pages before it can proceed. This causes massive latency spikes.

5.2. Zswap and Zram

Instead of writing to a slow disk, Linux can compress pages and store them in a dedicated portion of RAM.
  • Zram: A compressed RAM disk used as swap.
  • Zswap: A compressed cache for the swap pages.
  • Result: Drastically better performance on systems with fast CPUs and limited RAM (like Android or IoT).

6. Security: The Meltdown Legacy

In 2018, the Meltdown vulnerability revealed that CPUs could speculatively leak kernel memory to user-space through the TLB cache.

6.1. KPTI (Kernel Page Table Isolation)

Before Meltdown, the kernel was mapped into the upper half of every process’s page table for performance (to avoid TLB flushes on syscalls).
  • KPTI changed this. Now, the kernel and user space have separate page tables.
  • The Cost: Every syscall now requires a CR3 switch and a TLB flush (unless PCID is used). This caused a 530%5-30\% performance hit on older CPUs.

7. Advanced Performance: HugePages and IOMMU

7.1. Transparent HugePages (THP)

Linux attempts to automatically merge 4KB pages into 2MB HugePages.
  • The Problem: If memory is fragmented, the kernel might spend huge amounts of CPU time “compacting” memory to make a 2MB hole.
  • Recommendation: For databases (PostgreSQL/Redis), it is often better to disable THP and use Explicit HugePages.

8. Senior Interview Deep Dive

Belady’s Anomaly is the counter-intuitive observation that adding more physical memory (frames) can increase page faults in FIFO.Why LRU is immune: LRU is a Stack Algorithm. In a stack algorithm, the set of pages in NN frames is always a subset of the pages that would be in N+1N+1 frames. FIFO doesn’t satisfy this property because it cares about arrival order, not usage order.
A Dirty Bit (in the PTE) is set by hardware when a page is written to.
  • Clean Page: Can be evicted instantly (it’s already on disk/executable).
  • Dirty Page: Must be written back to disk before the frame can be reused.
  • Optimization: Schedulers prefer evicting clean pages first to avoid the I/O block.
Traditional swap is limited by Disk I/O speed (ms). Memory compression is limited by CPU speed (ns). Since modern CPUs are millions of times faster than disks, compressing a page in RAM is orders of magnitude faster than writing it to an NVMe SSD, even with the CPU overhead.

8. Hands-on Labs: Observing Virtual Memory in Practice

These exercises are designed so you can see the theory in action on a Linux system.

8.1 Lab: Watching Page Faults and Working Set Growth

  1. Write a simple C program that touches memory gradually:
// vm_walk.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv) {
    size_t n_mb = (argc > 1) ? atol(argv[1]) : 1024; // default 1 GB
    size_t bytes = n_mb * 1024UL * 1024UL;
    char *buf = malloc(bytes);

    for (size_t i = 0; i < bytes; i += 4096) {
        buf[i] = 1; // touch one byte per page
        if (i % (128UL * 1024 * 1024) == 0) {
            fprintf(stderr, "Touched %zu MB\n", i / (1024 * 1024));
            usleep(100000);
        }
    }
    pause(); // keep process alive
}
  1. Run it and watch page faults and RSS:
$ gcc -O2 vm_walk.c -o vm_walk
$ ./vm_walk 2048 &   # 2 GB
$ pid=$!

$ watch -n1 "ps -o pid,rss,maj_flt,min_flt,cmd -p $pid"
$ sudo perf stat -e page-faults,minor-faults,major-faults -p $pid sleep 10
You should see minor faults increase as new pages are allocated, and the resident set size (RSS) grow with the working set.

8.2 Lab: TLB Behavior with Strided Access

  1. Write a program that scans an array with different strides and times the result.
  2. Use perf stat -e dTLB-load-misses,iTLB-load-misses to correlate access pattern with TLB pressure.
This makes the abstract idea of locality and TLB coverage feel real — you will see performance collapse for pathological strides.

8.3 Lab: Inspecting Mappings with /proc

Use existing tools to introspect a running process:
$ cat /proc/$pid/maps     # segments and permissions
$ cat /proc/$pid/smaps    # per-VMA statistics
$ sudo ./pagemap_inspect $pid  # custom tool to dump PFNs (optional)
Tie what you see back to the concepts in this chapter: VMAs, page tables, shared mappings, copy-on-write.

9. Practice: The Page Table Challenge

Scenario: You are designing a 64-bit system with 8KB pages and 4-level paging.
  1. How many bits are used for the offset? (213=81922^{13} = 8192, so 13 bits).
  2. If each PTE is 8 bytes, how many entries fit in a page? (8192/8=10248192 / 8 = 1024, so 10 bits per level).
  3. How many bits of virtual address are covered by this 4-level scheme? (10×4+13=53 bits10 \times 4 + 13 = \mathbf{53 \text{ bits}}).

Next: Synchronization Primitives