Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Virtual Memory: The Grand Abstraction
Virtual Memory is the crowning achievement of operating system design. It creates the illusion that every process has access to a massive, contiguous block of memory, while simultaneously providing hardware-enforced isolation, efficient resource sharing, and the ability to run programs larger than physical RAM. This module provides an exhaustive, senior-level exploration of the mechanisms, algorithms, and hardware optimizations that make virtual memory possible.Interview Frequency: Extremely High
Key Topics: Demand Paging, Thrashing, Page Replacement Algorithms, KPTI, Meltdown
Time to Master: 15-20 hours
Key Topics: Demand Paging, Thrashing, Page Replacement Algorithms, KPTI, Meltdown
Time to Master: 15-20 hours
1. The Philosophical Shift: Mapping intent to Reality
In the early days, programs were loaded into fixed physical addresses. This meant only one program could run at a time (or a few with fixed partitions). Virtual memory changed this by introducing a Mapping Layer. The analogy that best captures virtual memory: every process thinks it lives in a sprawling mansion with a private address for every room. In reality, all processes share a single apartment building (physical RAM). The MMU is the concierge who translates “my room 42” into “building room 1,337” — and who ensures you can never accidentally walk into someone else’s apartment.1.1. The Mapping Function
At any moment, a virtual address is mapped to a physical address :1.2. Why the Abstraction Wins
- Decoupling: The compiler doesn’t need to know how much RAM the end-user has.
- Protection: A process literally cannot name memory that belongs to another process.
- Overcommit: You can allocate of virtual memory on a system with of RAM.
2. Advanced Paging Mechanics
Building on Memory Management, we now look at how the OS handles paging at scale.2.1. Inverted Page Tables (IPT)
On extremely large address spaces (64-bit or 128-bit), multi-level tables can still be deep.- The Concept: Instead of a table per process, have one table for the entire system.
- Structure: Each entry in the IPT corresponds to a Physical Frame. The entry contains
(ProcessID, VirtualPageNumber). - Search: To translate, you hash
(PID, VPN)and look it up. - Used by: IBM PowerPC, IA-64.
- Trade-off: Harder to share pages between processes.
2.2. Shared Memory Implementation
How dolibc.so or shared memory segments work?
- Multiple processes have different VPNs that point to the same PFN (Physical Frame Number) in their respective page tables.
- Permissions: Process A might have Read-Only access, while Process B has Read-Write.
3. Page Replacement: The Quest for the Optimal Page
When RAM is full and a page fault occurs, we must evict a page. This is the Page Replacement Problem.3.1. The Benchmarks
- Optimal Algorithm (MIN/Belady’s): Evict the page that will not be used for the longest period in the future.
- Status: Impossible to implement (requires prescience).
- Utility: Used as a baseline for measuring other algorithms.
- FIFO (First-In, First-Out):
- The Flaw: Belady’s Anomaly. Adding more frames can actually increase the number of page faults. This happens because FIFO doesn’t follow the Stack Property.
3.2. LRU: The Practical King
Least Recently Used (LRU) is based on the principle of Temporal Locality: If you used it recently, you’ll probably use it again.- LRU is a Stack Algorithm: It is provably immune to Belady’s Anomaly.
- Implementation Cost: Keeping a timestamp or a sorted list on every memory access is too slow for hardware.
3.3. Approximations: Clock and WSClock
True LRU is impractical at hardware speeds, so modern OSs approximate it using the Clock Algorithm (also called Second Chance). The analogy: imagine a security guard walking around a circular hallway of offices. Each office has a “someone was here recently” light. The guard walks past each door: if the light is on, she turns it off and moves on (giving that office a second chance). If the light is already off, the occupant has not been seen in a full rotation — evict them.- The Reference Bit: Every PTE has a bit that the hardware sets to
1when the page is accessed (any read or write). - The Pointer: A “clock hand” sweeps through the circular list of pages.
- If bit is
1, set to0and move on (giving it a second chance — the page was recently used). - If bit is
0, the page has not been accessed since the last sweep — evict it.
- If bit is
4. Thrashing and the Working Set Model
Thrashing occurs when the system spends more time moving pages in and out of disk than executing instructions. It is the memory equivalent of a traffic jam: adding more cars (processes) to an already-congested highway (RAM) does not increase throughput — it collapses it.4.1. The Cause: Over-multiprogramming
As you increase the number of processes, each gets fewer frames. Eventually, no process has enough frames to hold its “Working Set,” and the system performance collapses (throughput drops to near zero).4.2. The Working Set Model ()
The Working Set is the set of pages used by a process in the last time units.- The Goal: The sum of the working sets of all active processes must fit in RAM.
- Admission Control: If , the OS must suspend a process entirely (Swap Out) to save the others.
5. Linux Internals: Memory Reclaim Path
Linux uses a sophisticated dual-path reclaim strategy.5.1. kswapd vs. Direct Reclaim
Think of memory reclaim like inventory management in a warehouse. kswapd is the overnight restocking crew that keeps shelves from going empty. Direct reclaim is when a customer arrives to find the shelves bare and has to wait while a worker runs to the back to fetch stock in real time.- kswapd: A background kernel thread that wakes up when free memory falls below a “Low Watermark.” It scans the LRU lists, evicting clean pages and writing back dirty ones, aiming to restore free memory to the “High Watermark.” If kswapd keeps up, applications never notice memory pressure.
- Direct Reclaim: If an application requests memory and
kswapdhasn’t freed enough, the application itself is forced to stop and free pages before it can proceed. This causes massive latency spikes — the__alloc_pages_slowpath()call stalls for milliseconds or more. Direct reclaim is the number-one cause of unexpected latency spikes in memory-constrained Linux systems.
vmstat 1 (the si/so columns for swap in/out) or perf stat -e vmscan:mm_vmscan_direct_reclaim_begin. If direct reclaim is triggering frequently, your system is under memory pressure and you need to either add RAM, reduce workload, or tune vm.min_free_kbytes to give kswapd more headroom.
5.2. Zswap and Zram
Instead of writing to a slow disk, Linux can compress pages and store them in a dedicated portion of RAM.- Zram: A compressed RAM disk used as swap.
- Zswap: A compressed cache for the swap pages.
- Result: Drastically better performance on systems with fast CPUs and limited RAM (like Android or IoT).
6. Security: The Meltdown Legacy
In 2018, the Meltdown vulnerability revealed that CPUs could speculatively leak kernel memory to user-space through the TLB cache.6.1. KPTI (Kernel Page Table Isolation)
Before Meltdown, the kernel was mapped into the upper half of every process’s page table for performance (to avoid TLB flushes on syscalls). This was an optimization dating back decades: by keeping the kernel mapped in every process’s address space, a syscall could transition from user to kernel mode without changing the page table (CR3 register), preserving TLB entries. Meltdown shattered this assumption. It showed that speculative execution could read kernel-mapped pages from user mode: the CPU would speculatively load kernel data, use it in a cache-timing side channel, and leak it to user space — all before the permission check was enforced.- KPTI changed this. Now, the kernel and user space have separate page tables. When running in user mode, the page table maps only user-space memory plus a tiny “trampoline” region needed for the syscall entry path. On syscall entry, the kernel swaps to the full page table.
- The Cost: Every syscall now requires a CR3 switch and a TLB flush (unless PCID — Process Context Identifiers — is used). PCID tags TLB entries so the hardware can keep entries from both page tables in the TLB simultaneously, reducing the flush cost from catastrophic to modest. On CPUs with PCID support (Intel Haswell and later), the overhead is typically 2-5%. On older CPUs without PCID, the hit was 5-30%, which is why Meltdown was such a wake-up call for the industry.
dmesg | grep "page tables isolation". Check PCID support with grep pcid /proc/cpuinfo. If you are running latency-sensitive workloads on pre-Haswell hardware, KPTI overhead is substantial and you should factor it into capacity planning.
7. Advanced Performance: HugePages and IOMMU
7.1. Transparent HugePages (THP)
Linux attempts to automatically merge 4KB pages into 2MB HugePages. The benefit: a single TLB entry now covers 2MB instead of 4KB, reducing TLB misses by up to 512x for large working sets. For a database scanning a 64GB table, this translates to measurable throughput improvements.- The Problem: If memory is fragmented, the kernel might spend huge amounts of CPU time “compacting” memory to make a 2MB hole. This compaction can cause latency spikes of 100+ ms, appearing as mysterious stalls in your application. The kernel can also “stall” an allocation while it waits for compaction, causing a process to hang unpredictably.
- Recommendation: For databases (PostgreSQL, MySQL, Redis), it is often better to disable THP (
echo never > /sys/kernel/mm/transparent_hugepage/enabled) and use Explicit HugePages (vm.nr_hugepagesin sysctl). Explicit huge pages are pre-allocated at boot, avoiding runtime compaction entirely.
perf top shows compact_zone or khugepaged, THP compaction is the likely culprit. Disable THP or pin your heap to explicit huge pages. The JVM flag -XX:+UseHugeTLBFS maps the heap to pre-allocated huge pages.
8. Senior Interview Deep Dive
Explain demand paging and copy-on-write -- what data structures are involved?
Explain demand paging and copy-on-write -- what data structures are involved?
Strong Answer Framework:Common Wrong Answers:
- Both are forms of lazy allocation. The kernel defers the work of allocating physical pages until the moment they are actually needed. Demand paging covers first-touch on a fresh anonymous mapping; copy-on-write covers post-fork sharing that becomes private on first write.
- The shared data structures. Each process has an
mm_struct, which contains a red-black tree (and now also a maple tree, kernel 6.1+) ofvm_area_structregions. Each VMA describes a virtually contiguous range with permissions, backing store (anon, file, shared), and flags. Below that, multi-level page tables (pgd,p4d,pud,pmd,pte) translate VAs to PFNs. Thestruct pagearray (one per physical frame) tracks reference counts and reverse mappings. - Demand paging walk.
mmap(...MAP_ANON)allocates a VMA but no page tables are populated. On first access, the MMU fails to find a PTE, raises a page fault.do_anonymous_pageallocates a frame from the buddy allocator, zeros it (or maps the shared zero page if read-only), inserts the PTE, returns. Total cost: a few microseconds for a minor fault. - COW walk.
forkclones the page tables but marks both copies’ PTEs read-only and bumps the refcount on eachstruct page. On the next write by either process, the MMU traps a permission fault.do_wp_pagechecks the refcount: if greater than 1, allocate a new frame, copy the contents, install a writable PTE pointing at the new frame, decrement the original’s refcount. - Reverse mapping (rmap). To handle COW and page reclaim, every
struct pageknows which VMAs map it via theanon_vmachain (for anon pages) or theaddress_spaceinterval tree (for file-backed). Without rmap, the kernel could not unmap a page from all sharers when reclaiming or migrating. - Why both matter. Demand paging is what makes Linux processes cheap to start: a 10 GB sparse mapping costs nothing physically. COW is what makes
forkcheap: a 32 GB Redis can fork forBGSAVEin milliseconds even though copying 32 GB would take seconds.
BGSAVE is the textbook COW story in production. Redis forks a child to dump the dataset; the parent keeps serving writes; the OS makes copies only of pages the parent dirties during the snapshot. In 2018, an Airbnb post-mortem described a Redis instance with a write rate of 80 percent of dataset size per minute; during a 30-second BGSAVE, the parent dirtied 40 percent of its 64 GB working set, requiring the kernel to allocate roughly 25 GB of new pages — briefly doubling RSS and triggering OOM. The fix: schedule BGSAVE during low-write windows, and configure vm.overcommit_memory=1 so Redis fork does not get rejected by the overcommit heuristic mid-save.Senior follow-up: What is the “page fault scaling” problem and how was it fixed?Until kernel 5.8, page faults serialized on
mmap_sem (per-process rw-semaphore). On a process with 1000 threads all faulting in different VMAs, the lock became a bottleneck even though the work was parallelizable. Per-VMA locking (Suren Baghdasaryan, 2022-2023) replaced this with fine-grained per-VMA locks, allowing concurrent faults in different regions. Java GC, Postgres, and large in-memory analytics workloads all benefit.Senior follow-up: How is the dirty bit propagated from hardware to the rmap?The hardware sets the dirty bit in the PTE on write. The kernel queries it during reclaim (
page_referenced_one) by walking the rmap to find every PTE mapping the page. If any is dirty, the page must be written back before reclaim. After writeback, the kernel clears the dirty bits via the rmap. This is why rmap is performance-critical: every reclaim scan and every COW break walks rmap.Senior follow-up: What happens if
fork is called on a process with 100 GB RSS and the system has 60 GB free?Default vm.overcommit_memory=0 uses heuristic accounting: the kernel checks if the request looks plausible. Forking 100 GB looks like 100 GB of new commit, which the kernel may reject — even though COW means almost no actual new memory will be needed. Setting vm.overcommit_memory=1 (always overcommit) tells the kernel to trust that COW will handle it. Redis explicitly recommends overcommit=1 for this reason. The trade-off: if every page actually does get written, you are out of memory and someone gets OOM-killed.- “COW just copies pages when needed.” True at a high level but skips the refcount-based decision, the rmap walk, and the write-protect mechanism. An interviewer expects the dance: shared, write fault, refcount check, allocate-and-copy or just-flip-permissions.
- “Demand paging means everything is lazy.” Almost true, but file-backed mappings can be eagerly populated via
MAP_POPULATEto avoid future faults. Pre-faulting is sometimes the right choice for latency-critical code. - “Forking a 100 GB process costs 100 GB.” No — it costs page-table copying (megabytes), not data copying. COW makes fork constant-time in dataset size. Misunderstanding this is the canonical “I have not actually built a forking server” tell.
- LWN: “Per-VMA locking” series (2022-2023) — the canonical explanation of fault scalability.
- Mel Gorman, “Understanding the Linux Virtual Memory Manager,” chapters 3-4 (free PDF) — the rmap and fault paths in depth.
- Linux kernel:
mm/memory.cdo_anonymous_pageanddo_wp_page— surprisingly readable.
Walk through a page fault -- what registers, what tables, what code path.
Walk through a page fault -- what registers, what tables, what code path.
Strong Answer Framework:Common Wrong Answers:
- Hardware trap. MMU fails to translate or detects a permission violation. CPU writes the faulting address to
CR2, the error code (present, write, user, instruction-fetch flags) to the stack, and vectors via the IDT to interrupt 14. On x86-64, the entry stub is inarch/x86/entry/entry_64.S, which calls intodo_page_faultinarch/x86/mm/fault.c. - Classify. Read
CR2and the error code. Determine whether it is a user-mode fault (most common), a kernel-mode fault (then check the exception table for a fixup, e.g.,copy_from_userfailures), or a fault while atomic (which is a kernel bug —BUG_ON(in_atomic())). - mmap_lock acquire. Take the per-mm lock (mmap_lock as of 5.8, formerly mmap_sem) for read. If per-VMA locking applies, take the per-VMA lock instead.
- VMA lookup. Walk the maple tree (or rb-tree on older kernels) for the VMA covering
CR2. No VMA: SIGSEGV. VMA exists but access type is forbidden (write to read-only, exec on no-exec): SIGSEGV. - Dispatch into
handle_mm_faultandhandle_pte_fault. This is the architecture-independent core. It walks the page tables, allocating intermediate levels if needed, until reaching the PTE. - Per-fault-type handler.
- PTE absent and anonymous:
do_anonymous_page— allocate from buddy, zero, install PTE. - PTE absent and file-backed:
do_fault(calls into the filesystem’svm_ops->fault) — consults page cache, may issue I/O. - PTE present, write fault, COW:
do_wp_page— check refcount, copy if shared, install writable PTE. - PTE marked swapped-out:
do_swap_page— read from swap device.
- PTE absent and anonymous:
- Install the PTE atomically. Use
set_pte_atwith appropriate barriers. Update accessed/dirty bits. Flush the local TLB entry (flush_tlb_page); if the mapping is shared, send shootdown IPIs. - Return. Drop locks,
iret, CPU restarts the faulting instruction, which now succeeds.
mmap_lock contention on multi-threaded servers. Their internal madvise(MADV_HUGEPAGE) policy and a custom faster fault path were precursors to upstream’s per-VMA locking work. The lesson is that “page fault” is not free even when minor: at fleet scale, even microsecond costs aggregate into measurable percent overhead, and reducing them is real money.Senior follow-up: Why is
mmap_lock taken for read on a fault but for write on mmap?Faults observe and update the page tables but do not change the VMA topology. mmap/munmap/mprotect change the VMA tree itself. Read-locked faults can run concurrently with each other; writers exclude readers. This is a classic readers-writer pattern and matches the access pattern: VMA changes are rare; faults are constant.Senior follow-up: How does the kernel handle a fault while holding a spinlock?It cannot. A fault can sleep (allocating from the buddy can block; reading from disk certainly can), and you cannot sleep with a spinlock held. The kernel uses
pagefault_disable() around critical sections to assert this; if a fault occurs inside such a region, it is treated as a kernel bug. This is why kernel code that needs to access user memory under a spinlock uses __copy_from_user_inatomic plus a fallback path that retries after dropping the lock.Senior follow-up: What is
userfaultfd and why might you want it?userfaultfd lets a user-space process register a region and handle faults in that region itself, in user mode. Used by post-copy live migration (Qemu/KVM), checkpoint-restore (CRIU), and user-space memory managers. The handler receives a notification, fetches the missing page from a remote source, and ioctl(UFFDIO_COPY)s it into the region. This decouples paging logic from the kernel, enabling things like “migrate VM while still running.”- “The CPU automatically loads the page from disk.” No — the CPU does nothing automatic beyond raising the trap. All work happens in the kernel handler.
- “The instruction is skipped after the kernel handles the fault.” The instruction is restarted, not skipped. Anything else would corrupt program state.
- “Page faults always cost a disk I/O.” Most page faults are minor — no I/O at all. Only major faults (swap-in, first read of file-backed) incur I/O. Conflating the two is the most common mistake.
- Linux kernel:
arch/x86/mm/fault.c— the entry point. Read top-to-bottom; it is well-commented. - LWN: “The future of VMA locking” by Jonathan Corbet — explains the mmap_lock evolution.
- Mel Gorman, “Understanding the Linux Virtual Memory Manager” — chapter 4 covers fault paths in detail.
TLB miss vs page fault -- different costs and frequencies.
TLB miss vs page fault -- different costs and frequencies.
Strong Answer Framework:Common Wrong Answers:
- Different events, different layers. A TLB miss is when the CPU’s translation cache cannot translate a virtual address; the CPU must walk the page tables to find the mapping. A page fault is when the mapping does not exist or violates permissions; the CPU traps to the kernel. TLB miss is hardware; page fault is hardware-plus-software.
- TLB miss cost. The hardware page-table walker reads CR3, walks 4 levels of page tables (PML4, PDPT, PD, PT), each fetching a cache line from L1 or further. Best case: all four reads hit L1, total 5-15 cycles — nanoseconds. Worst case: all four miss to DRAM, each 60-100ns, total 200-400ns. Still software-invisible.
- Page fault cost. Minor fault: 1-10 microseconds (no I/O, just allocation and PTE install). Major fault: 100 microseconds on NVMe, milliseconds on SATA SSD, tens of milliseconds on spinning disk. Page faults invoke the kernel and run software.
- Frequency in practice. On a typical server, TLB misses run at 100K-10M per second per core depending on working set vs TLB coverage. Minor faults run at thousands to tens of thousands per second; major faults should be near zero in healthy operation. The cost ratio is: 1 page fault equals roughly 1000-100000 TLB misses.
- What to optimize. TLB pressure: enable hugepages so a single TLB entry covers more memory (a 2 MB hugepage = 512 4 KB pages of TLB coverage). Page fault rate: pre-fault critical regions (
MAP_POPULATE,mlock), avoid swap, monitor major-fault counters. - The shared mechanism. A page fault often invalidates and refills a TLB entry. So a major fault is also at least one TLB miss. They are layered: the TLB miss happens first; if the page-table walk finds nothing, that escalates to a page fault.
Senior follow-up: How big is the TLB and why are there L1 and L2 TLBs?Modern x86: L1 dTLB has 64-128 entries, L1 iTLB about 64, L2 STLB (shared) has 1500-2000. L1 is fully associative for low latency; L2 is set-associative and slower. The hierarchy mirrors the data caches. Coverage with 4 KB pages: L1 covers about 256 KB, L2 covers about 8 MB. With 2 MB hugepages, coverage jumps 512x.
Senior follow-up: Can a TLB miss happen on a hugepage?Yes — the hugepage TLB is separate from the 4 KB TLB on most CPUs (typically 16-32 entries on L1). If you have many hugepages mapped, you still get hugepage TLB misses, but the page-table walk is shorter (3 levels instead of 4 for 2 MB pages, 2 levels for 1 GB pages), so the miss is cheaper.
Senior follow-up: How do you measure TLB misses in production?
perf stat -e dTLB-load-misses,iTLB-load-misses,dTLB-store-misses gives per-CPU counters. For a single process: perf stat -p [pid]. To find which code path causes them: perf record -e dTLB-load-misses -p [pid] then perf report. The hot functions should match your hot code paths; if a “boring” function suddenly tops the list, suspect a memory-layout issue (poor cache locality, large strides through huge data structures).- “TLB miss and page fault are the same thing.” They are not. A TLB miss is satisfied entirely by hardware walking the page tables. A page fault is when there is no mapping at all and the kernel must do work.
- “You can’t optimize TLB misses without kernel changes.” Hugepages are a user-tunable knob. So is
prctl(PR_SET_THP_DISABLE)for opting out. Memory layout (stride patterns, working-set partitioning) also affects TLB pressure. - “Page faults are always bad.” Minor faults are normal and cheap; they are how lazy allocation works. The right metric is major-fault rate, not total fault rate.
- Intel Software Developer Manual, Volume 3, chapter on paging — covers the TLB hierarchy.
- Brendan Gregg, “Linux Performance” book, chapter on memory — includes TLB and fault analysis.
- ACM article: “Translation Caching: Skip, Don’t Walk (the Page Table)” — TLB optimization research relevant to modern hardware.
What is a 'Dirty Page' and why does it matter for replacement?
What is a 'Dirty Page' and why does it matter for replacement?
Strong Answer Framework:Common Wrong Answers:
- The dirty bit. Each PTE has a hardware-set bit (
PTE_DIRTYon x86) that the CPU sets atomically on the first write to the page. The kernel queries it during reclaim to decide what to do. - Clean vs dirty implications.
- Clean page: contents match the backing store (executable code, file-backed read-only mmap, fresh page-cache page). Eviction cost equals freeing the frame — microseconds.
- Dirty page: in-memory contents differ from the backing store. Eviction requires writeback (anonymous page goes to swap; file-backed page goes to the original file). Cost equals an I/O operation — milliseconds.
- Replacement preference. The kernel’s reclaim path (in
mm/vmscan.c) prefers evicting clean pages first. A clean page can be reclaimed synchronously inside the allocator. Dirty pages are flushed asynchronously bykswapdor the writeback threads; only as a last resort does direct reclaim wait for dirty page I/O. - Anonymous vs file-backed dirty handling. Dirty file-backed pages are written back to their file. Dirty anon pages are written to swap. If swap is disabled or full, the kernel cannot reclaim dirty anon pages — they are “pinned” until written to swap or the process dies.
- The dirty ratio knobs.
vm.dirty_ratio(default 20 percent) is the upper bound on dirty memory before processes synchronously write back.vm.dirty_background_ratio(default 10 percent) is the threshold where background writeback begins. On a 256 GB box, default ratios mean 25.6 GB of dirty memory before background flush — often too high for spinning disks, which leads to multi-second stalls when the writeback finally fires. - Production tuning. For latency-sensitive workloads on slow storage, lower
dirty_ratioto 5-10 percent anddirty_background_ratioto 2-5 percent. This trades a small throughput hit for much smoother latency.
dirty_ratio was the default 20 percent on machines with 64 GB RAM, allowing 12.8 GB of dirty pages before synchronous writeback. When a backup job triggered a flush of 12 GB to spinning disk, every process trying to write more got stalled for 30+ seconds in wait_on_page_writeback. The fix was to set dirty_ratio=10 and dirty_background_ratio=5, plus switch to faster storage for the backup target. Post-fix, the same scenarios produced sub-second stalls.Senior follow-up: How does the kernel handle a write to a clean file-backed page?The page is marked dirty in the PTE (hardware) and also via
set_page_dirty, which propagates the dirty state into the address_space radix tree (now xarray). The page is added to the dirty LRU. Periodic writeback (the per-bdi flusher threads) eventually writes it back via the filesystem’s writepages operation, then clears the dirty bit. This is the classic writeback path; it is what fdatasync waits for.Senior follow-up: What is the “dirty page balancing” mechanism?When dirty memory exceeds
dirty_ratio, processes that are dirtying memory get throttled in balance_dirty_pages. The kernel computes a “dirty rate” the process can sustain and makes it sleep proportionally. This prevents one runaway writer from filling memory faster than the disk can drain it. The behavior is intentional but can look like a hang. iotop shows the wait state, and /proc/[pid]/stack pinpoints balance_dirty_pages in the call chain.Senior follow-up: Why might
MADV_DONTNEED leave the page dirty?MADV_DONTNEED immediately frees the page; the memory becomes zero on next access. Dirty bit is moot because the page is gone. Compare with MADV_FREE (kernel 4.5+), which marks the pages as candidates for reclaim but leaves them in place; if the application writes to them before reclaim, the kernel notices the dirty bit and keeps the contents — a clever lazy-reclaim mechanism that allocators (jemalloc) rely on heavily.- “Dirty pages are bad and should be avoided.” No — dirty pages are normal and necessary. Every write produces a dirty page. The question is how aggressively they are flushed, not whether they exist.
- “The kernel always evicts the oldest page first.” Modern kernels use a two-list LRU (active and inactive) plus the clean/dirty distinction. Eviction order is a function of recency and write state, not a single FIFO.
- “Dirty pages can always be discarded if memory is needed.” Only clean pages can. Dirty pages must be written back first or the data is lost.
- LWN: “Better page-cache writeback control” — covers per-bdi flushing.
- kernel.org Documentation/admin-guide/sysctl/vm.rst — canonical reference for dirty_ratio and friends.
- Robert Love, “Linux Kernel Development” chapter on the page cache and writeback.
8. Hands-on Labs: Observing Virtual Memory in Practice
These exercises are designed so you can see the theory in action on a Linux system.8.1 Lab: Watching Page Faults and Working Set Growth
- Write a simple C program that touches memory gradually:
- Run it and watch page faults and RSS:
8.2 Lab: TLB Behavior with Strided Access
- Write a program that scans an array with different strides and times the result.
- Use
perf stat -e dTLB-load-misses,iTLB-load-missesto correlate access pattern with TLB pressure.
8.3 Lab: Inspecting Mappings with /proc
Use existing tools to introspect a running process:
9. Practice: The Page Table Challenge
Scenario: You are designing a 64-bit system with 8KB pages and 4-level paging.- How many bits are used for the offset? (, so 13 bits).
- If each PTE is 8 bytes, how many entries fit in a page? (, so 10 bits per level).
- How many bits of virtual address are covered by this 4-level scheme? ().
Next: Synchronization Primitives →