Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Memory Management: The Hardware-Software Interface

Memory management is the most complex dance between hardware (CPU/MMU) and software (Kernel). It is not merely about “giving memory to programs”; it is about creating a consistent, isolated, and high-performance execution environment while hiding the messy reality of physical RAM. Think of it like a hotel manager: guests (processes) each think they have the entire building to themselves (virtual address space), but behind the scenes the manager is constantly juggling room assignments (page tables), cleaning rooms for new guests (page reclaim), and moving guests to overflow parking (swap). This module covers everything from basic allocation strategies to the deep internals of 5-level paging and TLB management on modern x86-64 processors.
Interview Frequency: Critical
Key Topics: Page Tables, TLB, Fragmentation, Buddy/Slab Allocators, NUMA
Time to Master: 15-20 hours

1. The Core Architecture: Virtual vs. Physical

At the heart of modern computing lies a lie: Every process thinks it has the entire memory space to itself. This is Virtual Memory.

1.1. Why Virtual Memory?

  1. Isolation: Process A cannot read Process B’s memory.
  2. Relocatability: A process can be loaded at any physical address without changing its code.
  3. Efficiency: We only keep the “working set” of a process in physical RAM.
  4. Security: We can mark certain areas as “No-Execute” or “Read-Only”.

1.2. The Address Space Layout

A standard 64-bit address space is effectively a vast void. On x86-64, only 48 or 57 bits are actually used (canonical addresses).
  • User Space (Lower half): 000007FFFFFFFFFFF0 \dots 00007FFFFFFFFFFF
  • The “Gap” (Non-canonical): Hardware throws a #GP fault if accessed.
  • Kernel Space (Upper half): FFFF800000000000FFFFFFFFFFFFFFFFFFFF800000000000 \dots FFFFFFFFFFFFFFFF

2. Hardware Support: The MMU and Page Tables

The Memory Management Unit (MMU) is a hardware component in the CPU that performs the translation from Virtual Address (VA) to Physical Address (PA) on every single memory access. This is not an optional slow path — every MOV, every instruction fetch, every stack push passes through the MMU. It is one of the most performance-critical pieces of hardware in the entire CPU.

2.1. The Paging Mechanism

Memory is divided into fixed-size Pages (Virtual) and Frames (Physical). A typical page is 4KB. Virtual Address Breakdown (4KB pages):
  • VPN (Virtual Page Number): Used to index into the page table.
  • Offset: Position within the 4KB page (lowest 12 bits).

2.2. Multi-Level Page Tables (The x86-64 Walk)

A single-level page table for a 48-bit virtual address space would require 512 GB of memory just for the table itself — obviously impossible. We use a tree structure instead, where unused branches of the tree are simply not allocated. Think of it like a book index: you do not need an entry for every possible word, only the words that actually appear. For 4-level paging (standard on x86-64):
  1. CR3 Register: Holds the physical address of the PML4 (Level 4 table).
  2. PML4: Uses bits 47-39 of the VA to find the PDPT (Page Directory Pointer Table).
  3. PDPT: Uses bits 38-30 to find the PD (Page Directory).
  4. PD: Uses bits 29-21 to find the PT (Page Table).
  5. PT: Uses bits 20-12 to find the Physical Frame.
  6. Offset: Bits 11-0 are added to the Frame start to get the byte.
The “Walk” Cost: Every memory access theoretically requires 4 additional memory lookups (one per level of the page table tree). Without the TLB (described next), this would make every memory access 5x slower. This is why the TLB is arguably the most important cache in the entire CPU.

2.5 From *p = 42 to a DRAM Cell

To make all the abstractions concrete, walk through a single C statement:
int *p = malloc(sizeof(int));
*p = 42;
What actually happens when the CPU executes *p = 42?
  1. User-space allocator (malloc) has already:
    • Reserved a chunk of virtual address space (via brk() or mmap()).
    • Returned a pointer p that lives in your process’s virtual address space.
  2. CPU executes the store:
    • The compiler emits something like movl $42, (%rdi) where %rdi holds p.
    • The Virtual Address (VA) in %rdi is handed to the MMU.
  3. MMU + Page Tables:
    • The MMU looks up the VA in the TLB.
    • On a TLB hit: It immediately gets the Physical Frame Number (PFN) and offset.
    • On a TLB miss: It walks the multi-level page tables using the scheme described above (PML4 → PDPT → PD → PT) to find the PFN, then updates the TLB.
  4. Potential Page Fault:
    • If the page table entry is not present or lacks write permissions, the CPU raises a page fault exception.
    • The kernel’s page fault handler (do_page_fault) decides whether to allocate a new frame, fetch from swap, or kill the process (e.g., invalid pointer).
  5. Cache Hierarchy:
    • Once the Physical Address is known, the CPU looks in L1 data cache.
    • On a miss, it walks out to L2/L3 and finally to DRAM.
    • The cache line (typically 64 bytes) containing the target address is loaded into L1.
  6. Store Buffer and Write:
    • The value 42 is placed into the CPU’s store buffer and eventually merged into the L1 cache line.
    • Later, the cache coherence protocol (MESI) ensures other cores see the updated value when needed.
From the C programmer’s perspective, *p = 42; is a single operation. In reality, it is a coordinated dance between:
  • User-space allocator (managing virtual addresses).
  • Kernel (managing page tables and page faults).
  • MMU/TLB (translating addresses).
  • Cache hierarchy and coherence protocol (moving and sharing data).
  • DRAM controller (accessing physical cells).
Keep this mental picture in mind when debugging performance issues: a “simple store” might involve a TLB miss, a page fault, cache misses, and NUMA penalties.

3. The TLB: Translation Lookaside Buffer

To solve the 4-lookup penalty, the CPU uses a specialized cache called the TLB.

3.1. TLB Internals

  • CAM (Content-Addressable Memory): The TLB is extremely fast. It stores (VPN → PFN) mappings.
  • TLB Hit: Translation in 1\approx 1 cycle.
  • TLB Miss: Hardware (or software on some RISC) must perform the page table walk.

3.2. Context Switches and the TLB

When switching from Process A to Process B, the page tables change. Traditionally, we must flush the TLB because A’s mapping of address 0x40000x4000 is different from B’s.
  • Performance Hit: A flushed TLB leads to a “cold start” period of high latency.
  • Optimization: ASIDs (Address Space IDs): Modern CPUs tag TLB entries with a process ID (PCID in Intel). We only flush if the ASID isn’t found.

4. Contiguous Allocation: The Buddy System

While paging handles user memory, the kernel often needs physically contiguous memory (e.g., for DMA transfers where the hardware expects a continuous range of physical addresses, or for hugepages). The page allocator hands out individual 4KB frames, but the buddy system organizes those frames into larger contiguous blocks.

4.1. The Buddy System Logic

Linux uses the Buddy Allocator. It manages memory in blocks of 2n2^n pages (Orders). The name comes from the pairing: every block has exactly one “buddy” that it can merge with. Think of it like splitting a chocolate bar: you can break a 64-square bar into two 32-square halves, then break one half into two 16-square quarters. When you finish eating and return the pieces, if both halves of any pair are returned, they snap back together into the larger block.
  • Allocation: If you want Order 2 (16KB) but only Order 4 (64KB) is free, the allocator splits 64 -> 32+32, then 32 -> 16+16. One 16KB block is given to you; the other 16KB and 32KB blocks go on their respective free lists for future use.
  • Freeing (Coalescing): When you free a block, the kernel checks its “buddy” (the adjacent block of the same size). If the buddy is also free, they merge back into 2n+12^{n+1}, and this cascading merge can continue up through multiple orders.
Practical tip: You can see the buddy system’s current state in /proc/buddyinfo. If the higher-order columns (Order 3+) are all zeros, the system is suffering from external fragmentation — plenty of free memory, but no contiguous chunks large enough for hugepage allocations or DMA. Running echo 1 > /proc/sys/vm/compact_memory triggers the kernel’s memory compaction to defragment physical memory.
Caveat: Fragmentation vs. Free Memory — the “Plenty of RAM but Allocation Fails” TrapA classic production puzzle: free -h shows 12 GB available, but the kernel logs “page allocation failure: order:4”. The system has memory but cannot find a contiguous 64 KB block. This is not a bug — it is external fragmentation. Long-running servers with mixed allocation patterns (kernel slab objects scattered among user pages) accumulate fragmentation that the buddy allocator cannot satisfy for high-order requests.This bites drivers that need contiguous DMA buffers, jumbo-frame networking, and any code path that calls kmalloc(GFP_KERNEL, order > 3). The symptoms look like a memory leak, but the underlying memory accounting is fine.
Fix: Inspect, Compact, Reserve
  1. Inspect: cat /proc/buddyinfo — if Order 4 and above are all zeros, you have fragmentation, not exhaustion.
  2. Trigger compaction: echo 1 > /proc/sys/vm/compact_memory. This is synchronous and can stall the system briefly, so do it during a maintenance window the first time.
  3. Long-term: pre-allocate critical contiguous regions at boot using hugepages=, cma=, or memmap= boot parameters. CMA (Contiguous Memory Allocator) reserves a region that movable allocations can use opportunistically but that drivers can reclaim on demand.
  4. Tune vm.extfrag_threshold and enable proactive compaction with vm.compaction_proactiveness (kernel 5.8+). For long-uptime fleet machines, schedule periodic compaction during quiet hours.

4.2. Fragmentation

  • Internal: Wasted space inside the 2n2^n block (e.g., requesting 17KB gets you 32KB).
  • External: Plenty of free memory, but none of it is contiguous enough for a large request.

5. Small Object Allocation: SLAB and SLUB

Requesting a 4KB page just to store a 32-byte task_struct is wasteful — you would waste 99% of the page. The kernel needs a “sub-page” allocator for the millions of small objects it creates and destroys every second.

5.1. The Slab Concept

A Slab is a set of one or more contiguous pages, partitioned into small, fixed-size slots for specific kernel objects (e.g., inodes, dentries, mm_struct). Think of it like an ice cube tray: the tray (slab) has a fixed number of identically-shaped slots, and each slot holds exactly one object. Different object types get different trays.

5.2. Object Caching

Instead of initializing memory every time, the Slab allocator keeps a pool of “constructed” objects. When you free an object, it remains initialized in the “free” pool, ready for the next request. This is a huge win for objects with expensive constructors — for example, a struct inode requires setting up several internal locks and lists, so reusing a previously initialized inode is much faster than building one from scratch. Practical tip: Run slabtop to see which kernel object caches are consuming the most memory. On a system running many processes, task_struct and mm_struct caches will be large. On a file server, dentry and inode_cache will dominate. If dentry is consuming gigabytes, it means the kernel is caching directory entries for files that were accessed in the past — this is usually beneficial (faster path lookups), but you can reclaim it with echo 2 > /proc/sys/vm/drop_caches if needed.
Caveat: RSS, PSS, and USS — Why “Memory Used” Has Three Different AnswersAsk top how much memory your process is using and you get RSS (Resident Set Size). RSS counts every page mapped into the process, including pages shared with other processes. If five workers each load the same 200 MB shared library, each one’s RSS includes that 200 MB — so naive sums report 1 GB of “memory used” when the actual physical footprint is around 200 MB shared plus per-process private data.The three values you actually need:
  • RSS: total resident pages, shared counted in full per process. Overcounts on shared memory.
  • PSS (Proportional Set Size): shared pages divided by the number of processes sharing them. Sum of PSS across the system equals real physical memory used. This is the right number for capacity planning.
  • USS (Unique Set Size): only private pages. The amount you would actually free by killing this one process.
Fix: Use smem and /proc/[pid]/smaps_rollupsmem -t gives you RSS, PSS, and USS in one view. For per-process kernel-side answers: cat /proc/[pid]/smaps_rollup reports Rss, Pss, Private_Clean, Private_Dirty, and Shared_Clean. When deciding whether killing process X will actually free 4 GB, look at USS, not RSS. When sizing instances for a fleet of identical workers, sum PSS across the fleet — you will typically find you can pack 30-50 percent more workers per host than RSS-based sizing suggests, because the shared text segments and read-only mmaps double-count under RSS.Concrete example: on a server running 16 PHP-FPM workers, RSS sum was 12 GB. PSS sum was 4.2 GB. The other 7.8 GB was the same OPcache and shared libraries counted 16 times.
Caveat: Transparent Huge Pages Cause Latency Spikes — Disable for Latency-Critical ServicesTHP sounds free: bigger pages, fewer TLB misses, faster lookups. In practice, the khugepaged kernel thread that promotes 4 KB pages to 2 MB hugepages does so by scanning, compacting, and migrating pages — and on a fragmented system, that compaction issues TLB shootdowns across every CPU and can stall user threads for tens of milliseconds.Cassandra, MongoDB, Redis, PostgreSQL, Couchbase, and Oracle all explicitly recommend disabling THP. The MongoDB documentation calls it out as the single biggest cause of mysterious p99 latency spikes in production deployments. The pattern in dashboards: a periodic latency saw-tooth correlated with khugepaged CPU usage.
Fix: Disable THP, Use Explicit Hugepages
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Make this persistent via tuned profile, GRUB cmdline (transparent_hugepage=never), or a systemd unit that runs before your service starts. If you want hugepage benefits without the compaction risk, allocate explicit hugepages at boot (vm.nr_hugepages=N) and have your service use them via mmap(MAP_HUGETLB) or madvise(MADV_HUGEPAGE) on a specific region only.Verify: after the change, grep AnonHugePages /proc/meminfo should drop toward zero, and your p99 saw-tooth should disappear within a deploy cycle.

6. Segmentation: A Historical Perspective

Before Paging became dominant, Segmentation was used to divide memory into logical units (Code, Data, Stack).
  • GDT (Global Descriptor Table): A table where each entry defines a segment’s base, limit, and permissions.
  • Why it failed: Variable-sized segments led to massive External Fragmentation and complex compaction needs.
  • Modern Use: On x86-64, segmentation is mostly disabled, but the FS and GS segment registers are still used for Thread Local Storage (TLS) and pointing to per-CPU kernel data.

7. Advanced Memory Features

7.1. HugePages

Standard 4KB pages are small. For a 1TB database in memory, you need 256 million page table entries — the page table itself consumes gigabytes of RAM and the TLB can only cache a tiny fraction of those mappings, leading to constant TLB misses (each costing a 4-level page table walk).
  • HugePages (2MB or 1GB): Using 2MB pages, the same 1TB database needs only 512K entries. The TLB can cover much more memory, and the page table walk is shorter (3 levels instead of 4 for 2MB pages, 2 levels for 1GB pages).
  • THP (Transparent HugePages): A Linux kernel feature that automatically attempts to promote 4KB pages to 2MB HugePages in the background. Sounds great in theory, but in practice it can cause latency spikes — the kernel’s khugepaged thread periodically scans memory, compacts pages, and does TLB shootdowns to create hugepages, all of which can pause application threads.
Practical tip: Databases (PostgreSQL, Redis, MongoDB) and JVMs often recommend disabling THP and using explicit HugePages instead (configured via vm.nr_hugepages). THP’s background compaction causes unpredictable latency spikes that are unacceptable for latency-sensitive workloads. Explicit hugepages are allocated at boot time and never compacted, giving you the TLB benefits without the latency cost. Check grep -i huge /proc/meminfo to see your current hugepage allocation.

7.2. NUMA (Non-Uniform Memory Access)

In multi-socket servers, CPU 0 is “closer” to RAM Bank 0 than RAM Bank 1. Think of it like offices in different buildings: accessing a file cabinet in your own building (local NUMA node) is fast, but walking to the other building (remote NUMA node) takes 2-3x longer.
  • Local vs Remote: Accessing remote RAM can be 2x3x2x-3x slower in terms of latency and also consumes cross-socket interconnect bandwidth (QPI/UPI on Intel, Infinity Fabric on AMD).
  • First-Touch Policy: Linux generally allocates physical frames on the NUMA node where the thread that first writes to the page is running. This is usually correct (the thread that initializes data is often the one that uses it), but it can go wrong if you initialize data on one thread and then hand it off to threads on a different NUMA node.
Practical tip: Use numactl --hardware to see your NUMA topology and numastat to see per-node memory allocation and miss counts. If other_node hits are high, your application is frequently accessing remote memory. For NUMA-sensitive workloads (databases, HPC), pin processes to specific NUMA nodes with numactl --cpunodebind=0 --membind=0 to ensure all memory accesses are local.

7.3. Copy-on-Write (COW)

When you fork() a process, the kernel does not copy the memory. That would be painfully slow for a process with gigabytes of RAM. Instead, it points the new process’s page tables to the same physical frames but marks them Read-Only. Think of it like sharing a Google Doc with “view only” permissions — both users see the same document, and only when one needs to edit does the system create a private copy.
  • If either process tries to write, the CPU triggers a Page Fault (the page is marked read-only).
  • The kernel’s fault handler recognizes this as a COW fault, allocates a new physical frame, copies the page content, updates the faulting process’s page table to point to the new frame with Read-Write permissions, and restarts the instruction.
  • The other process continues to use the original frame, unaware that a copy was made.
Practical tip: COW is why Redis can BGSAVE (background snapshot) without pausing — it forks a child process that shares the parent’s memory via COW. Only pages that the parent modifies during the snapshot get copied. However, if the parent is write-heavy, the COW copies can temporarily double memory usage. Monitor RSS of both parent and child during snapshotting.
Caveat: OOM Killer Scoring — Production Services Must Opt Out CarefullyThe OOM killer scores every process via oom_score, weighted heavily by memory usage. Your beautifully tuned 32 GB Postgres process is, by definition, the highest-scoring victim on the box. When memory pressure spikes — even briefly during a backup, a log rotation, or a runaway sidecar — the kernel will kill your database before the noisy log shipper.Naive fix: set oom_score_adj=-1000 on the database. This makes it un-killable. Now the OOM killer cannot find a victim large enough, the system enters a death spiral, and the kernel panics or hangs instead of recovering. You traded one failure mode for a worse one.
Fix: Tier the oom_score_adj Across Your Process TreeUse a layered policy, not a binary one:
# Critical service: very strong protection but not absolute
echo -900 > /proc/$(pgrep -f postgres)/oom_score_adj

# Sidecars and agents: take the hit first
echo  500 > /proc/$(pgrep -f filebeat)/oom_score_adj
echo  500 > /proc/$(pgrep -f node_exporter)/oom_score_adj

# Batch jobs: kill on sight
echo 1000 > /proc/$(pgrep -f nightly_backup)/oom_score_adj
Then enforce a cgroup memory limit on the database with memory.high (soft pressure) below memory.max (hard cap). The cgroup OOM kicks in within the container, killing only the offender. systemd unit files support this via OOMScoreAdjust=. Combine with vm.panic_on_oom=0 so the kernel kills rather than panics, and kernel.softlockup_panic=0 during reclaim storms.Verify the policy works: dmesg -T | grep -i "killed process" after a stress test.

8. Senior Interview Deep Dive

Strong Answer Framework:
  1. Hardware trap. The MMU finds the PTE is not present, or the access violates W/X permissions. The CPU writes the faulting virtual address into CR2, the error code into the stack frame, and vectors to interrupt 14 (#PF). Control transfers to the kernel via the IDT entry pointing at do_page_fault (Linux) — the architecture-specific entry on x86 lives in arch/x86/mm/fault.c.
  2. Classify the fault. The handler reads CR2 and the error code (present/write/user/instruction bits). It distinguishes user-mode from kernel-mode faults, and read from write from execute.
  3. VMA lookup. Walk current->mm->mm_rb, the red-black tree of vm_area_struct regions, to find the VMA covering the faulting address. No VMA means SIGSEGV. A VMA exists but the access violates its permissions also means SIGSEGV.
  4. Decide the fault type. Inside handle_mm_fault and handle_pte_fault: anonymous page never touched (allocate zero page on demand), file-backed page not yet read (issue I/O via filemap_fault), copy-on-write fault (allocate a new frame and copy), or swap-in (read from swap device).
  5. Allocate or fetch. New anon page calls into the buddy allocator. Swap-in waits for a block I/O to complete — this is what makes major faults expensive. The page is then inserted into the LRU lists.
  6. Update the PTE. Atomically install the new PTE with the present bit, write/execute bits, and dirty/accessed bits as appropriate. Flush the local TLB entry and, if the mapping is shared, send TLB shootdown IPIs.
  7. Return and restart. iret returns to user mode and the CPU re-executes the faulting instruction. Critically, the instruction is restarted, not skipped — the CPU has no idea anything happened.
Real-World Example: In 2015, the Linux kernel had a notorious performance regression on multi-threaded workloads where do_page_fault held mmap_sem (the mmap semaphore) as a writer during minor faults. Workloads like the Java GC, which generate millions of minor faults per second across many threads, serialized on this lock. The fix (kernel 5.8, 2020) introduced per-VMA locking (the “Speculative Page Fault” / “VMA-based locking” series by Suren Baghdasaryan), allowing concurrent faults in different VMAs without taking the global mmap_lock. Production fleets reported 10-30 percent throughput gains on highly multi-threaded workloads.
Senior follow-up: How is mmap_sem (now mmap_lock) involved during a fault, and why was it a scalability bottleneck?Until 5.8, mmap_sem was a per-process rw-semaphore. Page faults took it for read; mmap, munmap, and mprotect took it for write. On a process with hundreds of threads all generating faults, the cache-line bouncing of the semaphore counter alone became a bottleneck even when no writer was contending. The fix split locking to per-VMA, so two threads faulting in different VMAs no longer serialize.
Senior follow-up: What is the difference between a minor fault and a major fault?Minor faults do not touch disk — the page is already in memory (zero page, COW, page cache hit on a previously read file). Major faults require I/O (swap-in, first read of a file-backed page). Minor faults take microseconds; major faults take milliseconds on SSD, tens of milliseconds on HDD — 10 to 100 times more expensive. ps -o maj_flt,min_flt shows the per-process counts. A high major-fault rate is a sign of memory pressure or thrashing.
Senior follow-up: Can a page fault happen in the kernel itself, and what are the rules?Yes — vmalloc regions and user-space accesses via copy_from_user can fault in kernel mode. The kernel uses an “extable” mechanism: on each copy_from_user call site, an entry in the exception table tells the fault handler “if this faults, jump to recovery label X.” This lets the kernel catch faulting user pointers without crashing. What the kernel cannot do is fault while holding a spinlock or in interrupt context — that is what GFP_ATOMIC and pagefault_disable() are for.
Common Wrong Answers:
  1. “The page is just loaded from disk and the program continues.” Skips classification, VMA lookup, and the distinction between fault types. Most page faults never touch disk — the majority are minor faults (anonymous zero pages or COW). Conflating “page fault” with “swap-in” is the single most common interview mistake.
  2. “The kernel allocates memory and the CPU jumps past the faulting instruction.” The CPU restarts, it does not skip. The whole point is that the user code is unaware of the fault. Skipping would corrupt program state.
  3. “Page faults are only a problem when you run out of RAM.” Misses major vs minor distinction. A healthy program generates millions of minor faults per second and they are essentially free; major faults are the expensive ones.
Further Reading:
  • Linux kernel source: arch/x86/mm/fault.c and mm/memory.c — the canonical implementation, surprisingly readable.
  • LWN: “Per-VMA locking” series by Jonathan Corbet (2022-2023) — explains the mmap_lock scalability fix in depth.
  • “Understanding the Linux Virtual Memory Manager” by Mel Gorman — chapter 4 covers fault handling end-to-end.
Strong Answer Framework:
  1. User code calls malloc(size). This is a libc function (glibc’s ptmalloc, jemalloc, tcmalloc, mimalloc — behaviors differ but the layering is the same).
  2. Allocator picks a strategy by size. Small allocations (under 128 KB in glibc by default, controlled by M_MMAP_THRESHOLD) come from a per-thread arena: a free-list of chunks carved out of a previously obtained heap region. Large allocations bypass the arena and call mmap directly.
  3. Arena exhausted — grow the heap. When the arena needs more, it calls brk or sbrk (extend the data segment) for the main arena, or mmap(MAP_ANONYMOUS) for thread arenas. brk is contiguous and historically simpler; mmap regions can live anywhere in the address space.
  4. Kernel: VMA bookkeeping only. brk/mmap updates vm_area_struct records but does not allocate physical pages. The memory is reserved virtually. Returning success does not mean a single byte of RAM was committed.
  5. First write triggers a page fault. Lazy allocation: the first store to each new page faults, and do_anonymous_page allocates a zero-filled frame from the buddy allocator (or maps the shared zero page if read-only). This is why a process can malloc(1 GB) instantly even on a 512 MB box — nothing has been physically allocated yet.
  6. Buddy allocator hands out frames. The buddy walks free-lists by order; small allocations come from Order 0. NUMA policies decide which node.
  7. Free path mirrors allocate. free returns the chunk to the arena. Large mmap chunks are unmapped immediately. Small chunks are coalesced with neighbors and returned to the free-list, but brk-based heap is typically not shrunk because of fragmentation — pages near the top of the heap that are still in use prevent shrinking.
Real-World Example: In 2018, Facebook’s mcrouter team measured a 30 percent reduction in resident memory by switching from glibc’s ptmalloc to jemalloc. The reason was not the speed of the fast path — it was that ptmalloc’s per-thread arenas in a 200-thread process held large unreturned pools. jemalloc’s arena sizing and madvise(MADV_DONTNEED) policies returned freed pages to the kernel more aggressively. The general lesson: glibc malloc holds onto memory longer than you think; for long-running multi-threaded servers, an alternative allocator can dramatically cut RSS.
Senior follow-up: What is M_MMAP_THRESHOLD and why does it default to 128 KB?Below 128 KB, glibc uses the brk-based arena: cheap to carve, easy to coalesce, but never returned to the OS until the heap top can shrink. Above 128 KB, each allocation gets its own mmap region: more expensive (one syscall, one VMA per chunk), but free immediately unmaps and returns memory to the OS. The 128 KB threshold balances syscall overhead against memory return aggressiveness. Tuning it down via mallopt(M_MMAP_THRESHOLD, ...) makes large workloads release memory faster at the cost of more syscalls.
Senior follow-up: Why does RSS not drop after I free a big buffer?Two reasons. First, glibc may have kept the pages in its arena instead of returning them to the kernel. Second, even when glibc calls madvise(MADV_DONTNEED), the kernel may not reclaim the page until pressure builds; and MADV_DONTNEED only zeros the page on next access — it does not free the VMA, so virtual size stays the same. Use malloc_trim(0) to force glibc to return what it can. Use jemalloc’s mallctl("arena.X.purge") for finer control.
Senior follow-up: What happens if two threads call malloc simultaneously?glibc maintains multiple arenas (M_ARENA_MAX, defaults to 8 times the CPU count). Each thread is sticky-bound to an arena via TLS to avoid contention. The arena itself has a mutex. Under high concurrency on small allocations, you can still see pthread_mutex_lock near the top of perf profiles — which is one reason high-throughput servers reach for jemalloc or tcmalloc, both of which use thread-local caches that bypass the arena lock for the fast path.
Common Wrong Answers:
  1. malloc calls brk, which allocates memory in the kernel.” Conflates virtual reservation with physical allocation. brk only extends the VMA; the physical pages come on first write via page fault.
  2. “The kernel always uses mmap for allocations now.” Glibc still uses brk for the main arena under the threshold. Other allocators (jemalloc) use only mmap, but that is implementation-specific.
  3. free returns memory to the OS.” Almost never true for small allocations. Memory stays in the user-space allocator’s free-list. Only large mmap-backed allocations return immediately.
Further Reading:
  • “ptmalloc” original paper by Wolfram Gloger — the design glibc inherited.
  • jemalloc design documents on jemalloc.net — contrast with ptmalloc on arena and purge policy.
  • LWN: “Memory management for embedded systems” — explains overcommit and lazy allocation in production context.
Strong Answer Framework:
  1. First, define “using”. Which tool reports 4 GB? top RSS, cgroup memory.current, application-reported heap, or container metrics? They measure different things.
  2. Compare RSS to PSS. cat /proc/[pid]/smaps_rollup. If RSS is 4 GB but PSS is 1.2 GB, most of the apparent memory is shared (libraries, OPcache, mmap’d files). The “actual” private cost is 1.2 GB.
  3. Check page cache attribution. free reports “available” by including reclaimable page cache. If the process mmap’d a 4 GB file, those pages count toward the process’s RSS but are also reclaimable by the kernel under pressure — so they appear in both columns.
  4. Look at anon vs file-backed memory. grep -E "Anonymous|Mapped" /proc/[pid]/status (or smaps_rollup). Anonymous pages are private heap/stack; file-backed pages are mmap’d files and shared libraries. A process with 3.5 GB of anon and 0.5 GB file-backed has very different scaling behavior than one that is mostly file-backed.
  5. Verify with cgroup accounting. If running in a container, cat /sys/fs/cgroup/memory.current (v2) shows the cgroup’s total, including kernel memory and page cache attributed to the container. This is what triggers OOM kills.
  6. Check for kernel-side memory. cat /proc/meminfo — look at Slab, KernelStack, PageTables. A multi-million-thread process can have gigabytes in PageTables alone.
Real-World Example: In 2020, a Discord post-mortem on their voice service described exactly this confusion: container metrics reported 8 GB used, the JVM heap reported 2 GB, and the kernel reported 12 GB free. The gap was JVM off-heap direct buffers (NIO) plus tcmalloc retaining freed memory in its thread caches. The fix involved bounding MaxDirectMemorySize, switching to G1 GC’s String Deduplication, and forcing periodic tcmalloc::MallocExtension::ReleaseFreeMemory(). Lesson: a single “memory used” number is meaningless without knowing what is being counted.
Senior follow-up: What does MemAvailable in /proc/meminfo actually compute?It is an estimate of how much memory is available for new allocations without swapping. The formula (since kernel 3.14): MemFree plus reclaimable page cache minus low watermark, plus reclaimable slab. It tries to answer “how much can I allocate before the system pages?” MemFree alone is misleading because the kernel deliberately uses spare RAM for page cache; MemAvailable is the metric to alert on.
Senior follow-up: Why might RSS differ from the cgroup’s memory.current?Cgroup accounts for the whole hierarchy: anon pages, file-backed pages mapped by processes in the cgroup, kernel memory (slab) charged to the cgroup, and tmpfs usage. RSS is per-process and ignores tmpfs and kernel memory. A container with a heavy tmpfs (/dev/shm) will see cgroup memory much higher than the sum of process RSSes.
Senior follow-up: How do you find which file is consuming the page cache?vmtouch -v /path reports per-file resident pages. For system-wide, pcstat (Brendan Gregg’s tool) lists files with their cached fraction. lsof -nP plus /proc/[pid]/maps correlates open files to processes. Useful when you suspect a misbehaving process is filling page cache with throwaway data and evicting hot pages of another service.
Common Wrong Answers:
  1. “The metrics are wrong, restart the service.” Treats the diagnostic as broken without investigating. The metrics are usually right; the model is incomplete.
  2. “It’s a memory leak.” Possible, but you have not gathered enough data. RSS-PSS mismatch and shared library counting often explain the apparent gap without any leak.
  3. free is showing buffers/cache as available, so there is no problem.” Sometimes true, but ignores the case where the process has mmap’d a file and the page cache is being attributed to it on one tool and not another.
Further Reading:
  • “Linux Memory Statistics” by Brendan Gregg — distinguishes RSS, PSS, USS, and the cache attribution rules.
  • kernel.org Documentation/filesystems/proc.rst — the canonical reference for /proc/meminfo and /proc/[pid]/smaps.
  • Discord engineering blog: “Why Discord is switching from Go to Rust” — not directly about this but discusses memory accounting under heavy concurrency.
Strong Answer Framework:
  1. Why shootdowns exist. Each CPU has its own TLB. When a page-table mapping changes (munmap, mprotect, madvise(MADV_DONTNEED), COW break, swap-in invalidation), the kernel must invalidate the now-stale TLB entries on every CPU that may have cached them.
  2. The mechanism. The initiating CPU calls flush_tlb_mm or flush_tlb_range, which sends an IPI (inter-processor interrupt) to the target CPUs. Each target enters an interrupt handler, executes INVLPG (invalidate one page) or a full TLB flush, and acknowledges. The initiating CPU spins until all acks come in.
  3. Why it hurts. IPIs are tens of microseconds round-trip. If a server with 96 cores does an munmap that invalidates all of them, you stall 96 cores for 30+ microseconds and the initiator waits for all of them. At high frequency (concurrent madvise calls, tight mmap/munmap loops), shootdowns can dominate runtime.
  4. Tracking it. Look at /proc/interrupts for the TLB line, or perf stat -e tlb:tlb_flush. On a busy box you might see millions of TLB shootdowns per second.
  5. Mitigations. Batch unmaps, prefer per-VMA scope over full-mm flushes, use lazy TLB invalidation for kernel threads, exploit PCID/ASID tagging so that user-mode entries from one process do not need to be flushed when switching to another. Modern kernels skip shootdowns for CPUs known not to have run the affected mm recently.
Real-World Example: In 2018, Cloudflare published a deep-dive (“Speeding up Linux disk encryption”) where they found a single core’s dm-crypt workqueue was getting saturated with TLB shootdowns triggered by per-block memory remapping. Switching from per-block to bulk processing reduced shootdowns by 95 percent and unlocked roughly 2x throughput. The general pattern: anywhere you see frequent map-unmap cycles in a hot path, shootdowns are a likely cost driver.
Senior follow-up: What is PCID/ASID and how does it reduce flush cost?PCID (Intel) and ASID (ARM) tag TLB entries with a process context ID. When the kernel switches mm (process switch or KPTI swap), it changes the active PCID instead of flushing the TLB. Entries from the old context remain in the TLB, tagged with the old PCID, ignored for current lookups. On return to the previous context, those entries are valid again. This is what made KPTI tolerable: without PCID, every syscall would require a TLB flush. With PCID, the cost dropped from 30+ percent to 2-5 percent on most workloads.
Senior follow-up: Are there workloads where shootdowns are more expensive than the mapping change itself?Yes. Tight mmap/munmap of small regions in a loop, COW pages on a write-heavy fork-then-mutate pattern, and heavy madvise(MADV_DONTNEED) usage all spend more cycles in shootdown coordination than in the actual page table edit. The fix is either to batch (one large munmap instead of many small ones) or to use mechanisms that avoid invalidation entirely (e.g., MADV_FREE instead of MADV_DONTNEED, which does not flush until reclaim).
Senior follow-up: How does the kernel know which CPUs to send the IPI to?Each mm_struct has a cpu_bitmap of CPUs that have ever run with this mm active. The shootdown IPI targets only those CPUs. This is critical on big machines: shooting down only the 4 CPUs that actually ran your process is much cheaper than shooting all 128 cores. The bitmap is updated lazily during context switches.
Common Wrong Answers:
  1. “Shootdowns only happen when you munmap.” Plenty of other operations trigger them: mprotect, mremap, swap-out, page migration (NUMA balancing), THP collapse, madvise(MADV_DONTNEED). The munmap answer is incomplete.
  2. “Just disable preemption around critical sections to avoid shootdowns.” Misunderstands the problem. Shootdowns happen across CPUs, not within a single CPU’s preemption window. Disabling preemption helps with lock acquisition order but does nothing for inter-CPU IPIs.
  3. “Big TLBs eliminate shootdowns.” Bigger TLBs reduce miss rates but every modification still requires invalidation. The shootdown cost scales with CPU count, not TLB size.
Further Reading:
  • LWN: “TLB flushing for the sloppy” by Jonathan Corbet — covers lazy invalidation and the cpu_bitmap optimization.
  • Intel Software Developer’s Manual, Volume 3, chapter on paging — canonical reference for PCID and INVPCID semantics.
  • “Latency-Tolerant Software for High-Throughput Servers” — ACM article on minimizing shootdowns in network code.

9. Practice: Memory Forensics

In Linux, you can inspect a process’s memory map directly:
# See the segments of a process
cat /proc/[pid]/maps

# Deep dive into page table stats
cat /proc/[pid]/smaps

# See physical frame usage (needs root)
cat /proc/kpageflags

Next: Virtual Memory & Swapping

Interview Deep-Dive

Strong Answer:The way I think about this is that the Buddy Allocator works beautifully when there is available high-order memory, but under sustained allocation pressure, external fragmentation makes large contiguous blocks scarce. When a request for a high-order block fails, the kernel has several fallback mechanisms:
  • Compaction (memory compaction): The kernel’s kcompactd daemon or a synchronous compaction pass physically migrates movable pages (user-space pages, page cache) to consolidate free frames into contiguous runs. This is like defragmenting a disk but for RAM. It walks the zone from both ends — a migration scanner from the bottom and a free scanner from the top — and moves pages until a sufficiently large hole appears.
  • Reclaim before retry: The allocator can invoke kswapd or direct reclaim to free clean page cache pages or evict anonymous pages to swap, hoping the freed frames coalesce with their buddies.
  • CMA (Contiguous Memory Allocator): For device drivers that absolutely must have contiguous DMA-able memory, Linux reserves a CMA region at boot. The region is usable by movable allocations during normal operation but can be evacuated when a driver needs it. This is the safety net for things like GPU framebuffers or camera buffers on embedded SoCs.
  • Falling back to vmalloc: If the allocation is not for DMA or a hardware page table, the kernel can use vmalloc() to create a virtually contiguous mapping backed by scattered physical pages. This avoids the fragmentation problem entirely but with a TLB overhead since each page needs its own mapping.
In my experience, the real gotcha in production is that compaction can stall allocation-hot paths for milliseconds. At a company I know of running large Redis instances, Transparent HugePages with synchronous compaction caused p99 latency spikes of 50ms+. The fix was to disable THP (echo madvise > /sys/kernel/mm/transparent_hugepage/enabled) and use explicit HugePages reserved at boot with vm.nr_hugepages.Follow-up: How does the kernel decide which pages are “movable” vs “unmovable” during compaction?Linux classifies allocations into three migrate types when they enter the buddy allocator: MIGRATE_MOVABLE (user pages, page cache — can be relocated because the kernel controls the page table mappings), MIGRATE_UNMOVABLE (kernel slab objects, page tables themselves — moving them would require updating every pointer in the kernel), and MIGRATE_RECLAIMABLE (clean page cache that can simply be discarded). The buddy allocator groups contiguous blocks by migrate type so that unmovable allocations cluster together, leaving large contiguous regions of movable pages that can be compacted. This is called “pageblock grouping” and was introduced specifically to make compaction feasible. The key insight: if you sprinkle unmovable allocations randomly across all physical memory, compaction becomes impossible because you cannot move those pinned pages out of the way.
Strong Answer:NUMA stands for Non-Uniform Memory Access. On a 2-socket server, each CPU socket has its own local memory controller and RAM. Accessing local memory might take 80-100ns, while accessing remote memory on the other socket goes through the interconnect (QPI/UPI on Intel) and costs 150-300ns — roughly 2-3x the latency with lower bandwidth.
  • First-touch policy: Linux allocates physical memory on the NUMA node where the thread that first writes to the page is running. This is usually sensible, but it breaks badly when initialization is single-threaded. For example, if thread 0 on node 0 initializes a large array, and then 64 threads spread across both nodes use it, half of them are hitting remote memory for every access.
  • Interleave policy: Spreads pages round-robin across nodes. Good for large shared data structures where you cannot predict which node will access which part. Reduces worst-case latency at the cost of average-case.
  • Preferred/bind policies: Pin allocations to a specific node or set of nodes. Used when you know the access pattern.
For debugging a NUMA regression, here is my approach:
  1. numastat -p PID: Shows per-node memory allocation for the process. If I see most memory on node 0 but the process has threads running on node 1, that is a red flag — lots of remote accesses.
  2. perf stat -e node-load-misses,node-store-misses: Hardware counters that directly measure cross-node memory accesses. A high ratio of remote-to-local accesses confirms the problem.
  3. numactl —hardware: Verify the topology — how many nodes, which CPUs belong to which node, and the inter-node latency matrix.
  4. lstopo (hwloc): Visualize the full topology including caches and PCIe devices.
The fix depends on the root cause. If the workload is a database like PostgreSQL, I would use numactl --interleave=all to spread the shared buffer pool evenly. If it is a latency-sensitive service, I would bind the process to a single node with numactl --membind=0 --cpunodebind=0 to eliminate all remote accesses. The worst thing you can do is nothing — the default first-touch policy with multi-threaded workloads is a landmine.Follow-up: What is “NUMA balancing” in the Linux kernel and when would you disable it?Linux has automatic NUMA balancing (enabled via kernel.numa_balancing=1) that periodically unmaps pages and uses the resulting page faults to detect which node each thread actually accesses. It then migrates pages to the node where they are being used most. This is clever but not free — the artificial page faults add overhead (typically 1-5% CPU), and for workloads with stable access patterns (like a database with pre-bound NUMA policy), it is pure waste. You would disable it when you have already done explicit NUMA placement (via numactl or mbind()) or when the workload is latency-sensitive and cannot tolerate the periodic fault injection. Conversely, keep it enabled for general-purpose workloads where threads migrate unpredictably.
Strong Answer:That statement is almost certainly wrong on a default Linux system, and this is one of the most misunderstood aspects of Linux memory management.
  • Overcommit: By default, Linux uses memory overcommit (controlled by vm.overcommit_memory, default value 0). When you call malloc(1GB), the kernel says “sure” and returns a valid pointer — even if there is only 512MB of free RAM. The kernel has not allocated any physical frames yet. It has only updated the process’s virtual memory area (VMA) structures. Physical pages are allocated on demand when you actually write to the memory, triggering page faults.
  • When malloc returns NULL: On a default Linux system, malloc returning NULL is extremely rare. It usually means the process has hit its virtual address space limit (ulimit -v), the process has exceeded its cgroup memory limit, or the kernel’s overcommit heuristic (which still does a rough sanity check) rejected an absurdly large request.
  • The OOM Killer: The real “out of memory” event on Linux is not malloc returning NULL — it is the OOM Killer activating. When the system genuinely cannot find or reclaim a single physical page and all swap is exhausted, the kernel invokes oom_kill_process() to sacrifice a process and reclaim its memory. The OOM killer scores processes by their memory usage (/proc/PID/oom_score) and kills the highest-scored one.
What most people miss is that this design is intentional. The Linux philosophy is that most large allocations (especially after fork()) never fully materialize, so rejecting them eagerly would waste the overcommit opportunity. The trade-off is that your process can be killed at any time by the OOM killer, even though malloc never returned NULL.For production services, the right approach is: set cgroup memory limits so the OOM killer targets the offending container rather than a random innocent process, monitor /proc/meminfo for MemAvailable approaching zero, and use vm.overcommit_memory=2 (strict accounting) only in safety-critical systems where predictability matters more than efficiency.Follow-up: How does the OOM killer decide which process to kill, and how would you protect a critical process from being killed?The OOM killer computes an oom_score for every process based primarily on its RSS (resident set size) — bigger consumers score higher. You can influence this via /proc/PID/oom_score_adj which ranges from -1000 (never kill) to +1000 (kill first). Setting oom_score_adj to -1000 for your database process effectively makes it immune to the OOM killer. Setting it to +1000 for a batch job makes it the first sacrifice. In production, I always set -1000 on the core service process and let sidecar processes (log shippers, metrics agents) take the hit first. The kernel also avoids killing processes that hold kernel resources (like flock locks) since killing them could leave the system in an inconsistent state.
Strong Answer:Both SLAB and SLUB solve the same fundamental problem: the buddy allocator gives you memory in page-sized chunks (4KB minimum), but the kernel constantly needs tiny objects — 200-byte task_struct fragments, 128-byte dentry structures, 64-byte inode objects. Allocating a full page for each would waste enormous amounts of memory to internal fragmentation.
  • SLAB (the original): Introduced by Jeff Bonwick (originally for Solaris, then ported to Linux). It maintains per-CPU caches, per-node shared caches, and a three-list management structure (full slabs, partial slabs, empty slabs). Objects are pre-constructed so that frequently allocated kernel objects (like inodes) skip expensive initialization on reuse.
  • SLUB (the replacement, default since Linux 2.6.22): Designed by Christoph Lameter to fix SLAB’s complexity. Key differences:
    • No per-CPU queues or shared caches in the SLAB sense. Instead, SLUB uses a simpler per-CPU “freelist” pointer directly into the slab page. This means fewer data structures and less metadata overhead.
    • No three-list management. SLUB tracks partial slabs in a simple linked list per node.
    • Better NUMA awareness. SLUB allocates from the local node’s partial list first.
    • Better debugging. SLUB has built-in red-zoning, poisoning, and tracking that can be enabled at runtime without recompilation.
    • Lower memory overhead. SLAB needed a separate struct kmem_cache descriptor per slab plus per-CPU arrays; SLUB embeds the freelist in the objects themselves using the freed object’s memory as a linked-list pointer.
The reason Linux moved to SLUB is that SLAB’s complexity (three-list management, per-CPU arrays, shared arrays) made it hard to maintain, hard to debug, and actually slower on NUMA machines due to excessive object shuffling between per-CPU and shared caches. SLUB’s simpler design ended up being both faster and more memory-efficient for the common case.You can check which allocator your kernel uses with cat /proc/slabinfo and inspect per-cache statistics. In production, SLUB is almost always the right choice. The only scenario where you might want SLAB is on very old embedded systems with tiny amounts of RAM where SLAB’s object caching behavior might reduce initialization cost.Follow-up: What is SLOB and when would you use it?SLOB (Simple List Of Blocks) is the third allocator option, designed for extremely memory-constrained embedded systems with less than 16MB of RAM. It uses a simple first-fit algorithm with minimal metadata overhead — literally just a linked list of free blocks. It is slower and fragments more than SLUB, but its code is tiny (about 600 lines vs SLUB’s several thousand) and its per-allocation overhead is minimal. You would choose SLOB only for microcontrollers or IoT devices running Linux where every kilobyte of kernel metadata counts.