> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Memory Management Internals > Deep dive into Linux memory management: zones, buddy allocator, slab, NUMA, and memory cgroups Linux Memory Management - Virtual memory, buddy allocator, and page tables

Linux Memory Management - Virtual memory, buddy allocator, and page tables

# Memory Management Internals Memory management is one of the most complex subsystems in the Linux kernel. Understanding it deeply is crucial for infrastructure engineers debugging OOM issues, optimizing container resource limits, and understanding application performance. **Interview Frequency**: Very High (especially at observability/infrastructure companies)\ **Key Topics**: Buddy allocator, slab, zones, OOM killer, cgroups\ **Time to Master**: 16-18 hours *** ## Physical Memory Organization ### Memory Zones Linux organizes physical memory into zones based on addressing constraints: Physical Memory Zones

### Viewing Zone Information ```bash theme={null} # View zone details cat /proc/zoneinfo # Summary cat /proc/buddyinfo # Example output: # Node 0, zone Normal 1234 876 432 210 105 52 26 13 6 3 1 # 2^0 2^1 2^2 2^3 2^4 2^5 2^6 2^7 2^8 2^9 2^10 ``` *** ## Buddy Allocator ### Why Power-of-2 Allocation? The buddy allocator is the kernel's physical page allocator. But why does it use powers of 2? **The problem**: If you allow arbitrary-sized allocations, memory becomes fragmented. You might have 100 free pages scattered around, but can't allocate a contiguous 64-page block. **The solution**: Only allow power-of-2 sized blocks (1, 2, 4, 8, 16... pages). This enables: 1. **Fast splitting**: A 16-page block splits perfectly into two 8-page blocks 2. **Fast coalescing**: Two adjacent 8-page blocks merge back into one 16-page block 3. **Simple math**: Finding a block's "buddy" is just XOR with the block size **What is a "buddy"?** Two blocks are buddies if: * They're the same size * They're adjacent in memory * Together they form the next larger power-of-2 block Example: Pages 0-7 and pages 8-15 are buddies (together = pages 0-15). The buddy system manages physical page allocation. Buddy Allocator Mechanism

### How Buddy Allocation Works ### Buddy Allocator Code ```c theme={null} // mm/page_alloc.c (simplified) // Allocate 2^order contiguous pages struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid) { struct page *page; // Try to get from free list page = get_page_from_freelist(gfp, order, alloc_flags); if (page) return page; // Slow path: reclaim, compact, OOM page = __alloc_pages_slowpath(gfp, order, &ac); return page; } // Free pages back to buddy void __free_pages(struct page *page, unsigned int order) { // Try to coalesce with buddy while (order < MAX_ORDER - 1) { struct page *buddy = find_buddy(page, order); if (!page_is_buddy(buddy, order)) break; // Coalesce list_del(&buddy->lru); order++; page = min(page, buddy); // Use lower address } // Add to free list list_add(&page->lru, &zone->free_area[order].free_list); } ``` ### Fragmentation Problem ```` ![Memory Fragmentation and Compaction](/images/courses/linux-memory-fragmentation.svg) ```bash # Check fragmentation cat /proc/buddyinfo # Low numbers at high orders = fragmentation # Trigger compaction echo 1 > /proc/sys/vm/compact_memory # Check compaction stats cat /proc/vmstat | grep compact ```` *** ## Slab Allocator (SLUB) ### The Problem with Buddy Allocator for Small Objects The buddy allocator works great for page-sized allocations (4KB+), but what about small objects like `task_struct` (2.6KB) or `dentry` (192 bytes)? **Problems**: 1. **Waste**: Allocating a full 4KB page for a 192-byte object wastes 95% of memory 2. **Performance**: Buddy allocator has global locks - contention on multi-core systems 3. **Cache efficiency**: Small objects from different pages = poor cache locality **The solution**: SLUB (Slab Allocator Unqueued) * Pre-allocate pages and divide them into same-sized objects * Per-CPU caches for lockless fast path * Object caches for common kernel structures (`task_struct`, `inode`, `dentry`, etc.) For small object allocations (smaller than a page). ### SLUB Architecture SLUB Allocator Architecture

### SLUB Fast Path ```c theme={null} // mm/slub.c (simplified) static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags) { void *object; struct kmem_cache_cpu *c; c = this_cpu_ptr(s->cpu_slab); // Fast path: get from per-CPU freelist object = c->freelist; if (likely(object)) { c->freelist = get_freepointer(s, object); return object; } // Slow path: get new slab from partial list or allocate return __slab_alloc(s, gfpflags, node); } ``` ### Viewing Slab Information ```bash theme={null} # View all slab caches cat /proc/slabinfo # Interactive slab viewer sudo slabtop # Memory usage by cache sudo slabtop -o | head -20 # Detailed stats cat /sys/kernel/slab/task_struct/object_size cat /sys/kernel/slab/task_struct/objs_per_slab cat /sys/kernel/slab/task_struct/total_objects ``` *** ## Page Tables and Virtual Memory ### 4-Level Page Tables (x86-64) x86-64 Page Table Walk

### TLB (Translation Lookaside Buffer) TLB Caching Mechanism

*** ## Page Fault Handling ### When Do Page Faults Happen? Page faults aren't errors - they're a normal part of how Linux manages memory efficiently. Here are real-world scenarios: **Minor fault examples**: * You run `./myprogram` - the first time code executes, it's not in memory yet (demand paging) * After `fork()`, child writes to memory - triggers copy-on-write * You `mmap()` a file but haven't read it yet - first access brings it in **Major fault examples**: * Your laptop suspended with apps open - resuming reads pages back from swap * You open a 1GB video file with `mmap()` - reading it triggers disk I/O * System under memory pressure swapped out your idle browser - switching back triggers swap-in **Invalid fault examples**: * `char *p = NULL; *p = 'x';` - accessing NULL pointer * Buffer overflow writing past array bounds * Accessing freed memory (use-after-free) ```` ![Page Fault Types](/images/courses/linux-page-fault-types.svg) ### Page Fault Handler Flow ![Page Fault Handling Sequence](/images/courses/linux-page-fault-handling.svg) When a process accesses a memory address that isn't currently mapped to a physical page, the CPU raises a **page fault exception**. The kernel's page fault handler (`do_page_fault` on x86) takes over to resolve the situation. **Resolution Paths:** 1. **Minor Fault**: The page is in memory (e.g., in page cache or shared library) but not mapped to this process. The kernel simply updates the page table. Fast. 2. **Major Fault**: The page is not in memory (e.g., swapped out or not read from disk yet). The kernel must issue disk I/O to fetch it. Slow. 3. **Invalid Fault**: The access is truly invalid (e.g., writing to read-only memory, accessing unmapped space). The kernel sends `SIGSEGV` (Segmentation Fault). ```c // arch/x86/mm/fault.c (simplified) static void handle_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long address) { struct vm_area_struct *vma; struct mm_struct *mm = current->mm; vm_fault_t fault; // Find VMA containing the address vma = find_vma(mm, address); if (!vma || address < vma->vm_start) { // No VMA - invalid access bad_area(regs, error_code, address); return; } // Check permissions if (unlikely(access_error(error_code, vma))) { bad_area_access_error(regs, error_code, address, vma); return; } // Handle the fault fault = handle_mm_fault(vma, address, flags, regs); // Check result if (fault & VM_FAULT_OOM) out_of_memory(); if (fault & VM_FAULT_SIGSEGV) bad_area_nosemaphore(regs, error_code, address); } ```` ### Monitoring Page Faults ```bash theme={null} # View page fault stats cat /proc/vmstat | grep pgfault # pgfault = total faults, pgmajfault = major faults # Per-process faults cat /proc//stat | awk '{print "Minor:", $10, "Major:", $12}' # Real-time monitoring with perf sudo perf stat -e page-faults,minor-faults,major-faults ./myprogram ``` *** ## Memory Reclaim ### kswapd and Direct Reclaim ```` ![Memory Reclaim Watermarks](/images/courses/linux-memory-reclaim.svg) ### LRU Lists ```c // Pages organized in LRU lists enum lru_list { LRU_INACTIVE_ANON, // Anonymous, not recently used LRU_ACTIVE_ANON, // Anonymous, recently used LRU_INACTIVE_FILE, // File-backed, not recently used LRU_ACTIVE_FILE, // File-backed, recently used LRU_UNEVICTABLE, // Locked pages, mlocked NR_LRU_LISTS }; ```` ```bash theme={null} # View LRU stats cat /proc/meminfo | grep -E "Active|Inactive" # Active(anon): 2048000 kB # Inactive(anon): 1024000 kB # Active(file): 4096000 kB # Inactive(file): 2048000 kB ``` ### Swappiness The `vm.swappiness` parameter controls how aggressively the kernel swaps out anonymous memory relative to reclaiming page cache. * **Range**: 0 to 100 (default 60) * **High value (100)**: Aggressively swap out anonymous memory to keep page cache (file data) in memory. Good for I/O heavy workloads. * **Low value (0)**: Avoid swapping anonymous memory unless absolutely necessary. Good for latency-sensitive applications (databases). * **Misconception**: Setting to 0 does NOT disable swap. It just tells the kernel to prefer reclaiming file pages first. ```bash theme={null} # Check current value cat /proc/sys/vm/swappiness # Change temporarily sudo sysctl vm.swappiness=10 # Make permanent in /etc/sysctl.conf # vm.swappiness = 10 ``` *** ## OOM Killer ### Design Philosophy When the system runs completely out of memory, Linux faces an impossible choice: which process to kill? **The goal**: Kill the "least important" process that frees the most memory. **The challenge**: How do you define "least important"? The OOM killer uses a heuristic: * **Memory usage** is the primary factor (kill big memory hogs) * **oom\_score\_adj** lets you override (protect critical services) * **Root processes** get a slight bonus (system services are important) * **Recently started** processes are slightly preferred (haven't done much work yet) **Why not just kill the biggest process?** Because that might be your database with critical data. The scoring system tries to balance memory freed vs. importance. ### OOM Score Calculation ```` ![OOM Killer Scoring](/images/courses/linux-oom-killer-scoring.svg) ### Configuring OOM Behavior ```bash # View OOM scores cat /proc//oom_score cat /proc//oom_score_adj # Make process immune to OOM echo -1000 > /proc//oom_score_adj # Make process OOM target echo 1000 > /proc//oom_score_adj # Disable OOM killer (dangerous!) echo 2 > /proc/sys/vm/overcommit_memory # Panic on OOM instead of killing echo 1 > /proc/sys/vm/panic_on_oom ```` *** ## Memory Cgroups Critical for container resource management. ### Cgroup v2 Memory Controller ```bash theme={null} # Create cgroup mkdir /sys/fs/cgroup/mygroup # Set memory limit (100MB) echo 104857600 > /sys/fs/cgroup/mygroup/memory.max # Set soft limit (target when under pressure) echo 52428800 > /sys/fs/cgroup/mygroup/memory.high # Add process to cgroup echo $$ > /sys/fs/cgroup/mygroup/cgroup.procs # View current usage cat /sys/fs/cgroup/mygroup/memory.current # View statistics cat /sys/fs/cgroup/mygroup/memory.stat ``` ### Memory Cgroup OOM ```` --- ## NUMA (Non-Uniform Memory Access) ### NUMA Architecture ![NUMA Topology](/images/courses/linux-numa-topology.svg) ### NUMA Policies ```bash # View NUMA topology numactl --hardware # Run with specific NUMA policy numactl --cpunodebind=0 --membind=0 ./myprogram # Local only numactl --interleave=all ./myprogram # Spread across nodes # View NUMA stats numastat numastat -p # View per-process NUMA memory cat /proc//numa_maps ```` *** ## Huge Pages ### Why Huge Pages Matter ### Configuring Huge Pages ```bash theme={null} # View huge page info cat /proc/meminfo | grep Huge # HugePages_Total: 128 # HugePages_Free: 100 # HugePages_Rsvd: 0 # Hugepagesize: 2048 kB # Reserve huge pages echo 128 > /proc/sys/vm/nr_hugepages # Use in application (mmap) void *ptr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0); # Transparent Huge Pages (automatic) cat /sys/kernel/mm/transparent_hugepage/enabled # [always] madvise never ``` *** ## Lab Exercises **Objective**: Understand memory allocation patterns ```bash theme={null} # Watch memory allocation in real-time watch -n 1 'cat /proc/meminfo | head -20' # Trace page allocations sudo perf record -e kmem:mm_page_alloc -a sleep 5 sudo perf script | head -50 # View buddy allocator fragmentation cat /proc/buddyinfo # Trigger memory pressure stress --vm 2 --vm-bytes 1G --timeout 10s # Watch reclaim watch -n 1 'cat /proc/vmstat | grep -E "pgfault|pgmaj|pswp"' ``` **Objective**: Understand OOM killer behavior ```bash theme={null} # Create a cgroup with 100MB limit sudo mkdir /sys/fs/cgroup/test_oom echo 104857600 | sudo tee /sys/fs/cgroup/test_oom/memory.max # Run memory-hungry process in cgroup sudo bash -c 'echo $$ > /sys/fs/cgroup/test_oom/cgroup.procs && \ python3 -c "x = [0] * (200 * 1024 * 1024 // 8)"' # Watch OOM events sudo dmesg | tail -20 # View OOM scores for all processes for pid in $(ls /proc | grep -E '^[0-9]+$'); do if [ -f /proc/$pid/oom_score ]; then echo "$pid $(cat /proc/$pid/oom_score 2>/dev/null) $(cat /proc/$pid/comm 2>/dev/null)" fi done | sort -nk2 | tail -20 # Cleanup sudo rmdir /sys/fs/cgroup/test_oom ``` **Objective**: Profile page faults ```c theme={null} // fault_test.c #include #include #include #include void print_faults(const char *label) { struct rusage usage; getrusage(RUSAGE_SELF, &usage); printf("%s - Minor: %ld, Major: %ld\n", label, usage.ru_minflt, usage.ru_majflt); } int main() { size_t size = 100 * 1024 * 1024; // 100 MB print_faults("Before allocation"); // Allocate (no faults yet - lazy allocation) char *mem = malloc(size); print_faults("After malloc"); // Touch every page (causes faults) for (size_t i = 0; i < size; i += 4096) { mem[i] = 'x'; } print_faults("After touching pages"); free(mem); return 0; } ``` ```bash theme={null} gcc fault_test.c -o fault_test ./fault_test ``` *** ## Interview Questions **Answer**: **Minor fault**: * Page is in memory but not mapped in page table * No disk I/O required * Examples: First touch of allocated memory, COW after fork * Cost: \~1-10 microseconds **Major fault**: * Page not in memory, must read from disk * Examples: Swap-in, reading memory-mapped file * Cost: \~1-10 milliseconds (1000x slower!) **Production impact**: * Minor faults: Normal, usually not a concern * Major faults: Serious performance problem, indicates memory pressure or working set too large **Monitoring**: ```bash theme={null} perf stat -e page-faults,minor-faults,major-faults ./program ``` **Answer**: **OOM score calculation**: ``` score = (process_memory / total_memory) × 1000 + oom_score_adj ``` **Factors**: 1. Memory usage (primary factor) 2. `oom_score_adj` (-1000 to +1000) 3. Root processes get slight preference **Selection process**: 1. Calculate score for all processes 2. Select process with highest score 3. Send SIGKILL to that process 4. Wait for memory to free **Protection strategies**: * Set `oom_score_adj = -1000` for critical services * Use memory cgroups to limit container memory * Enable `vm.panic_on_oom` for critical systems **Answer**: | Aspect | kmalloc | vmalloc | | ----------------- | ----------------- | ------------------------------ | | Physical memory | Contiguous | Non-contiguous | | Max size | \~4 MB | Virtual space limit | | Performance | Faster | Slower (TLB overhead) | | Use in DMA | Yes | No (not physically contiguous) | | Interrupt context | Yes (GFP\_ATOMIC) | No (may sleep) | **When to use kmalloc**: * Small allocations (\<4 MB) * DMA buffers * Performance-critical paths * Interrupt context **When to use vmalloc**: * Large allocations * Module loading * Memory that doesn't need DMA * Non-critical paths **Answer**: **Memory cgroup controls**: * `memory.max`: Hard limit (OOM if exceeded) * `memory.high`: Soft limit (throttling) * `memory.low`: Best-effort protection * `memory.min`: Hard protection **Container behavior**: 1. Container requests resources (Kubernetes requests) 2. Scheduler places based on available memory 3. Cgroup limits enforced at runtime 4. Exceeding `memory.max` → container OOM (not host) **OOM handling**: * Default: Kill one process in cgroup * `memory.oom.group = 1`: Kill entire cgroup * Docker default: Restart policy determines behavior **Best practices**: * Set `memory.max` to prevent runaway containers * Set `memory.high` slightly below max for graceful throttling * Monitor container memory usage with cAdvisor/Prometheus *** ## Key Takeaways Manages physical pages efficiently with power-of-2 coalescing Optimizes small object allocation with caching and per-CPU pools Protects system by killing processes based on memory score Enable container memory limits with per-cgroup OOM handling *** ## Interview Deep-Dive **Strong Answer:** * The container was killed by the cgroup OOM killer, not the system-wide OOM killer. Cgroups v2 memory controller enforces a per-cgroup `memory.max` limit that is independent of host-level free memory. When the container's `memory.current` exceeds `memory.max`, the kernel first attempts to reclaim pages from within that cgroup (file-backed pages, reclaimable slab). If reclaim fails to free enough, the cgroup OOM killer selects and kills a process within the cgroup. * To investigate, I would first check the cgroup memory stats: `cat /sys/fs/cgroup//memory.stat` to see the breakdown of `anon` (heap, stack), `file` (page cache charged to this cgroup), `kernel` (slab, page tables, socket buffers), and `sock` (network buffers). A common surprise is that page cache from file reads is charged to the container's cgroup, so a database doing sequential scans can fill its memory limit with page cache before the application's heap is even close to the limit. * I would also check `memory.events` for the `oom_kill` counter and correlate with `dmesg` for the detailed OOM output, which shows per-process RSS and the scoring breakdown. * Prevention strategies: set `memory.high` below `memory.max` to create a throttling buffer where the kernel aggressively reclaims before hitting the hard limit. For databases, consider using `O_DIRECT` to bypass page cache, or tune `vm.swappiness` within the cgroup. Always set the JVM's `-Xmx` below the container's memory limit, leaving headroom for native allocations, page tables, and kernel memory charged to the cgroup. **Follow-up:** How does the buddy allocator's fragmentation interact with cgroup memory limits? **Follow-up Answer:** * The buddy allocator operates on physical pages at the zone level, not per-cgroup. However, cgroup memory limits affect the availability of pages for allocation within a cgroup context. When a cgroup is near its `memory.max`, the kernel may need to reclaim pages before allocating new ones. If the cgroup's pages are heavily fragmented (many small free chunks but no contiguous blocks), allocations requesting higher-order pages (like transparent huge pages needing order-9 = 2MB) will fail even if the cgroup has reclaimable memory. The kernel will attempt compaction within the cgroup's pages, but this adds latency. This is why disabling transparent huge pages (`echo never > /sys/kernel/mm/transparent_hugepage/enabled`) is often recommended for latency-sensitive containerized workloads. **Strong Answer:** * When user space calls `malloc()`, glibc either services it from its internal free list or calls `mmap()` or `brk()` for new pages. The kernel's `mmap()` handler creates a `vm_area_struct` (VMA) describing the new mapping but does not allocate any physical pages. The page table entries for this range remain empty. This is lazy allocation -- the kernel defers physical allocation until the page is actually accessed. * When the process first writes to the allocated address, the CPU walks the page table, finds no valid PTE, and raises a page fault exception (interrupt 14 on x86-64). The CPU pushes the error code and faulting address (CR2 register) onto the kernel stack. * The kernel's `do_page_fault()` handler in `arch/x86/mm/fault.c` reads CR2 to get the faulting address, then calls `find_vma()` to locate the VMA containing that address. If no VMA exists, or the access violates VMA permissions, the kernel sends SIGSEGV. * For a valid anonymous mapping, `handle_mm_fault()` walks the page table levels (PGD, P4D, PUD, PMD, PTE), allocating intermediate page table pages as needed. At the PTE level, it calls `do_anonymous_page()`, which allocates a physical page via `alloc_pages()` (the buddy allocator), zeroes it (for security), creates a PTE mapping the virtual address to the physical page, and inserts it into the page table. * This is a minor fault: no disk I/O was needed, the cost is roughly 2-10 microseconds. **Follow-up:** How does transparent huge page allocation change this flow? **Follow-up Answer:** * When THP is enabled and the faulting address falls within a suitably aligned VMA, the PMD-level handler (`do_huge_pmd_anonymous_page()`) attempts to allocate an order-9 compound page (2MB) instead of a regular 4KB page. If the buddy allocator has a free 2MB block, it maps the entire 2MB region with a single PMD entry, eliminating 511 page table entries and dramatically reducing TLB misses. If the 2MB allocation fails due to fragmentation, the kernel falls back to regular 4KB pages. The `khugepaged` kernel thread runs in the background, scanning for opportunities to collapse adjacent 4KB pages into 2MB huge pages, which involves migrating pages to achieve contiguity. **Strong Answer:** * `kswapd` is the kernel's background memory reclaim daemon. It wakes up when free memory in a zone drops below the `low` watermark and reclaims pages until free memory reaches the `high` watermark. If kswapd is consuming 30% CPU, the system is under sustained memory pressure -- applications are allocating faster than kswapd can reclaim. * The implications are severe: kswapd reclaim involves scanning LRU lists (active/inactive for both anonymous and file-backed pages), which is CPU-intensive. Worse, if kswapd cannot keep up, application threads enter direct reclaim during `alloc_pages()`, meaning they block in the allocator trying to free pages before they can proceed. This causes unpredictable latency spikes across all applications. * To diagnose, I would check `/proc/meminfo` for the breakdown: if `Cached` is high, file pages are consuming memory and can be reclaimed. If `AnonPages` is high, the working set genuinely exceeds physical memory and swap is being used. I would check `/proc/vmstat` for `pgscan_kswapd` vs `pgscan_direct` -- a high ratio of direct reclaim to kswapd reclaim means kswapd is falling behind. * Fixes depend on the cause: if excessive page cache, tune `vm.vfs_cache_pressure` higher to reclaim dentries and inodes more aggressively. If anonymous memory pressure, either add physical memory, reduce the workload, or add swap (swap prevents OOM but adds latency). If the problem is specific containers, tighten their `memory.high` limits so they self-throttle before causing system-wide pressure. **Follow-up:** How does the multi-gen LRU (MGLRU) improve on the traditional two-list LRU for reclaim decisions? **Follow-up Answer:** * The traditional LRU uses two lists per page type (active and inactive) and makes binary decisions: a page is either recently used or not. This leads to poor decisions when the working set changes, because pages must cycle through both lists before being reclaimed. MGLRU (merged in kernel 6.1) introduces multiple generations of pages, tracking access recency with higher fidelity. It uses hardware-assisted page table scanning (checking accessed bits) to build a multi-generational age distribution, allowing the kernel to reclaim the truly coldest pages first. In practice, MGLRU reduces page fault rates and improves throughput under memory pressure, especially for workloads with large, dynamic working sets like databases. *** Next: [Virtual Memory & Address Translation →](/courses/linux-internals/virtual-memory)