Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Memory Management Internals
Memory management is one of the most complex subsystems in the Linux kernel. Understanding it deeply is crucial for infrastructure engineers debugging OOM issues, optimizing container resource limits, and understanding application performance.Key Topics: Buddy allocator, slab, zones, OOM killer, cgroups
Time to Master: 16-18 hours
Physical Memory Organization
Memory Zones
Linux organizes physical memory into zones based on addressing constraints:Viewing Zone Information
Buddy Allocator
Why Power-of-2 Allocation?
The buddy allocator is the kernel’s physical page allocator. But why does it use powers of 2? The problem: If you allow arbitrary-sized allocations, memory becomes fragmented. You might have 100 free pages scattered around, but can’t allocate a contiguous 64-page block. The solution: Only allow power-of-2 sized blocks (1, 2, 4, 8, 16… pages). This enables:- Fast splitting: A 16-page block splits perfectly into two 8-page blocks
- Fast coalescing: Two adjacent 8-page blocks merge back into one 16-page block
- Simple math: Finding a block’s “buddy” is just XOR with the block size
- They’re the same size
- They’re adjacent in memory
- Together they form the next larger power-of-2 block
How Buddy Allocation Works
Buddy Allocator Code
Fragmentation Problem
Slab Allocator (SLUB)
The Problem with Buddy Allocator for Small Objects
The buddy allocator works great for page-sized allocations (4KB+), but what about small objects liketask_struct (2.6KB) or dentry (192 bytes)?
Problems:
- Waste: Allocating a full 4KB page for a 192-byte object wastes 95% of memory
- Performance: Buddy allocator has global locks - contention on multi-core systems
- Cache efficiency: Small objects from different pages = poor cache locality
- Pre-allocate pages and divide them into same-sized objects
- Per-CPU caches for lockless fast path
- Object caches for common kernel structures (
task_struct,inode,dentry, etc.)
SLUB Architecture
SLUB Fast Path
Viewing Slab Information
Page Tables and Virtual Memory
4-Level Page Tables (x86-64)
TLB (Translation Lookaside Buffer)
Page Fault Handling
When Do Page Faults Happen?
Page faults aren’t errors - they’re a normal part of how Linux manages memory efficiently. Here are real-world scenarios: Minor fault examples:- You run
./myprogram- the first time code executes, it’s not in memory yet (demand paging) - After
fork(), child writes to memory - triggers copy-on-write - You
mmap()a file but haven’t read it yet - first access brings it in
- Your laptop suspended with apps open - resuming reads pages back from swap
- You open a 1GB video file with
mmap()- reading it triggers disk I/O - System under memory pressure swapped out your idle browser - switching back triggers swap-in
char *p = NULL; *p = 'x';- accessing NULL pointer- Buffer overflow writing past array bounds
- Accessing freed memory (use-after-free)
Monitoring Page Faults
Memory Reclaim
kswapd and Direct Reclaim
Swappiness
Thevm.swappiness parameter controls how aggressively the kernel swaps out anonymous memory relative to reclaiming page cache.
- Range: 0 to 100 (default 60)
- High value (100): Aggressively swap out anonymous memory to keep page cache (file data) in memory. Good for I/O heavy workloads.
- Low value (0): Avoid swapping anonymous memory unless absolutely necessary. Good for latency-sensitive applications (databases).
- Misconception: Setting to 0 does NOT disable swap. It just tells the kernel to prefer reclaiming file pages first.
OOM Killer
Design Philosophy
When the system runs completely out of memory, Linux faces an impossible choice: which process to kill? The goal: Kill the “least important” process that frees the most memory. The challenge: How do you define “least important”? The OOM killer uses a heuristic:- Memory usage is the primary factor (kill big memory hogs)
- oom_score_adj lets you override (protect critical services)
- Root processes get a slight bonus (system services are important)
- Recently started processes are slightly preferred (haven’t done much work yet)
OOM Score Calculation
Memory Cgroups
Critical for container resource management.Cgroup v2 Memory Controller
Memory Cgroup OOM
Huge Pages
Why Huge Pages Matter
Configuring Huge Pages
Lab Exercises
Lab 1: Memory Allocation Analysis
Lab 1: Memory Allocation Analysis
Lab 2: OOM Killer Experimentation
Lab 2: OOM Killer Experimentation
Lab 3: Page Fault Analysis
Lab 3: Page Fault Analysis
Interview Questions
Q1: Explain the difference between minor and major page faults
Q1: Explain the difference between minor and major page faults
- Page is in memory but not mapped in page table
- No disk I/O required
- Examples: First touch of allocated memory, COW after fork
- Cost: ~1-10 microseconds
- Page not in memory, must read from disk
- Examples: Swap-in, reading memory-mapped file
- Cost: ~1-10 milliseconds (1000x slower!)
- Minor faults: Normal, usually not a concern
- Major faults: Serious performance problem, indicates memory pressure or working set too large
Q2: How does the kernel decide which process to kill during OOM?
Q2: How does the kernel decide which process to kill during OOM?
- Memory usage (primary factor)
oom_score_adj(-1000 to +1000)- Root processes get slight preference
- Calculate score for all processes
- Select process with highest score
- Send SIGKILL to that process
- Wait for memory to free
- Set
oom_score_adj = -1000for critical services - Use memory cgroups to limit container memory
- Enable
vm.panic_on_oomfor critical systems
Q3: What are the trade-offs between kmalloc and vmalloc?
Q3: What are the trade-offs between kmalloc and vmalloc?
| Aspect | kmalloc | vmalloc |
|---|---|---|
| Physical memory | Contiguous | Non-contiguous |
| Max size | ~4 MB | Virtual space limit |
| Performance | Faster | Slower (TLB overhead) |
| Use in DMA | Yes | No (not physically contiguous) |
| Interrupt context | Yes (GFP_ATOMIC) | No (may sleep) |
- Small allocations (<4 MB)
- DMA buffers
- Performance-critical paths
- Interrupt context
- Large allocations
- Module loading
- Memory that doesn’t need DMA
- Non-critical paths
Q4: Explain how cgroups memory limits work with containers
Q4: Explain how cgroups memory limits work with containers
memory.max: Hard limit (OOM if exceeded)memory.high: Soft limit (throttling)memory.low: Best-effort protectionmemory.min: Hard protection
- Container requests resources (Kubernetes requests)
- Scheduler places based on available memory
- Cgroup limits enforced at runtime
- Exceeding
memory.max→ container OOM (not host)
- Default: Kill one process in cgroup
memory.oom.group = 1: Kill entire cgroup- Docker default: Restart policy determines behavior
- Set
memory.maxto prevent runaway containers - Set
memory.highslightly below max for graceful throttling - Monitor container memory usage with cAdvisor/Prometheus
Key Takeaways
Buddy Allocator
Slab Allocator
OOM Killer
Memory Cgroups
Interview Deep-Dive
A production database container was OOM-killed, but the host machine had 50GB free. Explain why this happened and walk through how you would investigate and prevent it.
A production database container was OOM-killed, but the host machine had 50GB free. Explain why this happened and walk through how you would investigate and prevent it.
- The container was killed by the cgroup OOM killer, not the system-wide OOM killer. Cgroups v2 memory controller enforces a per-cgroup
memory.maxlimit that is independent of host-level free memory. When the container’smemory.currentexceedsmemory.max, the kernel first attempts to reclaim pages from within that cgroup (file-backed pages, reclaimable slab). If reclaim fails to free enough, the cgroup OOM killer selects and kills a process within the cgroup. - To investigate, I would first check the cgroup memory stats:
cat /sys/fs/cgroup/<path>/memory.statto see the breakdown ofanon(heap, stack),file(page cache charged to this cgroup),kernel(slab, page tables, socket buffers), andsock(network buffers). A common surprise is that page cache from file reads is charged to the container’s cgroup, so a database doing sequential scans can fill its memory limit with page cache before the application’s heap is even close to the limit. - I would also check
memory.eventsfor theoom_killcounter and correlate withdmesgfor the detailed OOM output, which shows per-process RSS and the scoring breakdown. - Prevention strategies: set
memory.highbelowmemory.maxto create a throttling buffer where the kernel aggressively reclaims before hitting the hard limit. For databases, consider usingO_DIRECTto bypass page cache, or tunevm.swappinesswithin the cgroup. Always set the JVM’s-Xmxbelow the container’s memory limit, leaving headroom for native allocations, page tables, and kernel memory charged to the cgroup.
- The buddy allocator operates on physical pages at the zone level, not per-cgroup. However, cgroup memory limits affect the availability of pages for allocation within a cgroup context. When a cgroup is near its
memory.max, the kernel may need to reclaim pages before allocating new ones. If the cgroup’s pages are heavily fragmented (many small free chunks but no contiguous blocks), allocations requesting higher-order pages (like transparent huge pages needing order-9 = 2MB) will fail even if the cgroup has reclaimable memory. The kernel will attempt compaction within the cgroup’s pages, but this adds latency. This is why disabling transparent huge pages (echo never > /sys/kernel/mm/transparent_hugepage/enabled) is often recommended for latency-sensitive containerized workloads.
Explain the complete lifecycle of a page fault when a process accesses a heap allocation for the first time after malloc(). What kernel code paths are involved?
Explain the complete lifecycle of a page fault when a process accesses a heap allocation for the first time after malloc(). What kernel code paths are involved?
- When user space calls
malloc(), glibc either services it from its internal free list or callsmmap()orbrk()for new pages. The kernel’smmap()handler creates avm_area_struct(VMA) describing the new mapping but does not allocate any physical pages. The page table entries for this range remain empty. This is lazy allocation — the kernel defers physical allocation until the page is actually accessed. - When the process first writes to the allocated address, the CPU walks the page table, finds no valid PTE, and raises a page fault exception (interrupt 14 on x86-64). The CPU pushes the error code and faulting address (CR2 register) onto the kernel stack.
- The kernel’s
do_page_fault()handler inarch/x86/mm/fault.creads CR2 to get the faulting address, then callsfind_vma()to locate the VMA containing that address. If no VMA exists, or the access violates VMA permissions, the kernel sends SIGSEGV. - For a valid anonymous mapping,
handle_mm_fault()walks the page table levels (PGD, P4D, PUD, PMD, PTE), allocating intermediate page table pages as needed. At the PTE level, it callsdo_anonymous_page(), which allocates a physical page viaalloc_pages()(the buddy allocator), zeroes it (for security), creates a PTE mapping the virtual address to the physical page, and inserts it into the page table. - This is a minor fault: no disk I/O was needed, the cost is roughly 2-10 microseconds.
- When THP is enabled and the faulting address falls within a suitably aligned VMA, the PMD-level handler (
do_huge_pmd_anonymous_page()) attempts to allocate an order-9 compound page (2MB) instead of a regular 4KB page. If the buddy allocator has a free 2MB block, it maps the entire 2MB region with a single PMD entry, eliminating 511 page table entries and dramatically reducing TLB misses. If the 2MB allocation fails due to fragmentation, the kernel falls back to regular 4KB pages. Thekhugepagedkernel thread runs in the background, scanning for opportunities to collapse adjacent 4KB pages into 2MB huge pages, which involves migrating pages to achieve contiguity.
Your monitoring shows kswapd consuming 30% CPU on a server. What is happening, what are the implications, and how would you fix it?
Your monitoring shows kswapd consuming 30% CPU on a server. What is happening, what are the implications, and how would you fix it?
kswapdis the kernel’s background memory reclaim daemon. It wakes up when free memory in a zone drops below thelowwatermark and reclaims pages until free memory reaches thehighwatermark. If kswapd is consuming 30% CPU, the system is under sustained memory pressure — applications are allocating faster than kswapd can reclaim.- The implications are severe: kswapd reclaim involves scanning LRU lists (active/inactive for both anonymous and file-backed pages), which is CPU-intensive. Worse, if kswapd cannot keep up, application threads enter direct reclaim during
alloc_pages(), meaning they block in the allocator trying to free pages before they can proceed. This causes unpredictable latency spikes across all applications. - To diagnose, I would check
/proc/meminfofor the breakdown: ifCachedis high, file pages are consuming memory and can be reclaimed. IfAnonPagesis high, the working set genuinely exceeds physical memory and swap is being used. I would check/proc/vmstatforpgscan_kswapdvspgscan_direct— a high ratio of direct reclaim to kswapd reclaim means kswapd is falling behind. - Fixes depend on the cause: if excessive page cache, tune
vm.vfs_cache_pressurehigher to reclaim dentries and inodes more aggressively. If anonymous memory pressure, either add physical memory, reduce the workload, or add swap (swap prevents OOM but adds latency). If the problem is specific containers, tighten theirmemory.highlimits so they self-throttle before causing system-wide pressure.
- The traditional LRU uses two lists per page type (active and inactive) and makes binary decisions: a page is either recently used or not. This leads to poor decisions when the working set changes, because pages must cycle through both lists before being reclaimed. MGLRU (merged in kernel 6.1) introduces multiple generations of pages, tracking access recency with higher fidelity. It uses hardware-assisted page table scanning (checking accessed bits) to build a multi-generational age distribution, allowing the kernel to reclaim the truly coldest pages first. In practice, MGLRU reduces page fault rates and improves throughput under memory pressure, especially for workloads with large, dynamic working sets like databases.
Next: Virtual Memory & Address Translation →