Memory Management Internals
Memory management is one of the most complex subsystems in the Linux kernel. Understanding it deeply is crucial for infrastructure engineers debugging OOM issues, optimizing container resource limits, and understanding application performance.Interview Frequency: Very High (especially at observability/infrastructure companies)
Key Topics: Buddy allocator, slab, zones, OOM killer, cgroups
Time to Master: 16-18 hours
Key Topics: Buddy allocator, slab, zones, OOM killer, cgroups
Time to Master: 16-18 hours
Physical Memory Organization
Memory Zones
Linux organizes physical memory into zones based on addressing constraints:Viewing Zone Information
Buddy Allocator
Why Power-of-2 Allocation?
The buddy allocator is the kernel’s physical page allocator. But why does it use powers of 2? The problem: If you allow arbitrary-sized allocations, memory becomes fragmented. You might have 100 free pages scattered around, but can’t allocate a contiguous 64-page block. The solution: Only allow power-of-2 sized blocks (1, 2, 4, 8, 16… pages). This enables:- Fast splitting: A 16-page block splits perfectly into two 8-page blocks
- Fast coalescing: Two adjacent 8-page blocks merge back into one 16-page block
- Simple math: Finding a block’s “buddy” is just XOR with the block size
- They’re the same size
- They’re adjacent in memory
- Together they form the next larger power-of-2 block
How Buddy Allocation Works
Buddy Allocator Code
Fragmentation Problem
Slab Allocator (SLUB)
The Problem with Buddy Allocator for Small Objects
The buddy allocator works great for page-sized allocations (4KB+), but what about small objects liketask_struct (2.6KB) or dentry (192 bytes)?
Problems:
- Waste: Allocating a full 4KB page for a 192-byte object wastes 95% of memory
- Performance: Buddy allocator has global locks - contention on multi-core systems
- Cache efficiency: Small objects from different pages = poor cache locality
- Pre-allocate pages and divide them into same-sized objects
- Per-CPU caches for lockless fast path
- Object caches for common kernel structures (
task_struct,inode,dentry, etc.)
SLUB Architecture
SLUB Fast Path
Viewing Slab Information
Page Tables and Virtual Memory
4-Level Page Tables (x86-64)
TLB (Translation Lookaside Buffer)
Page Fault Handling
When Do Page Faults Happen?
Page faults aren’t errors - they’re a normal part of how Linux manages memory efficiently. Here are real-world scenarios: Minor fault examples:- You run
./myprogram- the first time code executes, it’s not in memory yet (demand paging) - After
fork(), child writes to memory - triggers copy-on-write - You
mmap()a file but haven’t read it yet - first access brings it in
- Your laptop suspended with apps open - resuming reads pages back from swap
- You open a 1GB video file with
mmap()- reading it triggers disk I/O - System under memory pressure swapped out your idle browser - switching back triggers swap-in
char *p = NULL; *p = 'x';- accessing NULL pointer- Buffer overflow writing past array bounds
- Accessing freed memory (use-after-free)
Monitoring Page Faults
Memory Reclaim
kswapd and Direct Reclaim
Swappiness
Thevm.swappiness parameter controls how aggressively the kernel swaps out anonymous memory relative to reclaiming page cache.
- Range: 0 to 100 (default 60)
- High value (100): Aggressively swap out anonymous memory to keep page cache (file data) in memory. Good for I/O heavy workloads.
- Low value (0): Avoid swapping anonymous memory unless absolutely necessary. Good for latency-sensitive applications (databases).
- Misconception: Setting to 0 does NOT disable swap. It just tells the kernel to prefer reclaiming file pages first.
OOM Killer
Design Philosophy
When the system runs completely out of memory, Linux faces an impossible choice: which process to kill? The goal: Kill the “least important” process that frees the most memory. The challenge: How do you define “least important”? The OOM killer uses a heuristic:- Memory usage is the primary factor (kill big memory hogs)
- oom_score_adj lets you override (protect critical services)
- Root processes get a slight bonus (system services are important)
- Recently started processes are slightly preferred (haven’t done much work yet)
OOM Score Calculation
Memory Cgroups
Critical for container resource management.Cgroup v2 Memory Controller
Memory Cgroup OOM
Huge Pages
Why Huge Pages Matter
Configuring Huge Pages
Lab Exercises
Lab 1: Memory Allocation Analysis
Lab 1: Memory Allocation Analysis
Objective: Understand memory allocation patterns
Lab 2: OOM Killer Experimentation
Lab 2: OOM Killer Experimentation
Objective: Understand OOM killer behavior
Lab 3: Page Fault Analysis
Lab 3: Page Fault Analysis
Objective: Profile page faults
Interview Questions
Q1: Explain the difference between minor and major page faults
Q1: Explain the difference between minor and major page faults
Answer:Minor fault:
- Page is in memory but not mapped in page table
- No disk I/O required
- Examples: First touch of allocated memory, COW after fork
- Cost: ~1-10 microseconds
- Page not in memory, must read from disk
- Examples: Swap-in, reading memory-mapped file
- Cost: ~1-10 milliseconds (1000x slower!)
- Minor faults: Normal, usually not a concern
- Major faults: Serious performance problem, indicates memory pressure or working set too large
Q2: How does the kernel decide which process to kill during OOM?
Q2: How does the kernel decide which process to kill during OOM?
Answer:OOM score calculation:Factors:
- Memory usage (primary factor)
oom_score_adj(-1000 to +1000)- Root processes get slight preference
- Calculate score for all processes
- Select process with highest score
- Send SIGKILL to that process
- Wait for memory to free
- Set
oom_score_adj = -1000for critical services - Use memory cgroups to limit container memory
- Enable
vm.panic_on_oomfor critical systems
Q3: What are the trade-offs between kmalloc and vmalloc?
Q3: What are the trade-offs between kmalloc and vmalloc?
Answer:
When to use kmalloc:
| Aspect | kmalloc | vmalloc |
|---|---|---|
| Physical memory | Contiguous | Non-contiguous |
| Max size | ~4 MB | Virtual space limit |
| Performance | Faster | Slower (TLB overhead) |
| Use in DMA | Yes | No (not physically contiguous) |
| Interrupt context | Yes (GFP_ATOMIC) | No (may sleep) |
- Small allocations (<4 MB)
- DMA buffers
- Performance-critical paths
- Interrupt context
- Large allocations
- Module loading
- Memory that doesn’t need DMA
- Non-critical paths
Q4: Explain how cgroups memory limits work with containers
Q4: Explain how cgroups memory limits work with containers
Answer:Memory cgroup controls:
memory.max: Hard limit (OOM if exceeded)memory.high: Soft limit (throttling)memory.low: Best-effort protectionmemory.min: Hard protection
- Container requests resources (Kubernetes requests)
- Scheduler places based on available memory
- Cgroup limits enforced at runtime
- Exceeding
memory.max→ container OOM (not host)
- Default: Kill one process in cgroup
memory.oom.group = 1: Kill entire cgroup- Docker default: Restart policy determines behavior
- Set
memory.maxto prevent runaway containers - Set
memory.highslightly below max for graceful throttling - Monitor container memory usage with cAdvisor/Prometheus
Key Takeaways
Buddy Allocator
Manages physical pages efficiently with power-of-2 coalescing
Slab Allocator
Optimizes small object allocation with caching and per-CPU pools
OOM Killer
Protects system by killing processes based on memory score
Memory Cgroups
Enable container memory limits with per-cgroup OOM handling
Next: Virtual Memory & Address Translation →