Linux Memory Management - Virtual memory, buddy allocator, and page tables

Memory Management Internals

Memory management is one of the most complex subsystems in the Linux kernel. Understanding it deeply is crucial for infrastructure engineers debugging OOM issues, optimizing container resource limits, and understanding application performance.

Interview Frequency: Very High (especially at observability/infrastructure companies)
Key Topics: Buddy allocator, slab, zones, OOM killer, cgroups
Time to Master: 16-18 hours

Physical Memory Organization

Memory Zones

Linux organizes physical memory into zones based on addressing constraints:

Viewing Zone Information

# View zone details
cat /proc/zoneinfo

# Summary
cat /proc/buddyinfo

# Example output:
# Node 0, zone   Normal 1234  876  432  210  105   52   26   13    6    3    1
#                        2^0  2^1 2^2  2^3  2^4  2^5  2^6  2^7  2^8  2^9 2^10

Buddy Allocator

Why Power-of-2 Allocation?

The buddy allocator is the kernel’s physical page allocator. But why does it use powers of 2? The problem: If you allow arbitrary-sized allocations, memory becomes fragmented. You might have 100 free pages scattered around, but can’t allocate a contiguous 64-page block. The solution: Only allow power-of-2 sized blocks (1, 2, 4, 8, 16… pages). This enables:

Fast splitting: A 16-page block splits perfectly into two 8-page blocks
Fast coalescing: Two adjacent 8-page blocks merge back into one 16-page block
Simple math: Finding a block’s “buddy” is just XOR with the block size

What is a “buddy”? Two blocks are buddies if:

They’re the same size
They’re adjacent in memory
Together they form the next larger power-of-2 block

Example: Pages 0-7 and pages 8-15 are buddies (together = pages 0-15). The buddy system manages physical page allocation.

How Buddy Allocation Works

Buddy Allocator Code

// mm/page_alloc.c (simplified)

// Allocate 2^order contiguous pages
struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid)
{
    struct page *page;
    
    // Try to get from free list
    page = get_page_from_freelist(gfp, order, alloc_flags);
    if (page)
        return page;
    
    // Slow path: reclaim, compact, OOM
    page = __alloc_pages_slowpath(gfp, order, &ac);
    return page;
}

// Free pages back to buddy
void __free_pages(struct page *page, unsigned int order)
{
    // Try to coalesce with buddy
    while (order < MAX_ORDER - 1) {
        struct page *buddy = find_buddy(page, order);
        if (!page_is_buddy(buddy, order))
            break;
        
        // Coalesce
        list_del(&buddy->lru);
        order++;
        page = min(page, buddy);  // Use lower address
    }
    
    // Add to free list
    list_add(&page->lru, &zone->free_area[order].free_list);
}

Fragmentation Problem

![Memory Fragmentation and Compaction](/images/courses/linux-memory-fragmentation.svg)

```bash
# Check fragmentation
cat /proc/buddyinfo
# Low numbers at high orders = fragmentation

# Trigger compaction
echo 1 > /proc/sys/vm/compact_memory

# Check compaction stats
cat /proc/vmstat | grep compact

Slab Allocator (SLUB)

The Problem with Buddy Allocator for Small Objects

The buddy allocator works great for page-sized allocations (4KB+), but what about small objects like task_struct (2.6KB) or dentry (192 bytes)? Problems:

Waste: Allocating a full 4KB page for a 192-byte object wastes 95% of memory
Performance: Buddy allocator has global locks - contention on multi-core systems
Cache efficiency: Small objects from different pages = poor cache locality

The solution: SLUB (Slab Allocator Unqueued)

Pre-allocate pages and divide them into same-sized objects
Per-CPU caches for lockless fast path
Object caches for common kernel structures (task_struct, inode, dentry, etc.)

For small object allocations (smaller than a page).

SLUB Architecture

SLUB Fast Path

// mm/slub.c (simplified)
static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags)
{
    void *object;
    struct kmem_cache_cpu *c;
    
    c = this_cpu_ptr(s->cpu_slab);
    
    // Fast path: get from per-CPU freelist
    object = c->freelist;
    if (likely(object)) {
        c->freelist = get_freepointer(s, object);
        return object;
    }
    
    // Slow path: get new slab from partial list or allocate
    return __slab_alloc(s, gfpflags, node);
}

Viewing Slab Information

# View all slab caches
cat /proc/slabinfo

# Interactive slab viewer
sudo slabtop

# Memory usage by cache
sudo slabtop -o | head -20

# Detailed stats
cat /sys/kernel/slab/task_struct/object_size
cat /sys/kernel/slab/task_struct/objs_per_slab
cat /sys/kernel/slab/task_struct/total_objects

Page Tables and Virtual Memory

4-Level Page Tables (x86-64)

TLB (Translation Lookaside Buffer)

Page Fault Handling

When Do Page Faults Happen?

Page faults aren’t errors - they’re a normal part of how Linux manages memory efficiently. Here are real-world scenarios: Minor fault examples:

You run ./myprogram - the first time code executes, it’s not in memory yet (demand paging)
After fork(), child writes to memory - triggers copy-on-write
You mmap() a file but haven’t read it yet - first access brings it in

Major fault examples:

Your laptop suspended with apps open - resuming reads pages back from swap
You open a 1GB video file with mmap() - reading it triggers disk I/O
System under memory pressure swapped out your idle browser - switching back triggers swap-in

Invalid fault examples:

char *p = NULL; *p = 'x'; - accessing NULL pointer
Buffer overflow writing past array bounds
Accessing freed memory (use-after-free)

![Page Fault Types](/images/courses/linux-page-fault-types.svg)

### Page Fault Handler Flow

![Page Fault Handling Sequence](/images/courses/linux-page-fault-handling.svg)

When a process accesses a memory address that isn't currently mapped to a physical page, the CPU raises a **page fault exception**. The kernel's page fault handler (`do_page_fault` on x86) takes over to resolve the situation.

**Resolution Paths:**
1. **Minor Fault**: The page is in memory (e.g., in page cache or shared library) but not mapped to this process. The kernel simply updates the page table. Fast.
2. **Major Fault**: The page is not in memory (e.g., swapped out or not read from disk yet). The kernel must issue disk I/O to fetch it. Slow.
3. **Invalid Fault**: The access is truly invalid (e.g., writing to read-only memory, accessing unmapped space). The kernel sends `SIGSEGV` (Segmentation Fault).

```c
// arch/x86/mm/fault.c (simplified)
static void handle_page_fault(struct pt_regs *regs, unsigned long error_code,
                              unsigned long address)
{
    struct vm_area_struct *vma;
    struct mm_struct *mm = current->mm;
    vm_fault_t fault;
    
    // Find VMA containing the address
    vma = find_vma(mm, address);
    
    if (!vma || address < vma->vm_start) {
        // No VMA - invalid access
        bad_area(regs, error_code, address);
        return;
    }
    
    // Check permissions
    if (unlikely(access_error(error_code, vma))) {
        bad_area_access_error(regs, error_code, address, vma);
        return;
    }
    
    // Handle the fault
    fault = handle_mm_fault(vma, address, flags, regs);
    
    // Check result
    if (fault & VM_FAULT_OOM)
        out_of_memory();
    if (fault & VM_FAULT_SIGSEGV)
        bad_area_nosemaphore(regs, error_code, address);
}

Monitoring Page Faults

# View page fault stats
cat /proc/vmstat | grep pgfault
# pgfault = total faults, pgmajfault = major faults

# Per-process faults
cat /proc/<pid>/stat | awk '{print "Minor:", $10, "Major:", $12}'

# Real-time monitoring with perf
sudo perf stat -e page-faults,minor-faults,major-faults ./myprogram

Memory Reclaim

kswapd and Direct Reclaim

![Memory Reclaim Watermarks](/images/courses/linux-memory-reclaim.svg)

### LRU Lists

```c
// Pages organized in LRU lists
enum lru_list {
    LRU_INACTIVE_ANON,   // Anonymous, not recently used
    LRU_ACTIVE_ANON,     // Anonymous, recently used
    LRU_INACTIVE_FILE,   // File-backed, not recently used
    LRU_ACTIVE_FILE,     // File-backed, recently used
    LRU_UNEVICTABLE,     // Locked pages, mlocked
    NR_LRU_LISTS
};

# View LRU stats
cat /proc/meminfo | grep -E "Active|Inactive"
# Active(anon):     2048000 kB
# Inactive(anon):   1024000 kB
# Active(file):     4096000 kB
# Inactive(file):   2048000 kB

Swappiness

The vm.swappiness parameter controls how aggressively the kernel swaps out anonymous memory relative to reclaiming page cache.

Range: 0 to 100 (default 60)
High value (100): Aggressively swap out anonymous memory to keep page cache (file data) in memory. Good for I/O heavy workloads.
Low value (0): Avoid swapping anonymous memory unless absolutely necessary. Good for latency-sensitive applications (databases).
Misconception: Setting to 0 does NOT disable swap. It just tells the kernel to prefer reclaiming file pages first.

# Check current value
cat /proc/sys/vm/swappiness

# Change temporarily
sudo sysctl vm.swappiness=10

# Make permanent in /etc/sysctl.conf
# vm.swappiness = 10

OOM Killer

Design Philosophy

When the system runs completely out of memory, Linux faces an impossible choice: which process to kill? The goal: Kill the “least important” process that frees the most memory. The challenge: How do you define “least important”? The OOM killer uses a heuristic:

Memory usage is the primary factor (kill big memory hogs)
oom_score_adj lets you override (protect critical services)
Root processes get a slight bonus (system services are important)
Recently started processes are slightly preferred (haven’t done much work yet)

Why not just kill the biggest process? Because that might be your database with critical data. The scoring system tries to balance memory freed vs. importance.

OOM Score Calculation

![OOM Killer Scoring](/images/courses/linux-oom-killer-scoring.svg)

### Configuring OOM Behavior

```bash
# View OOM scores
cat /proc/<pid>/oom_score
cat /proc/<pid>/oom_score_adj

# Make process immune to OOM
echo -1000 > /proc/<pid>/oom_score_adj

# Make process OOM target
echo 1000 > /proc/<pid>/oom_score_adj

# Disable OOM killer (dangerous!)
echo 2 > /proc/sys/vm/overcommit_memory

# Panic on OOM instead of killing
echo 1 > /proc/sys/vm/panic_on_oom

Memory Cgroups

Critical for container resource management.

Cgroup v2 Memory Controller

# Create cgroup
mkdir /sys/fs/cgroup/mygroup

# Set memory limit (100MB)
echo 104857600 > /sys/fs/cgroup/mygroup/memory.max

# Set soft limit (target when under pressure)
echo 52428800 > /sys/fs/cgroup/mygroup/memory.high

# Add process to cgroup
echo $$ > /sys/fs/cgroup/mygroup/cgroup.procs

# View current usage
cat /sys/fs/cgroup/mygroup/memory.current

# View statistics
cat /sys/fs/cgroup/mygroup/memory.stat

Memory Cgroup OOM

---

## NUMA (Non-Uniform Memory Access)

### NUMA Architecture

![NUMA Topology](/images/courses/linux-numa-topology.svg)

### NUMA Policies

```bash
# View NUMA topology
numactl --hardware

# Run with specific NUMA policy
numactl --cpunodebind=0 --membind=0 ./myprogram  # Local only
numactl --interleave=all ./myprogram  # Spread across nodes

# View NUMA stats
numastat
numastat -p <pid>

# View per-process NUMA memory
cat /proc/<pid>/numa_maps

Huge Pages

Why Huge Pages Matter

Configuring Huge Pages

# View huge page info
cat /proc/meminfo | grep Huge
# HugePages_Total:     128
# HugePages_Free:      100
# HugePages_Rsvd:        0
# Hugepagesize:       2048 kB

# Reserve huge pages
echo 128 > /proc/sys/vm/nr_hugepages

# Use in application (mmap)
void *ptr = mmap(NULL, size, PROT_READ|PROT_WRITE,
                 MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0);

# Transparent Huge Pages (automatic)
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never

Lab Exercises

Lab 1: Memory Allocation Analysis

Objective: Understand memory allocation patterns

# Watch memory allocation in real-time
watch -n 1 'cat /proc/meminfo | head -20'

# Trace page allocations
sudo perf record -e kmem:mm_page_alloc -a sleep 5
sudo perf script | head -50

# View buddy allocator fragmentation
cat /proc/buddyinfo

# Trigger memory pressure
stress --vm 2 --vm-bytes 1G --timeout 10s

# Watch reclaim
watch -n 1 'cat /proc/vmstat | grep -E "pgfault|pgmaj|pswp"'

Lab 2: OOM Killer Experimentation

Objective: Understand OOM killer behavior

# Create a cgroup with 100MB limit
sudo mkdir /sys/fs/cgroup/test_oom
echo 104857600 | sudo tee /sys/fs/cgroup/test_oom/memory.max

# Run memory-hungry process in cgroup
sudo bash -c 'echo $$ > /sys/fs/cgroup/test_oom/cgroup.procs && \
              python3 -c "x = [0] * (200 * 1024 * 1024 // 8)"'

# Watch OOM events
sudo dmesg | tail -20

# View OOM scores for all processes
for pid in $(ls /proc | grep -E '^[0-9]+$'); do
    if [ -f /proc/$pid/oom_score ]; then
        echo "$pid $(cat /proc/$pid/oom_score 2>/dev/null) $(cat /proc/$pid/comm 2>/dev/null)"
    fi
done | sort -nk2 | tail -20

# Cleanup
sudo rmdir /sys/fs/cgroup/test_oom

Lab 3: Page Fault Analysis

Objective: Profile page faults

// fault_test.c
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/resource.h>

void print_faults(const char *label) {
    struct rusage usage;
    getrusage(RUSAGE_SELF, &usage);
    printf("%s - Minor: %ld, Major: %ld\n", 
           label, usage.ru_minflt, usage.ru_majflt);
}

int main() {
    size_t size = 100 * 1024 * 1024;  // 100 MB
    
    print_faults("Before allocation");
    
    // Allocate (no faults yet - lazy allocation)
    char *mem = malloc(size);
    print_faults("After malloc");
    
    // Touch every page (causes faults)
    for (size_t i = 0; i < size; i += 4096) {
        mem[i] = 'x';
    }
    print_faults("After touching pages");
    
    free(mem);
    return 0;
}

gcc fault_test.c -o fault_test
./fault_test

Interview Questions

Q1: Explain the difference between minor and major page faults

Answer:Minor fault:

Page is in memory but not mapped in page table
No disk I/O required
Examples: First touch of allocated memory, COW after fork
Cost: ~1-10 microseconds

Major fault:

Page not in memory, must read from disk
Examples: Swap-in, reading memory-mapped file
Cost: ~1-10 milliseconds (1000x slower!)

Production impact:

Minor faults: Normal, usually not a concern
Major faults: Serious performance problem, indicates memory pressure or working set too large

Monitoring:

perf stat -e page-faults,minor-faults,major-faults ./program

Q2: How does the kernel decide which process to kill during OOM?

Answer:OOM score calculation:

score = (process_memory / total_memory) × 1000 + oom_score_adj

Factors:

Memory usage (primary factor)
oom_score_adj (-1000 to +1000)
Root processes get slight preference

Selection process:

Calculate score for all processes
Select process with highest score
Send SIGKILL to that process
Wait for memory to free

Protection strategies:

Set oom_score_adj = -1000 for critical services
Use memory cgroups to limit container memory
Enable vm.panic_on_oom for critical systems

Q3: What are the trade-offs between kmalloc and vmalloc?

Answer:

Aspect	kmalloc	vmalloc
Physical memory	Contiguous	Non-contiguous
Max size	~4 MB	Virtual space limit
Performance	Faster	Slower (TLB overhead)
Use in DMA	Yes	No (not physically contiguous)
Interrupt context	Yes (GFP_ATOMIC)	No (may sleep)

When to use kmalloc:

Small allocations (<4 MB)
DMA buffers
Performance-critical paths
Interrupt context

When to use vmalloc:

Large allocations
Module loading
Memory that doesn’t need DMA
Non-critical paths

Q4: Explain how cgroups memory limits work with containers

Answer:Memory cgroup controls:

memory.max: Hard limit (OOM if exceeded)
memory.high: Soft limit (throttling)
memory.low: Best-effort protection
memory.min: Hard protection

Container behavior:

Container requests resources (Kubernetes requests)
Scheduler places based on available memory
Cgroup limits enforced at runtime
Exceeding memory.max → container OOM (not host)

OOM handling:

Default: Kill one process in cgroup
memory.oom.group = 1: Kill entire cgroup
Docker default: Restart policy determines behavior

Best practices:

Set memory.max to prevent runaway containers
Set memory.high slightly below max for graceful throttling
Monitor container memory usage with cAdvisor/Prometheus

Key Takeaways

Buddy Allocator

Manages physical pages efficiently with power-of-2 coalescing

Slab Allocator

Optimizes small object allocation with caching and per-CPU pools

OOM Killer

Protects system by killing processes based on memory score

Memory Cgroups

Enable container memory limits with per-cgroup OOM handling

Next: Virtual Memory & Address Translation →

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Memory Management Internals

​Physical Memory Organization

​Memory Zones

​Viewing Zone Information

​Buddy Allocator

​Why Power-of-2 Allocation?

​How Buddy Allocation Works

​Buddy Allocator Code

​Fragmentation Problem

​Slab Allocator (SLUB)

​The Problem with Buddy Allocator for Small Objects

​SLUB Architecture

​SLUB Fast Path

​Viewing Slab Information

​Page Tables and Virtual Memory

​4-Level Page Tables (x86-64)

​TLB (Translation Lookaside Buffer)

​Page Fault Handling

​When Do Page Faults Happen?

​Monitoring Page Faults

​Memory Reclaim

​kswapd and Direct Reclaim

​Swappiness

​OOM Killer

​Design Philosophy

​OOM Score Calculation

​Memory Cgroups

​Cgroup v2 Memory Controller

​Memory Cgroup OOM

​Huge Pages

​Why Huge Pages Matter

​Configuring Huge Pages

​Lab Exercises

​Interview Questions

​Key Takeaways

Buddy Allocator

Slab Allocator

OOM Killer

Memory Management Internals

Physical Memory Organization

Memory Zones

Viewing Zone Information

Buddy Allocator

Why Power-of-2 Allocation?

How Buddy Allocation Works

Buddy Allocator Code

Fragmentation Problem

Slab Allocator (SLUB)

The Problem with Buddy Allocator for Small Objects

SLUB Architecture

SLUB Fast Path

Viewing Slab Information

Page Tables and Virtual Memory

4-Level Page Tables (x86-64)

TLB (Translation Lookaside Buffer)

Page Fault Handling

When Do Page Faults Happen?

Monitoring Page Faults

Memory Reclaim

kswapd and Direct Reclaim

Swappiness

OOM Killer

Design Philosophy

OOM Score Calculation

Memory Cgroups

Cgroup v2 Memory Controller

Memory Cgroup OOM

Huge Pages

Why Huge Pages Matter

Configuring Huge Pages

Lab Exercises

Interview Questions

Key Takeaways