Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Chapter 2: Control Groups (cgroups)

While namespaces provide isolation (what a container can see), cgroups provide resource limits (what a container can use). Without cgroups, a container could consume all system resources and starve everything else on the host. Let’s implement resource limiting! Here is a real-world analogy: namespaces are like giving each tenant in an apartment building their own mailbox and doorbell, so they don’t interfere with each other. Cgroups are like the lease agreement that limits how much water and electricity each tenant can use. Without the lease, one tenant could run every faucet at full blast and leave the rest of the building dry. In cloud computing, cgroups are what make multi-tenant systems safe and fair. Every major cloud provider (AWS, GCP, Azure) uses cgroups under the hood to enforce the resource limits you pay for.
Prerequisites: Chapter 1: Namespaces
Further Reading: Operating Systems: Resource Management
Time: 3-4 hours
Outcome: Containers with CPU, memory, and process limits

What Are Cgroups?

┌─────────────────────────────────────────────────────────────────────────────┐
│                         WITHOUT CGROUPS                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   SYSTEM: 8 CPU cores, 16GB RAM                                             │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │                                                                      │  │
│   │   Container A        Container B        Container C                 │  │
│   │   ┌──────────┐      ┌──────────┐       ┌──────────┐                │  │
│   │   │ Uses 6   │      │ Uses 1.5 │       │ Uses 0.5 │                │  │
│   │   │ CPUs,    │      │ CPUs,    │       │ CPUs,    │                │  │
│   │   │ 14GB RAM │      │ 1GB RAM  │       │ 1GB RAM  │                │  │
│   │   └──────────┘      └──────────┘       └──────────┘                │  │
│   │                                                                      │  │
│   │   Container A is a "noisy neighbor" - starving others!              │  │
│   │                                                                      │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                           WITH CGROUPS                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   SYSTEM: 8 CPU cores, 16GB RAM                                             │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │                                                                      │  │
│   │   Container A        Container B        Container C                 │  │
│   │   ┌──────────┐      ┌──────────┐       ┌──────────┐                │  │
│   │   │ LIMIT:   │      │ LIMIT:   │       │ LIMIT:   │                │  │
│   │   │ 2 CPUs   │      │ 2 CPUs   │       │ 2 CPUs   │                │  │
│   │   │ 4GB RAM  │      │ 4GB RAM  │       │ 4GB RAM  │                │  │
│   │   │ 100 procs│      │ 100 procs│       │ 100 procs│                │  │
│   │   └──────────┘      └──────────┘       └──────────┘                │  │
│   │                                                                      │  │
│   │   Fair resource allocation! Predictable performance!                │  │
│   │                                                                      │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Cgroup v2 Hierarchy

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CGROUP V2 FILESYSTEM                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   /sys/fs/cgroup/                         ← Cgroup root                     │
│   ├── cgroup.controllers                  ← Available controllers           │
│   ├── cgroup.subtree_control              ← Enabled controllers             │
│   ├── cpu.max                             ← Root CPU limits                 │
│   ├── memory.max                          ← Root memory limits              │
│   │                                                                         │
│   ├── minidocker/                         ← Our container group            │
│   │   ├── cgroup.controllers                                                │
│   │   ├── cgroup.procs                    ← PIDs in this group             │
│   │   ├── cpu.max                         ← CPU limit (quota period)       │
│   │   ├── cpu.weight                      ← CPU shares                     │
│   │   ├── memory.max                      ← Memory limit (bytes)           │
│   │   ├── memory.current                  ← Current memory usage           │
│   │   ├── pids.max                        ← Max processes                  │
│   │   └── pids.current                    ← Current process count          │
│   │                                                                         │
│   └── docker/                             ← Docker's groups                 │
│       ├── container-abc123/                                                 │
│       └── container-def456/                                                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Part 1: Cgroup Manager

src/main/java/com/minidocker/cgroup/CgroupManager.java
package com.minidocker.cgroup;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;

/**
 * Manages cgroup v2 resource limits.
 * 
 * Cgroups control:
 * - CPU: How much CPU time a container can use
 * - Memory: Maximum memory allocation
 * - PIDs: Maximum number of processes
 * - I/O: Disk bandwidth limits
 */
public class CgroupManager {
    
    private static final Path CGROUP_ROOT = Path.of("/sys/fs/cgroup");
    private static final String CGROUP_NAME = "minidocker";
    
    private final String containerId;
    private final Path cgroupPath;
    
    public CgroupManager(String containerId) {
        this.containerId = containerId;
        this.cgroupPath = CGROUP_ROOT.resolve(CGROUP_NAME).resolve(containerId);
    }
    
    /**
     * Creates the cgroup for this container.
     */
    public void create() throws IOException {
        // Create the cgroup directory
        Files.createDirectories(cgroupPath);
        System.out.println("✓ Created cgroup: " + cgroupPath);
        
        // Enable controllers in parent
        Path parentSubtreeControl = cgroupPath.getParent().resolve("cgroup.subtree_control");
        if (Files.exists(parentSubtreeControl)) {
            String controllers = "+cpu +memory +pids +io";
            Files.writeString(parentSubtreeControl, controllers);
            System.out.println("✓ Enabled controllers: cpu, memory, pids, io");
        }
    }
    
    /**
     * Sets CPU limit.
     * 
     * @param cpuPercent CPU percentage (100 = 1 core, 200 = 2 cores)
     *
     * How this works under the hood: the kernel scheduler checks cpu.max
     * every PERIOD microseconds. If the cgroup has consumed more than QUOTA
     * microseconds of CPU time in that period, all processes in the group are
     * throttled (paused) until the next period begins. So "100 100000" means
     * "use at most 100ms of CPU every 100ms" -- exactly one core. "200 100000"
     * means "200ms per 100ms" -- two cores.
     *
     * Common pitfall: setting QUOTA to 0 does NOT mean "unlimited" -- it means
     * "no CPU at all." Use "max 100000" for unlimited CPU.
     */
    public void setCpuLimit(int cpuPercent) throws IOException {
        // cpu.max format: "$QUOTA $PERIOD"
        // QUOTA: microseconds of CPU time per PERIOD
        // PERIOD: typically 100000 (100ms)
        
        int period = 100000;  // 100ms
        int quota = period * cpuPercent / 100;
        
        Path cpuMaxPath = cgroupPath.resolve("cpu.max");
        String value = quota + " " + period;
        Files.writeString(cpuMaxPath, value);
        
        System.out.println("✓ CPU limit set: " + cpuPercent + "% (" + value + ")");
    }
    
    /**
     * Sets memory limit.
     * 
     * @param memoryBytes Maximum memory in bytes
     */
    public void setMemoryLimit(long memoryBytes) throws IOException {
        Path memoryMaxPath = cgroupPath.resolve("memory.max");
        Files.writeString(memoryMaxPath, String.valueOf(memoryBytes));
        
        // Also set swap limit to prevent swap usage
        Path memorySwapMaxPath = cgroupPath.resolve("memory.swap.max");
        if (Files.exists(memorySwapMaxPath)) {
            Files.writeString(memorySwapMaxPath, "0");
        }
        
        String humanReadable = formatBytes(memoryBytes);
        System.out.println("✓ Memory limit set: " + humanReadable);
    }
    
    /**
     * Sets process (PID) limit.
     * 
     * This is your defense against fork bombs -- a malicious or buggy process
     * that recursively spawns children until the system runs out of PIDs or
     * memory. Without this limit, a single container could crash the entire
     * host by exhausting the kernel's process table. A typical default for
     * Docker is 4096 PIDs per container.
     * 
     * @param maxPids Maximum number of processes
     */
    public void setPidsLimit(int maxPids) throws IOException {
        Path pidsMaxPath = cgroupPath.resolve("pids.max");
        Files.writeString(pidsMaxPath, String.valueOf(maxPids));
        
        System.out.println("✓ PIDs limit set: " + maxPids);
    }
    
    /**
     * Sets I/O bandwidth limits.
     * 
     * @param deviceMajorMinor Device ID (e.g., "8:0" for /dev/sda)
     * @param rbps Read bytes per second
     * @param wbps Write bytes per second
     */
    public void setIOLimit(String deviceMajorMinor, long rbps, long wbps) throws IOException {
        Path ioMaxPath = cgroupPath.resolve("io.max");
        String value = deviceMajorMinor + " rbps=" + rbps + " wbps=" + wbps;
        Files.writeString(ioMaxPath, value);
        
        System.out.println("✓ I/O limit set for " + deviceMajorMinor + 
                          ": read=" + formatBytes(rbps) + "/s, write=" + formatBytes(wbps) + "/s");
    }
    
    /**
     * Adds a process to this cgroup.
     * 
     * @param pid Process ID to add
     */
    public void addProcess(int pid) throws IOException {
        Path procsPath = cgroupPath.resolve("cgroup.procs");
        Files.writeString(procsPath, String.valueOf(pid));
        
        System.out.println("✓ Added PID " + pid + " to cgroup");
    }
    
    /**
     * Adds the current process to this cgroup.
     */
    public void addCurrentProcess() throws IOException {
        addProcess(ProcessHandle.current().pid());
    }
    
    /**
     * Gets current resource usage.
     */
    public ResourceUsage getUsage() throws IOException {
        long memoryBytes = 0;
        int pids = 0;
        long cpuUsage = 0;
        
        Path memoryCurrent = cgroupPath.resolve("memory.current");
        if (Files.exists(memoryCurrent)) {
            memoryBytes = Long.parseLong(Files.readString(memoryCurrent).trim());
        }
        
        Path pidsCurrent = cgroupPath.resolve("pids.current");
        if (Files.exists(pidsCurrent)) {
            pids = Integer.parseInt(Files.readString(pidsCurrent).trim());
        }
        
        Path cpuStat = cgroupPath.resolve("cpu.stat");
        if (Files.exists(cpuStat)) {
            String stats = Files.readString(cpuStat);
            for (String line : stats.split("\n")) {
                if (line.startsWith("usage_usec")) {
                    cpuUsage = Long.parseLong(line.split(" ")[1]);
                }
            }
        }
        
        return new ResourceUsage(memoryBytes, pids, cpuUsage);
    }
    
    /**
     * Removes the cgroup (cleanup).
     */
    public void destroy() throws IOException {
        // First, kill all processes in the cgroup
        Path procsPath = cgroupPath.resolve("cgroup.procs");
        if (Files.exists(procsPath)) {
            String procs = Files.readString(procsPath);
            for (String pid : procs.split("\n")) {
                if (!pid.isEmpty()) {
                    ProcessHandle.of(Long.parseLong(pid))
                                 .ifPresent(ProcessHandle::destroy);
                }
            }
        }
        
        // Then remove the directory
        Files.deleteIfExists(cgroupPath);
        System.out.println("✓ Destroyed cgroup: " + cgroupPath);
    }
    
    private static String formatBytes(long bytes) {
        if (bytes < 1024) return bytes + " B";
        if (bytes < 1024 * 1024) return (bytes / 1024) + " KB";
        if (bytes < 1024 * 1024 * 1024) return (bytes / (1024 * 1024)) + " MB";
        return (bytes / (1024 * 1024 * 1024)) + " GB";
    }
    
    /**
     * Resource usage statistics.
     */
    public record ResourceUsage(long memoryBytes, int pids, long cpuUsageMicros) {
        @Override
        public String toString() {
            return String.format("Memory: %s, PIDs: %d, CPU: %.2fms",
                formatBytes(memoryBytes), pids, cpuUsageMicros / 1000.0);
        }
    }
}

Part 2: Resource Limits Configuration

src/main/java/com/minidocker/cgroup/ResourceLimits.java
package com.minidocker.cgroup;

/**
 * Container resource limits configuration.
 * 
 * Similar to Docker's --cpus, --memory, --pids-limit flags.
 */
public class ResourceLimits {
    
    private int cpuPercent = 100;           // 100 = 1 core
    private long memoryBytes = 512 * 1024 * 1024;  // 512MB default
    private int maxPids = 100;              // Max processes
    
    public static ResourceLimits defaults() {
        return new ResourceLimits();
    }
    
    public ResourceLimits withCpu(double cpus) {
        this.cpuPercent = (int) (cpus * 100);
        return this;
    }
    
    public ResourceLimits withMemory(String memory) {
        this.memoryBytes = parseMemory(memory);
        return this;
    }
    
    public ResourceLimits withMemoryBytes(long bytes) {
        this.memoryBytes = bytes;
        return this;
    }
    
    public ResourceLimits withMaxPids(int pids) {
        this.maxPids = pids;
        return this;
    }
    
    public int getCpuPercent() { return cpuPercent; }
    public long getMemoryBytes() { return memoryBytes; }
    public int getMaxPids() { return maxPids; }
    
    private static long parseMemory(String memory) {
        memory = memory.trim().toUpperCase();
        
        long multiplier = 1;
        if (memory.endsWith("K")) {
            multiplier = 1024;
            memory = memory.substring(0, memory.length() - 1);
        } else if (memory.endsWith("M")) {
            multiplier = 1024 * 1024;
            memory = memory.substring(0, memory.length() - 1);
        } else if (memory.endsWith("G")) {
            multiplier = 1024 * 1024 * 1024;
            memory = memory.substring(0, memory.length() - 1);
        }
        
        return Long.parseLong(memory) * multiplier;
    }
    
    @Override
    public String toString() {
        return String.format("ResourceLimits{cpu=%.2f cores, memory=%dMB, pids=%d}",
            cpuPercent / 100.0,
            memoryBytes / (1024 * 1024),
            maxPids);
    }
}

Part 3: Integrating with Container

src/main/java/com/minidocker/Container.java
package com.minidocker;

import com.minidocker.cgroup.CgroupManager;
import com.minidocker.cgroup.ResourceLimits;
import com.minidocker.linux.LibC;
import com.minidocker.namespace.NamespaceManager;
import com.minidocker.namespace.NamespaceOptions;

import java.util.UUID;

/**
 * Container with namespace isolation AND resource limits.
 */
public class Container {
    
    private final LibC libc = LibC.INSTANCE;
    private final NamespaceManager namespaces = new NamespaceManager();
    
    private final String id;
    private final String hostname;
    private final String[] command;
    private final ResourceLimits limits;
    
    private CgroupManager cgroup;
    
    public Container(String hostname, String[] command, ResourceLimits limits) {
        this.id = UUID.randomUUID().toString().substring(0, 12);
        this.hostname = hostname;
        this.command = command;
        this.limits = limits;
    }
    
    public void run() throws Exception {
        System.out.println("=== Starting Container " + id + " ===");
        System.out.println("Hostname: " + hostname);
        System.out.println("Limits: " + limits);
        System.out.println();
        
        // Step 1: Create cgroup
        cgroup = new CgroupManager(id);
        cgroup.create();
        
        // Step 2: Set resource limits
        cgroup.setCpuLimit(limits.getCpuPercent());
        cgroup.setMemoryLimit(limits.getMemoryBytes());
        cgroup.setPidsLimit(limits.getMaxPids());
        
        // Step 3: Create namespaces
        NamespaceOptions nsOptions = NamespaceOptions.builder()
            .withPid()
            .withMount()
            .withUts()
            .withIpc()
            .build();
        
        namespaces.createNamespaces(nsOptions);
        namespaces.setHostname(hostname);
        
        // Step 4: Fork
        int pid = libc.fork();
        
        if (pid == 0) {
            // Child: add to cgroup and run
            cgroup.addCurrentProcess();
            runContainerInit();
        } else if (pid > 0) {
            // Parent: wait and monitor
            monitorContainer(pid);
        } else {
            throw new RuntimeException("Fork failed");
        }
    }
    
    private void monitorContainer(int childPid) {
        // Start monitoring thread
        Thread monitor = new Thread(() -> {
            try {
                while (true) {
                    Thread.sleep(1000);
                    CgroupManager.ResourceUsage usage = cgroup.getUsage();
                    System.out.println("[Monitor] " + usage);
                }
            } catch (Exception e) {
                // Container exited
            }
        });
        monitor.setDaemon(true);
        monitor.start();
        
        // Wait for child
        int[] status = new int[1];
        libc.waitpid(childPid, status, 0);
        
        System.out.println("Container exited with status: " + status[0]);
        
        // Cleanup
        try {
            cgroup.destroy();
        } catch (Exception e) {
            System.err.println("Failed to cleanup cgroup: " + e.getMessage());
        }
    }
    
    private void runContainerInit() {
        try {
            System.out.println("\n=== Container Init (PID " + libc.getpid() + ") ===");
            
            if (command.length > 0) {
                String[] argv = new String[command.length + 1];
                System.arraycopy(command, 0, argv, 0, command.length);
                argv[command.length] = null;
                
                libc.execv(command[0], argv);
                System.err.println("Failed to execute: " + command[0]);
                System.exit(1);
            }
        } catch (Exception e) {
            System.err.println("Container init failed: " + e.getMessage());
            System.exit(1);
        }
    }
    
    public static void main(String[] args) {
        if (args.length < 2) {
            System.out.println("Usage: java Container <hostname> <command...>");
            System.out.println("       --cpus=<n>     CPU cores (e.g., 0.5, 2)");
            System.out.println("       --memory=<n>   Memory limit (e.g., 256M, 1G)");
            System.out.println("       --pids=<n>     Max processes");
            System.exit(1);
        }
        
        ResourceLimits limits = ResourceLimits.defaults();
        String hostname = null;
        String[] command = null;
        
        // Parse arguments
        int cmdStart = 0;
        for (int i = 0; i < args.length; i++) {
            if (args[i].startsWith("--cpus=")) {
                limits.withCpu(Double.parseDouble(args[i].substring(7)));
            } else if (args[i].startsWith("--memory=")) {
                limits.withMemory(args[i].substring(9));
            } else if (args[i].startsWith("--pids=")) {
                limits.withMaxPids(Integer.parseInt(args[i].substring(7)));
            } else if (hostname == null) {
                hostname = args[i];
            } else {
                cmdStart = i;
                break;
            }
        }
        
        if (hostname != null && cmdStart > 0) {
            command = new String[args.length - cmdStart];
            System.arraycopy(args, cmdStart, command, 0, command.length);
        }
        
        try {
            Container container = new Container(hostname, command, limits);
            container.run();
        } catch (Exception e) {
            System.err.println("Error: " + e.getMessage());
            e.printStackTrace();
            System.exit(1);
        }
    }
}

Part 4: Testing Resource Limits

Testing CPU Limits

# Run a CPU-intensive workload with 50% CPU limit
sudo java Container --cpus=0.5 testhost /bin/sh -c "while true; do :; done"

# In another terminal, observe CPU usage (should be ~50%)
top -p $(pgrep -f "while true")

Testing Memory Limits

// MemoryHog.java - For testing memory limits
public class MemoryHog {
    public static void main(String[] args) throws Exception {
        List<byte[]> allocations = new ArrayList<>();
        
        while (true) {
            // Allocate 10MB chunks
            byte[] chunk = new byte[10 * 1024 * 1024];
            allocations.add(chunk);
            System.out.println("Allocated: " + (allocations.size() * 10) + "MB");
            Thread.sleep(100);
        }
        // Container will be OOM-killed when hitting memory limit!
    }
}

Testing PID Limits

# Fork bomb protection!
# Without pids limit, this would crash the system:
:(){ :|:& };:

# With pids limit set to 100, the fork bomb is contained

Understanding Cgroup Controllers

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CGROUP CONTROLLERS                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   CONTROLLER    FILES               DESCRIPTION                             │
│   ──────────    ─────               ───────────                             │
│                                                                              │
│   cpu           cpu.max             Bandwidth limit (quota/period)          │
│                 cpu.weight          Relative CPU shares (1-10000)           │
│                 cpu.stat            Usage statistics                        │
│                                                                              │
│   memory        memory.max          Hard limit (OOM kill if exceeded)       │
│                 memory.high         Soft limit (throttling starts)          │
│                 memory.current      Current usage                           │
│                 memory.swap.max     Swap limit                              │
│                 memory.stat         Detailed statistics                     │
│                                                                              │
│   pids          pids.max            Maximum processes                       │
│                 pids.current        Current process count                   │
│                                                                              │
│   io            io.max              Bandwidth/IOPS limits per device        │
│                 io.weight           Relative I/O priority                   │
│                 io.stat             I/O statistics                          │
│                                                                              │
│   cpuset        cpuset.cpus         Allowed CPU cores (e.g., "0-3")         │
│                 cpuset.mems         Allowed NUMA nodes                      │
│                                                                              │
│   hugetlb       hugetlb.X.max       Huge page limits                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Exercises

Add disk I/O limits:
// 1. Find the device major:minor for the disk
// cat /sys/block/sda/dev

// 2. Set io.max
// "8:0 rbps=1048576 wbps=1048576" = 1MB/s read/write

// 3. Test with dd:
// dd if=/dev/zero of=/tmp/test bs=1M count=100
Pin container to specific CPU cores:
// Use cpuset controller
// Write to cpuset.cpus: "0,2" (only use cores 0 and 2)
// Write to cpuset.mems: "0" (NUMA node 0)

// This ensures containers don't share CPU cache
// Great for latency-sensitive workloads
Implement real-time resource monitoring:
// 1. Read cpu.stat periodically for CPU usage
// 2. Read memory.current and memory.stat for memory
// 3. Read io.stat for disk I/O
// 4. Calculate deltas to get rates
// 5. Display like 'docker stats'

Key Takeaways

Hierarchical

Cgroups form a tree - children inherit parent limits

Controller-Based

Different controllers for CPU, memory, I/O, PIDs

Kernel Enforcement

Limits are enforced by kernel, not by honor system

OOM Killer

Exceeding memory limit triggers OOM killer

What’s Next?

In Chapter 3: Filesystem, we’ll implement:
  • Overlay filesystems
  • Copy-on-write layers
  • Container root filesystem setup

Next: Filesystem

Build the layered filesystem

Interview Deep-Dive

Strong Answer:
  • Cgroups v1 had a per-controller hierarchy: each controller (cpu, memory, pids) had its own independent tree, and a process could be in different groups for different controllers. This created incoherent resource accounting — a process could be in cgroup A for CPU but cgroup B for memory, making it impossible to get a unified view of resource consumption.
  • Cgroups v2 uses a single unified hierarchy. A process belongs to exactly one cgroup, and all controllers apply to that cgroup. This makes resource accounting consistent and simplifies management tooling. It also enables cross-controller coordination — for example, the memory controller can influence the I/O controller’s behavior for a process, which was impossible in v1.
  • V2 also introduced the “no internal processes” rule: if a cgroup has child cgroups, the parent cannot directly contain processes. This eliminates ambiguity about which level of the hierarchy should be charged for resource usage.
  • The migration took years because systemd, Docker, and Kubernetes all had deep dependencies on v1 semantics. The practical interview relevance is that debugging resource limits in production requires knowing which cgroup version the host is running, because the filesystem paths and file formats differ significantly.
Follow-up: How do you check which cgroup version a system is running?Check with stat -fc %T /sys/fs/cgroup/ — if it returns cgroup2fs, the system uses v2; tmpfs indicates v1. In v1, you find CPU limits at /sys/fs/cgroup/cpu/docker/<container-id>/cpu.cfs_quota_us. In v2, it is /sys/fs/cgroup/system.slice/docker-<container-id>.scope/cpu.max. The file format also differs: v1 uses separate files for quota and period; v2 combines them into one file (cpu.max).
Strong Answer:
  • When a cgroup hits its memory.max limit, the kernel’s OOM killer activates for that cgroup specifically (it does not kill processes outside the cgroup). It selects the victim by calculating an oom_score for each process based on RSS, swap usage, and the oom_score_adj value.
  • For production tuning, set memory.high to a value below memory.max (e.g., high at 900MB, max at 1GB). When usage crosses memory.high, the kernel throttles memory allocation instead of killing, giving the application a chance to shed load or run garbage collection. This turns a hard crash into a soft slowdown.
  • Also configure oom_score_adj to ensure the right process dies if OOM is unavoidable. Setting the main process to -500 and helper sidecars to 0 ensures sidecars die first. And critically, monitor memory.events to alert on oom_kill events before they cascade.
  • A subtle point: application-level memory metrics (like JVM heap usage) only show user-space allocations. The kernel counts all memory charged to the cgroup, including page cache, tmpfs mounts, and kernel stack pages. A container writing heavily to tmpfs will trigger OOM even though the application reports low heap usage.
Follow-up: What is the difference between a cgroup OOM and a host-level OOM? Which is worse operationally?A cgroup OOM is scoped and controlled — only processes in that cgroup die. A host-level OOM means the entire machine’s memory is exhausted, and the kernel’s global OOM killer activates, often killing the wrong process. Host OOM is operationally catastrophic because it cascades across unrelated services. This is precisely why cgroup memory limits exist — they convert a global disaster into a local, contained failure.
Strong Answer:
  • CFS bandwidth control gives each cgroup a quota of CPU time per period (default 100ms). If cpu.max is 50000 100000, the container gets 50ms of CPU time per 100ms period.
  • The latency spike occurs when the container uses its entire 50ms quota in a burst at the start of the period. For the remaining 50ms, threads are suspended even if the CPU is idle. A request arriving during the throttled window waits up to 100ms, creating artificial latency.
  • This is measurable via cpu.stat — look for nr_throttled and throttled_usec. A container with low average CPU usage can still show thousands of throttle events if usage is bursty.
  • Mitigation strategies: increase the CPU limit for burst headroom, reduce the CFS period (smaller periods like 10ms distribute throttling more evenly but increase scheduling overhead), or use cpuset.cpus to pin the container to dedicated cores, bypassing the bandwidth controller entirely.
Follow-up: Why do some Kubernetes operators recommend removing CPU limits entirely?The argument is that CFS throttling causes more damage to latency-sensitive services than noisy neighbors do. If the cluster has accurate CPU requests and nodes are not overprovisioned, the scheduler places pods so total requested CPU does not exceed node capacity. Without limits, a pod can burst above its request when CPU is available. The counterargument is that without limits, a misbehaving pod can starve neighbors during peak load. The right answer depends on workload characteristics: for steady-state services with predictable CPU profiles, removing limits works well; for batch jobs or untrusted workloads, limits are necessary.