> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Chapter 2: Control Groups (cgroups)

> Implement resource limits using Linux cgroups for CPU, memory, and process control

# Chapter 2: Control Groups (cgroups)

While namespaces provide **isolation** (what a container can *see*), cgroups provide **resource limits** (what a container can *use*). Without cgroups, a container could consume all system resources and starve everything else on the host. Let's implement resource limiting!

Here is a real-world analogy: namespaces are like giving each tenant in an apartment building their own mailbox and doorbell, so they don't interfere with each other. Cgroups are like the lease agreement that limits how much water and electricity each tenant can use. Without the lease, one tenant could run every faucet at full blast and leave the rest of the building dry. In cloud computing, cgroups are what make multi-tenant systems safe and fair. Every major cloud provider (AWS, GCP, Azure) uses cgroups under the hood to enforce the resource limits you pay for.

<Info>
  **Prerequisites**: [Chapter 1: Namespaces](/courses/build-your-own-x/docker-1-namespaces)\
  **Further Reading**: [Operating Systems: Resource Management](/operating-systems/scheduling)\
  **Time**: 3-4 hours\
  **Outcome**: Containers with CPU, memory, and process limits
</Info>

***

## What Are Cgroups?

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         WITHOUT CGROUPS                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   SYSTEM: 8 CPU cores, 16GB RAM                                             │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │                                                                      │  │
│   │   Container A        Container B        Container C                 │  │
│   │   ┌──────────┐      ┌──────────┐       ┌──────────┐                │  │
│   │   │ Uses 6   │      │ Uses 1.5 │       │ Uses 0.5 │                │  │
│   │   │ CPUs,    │      │ CPUs,    │       │ CPUs,    │                │  │
│   │   │ 14GB RAM │      │ 1GB RAM  │       │ 1GB RAM  │                │  │
│   │   └──────────┘      └──────────┘       └──────────┘                │  │
│   │                                                                      │  │
│   │   Container A is a "noisy neighbor" - starving others!              │  │
│   │                                                                      │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                           WITH CGROUPS                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   SYSTEM: 8 CPU cores, 16GB RAM                                             │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │                                                                      │  │
│   │   Container A        Container B        Container C                 │  │
│   │   ┌──────────┐      ┌──────────┐       ┌──────────┐                │  │
│   │   │ LIMIT:   │      │ LIMIT:   │       │ LIMIT:   │                │  │
│   │   │ 2 CPUs   │      │ 2 CPUs   │       │ 2 CPUs   │                │  │
│   │   │ 4GB RAM  │      │ 4GB RAM  │       │ 4GB RAM  │                │  │
│   │   │ 100 procs│      │ 100 procs│       │ 100 procs│                │  │
│   │   └──────────┘      └──────────┘       └──────────┘                │  │
│   │                                                                      │  │
│   │   Fair resource allocation! Predictable performance!                │  │
│   │                                                                      │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Cgroup v2 Hierarchy

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      CGROUP V2 FILESYSTEM                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   /sys/fs/cgroup/                         ← Cgroup root                     │
│   ├── cgroup.controllers                  ← Available controllers           │
│   ├── cgroup.subtree_control              ← Enabled controllers             │
│   ├── cpu.max                             ← Root CPU limits                 │
│   ├── memory.max                          ← Root memory limits              │
│   │                                                                         │
│   ├── minidocker/                         ← Our container group            │
│   │   ├── cgroup.controllers                                                │
│   │   ├── cgroup.procs                    ← PIDs in this group             │
│   │   ├── cpu.max                         ← CPU limit (quota period)       │
│   │   ├── cpu.weight                      ← CPU shares                     │
│   │   ├── memory.max                      ← Memory limit (bytes)           │
│   │   ├── memory.current                  ← Current memory usage           │
│   │   ├── pids.max                        ← Max processes                  │
│   │   └── pids.current                    ← Current process count          │
│   │                                                                         │
│   └── docker/                             ← Docker's groups                 │
│       ├── container-abc123/                                                 │
│       └── container-def456/                                                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Part 1: Cgroup Manager

```java src/main/java/com/minidocker/cgroup/CgroupManager.java theme={null}
package com.minidocker.cgroup;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;

/**
 * Manages cgroup v2 resource limits.
 * 
 * Cgroups control:
 * - CPU: How much CPU time a container can use
 * - Memory: Maximum memory allocation
 * - PIDs: Maximum number of processes
 * - I/O: Disk bandwidth limits
 */
public class CgroupManager {
    
    private static final Path CGROUP_ROOT = Path.of("/sys/fs/cgroup");
    private static final String CGROUP_NAME = "minidocker";
    
    private final String containerId;
    private final Path cgroupPath;
    
    public CgroupManager(String containerId) {
        this.containerId = containerId;
        this.cgroupPath = CGROUP_ROOT.resolve(CGROUP_NAME).resolve(containerId);
    }
    
    /**
     * Creates the cgroup for this container.
     */
    public void create() throws IOException {
        // Create the cgroup directory
        Files.createDirectories(cgroupPath);
        System.out.println("✓ Created cgroup: " + cgroupPath);
        
        // Enable controllers in parent
        Path parentSubtreeControl = cgroupPath.getParent().resolve("cgroup.subtree_control");
        if (Files.exists(parentSubtreeControl)) {
            String controllers = "+cpu +memory +pids +io";
            Files.writeString(parentSubtreeControl, controllers);
            System.out.println("✓ Enabled controllers: cpu, memory, pids, io");
        }
    }
    
    /**
     * Sets CPU limit.
     * 
     * @param cpuPercent CPU percentage (100 = 1 core, 200 = 2 cores)
     *
     * How this works under the hood: the kernel scheduler checks cpu.max
     * every PERIOD microseconds. If the cgroup has consumed more than QUOTA
     * microseconds of CPU time in that period, all processes in the group are
     * throttled (paused) until the next period begins. So "100 100000" means
     * "use at most 100ms of CPU every 100ms" -- exactly one core. "200 100000"
     * means "200ms per 100ms" -- two cores.
     *
     * Common pitfall: setting QUOTA to 0 does NOT mean "unlimited" -- it means
     * "no CPU at all." Use "max 100000" for unlimited CPU.
     */
    public void setCpuLimit(int cpuPercent) throws IOException {
        // cpu.max format: "$QUOTA $PERIOD"
        // QUOTA: microseconds of CPU time per PERIOD
        // PERIOD: typically 100000 (100ms)
        
        int period = 100000;  // 100ms
        int quota = period * cpuPercent / 100;
        
        Path cpuMaxPath = cgroupPath.resolve("cpu.max");
        String value = quota + " " + period;
        Files.writeString(cpuMaxPath, value);
        
        System.out.println("✓ CPU limit set: " + cpuPercent + "% (" + value + ")");
    }
    
    /**
     * Sets memory limit.
     * 
     * @param memoryBytes Maximum memory in bytes
     */
    public void setMemoryLimit(long memoryBytes) throws IOException {
        Path memoryMaxPath = cgroupPath.resolve("memory.max");
        Files.writeString(memoryMaxPath, String.valueOf(memoryBytes));
        
        // Also set swap limit to prevent swap usage
        Path memorySwapMaxPath = cgroupPath.resolve("memory.swap.max");
        if (Files.exists(memorySwapMaxPath)) {
            Files.writeString(memorySwapMaxPath, "0");
        }
        
        String humanReadable = formatBytes(memoryBytes);
        System.out.println("✓ Memory limit set: " + humanReadable);
    }
    
    /**
     * Sets process (PID) limit.
     * 
     * This is your defense against fork bombs -- a malicious or buggy process
     * that recursively spawns children until the system runs out of PIDs or
     * memory. Without this limit, a single container could crash the entire
     * host by exhausting the kernel's process table. A typical default for
     * Docker is 4096 PIDs per container.
     * 
     * @param maxPids Maximum number of processes
     */
    public void setPidsLimit(int maxPids) throws IOException {
        Path pidsMaxPath = cgroupPath.resolve("pids.max");
        Files.writeString(pidsMaxPath, String.valueOf(maxPids));
        
        System.out.println("✓ PIDs limit set: " + maxPids);
    }
    
    /**
     * Sets I/O bandwidth limits.
     * 
     * @param deviceMajorMinor Device ID (e.g., "8:0" for /dev/sda)
     * @param rbps Read bytes per second
     * @param wbps Write bytes per second
     */
    public void setIOLimit(String deviceMajorMinor, long rbps, long wbps) throws IOException {
        Path ioMaxPath = cgroupPath.resolve("io.max");
        String value = deviceMajorMinor + " rbps=" + rbps + " wbps=" + wbps;
        Files.writeString(ioMaxPath, value);
        
        System.out.println("✓ I/O limit set for " + deviceMajorMinor + 
                          ": read=" + formatBytes(rbps) + "/s, write=" + formatBytes(wbps) + "/s");
    }
    
    /**
     * Adds a process to this cgroup.
     * 
     * @param pid Process ID to add
     */
    public void addProcess(int pid) throws IOException {
        Path procsPath = cgroupPath.resolve("cgroup.procs");
        Files.writeString(procsPath, String.valueOf(pid));
        
        System.out.println("✓ Added PID " + pid + " to cgroup");
    }
    
    /**
     * Adds the current process to this cgroup.
     */
    public void addCurrentProcess() throws IOException {
        addProcess(ProcessHandle.current().pid());
    }
    
    /**
     * Gets current resource usage.
     */
    public ResourceUsage getUsage() throws IOException {
        long memoryBytes = 0;
        int pids = 0;
        long cpuUsage = 0;
        
        Path memoryCurrent = cgroupPath.resolve("memory.current");
        if (Files.exists(memoryCurrent)) {
            memoryBytes = Long.parseLong(Files.readString(memoryCurrent).trim());
        }
        
        Path pidsCurrent = cgroupPath.resolve("pids.current");
        if (Files.exists(pidsCurrent)) {
            pids = Integer.parseInt(Files.readString(pidsCurrent).trim());
        }
        
        Path cpuStat = cgroupPath.resolve("cpu.stat");
        if (Files.exists(cpuStat)) {
            String stats = Files.readString(cpuStat);
            for (String line : stats.split("\n")) {
                if (line.startsWith("usage_usec")) {
                    cpuUsage = Long.parseLong(line.split(" ")[1]);
                }
            }
        }
        
        return new ResourceUsage(memoryBytes, pids, cpuUsage);
    }
    
    /**
     * Removes the cgroup (cleanup).
     */
    public void destroy() throws IOException {
        // First, kill all processes in the cgroup
        Path procsPath = cgroupPath.resolve("cgroup.procs");
        if (Files.exists(procsPath)) {
            String procs = Files.readString(procsPath);
            for (String pid : procs.split("\n")) {
                if (!pid.isEmpty()) {
                    ProcessHandle.of(Long.parseLong(pid))
                                 .ifPresent(ProcessHandle::destroy);
                }
            }
        }
        
        // Then remove the directory
        Files.deleteIfExists(cgroupPath);
        System.out.println("✓ Destroyed cgroup: " + cgroupPath);
    }
    
    private static String formatBytes(long bytes) {
        if (bytes < 1024) return bytes + " B";
        if (bytes < 1024 * 1024) return (bytes / 1024) + " KB";
        if (bytes < 1024 * 1024 * 1024) return (bytes / (1024 * 1024)) + " MB";
        return (bytes / (1024 * 1024 * 1024)) + " GB";
    }
    
    /**
     * Resource usage statistics.
     */
    public record ResourceUsage(long memoryBytes, int pids, long cpuUsageMicros) {
        @Override
        public String toString() {
            return String.format("Memory: %s, PIDs: %d, CPU: %.2fms",
                formatBytes(memoryBytes), pids, cpuUsageMicros / 1000.0);
        }
    }
}
```

***

## Part 2: Resource Limits Configuration

```java src/main/java/com/minidocker/cgroup/ResourceLimits.java theme={null}
package com.minidocker.cgroup;

/**
 * Container resource limits configuration.
 * 
 * Similar to Docker's --cpus, --memory, --pids-limit flags.
 */
public class ResourceLimits {
    
    private int cpuPercent = 100;           // 100 = 1 core
    private long memoryBytes = 512 * 1024 * 1024;  // 512MB default
    private int maxPids = 100;              // Max processes
    
    public static ResourceLimits defaults() {
        return new ResourceLimits();
    }
    
    public ResourceLimits withCpu(double cpus) {
        this.cpuPercent = (int) (cpus * 100);
        return this;
    }
    
    public ResourceLimits withMemory(String memory) {
        this.memoryBytes = parseMemory(memory);
        return this;
    }
    
    public ResourceLimits withMemoryBytes(long bytes) {
        this.memoryBytes = bytes;
        return this;
    }
    
    public ResourceLimits withMaxPids(int pids) {
        this.maxPids = pids;
        return this;
    }
    
    public int getCpuPercent() { return cpuPercent; }
    public long getMemoryBytes() { return memoryBytes; }
    public int getMaxPids() { return maxPids; }
    
    private static long parseMemory(String memory) {
        memory = memory.trim().toUpperCase();
        
        long multiplier = 1;
        if (memory.endsWith("K")) {
            multiplier = 1024;
            memory = memory.substring(0, memory.length() - 1);
        } else if (memory.endsWith("M")) {
            multiplier = 1024 * 1024;
            memory = memory.substring(0, memory.length() - 1);
        } else if (memory.endsWith("G")) {
            multiplier = 1024 * 1024 * 1024;
            memory = memory.substring(0, memory.length() - 1);
        }
        
        return Long.parseLong(memory) * multiplier;
    }
    
    @Override
    public String toString() {
        return String.format("ResourceLimits{cpu=%.2f cores, memory=%dMB, pids=%d}",
            cpuPercent / 100.0,
            memoryBytes / (1024 * 1024),
            maxPids);
    }
}
```

***

## Part 3: Integrating with Container

```java src/main/java/com/minidocker/Container.java theme={null}
package com.minidocker;

import com.minidocker.cgroup.CgroupManager;
import com.minidocker.cgroup.ResourceLimits;
import com.minidocker.linux.LibC;
import com.minidocker.namespace.NamespaceManager;
import com.minidocker.namespace.NamespaceOptions;

import java.util.UUID;

/**
 * Container with namespace isolation AND resource limits.
 */
public class Container {
    
    private final LibC libc = LibC.INSTANCE;
    private final NamespaceManager namespaces = new NamespaceManager();
    
    private final String id;
    private final String hostname;
    private final String[] command;
    private final ResourceLimits limits;
    
    private CgroupManager cgroup;
    
    public Container(String hostname, String[] command, ResourceLimits limits) {
        this.id = UUID.randomUUID().toString().substring(0, 12);
        this.hostname = hostname;
        this.command = command;
        this.limits = limits;
    }
    
    public void run() throws Exception {
        System.out.println("=== Starting Container " + id + " ===");
        System.out.println("Hostname: " + hostname);
        System.out.println("Limits: " + limits);
        System.out.println();
        
        // Step 1: Create cgroup
        cgroup = new CgroupManager(id);
        cgroup.create();
        
        // Step 2: Set resource limits
        cgroup.setCpuLimit(limits.getCpuPercent());
        cgroup.setMemoryLimit(limits.getMemoryBytes());
        cgroup.setPidsLimit(limits.getMaxPids());
        
        // Step 3: Create namespaces
        NamespaceOptions nsOptions = NamespaceOptions.builder()
            .withPid()
            .withMount()
            .withUts()
            .withIpc()
            .build();
        
        namespaces.createNamespaces(nsOptions);
        namespaces.setHostname(hostname);
        
        // Step 4: Fork
        int pid = libc.fork();
        
        if (pid == 0) {
            // Child: add to cgroup and run
            cgroup.addCurrentProcess();
            runContainerInit();
        } else if (pid > 0) {
            // Parent: wait and monitor
            monitorContainer(pid);
        } else {
            throw new RuntimeException("Fork failed");
        }
    }
    
    private void monitorContainer(int childPid) {
        // Start monitoring thread
        Thread monitor = new Thread(() -> {
            try {
                while (true) {
                    Thread.sleep(1000);
                    CgroupManager.ResourceUsage usage = cgroup.getUsage();
                    System.out.println("[Monitor] " + usage);
                }
            } catch (Exception e) {
                // Container exited
            }
        });
        monitor.setDaemon(true);
        monitor.start();
        
        // Wait for child
        int[] status = new int[1];
        libc.waitpid(childPid, status, 0);
        
        System.out.println("Container exited with status: " + status[0]);
        
        // Cleanup
        try {
            cgroup.destroy();
        } catch (Exception e) {
            System.err.println("Failed to cleanup cgroup: " + e.getMessage());
        }
    }
    
    private void runContainerInit() {
        try {
            System.out.println("\n=== Container Init (PID " + libc.getpid() + ") ===");
            
            if (command.length > 0) {
                String[] argv = new String[command.length + 1];
                System.arraycopy(command, 0, argv, 0, command.length);
                argv[command.length] = null;
                
                libc.execv(command[0], argv);
                System.err.println("Failed to execute: " + command[0]);
                System.exit(1);
            }
        } catch (Exception e) {
            System.err.println("Container init failed: " + e.getMessage());
            System.exit(1);
        }
    }
    
    public static void main(String[] args) {
        if (args.length < 2) {
            System.out.println("Usage: java Container <hostname> <command...>");
            System.out.println("       --cpus=<n>     CPU cores (e.g., 0.5, 2)");
            System.out.println("       --memory=<n>   Memory limit (e.g., 256M, 1G)");
            System.out.println("       --pids=<n>     Max processes");
            System.exit(1);
        }
        
        ResourceLimits limits = ResourceLimits.defaults();
        String hostname = null;
        String[] command = null;
        
        // Parse arguments
        int cmdStart = 0;
        for (int i = 0; i < args.length; i++) {
            if (args[i].startsWith("--cpus=")) {
                limits.withCpu(Double.parseDouble(args[i].substring(7)));
            } else if (args[i].startsWith("--memory=")) {
                limits.withMemory(args[i].substring(9));
            } else if (args[i].startsWith("--pids=")) {
                limits.withMaxPids(Integer.parseInt(args[i].substring(7)));
            } else if (hostname == null) {
                hostname = args[i];
            } else {
                cmdStart = i;
                break;
            }
        }
        
        if (hostname != null && cmdStart > 0) {
            command = new String[args.length - cmdStart];
            System.arraycopy(args, cmdStart, command, 0, command.length);
        }
        
        try {
            Container container = new Container(hostname, command, limits);
            container.run();
        } catch (Exception e) {
            System.err.println("Error: " + e.getMessage());
            e.printStackTrace();
            System.exit(1);
        }
    }
}
```

***

## Part 4: Testing Resource Limits

### Testing CPU Limits

```bash theme={null}
# Run a CPU-intensive workload with 50% CPU limit
sudo java Container --cpus=0.5 testhost /bin/sh -c "while true; do :; done"

# In another terminal, observe CPU usage (should be ~50%)
top -p $(pgrep -f "while true")
```

### Testing Memory Limits

```java theme={null}
// MemoryHog.java - For testing memory limits
public class MemoryHog {
    public static void main(String[] args) throws Exception {
        List<byte[]> allocations = new ArrayList<>();
        
        while (true) {
            // Allocate 10MB chunks
            byte[] chunk = new byte[10 * 1024 * 1024];
            allocations.add(chunk);
            System.out.println("Allocated: " + (allocations.size() * 10) + "MB");
            Thread.sleep(100);
        }
        // Container will be OOM-killed when hitting memory limit!
    }
}
```

### Testing PID Limits

```bash theme={null}
# Fork bomb protection!
# Without pids limit, this would crash the system:
:(){ :|:& };:

# With pids limit set to 100, the fork bomb is contained
```

***

## Understanding Cgroup Controllers

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      CGROUP CONTROLLERS                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   CONTROLLER    FILES               DESCRIPTION                             │
│   ──────────    ─────               ───────────                             │
│                                                                              │
│   cpu           cpu.max             Bandwidth limit (quota/period)          │
│                 cpu.weight          Relative CPU shares (1-10000)           │
│                 cpu.stat            Usage statistics                        │
│                                                                              │
│   memory        memory.max          Hard limit (OOM kill if exceeded)       │
│                 memory.high         Soft limit (throttling starts)          │
│                 memory.current      Current usage                           │
│                 memory.swap.max     Swap limit                              │
│                 memory.stat         Detailed statistics                     │
│                                                                              │
│   pids          pids.max            Maximum processes                       │
│                 pids.current        Current process count                   │
│                                                                              │
│   io            io.max              Bandwidth/IOPS limits per device        │
│                 io.weight           Relative I/O priority                   │
│                 io.stat             I/O statistics                          │
│                                                                              │
│   cpuset        cpuset.cpus         Allowed CPU cores (e.g., "0-3")         │
│                 cpuset.mems         Allowed NUMA nodes                      │
│                                                                              │
│   hugetlb       hugetlb.X.max       Huge page limits                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Exercises

<Accordion title="Exercise 1: Implement I/O Throttling" icon="hard-drive">
  Add disk I/O limits:

  ```java theme={null}
  // 1. Find the device major:minor for the disk
  // cat /sys/block/sda/dev

  // 2. Set io.max
  // "8:0 rbps=1048576 wbps=1048576" = 1MB/s read/write

  // 3. Test with dd:
  // dd if=/dev/zero of=/tmp/test bs=1M count=100
  ```
</Accordion>

<Accordion title="Exercise 2: Implement CPU Pinning" icon="microchip">
  Pin container to specific CPU cores:

  ```java theme={null}
  // Use cpuset controller
  // Write to cpuset.cpus: "0,2" (only use cores 0 and 2)
  // Write to cpuset.mems: "0" (NUMA node 0)

  // This ensures containers don't share CPU cache
  // Great for latency-sensitive workloads
  ```
</Accordion>

<Accordion title="Exercise 3: Add Resource Monitoring" icon="chart-line">
  Implement real-time resource monitoring:

  ```java theme={null}
  // 1. Read cpu.stat periodically for CPU usage
  // 2. Read memory.current and memory.stat for memory
  // 3. Read io.stat for disk I/O
  // 4. Calculate deltas to get rates
  // 5. Display like 'docker stats'
  ```
</Accordion>

***

## Key Takeaways

<CardGroup cols={2}>
  <Card title="Hierarchical" icon="sitemap">
    Cgroups form a tree - children inherit parent limits
  </Card>

  <Card title="Controller-Based" icon="sliders">
    Different controllers for CPU, memory, I/O, PIDs
  </Card>

  <Card title="Kernel Enforcement" icon="shield">
    Limits are enforced by kernel, not by honor system
  </Card>

  <Card title="OOM Killer" icon="skull">
    Exceeding memory limit triggers OOM killer
  </Card>
</CardGroup>

***

## What's Next?

In [Chapter 3: Filesystem](/courses/build-your-own-x/docker-3-filesystem), we'll implement:

* Overlay filesystems
* Copy-on-write layers
* Container root filesystem setup

<Card title="Next: Filesystem" icon="arrow-right" href="/courses/build-your-own-x/docker-3-filesystem">
  Build the layered filesystem
</Card>

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="Explain the difference between cgroups v1 and v2. Why did the kernel need a new version?">
    **Strong Answer:**

    * Cgroups v1 had a per-controller hierarchy: each controller (cpu, memory, pids) had its own independent tree, and a process could be in different groups for different controllers. This created incoherent resource accounting -- a process could be in cgroup A for CPU but cgroup B for memory, making it impossible to get a unified view of resource consumption.
    * Cgroups v2 uses a single unified hierarchy. A process belongs to exactly one cgroup, and all controllers apply to that cgroup. This makes resource accounting consistent and simplifies management tooling. It also enables cross-controller coordination -- for example, the memory controller can influence the I/O controller's behavior for a process, which was impossible in v1.
    * V2 also introduced the "no internal processes" rule: if a cgroup has child cgroups, the parent cannot directly contain processes. This eliminates ambiguity about which level of the hierarchy should be charged for resource usage.
    * The migration took years because systemd, Docker, and Kubernetes all had deep dependencies on v1 semantics. The practical interview relevance is that debugging resource limits in production requires knowing which cgroup version the host is running, because the filesystem paths and file formats differ significantly.

    **Follow-up: How do you check which cgroup version a system is running?**

    Check with `stat -fc %T /sys/fs/cgroup/` -- if it returns `cgroup2fs`, the system uses v2; `tmpfs` indicates v1. In v1, you find CPU limits at `/sys/fs/cgroup/cpu/docker/<container-id>/cpu.cfs_quota_us`. In v2, it is `/sys/fs/cgroup/system.slice/docker-<container-id>.scope/cpu.max`. The file format also differs: v1 uses separate files for quota and period; v2 combines them into one file (`cpu.max`).
  </Accordion>

  <Accordion title="How does the OOM killer decide which process to kill inside a container, and how would you tune it for a production service?">
    **Strong Answer:**

    * When a cgroup hits its `memory.max` limit, the kernel's OOM killer activates for that cgroup specifically (it does not kill processes outside the cgroup). It selects the victim by calculating an `oom_score` for each process based on RSS, swap usage, and the `oom_score_adj` value.
    * For production tuning, set `memory.high` to a value below `memory.max` (e.g., high at 900MB, max at 1GB). When usage crosses `memory.high`, the kernel throttles memory allocation instead of killing, giving the application a chance to shed load or run garbage collection. This turns a hard crash into a soft slowdown.
    * Also configure `oom_score_adj` to ensure the right process dies if OOM is unavoidable. Setting the main process to `-500` and helper sidecars to `0` ensures sidecars die first. And critically, monitor `memory.events` to alert on `oom_kill` events before they cascade.
    * A subtle point: application-level memory metrics (like JVM heap usage) only show user-space allocations. The kernel counts *all* memory charged to the cgroup, including page cache, tmpfs mounts, and kernel stack pages. A container writing heavily to tmpfs will trigger OOM even though the application reports low heap usage.

    **Follow-up: What is the difference between a cgroup OOM and a host-level OOM? Which is worse operationally?**

    A cgroup OOM is scoped and controlled -- only processes in that cgroup die. A host-level OOM means the entire machine's memory is exhausted, and the kernel's global OOM killer activates, often killing the wrong process. Host OOM is operationally catastrophic because it cascades across unrelated services. This is precisely why cgroup memory limits exist -- they convert a global disaster into a local, contained failure.
  </Accordion>

  <Accordion title="What is CPU throttling in CFS bandwidth control, and why does it cause latency spikes even when average CPU usage is low?">
    **Strong Answer:**

    * CFS bandwidth control gives each cgroup a quota of CPU time per period (default 100ms). If `cpu.max` is `50000 100000`, the container gets 50ms of CPU time per 100ms period.
    * The latency spike occurs when the container uses its entire 50ms quota in a burst at the start of the period. For the remaining 50ms, threads are suspended even if the CPU is idle. A request arriving during the throttled window waits up to 100ms, creating artificial latency.
    * This is measurable via `cpu.stat` -- look for `nr_throttled` and `throttled_usec`. A container with low average CPU usage can still show thousands of throttle events if usage is bursty.
    * Mitigation strategies: increase the CPU limit for burst headroom, reduce the CFS period (smaller periods like 10ms distribute throttling more evenly but increase scheduling overhead), or use `cpuset.cpus` to pin the container to dedicated cores, bypassing the bandwidth controller entirely.

    **Follow-up: Why do some Kubernetes operators recommend removing CPU limits entirely?**

    The argument is that CFS throttling causes more damage to latency-sensitive services than noisy neighbors do. If the cluster has accurate CPU requests and nodes are not overprovisioned, the scheduler places pods so total requested CPU does not exceed node capacity. Without limits, a pod can burst above its request when CPU is available. The counterargument is that without limits, a misbehaving pod can starve neighbors during peak load. The right answer depends on workload characteristics: for steady-state services with predictable CPU profiles, removing limits works well; for batch jobs or untrusted workloads, limits are necessary.
  </Accordion>
</AccordionGroup>
