Skip to main content

Chapter 2: Control Groups (cgroups)

While namespaces provide isolation, cgroups provide resource limits. Without cgroups, a container could consume all system resources. Let’s implement resource limiting!
Prerequisites: Chapter 1: Namespaces
Further Reading: Operating Systems: Resource Management
Time: 3-4 hours
Outcome: Containers with CPU, memory, and process limits

What Are Cgroups?

┌─────────────────────────────────────────────────────────────────────────────┐
│                         WITHOUT CGROUPS                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   SYSTEM: 8 CPU cores, 16GB RAM                                             │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │                                                                      │  │
│   │   Container A        Container B        Container C                 │  │
│   │   ┌──────────┐      ┌──────────┐       ┌──────────┐                │  │
│   │   │ Uses 6   │      │ Uses 1.5 │       │ Uses 0.5 │                │  │
│   │   │ CPUs,    │      │ CPUs,    │       │ CPUs,    │                │  │
│   │   │ 14GB RAM │      │ 1GB RAM  │       │ 1GB RAM  │                │  │
│   │   └──────────┘      └──────────┘       └──────────┘                │  │
│   │                                                                      │  │
│   │   Container A is a "noisy neighbor" - starving others!              │  │
│   │                                                                      │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                           WITH CGROUPS                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   SYSTEM: 8 CPU cores, 16GB RAM                                             │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │                                                                      │  │
│   │   Container A        Container B        Container C                 │  │
│   │   ┌──────────┐      ┌──────────┐       ┌──────────┐                │  │
│   │   │ LIMIT:   │      │ LIMIT:   │       │ LIMIT:   │                │  │
│   │   │ 2 CPUs   │      │ 2 CPUs   │       │ 2 CPUs   │                │  │
│   │   │ 4GB RAM  │      │ 4GB RAM  │       │ 4GB RAM  │                │  │
│   │   │ 100 procs│      │ 100 procs│       │ 100 procs│                │  │
│   │   └──────────┘      └──────────┘       └──────────┘                │  │
│   │                                                                      │  │
│   │   Fair resource allocation! Predictable performance!                │  │
│   │                                                                      │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Cgroup v2 Hierarchy

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CGROUP V2 FILESYSTEM                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   /sys/fs/cgroup/                         ← Cgroup root                     │
│   ├── cgroup.controllers                  ← Available controllers           │
│   ├── cgroup.subtree_control              ← Enabled controllers             │
│   ├── cpu.max                             ← Root CPU limits                 │
│   ├── memory.max                          ← Root memory limits              │
│   │                                                                         │
│   ├── minidocker/                         ← Our container group            │
│   │   ├── cgroup.controllers                                                │
│   │   ├── cgroup.procs                    ← PIDs in this group             │
│   │   ├── cpu.max                         ← CPU limit (quota period)       │
│   │   ├── cpu.weight                      ← CPU shares                     │
│   │   ├── memory.max                      ← Memory limit (bytes)           │
│   │   ├── memory.current                  ← Current memory usage           │
│   │   ├── pids.max                        ← Max processes                  │
│   │   └── pids.current                    ← Current process count          │
│   │                                                                         │
│   └── docker/                             ← Docker's groups                 │
│       ├── container-abc123/                                                 │
│       └── container-def456/                                                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Part 1: Cgroup Manager

src/main/java/com/minidocker/cgroup/CgroupManager.java
package com.minidocker.cgroup;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;

/**
 * Manages cgroup v2 resource limits.
 * 
 * Cgroups control:
 * - CPU: How much CPU time a container can use
 * - Memory: Maximum memory allocation
 * - PIDs: Maximum number of processes
 * - I/O: Disk bandwidth limits
 */
public class CgroupManager {
    
    private static final Path CGROUP_ROOT = Path.of("/sys/fs/cgroup");
    private static final String CGROUP_NAME = "minidocker";
    
    private final String containerId;
    private final Path cgroupPath;
    
    public CgroupManager(String containerId) {
        this.containerId = containerId;
        this.cgroupPath = CGROUP_ROOT.resolve(CGROUP_NAME).resolve(containerId);
    }
    
    /**
     * Creates the cgroup for this container.
     */
    public void create() throws IOException {
        // Create the cgroup directory
        Files.createDirectories(cgroupPath);
        System.out.println("✓ Created cgroup: " + cgroupPath);
        
        // Enable controllers in parent
        Path parentSubtreeControl = cgroupPath.getParent().resolve("cgroup.subtree_control");
        if (Files.exists(parentSubtreeControl)) {
            String controllers = "+cpu +memory +pids +io";
            Files.writeString(parentSubtreeControl, controllers);
            System.out.println("✓ Enabled controllers: cpu, memory, pids, io");
        }
    }
    
    /**
     * Sets CPU limit.
     * 
     * @param cpuPercent CPU percentage (100 = 1 core, 200 = 2 cores)
     */
    public void setCpuLimit(int cpuPercent) throws IOException {
        // cpu.max format: "$QUOTA $PERIOD"
        // QUOTA: microseconds of CPU time per PERIOD
        // PERIOD: typically 100000 (100ms)
        
        int period = 100000;  // 100ms
        int quota = period * cpuPercent / 100;
        
        Path cpuMaxPath = cgroupPath.resolve("cpu.max");
        String value = quota + " " + period;
        Files.writeString(cpuMaxPath, value);
        
        System.out.println("✓ CPU limit set: " + cpuPercent + "% (" + value + ")");
    }
    
    /**
     * Sets memory limit.
     * 
     * @param memoryBytes Maximum memory in bytes
     */
    public void setMemoryLimit(long memoryBytes) throws IOException {
        Path memoryMaxPath = cgroupPath.resolve("memory.max");
        Files.writeString(memoryMaxPath, String.valueOf(memoryBytes));
        
        // Also set swap limit to prevent swap usage
        Path memorySwapMaxPath = cgroupPath.resolve("memory.swap.max");
        if (Files.exists(memorySwapMaxPath)) {
            Files.writeString(memorySwapMaxPath, "0");
        }
        
        String humanReadable = formatBytes(memoryBytes);
        System.out.println("✓ Memory limit set: " + humanReadable);
    }
    
    /**
     * Sets process (PID) limit.
     * 
     * Prevents fork bombs!
     * 
     * @param maxPids Maximum number of processes
     */
    public void setPidsLimit(int maxPids) throws IOException {
        Path pidsMaxPath = cgroupPath.resolve("pids.max");
        Files.writeString(pidsMaxPath, String.valueOf(maxPids));
        
        System.out.println("✓ PIDs limit set: " + maxPids);
    }
    
    /**
     * Sets I/O bandwidth limits.
     * 
     * @param deviceMajorMinor Device ID (e.g., "8:0" for /dev/sda)
     * @param rbps Read bytes per second
     * @param wbps Write bytes per second
     */
    public void setIOLimit(String deviceMajorMinor, long rbps, long wbps) throws IOException {
        Path ioMaxPath = cgroupPath.resolve("io.max");
        String value = deviceMajorMinor + " rbps=" + rbps + " wbps=" + wbps;
        Files.writeString(ioMaxPath, value);
        
        System.out.println("✓ I/O limit set for " + deviceMajorMinor + 
                          ": read=" + formatBytes(rbps) + "/s, write=" + formatBytes(wbps) + "/s");
    }
    
    /**
     * Adds a process to this cgroup.
     * 
     * @param pid Process ID to add
     */
    public void addProcess(int pid) throws IOException {
        Path procsPath = cgroupPath.resolve("cgroup.procs");
        Files.writeString(procsPath, String.valueOf(pid));
        
        System.out.println("✓ Added PID " + pid + " to cgroup");
    }
    
    /**
     * Adds the current process to this cgroup.
     */
    public void addCurrentProcess() throws IOException {
        addProcess(ProcessHandle.current().pid());
    }
    
    /**
     * Gets current resource usage.
     */
    public ResourceUsage getUsage() throws IOException {
        long memoryBytes = 0;
        int pids = 0;
        long cpuUsage = 0;
        
        Path memoryCurrent = cgroupPath.resolve("memory.current");
        if (Files.exists(memoryCurrent)) {
            memoryBytes = Long.parseLong(Files.readString(memoryCurrent).trim());
        }
        
        Path pidsCurrent = cgroupPath.resolve("pids.current");
        if (Files.exists(pidsCurrent)) {
            pids = Integer.parseInt(Files.readString(pidsCurrent).trim());
        }
        
        Path cpuStat = cgroupPath.resolve("cpu.stat");
        if (Files.exists(cpuStat)) {
            String stats = Files.readString(cpuStat);
            for (String line : stats.split("\n")) {
                if (line.startsWith("usage_usec")) {
                    cpuUsage = Long.parseLong(line.split(" ")[1]);
                }
            }
        }
        
        return new ResourceUsage(memoryBytes, pids, cpuUsage);
    }
    
    /**
     * Removes the cgroup (cleanup).
     */
    public void destroy() throws IOException {
        // First, kill all processes in the cgroup
        Path procsPath = cgroupPath.resolve("cgroup.procs");
        if (Files.exists(procsPath)) {
            String procs = Files.readString(procsPath);
            for (String pid : procs.split("\n")) {
                if (!pid.isEmpty()) {
                    ProcessHandle.of(Long.parseLong(pid))
                                 .ifPresent(ProcessHandle::destroy);
                }
            }
        }
        
        // Then remove the directory
        Files.deleteIfExists(cgroupPath);
        System.out.println("✓ Destroyed cgroup: " + cgroupPath);
    }
    
    private static String formatBytes(long bytes) {
        if (bytes < 1024) return bytes + " B";
        if (bytes < 1024 * 1024) return (bytes / 1024) + " KB";
        if (bytes < 1024 * 1024 * 1024) return (bytes / (1024 * 1024)) + " MB";
        return (bytes / (1024 * 1024 * 1024)) + " GB";
    }
    
    /**
     * Resource usage statistics.
     */
    public record ResourceUsage(long memoryBytes, int pids, long cpuUsageMicros) {
        @Override
        public String toString() {
            return String.format("Memory: %s, PIDs: %d, CPU: %.2fms",
                formatBytes(memoryBytes), pids, cpuUsageMicros / 1000.0);
        }
    }
}

Part 2: Resource Limits Configuration

src/main/java/com/minidocker/cgroup/ResourceLimits.java
package com.minidocker.cgroup;

/**
 * Container resource limits configuration.
 * 
 * Similar to Docker's --cpus, --memory, --pids-limit flags.
 */
public class ResourceLimits {
    
    private int cpuPercent = 100;           // 100 = 1 core
    private long memoryBytes = 512 * 1024 * 1024;  // 512MB default
    private int maxPids = 100;              // Max processes
    
    public static ResourceLimits defaults() {
        return new ResourceLimits();
    }
    
    public ResourceLimits withCpu(double cpus) {
        this.cpuPercent = (int) (cpus * 100);
        return this;
    }
    
    public ResourceLimits withMemory(String memory) {
        this.memoryBytes = parseMemory(memory);
        return this;
    }
    
    public ResourceLimits withMemoryBytes(long bytes) {
        this.memoryBytes = bytes;
        return this;
    }
    
    public ResourceLimits withMaxPids(int pids) {
        this.maxPids = pids;
        return this;
    }
    
    public int getCpuPercent() { return cpuPercent; }
    public long getMemoryBytes() { return memoryBytes; }
    public int getMaxPids() { return maxPids; }
    
    private static long parseMemory(String memory) {
        memory = memory.trim().toUpperCase();
        
        long multiplier = 1;
        if (memory.endsWith("K")) {
            multiplier = 1024;
            memory = memory.substring(0, memory.length() - 1);
        } else if (memory.endsWith("M")) {
            multiplier = 1024 * 1024;
            memory = memory.substring(0, memory.length() - 1);
        } else if (memory.endsWith("G")) {
            multiplier = 1024 * 1024 * 1024;
            memory = memory.substring(0, memory.length() - 1);
        }
        
        return Long.parseLong(memory) * multiplier;
    }
    
    @Override
    public String toString() {
        return String.format("ResourceLimits{cpu=%.2f cores, memory=%dMB, pids=%d}",
            cpuPercent / 100.0,
            memoryBytes / (1024 * 1024),
            maxPids);
    }
}

Part 3: Integrating with Container

src/main/java/com/minidocker/Container.java
package com.minidocker;

import com.minidocker.cgroup.CgroupManager;
import com.minidocker.cgroup.ResourceLimits;
import com.minidocker.linux.LibC;
import com.minidocker.namespace.NamespaceManager;
import com.minidocker.namespace.NamespaceOptions;

import java.util.UUID;

/**
 * Container with namespace isolation AND resource limits.
 */
public class Container {
    
    private final LibC libc = LibC.INSTANCE;
    private final NamespaceManager namespaces = new NamespaceManager();
    
    private final String id;
    private final String hostname;
    private final String[] command;
    private final ResourceLimits limits;
    
    private CgroupManager cgroup;
    
    public Container(String hostname, String[] command, ResourceLimits limits) {
        this.id = UUID.randomUUID().toString().substring(0, 12);
        this.hostname = hostname;
        this.command = command;
        this.limits = limits;
    }
    
    public void run() throws Exception {
        System.out.println("=== Starting Container " + id + " ===");
        System.out.println("Hostname: " + hostname);
        System.out.println("Limits: " + limits);
        System.out.println();
        
        // Step 1: Create cgroup
        cgroup = new CgroupManager(id);
        cgroup.create();
        
        // Step 2: Set resource limits
        cgroup.setCpuLimit(limits.getCpuPercent());
        cgroup.setMemoryLimit(limits.getMemoryBytes());
        cgroup.setPidsLimit(limits.getMaxPids());
        
        // Step 3: Create namespaces
        NamespaceOptions nsOptions = NamespaceOptions.builder()
            .withPid()
            .withMount()
            .withUts()
            .withIpc()
            .build();
        
        namespaces.createNamespaces(nsOptions);
        namespaces.setHostname(hostname);
        
        // Step 4: Fork
        int pid = libc.fork();
        
        if (pid == 0) {
            // Child: add to cgroup and run
            cgroup.addCurrentProcess();
            runContainerInit();
        } else if (pid > 0) {
            // Parent: wait and monitor
            monitorContainer(pid);
        } else {
            throw new RuntimeException("Fork failed");
        }
    }
    
    private void monitorContainer(int childPid) {
        // Start monitoring thread
        Thread monitor = new Thread(() -> {
            try {
                while (true) {
                    Thread.sleep(1000);
                    CgroupManager.ResourceUsage usage = cgroup.getUsage();
                    System.out.println("[Monitor] " + usage);
                }
            } catch (Exception e) {
                // Container exited
            }
        });
        monitor.setDaemon(true);
        monitor.start();
        
        // Wait for child
        int[] status = new int[1];
        libc.waitpid(childPid, status, 0);
        
        System.out.println("Container exited with status: " + status[0]);
        
        // Cleanup
        try {
            cgroup.destroy();
        } catch (Exception e) {
            System.err.println("Failed to cleanup cgroup: " + e.getMessage());
        }
    }
    
    private void runContainerInit() {
        try {
            System.out.println("\n=== Container Init (PID " + libc.getpid() + ") ===");
            
            if (command.length > 0) {
                String[] argv = new String[command.length + 1];
                System.arraycopy(command, 0, argv, 0, command.length);
                argv[command.length] = null;
                
                libc.execv(command[0], argv);
                System.err.println("Failed to execute: " + command[0]);
                System.exit(1);
            }
        } catch (Exception e) {
            System.err.println("Container init failed: " + e.getMessage());
            System.exit(1);
        }
    }
    
    public static void main(String[] args) {
        if (args.length < 2) {
            System.out.println("Usage: java Container <hostname> <command...>");
            System.out.println("       --cpus=<n>     CPU cores (e.g., 0.5, 2)");
            System.out.println("       --memory=<n>   Memory limit (e.g., 256M, 1G)");
            System.out.println("       --pids=<n>     Max processes");
            System.exit(1);
        }
        
        ResourceLimits limits = ResourceLimits.defaults();
        String hostname = null;
        String[] command = null;
        
        // Parse arguments
        int cmdStart = 0;
        for (int i = 0; i < args.length; i++) {
            if (args[i].startsWith("--cpus=")) {
                limits.withCpu(Double.parseDouble(args[i].substring(7)));
            } else if (args[i].startsWith("--memory=")) {
                limits.withMemory(args[i].substring(9));
            } else if (args[i].startsWith("--pids=")) {
                limits.withMaxPids(Integer.parseInt(args[i].substring(7)));
            } else if (hostname == null) {
                hostname = args[i];
            } else {
                cmdStart = i;
                break;
            }
        }
        
        if (hostname != null && cmdStart > 0) {
            command = new String[args.length - cmdStart];
            System.arraycopy(args, cmdStart, command, 0, command.length);
        }
        
        try {
            Container container = new Container(hostname, command, limits);
            container.run();
        } catch (Exception e) {
            System.err.println("Error: " + e.getMessage());
            e.printStackTrace();
            System.exit(1);
        }
    }
}

Part 4: Testing Resource Limits

Testing CPU Limits

# Run a CPU-intensive workload with 50% CPU limit
sudo java Container --cpus=0.5 testhost /bin/sh -c "while true; do :; done"

# In another terminal, observe CPU usage (should be ~50%)
top -p $(pgrep -f "while true")

Testing Memory Limits

// MemoryHog.java - For testing memory limits
public class MemoryHog {
    public static void main(String[] args) throws Exception {
        List<byte[]> allocations = new ArrayList<>();
        
        while (true) {
            // Allocate 10MB chunks
            byte[] chunk = new byte[10 * 1024 * 1024];
            allocations.add(chunk);
            System.out.println("Allocated: " + (allocations.size() * 10) + "MB");
            Thread.sleep(100);
        }
        // Container will be OOM-killed when hitting memory limit!
    }
}

Testing PID Limits

# Fork bomb protection!
# Without pids limit, this would crash the system:
:(){ :|:& };:

# With pids limit set to 100, the fork bomb is contained

Understanding Cgroup Controllers

┌─────────────────────────────────────────────────────────────────────────────┐
│                      CGROUP CONTROLLERS                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   CONTROLLER    FILES               DESCRIPTION                             │
│   ──────────    ─────               ───────────                             │
│                                                                              │
│   cpu           cpu.max             Bandwidth limit (quota/period)          │
│                 cpu.weight          Relative CPU shares (1-10000)           │
│                 cpu.stat            Usage statistics                        │
│                                                                              │
│   memory        memory.max          Hard limit (OOM kill if exceeded)       │
│                 memory.high         Soft limit (throttling starts)          │
│                 memory.current      Current usage                           │
│                 memory.swap.max     Swap limit                              │
│                 memory.stat         Detailed statistics                     │
│                                                                              │
│   pids          pids.max            Maximum processes                       │
│                 pids.current        Current process count                   │
│                                                                              │
│   io            io.max              Bandwidth/IOPS limits per device        │
│                 io.weight           Relative I/O priority                   │
│                 io.stat             I/O statistics                          │
│                                                                              │
│   cpuset        cpuset.cpus         Allowed CPU cores (e.g., "0-3")         │
│                 cpuset.mems         Allowed NUMA nodes                      │
│                                                                              │
│   hugetlb       hugetlb.X.max       Huge page limits                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Exercises

Add disk I/O limits:
// 1. Find the device major:minor for the disk
// cat /sys/block/sda/dev

// 2. Set io.max
// "8:0 rbps=1048576 wbps=1048576" = 1MB/s read/write

// 3. Test with dd:
// dd if=/dev/zero of=/tmp/test bs=1M count=100
Pin container to specific CPU cores:
// Use cpuset controller
// Write to cpuset.cpus: "0,2" (only use cores 0 and 2)
// Write to cpuset.mems: "0" (NUMA node 0)

// This ensures containers don't share CPU cache
// Great for latency-sensitive workloads
Implement real-time resource monitoring:
// 1. Read cpu.stat periodically for CPU usage
// 2. Read memory.current and memory.stat for memory
// 3. Read io.stat for disk I/O
// 4. Calculate deltas to get rates
// 5. Display like 'docker stats'

Key Takeaways

Hierarchical

Cgroups form a tree - children inherit parent limits

Controller-Based

Different controllers for CPU, memory, I/O, PIDs

Kernel Enforcement

Limits are enforced by kernel, not by honor system

OOM Killer

Exceeding memory limit triggers OOM killer

What’s Next?

In Chapter 3: Filesystem, we’ll implement:
  • Overlay filesystems
  • Copy-on-write layers
  • Container root filesystem setup

Next: Filesystem

Build the layered filesystem