Chapter 2: Control Groups (cgroups)
While namespaces provide isolation, cgroups provide resource limits. Without cgroups, a container could consume all system resources. Let’s implement resource limiting!Prerequisites: Chapter 1: Namespaces
Further Reading: Operating Systems: Resource Management
Time: 3-4 hours
Outcome: Containers with CPU, memory, and process limits
Further Reading: Operating Systems: Resource Management
Time: 3-4 hours
Outcome: Containers with CPU, memory, and process limits
What Are Cgroups?
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ WITHOUT CGROUPS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SYSTEM: 8 CPU cores, 16GB RAM │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Container A Container B Container C │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Uses 6 │ │ Uses 1.5 │ │ Uses 0.5 │ │ │
│ │ │ CPUs, │ │ CPUs, │ │ CPUs, │ │ │
│ │ │ 14GB RAM │ │ 1GB RAM │ │ 1GB RAM │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ │ Container A is a "noisy neighbor" - starving others! │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ WITH CGROUPS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SYSTEM: 8 CPU cores, 16GB RAM │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Container A Container B Container C │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ LIMIT: │ │ LIMIT: │ │ LIMIT: │ │ │
│ │ │ 2 CPUs │ │ 2 CPUs │ │ 2 CPUs │ │ │
│ │ │ 4GB RAM │ │ 4GB RAM │ │ 4GB RAM │ │ │
│ │ │ 100 procs│ │ 100 procs│ │ 100 procs│ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │
│ │ Fair resource allocation! Predictable performance! │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Cgroup v2 Hierarchy
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ CGROUP V2 FILESYSTEM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ /sys/fs/cgroup/ ← Cgroup root │
│ ├── cgroup.controllers ← Available controllers │
│ ├── cgroup.subtree_control ← Enabled controllers │
│ ├── cpu.max ← Root CPU limits │
│ ├── memory.max ← Root memory limits │
│ │ │
│ ├── minidocker/ ← Our container group │
│ │ ├── cgroup.controllers │
│ │ ├── cgroup.procs ← PIDs in this group │
│ │ ├── cpu.max ← CPU limit (quota period) │
│ │ ├── cpu.weight ← CPU shares │
│ │ ├── memory.max ← Memory limit (bytes) │
│ │ ├── memory.current ← Current memory usage │
│ │ ├── pids.max ← Max processes │
│ │ └── pids.current ← Current process count │
│ │ │
│ └── docker/ ← Docker's groups │
│ ├── container-abc123/ │
│ └── container-def456/ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Part 1: Cgroup Manager
src/main/java/com/minidocker/cgroup/CgroupManager.java
Copy
package com.minidocker.cgroup;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
/**
* Manages cgroup v2 resource limits.
*
* Cgroups control:
* - CPU: How much CPU time a container can use
* - Memory: Maximum memory allocation
* - PIDs: Maximum number of processes
* - I/O: Disk bandwidth limits
*/
public class CgroupManager {
private static final Path CGROUP_ROOT = Path.of("/sys/fs/cgroup");
private static final String CGROUP_NAME = "minidocker";
private final String containerId;
private final Path cgroupPath;
public CgroupManager(String containerId) {
this.containerId = containerId;
this.cgroupPath = CGROUP_ROOT.resolve(CGROUP_NAME).resolve(containerId);
}
/**
* Creates the cgroup for this container.
*/
public void create() throws IOException {
// Create the cgroup directory
Files.createDirectories(cgroupPath);
System.out.println("✓ Created cgroup: " + cgroupPath);
// Enable controllers in parent
Path parentSubtreeControl = cgroupPath.getParent().resolve("cgroup.subtree_control");
if (Files.exists(parentSubtreeControl)) {
String controllers = "+cpu +memory +pids +io";
Files.writeString(parentSubtreeControl, controllers);
System.out.println("✓ Enabled controllers: cpu, memory, pids, io");
}
}
/**
* Sets CPU limit.
*
* @param cpuPercent CPU percentage (100 = 1 core, 200 = 2 cores)
*/
public void setCpuLimit(int cpuPercent) throws IOException {
// cpu.max format: "$QUOTA $PERIOD"
// QUOTA: microseconds of CPU time per PERIOD
// PERIOD: typically 100000 (100ms)
int period = 100000; // 100ms
int quota = period * cpuPercent / 100;
Path cpuMaxPath = cgroupPath.resolve("cpu.max");
String value = quota + " " + period;
Files.writeString(cpuMaxPath, value);
System.out.println("✓ CPU limit set: " + cpuPercent + "% (" + value + ")");
}
/**
* Sets memory limit.
*
* @param memoryBytes Maximum memory in bytes
*/
public void setMemoryLimit(long memoryBytes) throws IOException {
Path memoryMaxPath = cgroupPath.resolve("memory.max");
Files.writeString(memoryMaxPath, String.valueOf(memoryBytes));
// Also set swap limit to prevent swap usage
Path memorySwapMaxPath = cgroupPath.resolve("memory.swap.max");
if (Files.exists(memorySwapMaxPath)) {
Files.writeString(memorySwapMaxPath, "0");
}
String humanReadable = formatBytes(memoryBytes);
System.out.println("✓ Memory limit set: " + humanReadable);
}
/**
* Sets process (PID) limit.
*
* Prevents fork bombs!
*
* @param maxPids Maximum number of processes
*/
public void setPidsLimit(int maxPids) throws IOException {
Path pidsMaxPath = cgroupPath.resolve("pids.max");
Files.writeString(pidsMaxPath, String.valueOf(maxPids));
System.out.println("✓ PIDs limit set: " + maxPids);
}
/**
* Sets I/O bandwidth limits.
*
* @param deviceMajorMinor Device ID (e.g., "8:0" for /dev/sda)
* @param rbps Read bytes per second
* @param wbps Write bytes per second
*/
public void setIOLimit(String deviceMajorMinor, long rbps, long wbps) throws IOException {
Path ioMaxPath = cgroupPath.resolve("io.max");
String value = deviceMajorMinor + " rbps=" + rbps + " wbps=" + wbps;
Files.writeString(ioMaxPath, value);
System.out.println("✓ I/O limit set for " + deviceMajorMinor +
": read=" + formatBytes(rbps) + "/s, write=" + formatBytes(wbps) + "/s");
}
/**
* Adds a process to this cgroup.
*
* @param pid Process ID to add
*/
public void addProcess(int pid) throws IOException {
Path procsPath = cgroupPath.resolve("cgroup.procs");
Files.writeString(procsPath, String.valueOf(pid));
System.out.println("✓ Added PID " + pid + " to cgroup");
}
/**
* Adds the current process to this cgroup.
*/
public void addCurrentProcess() throws IOException {
addProcess(ProcessHandle.current().pid());
}
/**
* Gets current resource usage.
*/
public ResourceUsage getUsage() throws IOException {
long memoryBytes = 0;
int pids = 0;
long cpuUsage = 0;
Path memoryCurrent = cgroupPath.resolve("memory.current");
if (Files.exists(memoryCurrent)) {
memoryBytes = Long.parseLong(Files.readString(memoryCurrent).trim());
}
Path pidsCurrent = cgroupPath.resolve("pids.current");
if (Files.exists(pidsCurrent)) {
pids = Integer.parseInt(Files.readString(pidsCurrent).trim());
}
Path cpuStat = cgroupPath.resolve("cpu.stat");
if (Files.exists(cpuStat)) {
String stats = Files.readString(cpuStat);
for (String line : stats.split("\n")) {
if (line.startsWith("usage_usec")) {
cpuUsage = Long.parseLong(line.split(" ")[1]);
}
}
}
return new ResourceUsage(memoryBytes, pids, cpuUsage);
}
/**
* Removes the cgroup (cleanup).
*/
public void destroy() throws IOException {
// First, kill all processes in the cgroup
Path procsPath = cgroupPath.resolve("cgroup.procs");
if (Files.exists(procsPath)) {
String procs = Files.readString(procsPath);
for (String pid : procs.split("\n")) {
if (!pid.isEmpty()) {
ProcessHandle.of(Long.parseLong(pid))
.ifPresent(ProcessHandle::destroy);
}
}
}
// Then remove the directory
Files.deleteIfExists(cgroupPath);
System.out.println("✓ Destroyed cgroup: " + cgroupPath);
}
private static String formatBytes(long bytes) {
if (bytes < 1024) return bytes + " B";
if (bytes < 1024 * 1024) return (bytes / 1024) + " KB";
if (bytes < 1024 * 1024 * 1024) return (bytes / (1024 * 1024)) + " MB";
return (bytes / (1024 * 1024 * 1024)) + " GB";
}
/**
* Resource usage statistics.
*/
public record ResourceUsage(long memoryBytes, int pids, long cpuUsageMicros) {
@Override
public String toString() {
return String.format("Memory: %s, PIDs: %d, CPU: %.2fms",
formatBytes(memoryBytes), pids, cpuUsageMicros / 1000.0);
}
}
}
Part 2: Resource Limits Configuration
src/main/java/com/minidocker/cgroup/ResourceLimits.java
Copy
package com.minidocker.cgroup;
/**
* Container resource limits configuration.
*
* Similar to Docker's --cpus, --memory, --pids-limit flags.
*/
public class ResourceLimits {
private int cpuPercent = 100; // 100 = 1 core
private long memoryBytes = 512 * 1024 * 1024; // 512MB default
private int maxPids = 100; // Max processes
public static ResourceLimits defaults() {
return new ResourceLimits();
}
public ResourceLimits withCpu(double cpus) {
this.cpuPercent = (int) (cpus * 100);
return this;
}
public ResourceLimits withMemory(String memory) {
this.memoryBytes = parseMemory(memory);
return this;
}
public ResourceLimits withMemoryBytes(long bytes) {
this.memoryBytes = bytes;
return this;
}
public ResourceLimits withMaxPids(int pids) {
this.maxPids = pids;
return this;
}
public int getCpuPercent() { return cpuPercent; }
public long getMemoryBytes() { return memoryBytes; }
public int getMaxPids() { return maxPids; }
private static long parseMemory(String memory) {
memory = memory.trim().toUpperCase();
long multiplier = 1;
if (memory.endsWith("K")) {
multiplier = 1024;
memory = memory.substring(0, memory.length() - 1);
} else if (memory.endsWith("M")) {
multiplier = 1024 * 1024;
memory = memory.substring(0, memory.length() - 1);
} else if (memory.endsWith("G")) {
multiplier = 1024 * 1024 * 1024;
memory = memory.substring(0, memory.length() - 1);
}
return Long.parseLong(memory) * multiplier;
}
@Override
public String toString() {
return String.format("ResourceLimits{cpu=%.2f cores, memory=%dMB, pids=%d}",
cpuPercent / 100.0,
memoryBytes / (1024 * 1024),
maxPids);
}
}
Part 3: Integrating with Container
src/main/java/com/minidocker/Container.java
Copy
package com.minidocker;
import com.minidocker.cgroup.CgroupManager;
import com.minidocker.cgroup.ResourceLimits;
import com.minidocker.linux.LibC;
import com.minidocker.namespace.NamespaceManager;
import com.minidocker.namespace.NamespaceOptions;
import java.util.UUID;
/**
* Container with namespace isolation AND resource limits.
*/
public class Container {
private final LibC libc = LibC.INSTANCE;
private final NamespaceManager namespaces = new NamespaceManager();
private final String id;
private final String hostname;
private final String[] command;
private final ResourceLimits limits;
private CgroupManager cgroup;
public Container(String hostname, String[] command, ResourceLimits limits) {
this.id = UUID.randomUUID().toString().substring(0, 12);
this.hostname = hostname;
this.command = command;
this.limits = limits;
}
public void run() throws Exception {
System.out.println("=== Starting Container " + id + " ===");
System.out.println("Hostname: " + hostname);
System.out.println("Limits: " + limits);
System.out.println();
// Step 1: Create cgroup
cgroup = new CgroupManager(id);
cgroup.create();
// Step 2: Set resource limits
cgroup.setCpuLimit(limits.getCpuPercent());
cgroup.setMemoryLimit(limits.getMemoryBytes());
cgroup.setPidsLimit(limits.getMaxPids());
// Step 3: Create namespaces
NamespaceOptions nsOptions = NamespaceOptions.builder()
.withPid()
.withMount()
.withUts()
.withIpc()
.build();
namespaces.createNamespaces(nsOptions);
namespaces.setHostname(hostname);
// Step 4: Fork
int pid = libc.fork();
if (pid == 0) {
// Child: add to cgroup and run
cgroup.addCurrentProcess();
runContainerInit();
} else if (pid > 0) {
// Parent: wait and monitor
monitorContainer(pid);
} else {
throw new RuntimeException("Fork failed");
}
}
private void monitorContainer(int childPid) {
// Start monitoring thread
Thread monitor = new Thread(() -> {
try {
while (true) {
Thread.sleep(1000);
CgroupManager.ResourceUsage usage = cgroup.getUsage();
System.out.println("[Monitor] " + usage);
}
} catch (Exception e) {
// Container exited
}
});
monitor.setDaemon(true);
monitor.start();
// Wait for child
int[] status = new int[1];
libc.waitpid(childPid, status, 0);
System.out.println("Container exited with status: " + status[0]);
// Cleanup
try {
cgroup.destroy();
} catch (Exception e) {
System.err.println("Failed to cleanup cgroup: " + e.getMessage());
}
}
private void runContainerInit() {
try {
System.out.println("\n=== Container Init (PID " + libc.getpid() + ") ===");
if (command.length > 0) {
String[] argv = new String[command.length + 1];
System.arraycopy(command, 0, argv, 0, command.length);
argv[command.length] = null;
libc.execv(command[0], argv);
System.err.println("Failed to execute: " + command[0]);
System.exit(1);
}
} catch (Exception e) {
System.err.println("Container init failed: " + e.getMessage());
System.exit(1);
}
}
public static void main(String[] args) {
if (args.length < 2) {
System.out.println("Usage: java Container <hostname> <command...>");
System.out.println(" --cpus=<n> CPU cores (e.g., 0.5, 2)");
System.out.println(" --memory=<n> Memory limit (e.g., 256M, 1G)");
System.out.println(" --pids=<n> Max processes");
System.exit(1);
}
ResourceLimits limits = ResourceLimits.defaults();
String hostname = null;
String[] command = null;
// Parse arguments
int cmdStart = 0;
for (int i = 0; i < args.length; i++) {
if (args[i].startsWith("--cpus=")) {
limits.withCpu(Double.parseDouble(args[i].substring(7)));
} else if (args[i].startsWith("--memory=")) {
limits.withMemory(args[i].substring(9));
} else if (args[i].startsWith("--pids=")) {
limits.withMaxPids(Integer.parseInt(args[i].substring(7)));
} else if (hostname == null) {
hostname = args[i];
} else {
cmdStart = i;
break;
}
}
if (hostname != null && cmdStart > 0) {
command = new String[args.length - cmdStart];
System.arraycopy(args, cmdStart, command, 0, command.length);
}
try {
Container container = new Container(hostname, command, limits);
container.run();
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
e.printStackTrace();
System.exit(1);
}
}
}
Part 4: Testing Resource Limits
Testing CPU Limits
Copy
# Run a CPU-intensive workload with 50% CPU limit
sudo java Container --cpus=0.5 testhost /bin/sh -c "while true; do :; done"
# In another terminal, observe CPU usage (should be ~50%)
top -p $(pgrep -f "while true")
Testing Memory Limits
Copy
// MemoryHog.java - For testing memory limits
public class MemoryHog {
public static void main(String[] args) throws Exception {
List<byte[]> allocations = new ArrayList<>();
while (true) {
// Allocate 10MB chunks
byte[] chunk = new byte[10 * 1024 * 1024];
allocations.add(chunk);
System.out.println("Allocated: " + (allocations.size() * 10) + "MB");
Thread.sleep(100);
}
// Container will be OOM-killed when hitting memory limit!
}
}
Testing PID Limits
Copy
# Fork bomb protection!
# Without pids limit, this would crash the system:
:(){ :|:& };:
# With pids limit set to 100, the fork bomb is contained
Understanding Cgroup Controllers
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ CGROUP CONTROLLERS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ CONTROLLER FILES DESCRIPTION │
│ ────────── ───── ─────────── │
│ │
│ cpu cpu.max Bandwidth limit (quota/period) │
│ cpu.weight Relative CPU shares (1-10000) │
│ cpu.stat Usage statistics │
│ │
│ memory memory.max Hard limit (OOM kill if exceeded) │
│ memory.high Soft limit (throttling starts) │
│ memory.current Current usage │
│ memory.swap.max Swap limit │
│ memory.stat Detailed statistics │
│ │
│ pids pids.max Maximum processes │
│ pids.current Current process count │
│ │
│ io io.max Bandwidth/IOPS limits per device │
│ io.weight Relative I/O priority │
│ io.stat I/O statistics │
│ │
│ cpuset cpuset.cpus Allowed CPU cores (e.g., "0-3") │
│ cpuset.mems Allowed NUMA nodes │
│ │
│ hugetlb hugetlb.X.max Huge page limits │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Exercises
Exercise 1: Implement I/O Throttling
Exercise 1: Implement I/O Throttling
Add disk I/O limits:
Copy
// 1. Find the device major:minor for the disk
// cat /sys/block/sda/dev
// 2. Set io.max
// "8:0 rbps=1048576 wbps=1048576" = 1MB/s read/write
// 3. Test with dd:
// dd if=/dev/zero of=/tmp/test bs=1M count=100
Exercise 2: Implement CPU Pinning
Exercise 2: Implement CPU Pinning
Pin container to specific CPU cores:
Copy
// Use cpuset controller
// Write to cpuset.cpus: "0,2" (only use cores 0 and 2)
// Write to cpuset.mems: "0" (NUMA node 0)
// This ensures containers don't share CPU cache
// Great for latency-sensitive workloads
Exercise 3: Add Resource Monitoring
Exercise 3: Add Resource Monitoring
Implement real-time resource monitoring:
Copy
// 1. Read cpu.stat periodically for CPU usage
// 2. Read memory.current and memory.stat for memory
// 3. Read io.stat for disk I/O
// 4. Calculate deltas to get rates
// 5. Display like 'docker stats'
Key Takeaways
Hierarchical
Cgroups form a tree - children inherit parent limits
Controller-Based
Different controllers for CPU, memory, I/O, PIDs
Kernel Enforcement
Limits are enforced by kernel, not by honor system
OOM Killer
Exceeding memory limit triggers OOM killer
What’s Next?
In Chapter 3: Filesystem, we’ll implement:- Overlay filesystems
- Copy-on-write layers
- Container root filesystem setup
Next: Filesystem
Build the layered filesystem