While namespaces provide isolation (what a container can see), cgroups provide resource limits (what a container can use). Without cgroups, a container could consume all system resources and starve everything else on the host. Let’s implement resource limiting!Here is a real-world analogy: namespaces are like giving each tenant in an apartment building their own mailbox and doorbell, so they don’t interfere with each other. Cgroups are like the lease agreement that limits how much water and electricity each tenant can use. Without the lease, one tenant could run every faucet at full blast and leave the rest of the building dry. In cloud computing, cgroups are what make multi-tenant systems safe and fair. Every major cloud provider (AWS, GCP, Azure) uses cgroups under the hood to enforce the resource limits you pay for.
package com.minidocker.cgroup;import java.io.IOException;import java.nio.file.Files;import java.nio.file.Path;/** * Manages cgroup v2 resource limits. * * Cgroups control: * - CPU: How much CPU time a container can use * - Memory: Maximum memory allocation * - PIDs: Maximum number of processes * - I/O: Disk bandwidth limits */public class CgroupManager { private static final Path CGROUP_ROOT = Path.of("/sys/fs/cgroup"); private static final String CGROUP_NAME = "minidocker"; private final String containerId; private final Path cgroupPath; public CgroupManager(String containerId) { this.containerId = containerId; this.cgroupPath = CGROUP_ROOT.resolve(CGROUP_NAME).resolve(containerId); } /** * Creates the cgroup for this container. */ public void create() throws IOException { // Create the cgroup directory Files.createDirectories(cgroupPath); System.out.println("✓ Created cgroup: " + cgroupPath); // Enable controllers in parent Path parentSubtreeControl = cgroupPath.getParent().resolve("cgroup.subtree_control"); if (Files.exists(parentSubtreeControl)) { String controllers = "+cpu +memory +pids +io"; Files.writeString(parentSubtreeControl, controllers); System.out.println("✓ Enabled controllers: cpu, memory, pids, io"); } } /** * Sets CPU limit. * * @param cpuPercent CPU percentage (100 = 1 core, 200 = 2 cores) * * How this works under the hood: the kernel scheduler checks cpu.max * every PERIOD microseconds. If the cgroup has consumed more than QUOTA * microseconds of CPU time in that period, all processes in the group are * throttled (paused) until the next period begins. So "100 100000" means * "use at most 100ms of CPU every 100ms" -- exactly one core. "200 100000" * means "200ms per 100ms" -- two cores. * * Common pitfall: setting QUOTA to 0 does NOT mean "unlimited" -- it means * "no CPU at all." Use "max 100000" for unlimited CPU. */ public void setCpuLimit(int cpuPercent) throws IOException { // cpu.max format: "$QUOTA $PERIOD" // QUOTA: microseconds of CPU time per PERIOD // PERIOD: typically 100000 (100ms) int period = 100000; // 100ms int quota = period * cpuPercent / 100; Path cpuMaxPath = cgroupPath.resolve("cpu.max"); String value = quota + " " + period; Files.writeString(cpuMaxPath, value); System.out.println("✓ CPU limit set: " + cpuPercent + "% (" + value + ")"); } /** * Sets memory limit. * * @param memoryBytes Maximum memory in bytes */ public void setMemoryLimit(long memoryBytes) throws IOException { Path memoryMaxPath = cgroupPath.resolve("memory.max"); Files.writeString(memoryMaxPath, String.valueOf(memoryBytes)); // Also set swap limit to prevent swap usage Path memorySwapMaxPath = cgroupPath.resolve("memory.swap.max"); if (Files.exists(memorySwapMaxPath)) { Files.writeString(memorySwapMaxPath, "0"); } String humanReadable = formatBytes(memoryBytes); System.out.println("✓ Memory limit set: " + humanReadable); } /** * Sets process (PID) limit. * * This is your defense against fork bombs -- a malicious or buggy process * that recursively spawns children until the system runs out of PIDs or * memory. Without this limit, a single container could crash the entire * host by exhausting the kernel's process table. A typical default for * Docker is 4096 PIDs per container. * * @param maxPids Maximum number of processes */ public void setPidsLimit(int maxPids) throws IOException { Path pidsMaxPath = cgroupPath.resolve("pids.max"); Files.writeString(pidsMaxPath, String.valueOf(maxPids)); System.out.println("✓ PIDs limit set: " + maxPids); } /** * Sets I/O bandwidth limits. * * @param deviceMajorMinor Device ID (e.g., "8:0" for /dev/sda) * @param rbps Read bytes per second * @param wbps Write bytes per second */ public void setIOLimit(String deviceMajorMinor, long rbps, long wbps) throws IOException { Path ioMaxPath = cgroupPath.resolve("io.max"); String value = deviceMajorMinor + " rbps=" + rbps + " wbps=" + wbps; Files.writeString(ioMaxPath, value); System.out.println("✓ I/O limit set for " + deviceMajorMinor + ": read=" + formatBytes(rbps) + "/s, write=" + formatBytes(wbps) + "/s"); } /** * Adds a process to this cgroup. * * @param pid Process ID to add */ public void addProcess(int pid) throws IOException { Path procsPath = cgroupPath.resolve("cgroup.procs"); Files.writeString(procsPath, String.valueOf(pid)); System.out.println("✓ Added PID " + pid + " to cgroup"); } /** * Adds the current process to this cgroup. */ public void addCurrentProcess() throws IOException { addProcess(ProcessHandle.current().pid()); } /** * Gets current resource usage. */ public ResourceUsage getUsage() throws IOException { long memoryBytes = 0; int pids = 0; long cpuUsage = 0; Path memoryCurrent = cgroupPath.resolve("memory.current"); if (Files.exists(memoryCurrent)) { memoryBytes = Long.parseLong(Files.readString(memoryCurrent).trim()); } Path pidsCurrent = cgroupPath.resolve("pids.current"); if (Files.exists(pidsCurrent)) { pids = Integer.parseInt(Files.readString(pidsCurrent).trim()); } Path cpuStat = cgroupPath.resolve("cpu.stat"); if (Files.exists(cpuStat)) { String stats = Files.readString(cpuStat); for (String line : stats.split("\n")) { if (line.startsWith("usage_usec")) { cpuUsage = Long.parseLong(line.split(" ")[1]); } } } return new ResourceUsage(memoryBytes, pids, cpuUsage); } /** * Removes the cgroup (cleanup). */ public void destroy() throws IOException { // First, kill all processes in the cgroup Path procsPath = cgroupPath.resolve("cgroup.procs"); if (Files.exists(procsPath)) { String procs = Files.readString(procsPath); for (String pid : procs.split("\n")) { if (!pid.isEmpty()) { ProcessHandle.of(Long.parseLong(pid)) .ifPresent(ProcessHandle::destroy); } } } // Then remove the directory Files.deleteIfExists(cgroupPath); System.out.println("✓ Destroyed cgroup: " + cgroupPath); } private static String formatBytes(long bytes) { if (bytes < 1024) return bytes + " B"; if (bytes < 1024 * 1024) return (bytes / 1024) + " KB"; if (bytes < 1024 * 1024 * 1024) return (bytes / (1024 * 1024)) + " MB"; return (bytes / (1024 * 1024 * 1024)) + " GB"; } /** * Resource usage statistics. */ public record ResourceUsage(long memoryBytes, int pids, long cpuUsageMicros) { @Override public String toString() { return String.format("Memory: %s, PIDs: %d, CPU: %.2fms", formatBytes(memoryBytes), pids, cpuUsageMicros / 1000.0); } }}
# Run a CPU-intensive workload with 50% CPU limitsudo java Container --cpus=0.5 testhost /bin/sh -c "while true; do :; done"# In another terminal, observe CPU usage (should be ~50%)top -p $(pgrep -f "while true")
// 1. Find the device major:minor for the disk// cat /sys/block/sda/dev// 2. Set io.max// "8:0 rbps=1048576 wbps=1048576" = 1MB/s read/write// 3. Test with dd:// dd if=/dev/zero of=/tmp/test bs=1M count=100
Exercise 2: Implement CPU Pinning
Pin container to specific CPU cores:
// Use cpuset controller// Write to cpuset.cpus: "0,2" (only use cores 0 and 2)// Write to cpuset.mems: "0" (NUMA node 0)// This ensures containers don't share CPU cache// Great for latency-sensitive workloads
Exercise 3: Add Resource Monitoring
Implement real-time resource monitoring:
// 1. Read cpu.stat periodically for CPU usage// 2. Read memory.current and memory.stat for memory// 3. Read io.stat for disk I/O// 4. Calculate deltas to get rates// 5. Display like 'docker stats'
Explain the difference between cgroups v1 and v2. Why did the kernel need a new version?
Strong Answer:
Cgroups v1 had a per-controller hierarchy: each controller (cpu, memory, pids) had its own independent tree, and a process could be in different groups for different controllers. This created incoherent resource accounting — a process could be in cgroup A for CPU but cgroup B for memory, making it impossible to get a unified view of resource consumption.
Cgroups v2 uses a single unified hierarchy. A process belongs to exactly one cgroup, and all controllers apply to that cgroup. This makes resource accounting consistent and simplifies management tooling. It also enables cross-controller coordination — for example, the memory controller can influence the I/O controller’s behavior for a process, which was impossible in v1.
V2 also introduced the “no internal processes” rule: if a cgroup has child cgroups, the parent cannot directly contain processes. This eliminates ambiguity about which level of the hierarchy should be charged for resource usage.
The migration took years because systemd, Docker, and Kubernetes all had deep dependencies on v1 semantics. The practical interview relevance is that debugging resource limits in production requires knowing which cgroup version the host is running, because the filesystem paths and file formats differ significantly.
Follow-up: How do you check which cgroup version a system is running?Check with stat -fc %T /sys/fs/cgroup/ — if it returns cgroup2fs, the system uses v2; tmpfs indicates v1. In v1, you find CPU limits at /sys/fs/cgroup/cpu/docker/<container-id>/cpu.cfs_quota_us. In v2, it is /sys/fs/cgroup/system.slice/docker-<container-id>.scope/cpu.max. The file format also differs: v1 uses separate files for quota and period; v2 combines them into one file (cpu.max).
How does the OOM killer decide which process to kill inside a container, and how would you tune it for a production service?
Strong Answer:
When a cgroup hits its memory.max limit, the kernel’s OOM killer activates for that cgroup specifically (it does not kill processes outside the cgroup). It selects the victim by calculating an oom_score for each process based on RSS, swap usage, and the oom_score_adj value.
For production tuning, set memory.high to a value below memory.max (e.g., high at 900MB, max at 1GB). When usage crosses memory.high, the kernel throttles memory allocation instead of killing, giving the application a chance to shed load or run garbage collection. This turns a hard crash into a soft slowdown.
Also configure oom_score_adj to ensure the right process dies if OOM is unavoidable. Setting the main process to -500 and helper sidecars to 0 ensures sidecars die first. And critically, monitor memory.events to alert on oom_kill events before they cascade.
A subtle point: application-level memory metrics (like JVM heap usage) only show user-space allocations. The kernel counts all memory charged to the cgroup, including page cache, tmpfs mounts, and kernel stack pages. A container writing heavily to tmpfs will trigger OOM even though the application reports low heap usage.
Follow-up: What is the difference between a cgroup OOM and a host-level OOM? Which is worse operationally?A cgroup OOM is scoped and controlled — only processes in that cgroup die. A host-level OOM means the entire machine’s memory is exhausted, and the kernel’s global OOM killer activates, often killing the wrong process. Host OOM is operationally catastrophic because it cascades across unrelated services. This is precisely why cgroup memory limits exist — they convert a global disaster into a local, contained failure.
What is CPU throttling in CFS bandwidth control, and why does it cause latency spikes even when average CPU usage is low?
Strong Answer:
CFS bandwidth control gives each cgroup a quota of CPU time per period (default 100ms). If cpu.max is 50000 100000, the container gets 50ms of CPU time per 100ms period.
The latency spike occurs when the container uses its entire 50ms quota in a burst at the start of the period. For the remaining 50ms, threads are suspended even if the CPU is idle. A request arriving during the throttled window waits up to 100ms, creating artificial latency.
This is measurable via cpu.stat — look for nr_throttled and throttled_usec. A container with low average CPU usage can still show thousands of throttle events if usage is bursty.
Mitigation strategies: increase the CPU limit for burst headroom, reduce the CFS period (smaller periods like 10ms distribute throttling more evenly but increase scheduling overhead), or use cpuset.cpus to pin the container to dedicated cores, bypassing the bandwidth controller entirely.
Follow-up: Why do some Kubernetes operators recommend removing CPU limits entirely?The argument is that CFS throttling causes more damage to latency-sensitive services than noisy neighbors do. If the cluster has accurate CPU requests and nodes are not overprovisioned, the scheduler places pods so total requested CPU does not exceed node capacity. Without limits, a pod can burst above its request when CPU is available. The counterargument is that without limits, a misbehaving pod can starve neighbors during peak load. The right answer depends on workload characteristics: for steady-state services with predictable CPU profiles, removing limits works well; for batch jobs or untrusted workloads, limits are necessary.