Chapter 2: Control Groups (cgroups)
While namespaces provide isolation (what a container can see), cgroups provide resource limits (what a container can use). Without cgroups, a container could consume all system resources and starve everything else on the host. Let’s implement resource limiting! Here is a real-world analogy: namespaces are like giving each tenant in an apartment building their own mailbox and doorbell, so they don’t interfere with each other. Cgroups are like the lease agreement that limits how much water and electricity each tenant can use. Without the lease, one tenant could run every faucet at full blast and leave the rest of the building dry. In cloud computing, cgroups are what make multi-tenant systems safe and fair. Every major cloud provider (AWS, GCP, Azure) uses cgroups under the hood to enforce the resource limits you pay for.Prerequisites: Chapter 1: Namespaces
Further Reading: Operating Systems: Resource Management
Time: 3-4 hours
Outcome: Containers with CPU, memory, and process limits
Further Reading: Operating Systems: Resource Management
Time: 3-4 hours
Outcome: Containers with CPU, memory, and process limits
What Are Cgroups?
Cgroup v2 Hierarchy
Part 1: Cgroup Manager
src/main/java/com/minidocker/cgroup/CgroupManager.java
Part 2: Resource Limits Configuration
src/main/java/com/minidocker/cgroup/ResourceLimits.java
Part 3: Integrating with Container
src/main/java/com/minidocker/Container.java
Part 4: Testing Resource Limits
Testing CPU Limits
Testing Memory Limits
Testing PID Limits
Understanding Cgroup Controllers
Exercises
Exercise 1: Implement I/O Throttling
Exercise 1: Implement I/O Throttling
Add disk I/O limits:
Exercise 2: Implement CPU Pinning
Exercise 2: Implement CPU Pinning
Pin container to specific CPU cores:
Exercise 3: Add Resource Monitoring
Exercise 3: Add Resource Monitoring
Implement real-time resource monitoring:
Key Takeaways
Hierarchical
Cgroups form a tree - children inherit parent limits
Controller-Based
Different controllers for CPU, memory, I/O, PIDs
Kernel Enforcement
Limits are enforced by kernel, not by honor system
OOM Killer
Exceeding memory limit triggers OOM killer
What’s Next?
In Chapter 3: Filesystem, we’ll implement:- Overlay filesystems
- Copy-on-write layers
- Container root filesystem setup
Next: Filesystem
Build the layered filesystem
Interview Deep-Dive
Explain the difference between cgroups v1 and v2. Why did the kernel need a new version?
Explain the difference between cgroups v1 and v2. Why did the kernel need a new version?
Strong Answer:
- Cgroups v1 had a per-controller hierarchy: each controller (cpu, memory, pids) had its own independent tree, and a process could be in different groups for different controllers. This created incoherent resource accounting — a process could be in cgroup A for CPU but cgroup B for memory, making it impossible to get a unified view of resource consumption.
- Cgroups v2 uses a single unified hierarchy. A process belongs to exactly one cgroup, and all controllers apply to that cgroup. This makes resource accounting consistent and simplifies management tooling. It also enables cross-controller coordination — for example, the memory controller can influence the I/O controller’s behavior for a process, which was impossible in v1.
- V2 also introduced the “no internal processes” rule: if a cgroup has child cgroups, the parent cannot directly contain processes. This eliminates ambiguity about which level of the hierarchy should be charged for resource usage.
- The migration took years because systemd, Docker, and Kubernetes all had deep dependencies on v1 semantics. The practical interview relevance is that debugging resource limits in production requires knowing which cgroup version the host is running, because the filesystem paths and file formats differ significantly.
stat -fc %T /sys/fs/cgroup/ — if it returns cgroup2fs, the system uses v2; tmpfs indicates v1. In v1, you find CPU limits at /sys/fs/cgroup/cpu/docker/<container-id>/cpu.cfs_quota_us. In v2, it is /sys/fs/cgroup/system.slice/docker-<container-id>.scope/cpu.max. The file format also differs: v1 uses separate files for quota and period; v2 combines them into one file (cpu.max).How does the OOM killer decide which process to kill inside a container, and how would you tune it for a production service?
How does the OOM killer decide which process to kill inside a container, and how would you tune it for a production service?
Strong Answer:
- When a cgroup hits its
memory.maxlimit, the kernel’s OOM killer activates for that cgroup specifically (it does not kill processes outside the cgroup). It selects the victim by calculating anoom_scorefor each process based on RSS, swap usage, and theoom_score_adjvalue. - For production tuning, set
memory.highto a value belowmemory.max(e.g., high at 900MB, max at 1GB). When usage crossesmemory.high, the kernel throttles memory allocation instead of killing, giving the application a chance to shed load or run garbage collection. This turns a hard crash into a soft slowdown. - Also configure
oom_score_adjto ensure the right process dies if OOM is unavoidable. Setting the main process to-500and helper sidecars to0ensures sidecars die first. And critically, monitormemory.eventsto alert onoom_killevents before they cascade. - A subtle point: application-level memory metrics (like JVM heap usage) only show user-space allocations. The kernel counts all memory charged to the cgroup, including page cache, tmpfs mounts, and kernel stack pages. A container writing heavily to tmpfs will trigger OOM even though the application reports low heap usage.
What is CPU throttling in CFS bandwidth control, and why does it cause latency spikes even when average CPU usage is low?
What is CPU throttling in CFS bandwidth control, and why does it cause latency spikes even when average CPU usage is low?
Strong Answer:
- CFS bandwidth control gives each cgroup a quota of CPU time per period (default 100ms). If
cpu.maxis50000 100000, the container gets 50ms of CPU time per 100ms period. - The latency spike occurs when the container uses its entire 50ms quota in a burst at the start of the period. For the remaining 50ms, threads are suspended even if the CPU is idle. A request arriving during the throttled window waits up to 100ms, creating artificial latency.
- This is measurable via
cpu.stat— look fornr_throttledandthrottled_usec. A container with low average CPU usage can still show thousands of throttle events if usage is bursty. - Mitigation strategies: increase the CPU limit for burst headroom, reduce the CFS period (smaller periods like 10ms distribute throttling more evenly but increase scheduling overhead), or use
cpuset.cpusto pin the container to dedicated cores, bypassing the bandwidth controller entirely.