Control groups are the foundation of container resource management. Understanding cgroups deeply is essential for debugging container issues and designing resource-aware systems.
Prerequisites: Process fundamentals, basic container concepts Interview Focus: cgroups v2, memory limits, CPU throttling, debugging Companies: All container/cloud companies heavily test this
Common Interview Topic: Why is my container slow even when CPU usage looks low? Answer: CPU throttling!
# Check throttling statistics (cgroups v2)cat /sys/fs/cgroup/<path>/cpu.stat# Example output:usage_usec 1234567890 # Total CPU time useduser_usec 1000000000 # User-space timesystem_usec 234567890 # Kernel-space timenr_periods 50000 # Number of enforcement periodsnr_throttled 1000 # Periods where throttling occurredthrottled_usec 5000000 # Total time spent throttled# Throttling percentagethrottled_percentage = nr_throttled / nr_periods * 100# If > 5%, consider increasing CPU limit
# Set target latency for deviceecho "8:0 target=10000" > /sys/fs/cgroup/<path>/io.latency# 10000 = 10ms target latency# The kernel will throttle this cgroup if its I/O# is causing other cgroups to exceed their targets
# Limit maximum number of processesecho 100 > /sys/fs/cgroup/<path>/pids.max# Check current countcat /sys/fs/cgroup/<path>/pids.current# Check if limit was hitcat /sys/fs/cgroup/<path>/pids.events# max 5 ← Fork denied 5 times due to limit
Fork Bomb Protection: The pids controller prevents fork bombs from exhausting system resources. Set reasonable limits on all container cgroups.
# Pin to specific CPUsecho "0,2,4,6" > /sys/fs/cgroup/<path>/cpuset.cpus# Pin to specific memory nodes (NUMA)echo "0" > /sys/fs/cgroup/<path>/cpuset.mems# Verify current settingscat /sys/fs/cgroup/<path>/cpuset.cpus.effectivecat /sys/fs/cgroup/<path>/cpuset.mems.effective
Cause: Burst usage exceeds quota within period even though average is low.Solutions:
Increase CPU limit
Use cpu.burst (cgroups v2) for burst allowance
Spread work more evenly
Can't write to cgroup files
Symptom: echo: write error: Device or resource busyDiagnosis:
# Check if cgroup has processescat /sys/fs/cgroup/<path>/cgroup.procs# Check if cgroup has childrenls /sys/fs/cgroup/<path>/
Cause: Cgroups v2 “no internal processes” rule - if a cgroup has controllers enabled in subtree_control, processes must be in leaf cgroups.Solution: Move processes to leaf cgroups before modifying parent.
IO limits not working
Symptom: Set io.max but process still uses full disk bandwidth.Diagnosis:
# Check if IO controller is enabledcat /sys/fs/cgroup/<path>/cgroup.controllers | grep io# Check io.max format (needs MAJ:MIN)lsblkcat /sys/fs/cgroup/<path>/io.max
Common causes:
Wrong device major:minor
IO through page cache (buffered writes)
Controller not enabled
Solution:
# Use O_DIRECT or sync writes# Or use io.latency for latency-based control
A Kubernetes pod is showing 30% average CPU usage but requests are timing out. Users report the service is 'slow.' How do you diagnose this using cgroup internals, and what is the most likely cause?
Strong Answer:
The most likely cause is CFS bandwidth throttling. Average CPU usage can be misleading because CFS enforces CPU limits per-period (typically 100ms). A service might use only 30% average CPU, but if requests arrive in bursts, the container could exhaust its entire quota in the first 30ms of a period and be throttled for the remaining 70ms. During throttling, all threads in the cgroup are descheduled regardless of available CPU on the host.
To diagnose, I would read cat /sys/fs/cgroup/<path>/cpu.stat and compute the throttle percentage: nr_throttled / nr_periods * 100. If this exceeds 5%, throttling is significant. I would also check throttled_usec to see total time lost to throttling.
The kernel mechanism is the CFS bandwidth controller in kernel/sched/fair.c. Each cgroup has a cfs_bandwidth structure tracking quota and period. When a task in the cgroup runs, its runtime is charged against the quota. When quota reaches zero, the scheduler removes all tasks in the cgroup from their runqueues until the next period boundary replenishes the quota.
Solutions in order of preference: increase the CPU limit in the Kubernetes resource spec, enable cpu.burst in cgroups v2 (allows temporary burst above quota by borrowing from future periods), spread work more evenly across time using request queuing, or switch to CPU shares (cpu.weight) instead of hard limits if the node is not oversubscribed.
Follow-up: Why does Kubernetes use CPU limits (quota) at all instead of just CPU requests (shares)?Follow-up Answer:
CPU requests map to cpu.weight (shares) which provide proportional fairness: if a container requests 1 CPU and another requests 2 CPUs, the second gets twice the CPU time when both are contending. But shares only work when there is contention. Without limits, a runaway container could consume all available CPU during low-load periods, then cause latency for other containers when load increases. Limits provide a hard ceiling via cpu.max that caps CPU usage regardless of host utilization. The trade-off is that limits cause throttling even when CPU is idle, which is wasteful. Many teams now advocate for setting requests but not limits for CPU (Google’s best practice), accepting the risk of noisy neighbors in exchange for eliminating throttling-induced latency.
Explain the cgroups v2 'no internal processes' rule and why it was introduced. How does this affect container runtime design?
Strong Answer:
In cgroups v2, a cgroup that has controllers enabled in its cgroup.subtree_control cannot have processes directly in cgroup.procs — processes must be in leaf cgroups only. This means if /sys/fs/cgroup/mygroup/ enables CPU and memory controllers for its children, all processes must be in child cgroups like /sys/fs/cgroup/mygroup/child1/, not in /sys/fs/cgroup/mygroup/ itself.
This rule was introduced to solve a fundamental ambiguity in cgroups v1: if a parent cgroup has processes and also has child cgroups, how should the parent’s resource share be divided between its direct processes and its children? In v1, this was handled inconsistently across controllers and led to confusing behavior where resource distribution depended on the tree structure in unintuitive ways.
For container runtimes, this means the cgroup hierarchy must be designed carefully. A container runtime cannot simply create /sys/fs/cgroup/containers/ with controllers enabled and dump container processes there. Instead, it must create /sys/fs/cgroup/containers/container-abc/ for each container and place processes in the leaf. Systemd handles this naturally with its slice/scope/service hierarchy: system.slice -> docker-abc123.scope where the scope is the leaf containing the container’s processes.
The practical impact is that management processes (the container runtime’s shim) must be in a separate cgroup from the container’s processes. runc places the shim in the parent’s sibling cgroup and the container in its own leaf.
Follow-up: How does thread-level cgroup control work in v2, and when would you use it?Follow-up Answer:
Cgroups v2 introduced cgroup.type = "threaded" which allows individual threads of a process to be in different cgroups within a threaded subtree. This is useful when you want to give different threads different CPU priorities within the same process. For example, a database might put its query processing threads in one cgroup with higher CPU weight and its background compaction threads in another cgroup with lower weight. The root of the threaded subtree is a “domain threaded” cgroup, and its children are “threaded” cgroups. Not all controllers support threaded mode — currently only cpu and cpuset are thread-aware.
Design a monitoring system that detects containers approaching resource exhaustion before they get OOM-killed or CPU-throttled. What cgroup metrics would you monitor, and what are the thresholds?
Strong Answer:
For memory, I would monitor three signals. First, memory.current / memory.max as a utilization ratio — alert at 80%. Second, memory.pressure from PSI (Pressure Stall Information): some avg10 > 10 means at least one task is stalling 10% of the time waiting for memory, which indicates reclaim pressure before OOM. Third, the oom counter in memory.events to detect actual OOM kills.
For CPU, I would compute throttle percentage from cpu.stat: nr_throttled / nr_periods * 100. Alert at 5% throttle rate. I would also monitor cpu.pressure where some avg10 > 25 indicates meaningful CPU contention. The distinction between some (at least one task stalled) and full (all tasks stalled) is important: some indicates contention, full indicates complete blockage.
For I/O, monitor io.pressure for both some and full stall percentages. Also check io.stat for per-device throughput to detect I/O bottlenecks.
For PID limits, monitor pids.current / pids.max and alert at 80%. A fork bomb or thread leak will hit this before OOM.
Implementation-wise, I would use a polling agent reading these pseudo-files every 5-10 seconds. PSI metrics are particularly efficient because they are pre-computed moving averages, so reading them is a single file read returning a single line. For production, I would expose these as Prometheus metrics and set up Grafana alerts.
Follow-up: How does PSI (Pressure Stall Information) work internally, and why is it better than just checking utilization percentages?Follow-up Answer:
PSI tracks the actual time tasks spend stalled waiting for resources, not just how much of a resource is being used. The kernel maintains per-CPU counters that are updated on every task state transition. When a task transitions from runnable to blocked-on-memory-reclaim, the kernel increments the memory stall counter for that CPU. PSI then computes exponentially weighted moving averages over 10s, 60s, and 300s windows. This is fundamentally better than utilization because utilization does not capture demand: a container at 90% memory utilization might be fine (large but stable working set) or about to OOM (growing leak). PSI tells you whether tasks are actually waiting, which directly correlates with user-visible performance degradation.