> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Control Groups (cgroups) > Master cgroups v1 and v2 for resource limiting, accounting, and container isolation Control Groups - Resource limits, accounting, and prioritization

Control Groups - Resource limits, accounting, and prioritization

# Control Groups (cgroups) Control groups are the foundation of container resource management. Understanding cgroups deeply is essential for debugging container issues and designing resource-aware systems. **Prerequisites**: Process fundamentals, basic container concepts\ **Interview Focus**: cgroups v2, memory limits, CPU throttling, debugging\ **Companies**: All container/cloud companies heavily test this *** ## cgroups Overview Control Groups Hierarchy

*** ## cgroups v1 vs v2 **Characteristics**: * Multiple hierarchies (one per controller) * Controllers can be mounted independently * More flexible but complex * Still used in many production systems **File system layout**: ``` /sys/fs/cgroup/ ├── cpu/ │ └── docker/ │ └── container-abc/ │ ├── cpu.cfs_quota_us │ └── cpu.cfs_period_us ├── memory/ │ └── docker/ │ └── container-abc/ │ ├── memory.limit_in_bytes │ └── memory.usage_in_bytes └── pids/ └── docker/ └── container-abc/ └── pids.max ``` **Key issues**: * Inconsistent APIs across controllers * Race conditions between hierarchies * No unified resource management **Characteristics**: * Single unified hierarchy * All controllers in one tree * Simpler, more consistent API * Default in modern systems (RHEL 8+, Ubuntu 22.04+) **File system layout**: ``` /sys/fs/cgroup/ ├── cgroup.controllers # Available controllers ├── cgroup.subtree_control # Enabled for children ├── system.slice/ │ └── sshd.service/ │ ├── cgroup.procs │ ├── cpu.stat │ └── memory.current └── user.slice/ └── user-1000.slice/ ``` **Key improvements**: * Consistent no-internal-process rule * Unified resource control * Better pressure metrics * Thread-level controls *** ## CPU Controller Deep Dive Cgroup Resource Limits

### CPU Bandwidth Limiting ``` ┌─────────────────────────────────────────────────────────────────────┐ │ CPU BANDWIDTH LIMITING │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Configuration (cgroups v2): │ │ ┌─────────────────────────────────────────────────────────────────┐│ │ │ cpu.max = "$QUOTA $PERIOD" ││ │ │ ││ │ │ Examples: ││ │ │ "50000 100000" = 50ms per 100ms period = 50% of 1 CPU ││ │ │ "100000 100000" = 100ms per 100ms = 100% of 1 CPU ││ │ │ "200000 100000" = 200ms per 100ms = 200% = 2 CPUs ││ │ │ "max 100000" = Unlimited ││ │ │ ││ │ └─────────────────────────────────────────────────────────────────┘│ │ │ │ Kubernetes translation: │ │ resources: │ │ limits: │ │ cpu: "500m" → cpu.max = "50000 100000" │ │ cpu: "2" → cpu.max = "200000 100000" │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ### CPU Throttling **Common Interview Topic**: Why is my container slow even when CPU usage looks low? Answer: CPU throttling! ```bash theme={null} # Check throttling statistics (cgroups v2) cat /sys/fs/cgroup//cpu.stat # Example output: usage_usec 1234567890 # Total CPU time used user_usec 1000000000 # User-space time system_usec 234567890 # Kernel-space time nr_periods 50000 # Number of enforcement periods nr_throttled 1000 # Periods where throttling occurred throttled_usec 5000000 # Total time spent throttled # Throttling percentage throttled_percentage = nr_throttled / nr_periods * 100 # If > 5%, consider increasing CPU limit ``` ### CPU Throttling Visualization ``` ┌────────────────────────────────────────────────────────────────────────┐ │ CPU THROTTLING BEHAVIOR │ ├────────────────────────────────────────────────────────────────────────┤ │ │ │ Period: 100ms, Quota: 50ms (50% of 1 CPU) │ │ │ │ Time ─────────────────────────────────────────────────────────────► │ │ │ │ Period 1 │ Period 2 │ Period 3 │ Period 4 │ │ 0ms 100ms │ 100ms 200ms │ 200ms 300ms │ 300ms 400ms │ │ │ │ │ │ │ │ │ │ │ │ ├──────┤ ├───────┤ ├───────┤ ├───────┤ │ │ │██████│ │███████│ │████│ │ │███████│ │ │ │██████│ │███████│ │████│ │ │███████│ │ │ │██████│░░░░░░│███████│░░░░░░░│████│░░│░░░░░░░│███████│░░░░░░ │ │ └──────┘ └───────┘ └────┘ │ └───────┘ │ │ │ │ │ ██ = Running ░░ = Throttled │ │ │ │ │ │ Period 3: Only 40ms work, no throttle ◄─────── │ │ Other periods: Hit 50ms quota, throttled │ │ │ └────────────────────────────────────────────────────────────────────────┘ ``` *** ## Memory Controller Deep Dive ### Memory Limits and Protection ``` ┌─────────────────────────────────────────────────────────────────────┐ │ MEMORY CGROUP CONTROLS │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ memory.max │ │ │ │ │ │ ← Hard limit (OOM if exceeded) │ │ Usage │ │ │ │ ────┼──────────────────────────── │ │ │ │ │ │ │ memory.high │ │ │ │ │ │ │ │ ← Throttling begins (reclaim) │ │ │ ────┼──────────────────────────── │ │ │ │ │ │ │ ┌──────────────────────┴──────────────────────────┐ │ │ │ │ Normal Operation │ │ │ │ │ Application allocates freely │ │ │ │ └─────────────────────────────────────────────────┘ │ │ │ │ │ │ │ memory.low │ │ │ │ ← Best-effort protection │ │ │ ────┼──────────────────────────── │ │ │ │ │ │ │ memory.min │ │ │ │ ← Guaranteed protection │ │ └─────────────────────────┴────────────────────────────► Time │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ### Memory Accounting Details ```bash theme={null} # Read memory statistics (cgroups v2) cat /sys/fs/cgroup//memory.stat # Key fields: anon 1073741824 # Anonymous memory (heap, stack) file 536870912 # Page cache (file-backed pages) kernel 67108864 # Kernel memory (slab, etc.) kernel_stack 1048576 # Kernel stacks pagetables 8388608 # Page table memory sock 134217728 # Socket buffers shmem 0 # Shared memory file_mapped 268435456 # Memory-mapped files file_dirty 4096 # Dirty pages file_writeback 0 # Pages being written swapcached 0 # Swap cache anon_thp 0 # Anonymous huge pages file_thp 0 # File-backed huge pages slab_reclaimable 16777216 # Reclaimable slab slab_unreclaimable 8388608 # Non-reclaimable slab pgfault 15000000 # Total page faults pgmajfault 1000 # Major page faults (disk I/O) ``` ### Memory Limit vs OOM ``` ┌─────────────────────────────────────────────────────────────────────┐ │ MEMORY LIMIT EXCEEDED │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Process requests allocation │ │ │ │ │ ▼ │ │ Is memory.current < memory.max? │ │ │ │ │ ┌────┴────┐ │ │ │YES NO│ │ │ ▼ ▼ │ │ Allocate Try to reclaim │ │ memory from this cgroup │ │ │ │ │ ▼ │ │ Reclaim successful? │ │ │ │ │ ┌────┴────┐ │ │ │YES NO│ │ │ ▼ ▼ │ │ Allocate Invoke OOM killer │ │ memory for this cgroup │ │ │ │ │ ▼ │ │ Kill process with │ │ highest oom_score │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` *** ## I/O Controller ### I/O Bandwidth Limiting ```bash theme={null} # cgroups v2 I/O controls cat /sys/fs/cgroup//io.max # Format: "MAJ:MIN rbps=LIMIT wbps=LIMIT riops=LIMIT wiops=LIMIT" # Example: Limit to 10MB/s read, 5MB/s write on device 8:0 echo "8:0 rbps=10485760 wbps=5242880" > io.max # Check I/O statistics cat /sys/fs/cgroup//io.stat # Example output: # 8:0 rbytes=1073741824 wbytes=536870912 rios=10000 wios=5000 dbytes=0 dios=0 ``` ### I/O Latency Control (io.latency) ```bash theme={null} # Set target latency for device echo "8:0 target=10000" > /sys/fs/cgroup//io.latency # 10000 = 10ms target latency # The kernel will throttle this cgroup if its I/O # is causing other cgroups to exceed their targets ``` *** ## PID Controller ```bash theme={null} # Limit maximum number of processes echo 100 > /sys/fs/cgroup//pids.max # Check current count cat /sys/fs/cgroup//pids.current # Check if limit was hit cat /sys/fs/cgroup//pids.events # max 5 ← Fork denied 5 times due to limit ``` **Fork Bomb Protection**: The pids controller prevents fork bombs from exhausting system resources. Set reasonable limits on all container cgroups. *** ## cpuset Controller ```bash theme={null} # Pin to specific CPUs echo "0,2,4,6" > /sys/fs/cgroup//cpuset.cpus # Pin to specific memory nodes (NUMA) echo "0" > /sys/fs/cgroup//cpuset.mems # Verify current settings cat /sys/fs/cgroup//cpuset.cpus.effective cat /sys/fs/cgroup//cpuset.mems.effective ``` ### NUMA and cpuset ``` ┌─────────────────────────────────────────────────────────────────────┐ │ NUMA TOPOLOGY │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Node 0 Node 1 │ │ ┌─────────────────────┐ ┌─────────────────────┐ │ │ │ CPU 0 CPU 1 │ │ CPU 4 CPU 5 │ │ │ │ CPU 2 CPU 3 │ │ CPU 6 CPU 7 │ │ │ │ │ │ │ │ │ │ ┌───────────────┐ │ │ ┌───────────────┐ │ │ │ │ │ Memory │ │ │ │ Memory │ │ │ │ │ │ 64GB │ │ │ │ 64GB │ │ │ │ │ └───────────────┘ │ │ └───────────────┘ │ │ │ └─────────────────────┘ └─────────────────────┘ │ │ │ │ For best performance, pin process to CPUs and memory │ │ on the same NUMA node: │ │ │ │ cpuset.cpus = "0-3" │ │ cpuset.mems = "0" │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` *** ## Practical cgroups Operations ### Creating and Managing cgroups ```bash theme={null} # Create a new cgroup mkdir /sys/fs/cgroup/mygroup # Enable controllers for children cat /sys/fs/cgroup/cgroup.controllers # cpu io memory pids echo "+cpu +memory +pids" > /sys/fs/cgroup/cgroup.subtree_control # Create child cgroup mkdir /sys/fs/cgroup/mygroup/child # Add process to cgroup echo $$ > /sys/fs/cgroup/mygroup/child/cgroup.procs # Set limits echo "50000 100000" > /sys/fs/cgroup/mygroup/child/cpu.max echo "100M" > /sys/fs/cgroup/mygroup/child/memory.max echo 50 > /sys/fs/cgroup/mygroup/child/pids.max ``` ### Container Runtime cgroup Operations ```bash theme={null} # Find container's cgroup # Docker (v2) CONTAINER_ID=$(docker ps -q --filter name=mycontainer) cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}*/memory.current # Using systemd (cgroups v2) systemctl status docker-${CONTAINER_ID}.scope # Using /proc cat /proc//cgroup # Output: 0::/system.slice/docker-abc123.scope ``` *** ## Delegation and Nesting ### Cgroup Delegation ``` ┌─────────────────────────────────────────────────────────────────────┐ │ CGROUP DELEGATION │ ├─────────────────────────────────────────────────────────────────────┤ │ │ │ Root owns top levels: │ │ │ │ /sys/fs/cgroup/ ← root-owned │ │ └── system.slice/ ← root-owned │ │ └── docker.service/ ← root-owned │ │ └── container-abc/ ← delegated to container │ │ ├── cgroup.procs ← container can write │ │ ├── memory.current ← container can read │ │ └── child/ ← container can create │ │ │ │ Delegation enables: │ │ 1. Container can create nested cgroups │ │ 2. Container can move its processes between cgroups │ │ 3. Container cannot exceed parent's limits │ │ 4. Container cannot affect sibling cgroups │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ### Delegation Requirements ```bash theme={null} # Check delegation cat /sys/fs/cgroup/cgroup.controllers # cpu io memory pids # For delegation to work: # 1. Directory owned by delegate # 2. cgroup.procs owned by delegate # 3. cgroup.subtree_control owned by delegate # 4. Parent's subtree_control enables needed controllers # Set up delegation mkdir /sys/fs/cgroup/delegated chown -R user:user /sys/fs/cgroup/delegated ``` *** ## Debugging cgroups Issues ### Common Issues and Solutions **Symptom**: Container killed by OOM, but host shows available memory. **Diagnosis**: ```bash theme={null} # Check cgroup memory limit cat /sys/fs/cgroup//memory.max cat /sys/fs/cgroup//memory.current # Check for anonymous vs cache memory grep -E "^(anon|file)" /sys/fs/cgroup//memory.stat # Check OOM events cat /sys/fs/cgroup//memory.events # oom 5 ← OOM occurred 5 times ``` **Solutions**: * Increase memory limit * Reduce application memory usage * Add swap (for burst tolerance) **Symptom**: Container shows 30% CPU usage but requests are slow. **Diagnosis**: ```bash theme={null} # Check throttling stats cat /sys/fs/cgroup//cpu.stat # nr_throttled 15000 ← Throttled 15000 periods! # Calculate throttle percentage # throttle% = nr_throttled / nr_periods * 100 ``` **Cause**: Burst usage exceeds quota within period even though average is low. **Solutions**: * Increase CPU limit * Use `cpu.burst` (cgroups v2) for burst allowance * Spread work more evenly **Symptom**: `echo: write error: Device or resource busy` **Diagnosis**: ```bash theme={null} # Check if cgroup has processes cat /sys/fs/cgroup//cgroup.procs # Check if cgroup has children ls /sys/fs/cgroup// ``` **Cause**: Cgroups v2 "no internal processes" rule - if a cgroup has controllers enabled in subtree\_control, processes must be in leaf cgroups. **Solution**: Move processes to leaf cgroups before modifying parent. **Symptom**: Set io.max but process still uses full disk bandwidth. **Diagnosis**: ```bash theme={null} # Check if IO controller is enabled cat /sys/fs/cgroup//cgroup.controllers | grep io # Check io.max format (needs MAJ:MIN) lsblk cat /sys/fs/cgroup//io.max ``` **Common causes**: * Wrong device major:minor * IO through page cache (buffered writes) * Controller not enabled **Solution**: ```bash theme={null} # Use O_DIRECT or sync writes # Or use io.latency for latency-based control ``` *** ## Interview Questions **Answer**: When a cgroup exceeds `memory.high`: 1. **Reclaim pressure increases**: Kernel aggressively reclaims memory from this cgroup 2. **Throttling may occur**: Memory allocation requests may be delayed 3. **No OOM**: The process is NOT killed (unlike exceeding `memory.max`) 4. **Performance impact**: Application may slow down due to reclaim This is useful for: * Soft limits with burst allowance * Preventing one container from consuming all cache * Graceful degradation instead of hard OOM **Key differences**: | Aspect | v1 | v2 | | ------------------ | ------------------------- | -------------- | | Hierarchy | Multiple (per controller) | Single unified | | Internal processes | Allowed | Not allowed | | Thread control | Limited | Full support | | Pressure metrics | No | Yes (PSI) | | Delegation | Complex | Simple | **v2 advantages**: * Simpler mental model * Consistent behavior * Better pressure metrics (PSI) * Proper thread-level controls * Cleaner delegation model **Systematic approach**: 1. **Check CPU throttling**: ```bash theme={null} cat /sys/fs/cgroup//cpu.stat | grep throttled ``` 2. **Check memory pressure**: ```bash theme={null} cat /sys/fs/cgroup//memory.pressure cat /sys/fs/cgroup//memory.events | grep oom ``` 3. **Check I/O latency**: ```bash theme={null} cat /sys/fs/cgroup//io.stat cat /sys/fs/cgroup//io.pressure ``` 4. **Check for noisy neighbors**: * Look at sibling cgroups' usage * Check parent cgroup limits 5. **Use tracing**: ```bash theme={null} # Off-CPU analysis offcputime-bpfcc -p ``` *** ## PSI (Pressure Stall Information) New in cgroups v2 - provides pressure metrics: ```bash theme={null} # Memory pressure cat /sys/fs/cgroup//memory.pressure # some avg10=0.00 avg60=0.00 avg300=0.00 total=0 # full avg10=0.00 avg60=0.00 avg300=0.00 total=0 # CPU pressure cat /sys/fs/cgroup//cpu.pressure # some avg10=25.00 avg60=20.00 avg300=15.00 total=1234567 # IO pressure cat /sys/fs/cgroup//io.pressure # some avg10=5.00 avg60=3.00 avg300=2.00 total=789012 # full avg10=2.00 avg60=1.00 avg300=0.50 total=456789 ``` **Interpretation**: * `some`: At least one task stalled * `full`: All tasks stalled * `avg10/60/300`: 10s/60s/300s moving averages (%) * `total`: Total stall time in microseconds *** ## Interview Deep-Dive **Strong Answer:** * The most likely cause is CFS bandwidth throttling. Average CPU usage can be misleading because CFS enforces CPU limits per-period (typically 100ms). A service might use only 30% average CPU, but if requests arrive in bursts, the container could exhaust its entire quota in the first 30ms of a period and be throttled for the remaining 70ms. During throttling, all threads in the cgroup are descheduled regardless of available CPU on the host. * To diagnose, I would read `cat /sys/fs/cgroup//cpu.stat` and compute the throttle percentage: `nr_throttled / nr_periods * 100`. If this exceeds 5%, throttling is significant. I would also check `throttled_usec` to see total time lost to throttling. * The kernel mechanism is the CFS bandwidth controller in `kernel/sched/fair.c`. Each cgroup has a `cfs_bandwidth` structure tracking quota and period. When a task in the cgroup runs, its runtime is charged against the quota. When quota reaches zero, the scheduler removes all tasks in the cgroup from their runqueues until the next period boundary replenishes the quota. * Solutions in order of preference: increase the CPU limit in the Kubernetes resource spec, enable `cpu.burst` in cgroups v2 (allows temporary burst above quota by borrowing from future periods), spread work more evenly across time using request queuing, or switch to CPU shares (`cpu.weight`) instead of hard limits if the node is not oversubscribed. **Follow-up:** Why does Kubernetes use CPU limits (quota) at all instead of just CPU requests (shares)? **Follow-up Answer:** * CPU requests map to `cpu.weight` (shares) which provide proportional fairness: if a container requests 1 CPU and another requests 2 CPUs, the second gets twice the CPU time when both are contending. But shares only work when there is contention. Without limits, a runaway container could consume all available CPU during low-load periods, then cause latency for other containers when load increases. Limits provide a hard ceiling via `cpu.max` that caps CPU usage regardless of host utilization. The trade-off is that limits cause throttling even when CPU is idle, which is wasteful. Many teams now advocate for setting requests but not limits for CPU (Google's best practice), accepting the risk of noisy neighbors in exchange for eliminating throttling-induced latency. **Strong Answer:** * In cgroups v2, a cgroup that has controllers enabled in its `cgroup.subtree_control` cannot have processes directly in `cgroup.procs` -- processes must be in leaf cgroups only. This means if `/sys/fs/cgroup/mygroup/` enables CPU and memory controllers for its children, all processes must be in child cgroups like `/sys/fs/cgroup/mygroup/child1/`, not in `/sys/fs/cgroup/mygroup/` itself. * This rule was introduced to solve a fundamental ambiguity in cgroups v1: if a parent cgroup has processes and also has child cgroups, how should the parent's resource share be divided between its direct processes and its children? In v1, this was handled inconsistently across controllers and led to confusing behavior where resource distribution depended on the tree structure in unintuitive ways. * For container runtimes, this means the cgroup hierarchy must be designed carefully. A container runtime cannot simply create `/sys/fs/cgroup/containers/` with controllers enabled and dump container processes there. Instead, it must create `/sys/fs/cgroup/containers/container-abc/` for each container and place processes in the leaf. Systemd handles this naturally with its slice/scope/service hierarchy: `system.slice -> docker-abc123.scope` where the scope is the leaf containing the container's processes. * The practical impact is that management processes (the container runtime's shim) must be in a separate cgroup from the container's processes. runc places the shim in the parent's sibling cgroup and the container in its own leaf. **Follow-up:** How does thread-level cgroup control work in v2, and when would you use it? **Follow-up Answer:** * Cgroups v2 introduced `cgroup.type = "threaded"` which allows individual threads of a process to be in different cgroups within a threaded subtree. This is useful when you want to give different threads different CPU priorities within the same process. For example, a database might put its query processing threads in one cgroup with higher CPU weight and its background compaction threads in another cgroup with lower weight. The root of the threaded subtree is a "domain threaded" cgroup, and its children are "threaded" cgroups. Not all controllers support threaded mode -- currently only `cpu` and `cpuset` are thread-aware. **Strong Answer:** * For memory, I would monitor three signals. First, `memory.current / memory.max` as a utilization ratio -- alert at 80%. Second, `memory.pressure` from PSI (Pressure Stall Information): `some avg10 > 10` means at least one task is stalling 10% of the time waiting for memory, which indicates reclaim pressure before OOM. Third, the `oom` counter in `memory.events` to detect actual OOM kills. * For CPU, I would compute throttle percentage from `cpu.stat`: `nr_throttled / nr_periods * 100`. Alert at 5% throttle rate. I would also monitor `cpu.pressure` where `some avg10 > 25` indicates meaningful CPU contention. The distinction between `some` (at least one task stalled) and `full` (all tasks stalled) is important: `some` indicates contention, `full` indicates complete blockage. * For I/O, monitor `io.pressure` for both `some` and `full` stall percentages. Also check `io.stat` for per-device throughput to detect I/O bottlenecks. * For PID limits, monitor `pids.current / pids.max` and alert at 80%. A fork bomb or thread leak will hit this before OOM. * Implementation-wise, I would use a polling agent reading these pseudo-files every 5-10 seconds. PSI metrics are particularly efficient because they are pre-computed moving averages, so reading them is a single file read returning a single line. For production, I would expose these as Prometheus metrics and set up Grafana alerts. **Follow-up:** How does PSI (Pressure Stall Information) work internally, and why is it better than just checking utilization percentages? **Follow-up Answer:** * PSI tracks the actual time tasks spend stalled waiting for resources, not just how much of a resource is being used. The kernel maintains per-CPU counters that are updated on every task state transition. When a task transitions from runnable to blocked-on-memory-reclaim, the kernel increments the memory stall counter for that CPU. PSI then computes exponentially weighted moving averages over 10s, 60s, and 300s windows. This is fundamentally better than utilization because utilization does not capture demand: a container at 90% memory utilization might be fine (large but stable working set) or about to OOM (growing leak). PSI tells you whether tasks are actually waiting, which directly correlates with user-visible performance degradation. *** Next: [Filesystem & VFS →](/courses/linux-internals/filesystem-vfs)