Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Control Groups - Resource limits, accounting, and prioritization

Control Groups (cgroups)

Control groups are the foundation of container resource management. Understanding cgroups deeply is essential for debugging container issues and designing resource-aware systems.
Prerequisites: Process fundamentals, basic container concepts
Interview Focus: cgroups v2, memory limits, CPU throttling, debugging
Companies: All container/cloud companies heavily test this

cgroups Overview

Control Groups Hierarchy

cgroups v1 vs v2

Characteristics:
  • Multiple hierarchies (one per controller)
  • Controllers can be mounted independently
  • More flexible but complex
  • Still used in many production systems
File system layout:
/sys/fs/cgroup/
├── cpu/
│   └── docker/
│       └── container-abc/
│           ├── cpu.cfs_quota_us
│           └── cpu.cfs_period_us
├── memory/
│   └── docker/
│       └── container-abc/
│           ├── memory.limit_in_bytes
│           └── memory.usage_in_bytes
└── pids/
    └── docker/
        └── container-abc/
            └── pids.max
Key issues:
  • Inconsistent APIs across controllers
  • Race conditions between hierarchies
  • No unified resource management

CPU Controller Deep Dive

Cgroup Resource Limits

CPU Bandwidth Limiting

┌─────────────────────────────────────────────────────────────────────┐
│                    CPU BANDWIDTH LIMITING                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Configuration (cgroups v2):                                         │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  cpu.max = "$QUOTA $PERIOD"                                      ││
│  │                                                                  ││
│  │  Examples:                                                       ││
│  │  "50000 100000"  = 50ms per 100ms period = 50% of 1 CPU         ││
│  │  "100000 100000" = 100ms per 100ms = 100% of 1 CPU             ││
│  │  "200000 100000" = 200ms per 100ms = 200% = 2 CPUs             ││
│  │  "max 100000"    = Unlimited                                    ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  Kubernetes translation:                                             │
│  resources:                                                          │
│    limits:                                                           │
│      cpu: "500m"    →  cpu.max = "50000 100000"                     │
│      cpu: "2"       →  cpu.max = "200000 100000"                    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

CPU Throttling

Common Interview Topic: Why is my container slow even when CPU usage looks low? Answer: CPU throttling!
# Check throttling statistics (cgroups v2)
cat /sys/fs/cgroup/<path>/cpu.stat

# Example output:
usage_usec 1234567890        # Total CPU time used
user_usec 1000000000         # User-space time
system_usec 234567890        # Kernel-space time
nr_periods 50000             # Number of enforcement periods
nr_throttled 1000            # Periods where throttling occurred
throttled_usec 5000000       # Total time spent throttled

# Throttling percentage
throttled_percentage = nr_throttled / nr_periods * 100
# If > 5%, consider increasing CPU limit

CPU Throttling Visualization

┌────────────────────────────────────────────────────────────────────────┐
│                    CPU THROTTLING BEHAVIOR                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Period: 100ms, Quota: 50ms (50% of 1 CPU)                             │
│                                                                         │
│  Time ─────────────────────────────────────────────────────────────►   │
│                                                                         │
│  Period 1      │ Period 2      │ Period 3      │ Period 4              │
│  0ms    100ms  │ 100ms  200ms  │ 200ms  300ms  │ 300ms  400ms          │
│  │      │      │       │       │       │       │       │               │
│  ├──────┤      ├───────┤       ├───────┤       ├───────┤               │
│  │██████│      │███████│       │████│  │       │███████│               │
│  │██████│      │███████│       │████│  │       │███████│               │
│  │██████│░░░░░░│███████│░░░░░░░│████│░░│░░░░░░░│███████│░░░░░░         │
│  └──────┘      └───────┘       └────┘  │       └───────┘               │
│                                        │                               │
│  ██ = Running  ░░ = Throttled         │                               │
│                                        │                               │
│  Period 3: Only 40ms work, no throttle ◄───────                        │
│  Other periods: Hit 50ms quota, throttled                              │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Memory Controller Deep Dive

Memory Limits and Protection

┌─────────────────────────────────────────────────────────────────────┐
│                     MEMORY CGROUP CONTROLS                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│                          memory.max                                  │
│                              │                                       │
│                              │ ← Hard limit (OOM if exceeded)       │
│  Usage                       │                                       │
│    │                     ────┼────────────────────────────          │
│    │                         │                                       │
│    │                    memory.high                                  │
│    │                         │                                       │
│    │                         │ ← Throttling begins (reclaim)        │
│    │                     ────┼────────────────────────────          │
│    │                         │                                       │
│    │  ┌──────────────────────┴──────────────────────────┐           │
│    │  │                 Normal Operation                 │           │
│    │  │          Application allocates freely            │           │
│    │  └─────────────────────────────────────────────────┘           │
│    │                         │                                       │
│    │                    memory.low                                   │
│    │                         │ ← Best-effort protection             │
│    │                     ────┼────────────────────────────          │
│    │                         │                                       │
│    │                    memory.min                                   │
│    │                         │ ← Guaranteed protection               │
│    └─────────────────────────┴────────────────────────────► Time    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Memory Accounting Details

# Read memory statistics (cgroups v2)
cat /sys/fs/cgroup/<path>/memory.stat

# Key fields:
anon 1073741824                  # Anonymous memory (heap, stack)
file 536870912                   # Page cache (file-backed pages)
kernel 67108864                  # Kernel memory (slab, etc.)
kernel_stack 1048576             # Kernel stacks
pagetables 8388608               # Page table memory
sock 134217728                   # Socket buffers
shmem 0                          # Shared memory
file_mapped 268435456            # Memory-mapped files
file_dirty 4096                  # Dirty pages
file_writeback 0                 # Pages being written
swapcached 0                     # Swap cache
anon_thp 0                       # Anonymous huge pages
file_thp 0                       # File-backed huge pages
slab_reclaimable 16777216        # Reclaimable slab
slab_unreclaimable 8388608       # Non-reclaimable slab
pgfault 15000000                 # Total page faults
pgmajfault 1000                  # Major page faults (disk I/O)

Memory Limit vs OOM

┌─────────────────────────────────────────────────────────────────────┐
│                    MEMORY LIMIT EXCEEDED                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Process requests allocation                                         │
│         │                                                            │
│         ▼                                                            │
│  Is memory.current < memory.max?                                     │
│         │                                                            │
│    ┌────┴────┐                                                       │
│    │YES     NO│                                                       │
│    ▼          ▼                                                       │
│  Allocate   Try to reclaim                                           │
│  memory     from this cgroup                                         │
│              │                                                        │
│              ▼                                                        │
│         Reclaim successful?                                          │
│              │                                                        │
│         ┌────┴────┐                                                  │
│         │YES     NO│                                                  │
│         ▼          ▼                                                  │
│       Allocate   Invoke OOM killer                                   │
│       memory     for this cgroup                                     │
│                   │                                                   │
│                   ▼                                                   │
│            Kill process with                                         │
│            highest oom_score                                         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

I/O Controller

I/O Bandwidth Limiting

# cgroups v2 I/O controls
cat /sys/fs/cgroup/<path>/io.max

# Format: "MAJ:MIN rbps=LIMIT wbps=LIMIT riops=LIMIT wiops=LIMIT"
# Example: Limit to 10MB/s read, 5MB/s write on device 8:0
echo "8:0 rbps=10485760 wbps=5242880" > io.max

# Check I/O statistics
cat /sys/fs/cgroup/<path>/io.stat

# Example output:
# 8:0 rbytes=1073741824 wbytes=536870912 rios=10000 wios=5000 dbytes=0 dios=0

I/O Latency Control (io.latency)

# Set target latency for device
echo "8:0 target=10000" > /sys/fs/cgroup/<path>/io.latency
# 10000 = 10ms target latency

# The kernel will throttle this cgroup if its I/O
# is causing other cgroups to exceed their targets

PID Controller

# Limit maximum number of processes
echo 100 > /sys/fs/cgroup/<path>/pids.max

# Check current count
cat /sys/fs/cgroup/<path>/pids.current

# Check if limit was hit
cat /sys/fs/cgroup/<path>/pids.events
# max 5    ← Fork denied 5 times due to limit
Fork Bomb Protection: The pids controller prevents fork bombs from exhausting system resources. Set reasonable limits on all container cgroups.

cpuset Controller

# Pin to specific CPUs
echo "0,2,4,6" > /sys/fs/cgroup/<path>/cpuset.cpus

# Pin to specific memory nodes (NUMA)
echo "0" > /sys/fs/cgroup/<path>/cpuset.mems

# Verify current settings
cat /sys/fs/cgroup/<path>/cpuset.cpus.effective
cat /sys/fs/cgroup/<path>/cpuset.mems.effective

NUMA and cpuset

┌─────────────────────────────────────────────────────────────────────┐
│                        NUMA TOPOLOGY                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Node 0                          Node 1                             │
│  ┌─────────────────────┐        ┌─────────────────────┐            │
│  │  CPU 0   CPU 1      │        │  CPU 4   CPU 5      │            │
│  │  CPU 2   CPU 3      │        │  CPU 6   CPU 7      │            │
│  │                     │        │                     │            │
│  │  ┌───────────────┐  │        │  ┌───────────────┐  │            │
│  │  │   Memory      │  │        │  │   Memory      │  │            │
│  │  │   64GB        │  │        │  │   64GB        │  │            │
│  │  └───────────────┘  │        │  └───────────────┘  │            │
│  └─────────────────────┘        └─────────────────────┘            │
│                                                                      │
│  For best performance, pin process to CPUs and memory               │
│  on the same NUMA node:                                              │
│                                                                      │
│  cpuset.cpus = "0-3"                                                │
│  cpuset.mems = "0"                                                  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Practical cgroups Operations

Creating and Managing cgroups

# Create a new cgroup
mkdir /sys/fs/cgroup/mygroup

# Enable controllers for children
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids

echo "+cpu +memory +pids" > /sys/fs/cgroup/cgroup.subtree_control

# Create child cgroup
mkdir /sys/fs/cgroup/mygroup/child

# Add process to cgroup
echo $$ > /sys/fs/cgroup/mygroup/child/cgroup.procs

# Set limits
echo "50000 100000" > /sys/fs/cgroup/mygroup/child/cpu.max
echo "100M" > /sys/fs/cgroup/mygroup/child/memory.max
echo 50 > /sys/fs/cgroup/mygroup/child/pids.max

Container Runtime cgroup Operations

# Find container's cgroup
# Docker (v2)
CONTAINER_ID=$(docker ps -q --filter name=mycontainer)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}*/memory.current

# Using systemd (cgroups v2)
systemctl status docker-${CONTAINER_ID}.scope

# Using /proc
cat /proc/<pid>/cgroup
# Output: 0::/system.slice/docker-abc123.scope

Delegation and Nesting

Cgroup Delegation

┌─────────────────────────────────────────────────────────────────────┐
│                      CGROUP DELEGATION                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Root owns top levels:                                               │
│                                                                      │
│  /sys/fs/cgroup/                        ← root-owned                │
│  └── system.slice/                      ← root-owned                │
│      └── docker.service/                ← root-owned                │
│          └── container-abc/             ← delegated to container    │
│              ├── cgroup.procs           ← container can write       │
│              ├── memory.current         ← container can read        │
│              └── child/                 ← container can create      │
│                                                                      │
│  Delegation enables:                                                 │
│  1. Container can create nested cgroups                             │
│  2. Container can move its processes between cgroups                │
│  3. Container cannot exceed parent's limits                         │
│  4. Container cannot affect sibling cgroups                         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Delegation Requirements

# Check delegation
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids

# For delegation to work:
# 1. Directory owned by delegate
# 2. cgroup.procs owned by delegate  
# 3. cgroup.subtree_control owned by delegate
# 4. Parent's subtree_control enables needed controllers

# Set up delegation
mkdir /sys/fs/cgroup/delegated
chown -R user:user /sys/fs/cgroup/delegated

Debugging cgroups Issues

Common Issues and Solutions

Symptom: Container killed by OOM, but host shows available memory.Diagnosis:
# Check cgroup memory limit
cat /sys/fs/cgroup/<path>/memory.max
cat /sys/fs/cgroup/<path>/memory.current

# Check for anonymous vs cache memory
grep -E "^(anon|file)" /sys/fs/cgroup/<path>/memory.stat

# Check OOM events
cat /sys/fs/cgroup/<path>/memory.events
# oom 5  ← OOM occurred 5 times
Solutions:
  • Increase memory limit
  • Reduce application memory usage
  • Add swap (for burst tolerance)
Symptom: Container shows 30% CPU usage but requests are slow.Diagnosis:
# Check throttling stats
cat /sys/fs/cgroup/<path>/cpu.stat
# nr_throttled 15000  ← Throttled 15000 periods!

# Calculate throttle percentage
# throttle% = nr_throttled / nr_periods * 100
Cause: Burst usage exceeds quota within period even though average is low.Solutions:
  • Increase CPU limit
  • Use cpu.burst (cgroups v2) for burst allowance
  • Spread work more evenly
Symptom: echo: write error: Device or resource busyDiagnosis:
# Check if cgroup has processes
cat /sys/fs/cgroup/<path>/cgroup.procs

# Check if cgroup has children
ls /sys/fs/cgroup/<path>/
Cause: Cgroups v2 “no internal processes” rule - if a cgroup has controllers enabled in subtree_control, processes must be in leaf cgroups.Solution: Move processes to leaf cgroups before modifying parent.
Symptom: Set io.max but process still uses full disk bandwidth.Diagnosis:
# Check if IO controller is enabled
cat /sys/fs/cgroup/<path>/cgroup.controllers | grep io

# Check io.max format (needs MAJ:MIN)
lsblk
cat /sys/fs/cgroup/<path>/io.max
Common causes:
  • Wrong device major:minor
  • IO through page cache (buffered writes)
  • Controller not enabled
Solution:
# Use O_DIRECT or sync writes
# Or use io.latency for latency-based control

Interview Questions

Answer:When a cgroup exceeds memory.high:
  1. Reclaim pressure increases: Kernel aggressively reclaims memory from this cgroup
  2. Throttling may occur: Memory allocation requests may be delayed
  3. No OOM: The process is NOT killed (unlike exceeding memory.max)
  4. Performance impact: Application may slow down due to reclaim
This is useful for:
  • Soft limits with burst allowance
  • Preventing one container from consuming all cache
  • Graceful degradation instead of hard OOM
Key differences:
Aspectv1v2
HierarchyMultiple (per controller)Single unified
Internal processesAllowedNot allowed
Thread controlLimitedFull support
Pressure metricsNoYes (PSI)
DelegationComplexSimple
v2 advantages:
  • Simpler mental model
  • Consistent behavior
  • Better pressure metrics (PSI)
  • Proper thread-level controls
  • Cleaner delegation model
Systematic approach:
  1. Check CPU throttling:
    cat /sys/fs/cgroup/<path>/cpu.stat | grep throttled
    
  2. Check memory pressure:
    cat /sys/fs/cgroup/<path>/memory.pressure
    cat /sys/fs/cgroup/<path>/memory.events | grep oom
    
  3. Check I/O latency:
    cat /sys/fs/cgroup/<path>/io.stat
    cat /sys/fs/cgroup/<path>/io.pressure
    
  4. Check for noisy neighbors:
    • Look at sibling cgroups’ usage
    • Check parent cgroup limits
  5. Use tracing:
    # Off-CPU analysis
    offcputime-bpfcc -p <pid>
    

PSI (Pressure Stall Information)

New in cgroups v2 - provides pressure metrics:
# Memory pressure
cat /sys/fs/cgroup/<path>/memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

# CPU pressure  
cat /sys/fs/cgroup/<path>/cpu.pressure
# some avg10=25.00 avg60=20.00 avg300=15.00 total=1234567

# IO pressure
cat /sys/fs/cgroup/<path>/io.pressure
# some avg10=5.00 avg60=3.00 avg300=2.00 total=789012
# full avg10=2.00 avg60=1.00 avg300=0.50 total=456789
Interpretation:
  • some: At least one task stalled
  • full: All tasks stalled
  • avg10/60/300: 10s/60s/300s moving averages (%)
  • total: Total stall time in microseconds

Interview Deep-Dive

Strong Answer:
  • The most likely cause is CFS bandwidth throttling. Average CPU usage can be misleading because CFS enforces CPU limits per-period (typically 100ms). A service might use only 30% average CPU, but if requests arrive in bursts, the container could exhaust its entire quota in the first 30ms of a period and be throttled for the remaining 70ms. During throttling, all threads in the cgroup are descheduled regardless of available CPU on the host.
  • To diagnose, I would read cat /sys/fs/cgroup/<path>/cpu.stat and compute the throttle percentage: nr_throttled / nr_periods * 100. If this exceeds 5%, throttling is significant. I would also check throttled_usec to see total time lost to throttling.
  • The kernel mechanism is the CFS bandwidth controller in kernel/sched/fair.c. Each cgroup has a cfs_bandwidth structure tracking quota and period. When a task in the cgroup runs, its runtime is charged against the quota. When quota reaches zero, the scheduler removes all tasks in the cgroup from their runqueues until the next period boundary replenishes the quota.
  • Solutions in order of preference: increase the CPU limit in the Kubernetes resource spec, enable cpu.burst in cgroups v2 (allows temporary burst above quota by borrowing from future periods), spread work more evenly across time using request queuing, or switch to CPU shares (cpu.weight) instead of hard limits if the node is not oversubscribed.
Follow-up: Why does Kubernetes use CPU limits (quota) at all instead of just CPU requests (shares)?Follow-up Answer:
  • CPU requests map to cpu.weight (shares) which provide proportional fairness: if a container requests 1 CPU and another requests 2 CPUs, the second gets twice the CPU time when both are contending. But shares only work when there is contention. Without limits, a runaway container could consume all available CPU during low-load periods, then cause latency for other containers when load increases. Limits provide a hard ceiling via cpu.max that caps CPU usage regardless of host utilization. The trade-off is that limits cause throttling even when CPU is idle, which is wasteful. Many teams now advocate for setting requests but not limits for CPU (Google’s best practice), accepting the risk of noisy neighbors in exchange for eliminating throttling-induced latency.
Strong Answer:
  • In cgroups v2, a cgroup that has controllers enabled in its cgroup.subtree_control cannot have processes directly in cgroup.procs — processes must be in leaf cgroups only. This means if /sys/fs/cgroup/mygroup/ enables CPU and memory controllers for its children, all processes must be in child cgroups like /sys/fs/cgroup/mygroup/child1/, not in /sys/fs/cgroup/mygroup/ itself.
  • This rule was introduced to solve a fundamental ambiguity in cgroups v1: if a parent cgroup has processes and also has child cgroups, how should the parent’s resource share be divided between its direct processes and its children? In v1, this was handled inconsistently across controllers and led to confusing behavior where resource distribution depended on the tree structure in unintuitive ways.
  • For container runtimes, this means the cgroup hierarchy must be designed carefully. A container runtime cannot simply create /sys/fs/cgroup/containers/ with controllers enabled and dump container processes there. Instead, it must create /sys/fs/cgroup/containers/container-abc/ for each container and place processes in the leaf. Systemd handles this naturally with its slice/scope/service hierarchy: system.slice -> docker-abc123.scope where the scope is the leaf containing the container’s processes.
  • The practical impact is that management processes (the container runtime’s shim) must be in a separate cgroup from the container’s processes. runc places the shim in the parent’s sibling cgroup and the container in its own leaf.
Follow-up: How does thread-level cgroup control work in v2, and when would you use it?Follow-up Answer:
  • Cgroups v2 introduced cgroup.type = "threaded" which allows individual threads of a process to be in different cgroups within a threaded subtree. This is useful when you want to give different threads different CPU priorities within the same process. For example, a database might put its query processing threads in one cgroup with higher CPU weight and its background compaction threads in another cgroup with lower weight. The root of the threaded subtree is a “domain threaded” cgroup, and its children are “threaded” cgroups. Not all controllers support threaded mode — currently only cpu and cpuset are thread-aware.
Strong Answer:
  • For memory, I would monitor three signals. First, memory.current / memory.max as a utilization ratio — alert at 80%. Second, memory.pressure from PSI (Pressure Stall Information): some avg10 > 10 means at least one task is stalling 10% of the time waiting for memory, which indicates reclaim pressure before OOM. Third, the oom counter in memory.events to detect actual OOM kills.
  • For CPU, I would compute throttle percentage from cpu.stat: nr_throttled / nr_periods * 100. Alert at 5% throttle rate. I would also monitor cpu.pressure where some avg10 > 25 indicates meaningful CPU contention. The distinction between some (at least one task stalled) and full (all tasks stalled) is important: some indicates contention, full indicates complete blockage.
  • For I/O, monitor io.pressure for both some and full stall percentages. Also check io.stat for per-device throughput to detect I/O bottlenecks.
  • For PID limits, monitor pids.current / pids.max and alert at 80%. A fork bomb or thread leak will hit this before OOM.
  • Implementation-wise, I would use a polling agent reading these pseudo-files every 5-10 seconds. PSI metrics are particularly efficient because they are pre-computed moving averages, so reading them is a single file read returning a single line. For production, I would expose these as Prometheus metrics and set up Grafana alerts.
Follow-up: How does PSI (Pressure Stall Information) work internally, and why is it better than just checking utilization percentages?Follow-up Answer:
  • PSI tracks the actual time tasks spend stalled waiting for resources, not just how much of a resource is being used. The kernel maintains per-CPU counters that are updated on every task state transition. When a task transitions from runnable to blocked-on-memory-reclaim, the kernel increments the memory stall counter for that CPU. PSI then computes exponentially weighted moving averages over 10s, 60s, and 300s windows. This is fundamentally better than utilization because utilization does not capture demand: a container at 90% memory utilization might be fine (large but stable working set) or about to OOM (growing leak). PSI tells you whether tasks are actually waiting, which directly correlates with user-visible performance degradation.

Next: Filesystem & VFS →