Skip to main content
Control Groups - Resource limits, accounting, and prioritization

Control Groups (cgroups)

Control groups are the foundation of container resource management. Understanding cgroups deeply is essential for debugging container issues and designing resource-aware systems.
Prerequisites: Process fundamentals, basic container concepts
Interview Focus: cgroups v2, memory limits, CPU throttling, debugging
Companies: All container/cloud companies heavily test this

cgroups Overview

Control Groups Hierarchy

cgroups v1 vs v2

Characteristics:
  • Multiple hierarchies (one per controller)
  • Controllers can be mounted independently
  • More flexible but complex
  • Still used in many production systems
File system layout:
/sys/fs/cgroup/
├── cpu/
│   └── docker/
│       └── container-abc/
│           ├── cpu.cfs_quota_us
│           └── cpu.cfs_period_us
├── memory/
│   └── docker/
│       └── container-abc/
│           ├── memory.limit_in_bytes
│           └── memory.usage_in_bytes
└── pids/
    └── docker/
        └── container-abc/
            └── pids.max
Key issues:
  • Inconsistent APIs across controllers
  • Race conditions between hierarchies
  • No unified resource management

CPU Controller Deep Dive

Cgroup Resource Limits

CPU Bandwidth Limiting

┌─────────────────────────────────────────────────────────────────────┐
│                    CPU BANDWIDTH LIMITING                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Configuration (cgroups v2):                                         │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  cpu.max = "$QUOTA $PERIOD"                                      ││
│  │                                                                  ││
│  │  Examples:                                                       ││
│  │  "50000 100000"  = 50ms per 100ms period = 50% of 1 CPU         ││
│  │  "100000 100000" = 100ms per 100ms = 100% of 1 CPU             ││
│  │  "200000 100000" = 200ms per 100ms = 200% = 2 CPUs             ││
│  │  "max 100000"    = Unlimited                                    ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  Kubernetes translation:                                             │
│  resources:                                                          │
│    limits:                                                           │
│      cpu: "500m"    →  cpu.max = "50000 100000"                     │
│      cpu: "2"       →  cpu.max = "200000 100000"                    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

CPU Throttling

Common Interview Topic: Why is my container slow even when CPU usage looks low? Answer: CPU throttling!
# Check throttling statistics (cgroups v2)
cat /sys/fs/cgroup/<path>/cpu.stat

# Example output:
usage_usec 1234567890        # Total CPU time used
user_usec 1000000000         # User-space time
system_usec 234567890        # Kernel-space time
nr_periods 50000             # Number of enforcement periods
nr_throttled 1000            # Periods where throttling occurred
throttled_usec 5000000       # Total time spent throttled

# Throttling percentage
throttled_percentage = nr_throttled / nr_periods * 100
# If > 5%, consider increasing CPU limit

CPU Throttling Visualization

┌────────────────────────────────────────────────────────────────────────┐
│                    CPU THROTTLING BEHAVIOR                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Period: 100ms, Quota: 50ms (50% of 1 CPU)                             │
│                                                                         │
│  Time ─────────────────────────────────────────────────────────────►   │
│                                                                         │
│  Period 1      │ Period 2      │ Period 3      │ Period 4              │
│  0ms    100ms  │ 100ms  200ms  │ 200ms  300ms  │ 300ms  400ms          │
│  │      │      │       │       │       │       │       │               │
│  ├──────┤      ├───────┤       ├───────┤       ├───────┤               │
│  │██████│      │███████│       │████│  │       │███████│               │
│  │██████│      │███████│       │████│  │       │███████│               │
│  │██████│░░░░░░│███████│░░░░░░░│████│░░│░░░░░░░│███████│░░░░░░         │
│  └──────┘      └───────┘       └────┘  │       └───────┘               │
│                                        │                               │
│  ██ = Running  ░░ = Throttled         │                               │
│                                        │                               │
│  Period 3: Only 40ms work, no throttle ◄───────                        │
│  Other periods: Hit 50ms quota, throttled                              │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Memory Controller Deep Dive

Memory Limits and Protection

┌─────────────────────────────────────────────────────────────────────┐
│                     MEMORY CGROUP CONTROLS                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│                          memory.max                                  │
│                              │                                       │
│                              │ ← Hard limit (OOM if exceeded)       │
│  Usage                       │                                       │
│    │                     ────┼────────────────────────────          │
│    │                         │                                       │
│    │                    memory.high                                  │
│    │                         │                                       │
│    │                         │ ← Throttling begins (reclaim)        │
│    │                     ────┼────────────────────────────          │
│    │                         │                                       │
│    │  ┌──────────────────────┴──────────────────────────┐           │
│    │  │                 Normal Operation                 │           │
│    │  │          Application allocates freely            │           │
│    │  └─────────────────────────────────────────────────┘           │
│    │                         │                                       │
│    │                    memory.low                                   │
│    │                         │ ← Best-effort protection             │
│    │                     ────┼────────────────────────────          │
│    │                         │                                       │
│    │                    memory.min                                   │
│    │                         │ ← Guaranteed protection               │
│    └─────────────────────────┴────────────────────────────► Time    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Memory Accounting Details

# Read memory statistics (cgroups v2)
cat /sys/fs/cgroup/<path>/memory.stat

# Key fields:
anon 1073741824                  # Anonymous memory (heap, stack)
file 536870912                   # Page cache (file-backed pages)
kernel 67108864                  # Kernel memory (slab, etc.)
kernel_stack 1048576             # Kernel stacks
pagetables 8388608               # Page table memory
sock 134217728                   # Socket buffers
shmem 0                          # Shared memory
file_mapped 268435456            # Memory-mapped files
file_dirty 4096                  # Dirty pages
file_writeback 0                 # Pages being written
swapcached 0                     # Swap cache
anon_thp 0                       # Anonymous huge pages
file_thp 0                       # File-backed huge pages
slab_reclaimable 16777216        # Reclaimable slab
slab_unreclaimable 8388608       # Non-reclaimable slab
pgfault 15000000                 # Total page faults
pgmajfault 1000                  # Major page faults (disk I/O)

Memory Limit vs OOM

┌─────────────────────────────────────────────────────────────────────┐
│                    MEMORY LIMIT EXCEEDED                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Process requests allocation                                         │
│         │                                                            │
│         ▼                                                            │
│  Is memory.current < memory.max?                                     │
│         │                                                            │
│    ┌────┴────┐                                                       │
│    │YES     NO│                                                       │
│    ▼          ▼                                                       │
│  Allocate   Try to reclaim                                           │
│  memory     from this cgroup                                         │
│              │                                                        │
│              ▼                                                        │
│         Reclaim successful?                                          │
│              │                                                        │
│         ┌────┴────┐                                                  │
│         │YES     NO│                                                  │
│         ▼          ▼                                                  │
│       Allocate   Invoke OOM killer                                   │
│       memory     for this cgroup                                     │
│                   │                                                   │
│                   ▼                                                   │
│            Kill process with                                         │
│            highest oom_score                                         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

I/O Controller

I/O Bandwidth Limiting

# cgroups v2 I/O controls
cat /sys/fs/cgroup/<path>/io.max

# Format: "MAJ:MIN rbps=LIMIT wbps=LIMIT riops=LIMIT wiops=LIMIT"
# Example: Limit to 10MB/s read, 5MB/s write on device 8:0
echo "8:0 rbps=10485760 wbps=5242880" > io.max

# Check I/O statistics
cat /sys/fs/cgroup/<path>/io.stat

# Example output:
# 8:0 rbytes=1073741824 wbytes=536870912 rios=10000 wios=5000 dbytes=0 dios=0

I/O Latency Control (io.latency)

# Set target latency for device
echo "8:0 target=10000" > /sys/fs/cgroup/<path>/io.latency
# 10000 = 10ms target latency

# The kernel will throttle this cgroup if its I/O
# is causing other cgroups to exceed their targets

PID Controller

# Limit maximum number of processes
echo 100 > /sys/fs/cgroup/<path>/pids.max

# Check current count
cat /sys/fs/cgroup/<path>/pids.current

# Check if limit was hit
cat /sys/fs/cgroup/<path>/pids.events
# max 5    ← Fork denied 5 times due to limit
Fork Bomb Protection: The pids controller prevents fork bombs from exhausting system resources. Set reasonable limits on all container cgroups.

cpuset Controller

# Pin to specific CPUs
echo "0,2,4,6" > /sys/fs/cgroup/<path>/cpuset.cpus

# Pin to specific memory nodes (NUMA)
echo "0" > /sys/fs/cgroup/<path>/cpuset.mems

# Verify current settings
cat /sys/fs/cgroup/<path>/cpuset.cpus.effective
cat /sys/fs/cgroup/<path>/cpuset.mems.effective

NUMA and cpuset

┌─────────────────────────────────────────────────────────────────────┐
│                        NUMA TOPOLOGY                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Node 0                          Node 1                             │
│  ┌─────────────────────┐        ┌─────────────────────┐            │
│  │  CPU 0   CPU 1      │        │  CPU 4   CPU 5      │            │
│  │  CPU 2   CPU 3      │        │  CPU 6   CPU 7      │            │
│  │                     │        │                     │            │
│  │  ┌───────────────┐  │        │  ┌───────────────┐  │            │
│  │  │   Memory      │  │        │  │   Memory      │  │            │
│  │  │   64GB        │  │        │  │   64GB        │  │            │
│  │  └───────────────┘  │        │  └───────────────┘  │            │
│  └─────────────────────┘        └─────────────────────┘            │
│                                                                      │
│  For best performance, pin process to CPUs and memory               │
│  on the same NUMA node:                                              │
│                                                                      │
│  cpuset.cpus = "0-3"                                                │
│  cpuset.mems = "0"                                                  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Practical cgroups Operations

Creating and Managing cgroups

# Create a new cgroup
mkdir /sys/fs/cgroup/mygroup

# Enable controllers for children
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids

echo "+cpu +memory +pids" > /sys/fs/cgroup/cgroup.subtree_control

# Create child cgroup
mkdir /sys/fs/cgroup/mygroup/child

# Add process to cgroup
echo $$ > /sys/fs/cgroup/mygroup/child/cgroup.procs

# Set limits
echo "50000 100000" > /sys/fs/cgroup/mygroup/child/cpu.max
echo "100M" > /sys/fs/cgroup/mygroup/child/memory.max
echo 50 > /sys/fs/cgroup/mygroup/child/pids.max

Container Runtime cgroup Operations

# Find container's cgroup
# Docker (v2)
CONTAINER_ID=$(docker ps -q --filter name=mycontainer)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}*/memory.current

# Using systemd (cgroups v2)
systemctl status docker-${CONTAINER_ID}.scope

# Using /proc
cat /proc/<pid>/cgroup
# Output: 0::/system.slice/docker-abc123.scope

Delegation and Nesting

Cgroup Delegation

┌─────────────────────────────────────────────────────────────────────┐
│                      CGROUP DELEGATION                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Root owns top levels:                                               │
│                                                                      │
│  /sys/fs/cgroup/                        ← root-owned                │
│  └── system.slice/                      ← root-owned                │
│      └── docker.service/                ← root-owned                │
│          └── container-abc/             ← delegated to container    │
│              ├── cgroup.procs           ← container can write       │
│              ├── memory.current         ← container can read        │
│              └── child/                 ← container can create      │
│                                                                      │
│  Delegation enables:                                                 │
│  1. Container can create nested cgroups                             │
│  2. Container can move its processes between cgroups                │
│  3. Container cannot exceed parent's limits                         │
│  4. Container cannot affect sibling cgroups                         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Delegation Requirements

# Check delegation
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids

# For delegation to work:
# 1. Directory owned by delegate
# 2. cgroup.procs owned by delegate  
# 3. cgroup.subtree_control owned by delegate
# 4. Parent's subtree_control enables needed controllers

# Set up delegation
mkdir /sys/fs/cgroup/delegated
chown -R user:user /sys/fs/cgroup/delegated

Debugging cgroups Issues

Common Issues and Solutions

Symptom: Container killed by OOM, but host shows available memory.Diagnosis:
# Check cgroup memory limit
cat /sys/fs/cgroup/<path>/memory.max
cat /sys/fs/cgroup/<path>/memory.current

# Check for anonymous vs cache memory
grep -E "^(anon|file)" /sys/fs/cgroup/<path>/memory.stat

# Check OOM events
cat /sys/fs/cgroup/<path>/memory.events
# oom 5  ← OOM occurred 5 times
Solutions:
  • Increase memory limit
  • Reduce application memory usage
  • Add swap (for burst tolerance)
Symptom: Container shows 30% CPU usage but requests are slow.Diagnosis:
# Check throttling stats
cat /sys/fs/cgroup/<path>/cpu.stat
# nr_throttled 15000  ← Throttled 15000 periods!

# Calculate throttle percentage
# throttle% = nr_throttled / nr_periods * 100
Cause: Burst usage exceeds quota within period even though average is low.Solutions:
  • Increase CPU limit
  • Use cpu.burst (cgroups v2) for burst allowance
  • Spread work more evenly
Symptom: echo: write error: Device or resource busyDiagnosis:
# Check if cgroup has processes
cat /sys/fs/cgroup/<path>/cgroup.procs

# Check if cgroup has children
ls /sys/fs/cgroup/<path>/
Cause: Cgroups v2 “no internal processes” rule - if a cgroup has controllers enabled in subtree_control, processes must be in leaf cgroups.Solution: Move processes to leaf cgroups before modifying parent.
Symptom: Set io.max but process still uses full disk bandwidth.Diagnosis:
# Check if IO controller is enabled
cat /sys/fs/cgroup/<path>/cgroup.controllers | grep io

# Check io.max format (needs MAJ:MIN)
lsblk
cat /sys/fs/cgroup/<path>/io.max
Common causes:
  • Wrong device major:minor
  • IO through page cache (buffered writes)
  • Controller not enabled
Solution:
# Use O_DIRECT or sync writes
# Or use io.latency for latency-based control

Interview Questions

Answer:When a cgroup exceeds memory.high:
  1. Reclaim pressure increases: Kernel aggressively reclaims memory from this cgroup
  2. Throttling may occur: Memory allocation requests may be delayed
  3. No OOM: The process is NOT killed (unlike exceeding memory.max)
  4. Performance impact: Application may slow down due to reclaim
This is useful for:
  • Soft limits with burst allowance
  • Preventing one container from consuming all cache
  • Graceful degradation instead of hard OOM
Key differences:
Aspectv1v2
HierarchyMultiple (per controller)Single unified
Internal processesAllowedNot allowed
Thread controlLimitedFull support
Pressure metricsNoYes (PSI)
DelegationComplexSimple
v2 advantages:
  • Simpler mental model
  • Consistent behavior
  • Better pressure metrics (PSI)
  • Proper thread-level controls
  • Cleaner delegation model
Systematic approach:
  1. Check CPU throttling:
    cat /sys/fs/cgroup/<path>/cpu.stat | grep throttled
    
  2. Check memory pressure:
    cat /sys/fs/cgroup/<path>/memory.pressure
    cat /sys/fs/cgroup/<path>/memory.events | grep oom
    
  3. Check I/O latency:
    cat /sys/fs/cgroup/<path>/io.stat
    cat /sys/fs/cgroup/<path>/io.pressure
    
  4. Check for noisy neighbors:
    • Look at sibling cgroups’ usage
    • Check parent cgroup limits
  5. Use tracing:
    # Off-CPU analysis
    offcputime-bpfcc -p <pid>
    

PSI (Pressure Stall Information)

New in cgroups v2 - provides pressure metrics:
# Memory pressure
cat /sys/fs/cgroup/<path>/memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

# CPU pressure  
cat /sys/fs/cgroup/<path>/cpu.pressure
# some avg10=25.00 avg60=20.00 avg300=15.00 total=1234567

# IO pressure
cat /sys/fs/cgroup/<path>/io.pressure
# some avg10=5.00 avg60=3.00 avg300=2.00 total=789012
# full avg10=2.00 avg60=1.00 avg300=0.50 total=456789
Interpretation:
  • some: At least one task stalled
  • full: All tasks stalled
  • avg10/60/300: 10s/60s/300s moving averages (%)
  • total: Total stall time in microseconds

Next: Filesystem & VFS →