Control Groups (cgroups)
Control groups are the foundation of container resource management. Understanding cgroups deeply is essential for debugging container issues and designing resource-aware systems.
Prerequisites : Process fundamentals, basic container concepts
Interview Focus : cgroups v2, memory limits, CPU throttling, debugging
Companies : All container/cloud companies heavily test this
cgroups Overview
cgroups v1 vs v2
Characteristics :
Multiple hierarchies (one per controller)
Controllers can be mounted independently
More flexible but complex
Still used in many production systems
File system layout :/sys/fs/cgroup/
├── cpu/
│ └── docker/
│ └── container-abc/
│ ├── cpu.cfs_quota_us
│ └── cpu.cfs_period_us
├── memory/
│ └── docker/
│ └── container-abc/
│ ├── memory.limit_in_bytes
│ └── memory.usage_in_bytes
└── pids/
└── docker/
└── container-abc/
└── pids.max
Key issues :
Inconsistent APIs across controllers
Race conditions between hierarchies
No unified resource management
Characteristics :
Single unified hierarchy
All controllers in one tree
Simpler, more consistent API
Default in modern systems (RHEL 8+, Ubuntu 22.04+)
File system layout :/sys/fs/cgroup/
├── cgroup.controllers # Available controllers
├── cgroup.subtree_control # Enabled for children
├── system.slice/
│ └── sshd.service/
│ ├── cgroup.procs
│ ├── cpu.stat
│ └── memory.current
└── user.slice/
└── user-1000.slice/
Key improvements :
Consistent no-internal-process rule
Unified resource control
Better pressure metrics
Thread-level controls
CPU Controller Deep Dive
CPU Bandwidth Limiting
┌─────────────────────────────────────────────────────────────────────┐
│ CPU BANDWIDTH LIMITING │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Configuration (cgroups v2): │
│ ┌─────────────────────────────────────────────────────────────────┐│
│ │ cpu.max = "$QUOTA $PERIOD" ││
│ │ ││
│ │ Examples: ││
│ │ "50000 100000" = 50ms per 100ms period = 50% of 1 CPU ││
│ │ "100000 100000" = 100ms per 100ms = 100% of 1 CPU ││
│ │ "200000 100000" = 200ms per 100ms = 200% = 2 CPUs ││
│ │ "max 100000" = Unlimited ││
│ │ ││
│ └─────────────────────────────────────────────────────────────────┘│
│ │
│ Kubernetes translation: │
│ resources: │
│ limits: │
│ cpu: "500m" → cpu.max = "50000 100000" │
│ cpu: "2" → cpu.max = "200000 100000" │
│ │
└─────────────────────────────────────────────────────────────────────┘
CPU Throttling
Common Interview Topic : Why is my container slow even when CPU usage looks low? Answer: CPU throttling!
# Check throttling statistics (cgroups v2)
cat /sys/fs/cgroup/ < pat h > /cpu.stat
# Example output:
usage_usec 1234567890 # Total CPU time used
user_usec 1000000000 # User-space time
system_usec 234567890 # Kernel-space time
nr_periods 50000 # Number of enforcement periods
nr_throttled 1000 # Periods where throttling occurred
throttled_usec 5000000 # Total time spent throttled
# Throttling percentage
throttled_percentage = nr_throttled / nr_periods * 100
# If > 5%, consider increasing CPU limit
CPU Throttling Visualization
┌────────────────────────────────────────────────────────────────────────┐
│ CPU THROTTLING BEHAVIOR │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Period: 100ms, Quota: 50ms (50% of 1 CPU) │
│ │
│ Time ─────────────────────────────────────────────────────────────► │
│ │
│ Period 1 │ Period 2 │ Period 3 │ Period 4 │
│ 0ms 100ms │ 100ms 200ms │ 200ms 300ms │ 300ms 400ms │
│ │ │ │ │ │ │ │ │ │
│ ├──────┤ ├───────┤ ├───────┤ ├───────┤ │
│ │██████│ │███████│ │████│ │ │███████│ │
│ │██████│ │███████│ │████│ │ │███████│ │
│ │██████│░░░░░░│███████│░░░░░░░│████│░░│░░░░░░░│███████│░░░░░░ │
│ └──────┘ └───────┘ └────┘ │ └───────┘ │
│ │ │
│ ██ = Running ░░ = Throttled │ │
│ │ │
│ Period 3: Only 40ms work, no throttle ◄─────── │
│ Other periods: Hit 50ms quota, throttled │
│ │
└────────────────────────────────────────────────────────────────────────┘
Memory Controller Deep Dive
Memory Limits and Protection
┌─────────────────────────────────────────────────────────────────────┐
│ MEMORY CGROUP CONTROLS │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ memory.max │
│ │ │
│ │ ← Hard limit (OOM if exceeded) │
│ Usage │ │
│ │ ────┼──────────────────────────── │
│ │ │ │
│ │ memory.high │
│ │ │ │
│ │ │ ← Throttling begins (reclaim) │
│ │ ────┼──────────────────────────── │
│ │ │ │
│ │ ┌──────────────────────┴──────────────────────────┐ │
│ │ │ Normal Operation │ │
│ │ │ Application allocates freely │ │
│ │ └─────────────────────────────────────────────────┘ │
│ │ │ │
│ │ memory.low │
│ │ │ ← Best-effort protection │
│ │ ────┼──────────────────────────── │
│ │ │ │
│ │ memory.min │
│ │ │ ← Guaranteed protection │
│ └─────────────────────────┴────────────────────────────► Time │
│ │
└─────────────────────────────────────────────────────────────────────┘
Memory Accounting Details
# Read memory statistics (cgroups v2)
cat /sys/fs/cgroup/ < pat h > /memory.stat
# Key fields:
anon 1073741824 # Anonymous memory (heap, stack)
file 536870912 # Page cache (file-backed pages)
kernel 67108864 # Kernel memory (slab, etc.)
kernel_stack 1048576 # Kernel stacks
pagetables 8388608 # Page table memory
sock 134217728 # Socket buffers
shmem 0 # Shared memory
file_mapped 268435456 # Memory-mapped files
file_dirty 4096 # Dirty pages
file_writeback 0 # Pages being written
swapcached 0 # Swap cache
anon_thp 0 # Anonymous huge pages
file_thp 0 # File-backed huge pages
slab_reclaimable 16777216 # Reclaimable slab
slab_unreclaimable 8388608 # Non-reclaimable slab
pgfault 15000000 # Total page faults
pgmajfault 1000 # Major page faults (disk I/O)
Memory Limit vs OOM
┌─────────────────────────────────────────────────────────────────────┐
│ MEMORY LIMIT EXCEEDED │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Process requests allocation │
│ │ │
│ ▼ │
│ Is memory.current < memory.max? │
│ │ │
│ ┌────┴────┐ │
│ │YES NO│ │
│ ▼ ▼ │
│ Allocate Try to reclaim │
│ memory from this cgroup │
│ │ │
│ ▼ │
│ Reclaim successful? │
│ │ │
│ ┌────┴────┐ │
│ │YES NO│ │
│ ▼ ▼ │
│ Allocate Invoke OOM killer │
│ memory for this cgroup │
│ │ │
│ ▼ │
│ Kill process with │
│ highest oom_score │
│ │
└─────────────────────────────────────────────────────────────────────┘
I/O Controller
I/O Bandwidth Limiting
# cgroups v2 I/O controls
cat /sys/fs/cgroup/ < pat h > /io.max
# Format: "MAJ:MIN rbps=LIMIT wbps=LIMIT riops=LIMIT wiops=LIMIT"
# Example: Limit to 10MB/s read, 5MB/s write on device 8:0
echo "8:0 rbps=10485760 wbps=5242880" > io.max
# Check I/O statistics
cat /sys/fs/cgroup/ < pat h > /io.stat
# Example output:
# 8:0 rbytes=1073741824 wbytes=536870912 rios=10000 wios=5000 dbytes=0 dios=0
I/O Latency Control (io.latency)
# Set target latency for device
echo "8:0 target=10000" > /sys/fs/cgroup/ < pat h > /io.latency
# 10000 = 10ms target latency
# The kernel will throttle this cgroup if its I/O
# is causing other cgroups to exceed their targets
PID Controller
# Limit maximum number of processes
echo 100 > /sys/fs/cgroup/ < pat h > /pids.max
# Check current count
cat /sys/fs/cgroup/ < pat h > /pids.current
# Check if limit was hit
cat /sys/fs/cgroup/ < pat h > /pids.events
# max 5 ← Fork denied 5 times due to limit
Fork Bomb Protection : The pids controller prevents fork bombs from exhausting system resources. Set reasonable limits on all container cgroups.
cpuset Controller
# Pin to specific CPUs
echo "0,2,4,6" > /sys/fs/cgroup/ < pat h > /cpuset.cpus
# Pin to specific memory nodes (NUMA)
echo "0" > /sys/fs/cgroup/ < pat h > /cpuset.mems
# Verify current settings
cat /sys/fs/cgroup/ < pat h > /cpuset.cpus.effective
cat /sys/fs/cgroup/ < pat h > /cpuset.mems.effective
NUMA and cpuset
┌─────────────────────────────────────────────────────────────────────┐
│ NUMA TOPOLOGY │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Node 0 Node 1 │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ CPU 0 CPU 1 │ │ CPU 4 CPU 5 │ │
│ │ CPU 2 CPU 3 │ │ CPU 6 CPU 7 │ │
│ │ │ │ │ │
│ │ ┌───────────────┐ │ │ ┌───────────────┐ │ │
│ │ │ Memory │ │ │ │ Memory │ │ │
│ │ │ 64GB │ │ │ │ 64GB │ │ │
│ │ └───────────────┘ │ │ └───────────────┘ │ │
│ └─────────────────────┘ └─────────────────────┘ │
│ │
│ For best performance, pin process to CPUs and memory │
│ on the same NUMA node: │
│ │
│ cpuset.cpus = "0-3" │
│ cpuset.mems = "0" │
│ │
└─────────────────────────────────────────────────────────────────────┘
Practical cgroups Operations
Creating and Managing cgroups
# Create a new cgroup
mkdir /sys/fs/cgroup/mygroup
# Enable controllers for children
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids
echo "+cpu +memory +pids" > /sys/fs/cgroup/cgroup.subtree_control
# Create child cgroup
mkdir /sys/fs/cgroup/mygroup/child
# Add process to cgroup
echo $$ > /sys/fs/cgroup/mygroup/child/cgroup.procs
# Set limits
echo "50000 100000" > /sys/fs/cgroup/mygroup/child/cpu.max
echo "100M" > /sys/fs/cgroup/mygroup/child/memory.max
echo 50 > /sys/fs/cgroup/mygroup/child/pids.max
Container Runtime cgroup Operations
# Find container's cgroup
# Docker (v2)
CONTAINER_ID = $( docker ps -q --filter name=mycontainer )
cat /sys/fs/cgroup/system.slice/docker- ${ CONTAINER_ID } * /memory.current
# Using systemd (cgroups v2)
systemctl status docker- ${ CONTAINER_ID } .scope
# Using /proc
cat /proc/ < pi d > /cgroup
# Output: 0::/system.slice/docker-abc123.scope
Delegation and Nesting
Cgroup Delegation
┌─────────────────────────────────────────────────────────────────────┐
│ CGROUP DELEGATION │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Root owns top levels: │
│ │
│ /sys/fs/cgroup/ ← root-owned │
│ └── system.slice/ ← root-owned │
│ └── docker.service/ ← root-owned │
│ └── container-abc/ ← delegated to container │
│ ├── cgroup.procs ← container can write │
│ ├── memory.current ← container can read │
│ └── child/ ← container can create │
│ │
│ Delegation enables: │
│ 1. Container can create nested cgroups │
│ 2. Container can move its processes between cgroups │
│ 3. Container cannot exceed parent's limits │
│ 4. Container cannot affect sibling cgroups │
│ │
└─────────────────────────────────────────────────────────────────────┘
Delegation Requirements
# Check delegation
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids
# For delegation to work:
# 1. Directory owned by delegate
# 2. cgroup.procs owned by delegate
# 3. cgroup.subtree_control owned by delegate
# 4. Parent's subtree_control enables needed controllers
# Set up delegation
mkdir /sys/fs/cgroup/delegated
chown -R user:user /sys/fs/cgroup/delegated
Debugging cgroups Issues
Common Issues and Solutions
Container OOM but host has memory
Symptom : Container killed by OOM, but host shows available memory.Diagnosis :# Check cgroup memory limit
cat /sys/fs/cgroup/ < pat h > /memory.max
cat /sys/fs/cgroup/ < pat h > /memory.current
# Check for anonymous vs cache memory
grep -E "^(anon|file)" /sys/fs/cgroup/ < pat h > /memory.stat
# Check OOM events
cat /sys/fs/cgroup/ < pat h > /memory.events
# oom 5 ← OOM occurred 5 times
Solutions :
Increase memory limit
Reduce application memory usage
Add swap (for burst tolerance)
CPU throttling at low usage
Symptom : Container shows 30% CPU usage but requests are slow.Diagnosis :# Check throttling stats
cat /sys/fs/cgroup/ < pat h > /cpu.stat
# nr_throttled 15000 ← Throttled 15000 periods!
# Calculate throttle percentage
# throttle% = nr_throttled / nr_periods * 100
Cause : Burst usage exceeds quota within period even though average is low.Solutions :
Increase CPU limit
Use cpu.burst (cgroups v2) for burst allowance
Spread work more evenly
Can't write to cgroup files
Symptom : echo: write error: Device or resource busyDiagnosis :# Check if cgroup has processes
cat /sys/fs/cgroup/ < pat h > /cgroup.procs
# Check if cgroup has children
ls /sys/fs/cgroup/ < pat h > /
Cause : Cgroups v2 “no internal processes” rule - if a cgroup has controllers enabled in subtree_control, processes must be in leaf cgroups.Solution : Move processes to leaf cgroups before modifying parent.
Symptom : Set io.max but process still uses full disk bandwidth.Diagnosis :# Check if IO controller is enabled
cat /sys/fs/cgroup/ < pat h > /cgroup.controllers | grep io
# Check io.max format (needs MAJ:MIN)
lsblk
cat /sys/fs/cgroup/ < pat h > /io.max
Common causes :
Wrong device major:minor
IO through page cache (buffered writes)
Controller not enabled
Solution :# Use O_DIRECT or sync writes
# Or use io.latency for latency-based control
Interview Questions
Q: What happens when a container exceeds memory.high?
Answer :When a cgroup exceeds memory.high:
Reclaim pressure increases : Kernel aggressively reclaims memory from this cgroup
Throttling may occur : Memory allocation requests may be delayed
No OOM : The process is NOT killed (unlike exceeding memory.max)
Performance impact : Application may slow down due to reclaim
This is useful for:
Soft limits with burst allowance
Preventing one container from consuming all cache
Graceful degradation instead of hard OOM
Q: Explain the difference between cgroups v1 and v2
Key differences :Aspect v1 v2 Hierarchy Multiple (per controller) Single unified Internal processes Allowed Not allowed Thread control Limited Full support Pressure metrics No Yes (PSI) Delegation Complex Simple
v2 advantages :
Simpler mental model
Consistent behavior
Better pressure metrics (PSI)
Proper thread-level controls
Cleaner delegation model
Q: How would you debug high latency in a container?
Systematic approach :
Check CPU throttling :
cat /sys/fs/cgroup/ < pat h > /cpu.stat | grep throttled
Check memory pressure :
cat /sys/fs/cgroup/ < pat h > /memory.pressure
cat /sys/fs/cgroup/ < pat h > /memory.events | grep oom
Check I/O latency :
cat /sys/fs/cgroup/ < pat h > /io.stat
cat /sys/fs/cgroup/ < pat h > /io.pressure
Check for noisy neighbors :
Look at sibling cgroups’ usage
Check parent cgroup limits
Use tracing :
# Off-CPU analysis
offcputime-bpfcc -p < pi d >
New in cgroups v2 - provides pressure metrics:
# Memory pressure
cat /sys/fs/cgroup/ < pat h > /memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# CPU pressure
cat /sys/fs/cgroup/ < pat h > /cpu.pressure
# some avg10=25.00 avg60=20.00 avg300=15.00 total=1234567
# IO pressure
cat /sys/fs/cgroup/ < pat h > /io.pressure
# some avg10=5.00 avg60=3.00 avg300=2.00 total=789012
# full avg10=2.00 avg60=1.00 avg300=0.50 total=456789
Interpretation :
some: At least one task stalled
full: All tasks stalled
avg10/60/300: 10s/60s/300s moving averages (%)
total: Total stall time in microseconds
Next: Filesystem & VFS →