Control Groups - Resource limits, accounting, and prioritization

Control Groups (cgroups)

Control groups are the foundation of container resource management. Understanding cgroups deeply is essential for debugging container issues and designing resource-aware systems.

Prerequisites: Process fundamentals, basic container concepts
Interview Focus: cgroups v2, memory limits, CPU throttling, debugging
Companies: All container/cloud companies heavily test this

cgroups Overview

cgroups v1 vs v2

v1 (Legacy)
v2 (Unified)

Characteristics:

Multiple hierarchies (one per controller)
Controllers can be mounted independently
More flexible but complex
Still used in many production systems

File system layout:

/sys/fs/cgroup/
├── cpu/
│   └── docker/
│       └── container-abc/
│           ├── cpu.cfs_quota_us
│           └── cpu.cfs_period_us
├── memory/
│   └── docker/
│       └── container-abc/
│           ├── memory.limit_in_bytes
│           └── memory.usage_in_bytes
└── pids/
    └── docker/
        └── container-abc/
            └── pids.max

Key issues:

Inconsistent APIs across controllers
Race conditions between hierarchies
No unified resource management

Characteristics:

Single unified hierarchy
All controllers in one tree
Simpler, more consistent API
Default in modern systems (RHEL 8+, Ubuntu 22.04+)

File system layout:

/sys/fs/cgroup/
├── cgroup.controllers         # Available controllers
├── cgroup.subtree_control     # Enabled for children
├── system.slice/
│   └── sshd.service/
│       ├── cgroup.procs
│       ├── cpu.stat
│       └── memory.current
└── user.slice/
    └── user-1000.slice/

Key improvements:

Consistent no-internal-process rule
Unified resource control
Better pressure metrics
Thread-level controls

CPU Controller Deep Dive

CPU Bandwidth Limiting

┌─────────────────────────────────────────────────────────────────────┐
│                    CPU BANDWIDTH LIMITING                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Configuration (cgroups v2):                                         │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  cpu.max = "$QUOTA $PERIOD"                                      ││
│  │                                                                  ││
│  │  Examples:                                                       ││
│  │  "50000 100000"  = 50ms per 100ms period = 50% of 1 CPU         ││
│  │  "100000 100000" = 100ms per 100ms = 100% of 1 CPU             ││
│  │  "200000 100000" = 200ms per 100ms = 200% = 2 CPUs             ││
│  │  "max 100000"    = Unlimited                                    ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  Kubernetes translation:                                             │
│  resources:                                                          │
│    limits:                                                           │
│      cpu: "500m"    →  cpu.max = "50000 100000"                     │
│      cpu: "2"       →  cpu.max = "200000 100000"                    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

CPU Throttling

Common Interview Topic: Why is my container slow even when CPU usage looks low? Answer: CPU throttling!

# Check throttling statistics (cgroups v2)
cat /sys/fs/cgroup/<path>/cpu.stat

# Example output:
usage_usec 1234567890        # Total CPU time used
user_usec 1000000000         # User-space time
system_usec 234567890        # Kernel-space time
nr_periods 50000             # Number of enforcement periods
nr_throttled 1000            # Periods where throttling occurred
throttled_usec 5000000       # Total time spent throttled

# Throttling percentage
throttled_percentage = nr_throttled / nr_periods * 100
# If > 5%, consider increasing CPU limit

CPU Throttling Visualization

┌────────────────────────────────────────────────────────────────────────┐
│                    CPU THROTTLING BEHAVIOR                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Period: 100ms, Quota: 50ms (50% of 1 CPU)                             │
│                                                                         │
│  Time ─────────────────────────────────────────────────────────────►   │
│                                                                         │
│  Period 1      │ Period 2      │ Period 3      │ Period 4              │
│  0ms    100ms  │ 100ms  200ms  │ 200ms  300ms  │ 300ms  400ms          │
│  │      │      │       │       │       │       │       │               │
│  ├──────┤      ├───────┤       ├───────┤       ├───────┤               │
│  │██████│      │███████│       │████│  │       │███████│               │
│  │██████│      │███████│       │████│  │       │███████│               │
│  │██████│░░░░░░│███████│░░░░░░░│████│░░│░░░░░░░│███████│░░░░░░         │
│  └──────┘      └───────┘       └────┘  │       └───────┘               │
│                                        │                               │
│  ██ = Running  ░░ = Throttled         │                               │
│                                        │                               │
│  Period 3: Only 40ms work, no throttle ◄───────                        │
│  Other periods: Hit 50ms quota, throttled                              │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘

Memory Controller Deep Dive

Memory Limits and Protection

┌─────────────────────────────────────────────────────────────────────┐
│                     MEMORY CGROUP CONTROLS                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│                          memory.max                                  │
│                              │                                       │
│                              │ ← Hard limit (OOM if exceeded)       │
│  Usage                       │                                       │
│    │                     ────┼────────────────────────────          │
│    │                         │                                       │
│    │                    memory.high                                  │
│    │                         │                                       │
│    │                         │ ← Throttling begins (reclaim)        │
│    │                     ────┼────────────────────────────          │
│    │                         │                                       │
│    │  ┌──────────────────────┴──────────────────────────┐           │
│    │  │                 Normal Operation                 │           │
│    │  │          Application allocates freely            │           │
│    │  └─────────────────────────────────────────────────┘           │
│    │                         │                                       │
│    │                    memory.low                                   │
│    │                         │ ← Best-effort protection             │
│    │                     ────┼────────────────────────────          │
│    │                         │                                       │
│    │                    memory.min                                   │
│    │                         │ ← Guaranteed protection               │
│    └─────────────────────────┴────────────────────────────► Time    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Memory Accounting Details

# Read memory statistics (cgroups v2)
cat /sys/fs/cgroup/<path>/memory.stat

# Key fields:
anon 1073741824                  # Anonymous memory (heap, stack)
file 536870912                   # Page cache (file-backed pages)
kernel 67108864                  # Kernel memory (slab, etc.)
kernel_stack 1048576             # Kernel stacks
pagetables 8388608               # Page table memory
sock 134217728                   # Socket buffers
shmem 0                          # Shared memory
file_mapped 268435456            # Memory-mapped files
file_dirty 4096                  # Dirty pages
file_writeback 0                 # Pages being written
swapcached 0                     # Swap cache
anon_thp 0                       # Anonymous huge pages
file_thp 0                       # File-backed huge pages
slab_reclaimable 16777216        # Reclaimable slab
slab_unreclaimable 8388608       # Non-reclaimable slab
pgfault 15000000                 # Total page faults
pgmajfault 1000                  # Major page faults (disk I/O)

Memory Limit vs OOM

┌─────────────────────────────────────────────────────────────────────┐
│                    MEMORY LIMIT EXCEEDED                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Process requests allocation                                         │
│         │                                                            │
│         ▼                                                            │
│  Is memory.current < memory.max?                                     │
│         │                                                            │
│    ┌────┴────┐                                                       │
│    │YES     NO│                                                       │
│    ▼          ▼                                                       │
│  Allocate   Try to reclaim                                           │
│  memory     from this cgroup                                         │
│              │                                                        │
│              ▼                                                        │
│         Reclaim successful?                                          │
│              │                                                        │
│         ┌────┴────┐                                                  │
│         │YES     NO│                                                  │
│         ▼          ▼                                                  │
│       Allocate   Invoke OOM killer                                   │
│       memory     for this cgroup                                     │
│                   │                                                   │
│                   ▼                                                   │
│            Kill process with                                         │
│            highest oom_score                                         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

I/O Controller

I/O Bandwidth Limiting

# cgroups v2 I/O controls
cat /sys/fs/cgroup/<path>/io.max

# Format: "MAJ:MIN rbps=LIMIT wbps=LIMIT riops=LIMIT wiops=LIMIT"
# Example: Limit to 10MB/s read, 5MB/s write on device 8:0
echo "8:0 rbps=10485760 wbps=5242880" > io.max

# Check I/O statistics
cat /sys/fs/cgroup/<path>/io.stat

# Example output:
# 8:0 rbytes=1073741824 wbytes=536870912 rios=10000 wios=5000 dbytes=0 dios=0

I/O Latency Control (io.latency)

# Set target latency for device
echo "8:0 target=10000" > /sys/fs/cgroup/<path>/io.latency
# 10000 = 10ms target latency

# The kernel will throttle this cgroup if its I/O
# is causing other cgroups to exceed their targets

PID Controller

# Limit maximum number of processes
echo 100 > /sys/fs/cgroup/<path>/pids.max

# Check current count
cat /sys/fs/cgroup/<path>/pids.current

# Check if limit was hit
cat /sys/fs/cgroup/<path>/pids.events
# max 5    ← Fork denied 5 times due to limit

Fork Bomb Protection: The pids controller prevents fork bombs from exhausting system resources. Set reasonable limits on all container cgroups.

cpuset Controller

# Pin to specific CPUs
echo "0,2,4,6" > /sys/fs/cgroup/<path>/cpuset.cpus

# Pin to specific memory nodes (NUMA)
echo "0" > /sys/fs/cgroup/<path>/cpuset.mems

# Verify current settings
cat /sys/fs/cgroup/<path>/cpuset.cpus.effective
cat /sys/fs/cgroup/<path>/cpuset.mems.effective

NUMA and cpuset

┌─────────────────────────────────────────────────────────────────────┐
│                        NUMA TOPOLOGY                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Node 0                          Node 1                             │
│  ┌─────────────────────┐        ┌─────────────────────┐            │
│  │  CPU 0   CPU 1      │        │  CPU 4   CPU 5      │            │
│  │  CPU 2   CPU 3      │        │  CPU 6   CPU 7      │            │
│  │                     │        │                     │            │
│  │  ┌───────────────┐  │        │  ┌───────────────┐  │            │
│  │  │   Memory      │  │        │  │   Memory      │  │            │
│  │  │   64GB        │  │        │  │   64GB        │  │            │
│  │  └───────────────┘  │        │  └───────────────┘  │            │
│  └─────────────────────┘        └─────────────────────┘            │
│                                                                      │
│  For best performance, pin process to CPUs and memory               │
│  on the same NUMA node:                                              │
│                                                                      │
│  cpuset.cpus = "0-3"                                                │
│  cpuset.mems = "0"                                                  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Practical cgroups Operations

Creating and Managing cgroups

# Create a new cgroup
mkdir /sys/fs/cgroup/mygroup

# Enable controllers for children
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids

echo "+cpu +memory +pids" > /sys/fs/cgroup/cgroup.subtree_control

# Create child cgroup
mkdir /sys/fs/cgroup/mygroup/child

# Add process to cgroup
echo $$ > /sys/fs/cgroup/mygroup/child/cgroup.procs

# Set limits
echo "50000 100000" > /sys/fs/cgroup/mygroup/child/cpu.max
echo "100M" > /sys/fs/cgroup/mygroup/child/memory.max
echo 50 > /sys/fs/cgroup/mygroup/child/pids.max

Container Runtime cgroup Operations

# Find container's cgroup
# Docker (v2)
CONTAINER_ID=$(docker ps -q --filter name=mycontainer)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}*/memory.current

# Using systemd (cgroups v2)
systemctl status docker-${CONTAINER_ID}.scope

# Using /proc
cat /proc/<pid>/cgroup
# Output: 0::/system.slice/docker-abc123.scope

Delegation and Nesting

Cgroup Delegation

┌─────────────────────────────────────────────────────────────────────┐
│                      CGROUP DELEGATION                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Root owns top levels:                                               │
│                                                                      │
│  /sys/fs/cgroup/                        ← root-owned                │
│  └── system.slice/                      ← root-owned                │
│      └── docker.service/                ← root-owned                │
│          └── container-abc/             ← delegated to container    │
│              ├── cgroup.procs           ← container can write       │
│              ├── memory.current         ← container can read        │
│              └── child/                 ← container can create      │
│                                                                      │
│  Delegation enables:                                                 │
│  1. Container can create nested cgroups                             │
│  2. Container can move its processes between cgroups                │
│  3. Container cannot exceed parent's limits                         │
│  4. Container cannot affect sibling cgroups                         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Delegation Requirements

# Check delegation
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids

# For delegation to work:
# 1. Directory owned by delegate
# 2. cgroup.procs owned by delegate  
# 3. cgroup.subtree_control owned by delegate
# 4. Parent's subtree_control enables needed controllers

# Set up delegation
mkdir /sys/fs/cgroup/delegated
chown -R user:user /sys/fs/cgroup/delegated

Debugging cgroups Issues

Common Issues and Solutions

Container OOM but host has memory

Symptom: Container killed by OOM, but host shows available memory.Diagnosis:

# Check cgroup memory limit
cat /sys/fs/cgroup/<path>/memory.max
cat /sys/fs/cgroup/<path>/memory.current

# Check for anonymous vs cache memory
grep -E "^(anon|file)" /sys/fs/cgroup/<path>/memory.stat

# Check OOM events
cat /sys/fs/cgroup/<path>/memory.events
# oom 5  ← OOM occurred 5 times

Solutions:

Increase memory limit
Reduce application memory usage
Add swap (for burst tolerance)

CPU throttling at low usage

Symptom: Container shows 30% CPU usage but requests are slow.Diagnosis:

# Check throttling stats
cat /sys/fs/cgroup/<path>/cpu.stat
# nr_throttled 15000  ← Throttled 15000 periods!

# Calculate throttle percentage
# throttle% = nr_throttled / nr_periods * 100

Cause: Burst usage exceeds quota within period even though average is low.Solutions:

Increase CPU limit
Use cpu.burst (cgroups v2) for burst allowance
Spread work more evenly

Can't write to cgroup files

Symptom: echo: write error: Device or resource busyDiagnosis:

# Check if cgroup has processes
cat /sys/fs/cgroup/<path>/cgroup.procs

# Check if cgroup has children
ls /sys/fs/cgroup/<path>/

Cause: Cgroups v2 “no internal processes” rule - if a cgroup has controllers enabled in subtree_control, processes must be in leaf cgroups.Solution: Move processes to leaf cgroups before modifying parent.

IO limits not working

Symptom: Set io.max but process still uses full disk bandwidth.Diagnosis:

# Check if IO controller is enabled
cat /sys/fs/cgroup/<path>/cgroup.controllers | grep io

# Check io.max format (needs MAJ:MIN)
lsblk
cat /sys/fs/cgroup/<path>/io.max

Common causes:

Wrong device major:minor
IO through page cache (buffered writes)
Controller not enabled

Solution:

# Use O_DIRECT or sync writes
# Or use io.latency for latency-based control

Interview Questions

Q: What happens when a container exceeds memory.high?

Answer:When a cgroup exceeds memory.high:

Reclaim pressure increases: Kernel aggressively reclaims memory from this cgroup
Throttling may occur: Memory allocation requests may be delayed
No OOM: The process is NOT killed (unlike exceeding memory.max)
Performance impact: Application may slow down due to reclaim

This is useful for:

Soft limits with burst allowance
Preventing one container from consuming all cache
Graceful degradation instead of hard OOM

Q: Explain the difference between cgroups v1 and v2

Key differences:

Aspect	v1	v2
Hierarchy	Multiple (per controller)	Single unified
Internal processes	Allowed	Not allowed
Thread control	Limited	Full support
Pressure metrics	No	Yes (PSI)
Delegation	Complex	Simple

v2 advantages:

Simpler mental model
Consistent behavior
Better pressure metrics (PSI)
Proper thread-level controls
Cleaner delegation model

Q: How would you debug high latency in a container?

Systematic approach:

Check CPU throttling:

cat /sys/fs/cgroup/<path>/cpu.stat | grep throttled

Check memory pressure:

cat /sys/fs/cgroup/<path>/memory.pressure
cat /sys/fs/cgroup/<path>/memory.events | grep oom

Check I/O latency:

cat /sys/fs/cgroup/<path>/io.stat
cat /sys/fs/cgroup/<path>/io.pressure

Check for noisy neighbors:
- Look at sibling cgroups’ usage
- Check parent cgroup limits

Use tracing:

# Off-CPU analysis
offcputime-bpfcc -p <pid>

PSI (Pressure Stall Information)

New in cgroups v2 - provides pressure metrics:

# Memory pressure
cat /sys/fs/cgroup/<path>/memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

# CPU pressure  
cat /sys/fs/cgroup/<path>/cpu.pressure
# some avg10=25.00 avg60=20.00 avg300=15.00 total=1234567

# IO pressure
cat /sys/fs/cgroup/<path>/io.pressure
# some avg10=5.00 avg60=3.00 avg300=2.00 total=789012
# full avg10=2.00 avg60=1.00 avg300=0.50 total=456789

Interpretation:

some: At least one task stalled
full: All tasks stalled
avg10/60/300: 10s/60s/300s moving averages (%)
total: Total stall time in microseconds

Next: Filesystem & VFS →

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Control Groups (cgroups)

​cgroups Overview

​cgroups v1 vs v2

​CPU Controller Deep Dive

​CPU Bandwidth Limiting

​CPU Throttling

​CPU Throttling Visualization

​Memory Controller Deep Dive

​Memory Limits and Protection

​Memory Accounting Details

​Memory Limit vs OOM

​I/O Controller

​I/O Bandwidth Limiting

​I/O Latency Control (io.latency)

​PID Controller

​cpuset Controller

​NUMA and cpuset

​Practical cgroups Operations

​Creating and Managing cgroups

​Container Runtime cgroup Operations

​Delegation and Nesting

​Cgroup Delegation

​Delegation Requirements

​Debugging cgroups Issues

​Common Issues and Solutions

​Interview Questions

​PSI (Pressure Stall Information)

Control Groups (cgroups)

cgroups Overview

cgroups v1 vs v2

CPU Controller Deep Dive

CPU Bandwidth Limiting

CPU Throttling

CPU Throttling Visualization

Memory Controller Deep Dive

Memory Limits and Protection

Memory Accounting Details

Memory Limit vs OOM

I/O Controller

I/O Bandwidth Limiting

I/O Latency Control (io.latency)

PID Controller

cpuset Controller

NUMA and cpuset

Practical cgroups Operations

Creating and Managing cgroups

Container Runtime cgroup Operations

Delegation and Nesting

Cgroup Delegation

Delegation Requirements

Debugging cgroups Issues

Common Issues and Solutions

Interview Questions

PSI (Pressure Stall Information)