> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Control Groups (cgroups)

> Master cgroups v1 and v2 for resource limiting, accounting, and container isolation

<Frame>
  <img src="https://mintcdn.com/devweeekends/1GcDwVN8SzYRbJg1/images/courses/linux-internals/cgroups-concept.svg?fit=max&auto=format&n=1GcDwVN8SzYRbJg1&q=85&s=39b16879b0bdb8c0859c336f67e6461b" alt="Control Groups - Resource limits, accounting, and prioritization" width="1080" height="1080" data-path="images/courses/linux-internals/cgroups-concept.svg" />
</Frame>

# Control Groups (cgroups)

Control groups are the foundation of container resource management. Understanding cgroups deeply is essential for debugging container issues and designing resource-aware systems.

<Info>
  **Prerequisites**: Process fundamentals, basic container concepts\
  **Interview Focus**: cgroups v2, memory limits, CPU throttling, debugging\
  **Companies**: All container/cloud companies heavily test this
</Info>

***

## cgroups Overview

<img src="https://mintcdn.com/devweeekends/1GcDwVN8SzYRbJg1/images/courses/linux-cgroups-hierarchy.svg?fit=max&auto=format&n=1GcDwVN8SzYRbJg1&q=85&s=bc891888433656077341845f0f01e7dd" alt="Control Groups Hierarchy" width="1080" height="1080" data-path="images/courses/linux-cgroups-hierarchy.svg" />

***

## cgroups v1 vs v2

<Tabs>
  <Tab title="v1 (Legacy)">
    **Characteristics**:

    * Multiple hierarchies (one per controller)
    * Controllers can be mounted independently
    * More flexible but complex
    * Still used in many production systems

    **File system layout**:

    ```
    /sys/fs/cgroup/
    ├── cpu/
    │   └── docker/
    │       └── container-abc/
    │           ├── cpu.cfs_quota_us
    │           └── cpu.cfs_period_us
    ├── memory/
    │   └── docker/
    │       └── container-abc/
    │           ├── memory.limit_in_bytes
    │           └── memory.usage_in_bytes
    └── pids/
        └── docker/
            └── container-abc/
                └── pids.max
    ```

    **Key issues**:

    * Inconsistent APIs across controllers
    * Race conditions between hierarchies
    * No unified resource management
  </Tab>

  <Tab title="v2 (Unified)">
    **Characteristics**:

    * Single unified hierarchy
    * All controllers in one tree
    * Simpler, more consistent API
    * Default in modern systems (RHEL 8+, Ubuntu 22.04+)

    **File system layout**:

    ```
    /sys/fs/cgroup/
    ├── cgroup.controllers         # Available controllers
    ├── cgroup.subtree_control     # Enabled for children
    ├── system.slice/
    │   └── sshd.service/
    │       ├── cgroup.procs
    │       ├── cpu.stat
    │       └── memory.current
    └── user.slice/
        └── user-1000.slice/
    ```

    **Key improvements**:

    * Consistent no-internal-process rule
    * Unified resource control
    * Better pressure metrics
    * Thread-level controls
  </Tab>
</Tabs>

***

## CPU Controller Deep Dive

<img src="https://mintcdn.com/devweeekends/1GcDwVN8SzYRbJg1/images/courses/linux-cgroup-resource-limits.svg?fit=max&auto=format&n=1GcDwVN8SzYRbJg1&q=85&s=e2d5fddcb88c220520fe0d080b2503c0" alt="Cgroup Resource Limits" width="1080" height="1080" data-path="images/courses/linux-cgroup-resource-limits.svg" />

### CPU Bandwidth Limiting

```
┌─────────────────────────────────────────────────────────────────────┐
│                    CPU BANDWIDTH LIMITING                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Configuration (cgroups v2):                                         │
│  ┌─────────────────────────────────────────────────────────────────┐│
│  │  cpu.max = "$QUOTA $PERIOD"                                      ││
│  │                                                                  ││
│  │  Examples:                                                       ││
│  │  "50000 100000"  = 50ms per 100ms period = 50% of 1 CPU         ││
│  │  "100000 100000" = 100ms per 100ms = 100% of 1 CPU             ││
│  │  "200000 100000" = 200ms per 100ms = 200% = 2 CPUs             ││
│  │  "max 100000"    = Unlimited                                    ││
│  │                                                                  ││
│  └─────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  Kubernetes translation:                                             │
│  resources:                                                          │
│    limits:                                                           │
│      cpu: "500m"    →  cpu.max = "50000 100000"                     │
│      cpu: "2"       →  cpu.max = "200000 100000"                    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

### CPU Throttling

<Warning>
  **Common Interview Topic**: Why is my container slow even when CPU usage looks low? Answer: CPU throttling!
</Warning>

```bash theme={null}
# Check throttling statistics (cgroups v2)
cat /sys/fs/cgroup/<path>/cpu.stat

# Example output:
usage_usec 1234567890        # Total CPU time used
user_usec 1000000000         # User-space time
system_usec 234567890        # Kernel-space time
nr_periods 50000             # Number of enforcement periods
nr_throttled 1000            # Periods where throttling occurred
throttled_usec 5000000       # Total time spent throttled

# Throttling percentage
throttled_percentage = nr_throttled / nr_periods * 100
# If > 5%, consider increasing CPU limit
```

### CPU Throttling Visualization

```
┌────────────────────────────────────────────────────────────────────────┐
│                    CPU THROTTLING BEHAVIOR                              │
├────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Period: 100ms, Quota: 50ms (50% of 1 CPU)                             │
│                                                                         │
│  Time ─────────────────────────────────────────────────────────────►   │
│                                                                         │
│  Period 1      │ Period 2      │ Period 3      │ Period 4              │
│  0ms    100ms  │ 100ms  200ms  │ 200ms  300ms  │ 300ms  400ms          │
│  │      │      │       │       │       │       │       │               │
│  ├──────┤      ├───────┤       ├───────┤       ├───────┤               │
│  │██████│      │███████│       │████│  │       │███████│               │
│  │██████│      │███████│       │████│  │       │███████│               │
│  │██████│░░░░░░│███████│░░░░░░░│████│░░│░░░░░░░│███████│░░░░░░         │
│  └──────┘      └───────┘       └────┘  │       └───────┘               │
│                                        │                               │
│  ██ = Running  ░░ = Throttled         │                               │
│                                        │                               │
│  Period 3: Only 40ms work, no throttle ◄───────                        │
│  Other periods: Hit 50ms quota, throttled                              │
│                                                                         │
└────────────────────────────────────────────────────────────────────────┘
```

***

## Memory Controller Deep Dive

### Memory Limits and Protection

```
┌─────────────────────────────────────────────────────────────────────┐
│                     MEMORY CGROUP CONTROLS                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│                          memory.max                                  │
│                              │                                       │
│                              │ ← Hard limit (OOM if exceeded)       │
│  Usage                       │                                       │
│    │                     ────┼────────────────────────────          │
│    │                         │                                       │
│    │                    memory.high                                  │
│    │                         │                                       │
│    │                         │ ← Throttling begins (reclaim)        │
│    │                     ────┼────────────────────────────          │
│    │                         │                                       │
│    │  ┌──────────────────────┴──────────────────────────┐           │
│    │  │                 Normal Operation                 │           │
│    │  │          Application allocates freely            │           │
│    │  └─────────────────────────────────────────────────┘           │
│    │                         │                                       │
│    │                    memory.low                                   │
│    │                         │ ← Best-effort protection             │
│    │                     ────┼────────────────────────────          │
│    │                         │                                       │
│    │                    memory.min                                   │
│    │                         │ ← Guaranteed protection               │
│    └─────────────────────────┴────────────────────────────► Time    │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

### Memory Accounting Details

```bash theme={null}
# Read memory statistics (cgroups v2)
cat /sys/fs/cgroup/<path>/memory.stat

# Key fields:
anon 1073741824                  # Anonymous memory (heap, stack)
file 536870912                   # Page cache (file-backed pages)
kernel 67108864                  # Kernel memory (slab, etc.)
kernel_stack 1048576             # Kernel stacks
pagetables 8388608               # Page table memory
sock 134217728                   # Socket buffers
shmem 0                          # Shared memory
file_mapped 268435456            # Memory-mapped files
file_dirty 4096                  # Dirty pages
file_writeback 0                 # Pages being written
swapcached 0                     # Swap cache
anon_thp 0                       # Anonymous huge pages
file_thp 0                       # File-backed huge pages
slab_reclaimable 16777216        # Reclaimable slab
slab_unreclaimable 8388608       # Non-reclaimable slab
pgfault 15000000                 # Total page faults
pgmajfault 1000                  # Major page faults (disk I/O)
```

### Memory Limit vs OOM

```
┌─────────────────────────────────────────────────────────────────────┐
│                    MEMORY LIMIT EXCEEDED                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Process requests allocation                                         │
│         │                                                            │
│         ▼                                                            │
│  Is memory.current < memory.max?                                     │
│         │                                                            │
│    ┌────┴────┐                                                       │
│    │YES     NO│                                                       │
│    ▼          ▼                                                       │
│  Allocate   Try to reclaim                                           │
│  memory     from this cgroup                                         │
│              │                                                        │
│              ▼                                                        │
│         Reclaim successful?                                          │
│              │                                                        │
│         ┌────┴────┐                                                  │
│         │YES     NO│                                                  │
│         ▼          ▼                                                  │
│       Allocate   Invoke OOM killer                                   │
│       memory     for this cgroup                                     │
│                   │                                                   │
│                   ▼                                                   │
│            Kill process with                                         │
│            highest oom_score                                         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

***

## I/O Controller

### I/O Bandwidth Limiting

```bash theme={null}
# cgroups v2 I/O controls
cat /sys/fs/cgroup/<path>/io.max

# Format: "MAJ:MIN rbps=LIMIT wbps=LIMIT riops=LIMIT wiops=LIMIT"
# Example: Limit to 10MB/s read, 5MB/s write on device 8:0
echo "8:0 rbps=10485760 wbps=5242880" > io.max

# Check I/O statistics
cat /sys/fs/cgroup/<path>/io.stat

# Example output:
# 8:0 rbytes=1073741824 wbytes=536870912 rios=10000 wios=5000 dbytes=0 dios=0
```

### I/O Latency Control (io.latency)

```bash theme={null}
# Set target latency for device
echo "8:0 target=10000" > /sys/fs/cgroup/<path>/io.latency
# 10000 = 10ms target latency

# The kernel will throttle this cgroup if its I/O
# is causing other cgroups to exceed their targets
```

***

## PID Controller

```bash theme={null}
# Limit maximum number of processes
echo 100 > /sys/fs/cgroup/<path>/pids.max

# Check current count
cat /sys/fs/cgroup/<path>/pids.current

# Check if limit was hit
cat /sys/fs/cgroup/<path>/pids.events
# max 5    ← Fork denied 5 times due to limit
```

<Note>
  **Fork Bomb Protection**: The pids controller prevents fork bombs from exhausting system resources. Set reasonable limits on all container cgroups.
</Note>

***

## cpuset Controller

```bash theme={null}
# Pin to specific CPUs
echo "0,2,4,6" > /sys/fs/cgroup/<path>/cpuset.cpus

# Pin to specific memory nodes (NUMA)
echo "0" > /sys/fs/cgroup/<path>/cpuset.mems

# Verify current settings
cat /sys/fs/cgroup/<path>/cpuset.cpus.effective
cat /sys/fs/cgroup/<path>/cpuset.mems.effective
```

### NUMA and cpuset

```
┌─────────────────────────────────────────────────────────────────────┐
│                        NUMA TOPOLOGY                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Node 0                          Node 1                             │
│  ┌─────────────────────┐        ┌─────────────────────┐            │
│  │  CPU 0   CPU 1      │        │  CPU 4   CPU 5      │            │
│  │  CPU 2   CPU 3      │        │  CPU 6   CPU 7      │            │
│  │                     │        │                     │            │
│  │  ┌───────────────┐  │        │  ┌───────────────┐  │            │
│  │  │   Memory      │  │        │  │   Memory      │  │            │
│  │  │   64GB        │  │        │  │   64GB        │  │            │
│  │  └───────────────┘  │        │  └───────────────┘  │            │
│  └─────────────────────┘        └─────────────────────┘            │
│                                                                      │
│  For best performance, pin process to CPUs and memory               │
│  on the same NUMA node:                                              │
│                                                                      │
│  cpuset.cpus = "0-3"                                                │
│  cpuset.mems = "0"                                                  │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

***

## Practical cgroups Operations

### Creating and Managing cgroups

```bash theme={null}
# Create a new cgroup
mkdir /sys/fs/cgroup/mygroup

# Enable controllers for children
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids

echo "+cpu +memory +pids" > /sys/fs/cgroup/cgroup.subtree_control

# Create child cgroup
mkdir /sys/fs/cgroup/mygroup/child

# Add process to cgroup
echo $$ > /sys/fs/cgroup/mygroup/child/cgroup.procs

# Set limits
echo "50000 100000" > /sys/fs/cgroup/mygroup/child/cpu.max
echo "100M" > /sys/fs/cgroup/mygroup/child/memory.max
echo 50 > /sys/fs/cgroup/mygroup/child/pids.max
```

### Container Runtime cgroup Operations

```bash theme={null}
# Find container's cgroup
# Docker (v2)
CONTAINER_ID=$(docker ps -q --filter name=mycontainer)
cat /sys/fs/cgroup/system.slice/docker-${CONTAINER_ID}*/memory.current

# Using systemd (cgroups v2)
systemctl status docker-${CONTAINER_ID}.scope

# Using /proc
cat /proc/<pid>/cgroup
# Output: 0::/system.slice/docker-abc123.scope
```

***

## Delegation and Nesting

### Cgroup Delegation

```
┌─────────────────────────────────────────────────────────────────────┐
│                      CGROUP DELEGATION                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Root owns top levels:                                               │
│                                                                      │
│  /sys/fs/cgroup/                        ← root-owned                │
│  └── system.slice/                      ← root-owned                │
│      └── docker.service/                ← root-owned                │
│          └── container-abc/             ← delegated to container    │
│              ├── cgroup.procs           ← container can write       │
│              ├── memory.current         ← container can read        │
│              └── child/                 ← container can create      │
│                                                                      │
│  Delegation enables:                                                 │
│  1. Container can create nested cgroups                             │
│  2. Container can move its processes between cgroups                │
│  3. Container cannot exceed parent's limits                         │
│  4. Container cannot affect sibling cgroups                         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
```

### Delegation Requirements

```bash theme={null}
# Check delegation
cat /sys/fs/cgroup/cgroup.controllers
# cpu io memory pids

# For delegation to work:
# 1. Directory owned by delegate
# 2. cgroup.procs owned by delegate  
# 3. cgroup.subtree_control owned by delegate
# 4. Parent's subtree_control enables needed controllers

# Set up delegation
mkdir /sys/fs/cgroup/delegated
chown -R user:user /sys/fs/cgroup/delegated
```

***

## Debugging cgroups Issues

### Common Issues and Solutions

<AccordionGroup>
  <Accordion title="Container OOM but host has memory" icon="memory">
    **Symptom**: Container killed by OOM, but host shows available memory.

    **Diagnosis**:

    ```bash theme={null}
    # Check cgroup memory limit
    cat /sys/fs/cgroup/<path>/memory.max
    cat /sys/fs/cgroup/<path>/memory.current

    # Check for anonymous vs cache memory
    grep -E "^(anon|file)" /sys/fs/cgroup/<path>/memory.stat

    # Check OOM events
    cat /sys/fs/cgroup/<path>/memory.events
    # oom 5  ← OOM occurred 5 times
    ```

    **Solutions**:

    * Increase memory limit
    * Reduce application memory usage
    * Add swap (for burst tolerance)
  </Accordion>

  <Accordion title="CPU throttling at low usage" icon="microchip">
    **Symptom**: Container shows 30% CPU usage but requests are slow.

    **Diagnosis**:

    ```bash theme={null}
    # Check throttling stats
    cat /sys/fs/cgroup/<path>/cpu.stat
    # nr_throttled 15000  ← Throttled 15000 periods!

    # Calculate throttle percentage
    # throttle% = nr_throttled / nr_periods * 100
    ```

    **Cause**: Burst usage exceeds quota within period even though average is low.

    **Solutions**:

    * Increase CPU limit
    * Use `cpu.burst` (cgroups v2) for burst allowance
    * Spread work more evenly
  </Accordion>

  <Accordion title="Can't write to cgroup files" icon="lock">
    **Symptom**: `echo: write error: Device or resource busy`

    **Diagnosis**:

    ```bash theme={null}
    # Check if cgroup has processes
    cat /sys/fs/cgroup/<path>/cgroup.procs

    # Check if cgroup has children
    ls /sys/fs/cgroup/<path>/
    ```

    **Cause**: Cgroups v2 "no internal processes" rule - if a cgroup has controllers enabled in subtree\_control, processes must be in leaf cgroups.

    **Solution**: Move processes to leaf cgroups before modifying parent.
  </Accordion>

  <Accordion title="IO limits not working" icon="hard-drive">
    **Symptom**: Set io.max but process still uses full disk bandwidth.

    **Diagnosis**:

    ```bash theme={null}
    # Check if IO controller is enabled
    cat /sys/fs/cgroup/<path>/cgroup.controllers | grep io

    # Check io.max format (needs MAJ:MIN)
    lsblk
    cat /sys/fs/cgroup/<path>/io.max
    ```

    **Common causes**:

    * Wrong device major:minor
    * IO through page cache (buffered writes)
    * Controller not enabled

    **Solution**:

    ```bash theme={null}
    # Use O_DIRECT or sync writes
    # Or use io.latency for latency-based control
    ```
  </Accordion>
</AccordionGroup>

***

## Interview Questions

<AccordionGroup>
  <Accordion title="Q: What happens when a container exceeds memory.high?" icon="question">
    **Answer**:

    When a cgroup exceeds `memory.high`:

    1. **Reclaim pressure increases**: Kernel aggressively reclaims memory from this cgroup
    2. **Throttling may occur**: Memory allocation requests may be delayed
    3. **No OOM**: The process is NOT killed (unlike exceeding `memory.max`)
    4. **Performance impact**: Application may slow down due to reclaim

    This is useful for:

    * Soft limits with burst allowance
    * Preventing one container from consuming all cache
    * Graceful degradation instead of hard OOM
  </Accordion>

  <Accordion title="Q: Explain the difference between cgroups v1 and v2" icon="question">
    **Key differences**:

    | Aspect             | v1                        | v2             |
    | ------------------ | ------------------------- | -------------- |
    | Hierarchy          | Multiple (per controller) | Single unified |
    | Internal processes | Allowed                   | Not allowed    |
    | Thread control     | Limited                   | Full support   |
    | Pressure metrics   | No                        | Yes (PSI)      |
    | Delegation         | Complex                   | Simple         |

    **v2 advantages**:

    * Simpler mental model
    * Consistent behavior
    * Better pressure metrics (PSI)
    * Proper thread-level controls
    * Cleaner delegation model
  </Accordion>

  <Accordion title="Q: How would you debug high latency in a container?" icon="question">
    **Systematic approach**:

    1. **Check CPU throttling**:
       ```bash theme={null}
       cat /sys/fs/cgroup/<path>/cpu.stat | grep throttled
       ```

    2. **Check memory pressure**:
       ```bash theme={null}
       cat /sys/fs/cgroup/<path>/memory.pressure
       cat /sys/fs/cgroup/<path>/memory.events | grep oom
       ```

    3. **Check I/O latency**:
       ```bash theme={null}
       cat /sys/fs/cgroup/<path>/io.stat
       cat /sys/fs/cgroup/<path>/io.pressure
       ```

    4. **Check for noisy neighbors**:
       * Look at sibling cgroups' usage
       * Check parent cgroup limits

    5. **Use tracing**:
       ```bash theme={null}
       # Off-CPU analysis
       offcputime-bpfcc -p <pid>
       ```
  </Accordion>
</AccordionGroup>

***

## PSI (Pressure Stall Information)

New in cgroups v2 - provides pressure metrics:

```bash theme={null}
# Memory pressure
cat /sys/fs/cgroup/<path>/memory.pressure
# some avg10=0.00 avg60=0.00 avg300=0.00 total=0
# full avg10=0.00 avg60=0.00 avg300=0.00 total=0

# CPU pressure  
cat /sys/fs/cgroup/<path>/cpu.pressure
# some avg10=25.00 avg60=20.00 avg300=15.00 total=1234567

# IO pressure
cat /sys/fs/cgroup/<path>/io.pressure
# some avg10=5.00 avg60=3.00 avg300=2.00 total=789012
# full avg10=2.00 avg60=1.00 avg300=0.50 total=456789
```

**Interpretation**:

* `some`: At least one task stalled
* `full`: All tasks stalled
* `avg10/60/300`: 10s/60s/300s moving averages (%)
* `total`: Total stall time in microseconds

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="A Kubernetes pod is showing 30% average CPU usage but requests are timing out. Users report the service is 'slow.' How do you diagnose this using cgroup internals, and what is the most likely cause?" icon="message">
    **Strong Answer:**

    * The most likely cause is CFS bandwidth throttling. Average CPU usage can be misleading because CFS enforces CPU limits per-period (typically 100ms). A service might use only 30% average CPU, but if requests arrive in bursts, the container could exhaust its entire quota in the first 30ms of a period and be throttled for the remaining 70ms. During throttling, all threads in the cgroup are descheduled regardless of available CPU on the host.
    * To diagnose, I would read `cat /sys/fs/cgroup/<path>/cpu.stat` and compute the throttle percentage: `nr_throttled / nr_periods * 100`. If this exceeds 5%, throttling is significant. I would also check `throttled_usec` to see total time lost to throttling.
    * The kernel mechanism is the CFS bandwidth controller in `kernel/sched/fair.c`. Each cgroup has a `cfs_bandwidth` structure tracking quota and period. When a task in the cgroup runs, its runtime is charged against the quota. When quota reaches zero, the scheduler removes all tasks in the cgroup from their runqueues until the next period boundary replenishes the quota.
    * Solutions in order of preference: increase the CPU limit in the Kubernetes resource spec, enable `cpu.burst` in cgroups v2 (allows temporary burst above quota by borrowing from future periods), spread work more evenly across time using request queuing, or switch to CPU shares (`cpu.weight`) instead of hard limits if the node is not oversubscribed.

    **Follow-up:** Why does Kubernetes use CPU limits (quota) at all instead of just CPU requests (shares)?

    **Follow-up Answer:**

    * CPU requests map to `cpu.weight` (shares) which provide proportional fairness: if a container requests 1 CPU and another requests 2 CPUs, the second gets twice the CPU time when both are contending. But shares only work when there is contention. Without limits, a runaway container could consume all available CPU during low-load periods, then cause latency for other containers when load increases. Limits provide a hard ceiling via `cpu.max` that caps CPU usage regardless of host utilization. The trade-off is that limits cause throttling even when CPU is idle, which is wasteful. Many teams now advocate for setting requests but not limits for CPU (Google's best practice), accepting the risk of noisy neighbors in exchange for eliminating throttling-induced latency.
  </Accordion>

  <Accordion title="Explain the cgroups v2 'no internal processes' rule and why it was introduced. How does this affect container runtime design?" icon="message">
    **Strong Answer:**

    * In cgroups v2, a cgroup that has controllers enabled in its `cgroup.subtree_control` cannot have processes directly in `cgroup.procs` -- processes must be in leaf cgroups only. This means if `/sys/fs/cgroup/mygroup/` enables CPU and memory controllers for its children, all processes must be in child cgroups like `/sys/fs/cgroup/mygroup/child1/`, not in `/sys/fs/cgroup/mygroup/` itself.
    * This rule was introduced to solve a fundamental ambiguity in cgroups v1: if a parent cgroup has processes and also has child cgroups, how should the parent's resource share be divided between its direct processes and its children? In v1, this was handled inconsistently across controllers and led to confusing behavior where resource distribution depended on the tree structure in unintuitive ways.
    * For container runtimes, this means the cgroup hierarchy must be designed carefully. A container runtime cannot simply create `/sys/fs/cgroup/containers/` with controllers enabled and dump container processes there. Instead, it must create `/sys/fs/cgroup/containers/container-abc/` for each container and place processes in the leaf. Systemd handles this naturally with its slice/scope/service hierarchy: `system.slice -> docker-abc123.scope` where the scope is the leaf containing the container's processes.
    * The practical impact is that management processes (the container runtime's shim) must be in a separate cgroup from the container's processes. runc places the shim in the parent's sibling cgroup and the container in its own leaf.

    **Follow-up:** How does thread-level cgroup control work in v2, and when would you use it?

    **Follow-up Answer:**

    * Cgroups v2 introduced `cgroup.type = "threaded"` which allows individual threads of a process to be in different cgroups within a threaded subtree. This is useful when you want to give different threads different CPU priorities within the same process. For example, a database might put its query processing threads in one cgroup with higher CPU weight and its background compaction threads in another cgroup with lower weight. The root of the threaded subtree is a "domain threaded" cgroup, and its children are "threaded" cgroups. Not all controllers support threaded mode -- currently only `cpu` and `cpuset` are thread-aware.
  </Accordion>

  <Accordion title="Design a monitoring system that detects containers approaching resource exhaustion before they get OOM-killed or CPU-throttled. What cgroup metrics would you monitor, and what are the thresholds?" icon="message">
    **Strong Answer:**

    * For memory, I would monitor three signals. First, `memory.current / memory.max` as a utilization ratio -- alert at 80%. Second, `memory.pressure` from PSI (Pressure Stall Information): `some avg10 > 10` means at least one task is stalling 10% of the time waiting for memory, which indicates reclaim pressure before OOM. Third, the `oom` counter in `memory.events` to detect actual OOM kills.
    * For CPU, I would compute throttle percentage from `cpu.stat`: `nr_throttled / nr_periods * 100`. Alert at 5% throttle rate. I would also monitor `cpu.pressure` where `some avg10 > 25` indicates meaningful CPU contention. The distinction between `some` (at least one task stalled) and `full` (all tasks stalled) is important: `some` indicates contention, `full` indicates complete blockage.
    * For I/O, monitor `io.pressure` for both `some` and `full` stall percentages. Also check `io.stat` for per-device throughput to detect I/O bottlenecks.
    * For PID limits, monitor `pids.current / pids.max` and alert at 80%. A fork bomb or thread leak will hit this before OOM.
    * Implementation-wise, I would use a polling agent reading these pseudo-files every 5-10 seconds. PSI metrics are particularly efficient because they are pre-computed moving averages, so reading them is a single file read returning a single line. For production, I would expose these as Prometheus metrics and set up Grafana alerts.

    **Follow-up:** How does PSI (Pressure Stall Information) work internally, and why is it better than just checking utilization percentages?

    **Follow-up Answer:**

    * PSI tracks the actual time tasks spend stalled waiting for resources, not just how much of a resource is being used. The kernel maintains per-CPU counters that are updated on every task state transition. When a task transitions from runnable to blocked-on-memory-reclaim, the kernel increments the memory stall counter for that CPU. PSI then computes exponentially weighted moving averages over 10s, 60s, and 300s windows. This is fundamentally better than utilization because utilization does not capture demand: a container at 90% memory utilization might be fine (large but stable working set) or about to OOM (growing leak). PSI tells you whether tasks are actually waiting, which directly correlates with user-visible performance degradation.
  </Accordion>
</AccordionGroup>

***

Next: [Filesystem & VFS →](/courses/linux-internals/filesystem-vfs)
