Containers & Virtualization

Isolation is the core requirement of multi-tenant cloud computing. Whether you are running a SaaS platform or a microservices cluster, you must ensure that processes are contained, resources are metered, and security boundaries are enforced. Modern systems achieve this through two distinct paths: OS-level virtualization (Containers) and Hardware-level virtualization (VMs).

Mastery Level: Senior Systems Engineer Key Internals: CLONE_NEW*, Cgroups v2 Unified Hierarchy, VMCS, EPT/SLAT, Firecracker MicroVMs Prerequisites: Process Internals, Memory Management

1. Container Internals: The Linux “Trio”

A container is not a “thing” in the Linux kernel. It is a user-space abstraction built using three primary kernel features: Namespaces, Control Groups (cgroups), and Union Filesystems.

┌─────────────────────────────────────────────────────────────────────┐
│                     CONTAINER ARCHITECTURE                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Container Runtime (Docker/containerd/cri-o)                        │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  Container 1           Container 2           Container 3        │ │
│  │  ┌──────────┐          ┌──────────┐          ┌──────────┐      │ │
│  │  │ App      │          │ App      │          │ App      │      │ │
│  │  │ (nginx)  │          │ (redis)  │          │ (postgres)      │ │
│  │  └────┬─────┘          └────┬─────┘          └────┬─────┘      │ │
│  │       │                     │                     │             │ │
│  └───────┼─────────────────────┼─────────────────────┼─────────────┘ │
│          │                     │                     │               │
│  ════════╧═════════════════════╧═════════════════════╧═════════════  │
│                                                                     │
│  Linux Kernel Features                                              │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐ │ │
│  │  │   Namespaces    │  │    Cgroups      │  │  Union FS      │ │ │
│  │  │                 │  │                 │  │  (OverlayFS)   │ │ │
│  │  │ • PID           │  │ • CPU           │  │                │ │ │
│  │  │ • Mount         │  │ • Memory        │  │ • LowerDir     │ │ │
│  │  │ • Network       │  │ • PIDs          │  │ • UpperDir     │ │ │
│  │  │ • UTS           │  │ • Blkio         │  │ • MergedDir    │ │ │
│  │  │ • IPC           │  │ • Devices       │  │ • WorkDir      │ │ │
│  │  │ • User          │  │ • Network       │  │                │ │ │
│  │  │ • Cgroup        │  │                 │  │                │ │ │
│  │  │ • Time          │  │                 │  │                │ │ │
│  │  └─────────────────┘  └─────────────────┘  └────────────────┘ │ │
│  │                                                                 │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  ════════════════════════════════════════════════════════════════   │
│                                                                     │
│  Shared Linux Kernel                                                │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  System Calls, Process Scheduler, Memory Management, Drivers   │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

1.1 Namespaces: The Illusion of Isolation

Namespaces wrap global system resources in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the resource.

Namespace	Flag	Isolated Resource
Mount	`CLONE_NEWNS`	Filesystem mount points (independent `mount`/`umount`).
UTS	`CLONE_NEWUTS`	Hostname and NIS domain name.
IPC	`CLONE_NEWIPC`	System V IPC, POSIX message queues.
PID	`CLONE_NEWPID`	Process IDs (Process 1 inside the container).
Network	`CLONE_NEWNET`	Network devices, stacks, ports, firewalls.
User	`CLONE_NEWUSER`	User and group IDs (Root in container != Root on host).
Cgroup	`CLONE_NEWCGROUP`	Cgroup root directory view.
Time	`CLONE_NEWTIME`	System boot and monotonic clocks.

Deep Dive: PID Namespace

The PID namespace creates a hierarchical process view where each namespace has its own PID 1.

┌─────────────────────────────────────────────────────────────────────┐
│                      PID NAMESPACE HIERARCHY                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Host (Initial PID Namespace)                                       │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  PID 1: /sbin/init (systemd)                                   │ │
│  │  PID 523: dockerd                                              │ │
│  │  PID 1024: container-init   ←────┐                             │ │
│  │  PID 1025: nginx (worker)   ←────┼─┐                           │ │
│  │  PID 1026: nginx (worker)   ←────┼─┼─┐                         │ │
│  │                                   │ │ │                         │ │
│  └───────────────────────────────────┼─┼─┼─────────────────────────┘ │
│                                      │ │ │                           │
│  Container PID Namespace             │ │ │                           │
│  ┌──────────────────────────────────┼─┼─┼─────────────────────────┐ │
│  │                                  │ │ │                         │ │
│  │  PID 1: /init  ──────────────────┘ │ │  (maps to host 1024)   │ │
│  │  PID 2: nginx master ───────────────┘ │  (maps to host 1025)   │ │
│  │  PID 3: nginx worker ─────────────────┘  (maps to host 1026)   │ │
│  │                                                                 │ │
│  │  Processes see only PIDs 1, 2, 3                               │ │
│  │  Cannot see or signal host processes                           │ │
│  │                                                                 │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Implementation Details:

// Creating a PID namespace
int clone_flags = CLONE_NEWPID | SIGCHLD;
pid_t pid = clone(child_fn, child_stack, clone_flags, NULL);

// Inside the child function
int child_fn(void *arg) {
    printf("My PID: %d\n", getpid());  // Will print: 1

    // Fork a child
    pid_t child = fork();
    if (child == 0) {
        printf("Child PID: %d\n", getpid());  // Will print: 2
    }
    return 0;
}

Key Properties:

First process in namespace becomes PID 1
If PID 1 exits, kernel kills all processes in namespace
Parent namespace can see child processes with their “real” PIDs
/proc shows only processes in current namespace (with mount namespace)

Deep Dive: Network Namespace

Network namespaces isolate the network stack: devices, routing tables, firewall rules, sockets.

┌─────────────────────────────────────────────────────────────────────┐
│                    NETWORK NAMESPACE TOPOLOGY                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Host Network Namespace                                             │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐       │ │
│  │  │   eth0   │  │  veth0   │  │  veth2   │  │  veth4   │       │ │
│  │  │ (physical)  │  (host)  │  │  (host)  │  │  (host)  │       │ │
│  │  └─────┬────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘       │ │
│  │        │            │             │             │              │ │
│  │        │       ┌────┴─────────────┴─────────────┴────┐         │ │
│  │        │       │         docker0 (bridge)            │         │ │
│  │        │       │         172.17.0.1/16               │         │ │
│  │        │       └─────────────────────────────────────┘         │ │
│  │        │                                                        │ │
│  │    [Internet]                                                  │ │
│  │                                                                 │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                           │             │             │             │
│                           │             │             │             │
│  Container 1 Netns        │             │             │             │
│  ┌───────────────────────┼─────────────┘             │             │
│  │  ┌────────────────────▼────┐                      │             │
│  │  │  eth0 (container view)  │                      │             │
│  │  │  veth1 (actual peer)    │                      │             │
│  │  │  172.17.0.2/16          │                      │             │
│  │  └─────────────────────────┘                      │             │
│  │  Route: default via 172.17.0.1                    │             │
│  └────────────────────────────────────────────────────┘             │
│                                                       │             │
│  Container 2 Netns                                    │             │
│  ┌───────────────────────────────────────────────────┼─────────────┘
│  │  ┌────────────────────────────────────────────────▼────┐         │
│  │  │  eth0 (container view)                             │         │
│  │  │  veth3 (actual peer)                               │         │
│  │  │  172.17.0.3/16                                     │         │
│  │  └────────────────────────────────────────────────────┘         │
│  │  Route: default via 172.17.0.1                                  │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Creating veth pairs:

# Create network namespace
ip netns add container1

# Create veth pair
ip link add veth0 type veth peer name veth1

# Move one end into namespace
ip link set veth1 netns container1

# Configure host end
ip addr add 172.17.0.1/16 dev veth0
ip link set veth0 up

# Configure container end
ip netns exec container1 ip addr add 172.17.0.2/16 dev veth1
ip netns exec container1 ip link set veth1 up
ip netns exec container1 ip route add default via 172.17.0.1

# Test connectivity
ip netns exec container1 ping 172.17.0.1

Code Example: Creating Network Namespace

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <net/if.h>

int child_fn(void *arg) {
    // Now in new network namespace

    // List network interfaces
    struct if_nameindex *if_ni = if_nameindex();
    if (if_ni) {
        for (int i = 0; if_ni[i].if_index != 0; i++) {
            printf("Interface: %s (index %d)\n",
                   if_ni[i].if_name, if_ni[i].if_index);
        }
        if_freenameindex(if_ni);
    }

    // Only loopback exists in new namespace
    return 0;
}

int main() {
    const int STACK_SIZE = 65536;
    char *stack = malloc(STACK_SIZE);

    clone(child_fn, stack + STACK_SIZE,
          CLONE_NEWNET | SIGCHLD, NULL);

    wait(NULL);
    return 0;
}

Deep Dive: Mount Namespace

Mount namespaces isolate the filesystem mount points.

// Create mount namespace
unshare(CLONE_NEWNS);

// Mounts are now private to this namespace
mount("/dev/sda1", "/mnt", "ext4", 0, NULL);

// Other namespaces won't see this mount

Mount Propagation:

┌─────────────────────────────────────────────────────────────────────┐
│                     MOUNT PROPAGATION TYPES                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  MS_SHARED: Mounts propagate bidirectionally                        │
│  ┌────────────────┐                    ┌────────────────┐           │
│  │  Namespace A   │ ◄───────────────► │  Namespace B   │           │
│  │  mount /foo    │    propagates      │  sees /foo     │           │
│  └────────────────┘                    └────────────────┘           │
│                                                                     │
│  MS_PRIVATE: Mounts don't propagate (default in containers)         │
│  ┌────────────────┐                    ┌────────────────┐           │
│  │  Namespace A   │  X  no sharing  X  │  Namespace B   │           │
│  │  mount /foo    │                    │  no /foo       │           │
│  └────────────────┘                    └────────────────┘           │
│                                                                     │
│  MS_SLAVE: Receives mounts from master, but doesn't send            │
│  ┌────────────────┐                    ┌────────────────┐           │
│  │  Master        │ ─────────────────► │  Slave         │           │
│  │  mount /foo    │    one-way         │  sees /foo     │           │
│  └────────────────┘ ◄─────X────────────└────────────────┘           │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Deep Dive: User Namespace

User namespaces allow mapping UIDs/GIDs, enabling rootless containers.

┌─────────────────────────────────────────────────────────────────────┐
│                      USER NAMESPACE MAPPING                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Host                              Container                        │
│  ┌──────────────────┐              ┌──────────────────┐            │
│  │                  │              │                  │            │
│  │  UID 0 (root)    │ ────X────►  │  Not mapped      │            │
│  │  UID 1000 (user) │ ───────────► │  UID 0 (root)    │            │
│  │  UID 1001        │ ───────────► │  UID 1           │            │
│  │  UID 1002        │ ───────────► │  UID 2           │            │
│  │  ...             │              │  ...             │            │
│  │  UID 65535       │ ───────────► │  UID 64535       │            │
│  │                  │              │                  │            │
│  └──────────────────┘              └──────────────────┘            │
│                                                                     │
│  Configuration: /proc/<pid>/uid_map                                │
│  Format: <container_id> <host_id> <range>                          │
│  Example: 0 1000 65536                                             │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Setting up User Namespace:

# Create user namespace
unshare --user --map-root-user /bin/bash

# Inside namespace
id  # uid=0(root) gid=0(root)

# But on host, this process runs as your regular user
# File operations as "root" in container map to your UID on host

Code Example:

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>
#include <fcntl.h>

void setup_uid_map(pid_t pid) {
    char path[256];
    char map[256];

    // Map container root (0) to host user (1000)
    snprintf(path, sizeof(path), "/proc/%d/uid_map", pid);
    snprintf(map, sizeof(map), "0 1000 1");

    int fd = open(path, O_WRONLY);
    write(fd, map, strlen(map));
    close(fd);

    // Same for GID
    snprintf(path, sizeof(path), "/proc/%d/gid_map", pid);
    fd = open(path, O_WRONLY);
    write(fd, map, strlen(map));
    close(fd);
}

int child_fn(void *arg) {
    printf("UID in container: %d\n", getuid());  // 0
    printf("GID in container: %d\n", getgid());  // 0
    return 0;
}

Deep Dive: IPC Namespace

IPC namespaces isolate System V IPC objects and POSIX message queues.

# In host namespace
ipcmk -Q  # Create message queue
ipcs -q   # List queues - visible

# In new IPC namespace
unshare --ipc ipcs -q  # Empty - can't see host queues

Deep Dive: UTS Namespace

UTS namespaces isolate hostname and domain name.

unshare(CLONE_NEWUTS);
sethostname("container1", 10);

// This hostname is isolated to this namespace
// Host and other containers see their own hostnames

Deep Dive: Time Namespace

Time namespaces (Linux 5.6+) allow different boot times and monotonic clocks.

// Offset boot time by 1 hour
unshare(CLONE_NEWTIME);

// Write to /proc/self/timens_offsets
// Format: <clock_id> <seconds> <nanoseconds>
// CLOCK_MONOTONIC 3600 0

The `pivot_root` vs `chroot`

While chroot only changes the root directory for path resolution, it is insecure (processes can “break out” via .. or file descriptor trickery). Containers use pivot_root, which moves the entire mount namespace to a new root and removes access to the old one, providing a true filesystem jail.

// pivot_root implementation
int pivot_root(const char *new_root, const char *put_old) {
    // Move old root to put_old directory within new_root
    // Make new_root the new root
    // This removes all access to old root
    return syscall(SYS_pivot_root, new_root, put_old);
}

// Usage
chdir("/new_root");
pivot_root(".", "old_root");
umount2("old_root", MNT_DETACH);
rmdir("old_root");
chdir("/");

1.2 Cgroups: Resource Metering and Limiting

If Namespaces provide isolation (what you see), Cgroups provide containment (what you can use).

Cgroups v1 (Legacy): Multiple hierarchies. A process could be in one group for CPU and a completely different group for Memory. This led to massive complexity and performance issues.
Cgroups v2 (Modern/Unified): A single hierarchy. Every process belongs to exactly one cgroup in a unified tree. This allows for better resource accounting (e.g., attributing page cache writeback to the specific cgroup that caused the dirty pages).

┌─────────────────────────────────────────────────────────────────────┐
│                  CGROUPS V1 VS CGROUPS V2                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Cgroups v1 (Multiple Hierarchies)                                  │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  CPU Hierarchy        Memory Hierarchy      IO Hierarchy        │ │
│  │  ┌──────────┐         ┌──────────┐         ┌──────────┐        │ │
│  │  │   root   │         │   root   │         │   root   │        │ │
│  │  ├──────────┤         ├──────────┤         ├──────────┤        │ │
│  │  │ system   │         │ docker   │         │  user    │        │ │
│  │  │  ├─bash  │         │  ├─nginx │         │   ├─bash │        │ │
│  │  │  └─sshd  │         │  └─redis │         │   └─vim  │        │ │
│  │  └──────────┘         └──────────┘         └──────────┘        │ │
│  │                                                                 │ │
│  │  Problem: Process can be in different groups per controller    │ │
│  │  bash: CPU→system, Memory→docker, IO→user (confusing!)         │ │
│  │                                                                 │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  Cgroups v2 (Unified Hierarchy)                                     │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  Single Hierarchy (All Controllers)                             │ │
│  │  ┌──────────────────────────────────────────────────────────┐  │ │
│  │  │                       root                                │  │ │
│  │  │               (cpu, memory, io, pids)                     │  │ │
│  │  ├──────────────────────┬───────────────────────────────────┤  │ │
│  │  │      system          │          user.slice               │  │ │
│  │  │  ├─ sshd.service     │      ├─ user-1000.slice          │  │ │
│  │  │  └─ cron.service     │      │   ├─ session-1.scope       │  │ │
│  │  │                      │      │   │   ├─ bash              │  │ │
│  │  │                      │      │   │   └─ vim               │  │ │
│  │  │                      │      │   └─ docker.service        │  │ │
│  │  │                      │      │       ├─ container1         │  │ │
│  │  │                      │      │       │   ├─ nginx          │  │ │
│  │  │                      │      │       └─ container2         │  │ │
│  │  │                      │      │           └─ redis          │  │ │
│  │  └──────────────────────┴───────────────────────────────────┘  │ │
│  │                                                                 │ │
│  │  Benefit: Process location same for all controllers            │ │
│  │  Proper resource attribution and delegation                    │ │
│  │                                                                 │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Key Controllers:

CPU Controller:

# Cgroups v2 CPU control
cd /sys/fs/cgroup/user.slice/user-1000.slice

# cpu.max format: $MAX $PERIOD
# Allow 50% of one CPU: 50000 out of 100000 microseconds
echo "50000 100000" > cpu.max

# CPU weight (shares): 1-10000, default 100
echo "200" > cpu.weight  # 2x normal priority

# Statistics
cat cpu.stat
# usage_usec 12345678
# user_usec 10000000
# system_usec 2345678
# nr_periods 1234
# nr_throttled 56
# throttled_usec 789012

Memory Controller:

# Memory limits
echo "512M" > memory.max      # Hard limit
echo "256M" > memory.high     # Soft limit (throttling)
echo "128M" > memory.low      # Best-effort protection
echo "64M" > memory.min       # Hard protection

# Current usage
cat memory.current

# Detailed statistics
cat memory.stat
# anon 104857600           # Anonymous memory (heap, stack)
# file 52428800            # Page cache
# kernel_stack 131072
# slab 8388608
# sock 65536
# shmem 0
# file_mapped 16777216
# file_dirty 1048576
# file_writeback 524288
# inactive_anon 0
# active_anon 104857600
# inactive_file 26214400
# active_file 26214400

# Memory events
cat memory.events
# low 0                   # Times below memory.low
# high 12                 # Times above memory.high
# max 3                   # Times hit memory.max
# oom 0                   # OOM kills
# oom_kill 0

I/O Controller:

# I/O weight (1-10000)
echo "500" > io.weight

# I/O max (rate limiting)
# Format: $MAJ:$MIN rbps=$BYTES wbps=$BYTES riops=$IOPS wiops=$IOPS
echo "8:0 rbps=10485760 wbps=5242880" > io.max
# Limit reads to 10MB/s, writes to 5MB/s on device 8:0

# I/O statistics
cat io.stat
# 8:0 rbytes=1048576000 wbytes=524288000 rios=1000 wios=500

PIDs Controller:

# Limit number of processes
echo "100" > pids.max

# Current count
cat pids.current

# Events
cat pids.events
# max 5  # Times hit pids.max

Cgroups v2 Core Features:

┌─────────────────────────────────────────────────────────────────────┐
│                    CGROUPS V2 CORE CONCEPTS                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. No-Internal-Process Rule                                        │
│     ┌───────────────────────────────────────────────────────────┐  │
│     │  Only leaf cgroups can have processes                     │  │
│     │                                                            │  │
│     │  root/                                                     │  │
│     │  ├─ cgroup.procs         ← Cannot write here             │  │
│     │  └─ system/                                               │  │
│     │     ├─ cgroup.procs      ← Cannot write here             │  │
│     │     └─ sshd.service/                                      │  │
│     │        └─ cgroup.procs   ← Can write here (leaf)         │  │
│     └───────────────────────────────────────────────────────────┘  │
│                                                                     │
│  2. Controller Delegation                                           │
│     ┌───────────────────────────────────────────────────────────┐  │
│     │  Controllers must be explicitly enabled                   │  │
│     │                                                            │  │
│     │  root/cgroup.controllers                                  │  │
│     │  → cpu memory io pids                                     │  │
│     │                                                            │  │
│     │  root/cgroup.subtree_control                              │  │
│     │  → +cpu +memory   (enable for children)                   │  │
│     │                                                            │  │
│     │  root/system/cgroup.controllers                           │  │
│     │  → cpu memory     (inherited from parent)                 │  │
│     └───────────────────────────────────────────────────────────┘  │
│                                                                     │
│  3. Pressure Stall Information (PSI)                                │
│     ┌───────────────────────────────────────────────────────────┐  │
│     │  Tracks resource contention                               │  │
│     │                                                            │  │
│     │  cpu.pressure:                                            │  │
│     │  some avg10=5.23 avg60=3.14 avg300=1.87 total=123456      │  │
│     │                                                            │  │
│     │  memory.pressure:                                         │  │
│     │  some avg10=12.34 avg60=8.90 avg300=5.67 total=234567     │  │
│     │  full avg10=2.10 avg60=1.50 avg300=0.80 total=45678       │  │
│     │                                                            │  │
│     │  io.pressure:                                             │  │
│     │  some avg10=8.90 avg60=6.70 avg300=4.50 total=345678      │  │
│     │  full avg10=3.20 avg60=2.10 avg300=1.40 total=56789       │  │
│     └───────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Creating and Managing Cgroups:

// C API for cgroup management
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>

void create_cgroup(const char *name) {
    char path[256];

    // Create cgroup directory
    snprintf(path, sizeof(path), "/sys/fs/cgroup/%s", name);
    mkdir(path, 0755);
}

void set_memory_limit(const char *name, const char *limit) {
    char path[256];
    int fd;

    snprintf(path, sizeof(path), "/sys/fs/cgroup/%s/memory.max", name);
    fd = open(path, O_WRONLY);
    write(fd, limit, strlen(limit));
    close(fd);
}

void add_process_to_cgroup(const char *name, pid_t pid) {
    char path[256];
    char pid_str[32];
    int fd;

    snprintf(path, sizeof(path), "/sys/fs/cgroup/%s/cgroup.procs", name);
    snprintf(pid_str, sizeof(pid_str), "%d", pid);

    fd = open(path, O_WRONLY);
    write(fd, pid_str, strlen(pid_str));
    close(fd);
}

int main() {
    create_cgroup("myapp");
    set_memory_limit("myapp", "512M");
    add_process_to_cgroup("myapp", getpid());

    // Process now limited to 512MB
    // Allocate memory and observe behavior

    return 0;
}

OOM Killer in Cgroups:

┌─────────────────────────────────────────────────────────────────────┐
│                    OOM KILLER IN CGROUPS                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  When cgroup exceeds memory.max:                                    │
│                                                                     │
│  1. Kernel triggers OOM killer                                      │
│  2. Selects victim ONLY from within the cgroup                      │
│  3. Score calculation (higher = more likely to kill):               │
│     ┌───────────────────────────────────────────────────────────┐  │
│     │  score = (rss + swap) * 1000 / total_memory                │  │
│     │  + oom_score_adj                                           │  │
│     │                                                             │  │
│     │  oom_score_adj range: -1000 to 1000                        │  │
│     │  -1000: disable OOM kill                                   │  │
│     │  0: default                                                │  │
│     │  1000: always kill first                                   │  │
│     └───────────────────────────────────────────────────────────┘  │
│                                                                     │
│  4. Kill victim process                                             │
│  5. Log to dmesg:                                                   │
│     "Memory cgroup out of memory: Killed process 1234 (app)"       │
│                                                                     │
│  Prevent OOM kill:                                                  │
│  echo "-1000" > /proc/<pid>/oom_score_adj                          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

1.3 OverlayFS: The Layered Filesystem

Containers use Union Filesystems (like OverlayFS) to provide a writable layer on top of read-only image layers.

LowerDir: Read-only layers (the Docker image).
UpperDir: The writable layer where changes are stored.
MergedDir: The unified view presented to the container.
Copy-on-Write (CoW): When a container modifies a file in the LowerDir, the kernel first copies it to the UpperDir before applying the change.

2. Virtualization: Emulating the Machine

Virtual Machines (VMs) take the isolation boundary down to the hardware level. Instead of sharing a kernel, they share the physical CPU and Memory.

2.1 The Hypervisor (VMM)

The Virtual Machine Monitor (VMM) is the software that manages guest execution.

Type 1 (Bare Metal): Runs directly on hardware (Xen, ESXi).
Type 2 (Hosted): Runs as an app on a host OS (KVM, VirtualBox). Note: KVM is unique because it turns the Linux kernel itself into a Type 1 hypervisor.

2.2 Hardware-Assisted Virtualization (VT-x / AMD-V)

Early virtualization used “Binary Translation” to replace privileged instructions. Modern CPUs handle this in hardware:

VMX Root Mode: The hypervisor runs here (full privileges).
VMX Non-Root Mode: The guest OS runs here. If the guest tries to execute a privileged instruction (like HLT or modifying CR3), the CPU triggers a VM Exit, trapping into the hypervisor to handle the event.

VMCS (Virtual Machine Control Structure)

The VMCS is a memory block that stores the “state” of a virtual CPU (registers, control bits). When switching from VM A to VM B, the hypervisor swaps the VMCS pointers.

2.3 Memory Virtualization: EPT and NPT

In a VM, there are three types of addresses:

Guest Virtual (GV)
Guest Physical (GP)
Host Physical (HP)

Shadow Page Tables (Old): The hypervisor manually tracked guest page table changes and built a combined GV→HP table. This was extremely slow. EPT (Extended Page Tables): Hardware handles the translation. The CPU has a second set of page tables that map GP→HP. A memory access now involves a “2D Page Walk,” but it happens entirely in hardware.

3. The Middle Ground: MicroVMs

Plain containers have a large attack surface (thousands of syscalls). Plain VMs are slow and heavy. MicroVMs (like Firecracker) bridge the gap.

Firecracker Architecture

Minimalism: Removes all non-essential devices (no VGA, no USB, no sound).
VirtIO: Uses paravirtualized drivers for network and disk, avoiding the overhead of emulating real hardware registers.
Jailer: Firecracker itself runs inside a container (Namespaces + Cgroups) to provide “Defense in Depth.”
Performance: Can boot a Linux kernel in < 125ms and run thousands of instances on a single host.

4. Comparison: When to Use What?

Feature	Containers	MicroVMs (Firecracker)	Full VMs (ESXi/KVM)
Isolation	Logical (Kernel)	Hardware (Minimal)	Hardware (Full)
Startup	< 1s	< 1s	> 10s
Payload	Process	Kernel + Rootfs	Full OS
Security	Medium (Shared Kernel)	High	Highest
Use Case	Microservices	Serverless / Multi-tenant	Legacy / Windows

5. Docker Internals: Putting It All Together

Docker is a high-level container runtime that orchestrates namespaces, cgroups, and OverlayFS.

┌─────────────────────────────────────────────────────────────────────┐
│                     DOCKER ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  User Space                                                          │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  docker CLI                                                     │ │
│  │  $ docker run -m 512m --cpus=0.5 nginx                         │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │ REST API over Unix socket          │
│                                 ▼                                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  dockerd (Docker Daemon)                                        │ │
│  │  • Image management                                             │ │
│  │  • Volume management                                            │ │
│  │  • Network management                                           │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │ gRPC                               │
│                                 ▼                                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  containerd (Container Runtime)                                 │ │
│  │  • Container lifecycle                                          │ │
│  │  • Image pulls/pushes                                           │ │
│  │  • Storage management                                           │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │                                    │
│                                 ▼                                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  containerd-shim (Per-container)                                │ │
│  │  • Keeps container running if containerd crashes                │ │
│  │  • Reports exit status                                          │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │ fork/exec                          │
│                                 ▼                                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  runc (OCI Runtime)                                             │ │
│  │  • Creates namespaces                                           │ │
│  │  • Sets up cgroups                                              │ │
│  │  • Mounts overlay filesystem                                    │ │
│  │  • Executes container process                                   │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │                                    │
│  ════════════════════════════════╧═════════════════════════════════  │
│                                                                     │
│  Kernel Space                                                        │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  Container Process                                              │ │
│  │  ┌───────────────────────────────────────────────────────────┐ │ │
│  │  │  PID, Mount, Network, UTS, IPC, User, Cgroup Namespaces   │ │ │
│  │  │  CPU, Memory, IO, PIDs Cgroups                             │ │ │
│  │  │  OverlayFS (LowerDir, UpperDir, WorkDir, MergedDir)        │ │ │
│  │  └───────────────────────────────────────────────────────────┘ │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Docker Container Creation Steps:

# 1. Pull image (if not cached)
docker pull nginx:latest

# 2. Create container
docker create --name web -p 80:80 nginx

# 3. Start container
docker start web

# What happens internally:
# a. containerd unpacks image layers
# b. runc creates namespaces (CLONE_NEWPID|CLONE_NEWNET|...)
# c. runc sets up cgroups (/sys/fs/cgroup/docker/<container-id>/)
# d. runc mounts OverlayFS
# e. runc configures network (veth pair, bridge)
# f. runc pivot_root to container filesystem
# g. runc executes CMD/ENTRYPOINT

Inspecting Docker Internals:

# View container's namespaces
docker inspect web | jq '.[0].State.Pid'  # Get PID
sudo ls -la /proc/<PID>/ns/
# lrwxrwxrwx 1 root root 0 pid:[4026532198]
# lrwxrwxrwx 1 root root 0 net:[4026532201]
# lrwxrwxrwx 1 root root 0 mnt:[4026532196]

# View cgroup limits
cat /sys/fs/cgroup/docker/<container-id>/memory.max
cat /sys/fs/cgroup/docker/<container-id>/cpu.max

# View OverlayFS layers
docker inspect web | jq '.[0].GraphDriver'
# {
#   "Data": {
#     "LowerDir": "/var/lib/docker/overlay2/abc.../diff",
#     "UpperDir": "/var/lib/docker/overlay2/def.../diff",
#     "WorkDir": "/var/lib/docker/overlay2/def.../work",
#     "MergedDir": "/var/lib/docker/overlay2/def.../merged"
#   }
# }

6. Interview Deep Dive: Senior Level

Q1: How does 'User Namespaces' improve container security?

Answer:User Namespaces (CLONE_NEWUSER) allow a process to have UID 0 (root) inside the container while being a non-privileged UID (e.g., 1000) on the host.Security Improvement:

Without User Namespace:
┌───────────────────────────────────────────────────────────┐
│  Container                         Host                    │
│  ┌──────────────┐                 ┌──────────────┐        │
│  │  UID 0       │ ═══════════════►│  UID 0       │        │
│  │  (root)      │                 │  (root)      │        │
│  └──────────────┘                 └──────────────┘        │
│                                                            │
│  If container breakout occurs:                            │
│  → Attacker has root on host!                             │
└───────────────────────────────────────────────────────────┘

With User Namespace:
┌───────────────────────────────────────────────────────────┐
│  Container                         Host                    │
│  ┌──────────────┐                 ┌──────────────┐        │
│  │  UID 0       │ ───────────────►│  UID 1000    │        │
│  │  (root)      │     mapped      │  (user)      │        │
│  └──────────────┘                 └──────────────┘        │
│                                                            │
│  If container breakout occurs:                            │
│  → Attacker only has UID 1000 permissions                 │
│  → Cannot write to /etc, /boot, system files              │
│  → Cannot load kernel modules                             │
│  → Cannot access other users' files                       │
└───────────────────────────────────────────────────────────┘

Implementation:

# Run Docker with user namespace remapping
dockerd --userns-remap=default

# Or manually with unshare
unshare --user --map-root-user /bin/bash

Limitations:

Some operations still require host root (mounting certain filesystems)
File ownership can be confusing (files created by container appear owned by high UIDs on host)
Not all containers work with user namespaces (especially those requiring true root)

Q2: Explain the difference between cgroups v1 and v2 and why v2 is better

Answer:Cgroups v1 Problems:

Multiple Hierarchies:
- Each controller (cpu, memory, io) has its own hierarchy
- A process can be in /sys/fs/cgroup/cpu/groupA and /sys/fs/cgroup/memory/groupB
- Impossible to do unified resource accounting
Writeback Ambiguity:
- Process in cgroup A writes to page cache
- Page cache writeback happens later
- Which cgroup gets charged for the disk I/O?
- v1: Charged to whoever triggers writeback (wrong!)
No Delegation:
- Can’t safely give non-root users control over cgroups
- Security issues with nested hierarchies

Cgroups v2 Solutions:

Single Hierarchy:
- One tree, all controllers
- Process location is the same for all resources
- Enables proper delegation and accounting
Proper Attribution:
- Tracks which cgroup dirtied pages
- I/O charged correctly even if writeback delayed
Pressure Stall Information (PSI):
- Built-in resource pressure metrics
- Can detect when cgroup is starved

Migration Example:

# v1: Multiple hierarchies
/sys/fs/cgroup/cpu/docker/container1/
/sys/fs/cgroup/memory/system/container1/

# v2: Single hierarchy
/sys/fs/cgroup/docker/container1/
# All controllers available here

Q3: How does Docker implement network isolation and connectivity?

Answer:Docker uses network namespaces + veth pairs + Linux bridge.Default Bridge Network:

Create network namespace for container
Create veth pair (virtual ethernet cable with two ends)
Move one end into container namespace
Attach other end to docker0 bridge
Configure IP addresses and routes
Setup iptables rules for NAT

Detailed Flow:

# Container sends packet to 8.8.8.8:53
eth0@container (172.17.0.2) → veth pair
vethXXX@host → docker0 bridge (172.17.0.1)
SNAT: 172.17.0.2 → 192.168.1.100 (host IP)
eth0@host → Internet

# Response
eth0@host ← Internet
DNAT: 192.168.1.100 → 172.17.0.2
docker0 bridge → vethXXX@host
veth pair → eth0@container

Port Mapping:

docker run -p 8080:80 nginx

# iptables rule created:
iptables -t nat -A DOCKER -p tcp --dport 8080 \
  -j DNAT --to-destination 172.17.0.2:80

Network Modes:

Mode	Description	Use Case
bridge	Default, isolated network	Normal containers
host	Share host network namespace	High performance
none	No network	Security isolation
container:ID	Share another container’s netns	Sidecars

Code Example:

// Simplified Docker network setup

// 1. Create veth pair
ip_link_add("veth0", "veth1", VETH);

// 2. Move one end to namespace
ip_link_set_ns("veth1", container_netns);

// 3. Attach to bridge
ip_link_set_master("veth0", "docker0");

// 4. Configure IPs
ip_addr_add("172.17.0.2/16", "veth1", container_netns);
ip_route_add("default via 172.17.0.1", container_netns);

Q4: What is a 'VM Exit' and why is it expensive?

Answer:A VM Exit occurs when the guest OS performs an action that requires hypervisor intervention (e.g., I/O, CPUID, or accessing certain registers).VM Exit Flow:

┌─────────────────────────────────────────────────────────────┐
│                      VM EXIT OVERHEAD                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Guest (VM)                                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  1. Execute privileged instruction (e.g., IN/OUT)    │   │
│  │     or access protected resource                      │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ CPU Trap                         │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  2. Hardware VM Exit                                  │   │
│  │     • Save guest state to VMCS (registers, RIP, etc.) │   │
│  │     • Load host state from VMCS                       │   │
│  │     • Jump to hypervisor entry point                  │   │
│  │     • Time: ~1000-3000 cycles                         │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│  Hypervisor                                                  │
│  ┌────────────────────────▼─────────────────────────────┐   │
│  │  3. Handle VM Exit                                    │   │
│  │     • Inspect exit reason                             │   │
│  │     • Emulate instruction (e.g., read port 0x3F8)     │   │
│  │     • Update guest state                              │   │
│  │     • Time: 500-2000 cycles                           │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ VM Entry                         │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  4. Resume Guest                                      │   │
│  │     • Load guest state from VMCS                      │   │
│  │     • Switch to VMX non-root mode                     │   │
│  │     • Continue guest execution                        │   │
│  │     • Time: ~1000-2000 cycles                         │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Total Overhead: 2500-7000 cycles (1-3 microseconds)        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Common VM Exit Causes:

Cause	Frequency	Mitigation
I/O instructions (IN/OUT)	High	Use virtio (paravirtualization)
CPUID	Medium	Cache results in guest
CR3 writes (page table)	High	Use EPT (hardware MMU)
Interrupts	Very High	APIC virtualization
MSR access	Medium	Use MSR bitmaps
HLT (idle)	Low	Acceptable (CPU idle anyway)

Optimization Strategies:

EPT (Extended Page Tables):
- Guest can change CR3 without VM exit
- Hardware handles GVA → GPA → HPA translation
APIC Virtualization:
- Virtual APIC page in guest memory
- Most interrupt operations happen without exits
VirtIO:
- Paravirtualized drivers
- Shared memory rings reduce I/O exits

Modern virtualization aims to minimize VM Exits using features like APIC Virtualization and EPT.

Q5: Explain the 'Nested Virtualization' problem

Answer:Nested virtualization is running a VM inside another VM (e.g., GKE on Google Cloud).Address Translation Complexity:

┌─────────────────────────────────────────────────────────────┐
│                NESTED VIRTUALIZATION                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  L2 Guest (innermost VM)                                     │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  GVA (Guest Virtual Address)                          │   │
│  │  e.g., 0x400000 (program address)                     │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │ L2 Page Tables                   │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  GPA-L2 (Guest Physical Address of L2)                │   │
│  │  e.g., 0x80000000                                     │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│  L1 Hypervisor (middle VM)                                   │
│                           │ L1 EPT                           │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  GPA-L1 (Guest Physical Address of L1)                │   │
│  │  e.g., 0x100000000                                    │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│  L0 Hypervisor (host)                                        │
│                           │ L0 EPT                           │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  HPA (Host Physical Address)                          │   │
│  │  e.g., 0x200000000 (actual RAM)                       │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Without Nested EPT:                                         │
│  → Each memory access requires 4 page walks                 │
│  → GVA→GPA-L2: 4 walks                                      │
│  → Each GPA-L2 access needs GPA-L2→HPA translation          │
│  → Total: 4 + (4 × 4) = 20 memory accesses!                 │
│                                                              │
│  With Nested EPT (Intel):                                    │
│  → Hardware combines L1 and L0 EPT                          │
│  → Still slower than native, but manageable                  │
│  → ~2-3x overhead instead of 10x+                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Performance Impact:

Operation         Native    L1 VM    Nested L2 VM
Memory Access     1x        1.1x     2-3x
I/O               1x        2x       4-6x
Context Switch    1x        1.5x     3-4x

When to Use Nested Virtualization:

Development/Testing:
- Test hypervisor code
- CI/CD pipelines testing VMs
Cloud Services:
- Kubernetes on cloud VMs
- CI runners in cloud
Education:
- Teaching virtualization
- Lab environments

Avoid for:

Production databases
High-performance computing
Latency-sensitive applications

The main challenge is the “Level 2” Guest Physical to “Level 0” Host Physical translation. This requires either complex shadow page table merging or hardware support for Nested EPT, which can significantly degrade memory performance due to the exponentially more complex page walks.

Q6: How does OverlayFS implement copy-on-write for containers?

Answer:OverlayFS provides a union mount where multiple layers are combined into a single view.Layer Structure:

┌─────────────────────────────────────────────────────────────┐
│                    OVERLAYFS LAYERS                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Container View (MergedDir)                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  /bin/bash      ← from Base Layer                     │   │
│  │  /etc/nginx/    ← from Nginx Layer                    │   │
│  │  /var/log/app   ← from Container Layer (writable)     │   │
│  │  /app/config    ← from Container Layer (modified)     │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│  ═════════════════════════╧═════════════════════════════     │
│                                                              │
│  UpperDir (Writable Container Layer)                         │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  /var/log/app           ← New file                    │   │
│  │  /app/config            ← Modified file               │   │
│  │  .wh.oldfile            ← Whiteout (deleted file)     │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  LowerDir (Read-Only Image Layers)                           │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Layer 3: Nginx Layer                                 │   │
│  │  /etc/nginx/nginx.conf                                │   │
│  │  /usr/sbin/nginx                                      │   │
│  ├──────────────────────────────────────────────────────┤   │
│  │  Layer 2: App Dependencies                            │   │
│  │  /usr/lib/libssl.so                                   │   │
│  ├──────────────────────────────────────────────────────┤   │
│  │  Layer 1: Base Ubuntu                                 │   │
│  │  /bin/bash                                            │   │
│  │  /etc/passwd                                          │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Copy-on-Write Operations:1. Read File:

open("/etc/nginx/nginx.conf", O_RDONLY)
→ OverlayFS checks UpperDir: not found
→ Falls through to LowerDir: found in Nginx layer
→ Returns file from LowerDir (no copy needed)

2. Modify File:

open("/etc/nginx/nginx.conf", O_WRONLY)
→ OverlayFS checks UpperDir: not found
→ Copy file from LowerDir to UpperDir (copy-up)
→ Open file in UpperDir for writing
→ Future reads will use UpperDir version

3. Delete File:

unlink("/etc/nginx/nginx.conf")
→ OverlayFS creates whiteout file in UpperDir
→ UpperDir/.wh.nginx.conf (marks file as deleted)
→ LowerDir file remains (other containers unaffected)
→ MergedDir view hides the file

4. Create New File:

open("/var/log/app.log", O_CREAT)
→ OverlayFS creates file directly in UpperDir
→ No interaction with LowerDir needed

Mount Command:

mount -t overlay overlay \
  -o lowerdir=/lower1:/lower2:/lower3,\
     upperdir=/upper,\
     workdir=/work \
  /merged

Benefits:

Shared base layers save disk space
Fast container startup (no copying)
Efficient use of cache (shared pages)

Performance Considerations:

First write to file triggers copy-up (can be slow for large files)
Many layers slow down lookup
Whiteouts can accumulate (use docker system prune)

Q7: What are the security implications of sharing the kernel in containers vs VMs?

Answer:Container Security (Shared Kernel):Pros:

Faster startup and lower overhead
Easier management

Cons:

Kernel Exploits:

Container → Kernel Vulnerability → Host Compromise

Example: Dirty COW (CVE-2016-5195)
- Container can exploit kernel bug
- Gain root on host
- Escape to host system

Large Attack Surface:

~300+ system calls exposed
Any syscall vulnerability affects all containers

Mitigation: seccomp-bpf filters
→ Block dangerous syscalls
→ Reduce attack surface

Resource Exhaustion:

Without cgroups:
Container A → Allocate all memory → OOM kills Container B

With cgroups:
Container A → Hit memory.max → OOM kills processes in A only

Information Leakage:

/proc and /sys expose kernel information
- /proc/kallsyms (kernel symbols)
- /sys/kernel/debug (debug info)

Mitigation: Mount with hidepid, remove sensitive mounts

VM Security (Separate Kernel):Pros:

Strong Isolation:

VM → Hypervisor → Host

Attack path requires:
1. Exploit in guest kernel
2. VM escape vulnerability
3. Hypervisor exploit

Much harder than container escape

Smaller Attack Surface:

VM → Hypervisor interface is small
- Hypercalls (10-20 vs 300+ syscalls)
- Device emulation
- Much less code to attack

Different Kernels:

Can run different kernel versions
Old vulnerable kernel in VM doesn't affect host

Comparison Table:

Aspect	Containers	VMs
Kernel isolation	Shared	Separate
Escape difficulty	Medium	Hard
Attack surface	Large (~300 syscalls)	Small (~20 hypercalls)
Vulnerability impact	Affects host	Contained to VM
Performance overhead	~2%	~5-10%
Startup time	under 1s	10-30s

Best Practices:For Containers:

# 1. Use user namespaces
--userns-remap=default

# 2. Drop capabilities
--cap-drop=ALL --cap-add=NET_BIND_SERVICE

# 3. Seccomp filter
--security-opt seccomp=default.json

# 4. AppArmor/SELinux
--security-opt apparmor=docker-default

# 5. Read-only root
--read-only --tmpfs /tmp

# 6. No privileged mode
# NEVER use --privileged in production!

For VMs:

# 1. Minimal device emulation
Use virtio instead of emulated hardware

# 2. Disable unnecessary devices
-nodefaults -no-vga -no-audio

# 3. Use KVM (hardware virtualization)
-enable-kvm

# 4. Memory ballooning
-device virtio-balloon

# 5. vTPM for measured boot
-tpmdev emulator

Hybrid Approach (Kata Containers):

Container API → Lightweight VM → Strong isolation

Benefits:
- Container-like UX
- VM-like security
- ~50-100ms startup (vs 10s for traditional VM)

Q8: How do hypervisors implement device emulation vs paravirtualization?

Answer:Device Emulation (Full Virtualization):Guest believes it’s talking to real hardware, hypervisor emulates every register read/write.

┌─────────────────────────────────────────────────────────────┐
│                  DEVICE EMULATION                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Guest VM                                                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  1. Guest writes to e1000 NIC register                │   │
│  │     outl(0xC000, ETH_TX_DESC)                         │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ VM Exit (I/O port access)        │
│                           ▼                                  │
│  Hypervisor (QEMU/KVM)                                       │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  2. Trap I/O operation                                │   │
│  │     Decode: write to port 0xC000, value 0x12345678    │   │
│  │                                                        │   │
│  │  3. Emulate e1000 device logic                        │   │
│  │     - Update virtual NIC state                        │   │
│  │     - Copy packet from guest memory                   │   │
│  │     - Send to host TAP device                         │   │
│  │                                                        │   │
│  │  4. Return to guest                                   │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ VM Entry                         │
│                           ▼                                  │
│  Guest continues...                                          │
│                                                              │
│  Problem: EVERY register access causes VM Exit!              │
│  → Thousands of exits per packet                            │
│  → 10x+ overhead                                            │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Paravirtualization (VirtIO):Guest knows it’s virtualized, uses efficient shared-memory interface.

┌─────────────────────────────────────────────────────────────┐
│                  PARAVIRTUALIZATION                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Guest VM                                                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  1. VirtIO driver in guest                            │   │
│  │     - Shared memory ring (vring)                      │   │
│  │     - No register emulation needed                    │   │
│  │                                                        │   │
│  │  2. Write packet to shared ring                       │   │
│  │     vring[idx] = packet_buffer                        │   │
│  │     idx++                                             │   │
│  │                                                        │   │
│  │  3. Kick hypervisor (single VM exit)                  │   │
│  │     kick_notify()                                     │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ Single VM Exit                   │
│                           ▼                                  │
│  Hypervisor                                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  4. Process all pending packets                       │   │
│  │     while (vring has packets) {                       │   │
│  │       packet = vring[idx]                             │   │
│  │       send_to_tap(packet)                             │   │
│  │       idx++                                           │   │
│  │     }                                                 │   │
│  │                                                        │   │
│  │  5. Return to guest                                   │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ VM Entry                         │
│                           ▼                                  │
│  Guest continues...                                          │
│                                                              │
│  Benefit: One VM exit for multiple packets!                  │
│  → Near-native performance                                  │
│  → <5% overhead                                             │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Performance Comparison:

Operation	Emulation	VirtIO	Native
Network throughput	1 Gbps	9.5 Gbps	10 Gbps
Disk IOPS	5,000	45,000	50,000
VM exits per packet	100-1000	1-2	0

VirtIO Ring Structure:

struct vring {
    // Available ring: guest writes here
    struct vring_avail {
        uint16_t flags;
        uint16_t idx;
        uint16_t ring[queue_size];
    } avail;

    // Descriptor table: describes buffers
    struct vring_desc {
        uint64_t addr;   // Guest physical address
        uint32_t len;    // Buffer length
        uint16_t flags;  // VRING_DESC_F_NEXT, etc.
        uint16_t next;   // Next descriptor
    } desc[queue_size];

    // Used ring: hypervisor writes here
    struct vring_used {
        uint16_t flags;
        uint16_t idx;
        struct vring_used_elem {
            uint32_t id;  // Descriptor index
            uint32_t len; // Bytes written
        } ring[queue_size];
    } used;
};

Tradeoffs:Emulation:

Pros: No guest modification, runs any OS
Cons: Slow, many VM exits

Paravirtualization:

Pros: Fast, few VM exits
Cons: Requires guest support (modified drivers)

Modern Approach:

Use paravirt for performance-critical devices (disk, network)
Use emulation for legacy devices (VGA, PS/2)
Gradually reduce emulation over time

6. Namespaces & Cgroups: A Single Process’s Perspective

What does a process actually “see” when it’s containerized? Here’s the view from inside:

What Changes for the Process

┌─────────────────────────────────────────────────────────────────────┐
│     PROCESS VIEW: BEFORE vs AFTER CONTAINERIZATION                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  BEFORE (Host Process)                AFTER (Containerized)         │
│  ────────────────────                 ─────────────────────         │
│                                                                     │
│  PID: 12345                           PID: 1  (thinks it's init!)   │
│  UID: 1000                            UID: 0  (root in container)   │
│  Hostname: myserver                   Hostname: container-abc       │
│  /proc: sees all processes            /proc: sees only self         │
│  Network: eth0 (192.168.1.5)          Network: eth0 (172.17.0.2)    │
│  Filesystem: /home/user/...           Filesystem: / (isolated root) │
│  Memory: unlimited                    Memory: 512MB max (cgroup)    │
│  CPU: all cores                       CPU: 50% of 1 core (cgroup)   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Inspecting Your Own Namespace

# Inside a container (or any process), see your namespaces:
ls -la /proc/self/ns/
# lrwxrwxrwx 1 root root 0 pid:[4026532198]
# lrwxrwxrwx 1 root root 0 net:[4026532201]
# lrwxrwxrwx 1 root root 0 mnt:[4026532196]
# ...

# Compare with host (different inode numbers = different namespace)
# Host:  pid:[4026531836]
# Container: pid:[4026532198]  ← Different!

# See your cgroup limits
cat /sys/fs/cgroup/memory.max      # Memory limit
cat /sys/fs/cgroup/cpu.max         # CPU limit (quota period)
cat /sys/fs/cgroup/pids.max        # Max processes

# See your cgroup resource usage
cat /sys/fs/cgroup/memory.current  # Current memory usage
cat /sys/fs/cgroup/cpu.stat        # CPU time consumed

The Process Doesn’t Know It’s Contained

// This code behaves identically on host or in container:
#include <stdio.h>
#include <unistd.h>

int main() {
    printf("My PID: %d\n", getpid());        // 1 in container, 12345 on host
    printf("My UID: %d\n", getuid());        // 0 in container (fake root)

    char hostname[256];
    gethostname(hostname, sizeof(hostname));
    printf("Hostname: %s\n", hostname);      // "container-abc" in container

    // Process has no idea it's in a container!
    // All syscalls return "virtualized" results
    return 0;
}

Key Insight: Syscalls Are Virtualized

Every syscall that returns information about the system goes through namespace translation:

Syscall	Host Returns	Container Returns
`getpid()`	12345	1
`getuid()`	1000	0 (mapped root)
`uname()`	myserver	container-abc
`readdir(/proc)`	All PIDs	Only container PIDs
`socket(AF_INET,...)`	Host network	Container network

7. Advanced Practice

Manual Namespace Build: Use the unshare command to create a shell with its own network and PID namespace. Try to ping the host.
Cgroup Stress Test: Create a cgroup v2 with a 100MB memory limit. Run a program that allocates 200MB and observe the kernel’s OOM killer logs in dmesg.
VirtIO Analysis: Run a KVM guest and use lspci inside the guest to identify which devices are using virtio drivers vs. emulated hardware.

Next: OS Security & Hardening →

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Containers & Virtualization

​1. Container Internals: The Linux “Trio”

​1.1 Namespaces: The Illusion of Isolation

​Deep Dive: PID Namespace

​Deep Dive: Network Namespace

​Deep Dive: Mount Namespace

​Deep Dive: User Namespace

​Deep Dive: IPC Namespace

​Deep Dive: UTS Namespace

​Deep Dive: Time Namespace

​The pivot_root vs chroot

​1.2 Cgroups: Resource Metering and Limiting

​Key Controllers:

​1.3 OverlayFS: The Layered Filesystem

​2. Virtualization: Emulating the Machine

​2.1 The Hypervisor (VMM)

​2.2 Hardware-Assisted Virtualization (VT-x / AMD-V)

​VMCS (Virtual Machine Control Structure)

​2.3 Memory Virtualization: EPT and NPT

​3. The Middle Ground: MicroVMs

​Firecracker Architecture

​4. Comparison: When to Use What?

​5. Docker Internals: Putting It All Together

​6. Interview Deep Dive: Senior Level

​6. Namespaces & Cgroups: A Single Process’s Perspective

​What Changes for the Process

​Inspecting Your Own Namespace

​The Process Doesn’t Know It’s Contained

​Key Insight: Syscalls Are Virtualized

​7. Advanced Practice

Containers & Virtualization

1. Container Internals: The Linux “Trio”

1.1 Namespaces: The Illusion of Isolation

Deep Dive: PID Namespace

Deep Dive: Network Namespace

Deep Dive: Mount Namespace

Deep Dive: User Namespace

Deep Dive: IPC Namespace

Deep Dive: UTS Namespace

Deep Dive: Time Namespace

The `pivot_root` vs `chroot`

1.2 Cgroups: Resource Metering and Limiting

Key Controllers:

1.3 OverlayFS: The Layered Filesystem

2. Virtualization: Emulating the Machine

2.1 The Hypervisor (VMM)

2.2 Hardware-Assisted Virtualization (VT-x / AMD-V)

VMCS (Virtual Machine Control Structure)

2.3 Memory Virtualization: EPT and NPT

3. The Middle Ground: MicroVMs

Firecracker Architecture

4. Comparison: When to Use What?

5. Docker Internals: Putting It All Together

6. Interview Deep Dive: Senior Level

6. Namespaces & Cgroups: A Single Process’s Perspective

What Changes for the Process

Inspecting Your Own Namespace

The Process Doesn’t Know It’s Contained

Key Insight: Syscalls Are Virtualized

7. Advanced Practice