Skip to main content

Containers & Virtualization

Isolation is the core requirement of multi-tenant cloud computing. Whether you are running a SaaS platform or a microservices cluster, you must ensure that processes are contained, resources are metered, and security boundaries are enforced. Modern systems achieve this through two distinct paths: OS-level virtualization (Containers) and Hardware-level virtualization (VMs).
Mastery Level: Senior Systems Engineer Key Internals: CLONE_NEW*, Cgroups v2 Unified Hierarchy, VMCS, EPT/SLAT, Firecracker MicroVMs Prerequisites: Process Internals, Memory Management

1. Container Internals: The Linux “Trio”

A container is not a “thing” in the Linux kernel. It is a user-space abstraction built using three primary kernel features: Namespaces, Control Groups (cgroups), and Union Filesystems.
┌─────────────────────────────────────────────────────────────────────┐
│                     CONTAINER ARCHITECTURE                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Container Runtime (Docker/containerd/cri-o)                        │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  Container 1           Container 2           Container 3        │ │
│  │  ┌──────────┐          ┌──────────┐          ┌──────────┐      │ │
│  │  │ App      │          │ App      │          │ App      │      │ │
│  │  │ (nginx)  │          │ (redis)  │          │ (postgres)      │ │
│  │  └────┬─────┘          └────┬─────┘          └────┬─────┘      │ │
│  │       │                     │                     │             │ │
│  └───────┼─────────────────────┼─────────────────────┼─────────────┘ │
│          │                     │                     │               │
│  ════════╧═════════════════════╧═════════════════════╧═════════════  │
│                                                                     │
│  Linux Kernel Features                                              │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐ │ │
│  │  │   Namespaces    │  │    Cgroups      │  │  Union FS      │ │ │
│  │  │                 │  │                 │  │  (OverlayFS)   │ │ │
│  │  │ • PID           │  │ • CPU           │  │                │ │ │
│  │  │ • Mount         │  │ • Memory        │  │ • LowerDir     │ │ │
│  │  │ • Network       │  │ • PIDs          │  │ • UpperDir     │ │ │
│  │  │ • UTS           │  │ • Blkio         │  │ • MergedDir    │ │ │
│  │  │ • IPC           │  │ • Devices       │  │ • WorkDir      │ │ │
│  │  │ • User          │  │ • Network       │  │                │ │ │
│  │  │ • Cgroup        │  │                 │  │                │ │ │
│  │  │ • Time          │  │                 │  │                │ │ │
│  │  └─────────────────┘  └─────────────────┘  └────────────────┘ │ │
│  │                                                                 │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  ════════════════════════════════════════════════════════════════   │
│                                                                     │
│  Shared Linux Kernel                                                │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  System Calls, Process Scheduler, Memory Management, Drivers   │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

1.1 Namespaces: The Illusion of Isolation

Namespaces wrap global system resources in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the resource.
NamespaceFlagIsolated Resource
MountCLONE_NEWNSFilesystem mount points (independent mount/umount).
UTSCLONE_NEWUTSHostname and NIS domain name.
IPCCLONE_NEWIPCSystem V IPC, POSIX message queues.
PIDCLONE_NEWPIDProcess IDs (Process 1 inside the container).
NetworkCLONE_NEWNETNetwork devices, stacks, ports, firewalls.
UserCLONE_NEWUSERUser and group IDs (Root in container != Root on host).
CgroupCLONE_NEWCGROUPCgroup root directory view.
TimeCLONE_NEWTIMESystem boot and monotonic clocks.

Deep Dive: PID Namespace

The PID namespace creates a hierarchical process view where each namespace has its own PID 1.
┌─────────────────────────────────────────────────────────────────────┐
│                      PID NAMESPACE HIERARCHY                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Host (Initial PID Namespace)                                       │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  PID 1: /sbin/init (systemd)                                   │ │
│  │  PID 523: dockerd                                              │ │
│  │  PID 1024: container-init   ←────┐                             │ │
│  │  PID 1025: nginx (worker)   ←────┼─┐                           │ │
│  │  PID 1026: nginx (worker)   ←────┼─┼─┐                         │ │
│  │                                   │ │ │                         │ │
│  └───────────────────────────────────┼─┼─┼─────────────────────────┘ │
│                                      │ │ │                           │
│  Container PID Namespace             │ │ │                           │
│  ┌──────────────────────────────────┼─┼─┼─────────────────────────┐ │
│  │                                  │ │ │                         │ │
│  │  PID 1: /init  ──────────────────┘ │ │  (maps to host 1024)   │ │
│  │  PID 2: nginx master ───────────────┘ │  (maps to host 1025)   │ │
│  │  PID 3: nginx worker ─────────────────┘  (maps to host 1026)   │ │
│  │                                                                 │ │
│  │  Processes see only PIDs 1, 2, 3                               │ │
│  │  Cannot see or signal host processes                           │ │
│  │                                                                 │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
Implementation Details:
// Creating a PID namespace
int clone_flags = CLONE_NEWPID | SIGCHLD;
pid_t pid = clone(child_fn, child_stack, clone_flags, NULL);

// Inside the child function
int child_fn(void *arg) {
    printf("My PID: %d\n", getpid());  // Will print: 1

    // Fork a child
    pid_t child = fork();
    if (child == 0) {
        printf("Child PID: %d\n", getpid());  // Will print: 2
    }
    return 0;
}
Key Properties:
  • First process in namespace becomes PID 1
  • If PID 1 exits, kernel kills all processes in namespace
  • Parent namespace can see child processes with their “real” PIDs
  • /proc shows only processes in current namespace (with mount namespace)

Deep Dive: Network Namespace

Network namespaces isolate the network stack: devices, routing tables, firewall rules, sockets.
┌─────────────────────────────────────────────────────────────────────┐
│                    NETWORK NAMESPACE TOPOLOGY                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Host Network Namespace                                             │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐       │ │
│  │  │   eth0   │  │  veth0   │  │  veth2   │  │  veth4   │       │ │
│  │  │ (physical)  │  (host)  │  │  (host)  │  │  (host)  │       │ │
│  │  └─────┬────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘       │ │
│  │        │            │             │             │              │ │
│  │        │       ┌────┴─────────────┴─────────────┴────┐         │ │
│  │        │       │         docker0 (bridge)            │         │ │
│  │        │       │         172.17.0.1/16               │         │ │
│  │        │       └─────────────────────────────────────┘         │ │
│  │        │                                                        │ │
│  │    [Internet]                                                  │ │
│  │                                                                 │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                           │             │             │             │
│                           │             │             │             │
│  Container 1 Netns        │             │             │             │
│  ┌───────────────────────┼─────────────┘             │             │
│  │  ┌────────────────────▼────┐                      │             │
│  │  │  eth0 (container view)  │                      │             │
│  │  │  veth1 (actual peer)    │                      │             │
│  │  │  172.17.0.2/16          │                      │             │
│  │  └─────────────────────────┘                      │             │
│  │  Route: default via 172.17.0.1                    │             │
│  └────────────────────────────────────────────────────┘             │
│                                                       │             │
│  Container 2 Netns                                    │             │
│  ┌───────────────────────────────────────────────────┼─────────────┘
│  │  ┌────────────────────────────────────────────────▼────┐         │
│  │  │  eth0 (container view)                             │         │
│  │  │  veth3 (actual peer)                               │         │
│  │  │  172.17.0.3/16                                     │         │
│  │  └────────────────────────────────────────────────────┘         │
│  │  Route: default via 172.17.0.1                                  │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
Creating veth pairs:
# Create network namespace
ip netns add container1

# Create veth pair
ip link add veth0 type veth peer name veth1

# Move one end into namespace
ip link set veth1 netns container1

# Configure host end
ip addr add 172.17.0.1/16 dev veth0
ip link set veth0 up

# Configure container end
ip netns exec container1 ip addr add 172.17.0.2/16 dev veth1
ip netns exec container1 ip link set veth1 up
ip netns exec container1 ip route add default via 172.17.0.1

# Test connectivity
ip netns exec container1 ping 172.17.0.1
Code Example: Creating Network Namespace
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <net/if.h>

int child_fn(void *arg) {
    // Now in new network namespace

    // List network interfaces
    struct if_nameindex *if_ni = if_nameindex();
    if (if_ni) {
        for (int i = 0; if_ni[i].if_index != 0; i++) {
            printf("Interface: %s (index %d)\n",
                   if_ni[i].if_name, if_ni[i].if_index);
        }
        if_freenameindex(if_ni);
    }

    // Only loopback exists in new namespace
    return 0;
}

int main() {
    const int STACK_SIZE = 65536;
    char *stack = malloc(STACK_SIZE);

    clone(child_fn, stack + STACK_SIZE,
          CLONE_NEWNET | SIGCHLD, NULL);

    wait(NULL);
    return 0;
}

Deep Dive: Mount Namespace

Mount namespaces isolate the filesystem mount points.
// Create mount namespace
unshare(CLONE_NEWNS);

// Mounts are now private to this namespace
mount("/dev/sda1", "/mnt", "ext4", 0, NULL);

// Other namespaces won't see this mount
Mount Propagation:
┌─────────────────────────────────────────────────────────────────────┐
│                     MOUNT PROPAGATION TYPES                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  MS_SHARED: Mounts propagate bidirectionally                        │
│  ┌────────────────┐                    ┌────────────────┐           │
│  │  Namespace A   │ ◄───────────────► │  Namespace B   │           │
│  │  mount /foo    │    propagates      │  sees /foo     │           │
│  └────────────────┘                    └────────────────┘           │
│                                                                     │
│  MS_PRIVATE: Mounts don't propagate (default in containers)         │
│  ┌────────────────┐                    ┌────────────────┐           │
│  │  Namespace A   │  X  no sharing  X  │  Namespace B   │           │
│  │  mount /foo    │                    │  no /foo       │           │
│  └────────────────┘                    └────────────────┘           │
│                                                                     │
│  MS_SLAVE: Receives mounts from master, but doesn't send            │
│  ┌────────────────┐                    ┌────────────────┐           │
│  │  Master        │ ─────────────────► │  Slave         │           │
│  │  mount /foo    │    one-way         │  sees /foo     │           │
│  └────────────────┘ ◄─────X────────────└────────────────┘           │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Deep Dive: User Namespace

User namespaces allow mapping UIDs/GIDs, enabling rootless containers.
┌─────────────────────────────────────────────────────────────────────┐
│                      USER NAMESPACE MAPPING                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Host                              Container                        │
│  ┌──────────────────┐              ┌──────────────────┐            │
│  │                  │              │                  │            │
│  │  UID 0 (root)    │ ────X────►  │  Not mapped      │            │
│  │  UID 1000 (user) │ ───────────► │  UID 0 (root)    │            │
│  │  UID 1001        │ ───────────► │  UID 1           │            │
│  │  UID 1002        │ ───────────► │  UID 2           │            │
│  │  ...             │              │  ...             │            │
│  │  UID 65535       │ ───────────► │  UID 64535       │            │
│  │                  │              │                  │            │
│  └──────────────────┘              └──────────────────┘            │
│                                                                     │
│  Configuration: /proc/<pid>/uid_map                                │
│  Format: <container_id> <host_id> <range>                          │
│  Example: 0 1000 65536                                             │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
Setting up User Namespace:
# Create user namespace
unshare --user --map-root-user /bin/bash

# Inside namespace
id  # uid=0(root) gid=0(root)

# But on host, this process runs as your regular user
# File operations as "root" in container map to your UID on host
Code Example:
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>
#include <fcntl.h>

void setup_uid_map(pid_t pid) {
    char path[256];
    char map[256];

    // Map container root (0) to host user (1000)
    snprintf(path, sizeof(path), "/proc/%d/uid_map", pid);
    snprintf(map, sizeof(map), "0 1000 1");

    int fd = open(path, O_WRONLY);
    write(fd, map, strlen(map));
    close(fd);

    // Same for GID
    snprintf(path, sizeof(path), "/proc/%d/gid_map", pid);
    fd = open(path, O_WRONLY);
    write(fd, map, strlen(map));
    close(fd);
}

int child_fn(void *arg) {
    printf("UID in container: %d\n", getuid());  // 0
    printf("GID in container: %d\n", getgid());  // 0
    return 0;
}

Deep Dive: IPC Namespace

IPC namespaces isolate System V IPC objects and POSIX message queues.
# In host namespace
ipcmk -Q  # Create message queue
ipcs -q   # List queues - visible

# In new IPC namespace
unshare --ipc ipcs -q  # Empty - can't see host queues

Deep Dive: UTS Namespace

UTS namespaces isolate hostname and domain name.
unshare(CLONE_NEWUTS);
sethostname("container1", 10);

// This hostname is isolated to this namespace
// Host and other containers see their own hostnames

Deep Dive: Time Namespace

Time namespaces (Linux 5.6+) allow different boot times and monotonic clocks.
// Offset boot time by 1 hour
unshare(CLONE_NEWTIME);

// Write to /proc/self/timens_offsets
// Format: <clock_id> <seconds> <nanoseconds>
// CLOCK_MONOTONIC 3600 0

The pivot_root vs chroot

While chroot only changes the root directory for path resolution, it is insecure (processes can “break out” via .. or file descriptor trickery). Containers use pivot_root, which moves the entire mount namespace to a new root and removes access to the old one, providing a true filesystem jail.
// pivot_root implementation
int pivot_root(const char *new_root, const char *put_old) {
    // Move old root to put_old directory within new_root
    // Make new_root the new root
    // This removes all access to old root
    return syscall(SYS_pivot_root, new_root, put_old);
}

// Usage
chdir("/new_root");
pivot_root(".", "old_root");
umount2("old_root", MNT_DETACH);
rmdir("old_root");
chdir("/");

1.2 Cgroups: Resource Metering and Limiting

If Namespaces provide isolation (what you see), Cgroups provide containment (what you can use).
  • Cgroups v1 (Legacy): Multiple hierarchies. A process could be in one group for CPU and a completely different group for Memory. This led to massive complexity and performance issues.
  • Cgroups v2 (Modern/Unified): A single hierarchy. Every process belongs to exactly one cgroup in a unified tree. This allows for better resource accounting (e.g., attributing page cache writeback to the specific cgroup that caused the dirty pages).
┌─────────────────────────────────────────────────────────────────────┐
│                  CGROUPS V1 VS CGROUPS V2                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Cgroups v1 (Multiple Hierarchies)                                  │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  CPU Hierarchy        Memory Hierarchy      IO Hierarchy        │ │
│  │  ┌──────────┐         ┌──────────┐         ┌──────────┐        │ │
│  │  │   root   │         │   root   │         │   root   │        │ │
│  │  ├──────────┤         ├──────────┤         ├──────────┤        │ │
│  │  │ system   │         │ docker   │         │  user    │        │ │
│  │  │  ├─bash  │         │  ├─nginx │         │   ├─bash │        │ │
│  │  │  └─sshd  │         │  └─redis │         │   └─vim  │        │ │
│  │  └──────────┘         └──────────┘         └──────────┘        │ │
│  │                                                                 │ │
│  │  Problem: Process can be in different groups per controller    │ │
│  │  bash: CPU→system, Memory→docker, IO→user (confusing!)         │ │
│  │                                                                 │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  Cgroups v2 (Unified Hierarchy)                                     │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  Single Hierarchy (All Controllers)                             │ │
│  │  ┌──────────────────────────────────────────────────────────┐  │ │
│  │  │                       root                                │  │ │
│  │  │               (cpu, memory, io, pids)                     │  │ │
│  │  ├──────────────────────┬───────────────────────────────────┤  │ │
│  │  │      system          │          user.slice               │  │ │
│  │  │  ├─ sshd.service     │      ├─ user-1000.slice          │  │ │
│  │  │  └─ cron.service     │      │   ├─ session-1.scope       │  │ │
│  │  │                      │      │   │   ├─ bash              │  │ │
│  │  │                      │      │   │   └─ vim               │  │ │
│  │  │                      │      │   └─ docker.service        │  │ │
│  │  │                      │      │       ├─ container1         │  │ │
│  │  │                      │      │       │   ├─ nginx          │  │ │
│  │  │                      │      │       └─ container2         │  │ │
│  │  │                      │      │           └─ redis          │  │ │
│  │  └──────────────────────┴───────────────────────────────────┘  │ │
│  │                                                                 │ │
│  │  Benefit: Process location same for all controllers            │ │
│  │  Proper resource attribution and delegation                    │ │
│  │                                                                 │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Key Controllers:

CPU Controller:
# Cgroups v2 CPU control
cd /sys/fs/cgroup/user.slice/user-1000.slice

# cpu.max format: $MAX $PERIOD
# Allow 50% of one CPU: 50000 out of 100000 microseconds
echo "50000 100000" > cpu.max

# CPU weight (shares): 1-10000, default 100
echo "200" > cpu.weight  # 2x normal priority

# Statistics
cat cpu.stat
# usage_usec 12345678
# user_usec 10000000
# system_usec 2345678
# nr_periods 1234
# nr_throttled 56
# throttled_usec 789012
Memory Controller:
# Memory limits
echo "512M" > memory.max      # Hard limit
echo "256M" > memory.high     # Soft limit (throttling)
echo "128M" > memory.low      # Best-effort protection
echo "64M" > memory.min       # Hard protection

# Current usage
cat memory.current

# Detailed statistics
cat memory.stat
# anon 104857600           # Anonymous memory (heap, stack)
# file 52428800            # Page cache
# kernel_stack 131072
# slab 8388608
# sock 65536
# shmem 0
# file_mapped 16777216
# file_dirty 1048576
# file_writeback 524288
# inactive_anon 0
# active_anon 104857600
# inactive_file 26214400
# active_file 26214400

# Memory events
cat memory.events
# low 0                   # Times below memory.low
# high 12                 # Times above memory.high
# max 3                   # Times hit memory.max
# oom 0                   # OOM kills
# oom_kill 0
I/O Controller:
# I/O weight (1-10000)
echo "500" > io.weight

# I/O max (rate limiting)
# Format: $MAJ:$MIN rbps=$BYTES wbps=$BYTES riops=$IOPS wiops=$IOPS
echo "8:0 rbps=10485760 wbps=5242880" > io.max
# Limit reads to 10MB/s, writes to 5MB/s on device 8:0

# I/O statistics
cat io.stat
# 8:0 rbytes=1048576000 wbytes=524288000 rios=1000 wios=500
PIDs Controller:
# Limit number of processes
echo "100" > pids.max

# Current count
cat pids.current

# Events
cat pids.events
# max 5  # Times hit pids.max
Cgroups v2 Core Features:
┌─────────────────────────────────────────────────────────────────────┐
│                    CGROUPS V2 CORE CONCEPTS                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. No-Internal-Process Rule                                        │
│     ┌───────────────────────────────────────────────────────────┐  │
│     │  Only leaf cgroups can have processes                     │  │
│     │                                                            │  │
│     │  root/                                                     │  │
│     │  ├─ cgroup.procs         ← Cannot write here             │  │
│     │  └─ system/                                               │  │
│     │     ├─ cgroup.procs      ← Cannot write here             │  │
│     │     └─ sshd.service/                                      │  │
│     │        └─ cgroup.procs   ← Can write here (leaf)         │  │
│     └───────────────────────────────────────────────────────────┘  │
│                                                                     │
│  2. Controller Delegation                                           │
│     ┌───────────────────────────────────────────────────────────┐  │
│     │  Controllers must be explicitly enabled                   │  │
│     │                                                            │  │
│     │  root/cgroup.controllers                                  │  │
│     │  → cpu memory io pids                                     │  │
│     │                                                            │  │
│     │  root/cgroup.subtree_control                              │  │
│     │  → +cpu +memory   (enable for children)                   │  │
│     │                                                            │  │
│     │  root/system/cgroup.controllers                           │  │
│     │  → cpu memory     (inherited from parent)                 │  │
│     └───────────────────────────────────────────────────────────┘  │
│                                                                     │
│  3. Pressure Stall Information (PSI)                                │
│     ┌───────────────────────────────────────────────────────────┐  │
│     │  Tracks resource contention                               │  │
│     │                                                            │  │
│     │  cpu.pressure:                                            │  │
│     │  some avg10=5.23 avg60=3.14 avg300=1.87 total=123456      │  │
│     │                                                            │  │
│     │  memory.pressure:                                         │  │
│     │  some avg10=12.34 avg60=8.90 avg300=5.67 total=234567     │  │
│     │  full avg10=2.10 avg60=1.50 avg300=0.80 total=45678       │  │
│     │                                                            │  │
│     │  io.pressure:                                             │  │
│     │  some avg10=8.90 avg60=6.70 avg300=4.50 total=345678      │  │
│     │  full avg10=3.20 avg60=2.10 avg300=1.40 total=56789       │  │
│     └───────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
Creating and Managing Cgroups:
// C API for cgroup management
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>

void create_cgroup(const char *name) {
    char path[256];

    // Create cgroup directory
    snprintf(path, sizeof(path), "/sys/fs/cgroup/%s", name);
    mkdir(path, 0755);
}

void set_memory_limit(const char *name, const char *limit) {
    char path[256];
    int fd;

    snprintf(path, sizeof(path), "/sys/fs/cgroup/%s/memory.max", name);
    fd = open(path, O_WRONLY);
    write(fd, limit, strlen(limit));
    close(fd);
}

void add_process_to_cgroup(const char *name, pid_t pid) {
    char path[256];
    char pid_str[32];
    int fd;

    snprintf(path, sizeof(path), "/sys/fs/cgroup/%s/cgroup.procs", name);
    snprintf(pid_str, sizeof(pid_str), "%d", pid);

    fd = open(path, O_WRONLY);
    write(fd, pid_str, strlen(pid_str));
    close(fd);
}

int main() {
    create_cgroup("myapp");
    set_memory_limit("myapp", "512M");
    add_process_to_cgroup("myapp", getpid());

    // Process now limited to 512MB
    // Allocate memory and observe behavior

    return 0;
}
OOM Killer in Cgroups:
┌─────────────────────────────────────────────────────────────────────┐
│                    OOM KILLER IN CGROUPS                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  When cgroup exceeds memory.max:                                    │
│                                                                     │
│  1. Kernel triggers OOM killer                                      │
│  2. Selects victim ONLY from within the cgroup                      │
│  3. Score calculation (higher = more likely to kill):               │
│     ┌───────────────────────────────────────────────────────────┐  │
│     │  score = (rss + swap) * 1000 / total_memory                │  │
│     │  + oom_score_adj                                           │  │
│     │                                                             │  │
│     │  oom_score_adj range: -1000 to 1000                        │  │
│     │  -1000: disable OOM kill                                   │  │
│     │  0: default                                                │  │
│     │  1000: always kill first                                   │  │
│     └───────────────────────────────────────────────────────────┘  │
│                                                                     │
│  4. Kill victim process                                             │
│  5. Log to dmesg:                                                   │
│     "Memory cgroup out of memory: Killed process 1234 (app)"       │
│                                                                     │
│  Prevent OOM kill:                                                  │
│  echo "-1000" > /proc/<pid>/oom_score_adj                          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

1.3 OverlayFS: The Layered Filesystem

Containers use Union Filesystems (like OverlayFS) to provide a writable layer on top of read-only image layers.
  1. LowerDir: Read-only layers (the Docker image).
  2. UpperDir: The writable layer where changes are stored.
  3. MergedDir: The unified view presented to the container.
  4. Copy-on-Write (CoW): When a container modifies a file in the LowerDir, the kernel first copies it to the UpperDir before applying the change.

2. Virtualization: Emulating the Machine

Virtual Machines (VMs) take the isolation boundary down to the hardware level. Instead of sharing a kernel, they share the physical CPU and Memory.

2.1 The Hypervisor (VMM)

The Virtual Machine Monitor (VMM) is the software that manages guest execution.
  • Type 1 (Bare Metal): Runs directly on hardware (Xen, ESXi).
  • Type 2 (Hosted): Runs as an app on a host OS (KVM, VirtualBox). Note: KVM is unique because it turns the Linux kernel itself into a Type 1 hypervisor.

2.2 Hardware-Assisted Virtualization (VT-x / AMD-V)

Early virtualization used “Binary Translation” to replace privileged instructions. Modern CPUs handle this in hardware:
  • VMX Root Mode: The hypervisor runs here (full privileges).
  • VMX Non-Root Mode: The guest OS runs here. If the guest tries to execute a privileged instruction (like HLT or modifying CR3), the CPU triggers a VM Exit, trapping into the hypervisor to handle the event.

VMCS (Virtual Machine Control Structure)

The VMCS is a memory block that stores the “state” of a virtual CPU (registers, control bits). When switching from VM A to VM B, the hypervisor swaps the VMCS pointers.

2.3 Memory Virtualization: EPT and NPT

In a VM, there are three types of addresses:
  1. Guest Virtual (GV)
  2. Guest Physical (GP)
  3. Host Physical (HP)
Shadow Page Tables (Old): The hypervisor manually tracked guest page table changes and built a combined GV→HP table. This was extremely slow. EPT (Extended Page Tables): Hardware handles the translation. The CPU has a second set of page tables that map GP→HP. A memory access now involves a “2D Page Walk,” but it happens entirely in hardware.

3. The Middle Ground: MicroVMs

Plain containers have a large attack surface (thousands of syscalls). Plain VMs are slow and heavy. MicroVMs (like Firecracker) bridge the gap.

Firecracker Architecture

  • Minimalism: Removes all non-essential devices (no VGA, no USB, no sound).
  • VirtIO: Uses paravirtualized drivers for network and disk, avoiding the overhead of emulating real hardware registers.
  • Jailer: Firecracker itself runs inside a container (Namespaces + Cgroups) to provide “Defense in Depth.”
  • Performance: Can boot a Linux kernel in < 125ms and run thousands of instances on a single host.

4. Comparison: When to Use What?

FeatureContainersMicroVMs (Firecracker)Full VMs (ESXi/KVM)
IsolationLogical (Kernel)Hardware (Minimal)Hardware (Full)
Startup< 1s< 1s> 10s
PayloadProcessKernel + RootfsFull OS
SecurityMedium (Shared Kernel)HighHighest
Use CaseMicroservicesServerless / Multi-tenantLegacy / Windows

5. Docker Internals: Putting It All Together

Docker is a high-level container runtime that orchestrates namespaces, cgroups, and OverlayFS.
┌─────────────────────────────────────────────────────────────────────┐
│                     DOCKER ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  User Space                                                          │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  docker CLI                                                     │ │
│  │  $ docker run -m 512m --cpus=0.5 nginx                         │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │ REST API over Unix socket          │
│                                 ▼                                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  dockerd (Docker Daemon)                                        │ │
│  │  • Image management                                             │ │
│  │  • Volume management                                            │ │
│  │  • Network management                                           │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │ gRPC                               │
│                                 ▼                                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  containerd (Container Runtime)                                 │ │
│  │  • Container lifecycle                                          │ │
│  │  • Image pulls/pushes                                           │ │
│  │  • Storage management                                           │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │                                    │
│                                 ▼                                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  containerd-shim (Per-container)                                │ │
│  │  • Keeps container running if containerd crashes                │ │
│  │  • Reports exit status                                          │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │ fork/exec                          │
│                                 ▼                                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  runc (OCI Runtime)                                             │ │
│  │  • Creates namespaces                                           │ │
│  │  • Sets up cgroups                                              │ │
│  │  • Mounts overlay filesystem                                    │ │
│  │  • Executes container process                                   │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │                                    │
│  ════════════════════════════════╧═════════════════════════════════  │
│                                                                     │
│  Kernel Space                                                        │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  Container Process                                              │ │
│  │  ┌───────────────────────────────────────────────────────────┐ │ │
│  │  │  PID, Mount, Network, UTS, IPC, User, Cgroup Namespaces   │ │ │
│  │  │  CPU, Memory, IO, PIDs Cgroups                             │ │ │
│  │  │  OverlayFS (LowerDir, UpperDir, WorkDir, MergedDir)        │ │ │
│  │  └───────────────────────────────────────────────────────────┘ │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
Docker Container Creation Steps:
# 1. Pull image (if not cached)
docker pull nginx:latest

# 2. Create container
docker create --name web -p 80:80 nginx

# 3. Start container
docker start web

# What happens internally:
# a. containerd unpacks image layers
# b. runc creates namespaces (CLONE_NEWPID|CLONE_NEWNET|...)
# c. runc sets up cgroups (/sys/fs/cgroup/docker/<container-id>/)
# d. runc mounts OverlayFS
# e. runc configures network (veth pair, bridge)
# f. runc pivot_root to container filesystem
# g. runc executes CMD/ENTRYPOINT
Inspecting Docker Internals:
# View container's namespaces
docker inspect web | jq '.[0].State.Pid'  # Get PID
sudo ls -la /proc/<PID>/ns/
# lrwxrwxrwx 1 root root 0 pid:[4026532198]
# lrwxrwxrwx 1 root root 0 net:[4026532201]
# lrwxrwxrwx 1 root root 0 mnt:[4026532196]

# View cgroup limits
cat /sys/fs/cgroup/docker/<container-id>/memory.max
cat /sys/fs/cgroup/docker/<container-id>/cpu.max

# View OverlayFS layers
docker inspect web | jq '.[0].GraphDriver'
# {
#   "Data": {
#     "LowerDir": "/var/lib/docker/overlay2/abc.../diff",
#     "UpperDir": "/var/lib/docker/overlay2/def.../diff",
#     "WorkDir": "/var/lib/docker/overlay2/def.../work",
#     "MergedDir": "/var/lib/docker/overlay2/def.../merged"
#   }
# }

6. Interview Deep Dive: Senior Level

Answer:User Namespaces (CLONE_NEWUSER) allow a process to have UID 0 (root) inside the container while being a non-privileged UID (e.g., 1000) on the host.Security Improvement:
Without User Namespace:
┌───────────────────────────────────────────────────────────┐
│  Container                         Host                    │
│  ┌──────────────┐                 ┌──────────────┐        │
│  │  UID 0       │ ═══════════════►│  UID 0       │        │
│  │  (root)      │                 │  (root)      │        │
│  └──────────────┘                 └──────────────┘        │
│                                                            │
│  If container breakout occurs:                            │
│  → Attacker has root on host!                             │
└───────────────────────────────────────────────────────────┘

With User Namespace:
┌───────────────────────────────────────────────────────────┐
│  Container                         Host                    │
│  ┌──────────────┐                 ┌──────────────┐        │
│  │  UID 0       │ ───────────────►│  UID 1000    │        │
│  │  (root)      │     mapped      │  (user)      │        │
│  └──────────────┘                 └──────────────┘        │
│                                                            │
│  If container breakout occurs:                            │
│  → Attacker only has UID 1000 permissions                 │
│  → Cannot write to /etc, /boot, system files              │
│  → Cannot load kernel modules                             │
│  → Cannot access other users' files                       │
└───────────────────────────────────────────────────────────┘
Implementation:
# Run Docker with user namespace remapping
dockerd --userns-remap=default

# Or manually with unshare
unshare --user --map-root-user /bin/bash
Limitations:
  • Some operations still require host root (mounting certain filesystems)
  • File ownership can be confusing (files created by container appear owned by high UIDs on host)
  • Not all containers work with user namespaces (especially those requiring true root)
Answer:Cgroups v1 Problems:
  1. Multiple Hierarchies:
    • Each controller (cpu, memory, io) has its own hierarchy
    • A process can be in /sys/fs/cgroup/cpu/groupA and /sys/fs/cgroup/memory/groupB
    • Impossible to do unified resource accounting
  2. Writeback Ambiguity:
    • Process in cgroup A writes to page cache
    • Page cache writeback happens later
    • Which cgroup gets charged for the disk I/O?
    • v1: Charged to whoever triggers writeback (wrong!)
  3. No Delegation:
    • Can’t safely give non-root users control over cgroups
    • Security issues with nested hierarchies
Cgroups v2 Solutions:
  1. Single Hierarchy:
    • One tree, all controllers
    • Process location is the same for all resources
    • Enables proper delegation and accounting
  2. Proper Attribution:
    • Tracks which cgroup dirtied pages
    • I/O charged correctly even if writeback delayed
  3. Pressure Stall Information (PSI):
    • Built-in resource pressure metrics
    • Can detect when cgroup is starved
Migration Example:
# v1: Multiple hierarchies
/sys/fs/cgroup/cpu/docker/container1/
/sys/fs/cgroup/memory/system/container1/

# v2: Single hierarchy
/sys/fs/cgroup/docker/container1/
# All controllers available here
Answer:Docker uses network namespaces + veth pairs + Linux bridge.Default Bridge Network:
1. Create network namespace for container
2. Create veth pair (virtual ethernet cable with two ends)
3. Move one end into container namespace
4. Attach other end to docker0 bridge
5. Configure IP addresses and routes
6. Setup iptables rules for NAT
Detailed Flow:
# Container sends packet to 8.8.8.8:53
1. eth0@container (172.17.0.2) → veth pair
2. vethXXX@host docker0 bridge (172.17.0.1)
3. SNAT: 172.17.0.2 192.168.1.100 (host IP)
4. eth0@host Internet

# Response
1. eth0@host Internet
2. DNAT: 192.168.1.100 172.17.0.2
3. docker0 bridge vethXXX@host
4. veth pair eth0@container
Port Mapping:
docker run -p 8080:80 nginx

# iptables rule created:
iptables -t nat -A DOCKER -p tcp --dport 8080 \
  -j DNAT --to-destination 172.17.0.2:80
Network Modes:
ModeDescriptionUse Case
bridgeDefault, isolated networkNormal containers
hostShare host network namespaceHigh performance
noneNo networkSecurity isolation
container:IDShare another container’s netnsSidecars
Code Example:
// Simplified Docker network setup

// 1. Create veth pair
ip_link_add("veth0", "veth1", VETH);

// 2. Move one end to namespace
ip_link_set_ns("veth1", container_netns);

// 3. Attach to bridge
ip_link_set_master("veth0", "docker0");

// 4. Configure IPs
ip_addr_add("172.17.0.2/16", "veth1", container_netns);
ip_route_add("default via 172.17.0.1", container_netns);
Answer:A VM Exit occurs when the guest OS performs an action that requires hypervisor intervention (e.g., I/O, CPUID, or accessing certain registers).VM Exit Flow:
┌─────────────────────────────────────────────────────────────┐
│                      VM EXIT OVERHEAD                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Guest (VM)                                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  1. Execute privileged instruction (e.g., IN/OUT)    │   │
│  │     or access protected resource                      │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ CPU Trap                         │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  2. Hardware VM Exit                                  │   │
│  │     • Save guest state to VMCS (registers, RIP, etc.) │   │
│  │     • Load host state from VMCS                       │   │
│  │     • Jump to hypervisor entry point                  │   │
│  │     • Time: ~1000-3000 cycles                         │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│  Hypervisor                                                  │
│  ┌────────────────────────▼─────────────────────────────┐   │
│  │  3. Handle VM Exit                                    │   │
│  │     • Inspect exit reason                             │   │
│  │     • Emulate instruction (e.g., read port 0x3F8)     │   │
│  │     • Update guest state                              │   │
│  │     • Time: 500-2000 cycles                           │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ VM Entry                         │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  4. Resume Guest                                      │   │
│  │     • Load guest state from VMCS                      │   │
│  │     • Switch to VMX non-root mode                     │   │
│  │     • Continue guest execution                        │   │
│  │     • Time: ~1000-2000 cycles                         │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Total Overhead: 2500-7000 cycles (1-3 microseconds)        │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Common VM Exit Causes:
CauseFrequencyMitigation
I/O instructions (IN/OUT)HighUse virtio (paravirtualization)
CPUIDMediumCache results in guest
CR3 writes (page table)HighUse EPT (hardware MMU)
InterruptsVery HighAPIC virtualization
MSR accessMediumUse MSR bitmaps
HLT (idle)LowAcceptable (CPU idle anyway)
Optimization Strategies:
  1. EPT (Extended Page Tables):
    • Guest can change CR3 without VM exit
    • Hardware handles GVA → GPA → HPA translation
  2. APIC Virtualization:
    • Virtual APIC page in guest memory
    • Most interrupt operations happen without exits
  3. VirtIO:
    • Paravirtualized drivers
    • Shared memory rings reduce I/O exits
Modern virtualization aims to minimize VM Exits using features like APIC Virtualization and EPT.
Answer:Nested virtualization is running a VM inside another VM (e.g., GKE on Google Cloud).Address Translation Complexity:
┌─────────────────────────────────────────────────────────────┐
│                NESTED VIRTUALIZATION                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  L2 Guest (innermost VM)                                     │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  GVA (Guest Virtual Address)                          │   │
│  │  e.g., 0x400000 (program address)                     │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │ L2 Page Tables                   │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  GPA-L2 (Guest Physical Address of L2)                │   │
│  │  e.g., 0x80000000                                     │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│  L1 Hypervisor (middle VM)                                   │
│                           │ L1 EPT                           │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  GPA-L1 (Guest Physical Address of L1)                │   │
│  │  e.g., 0x100000000                                    │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│  L0 Hypervisor (host)                                        │
│                           │ L0 EPT                           │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  HPA (Host Physical Address)                          │   │
│  │  e.g., 0x200000000 (actual RAM)                       │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Without Nested EPT:                                         │
│  → Each memory access requires 4 page walks                 │
│  → GVA→GPA-L2: 4 walks                                      │
│  → Each GPA-L2 access needs GPA-L2→HPA translation          │
│  → Total: 4 + (4 × 4) = 20 memory accesses!                 │
│                                                              │
│  With Nested EPT (Intel):                                    │
│  → Hardware combines L1 and L0 EPT                          │
│  → Still slower than native, but manageable                  │
│  → ~2-3x overhead instead of 10x+                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Performance Impact:
Operation         Native    L1 VM    Nested L2 VM
Memory Access     1x        1.1x     2-3x
I/O               1x        2x       4-6x
Context Switch    1x        1.5x     3-4x
When to Use Nested Virtualization:
  1. Development/Testing:
    • Test hypervisor code
    • CI/CD pipelines testing VMs
  2. Cloud Services:
    • Kubernetes on cloud VMs
    • CI runners in cloud
  3. Education:
    • Teaching virtualization
    • Lab environments
Avoid for:
  • Production databases
  • High-performance computing
  • Latency-sensitive applications
The main challenge is the “Level 2” Guest Physical to “Level 0” Host Physical translation. This requires either complex shadow page table merging or hardware support for Nested EPT, which can significantly degrade memory performance due to the exponentially more complex page walks.
Answer:OverlayFS provides a union mount where multiple layers are combined into a single view.Layer Structure:
┌─────────────────────────────────────────────────────────────┐
│                    OVERLAYFS LAYERS                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Container View (MergedDir)                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  /bin/bash      ← from Base Layer                     │   │
│  │  /etc/nginx/    ← from Nginx Layer                    │   │
│  │  /var/log/app   ← from Container Layer (writable)     │   │
│  │  /app/config    ← from Container Layer (modified)     │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│  ═════════════════════════╧═════════════════════════════     │
│                                                              │
│  UpperDir (Writable Container Layer)                         │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  /var/log/app           ← New file                    │   │
│  │  /app/config            ← Modified file               │   │
│  │  .wh.oldfile            ← Whiteout (deleted file)     │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  LowerDir (Read-Only Image Layers)                           │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Layer 3: Nginx Layer                                 │   │
│  │  /etc/nginx/nginx.conf                                │   │
│  │  /usr/sbin/nginx                                      │   │
│  ├──────────────────────────────────────────────────────┤   │
│  │  Layer 2: App Dependencies                            │   │
│  │  /usr/lib/libssl.so                                   │   │
│  ├──────────────────────────────────────────────────────┤   │
│  │  Layer 1: Base Ubuntu                                 │   │
│  │  /bin/bash                                            │   │
│  │  /etc/passwd                                          │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Copy-on-Write Operations:1. Read File:
open("/etc/nginx/nginx.conf", O_RDONLY)
→ OverlayFS checks UpperDir: not found
→ Falls through to LowerDir: found in Nginx layer
→ Returns file from LowerDir (no copy needed)
2. Modify File:
open("/etc/nginx/nginx.conf", O_WRONLY)
→ OverlayFS checks UpperDir: not found
→ Copy file from LowerDir to UpperDir (copy-up)
→ Open file in UpperDir for writing
→ Future reads will use UpperDir version
3. Delete File:
unlink("/etc/nginx/nginx.conf")
→ OverlayFS creates whiteout file in UpperDir
→ UpperDir/.wh.nginx.conf (marks file as deleted)
→ LowerDir file remains (other containers unaffected)
→ MergedDir view hides the file
4. Create New File:
open("/var/log/app.log", O_CREAT)
→ OverlayFS creates file directly in UpperDir
→ No interaction with LowerDir needed
Mount Command:
mount -t overlay overlay \
  -o lowerdir=/lower1:/lower2:/lower3,\
     upperdir=/upper,\
     workdir=/work \
  /merged
Benefits:
  • Shared base layers save disk space
  • Fast container startup (no copying)
  • Efficient use of cache (shared pages)
Performance Considerations:
  • First write to file triggers copy-up (can be slow for large files)
  • Many layers slow down lookup
  • Whiteouts can accumulate (use docker system prune)
Answer:Container Security (Shared Kernel):Pros:
  • Faster startup and lower overhead
  • Easier management
Cons:
  1. Kernel Exploits:
    Container → Kernel Vulnerability → Host Compromise
    
    Example: Dirty COW (CVE-2016-5195)
    - Container can exploit kernel bug
    - Gain root on host
    - Escape to host system
    
  2. Large Attack Surface:
    ~300+ system calls exposed
    Any syscall vulnerability affects all containers
    
    Mitigation: seccomp-bpf filters
    → Block dangerous syscalls
    → Reduce attack surface
    
  3. Resource Exhaustion:
    Without cgroups:
    Container A → Allocate all memory → OOM kills Container B
    
    With cgroups:
    Container A → Hit memory.max → OOM kills processes in A only
    
  4. Information Leakage:
    /proc and /sys expose kernel information
    - /proc/kallsyms (kernel symbols)
    - /sys/kernel/debug (debug info)
    
    Mitigation: Mount with hidepid, remove sensitive mounts
    
VM Security (Separate Kernel):Pros:
  1. Strong Isolation:
    VM → Hypervisor → Host
    
    Attack path requires:
    1. Exploit in guest kernel
    2. VM escape vulnerability
    3. Hypervisor exploit
    
    Much harder than container escape
    
  2. Smaller Attack Surface:
    VM → Hypervisor interface is small
    - Hypercalls (10-20 vs 300+ syscalls)
    - Device emulation
    - Much less code to attack
    
  3. Different Kernels:
    Can run different kernel versions
    Old vulnerable kernel in VM doesn't affect host
    
Comparison Table:
AspectContainersVMs
Kernel isolationSharedSeparate
Escape difficultyMediumHard
Attack surfaceLarge (~300 syscalls)Small (~20 hypercalls)
Vulnerability impactAffects hostContained to VM
Performance overhead~2%~5-10%
Startup timeunder 1s10-30s
Best Practices:For Containers:
# 1. Use user namespaces
--userns-remap=default

# 2. Drop capabilities
--cap-drop=ALL --cap-add=NET_BIND_SERVICE

# 3. Seccomp filter
--security-opt seccomp=default.json

# 4. AppArmor/SELinux
--security-opt apparmor=docker-default

# 5. Read-only root
--read-only --tmpfs /tmp

# 6. No privileged mode
# NEVER use --privileged in production!
For VMs:
# 1. Minimal device emulation
Use virtio instead of emulated hardware

# 2. Disable unnecessary devices
-nodefaults -no-vga -no-audio

# 3. Use KVM (hardware virtualization)
-enable-kvm

# 4. Memory ballooning
-device virtio-balloon

# 5. vTPM for measured boot
-tpmdev emulator
Hybrid Approach (Kata Containers):
Container API → Lightweight VM → Strong isolation

Benefits:
- Container-like UX
- VM-like security
- ~50-100ms startup (vs 10s for traditional VM)
Answer:Device Emulation (Full Virtualization):Guest believes it’s talking to real hardware, hypervisor emulates every register read/write.
┌─────────────────────────────────────────────────────────────┐
│                  DEVICE EMULATION                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Guest VM                                                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  1. Guest writes to e1000 NIC register                │   │
│  │     outl(0xC000, ETH_TX_DESC)                         │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ VM Exit (I/O port access)        │
│                           ▼                                  │
│  Hypervisor (QEMU/KVM)                                       │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  2. Trap I/O operation                                │   │
│  │     Decode: write to port 0xC000, value 0x12345678    │   │
│  │                                                        │   │
│  │  3. Emulate e1000 device logic                        │   │
│  │     - Update virtual NIC state                        │   │
│  │     - Copy packet from guest memory                   │   │
│  │     - Send to host TAP device                         │   │
│  │                                                        │   │
│  │  4. Return to guest                                   │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ VM Entry                         │
│                           ▼                                  │
│  Guest continues...                                          │
│                                                              │
│  Problem: EVERY register access causes VM Exit!              │
│  → Thousands of exits per packet                            │
│  → 10x+ overhead                                            │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Paravirtualization (VirtIO):Guest knows it’s virtualized, uses efficient shared-memory interface.
┌─────────────────────────────────────────────────────────────┐
│                  PARAVIRTUALIZATION                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Guest VM                                                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  1. VirtIO driver in guest                            │   │
│  │     - Shared memory ring (vring)                      │   │
│  │     - No register emulation needed                    │   │
│  │                                                        │   │
│  │  2. Write packet to shared ring                       │   │
│  │     vring[idx] = packet_buffer                        │   │
│  │     idx++                                             │   │
│  │                                                        │   │
│  │  3. Kick hypervisor (single VM exit)                  │   │
│  │     kick_notify()                                     │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ Single VM Exit                   │
│                           ▼                                  │
│  Hypervisor                                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  4. Process all pending packets                       │   │
│  │     while (vring has packets) {                       │   │
│  │       packet = vring[idx]                             │   │
│  │       send_to_tap(packet)                             │   │
│  │       idx++                                           │   │
│  │     }                                                 │   │
│  │                                                        │   │
│  │  5. Return to guest                                   │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ VM Entry                         │
│                           ▼                                  │
│  Guest continues...                                          │
│                                                              │
│  Benefit: One VM exit for multiple packets!                  │
│  → Near-native performance                                  │
│  → <5% overhead                                             │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Performance Comparison:
OperationEmulationVirtIONative
Network throughput1 Gbps9.5 Gbps10 Gbps
Disk IOPS5,00045,00050,000
VM exits per packet100-10001-20
VirtIO Ring Structure:
struct vring {
    // Available ring: guest writes here
    struct vring_avail {
        uint16_t flags;
        uint16_t idx;
        uint16_t ring[queue_size];
    } avail;

    // Descriptor table: describes buffers
    struct vring_desc {
        uint64_t addr;   // Guest physical address
        uint32_t len;    // Buffer length
        uint16_t flags;  // VRING_DESC_F_NEXT, etc.
        uint16_t next;   // Next descriptor
    } desc[queue_size];

    // Used ring: hypervisor writes here
    struct vring_used {
        uint16_t flags;
        uint16_t idx;
        struct vring_used_elem {
            uint32_t id;  // Descriptor index
            uint32_t len; // Bytes written
        } ring[queue_size];
    } used;
};
Tradeoffs:Emulation:
  • Pros: No guest modification, runs any OS
  • Cons: Slow, many VM exits
Paravirtualization:
  • Pros: Fast, few VM exits
  • Cons: Requires guest support (modified drivers)
Modern Approach:
  • Use paravirt for performance-critical devices (disk, network)
  • Use emulation for legacy devices (VGA, PS/2)
  • Gradually reduce emulation over time

6. Namespaces & Cgroups: A Single Process’s Perspective

What does a process actually “see” when it’s containerized? Here’s the view from inside:

What Changes for the Process

┌─────────────────────────────────────────────────────────────────────┐
│     PROCESS VIEW: BEFORE vs AFTER CONTAINERIZATION                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  BEFORE (Host Process)                AFTER (Containerized)         │
│  ────────────────────                 ─────────────────────         │
│                                                                     │
│  PID: 12345                           PID: 1  (thinks it's init!)   │
│  UID: 1000                            UID: 0  (root in container)   │
│  Hostname: myserver                   Hostname: container-abc       │
│  /proc: sees all processes            /proc: sees only self         │
│  Network: eth0 (192.168.1.5)          Network: eth0 (172.17.0.2)    │
│  Filesystem: /home/user/...           Filesystem: / (isolated root) │
│  Memory: unlimited                    Memory: 512MB max (cgroup)    │
│  CPU: all cores                       CPU: 50% of 1 core (cgroup)   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Inspecting Your Own Namespace

# Inside a container (or any process), see your namespaces:
ls -la /proc/self/ns/
# lrwxrwxrwx 1 root root 0 pid:[4026532198]
# lrwxrwxrwx 1 root root 0 net:[4026532201]
# lrwxrwxrwx 1 root root 0 mnt:[4026532196]
# ...

# Compare with host (different inode numbers = different namespace)
# Host:  pid:[4026531836]
# Container: pid:[4026532198]  ← Different!

# See your cgroup limits
cat /sys/fs/cgroup/memory.max      # Memory limit
cat /sys/fs/cgroup/cpu.max         # CPU limit (quota period)
cat /sys/fs/cgroup/pids.max        # Max processes

# See your cgroup resource usage
cat /sys/fs/cgroup/memory.current  # Current memory usage
cat /sys/fs/cgroup/cpu.stat        # CPU time consumed

The Process Doesn’t Know It’s Contained

// This code behaves identically on host or in container:
#include <stdio.h>
#include <unistd.h>

int main() {
    printf("My PID: %d\n", getpid());        // 1 in container, 12345 on host
    printf("My UID: %d\n", getuid());        // 0 in container (fake root)

    char hostname[256];
    gethostname(hostname, sizeof(hostname));
    printf("Hostname: %s\n", hostname);      // "container-abc" in container

    // Process has no idea it's in a container!
    // All syscalls return "virtualized" results
    return 0;
}

Key Insight: Syscalls Are Virtualized

Every syscall that returns information about the system goes through namespace translation:
SyscallHost ReturnsContainer Returns
getpid()123451
getuid()10000 (mapped root)
uname()myservercontainer-abc
readdir(/proc)All PIDsOnly container PIDs
socket(AF_INET,...)Host networkContainer network

7. Advanced Practice

  1. Manual Namespace Build: Use the unshare command to create a shell with its own network and PID namespace. Try to ping the host.
  2. Cgroup Stress Test: Create a cgroup v2 with a 100MB memory limit. Run a program that allocates 200MB and observe the kernel’s OOM killer logs in dmesg.
  3. VirtIO Analysis: Run a KVM guest and use lspci inside the guest to identify which devices are using virtio drivers vs. emulated hardware.

Next: OS Security & Hardening