Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Containers & Virtualization

Isolation is the core requirement of multi-tenant cloud computing. Whether you are running a SaaS platform or a microservices cluster, you must ensure that processes are contained, resources are metered, and security boundaries are enforced. Modern systems achieve this through two distinct paths: OS-level virtualization (Containers) and Hardware-level virtualization (VMs).
Mastery Level: Senior Systems Engineer Key Internals: CLONE_NEW*, Cgroups v2 Unified Hierarchy, VMCS, EPT/SLAT, Firecracker MicroVMs Prerequisites: Process Internals, Memory Management

1. Container Internals: The Linux “Trio”

A container is not a “thing” in the Linux kernel. It is a user-space abstraction built using three primary kernel features: Namespaces, Control Groups (cgroups), and Union Filesystems.
┌─────────────────────────────────────────────────────────────────────┐
│                     CONTAINER ARCHITECTURE                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Container Runtime (Docker/containerd/cri-o)                        │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  Container 1           Container 2           Container 3        │ │
│  │  ┌──────────┐          ┌──────────┐          ┌──────────┐      │ │
│  │  │ App      │          │ App      │          │ App      │      │ │
│  │  │ (nginx)  │          │ (redis)  │          │ (postgres)      │ │
│  │  └────┬─────┘          └────┬─────┘          └────┬─────┘      │ │
│  │       │                     │                     │             │ │
│  └───────┼─────────────────────┼─────────────────────┼─────────────┘ │
│          │                     │                     │               │
│  ════════╧═════════════════════╧═════════════════════╧═════════════  │
│                                                                     │
│  Linux Kernel Features                                              │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  ┌─────────────────┐  ┌─────────────────┐  ┌────────────────┐ │ │
│  │  │   Namespaces    │  │    Cgroups      │  │  Union FS      │ │ │
│  │  │                 │  │                 │  │  (OverlayFS)   │ │ │
│  │  │ • PID           │  │ • CPU           │  │                │ │ │
│  │  │ • Mount         │  │ • Memory        │  │ • LowerDir     │ │ │
│  │  │ • Network       │  │ • PIDs          │  │ • UpperDir     │ │ │
│  │  │ • UTS           │  │ • Blkio         │  │ • MergedDir    │ │ │
│  │  │ • IPC           │  │ • Devices       │  │ • WorkDir      │ │ │
│  │  │ • User          │  │ • Network       │  │                │ │ │
│  │  │ • Cgroup        │  │                 │  │                │ │ │
│  │  │ • Time          │  │                 │  │                │ │ │
│  │  └─────────────────┘  └─────────────────┘  └────────────────┘ │ │
│  │                                                                 │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  ════════════════════════════════════════════════════════════════   │
│                                                                     │
│  Shared Linux Kernel                                                │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  System Calls, Process Scheduler, Memory Management, Drivers   │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

1.1 Namespaces: The Illusion of Isolation

Namespaces wrap global system resources in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the resource.
NamespaceFlagIsolated Resource
MountCLONE_NEWNSFilesystem mount points (independent mount/umount).
UTSCLONE_NEWUTSHostname and NIS domain name.
IPCCLONE_NEWIPCSystem V IPC, POSIX message queues.
PIDCLONE_NEWPIDProcess IDs (Process 1 inside the container).
NetworkCLONE_NEWNETNetwork devices, stacks, ports, firewalls.
UserCLONE_NEWUSERUser and group IDs (Root in container != Root on host).
CgroupCLONE_NEWCGROUPCgroup root directory view.
TimeCLONE_NEWTIMESystem boot and monotonic clocks.

Deep Dive: PID Namespace

The PID namespace creates a hierarchical process view where each namespace has its own PID 1.
┌─────────────────────────────────────────────────────────────────────┐
│                      PID NAMESPACE HIERARCHY                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Host (Initial PID Namespace)                                       │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  PID 1: /sbin/init (systemd)                                   │ │
│  │  PID 523: dockerd                                              │ │
│  │  PID 1024: container-init   ←────┐                             │ │
│  │  PID 1025: nginx (worker)   ←────┼─┐                           │ │
│  │  PID 1026: nginx (worker)   ←────┼─┼─┐                         │ │
│  │                                   │ │ │                         │ │
│  └───────────────────────────────────┼─┼─┼─────────────────────────┘ │
│                                      │ │ │                           │
│  Container PID Namespace             │ │ │                           │
│  ┌──────────────────────────────────┼─┼─┼─────────────────────────┐ │
│  │                                  │ │ │                         │ │
│  │  PID 1: /init  ──────────────────┘ │ │  (maps to host 1024)   │ │
│  │  PID 2: nginx master ───────────────┘ │  (maps to host 1025)   │ │
│  │  PID 3: nginx worker ─────────────────┘  (maps to host 1026)   │ │
│  │                                                                 │ │
│  │  Processes see only PIDs 1, 2, 3                               │ │
│  │  Cannot see or signal host processes                           │ │
│  │                                                                 │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
Implementation Details:
// Creating a PID namespace
int clone_flags = CLONE_NEWPID | SIGCHLD;
pid_t pid = clone(child_fn, child_stack, clone_flags, NULL);

// Inside the child function
int child_fn(void *arg) {
    printf("My PID: %d\n", getpid());  // Will print: 1

    // Fork a child
    pid_t child = fork();
    if (child == 0) {
        printf("Child PID: %d\n", getpid());  // Will print: 2
    }
    return 0;
}
Key Properties:
  • First process in namespace becomes PID 1
  • If PID 1 exits, kernel kills all processes in namespace
  • Parent namespace can see child processes with their “real” PIDs
  • /proc shows only processes in current namespace (with mount namespace)

Deep Dive: Network Namespace

Network namespaces isolate the network stack: devices, routing tables, firewall rules, sockets.
┌─────────────────────────────────────────────────────────────────────┐
│                    NETWORK NAMESPACE TOPOLOGY                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Host Network Namespace                                             │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐       │ │
│  │  │   eth0   │  │  veth0   │  │  veth2   │  │  veth4   │       │ │
│  │  │ (physical)  │  (host)  │  │  (host)  │  │  (host)  │       │ │
│  │  └─────┬────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘       │ │
│  │        │            │             │             │              │ │
│  │        │       ┌────┴─────────────┴─────────────┴────┐         │ │
│  │        │       │         docker0 (bridge)            │         │ │
│  │        │       │         172.17.0.1/16               │         │ │
│  │        │       └─────────────────────────────────────┘         │ │
│  │        │                                                        │ │
│  │    [Internet]                                                  │ │
│  │                                                                 │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                           │             │             │             │
│                           │             │             │             │
│  Container 1 Netns        │             │             │             │
│  ┌───────────────────────┼─────────────┘             │             │
│  │  ┌────────────────────▼────┐                      │             │
│  │  │  eth0 (container view)  │                      │             │
│  │  │  veth1 (actual peer)    │                      │             │
│  │  │  172.17.0.2/16          │                      │             │
│  │  └─────────────────────────┘                      │             │
│  │  Route: default via 172.17.0.1                    │             │
│  └────────────────────────────────────────────────────┘             │
│                                                       │             │
│  Container 2 Netns                                    │             │
│  ┌───────────────────────────────────────────────────┼─────────────┘
│  │  ┌────────────────────────────────────────────────▼────┐         │
│  │  │  eth0 (container view)                             │         │
│  │  │  veth3 (actual peer)                               │         │
│  │  │  172.17.0.3/16                                     │         │
│  │  └────────────────────────────────────────────────────┘         │
│  │  Route: default via 172.17.0.1                                  │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
Creating veth pairs:
# Create network namespace
ip netns add container1

# Create veth pair
ip link add veth0 type veth peer name veth1

# Move one end into namespace
ip link set veth1 netns container1

# Configure host end
ip addr add 172.17.0.1/16 dev veth0
ip link set veth0 up

# Configure container end
ip netns exec container1 ip addr add 172.17.0.2/16 dev veth1
ip netns exec container1 ip link set veth1 up
ip netns exec container1 ip route add default via 172.17.0.1

# Test connectivity
ip netns exec container1 ping 172.17.0.1
Code Example: Creating Network Namespace
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <net/if.h>

int child_fn(void *arg) {
    // Now in new network namespace

    // List network interfaces
    struct if_nameindex *if_ni = if_nameindex();
    if (if_ni) {
        for (int i = 0; if_ni[i].if_index != 0; i++) {
            printf("Interface: %s (index %d)\n",
                   if_ni[i].if_name, if_ni[i].if_index);
        }
        if_freenameindex(if_ni);
    }

    // Only loopback exists in new namespace
    return 0;
}

int main() {
    const int STACK_SIZE = 65536;
    char *stack = malloc(STACK_SIZE);

    clone(child_fn, stack + STACK_SIZE,
          CLONE_NEWNET | SIGCHLD, NULL);

    wait(NULL);
    return 0;
}

Deep Dive: Mount Namespace

Mount namespaces isolate the filesystem mount points.
// Create mount namespace
unshare(CLONE_NEWNS);

// Mounts are now private to this namespace
mount("/dev/sda1", "/mnt", "ext4", 0, NULL);

// Other namespaces won't see this mount
Mount Propagation:
┌─────────────────────────────────────────────────────────────────────┐
│                     MOUNT PROPAGATION TYPES                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  MS_SHARED: Mounts propagate bidirectionally                        │
│  ┌────────────────┐                    ┌────────────────┐           │
│  │  Namespace A   │ ◄───────────────► │  Namespace B   │           │
│  │  mount /foo    │    propagates      │  sees /foo     │           │
│  └────────────────┘                    └────────────────┘           │
│                                                                     │
│  MS_PRIVATE: Mounts don't propagate (default in containers)         │
│  ┌────────────────┐                    ┌────────────────┐           │
│  │  Namespace A   │  X  no sharing  X  │  Namespace B   │           │
│  │  mount /foo    │                    │  no /foo       │           │
│  └────────────────┘                    └────────────────┘           │
│                                                                     │
│  MS_SLAVE: Receives mounts from master, but doesn't send            │
│  ┌────────────────┐                    ┌────────────────┐           │
│  │  Master        │ ─────────────────► │  Slave         │           │
│  │  mount /foo    │    one-way         │  sees /foo     │           │
│  └────────────────┘ ◄─────X────────────└────────────────┘           │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Deep Dive: User Namespace

User namespaces allow mapping UIDs/GIDs, enabling rootless containers.
┌─────────────────────────────────────────────────────────────────────┐
│                      USER NAMESPACE MAPPING                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Host                              Container                        │
│  ┌──────────────────┐              ┌──────────────────┐            │
│  │                  │              │                  │            │
│  │  UID 0 (root)    │ ────X────►  │  Not mapped      │            │
│  │  UID 1000 (user) │ ───────────► │  UID 0 (root)    │            │
│  │  UID 1001        │ ───────────► │  UID 1           │            │
│  │  UID 1002        │ ───────────► │  UID 2           │            │
│  │  ...             │              │  ...             │            │
│  │  UID 65535       │ ───────────► │  UID 64535       │            │
│  │                  │              │                  │            │
│  └──────────────────┘              └──────────────────┘            │
│                                                                     │
│  Configuration: /proc/<pid>/uid_map                                │
│  Format: <container_id> <host_id> <range>                          │
│  Example: 0 1000 65536                                             │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
Setting up User Namespace:
# Create user namespace
unshare --user --map-root-user /bin/bash

# Inside namespace
id  # uid=0(root) gid=0(root)

# But on host, this process runs as your regular user
# File operations as "root" in container map to your UID on host
Code Example:
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <sys/wait.h>
#include <unistd.h>
#include <fcntl.h>

void setup_uid_map(pid_t pid) {
    char path[256];
    char map[256];

    // Map container root (0) to host user (1000)
    snprintf(path, sizeof(path), "/proc/%d/uid_map", pid);
    snprintf(map, sizeof(map), "0 1000 1");

    int fd = open(path, O_WRONLY);
    write(fd, map, strlen(map));
    close(fd);

    // Same for GID
    snprintf(path, sizeof(path), "/proc/%d/gid_map", pid);
    fd = open(path, O_WRONLY);
    write(fd, map, strlen(map));
    close(fd);
}

int child_fn(void *arg) {
    printf("UID in container: %d\n", getuid());  // 0
    printf("GID in container: %d\n", getgid());  // 0
    return 0;
}

Deep Dive: IPC Namespace

IPC namespaces isolate System V IPC objects and POSIX message queues.
# In host namespace
ipcmk -Q  # Create message queue
ipcs -q   # List queues - visible

# In new IPC namespace
unshare --ipc ipcs -q  # Empty - can't see host queues

Deep Dive: UTS Namespace

UTS namespaces isolate hostname and domain name.
unshare(CLONE_NEWUTS);
sethostname("container1", 10);

// This hostname is isolated to this namespace
// Host and other containers see their own hostnames

Deep Dive: Time Namespace

Time namespaces (Linux 5.6+) allow different boot times and monotonic clocks.
// Offset boot time by 1 hour
unshare(CLONE_NEWTIME);

// Write to /proc/self/timens_offsets
// Format: <clock_id> <seconds> <nanoseconds>
// CLOCK_MONOTONIC 3600 0

The pivot_root vs chroot

While chroot only changes the root directory for path resolution, it is insecure (processes can “break out” via .. or file descriptor trickery). Containers use pivot_root, which moves the entire mount namespace to a new root and removes access to the old one, providing a true filesystem jail.
// pivot_root implementation
int pivot_root(const char *new_root, const char *put_old) {
    // Move old root to put_old directory within new_root
    // Make new_root the new root
    // This removes all access to old root
    return syscall(SYS_pivot_root, new_root, put_old);
}

// Usage
chdir("/new_root");
pivot_root(".", "old_root");
umount2("old_root", MNT_DETACH);
rmdir("old_root");
chdir("/");

1.2 Cgroups: Resource Metering and Limiting

If Namespaces provide isolation (what you see), Cgroups provide containment (what you can use).
  • Cgroups v1 (Legacy): Multiple hierarchies. A process could be in one group for CPU and a completely different group for Memory. This led to massive complexity and performance issues.
  • Cgroups v2 (Modern/Unified): A single hierarchy. Every process belongs to exactly one cgroup in a unified tree. This allows for better resource accounting (e.g., attributing page cache writeback to the specific cgroup that caused the dirty pages).
┌─────────────────────────────────────────────────────────────────────┐
│                  CGROUPS V1 VS CGROUPS V2                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Cgroups v1 (Multiple Hierarchies)                                  │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  CPU Hierarchy        Memory Hierarchy      IO Hierarchy        │ │
│  │  ┌──────────┐         ┌──────────┐         ┌──────────┐        │ │
│  │  │   root   │         │   root   │         │   root   │        │ │
│  │  ├──────────┤         ├──────────┤         ├──────────┤        │ │
│  │  │ system   │         │ docker   │         │  user    │        │ │
│  │  │  ├─bash  │         │  ├─nginx │         │   ├─bash │        │ │
│  │  │  └─sshd  │         │  └─redis │         │   └─vim  │        │ │
│  │  └──────────┘         └──────────┘         └──────────┘        │ │
│  │                                                                 │ │
│  │  Problem: Process can be in different groups per controller    │ │
│  │  bash: CPU→system, Memory→docker, IO→user (confusing!)         │ │
│  │                                                                 │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
│  Cgroups v2 (Unified Hierarchy)                                     │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │                                                                 │ │
│  │  Single Hierarchy (All Controllers)                             │ │
│  │  ┌──────────────────────────────────────────────────────────┐  │ │
│  │  │                       root                                │  │ │
│  │  │               (cpu, memory, io, pids)                     │  │ │
│  │  ├──────────────────────┬───────────────────────────────────┤  │ │
│  │  │      system          │          user.slice               │  │ │
│  │  │  ├─ sshd.service     │      ├─ user-1000.slice          │  │ │
│  │  │  └─ cron.service     │      │   ├─ session-1.scope       │  │ │
│  │  │                      │      │   │   ├─ bash              │  │ │
│  │  │                      │      │   │   └─ vim               │  │ │
│  │  │                      │      │   └─ docker.service        │  │ │
│  │  │                      │      │       ├─ container1         │  │ │
│  │  │                      │      │       │   ├─ nginx          │  │ │
│  │  │                      │      │       └─ container2         │  │ │
│  │  │                      │      │           └─ redis          │  │ │
│  │  └──────────────────────┴───────────────────────────────────┘  │ │
│  │                                                                 │ │
│  │  Benefit: Process location same for all controllers            │ │
│  │  Proper resource attribution and delegation                    │ │
│  │                                                                 │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Key Controllers:

CPU Controller:
# Cgroups v2 CPU control
cd /sys/fs/cgroup/user.slice/user-1000.slice

# cpu.max format: $MAX $PERIOD
# Allow 50% of one CPU: 50000 out of 100000 microseconds
echo "50000 100000" > cpu.max

# CPU weight (shares): 1-10000, default 100
echo "200" > cpu.weight  # 2x normal priority

# Statistics
cat cpu.stat
# usage_usec 12345678
# user_usec 10000000
# system_usec 2345678
# nr_periods 1234
# nr_throttled 56
# throttled_usec 789012
Memory Controller:
# Memory limits
echo "512M" > memory.max      # Hard limit
echo "256M" > memory.high     # Soft limit (throttling)
echo "128M" > memory.low      # Best-effort protection
echo "64M" > memory.min       # Hard protection

# Current usage
cat memory.current

# Detailed statistics
cat memory.stat
# anon 104857600           # Anonymous memory (heap, stack)
# file 52428800            # Page cache
# kernel_stack 131072
# slab 8388608
# sock 65536
# shmem 0
# file_mapped 16777216
# file_dirty 1048576
# file_writeback 524288
# inactive_anon 0
# active_anon 104857600
# inactive_file 26214400
# active_file 26214400

# Memory events
cat memory.events
# low 0                   # Times below memory.low
# high 12                 # Times above memory.high
# max 3                   # Times hit memory.max
# oom 0                   # OOM kills
# oom_kill 0
I/O Controller:
# I/O weight (1-10000)
echo "500" > io.weight

# I/O max (rate limiting)
# Format: $MAJ:$MIN rbps=$BYTES wbps=$BYTES riops=$IOPS wiops=$IOPS
echo "8:0 rbps=10485760 wbps=5242880" > io.max
# Limit reads to 10MB/s, writes to 5MB/s on device 8:0

# I/O statistics
cat io.stat
# 8:0 rbytes=1048576000 wbytes=524288000 rios=1000 wios=500
PIDs Controller:
# Limit number of processes
echo "100" > pids.max

# Current count
cat pids.current

# Events
cat pids.events
# max 5  # Times hit pids.max
Cgroups v2 Core Features:
┌─────────────────────────────────────────────────────────────────────┐
│                    CGROUPS V2 CORE CONCEPTS                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  1. No-Internal-Process Rule                                        │
│     ┌───────────────────────────────────────────────────────────┐  │
│     │  Only leaf cgroups can have processes                     │  │
│     │                                                            │  │
│     │  root/                                                     │  │
│     │  ├─ cgroup.procs         ← Cannot write here             │  │
│     │  └─ system/                                               │  │
│     │     ├─ cgroup.procs      ← Cannot write here             │  │
│     │     └─ sshd.service/                                      │  │
│     │        └─ cgroup.procs   ← Can write here (leaf)         │  │
│     └───────────────────────────────────────────────────────────┘  │
│                                                                     │
│  2. Controller Delegation                                           │
│     ┌───────────────────────────────────────────────────────────┐  │
│     │  Controllers must be explicitly enabled                   │  │
│     │                                                            │  │
│     │  root/cgroup.controllers                                  │  │
│     │  → cpu memory io pids                                     │  │
│     │                                                            │  │
│     │  root/cgroup.subtree_control                              │  │
│     │  → +cpu +memory   (enable for children)                   │  │
│     │                                                            │  │
│     │  root/system/cgroup.controllers                           │  │
│     │  → cpu memory     (inherited from parent)                 │  │
│     └───────────────────────────────────────────────────────────┘  │
│                                                                     │
│  3. Pressure Stall Information (PSI)                                │
│     ┌───────────────────────────────────────────────────────────┐  │
│     │  Tracks resource contention                               │  │
│     │                                                            │  │
│     │  cpu.pressure:                                            │  │
│     │  some avg10=5.23 avg60=3.14 avg300=1.87 total=123456      │  │
│     │                                                            │  │
│     │  memory.pressure:                                         │  │
│     │  some avg10=12.34 avg60=8.90 avg300=5.67 total=234567     │  │
│     │  full avg10=2.10 avg60=1.50 avg300=0.80 total=45678       │  │
│     │                                                            │  │
│     │  io.pressure:                                             │  │
│     │  some avg10=8.90 avg60=6.70 avg300=4.50 total=345678      │  │
│     │  full avg10=3.20 avg60=2.10 avg300=1.40 total=56789       │  │
│     └───────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
Creating and Managing Cgroups:
// C API for cgroup management
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>

void create_cgroup(const char *name) {
    char path[256];

    // Create cgroup directory
    snprintf(path, sizeof(path), "/sys/fs/cgroup/%s", name);
    mkdir(path, 0755);
}

void set_memory_limit(const char *name, const char *limit) {
    char path[256];
    int fd;

    snprintf(path, sizeof(path), "/sys/fs/cgroup/%s/memory.max", name);
    fd = open(path, O_WRONLY);
    write(fd, limit, strlen(limit));
    close(fd);
}

void add_process_to_cgroup(const char *name, pid_t pid) {
    char path[256];
    char pid_str[32];
    int fd;

    snprintf(path, sizeof(path), "/sys/fs/cgroup/%s/cgroup.procs", name);
    snprintf(pid_str, sizeof(pid_str), "%d", pid);

    fd = open(path, O_WRONLY);
    write(fd, pid_str, strlen(pid_str));
    close(fd);
}

int main() {
    create_cgroup("myapp");
    set_memory_limit("myapp", "512M");
    add_process_to_cgroup("myapp", getpid());

    // Process now limited to 512MB
    // Allocate memory and observe behavior

    return 0;
}
OOM Killer in Cgroups:
┌─────────────────────────────────────────────────────────────────────┐
│                    OOM KILLER IN CGROUPS                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  When cgroup exceeds memory.max:                                    │
│                                                                     │
│  1. Kernel triggers OOM killer                                      │
│  2. Selects victim ONLY from within the cgroup                      │
│  3. Score calculation (higher = more likely to kill):               │
│     ┌───────────────────────────────────────────────────────────┐  │
│     │  score = (rss + swap) * 1000 / total_memory                │  │
│     │  + oom_score_adj                                           │  │
│     │                                                             │  │
│     │  oom_score_adj range: -1000 to 1000                        │  │
│     │  -1000: disable OOM kill                                   │  │
│     │  0: default                                                │  │
│     │  1000: always kill first                                   │  │
│     └───────────────────────────────────────────────────────────┘  │
│                                                                     │
│  4. Kill victim process                                             │
│  5. Log to dmesg:                                                   │
│     "Memory cgroup out of memory: Killed process 1234 (app)"       │
│                                                                     │
│  Prevent OOM kill:                                                  │
│  echo "-1000" > /proc/<pid>/oom_score_adj                          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

1.3 OverlayFS: The Layered Filesystem

Containers use Union Filesystems (like OverlayFS) to provide a writable layer on top of read-only image layers.
  1. LowerDir: Read-only layers (the Docker image).
  2. UpperDir: The writable layer where changes are stored.
  3. MergedDir: The unified view presented to the container.
  4. Copy-on-Write (CoW): When a container modifies a file in the LowerDir, the kernel first copies it to the UpperDir before applying the change.

2. Virtualization: Emulating the Machine

Virtual Machines (VMs) take the isolation boundary down to the hardware level. Instead of sharing a kernel, they share the physical CPU and Memory.

2.1 The Hypervisor (VMM)

The Virtual Machine Monitor (VMM) is the software that manages guest execution.
  • Type 1 (Bare Metal): Runs directly on hardware (Xen, ESXi).
  • Type 2 (Hosted): Runs as an app on a host OS (KVM, VirtualBox). Note: KVM is unique because it turns the Linux kernel itself into a Type 1 hypervisor.

2.2 Hardware-Assisted Virtualization (VT-x / AMD-V)

Early virtualization used “Binary Translation” to replace privileged instructions. Modern CPUs handle this in hardware:
  • VMX Root Mode: The hypervisor runs here (full privileges).
  • VMX Non-Root Mode: The guest OS runs here. If the guest tries to execute a privileged instruction (like HLT or modifying CR3), the CPU triggers a VM Exit, trapping into the hypervisor to handle the event.

VMCS (Virtual Machine Control Structure)

The VMCS is a memory block that stores the “state” of a virtual CPU (registers, control bits). When switching from VM A to VM B, the hypervisor swaps the VMCS pointers.

2.3 Memory Virtualization: EPT and NPT

In a VM, there are three types of addresses:
  1. Guest Virtual (GV)
  2. Guest Physical (GP)
  3. Host Physical (HP)
Shadow Page Tables (Old): The hypervisor manually tracked guest page table changes and built a combined GV→HP table. This was extremely slow. EPT (Extended Page Tables): Hardware handles the translation. The CPU has a second set of page tables that map GP→HP. A memory access now involves a “2D Page Walk,” but it happens entirely in hardware.

3. The Middle Ground: MicroVMs

Plain containers have a large attack surface (thousands of syscalls). Plain VMs are slow and heavy. MicroVMs (like Firecracker) bridge the gap.

Firecracker Architecture

  • Minimalism: Removes all non-essential devices (no VGA, no USB, no sound).
  • VirtIO: Uses paravirtualized drivers for network and disk, avoiding the overhead of emulating real hardware registers.
  • Jailer: Firecracker itself runs inside a container (Namespaces + Cgroups) to provide “Defense in Depth.”
  • Performance: Can boot a Linux kernel in under 125ms and run thousands of instances on a single host.

4. Comparison: When to Use What?

FeatureContainersMicroVMs (Firecracker)Full VMs (ESXi/KVM)
IsolationLogical (Kernel)Hardware (Minimal)Hardware (Full)
Startupsub-secondsub-secondover 10s
PayloadProcessKernel + RootfsFull OS
SecurityMedium (Shared Kernel)HighHighest
Use CaseMicroservicesServerless / Multi-tenantLegacy / Windows

5. Docker Internals: Putting It All Together

Docker is a high-level container runtime that orchestrates namespaces, cgroups, and OverlayFS.
┌─────────────────────────────────────────────────────────────────────┐
│                     DOCKER ARCHITECTURE                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  User Space                                                          │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  docker CLI                                                     │ │
│  │  $ docker run -m 512m --cpus=0.5 nginx                         │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │ REST API over Unix socket          │
│                                 ▼                                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  dockerd (Docker Daemon)                                        │ │
│  │  • Image management                                             │ │
│  │  • Volume management                                            │ │
│  │  • Network management                                           │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │ gRPC                               │
│                                 ▼                                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  containerd (Container Runtime)                                 │ │
│  │  • Container lifecycle                                          │ │
│  │  • Image pulls/pushes                                           │ │
│  │  • Storage management                                           │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │                                    │
│                                 ▼                                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  containerd-shim (Per-container)                                │ │
│  │  • Keeps container running if containerd crashes                │ │
│  │  • Reports exit status                                          │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │ fork/exec                          │
│                                 ▼                                    │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  runc (OCI Runtime)                                             │ │
│  │  • Creates namespaces                                           │ │
│  │  • Sets up cgroups                                              │ │
│  │  • Mounts overlay filesystem                                    │ │
│  │  • Executes container process                                   │ │
│  └──────────────────────────────┬─────────────────────────────────┘ │
│                                 │                                    │
│  ════════════════════════════════╧═════════════════════════════════  │
│                                                                     │
│  Kernel Space                                                        │
│  ┌────────────────────────────────────────────────────────────────┐ │
│  │  Container Process                                              │ │
│  │  ┌───────────────────────────────────────────────────────────┐ │ │
│  │  │  PID, Mount, Network, UTS, IPC, User, Cgroup Namespaces   │ │ │
│  │  │  CPU, Memory, IO, PIDs Cgroups                             │ │ │
│  │  │  OverlayFS (LowerDir, UpperDir, WorkDir, MergedDir)        │ │ │
│  │  └───────────────────────────────────────────────────────────┘ │ │
│  └────────────────────────────────────────────────────────────────┘ │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
Docker Container Creation Steps:
# 1. Pull image (if not cached)
docker pull nginx:latest

# 2. Create container
docker create --name web -p 80:80 nginx

# 3. Start container
docker start web

# What happens internally:
# a. containerd unpacks image layers
# b. runc creates namespaces (CLONE_NEWPID|CLONE_NEWNET|...)
# c. runc sets up cgroups (/sys/fs/cgroup/docker/<container-id>/)
# d. runc mounts OverlayFS
# e. runc configures network (veth pair, bridge)
# f. runc pivot_root to container filesystem
# g. runc executes CMD/ENTRYPOINT
Inspecting Docker Internals:
# View container's namespaces
docker inspect web | jq '.[0].State.Pid'  # Get PID
sudo ls -la /proc/<PID>/ns/
# lrwxrwxrwx 1 root root 0 pid:[4026532198]
# lrwxrwxrwx 1 root root 0 net:[4026532201]
# lrwxrwxrwx 1 root root 0 mnt:[4026532196]

# View cgroup limits
cat /sys/fs/cgroup/docker/<container-id>/memory.max
cat /sys/fs/cgroup/docker/<container-id>/cpu.max

# View OverlayFS layers
docker inspect web | jq '.[0].GraphDriver'
# {
#   "Data": {
#     "LowerDir": "/var/lib/docker/overlay2/abc.../diff",
#     "UpperDir": "/var/lib/docker/overlay2/def.../diff",
#     "WorkDir": "/var/lib/docker/overlay2/def.../work",
#     "MergedDir": "/var/lib/docker/overlay2/def.../merged"
#   }
# }

6. Interview Deep Dive: Senior Level

Answer:User Namespaces (CLONE_NEWUSER) allow a process to have UID 0 (root) inside the container while being a non-privileged UID (e.g., 1000) on the host.Security Improvement:
Without User Namespace:
┌───────────────────────────────────────────────────────────┐
│  Container                         Host                    │
│  ┌──────────────┐                 ┌──────────────┐        │
│  │  UID 0       │ ═══════════════►│  UID 0       │        │
│  │  (root)      │                 │  (root)      │        │
│  └──────────────┘                 └──────────────┘        │
│                                                            │
│  If container breakout occurs:                            │
│  → Attacker has root on host!                             │
└───────────────────────────────────────────────────────────┘

With User Namespace:
┌───────────────────────────────────────────────────────────┐
│  Container                         Host                    │
│  ┌──────────────┐                 ┌──────────────┐        │
│  │  UID 0       │ ───────────────►│  UID 1000    │        │
│  │  (root)      │     mapped      │  (user)      │        │
│  └──────────────┘                 └──────────────┘        │
│                                                            │
│  If container breakout occurs:                            │
│  → Attacker only has UID 1000 permissions                 │
│  → Cannot write to /etc, /boot, system files              │
│  → Cannot load kernel modules                             │
│  → Cannot access other users' files                       │
└───────────────────────────────────────────────────────────┘
Implementation:
# Run Docker with user namespace remapping
dockerd --userns-remap=default

# Or manually with unshare
unshare --user --map-root-user /bin/bash
Limitations:
  • Some operations still require host root (mounting certain filesystems)
  • File ownership can be confusing (files created by container appear owned by high UIDs on host)
  • Not all containers work with user namespaces (especially those requiring true root)
Answer:Cgroups v1 Problems:
  1. Multiple Hierarchies:
    • Each controller (cpu, memory, io) has its own hierarchy
    • A process can be in /sys/fs/cgroup/cpu/groupA and /sys/fs/cgroup/memory/groupB
    • Impossible to do unified resource accounting
  2. Writeback Ambiguity:
    • Process in cgroup A writes to page cache
    • Page cache writeback happens later
    • Which cgroup gets charged for the disk I/O?
    • v1: Charged to whoever triggers writeback (wrong!)
  3. No Delegation:
    • Can’t safely give non-root users control over cgroups
    • Security issues with nested hierarchies
Cgroups v2 Solutions:
  1. Single Hierarchy:
    • One tree, all controllers
    • Process location is the same for all resources
    • Enables proper delegation and accounting
  2. Proper Attribution:
    • Tracks which cgroup dirtied pages
    • I/O charged correctly even if writeback delayed
  3. Pressure Stall Information (PSI):
    • Built-in resource pressure metrics
    • Can detect when cgroup is starved
Migration Example:
# v1: Multiple hierarchies
/sys/fs/cgroup/cpu/docker/container1/
/sys/fs/cgroup/memory/system/container1/

# v2: Single hierarchy
/sys/fs/cgroup/docker/container1/
# All controllers available here
Answer:Docker uses network namespaces + veth pairs + Linux bridge.Default Bridge Network:
1. Create network namespace for container
2. Create veth pair (virtual ethernet cable with two ends)
3. Move one end into container namespace
4. Attach other end to docker0 bridge
5. Configure IP addresses and routes
6. Setup iptables rules for NAT
Detailed Flow:
# Container sends packet to 8.8.8.8:53
1. eth0@container (172.17.0.2) → veth pair
2. vethXXX@host docker0 bridge (172.17.0.1)
3. SNAT: 172.17.0.2 192.168.1.100 (host IP)
4. eth0@host Internet

# Response
1. eth0@host Internet
2. DNAT: 192.168.1.100 172.17.0.2
3. docker0 bridge vethXXX@host
4. veth pair eth0@container
Port Mapping:
docker run -p 8080:80 nginx

# iptables rule created:
iptables -t nat -A DOCKER -p tcp --dport 8080 \
  -j DNAT --to-destination 172.17.0.2:80
Network Modes:
ModeDescriptionUse Case
bridgeDefault, isolated networkNormal containers
hostShare host network namespaceHigh performance
noneNo networkSecurity isolation
container:IDShare another container’s netnsSidecars
Code Example:
// Simplified Docker network setup

// 1. Create veth pair
ip_link_add("veth0", "veth1", VETH);

// 2. Move one end to namespace
ip_link_set_ns("veth1", container_netns);

// 3. Attach to bridge
ip_link_set_master("veth0", "docker0");

// 4. Configure IPs
ip_addr_add("172.17.0.2/16", "veth1", container_netns);
ip_route_add("default via 172.17.0.1", container_netns);
Answer:A VM Exit occurs when the guest OS performs an action that requires hypervisor intervention (e.g., I/O, CPUID, or accessing certain registers).VM Exit Flow:
┌─────────────────────────────────────────────────────────────┐
│                      VM EXIT OVERHEAD                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Guest (VM)                                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  1. Execute privileged instruction (e.g., IN/OUT)    │   │
│  │     or access protected resource                      │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ CPU Trap                         │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  2. Hardware VM Exit                                  │   │
│  │     • Save guest state to VMCS (registers, RIP, etc.) │   │
│  │     • Load host state from VMCS                       │   │
│  │     • Jump to hypervisor entry point                  │   │
│  │     • Time: ~1000-3000 cycles                         │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│  Hypervisor                                                  │
│  ┌────────────────────────▼─────────────────────────────┐   │
│  │  3. Handle VM Exit                                    │   │
│  │     • Inspect exit reason                             │   │
│  │     • Emulate instruction (e.g., read port 0x3F8)     │   │
│  │     • Update guest state                              │   │
│  │     • Time: 500-2000 cycles                           │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ VM Entry                         │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  4. Resume Guest                                      │   │
│  │     • Load guest state from VMCS                      │   │
│  │     • Switch to VMX non-root mode                     │   │
│  │     • Continue guest execution                        │   │
│  │     • Time: ~1000-2000 cycles                         │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Total Overhead: 2500-7000 cycles (1-3 microseconds)        │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Common VM Exit Causes:
CauseFrequencyMitigation
I/O instructions (IN/OUT)HighUse virtio (paravirtualization)
CPUIDMediumCache results in guest
CR3 writes (page table)HighUse EPT (hardware MMU)
InterruptsVery HighAPIC virtualization
MSR accessMediumUse MSR bitmaps
HLT (idle)LowAcceptable (CPU idle anyway)
Optimization Strategies:
  1. EPT (Extended Page Tables):
    • Guest can change CR3 without VM exit
    • Hardware handles GVA → GPA → HPA translation
  2. APIC Virtualization:
    • Virtual APIC page in guest memory
    • Most interrupt operations happen without exits
  3. VirtIO:
    • Paravirtualized drivers
    • Shared memory rings reduce I/O exits
Modern virtualization aims to minimize VM Exits using features like APIC Virtualization and EPT.
Answer:Nested virtualization is running a VM inside another VM (e.g., GKE on Google Cloud).Address Translation Complexity:
┌─────────────────────────────────────────────────────────────┐
│                NESTED VIRTUALIZATION                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  L2 Guest (innermost VM)                                     │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  GVA (Guest Virtual Address)                          │   │
│  │  e.g., 0x400000 (program address)                     │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │ L2 Page Tables                   │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  GPA-L2 (Guest Physical Address of L2)                │   │
│  │  e.g., 0x80000000                                     │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│  L1 Hypervisor (middle VM)                                   │
│                           │ L1 EPT                           │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  GPA-L1 (Guest Physical Address of L1)                │   │
│  │  e.g., 0x100000000                                    │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│  L0 Hypervisor (host)                                        │
│                           │ L0 EPT                           │
│                           ▼                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  HPA (Host Physical Address)                          │   │
│  │  e.g., 0x200000000 (actual RAM)                       │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  Without Nested EPT:                                         │
│  → Each memory access requires 4 page walks                 │
│  → GVA→GPA-L2: 4 walks                                      │
│  → Each GPA-L2 access needs GPA-L2→HPA translation          │
│  → Total: 4 + (4 × 4) = 20 memory accesses!                 │
│                                                              │
│  With Nested EPT (Intel):                                    │
│  → Hardware combines L1 and L0 EPT                          │
│  → Still slower than native, but manageable                  │
│  → ~2-3x overhead instead of 10x+                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Performance Impact:
Operation         Native    L1 VM    Nested L2 VM
Memory Access     1x        1.1x     2-3x
I/O               1x        2x       4-6x
Context Switch    1x        1.5x     3-4x
When to Use Nested Virtualization:
  1. Development/Testing:
    • Test hypervisor code
    • CI/CD pipelines testing VMs
  2. Cloud Services:
    • Kubernetes on cloud VMs
    • CI runners in cloud
  3. Education:
    • Teaching virtualization
    • Lab environments
Avoid for:
  • Production databases
  • High-performance computing
  • Latency-sensitive applications
The main challenge is the “Level 2” Guest Physical to “Level 0” Host Physical translation. This requires either complex shadow page table merging or hardware support for Nested EPT, which can significantly degrade memory performance due to the exponentially more complex page walks.
Answer:OverlayFS provides a union mount where multiple layers are combined into a single view.Layer Structure:
┌─────────────────────────────────────────────────────────────┐
│                    OVERLAYFS LAYERS                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Container View (MergedDir)                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  /bin/bash      ← from Base Layer                     │   │
│  │  /etc/nginx/    ← from Nginx Layer                    │   │
│  │  /var/log/app   ← from Container Layer (writable)     │   │
│  │  /app/config    ← from Container Layer (modified)     │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│  ═════════════════════════╧═════════════════════════════     │
│                                                              │
│  UpperDir (Writable Container Layer)                         │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  /var/log/app           ← New file                    │   │
│  │  /app/config            ← Modified file               │   │
│  │  .wh.oldfile            ← Whiteout (deleted file)     │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
│  LowerDir (Read-Only Image Layers)                           │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Layer 3: Nginx Layer                                 │   │
│  │  /etc/nginx/nginx.conf                                │   │
│  │  /usr/sbin/nginx                                      │   │
│  ├──────────────────────────────────────────────────────┤   │
│  │  Layer 2: App Dependencies                            │   │
│  │  /usr/lib/libssl.so                                   │   │
│  ├──────────────────────────────────────────────────────┤   │
│  │  Layer 1: Base Ubuntu                                 │   │
│  │  /bin/bash                                            │   │
│  │  /etc/passwd                                          │   │
│  └──────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Copy-on-Write Operations:1. Read File:
open("/etc/nginx/nginx.conf", O_RDONLY)
→ OverlayFS checks UpperDir: not found
→ Falls through to LowerDir: found in Nginx layer
→ Returns file from LowerDir (no copy needed)
2. Modify File:
open("/etc/nginx/nginx.conf", O_WRONLY)
→ OverlayFS checks UpperDir: not found
→ Copy file from LowerDir to UpperDir (copy-up)
→ Open file in UpperDir for writing
→ Future reads will use UpperDir version
3. Delete File:
unlink("/etc/nginx/nginx.conf")
→ OverlayFS creates whiteout file in UpperDir
→ UpperDir/.wh.nginx.conf (marks file as deleted)
→ LowerDir file remains (other containers unaffected)
→ MergedDir view hides the file
4. Create New File:
open("/var/log/app.log", O_CREAT)
→ OverlayFS creates file directly in UpperDir
→ No interaction with LowerDir needed
Mount Command:
mount -t overlay overlay \
  -o lowerdir=/lower1:/lower2:/lower3,\
     upperdir=/upper,\
     workdir=/work \
  /merged
Benefits:
  • Shared base layers save disk space
  • Fast container startup (no copying)
  • Efficient use of cache (shared pages)
Performance Considerations:
  • First write to file triggers copy-up (can be slow for large files)
  • Many layers slow down lookup
  • Whiteouts can accumulate (use docker system prune)
Answer:Container Security (Shared Kernel):Pros:
  • Faster startup and lower overhead
  • Easier management
Cons:
  1. Kernel Exploits:
    Container → Kernel Vulnerability → Host Compromise
    
    Example: Dirty COW (CVE-2016-5195)
    - Container can exploit kernel bug
    - Gain root on host
    - Escape to host system
    
  2. Large Attack Surface:
    ~300+ system calls exposed
    Any syscall vulnerability affects all containers
    
    Mitigation: seccomp-bpf filters
    → Block dangerous syscalls
    → Reduce attack surface
    
  3. Resource Exhaustion:
    Without cgroups:
    Container A → Allocate all memory → OOM kills Container B
    
    With cgroups:
    Container A → Hit memory.max → OOM kills processes in A only
    
  4. Information Leakage:
    /proc and /sys expose kernel information
    - /proc/kallsyms (kernel symbols)
    - /sys/kernel/debug (debug info)
    
    Mitigation: Mount with hidepid, remove sensitive mounts
    
VM Security (Separate Kernel):Pros:
  1. Strong Isolation:
    VM → Hypervisor → Host
    
    Attack path requires:
    1. Exploit in guest kernel
    2. VM escape vulnerability
    3. Hypervisor exploit
    
    Much harder than container escape
    
  2. Smaller Attack Surface:
    VM → Hypervisor interface is small
    - Hypercalls (10-20 vs 300+ syscalls)
    - Device emulation
    - Much less code to attack
    
  3. Different Kernels:
    Can run different kernel versions
    Old vulnerable kernel in VM doesn't affect host
    
Comparison Table:
AspectContainersVMs
Kernel isolationSharedSeparate
Escape difficultyMediumHard
Attack surfaceLarge (~300 syscalls)Small (~20 hypercalls)
Vulnerability impactAffects hostContained to VM
Performance overhead~2%~5-10%
Startup timeunder 1s10-30s
Best Practices:For Containers:
# 1. Use user namespaces
--userns-remap=default

# 2. Drop capabilities
--cap-drop=ALL --cap-add=NET_BIND_SERVICE

# 3. Seccomp filter
--security-opt seccomp=default.json

# 4. AppArmor/SELinux
--security-opt apparmor=docker-default

# 5. Read-only root
--read-only --tmpfs /tmp

# 6. No privileged mode
# NEVER use --privileged in production!
For VMs:
# 1. Minimal device emulation
Use virtio instead of emulated hardware

# 2. Disable unnecessary devices
-nodefaults -no-vga -no-audio

# 3. Use KVM (hardware virtualization)
-enable-kvm

# 4. Memory ballooning
-device virtio-balloon

# 5. vTPM for measured boot
-tpmdev emulator
Hybrid Approach (Kata Containers):
Container API → Lightweight VM → Strong isolation

Benefits:
- Container-like UX
- VM-like security
- ~50-100ms startup (vs 10s for traditional VM)
Answer:Device Emulation (Full Virtualization):Guest believes it’s talking to real hardware, hypervisor emulates every register read/write.
┌─────────────────────────────────────────────────────────────┐
│                  DEVICE EMULATION                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Guest VM                                                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  1. Guest writes to e1000 NIC register                │   │
│  │     outl(0xC000, ETH_TX_DESC)                         │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ VM Exit (I/O port access)        │
│                           ▼                                  │
│  Hypervisor (QEMU/KVM)                                       │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  2. Trap I/O operation                                │   │
│  │     Decode: write to port 0xC000, value 0x12345678    │   │
│  │                                                        │   │
│  │  3. Emulate e1000 device logic                        │   │
│  │     - Update virtual NIC state                        │   │
│  │     - Copy packet from guest memory                   │   │
│  │     - Send to host TAP device                         │   │
│  │                                                        │   │
│  │  4. Return to guest                                   │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ VM Entry                         │
│                           ▼                                  │
│  Guest continues...                                          │
│                                                              │
│  Problem: EVERY register access causes VM Exit!              │
│  → Thousands of exits per packet                            │
│  → 10x+ overhead                                            │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Paravirtualization (VirtIO):Guest knows it’s virtualized, uses efficient shared-memory interface.
┌─────────────────────────────────────────────────────────────┐
│                  PARAVIRTUALIZATION                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Guest VM                                                    │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  1. VirtIO driver in guest                            │   │
│  │     - Shared memory ring (vring)                      │   │
│  │     - No register emulation needed                    │   │
│  │                                                        │   │
│  │  2. Write packet to shared ring                       │   │
│  │     vring[idx] = packet_buffer                        │   │
│  │     idx++                                             │   │
│  │                                                        │   │
│  │  3. Kick hypervisor (single VM exit)                  │   │
│  │     kick_notify()                                     │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ Single VM Exit                   │
│                           ▼                                  │
│  Hypervisor                                                  │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  4. Process all pending packets                       │   │
│  │     while (vring has packets) {                       │   │
│  │       packet = vring[idx]                             │   │
│  │       send_to_tap(packet)                             │   │
│  │       idx++                                           │   │
│  │     }                                                 │   │
│  │                                                        │   │
│  │  5. Return to guest                                   │   │
│  └────────────────────────┬─────────────────────────────┘   │
│                           │                                  │
│                           │ VM Entry                         │
│                           ▼                                  │
│  Guest continues...                                          │
│                                                              │
│  Benefit: One VM exit for multiple packets!                  │
│  → Near-native performance                                  │
│  → <5% overhead                                             │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Performance Comparison:
OperationEmulationVirtIONative
Network throughput1 Gbps9.5 Gbps10 Gbps
Disk IOPS5,00045,00050,000
VM exits per packet100-10001-20
VirtIO Ring Structure:
struct vring {
    // Available ring: guest writes here
    struct vring_avail {
        uint16_t flags;
        uint16_t idx;
        uint16_t ring[queue_size];
    } avail;

    // Descriptor table: describes buffers
    struct vring_desc {
        uint64_t addr;   // Guest physical address
        uint32_t len;    // Buffer length
        uint16_t flags;  // VRING_DESC_F_NEXT, etc.
        uint16_t next;   // Next descriptor
    } desc[queue_size];

    // Used ring: hypervisor writes here
    struct vring_used {
        uint16_t flags;
        uint16_t idx;
        struct vring_used_elem {
            uint32_t id;  // Descriptor index
            uint32_t len; // Bytes written
        } ring[queue_size];
    } used;
};
Tradeoffs:Emulation:
  • Pros: No guest modification, runs any OS
  • Cons: Slow, many VM exits
Paravirtualization:
  • Pros: Fast, few VM exits
  • Cons: Requires guest support (modified drivers)
Modern Approach:
  • Use paravirt for performance-critical devices (disk, network)
  • Use emulation for legacy devices (VGA, PS/2)
  • Gradually reduce emulation over time

6. Namespaces & Cgroups: A Single Process’s Perspective

What does a process actually “see” when it’s containerized? Here’s the view from inside:

What Changes for the Process

┌─────────────────────────────────────────────────────────────────────┐
│     PROCESS VIEW: BEFORE vs AFTER CONTAINERIZATION                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  BEFORE (Host Process)                AFTER (Containerized)         │
│  ────────────────────                 ─────────────────────         │
│                                                                     │
│  PID: 12345                           PID: 1  (thinks it's init!)   │
│  UID: 1000                            UID: 0  (root in container)   │
│  Hostname: myserver                   Hostname: container-abc       │
│  /proc: sees all processes            /proc: sees only self         │
│  Network: eth0 (192.168.1.5)          Network: eth0 (172.17.0.2)    │
│  Filesystem: /home/user/...           Filesystem: / (isolated root) │
│  Memory: unlimited                    Memory: 512MB max (cgroup)    │
│  CPU: all cores                       CPU: 50% of 1 core (cgroup)   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Inspecting Your Own Namespace

# Inside a container (or any process), see your namespaces:
ls -la /proc/self/ns/
# lrwxrwxrwx 1 root root 0 pid:[4026532198]
# lrwxrwxrwx 1 root root 0 net:[4026532201]
# lrwxrwxrwx 1 root root 0 mnt:[4026532196]
# ...

# Compare with host (different inode numbers = different namespace)
# Host:  pid:[4026531836]
# Container: pid:[4026532198]  ← Different!

# See your cgroup limits
cat /sys/fs/cgroup/memory.max      # Memory limit
cat /sys/fs/cgroup/cpu.max         # CPU limit (quota period)
cat /sys/fs/cgroup/pids.max        # Max processes

# See your cgroup resource usage
cat /sys/fs/cgroup/memory.current  # Current memory usage
cat /sys/fs/cgroup/cpu.stat        # CPU time consumed

The Process Doesn’t Know It’s Contained

// This code behaves identically on host or in container:
#include <stdio.h>
#include <unistd.h>

int main() {
    printf("My PID: %d\n", getpid());        // 1 in container, 12345 on host
    printf("My UID: %d\n", getuid());        // 0 in container (fake root)

    char hostname[256];
    gethostname(hostname, sizeof(hostname));
    printf("Hostname: %s\n", hostname);      // "container-abc" in container

    // Process has no idea it's in a container!
    // All syscalls return "virtualized" results
    return 0;
}

Key Insight: Syscalls Are Virtualized

Every syscall that returns information about the system goes through namespace translation:
SyscallHost ReturnsContainer Returns
getpid()123451
getuid()10000 (mapped root)
uname()myservercontainer-abc
readdir(/proc)All PIDsOnly container PIDs
socket(AF_INET,...)Host networkContainer network

Production Caveats and Patterns

The container ecosystem is full of footguns that only fire under load, in production, after months of working fine. Below are the four pitfalls that have generated the most incidents in real engineering organizations, paired with the patterns that defuse them.
Caveat 1: Containers are not VMs, and namespace breakouts are real. A container shares the host kernel. A kernel bug in any reachable code path can become a host compromise. The 2019 runc CVE (CVE-2019-5736) let a malicious container overwrite the host runc binary by abusing the way runc re-executes itself through /proc/self/exe; once overwritten, the next container start would execute attacker-controlled code as root on the host. Similar primitives have surfaced periodically (overlayfs in 2021, cgroups release_agent in 2022, OverlayFS again in CVE-2023-2640). The lesson: never treat the kernel attack surface as if it had VM-grade isolation properties.
Pattern: Defense-in-depth and choose the right runtime for the threat model. Run untrusted code under gVisor (user-space syscall interposition) or Kata/Firecracker (per-container microVM) — AWS Lambda and Fly.io chose Firecracker for exactly this reason. For trusted workloads, layer seccomp (block ~250 of ~340 syscalls), drop all capabilities by default, enable user namespaces so container root maps to an unprivileged host UID, set --read-only rootfs, and keep runc patched. Audit your seccomp profile — the default Docker profile is good but not minimal.
Caveat 2: PID 1 in a container must reap children, or zombies pile up until you exhaust the PID limit. On a normal Linux system, init (systemd, sysvinit) reaps orphaned children. Inside a container, your application is PID 1 — and most application runtimes (Node, Python, the JVM, Go) do not reap orphans. If your app spawns short-lived shells (think child_process.exec), each exited child becomes a zombie that holds a process slot until PID 1 reads its exit status. Symptom: fork: retry: Resource temporarily unavailable after a few days, even though the container looks idle.
Pattern: Use a tiny init like tini, dumb-init, or docker run --init. Docker’s --init flag injects tini as PID 1, which forwards signals to your app and reaps zombies. The cost is one extra process (~1 MB RSS). If you control the image, ENTRYPOINT ["/usr/bin/dumb-init", "--"] is the production-standard pattern. The same logic applies in Kubernetes via shareProcessNamespace: true (so the pause container reaps) — but even then, prefer tini per-container for deterministic behavior.
Caveat 3: When a container OOMs, the kernel kills a process inside it — not necessarily the one that allocated the memory. The cgroup OOM killer scores processes in the cgroup by oom_score_adj and RSS, then kills the highest scorer. If your container runs a parent process and several workers, the parent might survive while a worker dies, leaving you with a half-broken container that the orchestrator sees as “running.” Worse: if the container’s entire memory limit is consumed by the page cache (file-backed reads), the cgroup OOM may fire even though no process is actually holding much heap.
Pattern: Set memory.high for soft pressure, treat OOM as restart-the-pod, and inspect memory.stat. In cgroups v2, memory.high throttles allocation before hitting the hard memory.max — giving you backpressure instead of a kill. In Kubernetes, set restartPolicy: Always and use a liveness probe that fails when the app’s worker count drops; the orchestrator will replace the half-dead pod. When debugging, never trust memory.current alone — read memory.stat and look at anon (real heap), file (page cache, reclaimable), and kernel_stack (per-thread cost). Most “memory leaks” in containers are actually unbounded page cache from log files.
Caveat 4: Cgroup v1 vs v2 — many tools, runtimes, and interview answers still assume v1. Cgroups v2 (the unified hierarchy) is the default on RHEL 9, Ubuntu 22.04+, and modern Kubernetes. It changes the semantics: a single hierarchy instead of one per controller, memory.max instead of memory.limit_in_bytes, cpu.max instead of cpu.cfs_quota_us + cpu.cfs_period_us. Older Java versions (pre-15) read v1 paths and silently report wrong limits, leading to JVMs that allocate based on the host’s RAM rather than the container’s limit. Same trap with cAdvisor pre-0.40 and old Datadog agents.
Pattern: Detect the hierarchy and read the right files. Check stat -fc %T /sys/fs/cgroup/cgroup2fs means v2, tmpfs means v1. For runtime detection in apps, read /proc/self/cgroup: a single line of 0::/... is v2, multiple lines with controller prefixes is v1. Upgrade to JDK 17+ (or backports) for correct cgroup v2 awareness. In Kubernetes 1.25+, cgroup v2 is GA and required for features like Memory QoS.

Senior Interview Questions

Strong Answer Framework:
  1. State the isolation boundary first. A VM virtualizes hardware — the hypervisor (KVM, Hyper-V, Xen) traps privileged instructions, presents virtual CPUs and devices, and the guest runs its own kernel. A container shares the host kernel and uses namespaces plus cgroups to restrict what a process sees and how much it can consume. The boundary determines the threat model: a VM contains a guest-kernel exploit; a container does not.
  2. Describe the cost dimensions. VMs pay for boot time (seconds to tens of seconds for a full OS), per-instance memory (50-500 MB just for the guest kernel and userspace), and VM-exit overhead on every privileged operation. Containers boot in milliseconds, share the host kernel page cache, and have near-native syscall latency. On a typical 64 GB server, you fit 10-30 VMs or hundreds-to-thousands of containers.
  3. Map workload to choice. Multi-tenant code execution (Lambda, code sandboxes, customer-supplied workloads) wants a VM-grade boundary. Internal microservices owned by your org want containers — you trust the code, you want the density. Long-running stateful services with strict noisy-neighbor isolation (databases on shared hardware) sit in the middle: VMs or microVMs.
  4. Acknowledge the middle ground. Firecracker and Cloud Hypervisor are minimal VMMs that boot a microVM in ~125 ms with ~5 MB overhead — AWS Lambda and Fly.io use them precisely because they want VM isolation at container density. gVisor goes the other way: a user-space kernel that intercepts syscalls before they reach the host kernel, reducing the kernel attack surface at a 2-5x I/O perf cost.
  5. Name your default. “I default to containers for everything I own and trust, microVMs (Firecracker) for anything where the threat model includes hostile guest code, and full VMs only when I need a different OS or hardware-level features (nested virtualization, GPU passthrough).”
Real-World Example: AWS Fargate originally ran customer containers on shared EC2 hosts with seccomp+namespace isolation. After internal red-team exercises and the runc CVE-2019-5736 in February 2019, Amazon transitioned Fargate to Firecracker-based microVMs (announced re:Invent 2019). Each customer task now runs in its own microVM, eliminating the kernel-shared attack surface. Latency cost was small (~125 ms cold boot vs ~50 ms for a container start); the security win was eliminating an entire class of cross-tenant kernel exploits.
Senior follow-up: How does Firecracker boot in 125 ms when a normal Linux VM takes 30+ seconds? It strips the kernel to a minimal config (no PCI enumeration, no most drivers, no initramfs), uses virtio-only devices over MMIO, and bypasses BIOS/UEFI by jumping straight into the kernel via the Linux Boot Protocol. The VMM itself is ~50k lines of Rust with no QEMU device model. The whole point is that “boot” in a microVM is closer to “exec a process” than to “start a computer.”
Senior follow-up: A colleague says they can run untrusted code in a Docker container with --security-opt no-new-privileges and a strict seccomp profile — safe enough? No. You have closed common privilege-escalation paths, but you have not closed the kernel attack surface. CVE-2022-0492 (cgroups release_agent), CVE-2022-0185 (filesystem mount), and CVE-2023-0386 (overlayfs) all let a non-root container process escape to the host on default-configured systems. Seccomp narrows the syscall surface but does not eliminate it. Use a microVM if the code is genuinely hostile.
Senior follow-up: When does the VM vs container choice not matter? When the workload itself dominates — a CPU-bound numerical job spending 99% of cycles in user-space math does not care about VM-exit overhead. The choice matters for I/O-heavy workloads (where syscall + virt overhead is felt), short-lived workloads (where boot time is felt), and dense multi-tenant workloads (where per-instance memory tax is felt).
Common Wrong Answers:
  1. “Containers are just lightweight VMs.” Wrong on both technical and security grounds. Different isolation boundary, different threat model, different operational profile. This phrasing signals the candidate has not internalized why CVE-2019-5736 was possible at all.
  2. “VMs are always more secure.” Not always. A poorly-configured VM (default credentials, exposed management interface, unpatched hypervisor) is less secure than a hardened container. Isolation is necessary but not sufficient.
  3. “Use Kubernetes; it handles all this.” Kubernetes is an orchestrator, not an isolation strategy. By default, K8s pods on the same node share the kernel. Multi-tenant K8s requires explicit decisions: gVisor RuntimeClass, Kata Containers, dedicated node pools per tenant.
Further Reading:
  • Aqua Security CVE write-up on runc CVE-2019-5736 — the canonical example of a container breakout via shared filesystem semantics.
  • Firecracker design paper (NSDI 2020), “Firecracker: Lightweight Virtualization for Serverless Applications.”
  • “Container Security” by Liz Rice (O’Reilly) — the standard reference for production container isolation.
Strong Answer Framework:
  1. Start with the layered architecture. docker run talks to dockerd over a Unix socket. dockerd parses the request and delegates to containerd via gRPC. containerd resolves the image (pulling layers if needed), prepares an OCI runtime spec (a JSON config describing the container), and shells out to runc. runc is the piece that actually creates the namespaces.
  2. The runc dance. runc forks itself. The child calls unshare() (or passes CLONE_NEW* flags to clone()) to create new namespaces in a single transition: PID, mount, UTS, IPC, net, user, cgroup. Order matters — user namespace must be set up first so subsequent namespaces are owned by it. The parent stays in the host namespace and writes the new PID into the appropriate cgroup files.
  3. Filesystem setup. Inside the new mount namespace, runc bind-mounts the OverlayFS merged directory (lower = read-only image layers, upper = writable container layer, work = OverlayFS bookkeeping) to a temporary path, then pivot_roots into it. Old root is unmounted. /proc and /sys get fresh mounts inside the new mount namespace — this is what makes cat /proc/1/status show the containerized process as PID 1.
  4. Resource caps. The cgroup is configured before the workload starts: memory.max, cpu.max, pids.max, io.max are all written to the cgroup directory. The process is added to the cgroup via cgroup.procs. Once added, the kernel enforces limits on every allocation and scheduling decision for that process and its descendants.
  5. Security hardening. runc applies the seccomp BPF filter (~50 syscalls blocked by default profile), drops capabilities (default keeps ~14 of 40+ capabilities), applies AppArmor or SELinux label, and finally execves the actual entrypoint. At this point, the process is the container.
  6. Detach. runc exits, leaving the entrypoint reparented to a containerd-shim process. The shim holds the container’s TTY and exit code so dockerd can crash and the container survives.
Real-World Example: In 2021, a Datadog customer reported intermittent “container start” timeouts under load. Root cause: their CI system was launching ~500 containers per minute, each creating a new network namespace and veth pair. The kernel’s rtnl_lock (the global netlink lock) became a bottleneck — veth creation serialized across cores. Fix was to switch to ipvlan (which is lock-free per-namespace) and to batch container creation to avoid thundering-herd lock contention. The lesson: namespace creation is not free, and the cost is concentrated in specific kernel locks.
Senior follow-up: Why does docker run --pid=host exist, and when is it dangerous? It tells runc to skip creating a PID namespace — the container shares the host’s PID namespace. Useful for debugging tools (a sidecar that needs to kill -9 a host process) or process supervisors. Dangerous because a process inside the container can now see and signal every process on the host; if the container has CAP_KILL or CAP_SYS_PTRACE, that is a full escape primitive.
Senior follow-up: What does the containerd-shim actually do? It is the per-container parent process that survives daemon restarts. It owns the container’s stdin/stdout/stderr pipes, reaps the container’s exit code (so dockerd can read it later), and holds the controlling TTY. Without the shim, restarting dockerd would orphan every running container. The shim is roughly 5 MB of memory per container — non-trivial at scale, which is why projects like Kata Containers replace it with a single shim per VM.
Senior follow-up: How is docker exec different from docker run? exec does not create new namespaces. It uses setns(2) to enter the existing container’s namespaces (one syscall per namespace, with the target file descriptor from /proc/PID/ns/*), then execves the new command. The new process inherits all the container’s restrictions but is not its child — which is why execed processes do not show up under PID 1 in some ps views and require explicit handling for signal forwarding.
Common Wrong Answers:
  1. “Docker creates a lightweight virtual machine.” No. Docker creates namespaces and cgroups; there is no VM, no hypervisor, no separate kernel.
  2. “Containers are just chroot jails.” chroot only changes the filesystem root. Namespaces additionally isolate process IDs, network, IPC, hostname, users, and cgroup view. Containers are chroot plus seven other forms of isolation, plus resource limits.
  3. “Docker uses LXC.” It used to (in 2013-2014). Since libcontainer (now runc), Docker has its own implementation. LXC is a separate project with different design choices (system containers vs application containers).
Further Reading:
  • Liz Rice, “Containers from Scratch” talk (Container Camp 2017) — builds a container in ~100 lines of Go.
  • OCI Runtime Specification (github.com/opencontainers/runtime-spec) — the actual contract between containerd and runc.
  • “What Have Syscalls Done for You Lately?” by Jessie Frazelle — a tour of the kernel features Docker exercises.
Strong Answer Framework:
  1. Frame the question as ‘what is the attack surface’. Plain Docker exposes the full Linux syscall ABI (~340 syscalls, ~70 typically allowed by default seccomp). Any kernel bug reachable through those syscalls is a potential escape. The three alternatives reduce the attack surface in different ways.
  2. gVisor (runsc). A user-space kernel called Sentry implements the Linux syscall interface in Go. Container syscalls go to Sentry; Sentry only makes ~20 syscalls to the host kernel. Attack surface drops by an order of magnitude. Cost: every syscall traverses Sentry’s user-space implementation, so I/O-heavy workloads see 2-5x slowdowns. Some uncommon syscalls or kernel features are not implemented (io_uring, some namespaces). Used by Google for App Engine and Cloud Run.
  3. Firecracker. Each container gets a lightweight VM with a minimal Linux kernel, a virtio-only device model, and a ~50k-line Rust VMM. The host kernel attack surface reduces to KVM plus virtio. Boot time ~125 ms, memory overhead ~5 MB per microVM. Syscall-compatible (it is real Linux), supports any syscall the guest kernel supports. Cost: per-VM kernel memory, VM-exit overhead, and you now manage a kernel image lifecycle. Used by AWS Lambda, AWS Fargate, Fly.io.
  4. Kata Containers. Same idea as Firecracker — per-container VM — but using QEMU (or Cloud Hypervisor) and integrating directly with Kubernetes via a RuntimeClass. Compatible with full OCI semantics, supports more device types (GPUs via VFIO). Slightly heavier than Firecracker (~50-100 ms boot, ~30 MB memory) but more flexible.
  5. Recommend with reasoning. “For untrusted code execution at scale — think a Lambda-style platform, a code-runner SaaS, or sandboxed CI — I would pick Firecracker. The threat model is ‘guest is hostile,’ and Firecracker gives me VM-grade isolation with container-grade density and start time. gVisor is a strong alternative when I cannot run my own kernel image (managed K8s without RuntimeClass support) and the workload is not I/O-bound.”
Real-World Example: In 2018, AWS Lambda originally used a custom container runtime on shared EC2 hosts. After internal security review, they built Firecracker (open-sourced November 2018) and migrated all of Lambda to per-invocation microVMs over 2019. The migration eliminated cross-tenant kernel-shared attack surface entirely. Boot-time impact: ~125 ms cold starts (down from ~150-200 ms for the previous container-based system), so customers actually got faster cold starts in addition to stronger isolation.
Senior follow-up: What is a VM exit and why does Firecracker care? A VM exit happens when the guest CPU executes an instruction that requires hypervisor mediation — accessing a device register, hitting a not-present page in the guest page tables, executing a privileged instruction. Each exit costs 1000-3000 cycles for state save/restore. Firecracker minimizes exits by using virtio over MMIO (shared memory rings instead of port I/O), pre-faulting memory to avoid page-fault exits, and configuring the VMCS to let the guest handle as many operations natively as possible. For I/O-bound workloads, exit frequency is the dominant performance factor.
Senior follow-up: Why is gVisor written in Go, given the performance cost? Memory safety. The whole point of gVisor is that the user-space kernel is a barrier — if Sentry has a memory corruption bug, you have not gained anything. Go’s runtime gives you bounds-checked array access, GC-managed lifetimes, and no buffer overflows — significantly reducing the class of bugs that matter for a kernel. The performance cost is the price of that safety. Recent versions use nogo (Go without GC for the syscall hot path) and platform-specific optimizations to claw some of it back.
Senior follow-up: When is gVisor the wrong answer? When the workload uses syscalls gVisor has not implemented (io_uring is not supported, some perf events, some less-common namespaces), or when I/O latency dominates (a database, a high-RPS proxy). gVisor’s overhead on small reads/writes can be 3-5x; on a sustained NVMe workload, that is the difference between “works” and “useless.”
Common Wrong Answers:
  1. “They are all the same thing — secure containers.” They use fundamentally different mechanisms. gVisor intercepts syscalls in user space; Firecracker and Kata run a real guest kernel in a VM. The security properties and performance profiles are not interchangeable.
  2. “Just use seccomp.” Seccomp narrows the syscall surface but does not change the trust boundary. The remaining allowed syscalls still execute in the host kernel. Seccomp is a hardening layer, not an isolation strategy.
  3. “VMs are always slower.” Not for the metrics that matter at scale. Firecracker boots faster than many container runtimes (~125 ms vs ~200-500 ms for full Docker). Per-syscall overhead is comparable to native for compute-bound workloads. The VM tax is real but smaller than people assume.
Further Reading:
  • Firecracker NSDI 2020 paper — Agache et al., “Firecracker: Lightweight Virtualization for Serverless Applications.”
  • gVisor docs (gvisor.dev) — specifically the “Performance Guide” page which is honest about where gVisor is slow.
  • Kata Containers architecture overview at katacontainers.io — explains the shim-v2 model.

7. Advanced Practice

  1. Manual Namespace Build: Use the unshare command to create a shell with its own network and PID namespace. Try to ping the host.
  2. Cgroup Stress Test: Create a cgroup v2 with a 100MB memory limit. Run a program that allocates 200MB and observe the kernel’s OOM killer logs in dmesg.
  3. VirtIO Analysis: Run a KVM guest and use lspci inside the guest to identify which devices are using virtio drivers vs. emulated hardware.

Next: OS Security & Hardening

Interview Deep-Dive

Strong Answer:
  • Containers and VMs have fundamentally different isolation boundaries. A VM runs its own kernel on virtualized hardware — the hypervisor (KVM, Xen, Hyper-V) provides each VM with a virtual CPU, virtual memory, and virtual devices. The guest OS has no direct access to the host kernel. A container, on the other hand, shares the host kernel. It is just a regular process (or group of processes) with restricted views of system resources via namespaces and restricted resource usage via cgroups.
  • The security implication is significant: a kernel vulnerability (like a privilege escalation in a syscall handler) can allow a container to escape to the host. In a VM, the guest kernel vulnerability stays contained because the guest cannot directly call host kernel code — it must go through the hypervisor, which is a much smaller attack surface. This is why multi-tenant public clouds (AWS, GCP) use VMs, not containers, as the primary isolation boundary for customer workloads.
  • Containers add defense-in-depth with seccomp (restricting which syscalls a container can make), AppArmor/SELinux (mandatory access control profiles), dropped capabilities (containers typically run without CAP_SYS_ADMIN), and user namespaces (mapping container root to a non-root host UID). But all of these are enforced by the shared host kernel, so a kernel bug can bypass all of them simultaneously.
  • The middle ground is gVisor (which interposes a user-space kernel that handles syscalls before they reach the real kernel) and Kata Containers / Firecracker (which run each container inside a lightweight VM). AWS Lambda uses Firecracker MicroVMs — each function invocation gets its own VM with a minimal Linux kernel that boots in about 125ms.
Follow-up: What specifically does a PID namespace do, and can a process inside a PID namespace see or signal processes outside it?A PID namespace virtualizes process IDs. The first process in a new PID namespace becomes PID 1 within that namespace, even though it has a different PID on the host. Processes inside the namespace can only see and signal other processes within the same namespace (or descendant namespaces). They cannot see or signal host processes. However, from the host’s perspective, all container processes are visible with their host PIDs. This is one-way isolation: the host can see into containers, but containers cannot see out. The exception is if a container has CAP_SYS_PTRACE and shares the host PID namespace (e.g., Docker’s --pid=host flag), which breaks isolation and is a serious security risk.
Strong Answer:
  • In cgroups v2, the memory.max file sets the hard memory limit for a cgroup. The kernel tracks every page allocated by processes in the cgroup: anonymous pages (heap, stack), file-backed pages (page cache), kernel memory (slab, socket buffers), and swap (if memory.swap.max is set).
  • When a process in the cgroup tries to allocate memory and the cgroup’s usage is at memory.max, the kernel first tries to reclaim memory. It invokes the cgroup-aware reclaimer, which scans the cgroup’s LRU lists and evicts reclaimable pages — file-backed pages can be dropped (clean) or written back (dirty), and anonymous pages can be swapped out if swap is available.
  • If reclamation succeeds in freeing enough memory, the allocation proceeds and there is no OOM kill. This is actually the common case — the page cache grows until it fills the memory limit, then the kernel evicts old cached pages to make room for new allocations.
  • OOM kill only happens when reclamation fails — there is no more reclaimable memory (all pages are anonymous and there is no swap, or swap is also full). At that point, the cgroup-level OOM killer selects a process within the cgroup to kill. Crucially, it will NOT kill processes outside the cgroup, which is the whole point of containerized memory limits.
  • There is also memory.high, which is a throttling threshold (not a hard limit). When usage exceeds memory.high, the kernel applies memory pressure by slowing down allocations (the process is forced to do direct reclaim, which is slow). This provides backpressure before hitting the hard limit, giving the application a chance to reduce its memory footprint.
Follow-up: Why do container monitoring tools often show different memory usage than the host’s free command?Container monitoring tools (like docker stats or cAdvisor) report cgroup-level memory usage, which includes the page cache attributed to that cgroup. The page cache is reclaimable, so it is not “used” in the same sense as heap memory, but it counts against the cgroup’s limit. This is why a container might show 90% memory usage while the application inside thinks it is only using 200MB of heap — the rest is kernel page cache from file I/O. The memory.stat file in the cgroup filesystem breaks this down into anon, file, shmem, etc. When diagnosing memory issues, always check memory.stat rather than just memory.current.
Strong Answer:
  • Docker containers (with default settings) provide namespace isolation, cgroup resource limits, seccomp syscall filtering (~300 blocked syscalls), and dropped capabilities. The attack surface is the full Linux kernel syscall interface — roughly 70 syscalls are allowed by default. A kernel zero-day in any of those 70 syscalls can escape the container. In practice, this is good enough for trusted workloads (your own code) but risky for untrusted code.
  • gVisor (runsc) interposes a user-space kernel called Sentry that implements the Linux syscall interface. The untrusted process’s syscalls go to Sentry, not the real kernel. Sentry only makes a small subset of syscalls to the host kernel (around 20), massively reducing the attack surface. The trade-off is performance: every syscall goes through Sentry’s user-space implementation, which adds latency. I/O-heavy workloads can see 2-5x slowdowns. gVisor also does not support every Linux syscall perfectly, so some applications may not run correctly.
  • Firecracker MicroVMs give each workload its own lightweight VM with a minimal Linux kernel. The host kernel attack surface is reduced to KVM (the hypervisor) and a small set of virtio device emulations. Firecracker itself is written in Rust with a minimal device model (~50k lines of code), so the attack surface is tiny. Boot time is about 125ms, and memory overhead is about 5MB per VM. The trade-off is that you need to manage a full (minimal) kernel per workload, and there is overhead from the virtualization layer (EPT/SLAT page table walks, VM exits).
  • For untrusted workloads, I would choose Firecracker if latency and syscall compatibility matter (like running arbitrary user-submitted code, which is what AWS Lambda does), or gVisor if the workload is I/O-light and I want strong isolation without the operational complexity of managing VM images. I would never use plain Docker for truly untrusted code.
Follow-up: What is a VM exit, and why does it matter for Firecracker’s performance?A VM exit occurs when the guest executes an instruction that the hardware cannot handle in guest mode — for example, accessing a device register, executing a privileged instruction not configured in the VMCS, or hitting a page fault that requires hypervisor intervention. The CPU saves the guest state, loads the host state, and transfers control to the hypervisor (KVM), which handles the event and then re-enters the guest (VM entry). Each exit/entry cycle costs roughly 1000-3000 CPU cycles. Firecracker minimizes VM exits by using virtio paravirtualized devices (the guest cooperates with the hypervisor using shared memory rings instead of emulating real hardware registers) and by configuring the VMCS to let the guest handle as many operations as possible without exiting. For I/O-heavy workloads, the frequency of VM exits is the primary performance bottleneck.