Isolation is the core requirement of multi-tenant cloud computing. Whether you are running a SaaS platform or a microservices cluster, you must ensure that processes are contained, resources are metered, and security boundaries are enforced. Modern systems achieve this through two distinct paths: OS-level virtualization (Containers) and Hardware-level virtualization (VMs).
A container is not a “thing” in the Linux kernel. It is a user-space abstraction built using three primary kernel features: Namespaces, Control Groups (cgroups), and Union Filesystems.
Namespaces wrap global system resources in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the resource.
Namespace
Flag
Isolated Resource
Mount
CLONE_NEWNS
Filesystem mount points (independent mount/umount).
UTS
CLONE_NEWUTS
Hostname and NIS domain name.
IPC
CLONE_NEWIPC
System V IPC, POSIX message queues.
PID
CLONE_NEWPID
Process IDs (Process 1 inside the container).
Network
CLONE_NEWNET
Network devices, stacks, ports, firewalls.
User
CLONE_NEWUSER
User and group IDs (Root in container != Root on host).
# Create network namespaceip netns add container1# Create veth pairip link add veth0 type veth peer name veth1# Move one end into namespaceip link set veth1 netns container1# Configure host endip addr add 172.17.0.1/16 dev veth0ip link set veth0 up# Configure container endip netns exec container1 ip addr add 172.17.0.2/16 dev veth1ip netns exec container1 ip link set veth1 upip netns exec container1 ip route add default via 172.17.0.1# Test connectivityip netns exec container1 ping 172.17.0.1
Code Example: Creating Network Namespace
#define _GNU_SOURCE#include <sched.h>#include <stdio.h>#include <sys/socket.h>#include <arpa/inet.h>#include <net/if.h>int child_fn(void *arg) { // Now in new network namespace // List network interfaces struct if_nameindex *if_ni = if_nameindex(); if (if_ni) { for (int i = 0; if_ni[i].if_index != 0; i++) { printf("Interface: %s (index %d)\n", if_ni[i].if_name, if_ni[i].if_index); } if_freenameindex(if_ni); } // Only loopback exists in new namespace return 0;}int main() { const int STACK_SIZE = 65536; char *stack = malloc(STACK_SIZE); clone(child_fn, stack + STACK_SIZE, CLONE_NEWNET | SIGCHLD, NULL); wait(NULL); return 0;}
Mount namespaces isolate the filesystem mount points.
// Create mount namespaceunshare(CLONE_NEWNS);// Mounts are now private to this namespacemount("/dev/sda1", "/mnt", "ext4", 0, NULL);// Other namespaces won't see this mount
Mount Propagation:
┌─────────────────────────────────────────────────────────────────────┐│ MOUNT PROPAGATION TYPES │├─────────────────────────────────────────────────────────────────────┤│ ││ MS_SHARED: Mounts propagate bidirectionally ││ ┌────────────────┐ ┌────────────────┐ ││ │ Namespace A │ ◄───────────────► │ Namespace B │ ││ │ mount /foo │ propagates │ sees /foo │ ││ └────────────────┘ └────────────────┘ ││ ││ MS_PRIVATE: Mounts don't propagate (default in containers) ││ ┌────────────────┐ ┌────────────────┐ ││ │ Namespace A │ X no sharing X │ Namespace B │ ││ │ mount /foo │ │ no /foo │ ││ └────────────────┘ └────────────────┘ ││ ││ MS_SLAVE: Receives mounts from master, but doesn't send ││ ┌────────────────┐ ┌────────────────┐ ││ │ Master │ ─────────────────► │ Slave │ ││ │ mount /foo │ one-way │ sees /foo │ ││ └────────────────┘ ◄─────X────────────└────────────────┘ ││ │└─────────────────────────────────────────────────────────────────────┘
# Create user namespaceunshare --user --map-root-user /bin/bash# Inside namespaceid # uid=0(root) gid=0(root)# But on host, this process runs as your regular user# File operations as "root" in container map to your UID on host
While chroot only changes the root directory for path resolution, it is insecure (processes can “break out” via .. or file descriptor trickery). Containers use pivot_root, which moves the entire mount namespace to a new root and removes access to the old one, providing a true filesystem jail.
// pivot_root implementationint pivot_root(const char *new_root, const char *put_old) { // Move old root to put_old directory within new_root // Make new_root the new root // This removes all access to old root return syscall(SYS_pivot_root, new_root, put_old);}// Usagechdir("/new_root");pivot_root(".", "old_root");umount2("old_root", MNT_DETACH);rmdir("old_root");chdir("/");
If Namespaces provide isolation (what you see), Cgroups provide containment (what you can use).
Cgroups v1 (Legacy): Multiple hierarchies. A process could be in one group for CPU and a completely different group for Memory. This led to massive complexity and performance issues.
Cgroups v2 (Modern/Unified): A single hierarchy. Every process belongs to exactly one cgroup in a unified tree. This allows for better resource accounting (e.g., attributing page cache writeback to the specific cgroup that caused the dirty pages).
Early virtualization used “Binary Translation” to replace privileged instructions. Modern CPUs handle this in hardware:
VMX Root Mode: The hypervisor runs here (full privileges).
VMX Non-Root Mode: The guest OS runs here. If the guest tries to execute a privileged instruction (like HLT or modifying CR3), the CPU triggers a VM Exit, trapping into the hypervisor to handle the event.
The VMCS is a memory block that stores the “state” of a virtual CPU (registers, control bits). When switching from VM A to VM B, the hypervisor swaps the VMCS pointers.
Shadow Page Tables (Old): The hypervisor manually tracked guest page table changes and built a combined GV→HP table. This was extremely slow.
EPT (Extended Page Tables): Hardware handles the translation. The CPU has a second set of page tables that map GP→HP. A memory access now involves a “2D Page Walk,” but it happens entirely in hardware.
Q1: How does 'User Namespaces' improve container security?
Answer:User Namespaces (CLONE_NEWUSER) allow a process to have UID 0 (root) inside the container while being a non-privileged UID (e.g., 1000) on the host.Security Improvement:
Without User Namespace:┌───────────────────────────────────────────────────────────┐│ Container Host ││ ┌──────────────┐ ┌──────────────┐ ││ │ UID 0 │ ═══════════════►│ UID 0 │ ││ │ (root) │ │ (root) │ ││ └──────────────┘ └──────────────┘ ││ ││ If container breakout occurs: ││ → Attacker has root on host! │└───────────────────────────────────────────────────────────┘With User Namespace:┌───────────────────────────────────────────────────────────┐│ Container Host ││ ┌──────────────┐ ┌──────────────┐ ││ │ UID 0 │ ───────────────►│ UID 1000 │ ││ │ (root) │ mapped │ (user) │ ││ └──────────────┘ └──────────────┘ ││ ││ If container breakout occurs: ││ → Attacker only has UID 1000 permissions ││ → Cannot write to /etc, /boot, system files ││ → Cannot load kernel modules ││ → Cannot access other users' files │└───────────────────────────────────────────────────────────┘
Implementation:
# Run Docker with user namespace remappingdockerd --userns-remap=default# Or manually with unshareunshare --user --map-root-user /bin/bash
Limitations:
Some operations still require host root (mounting certain filesystems)
File ownership can be confusing (files created by container appear owned by high UIDs on host)
Not all containers work with user namespaces (especially those requiring true root)
Q2: Explain the difference between cgroups v1 and v2 and why v2 is better
Answer:Cgroups v1 Problems:
Multiple Hierarchies:
Each controller (cpu, memory, io) has its own hierarchy
A process can be in /sys/fs/cgroup/cpu/groupA and /sys/fs/cgroup/memory/groupB
Impossible to do unified resource accounting
Writeback Ambiguity:
Process in cgroup A writes to page cache
Page cache writeback happens later
Which cgroup gets charged for the disk I/O?
v1: Charged to whoever triggers writeback (wrong!)
No Delegation:
Can’t safely give non-root users control over cgroups
Security issues with nested hierarchies
Cgroups v2 Solutions:
Single Hierarchy:
One tree, all controllers
Process location is the same for all resources
Enables proper delegation and accounting
Proper Attribution:
Tracks which cgroup dirtied pages
I/O charged correctly even if writeback delayed
Pressure Stall Information (PSI):
Built-in resource pressure metrics
Can detect when cgroup is starved
Migration Example:
# v1: Multiple hierarchies/sys/fs/cgroup/cpu/docker/container1//sys/fs/cgroup/memory/system/container1/# v2: Single hierarchy/sys/fs/cgroup/docker/container1/# All controllers available here
Q3: How does Docker implement network isolation and connectivity?
1. Create network namespace for container2. Create veth pair (virtual ethernet cable with two ends)3. Move one end into container namespace4. Attach other end to docker0 bridge5. Configure IP addresses and routes6. Setup iptables rules for NAT
// Simplified Docker network setup// 1. Create veth pairip_link_add("veth0", "veth1", VETH);// 2. Move one end to namespaceip_link_set_ns("veth1", container_netns);// 3. Attach to bridgeip_link_set_master("veth0", "docker0");// 4. Configure IPsip_addr_add("172.17.0.2/16", "veth1", container_netns);ip_route_add("default via 172.17.0.1", container_netns);
Q4: What is a 'VM Exit' and why is it expensive?
Answer:A VM Exit occurs when the guest OS performs an action that requires hypervisor intervention (e.g., I/O, CPUID, or accessing certain registers).VM Exit Flow:
┌─────────────────────────────────────────────────────────────┐│ VM EXIT OVERHEAD │├─────────────────────────────────────────────────────────────┤│ ││ Guest (VM) ││ ┌──────────────────────────────────────────────────────┐ ││ │ 1. Execute privileged instruction (e.g., IN/OUT) │ ││ │ or access protected resource │ ││ └────────────────────────┬─────────────────────────────┘ ││ │ ││ │ CPU Trap ││ ▼ ││ ┌──────────────────────────────────────────────────────┐ ││ │ 2. Hardware VM Exit │ ││ │ • Save guest state to VMCS (registers, RIP, etc.) │ ││ │ • Load host state from VMCS │ ││ │ • Jump to hypervisor entry point │ ││ │ • Time: ~1000-3000 cycles │ ││ └────────────────────────┬─────────────────────────────┘ ││ │ ││ Hypervisor ││ ┌────────────────────────▼─────────────────────────────┐ ││ │ 3. Handle VM Exit │ ││ │ • Inspect exit reason │ ││ │ • Emulate instruction (e.g., read port 0x3F8) │ ││ │ • Update guest state │ ││ │ • Time: 500-2000 cycles │ ││ └────────────────────────┬─────────────────────────────┘ ││ │ ││ │ VM Entry ││ ▼ ││ ┌──────────────────────────────────────────────────────┐ ││ │ 4. Resume Guest │ ││ │ • Load guest state from VMCS │ ││ │ • Switch to VMX non-root mode │ ││ │ • Continue guest execution │ ││ │ • Time: ~1000-2000 cycles │ ││ └──────────────────────────────────────────────────────┘ ││ ││ Total Overhead: 2500-7000 cycles (1-3 microseconds) ││ │└─────────────────────────────────────────────────────────────┘
Common VM Exit Causes:
Cause
Frequency
Mitigation
I/O instructions (IN/OUT)
High
Use virtio (paravirtualization)
CPUID
Medium
Cache results in guest
CR3 writes (page table)
High
Use EPT (hardware MMU)
Interrupts
Very High
APIC virtualization
MSR access
Medium
Use MSR bitmaps
HLT (idle)
Low
Acceptable (CPU idle anyway)
Optimization Strategies:
EPT (Extended Page Tables):
Guest can change CR3 without VM exit
Hardware handles GVA → GPA → HPA translation
APIC Virtualization:
Virtual APIC page in guest memory
Most interrupt operations happen without exits
VirtIO:
Paravirtualized drivers
Shared memory rings reduce I/O exits
Modern virtualization aims to minimize VM Exits using features like APIC Virtualization and EPT.
Q5: Explain the 'Nested Virtualization' problem
Answer:Nested virtualization is running a VM inside another VM (e.g., GKE on Google Cloud).Address Translation Complexity:
The main challenge is the “Level 2” Guest Physical to “Level 0” Host Physical translation. This requires either complex shadow page table merging or hardware support for Nested EPT, which can significantly degrade memory performance due to the exponentially more complex page walks.
Q6: How does OverlayFS implement copy-on-write for containers?
Answer:OverlayFS provides a union mount where multiple layers are combined into a single view.Layer Structure:
open("/etc/nginx/nginx.conf", O_RDONLY)→ OverlayFS checks UpperDir: not found→ Falls through to LowerDir: found in Nginx layer→ Returns file from LowerDir (no copy needed)
2. Modify File:
open("/etc/nginx/nginx.conf", O_WRONLY)→ OverlayFS checks UpperDir: not found→ Copy file from LowerDir to UpperDir (copy-up)→ Open file in UpperDir for writing→ Future reads will use UpperDir version
3. Delete File:
unlink("/etc/nginx/nginx.conf")→ OverlayFS creates whiteout file in UpperDir→ UpperDir/.wh.nginx.conf (marks file as deleted)→ LowerDir file remains (other containers unaffected)→ MergedDir view hides the file
4. Create New File:
open("/var/log/app.log", O_CREAT)→ OverlayFS creates file directly in UpperDir→ No interaction with LowerDir needed
First write to file triggers copy-up (can be slow for large files)
Many layers slow down lookup
Whiteouts can accumulate (use docker system prune)
Q7: What are the security implications of sharing the kernel in containers vs VMs?
Answer:Container Security (Shared Kernel):Pros:
Faster startup and lower overhead
Easier management
Cons:
Kernel Exploits:
Container → Kernel Vulnerability → Host CompromiseExample: Dirty COW (CVE-2016-5195)- Container can exploit kernel bug- Gain root on host- Escape to host system
Large Attack Surface:
~300+ system calls exposedAny syscall vulnerability affects all containersMitigation: seccomp-bpf filters→ Block dangerous syscalls→ Reduce attack surface
Resource Exhaustion:
Without cgroups:Container A → Allocate all memory → OOM kills Container BWith cgroups:Container A → Hit memory.max → OOM kills processes in A only
Information Leakage:
/proc and /sys expose kernel information- /proc/kallsyms (kernel symbols)- /sys/kernel/debug (debug info)Mitigation: Mount with hidepid, remove sensitive mounts
VM Security (Separate Kernel):Pros:
Strong Isolation:
VM → Hypervisor → HostAttack path requires:1. Exploit in guest kernel2. VM escape vulnerability3. Hypervisor exploitMuch harder than container escape
Smaller Attack Surface:
VM → Hypervisor interface is small- Hypercalls (10-20 vs 300+ syscalls)- Device emulation- Much less code to attack
Different Kernels:
Can run different kernel versionsOld vulnerable kernel in VM doesn't affect host
Comparison Table:
Aspect
Containers
VMs
Kernel isolation
Shared
Separate
Escape difficulty
Medium
Hard
Attack surface
Large (~300 syscalls)
Small (~20 hypercalls)
Vulnerability impact
Affects host
Contained to VM
Performance overhead
~2%
~5-10%
Startup time
under 1s
10-30s
Best Practices:For Containers:
# 1. Use user namespaces--userns-remap=default# 2. Drop capabilities--cap-drop=ALL --cap-add=NET_BIND_SERVICE# 3. Seccomp filter--security-opt seccomp=default.json# 4. AppArmor/SELinux--security-opt apparmor=docker-default# 5. Read-only root--read-only --tmpfs /tmp# 6. No privileged mode# NEVER use --privileged in production!
// This code behaves identically on host or in container:#include <stdio.h>#include <unistd.h>int main() { printf("My PID: %d\n", getpid()); // 1 in container, 12345 on host printf("My UID: %d\n", getuid()); // 0 in container (fake root) char hostname[256]; gethostname(hostname, sizeof(hostname)); printf("Hostname: %s\n", hostname); // "container-abc" in container // Process has no idea it's in a container! // All syscalls return "virtualized" results return 0;}
The container ecosystem is full of footguns that only fire under load, in production, after months of working fine. Below are the four pitfalls that have generated the most incidents in real engineering organizations, paired with the patterns that defuse them.
Caveat 1: Containers are not VMs, and namespace breakouts are real. A container shares the host kernel. A kernel bug in any reachable code path can become a host compromise. The 2019 runc CVE (CVE-2019-5736) let a malicious container overwrite the host runc binary by abusing the way runc re-executes itself through /proc/self/exe; once overwritten, the next container start would execute attacker-controlled code as root on the host. Similar primitives have surfaced periodically (overlayfs in 2021, cgroups release_agent in 2022, OverlayFS again in CVE-2023-2640). The lesson: never treat the kernel attack surface as if it had VM-grade isolation properties.
Pattern: Defense-in-depth and choose the right runtime for the threat model. Run untrusted code under gVisor (user-space syscall interposition) or Kata/Firecracker (per-container microVM) — AWS Lambda and Fly.io chose Firecracker for exactly this reason. For trusted workloads, layer seccomp (block ~250 of ~340 syscalls), drop all capabilities by default, enable user namespaces so container root maps to an unprivileged host UID, set --read-only rootfs, and keep runc patched. Audit your seccomp profile — the default Docker profile is good but not minimal.
Caveat 2: PID 1 in a container must reap children, or zombies pile up until you exhaust the PID limit. On a normal Linux system, init (systemd, sysvinit) reaps orphaned children. Inside a container, your application is PID 1 — and most application runtimes (Node, Python, the JVM, Go) do not reap orphans. If your app spawns short-lived shells (think child_process.exec), each exited child becomes a zombie that holds a process slot until PID 1 reads its exit status. Symptom: fork: retry: Resource temporarily unavailable after a few days, even though the container looks idle.
Pattern: Use a tiny init like tini, dumb-init, or docker run --init. Docker’s --init flag injects tini as PID 1, which forwards signals to your app and reaps zombies. The cost is one extra process (~1 MB RSS). If you control the image, ENTRYPOINT ["/usr/bin/dumb-init", "--"] is the production-standard pattern. The same logic applies in Kubernetes via shareProcessNamespace: true (so the pause container reaps) — but even then, prefer tini per-container for deterministic behavior.
Caveat 3: When a container OOMs, the kernel kills a process inside it — not necessarily the one that allocated the memory. The cgroup OOM killer scores processes in the cgroup by oom_score_adj and RSS, then kills the highest scorer. If your container runs a parent process and several workers, the parent might survive while a worker dies, leaving you with a half-broken container that the orchestrator sees as “running.” Worse: if the container’s entire memory limit is consumed by the page cache (file-backed reads), the cgroup OOM may fire even though no process is actually holding much heap.
Pattern: Set memory.high for soft pressure, treat OOM as restart-the-pod, and inspect memory.stat. In cgroups v2, memory.high throttles allocation before hitting the hard memory.max — giving you backpressure instead of a kill. In Kubernetes, set restartPolicy: Always and use a liveness probe that fails when the app’s worker count drops; the orchestrator will replace the half-dead pod. When debugging, never trust memory.current alone — read memory.stat and look at anon (real heap), file (page cache, reclaimable), and kernel_stack (per-thread cost). Most “memory leaks” in containers are actually unbounded page cache from log files.
Caveat 4: Cgroup v1 vs v2 — many tools, runtimes, and interview answers still assume v1. Cgroups v2 (the unified hierarchy) is the default on RHEL 9, Ubuntu 22.04+, and modern Kubernetes. It changes the semantics: a single hierarchy instead of one per controller, memory.max instead of memory.limit_in_bytes, cpu.max instead of cpu.cfs_quota_us + cpu.cfs_period_us. Older Java versions (pre-15) read v1 paths and silently report wrong limits, leading to JVMs that allocate based on the host’s RAM rather than the container’s limit. Same trap with cAdvisor pre-0.40 and old Datadog agents.
Pattern: Detect the hierarchy and read the right files. Check stat -fc %T /sys/fs/cgroup/ — cgroup2fs means v2, tmpfs means v1. For runtime detection in apps, read /proc/self/cgroup: a single line of 0::/... is v2, multiple lines with controller prefixes is v1. Upgrade to JDK 17+ (or backports) for correct cgroup v2 awareness. In Kubernetes 1.25+, cgroup v2 is GA and required for features like Memory QoS.
Containers vs VMs -- when do you choose which, and what does 'isolation' really mean here?
Strong Answer Framework:
State the isolation boundary first. A VM virtualizes hardware — the hypervisor (KVM, Hyper-V, Xen) traps privileged instructions, presents virtual CPUs and devices, and the guest runs its own kernel. A container shares the host kernel and uses namespaces plus cgroups to restrict what a process sees and how much it can consume. The boundary determines the threat model: a VM contains a guest-kernel exploit; a container does not.
Describe the cost dimensions. VMs pay for boot time (seconds to tens of seconds for a full OS), per-instance memory (50-500 MB just for the guest kernel and userspace), and VM-exit overhead on every privileged operation. Containers boot in milliseconds, share the host kernel page cache, and have near-native syscall latency. On a typical 64 GB server, you fit 10-30 VMs or hundreds-to-thousands of containers.
Map workload to choice. Multi-tenant code execution (Lambda, code sandboxes, customer-supplied workloads) wants a VM-grade boundary. Internal microservices owned by your org want containers — you trust the code, you want the density. Long-running stateful services with strict noisy-neighbor isolation (databases on shared hardware) sit in the middle: VMs or microVMs.
Acknowledge the middle ground. Firecracker and Cloud Hypervisor are minimal VMMs that boot a microVM in ~125 ms with ~5 MB overhead — AWS Lambda and Fly.io use them precisely because they want VM isolation at container density. gVisor goes the other way: a user-space kernel that intercepts syscalls before they reach the host kernel, reducing the kernel attack surface at a 2-5x I/O perf cost.
Name your default. “I default to containers for everything I own and trust, microVMs (Firecracker) for anything where the threat model includes hostile guest code, and full VMs only when I need a different OS or hardware-level features (nested virtualization, GPU passthrough).”
Real-World Example: AWS Fargate originally ran customer containers on shared EC2 hosts with seccomp+namespace isolation. After internal red-team exercises and the runc CVE-2019-5736 in February 2019, Amazon transitioned Fargate to Firecracker-based microVMs (announced re:Invent 2019). Each customer task now runs in its own microVM, eliminating the kernel-shared attack surface. Latency cost was small (~125 ms cold boot vs ~50 ms for a container start); the security win was eliminating an entire class of cross-tenant kernel exploits.
Senior follow-up: How does Firecracker boot in 125 ms when a normal Linux VM takes 30+ seconds? It strips the kernel to a minimal config (no PCI enumeration, no most drivers, no initramfs), uses virtio-only devices over MMIO, and bypasses BIOS/UEFI by jumping straight into the kernel via the Linux Boot Protocol. The VMM itself is ~50k lines of Rust with no QEMU device model. The whole point is that “boot” in a microVM is closer to “exec a process” than to “start a computer.”
Senior follow-up: A colleague says they can run untrusted code in a Docker container with --security-opt no-new-privileges and a strict seccomp profile — safe enough? No. You have closed common privilege-escalation paths, but you have not closed the kernel attack surface. CVE-2022-0492 (cgroups release_agent), CVE-2022-0185 (filesystem mount), and CVE-2023-0386 (overlayfs) all let a non-root container process escape to the host on default-configured systems. Seccomp narrows the syscall surface but does not eliminate it. Use a microVM if the code is genuinely hostile.
Senior follow-up: When does the VM vs container choice not matter? When the workload itself dominates — a CPU-bound numerical job spending 99% of cycles in user-space math does not care about VM-exit overhead. The choice matters for I/O-heavy workloads (where syscall + virt overhead is felt), short-lived workloads (where boot time is felt), and dense multi-tenant workloads (where per-instance memory tax is felt).
Common Wrong Answers:
“Containers are just lightweight VMs.” Wrong on both technical and security grounds. Different isolation boundary, different threat model, different operational profile. This phrasing signals the candidate has not internalized why CVE-2019-5736 was possible at all.
“VMs are always more secure.” Not always. A poorly-configured VM (default credentials, exposed management interface, unpatched hypervisor) is less secure than a hardened container. Isolation is necessary but not sufficient.
“Use Kubernetes; it handles all this.” Kubernetes is an orchestrator, not an isolation strategy. By default, K8s pods on the same node share the kernel. Multi-tenant K8s requires explicit decisions: gVisor RuntimeClass, Kata Containers, dedicated node pools per tenant.
Further Reading:
Aqua Security CVE write-up on runc CVE-2019-5736 — the canonical example of a container breakout via shared filesystem semantics.
Firecracker design paper (NSDI 2020), “Firecracker: Lightweight Virtualization for Serverless Applications.”
“Container Security” by Liz Rice (O’Reilly) — the standard reference for production container isolation.
Walk me through how Docker actually starts a container, namespace by namespace.
Strong Answer Framework:
Start with the layered architecture.docker run talks to dockerd over a Unix socket. dockerd parses the request and delegates to containerd via gRPC. containerd resolves the image (pulling layers if needed), prepares an OCI runtime spec (a JSON config describing the container), and shells out to runc. runc is the piece that actually creates the namespaces.
The runc dance. runc forks itself. The child calls unshare() (or passes CLONE_NEW* flags to clone()) to create new namespaces in a single transition: PID, mount, UTS, IPC, net, user, cgroup. Order matters — user namespace must be set up first so subsequent namespaces are owned by it. The parent stays in the host namespace and writes the new PID into the appropriate cgroup files.
Filesystem setup. Inside the new mount namespace, runc bind-mounts the OverlayFS merged directory (lower = read-only image layers, upper = writable container layer, work = OverlayFS bookkeeping) to a temporary path, then pivot_roots into it. Old root is unmounted. /proc and /sys get fresh mounts inside the new mount namespace — this is what makes cat /proc/1/status show the containerized process as PID 1.
Resource caps. The cgroup is configured before the workload starts: memory.max, cpu.max, pids.max, io.max are all written to the cgroup directory. The process is added to the cgroup via cgroup.procs. Once added, the kernel enforces limits on every allocation and scheduling decision for that process and its descendants.
Security hardening. runc applies the seccomp BPF filter (~50 syscalls blocked by default profile), drops capabilities (default keeps ~14 of 40+ capabilities), applies AppArmor or SELinux label, and finally execves the actual entrypoint. At this point, the process is the container.
Detach. runc exits, leaving the entrypoint reparented to a containerd-shim process. The shim holds the container’s TTY and exit code so dockerd can crash and the container survives.
Real-World Example: In 2021, a Datadog customer reported intermittent “container start” timeouts under load. Root cause: their CI system was launching ~500 containers per minute, each creating a new network namespace and veth pair. The kernel’s rtnl_lock (the global netlink lock) became a bottleneck — veth creation serialized across cores. Fix was to switch to ipvlan (which is lock-free per-namespace) and to batch container creation to avoid thundering-herd lock contention. The lesson: namespace creation is not free, and the cost is concentrated in specific kernel locks.
Senior follow-up: Why does docker run --pid=host exist, and when is it dangerous? It tells runc to skip creating a PID namespace — the container shares the host’s PID namespace. Useful for debugging tools (a sidecar that needs to kill -9 a host process) or process supervisors. Dangerous because a process inside the container can now see and signal every process on the host; if the container has CAP_KILL or CAP_SYS_PTRACE, that is a full escape primitive.
Senior follow-up: What does the containerd-shim actually do? It is the per-container parent process that survives daemon restarts. It owns the container’s stdin/stdout/stderr pipes, reaps the container’s exit code (so dockerd can read it later), and holds the controlling TTY. Without the shim, restarting dockerd would orphan every running container. The shim is roughly 5 MB of memory per container — non-trivial at scale, which is why projects like Kata Containers replace it with a single shim per VM.
Senior follow-up: How is docker exec different from docker run?exec does not create new namespaces. It uses setns(2) to enter the existing container’s namespaces (one syscall per namespace, with the target file descriptor from /proc/PID/ns/*), then execves the new command. The new process inherits all the container’s restrictions but is not its child — which is why execed processes do not show up under PID 1 in some ps views and require explicit handling for signal forwarding.
Common Wrong Answers:
“Docker creates a lightweight virtual machine.” No. Docker creates namespaces and cgroups; there is no VM, no hypervisor, no separate kernel.
“Containers are just chroot jails.” chroot only changes the filesystem root. Namespaces additionally isolate process IDs, network, IPC, hostname, users, and cgroup view. Containers are chroot plus seven other forms of isolation, plus resource limits.
“Docker uses LXC.” It used to (in 2013-2014). Since libcontainer (now runc), Docker has its own implementation. LXC is a separate project with different design choices (system containers vs application containers).
Further Reading:
Liz Rice, “Containers from Scratch” talk (Container Camp 2017) — builds a container in ~100 lines of Go.
OCI Runtime Specification (github.com/opencontainers/runtime-spec) — the actual contract between containerd and runc.
“What Have Syscalls Done for You Lately?” by Jessie Frazelle — a tour of the kernel features Docker exercises.
gVisor, Firecracker, Kata Containers -- compare the security trade-offs and pick one for an untrusted code workload.
Strong Answer Framework:
Frame the question as ‘what is the attack surface’. Plain Docker exposes the full Linux syscall ABI (~340 syscalls, ~70 typically allowed by default seccomp). Any kernel bug reachable through those syscalls is a potential escape. The three alternatives reduce the attack surface in different ways.
gVisor (runsc). A user-space kernel called Sentry implements the Linux syscall interface in Go. Container syscalls go to Sentry; Sentry only makes ~20 syscalls to the host kernel. Attack surface drops by an order of magnitude. Cost: every syscall traverses Sentry’s user-space implementation, so I/O-heavy workloads see 2-5x slowdowns. Some uncommon syscalls or kernel features are not implemented (io_uring, some namespaces). Used by Google for App Engine and Cloud Run.
Firecracker. Each container gets a lightweight VM with a minimal Linux kernel, a virtio-only device model, and a ~50k-line Rust VMM. The host kernel attack surface reduces to KVM plus virtio. Boot time ~125 ms, memory overhead ~5 MB per microVM. Syscall-compatible (it is real Linux), supports any syscall the guest kernel supports. Cost: per-VM kernel memory, VM-exit overhead, and you now manage a kernel image lifecycle. Used by AWS Lambda, AWS Fargate, Fly.io.
Kata Containers. Same idea as Firecracker — per-container VM — but using QEMU (or Cloud Hypervisor) and integrating directly with Kubernetes via a RuntimeClass. Compatible with full OCI semantics, supports more device types (GPUs via VFIO). Slightly heavier than Firecracker (~50-100 ms boot, ~30 MB memory) but more flexible.
Recommend with reasoning. “For untrusted code execution at scale — think a Lambda-style platform, a code-runner SaaS, or sandboxed CI — I would pick Firecracker. The threat model is ‘guest is hostile,’ and Firecracker gives me VM-grade isolation with container-grade density and start time. gVisor is a strong alternative when I cannot run my own kernel image (managed K8s without RuntimeClass support) and the workload is not I/O-bound.”
Real-World Example: In 2018, AWS Lambda originally used a custom container runtime on shared EC2 hosts. After internal security review, they built Firecracker (open-sourced November 2018) and migrated all of Lambda to per-invocation microVMs over 2019. The migration eliminated cross-tenant kernel-shared attack surface entirely. Boot-time impact: ~125 ms cold starts (down from ~150-200 ms for the previous container-based system), so customers actually got faster cold starts in addition to stronger isolation.
Senior follow-up: What is a VM exit and why does Firecracker care? A VM exit happens when the guest CPU executes an instruction that requires hypervisor mediation — accessing a device register, hitting a not-present page in the guest page tables, executing a privileged instruction. Each exit costs 1000-3000 cycles for state save/restore. Firecracker minimizes exits by using virtio over MMIO (shared memory rings instead of port I/O), pre-faulting memory to avoid page-fault exits, and configuring the VMCS to let the guest handle as many operations natively as possible. For I/O-bound workloads, exit frequency is the dominant performance factor.
Senior follow-up: Why is gVisor written in Go, given the performance cost? Memory safety. The whole point of gVisor is that the user-space kernel is a barrier — if Sentry has a memory corruption bug, you have not gained anything. Go’s runtime gives you bounds-checked array access, GC-managed lifetimes, and no buffer overflows — significantly reducing the class of bugs that matter for a kernel. The performance cost is the price of that safety. Recent versions use nogo (Go without GC for the syscall hot path) and platform-specific optimizations to claw some of it back.
Senior follow-up: When is gVisor the wrong answer? When the workload uses syscalls gVisor has not implemented (io_uring is not supported, some perf events, some less-common namespaces), or when I/O latency dominates (a database, a high-RPS proxy). gVisor’s overhead on small reads/writes can be 3-5x; on a sustained NVMe workload, that is the difference between “works” and “useless.”
Common Wrong Answers:
“They are all the same thing — secure containers.” They use fundamentally different mechanisms. gVisor intercepts syscalls in user space; Firecracker and Kata run a real guest kernel in a VM. The security properties and performance profiles are not interchangeable.
“Just use seccomp.” Seccomp narrows the syscall surface but does not change the trust boundary. The remaining allowed syscalls still execute in the host kernel. Seccomp is a hardening layer, not an isolation strategy.
“VMs are always slower.” Not for the metrics that matter at scale. Firecracker boots faster than many container runtimes (~125 ms vs ~200-500 ms for full Docker). Per-syscall overhead is comparable to native for compute-bound workloads. The VM tax is real but smaller than people assume.
Further Reading:
Firecracker NSDI 2020 paper — Agache et al., “Firecracker: Lightweight Virtualization for Serverless Applications.”
gVisor docs (gvisor.dev) — specifically the “Performance Guide” page which is honest about where gVisor is slow.
Kata Containers architecture overview at katacontainers.io — explains the shim-v2 model.
Manual Namespace Build: Use the unshare command to create a shell with its own network and PID namespace. Try to ping the host.
Cgroup Stress Test: Create a cgroup v2 with a 100MB memory limit. Run a program that allocates 200MB and observe the kernel’s OOM killer logs in dmesg.
VirtIO Analysis: Run a KVM guest and use lspci inside the guest to identify which devices are using virtio drivers vs. emulated hardware.
A colleague says 'containers are lightweight VMs.' Correct this misconception, and explain the security implications of the difference.
Strong Answer:
Containers and VMs have fundamentally different isolation boundaries. A VM runs its own kernel on virtualized hardware — the hypervisor (KVM, Xen, Hyper-V) provides each VM with a virtual CPU, virtual memory, and virtual devices. The guest OS has no direct access to the host kernel. A container, on the other hand, shares the host kernel. It is just a regular process (or group of processes) with restricted views of system resources via namespaces and restricted resource usage via cgroups.
The security implication is significant: a kernel vulnerability (like a privilege escalation in a syscall handler) can allow a container to escape to the host. In a VM, the guest kernel vulnerability stays contained because the guest cannot directly call host kernel code — it must go through the hypervisor, which is a much smaller attack surface. This is why multi-tenant public clouds (AWS, GCP) use VMs, not containers, as the primary isolation boundary for customer workloads.
Containers add defense-in-depth with seccomp (restricting which syscalls a container can make), AppArmor/SELinux (mandatory access control profiles), dropped capabilities (containers typically run without CAP_SYS_ADMIN), and user namespaces (mapping container root to a non-root host UID). But all of these are enforced by the shared host kernel, so a kernel bug can bypass all of them simultaneously.
The middle ground is gVisor (which interposes a user-space kernel that handles syscalls before they reach the real kernel) and Kata Containers / Firecracker (which run each container inside a lightweight VM). AWS Lambda uses Firecracker MicroVMs — each function invocation gets its own VM with a minimal Linux kernel that boots in about 125ms.
Follow-up: What specifically does a PID namespace do, and can a process inside a PID namespace see or signal processes outside it?A PID namespace virtualizes process IDs. The first process in a new PID namespace becomes PID 1 within that namespace, even though it has a different PID on the host. Processes inside the namespace can only see and signal other processes within the same namespace (or descendant namespaces). They cannot see or signal host processes. However, from the host’s perspective, all container processes are visible with their host PIDs. This is one-way isolation: the host can see into containers, but containers cannot see out. The exception is if a container has CAP_SYS_PTRACE and shares the host PID namespace (e.g., Docker’s --pid=host flag), which breaks isolation and is a serious security risk.
Explain how cgroups v2 memory limits work internally. What happens when a container hits its memory limit -- is it always an OOM kill?
Strong Answer:
In cgroups v2, the memory.max file sets the hard memory limit for a cgroup. The kernel tracks every page allocated by processes in the cgroup: anonymous pages (heap, stack), file-backed pages (page cache), kernel memory (slab, socket buffers), and swap (if memory.swap.max is set).
When a process in the cgroup tries to allocate memory and the cgroup’s usage is at memory.max, the kernel first tries to reclaim memory. It invokes the cgroup-aware reclaimer, which scans the cgroup’s LRU lists and evicts reclaimable pages — file-backed pages can be dropped (clean) or written back (dirty), and anonymous pages can be swapped out if swap is available.
If reclamation succeeds in freeing enough memory, the allocation proceeds and there is no OOM kill. This is actually the common case — the page cache grows until it fills the memory limit, then the kernel evicts old cached pages to make room for new allocations.
OOM kill only happens when reclamation fails — there is no more reclaimable memory (all pages are anonymous and there is no swap, or swap is also full). At that point, the cgroup-level OOM killer selects a process within the cgroup to kill. Crucially, it will NOT kill processes outside the cgroup, which is the whole point of containerized memory limits.
There is also memory.high, which is a throttling threshold (not a hard limit). When usage exceeds memory.high, the kernel applies memory pressure by slowing down allocations (the process is forced to do direct reclaim, which is slow). This provides backpressure before hitting the hard limit, giving the application a chance to reduce its memory footprint.
Follow-up: Why do container monitoring tools often show different memory usage than the host’s free command?Container monitoring tools (like docker stats or cAdvisor) report cgroup-level memory usage, which includes the page cache attributed to that cgroup. The page cache is reclaimable, so it is not “used” in the same sense as heap memory, but it counts against the cgroup’s limit. This is why a container might show 90% memory usage while the application inside thinks it is only using 200MB of heap — the rest is kernel page cache from file I/O. The memory.stat file in the cgroup filesystem breaks this down into anon, file, shmem, etc. When diagnosing memory issues, always check memory.stat rather than just memory.current.
You need to run an untrusted workload in production. Compare the isolation guarantees of Docker containers, gVisor, and Firecracker MicroVMs. Which would you choose and why?
Strong Answer:
Docker containers (with default settings) provide namespace isolation, cgroup resource limits, seccomp syscall filtering (~300 blocked syscalls), and dropped capabilities. The attack surface is the full Linux kernel syscall interface — roughly 70 syscalls are allowed by default. A kernel zero-day in any of those 70 syscalls can escape the container. In practice, this is good enough for trusted workloads (your own code) but risky for untrusted code.
gVisor (runsc) interposes a user-space kernel called Sentry that implements the Linux syscall interface. The untrusted process’s syscalls go to Sentry, not the real kernel. Sentry only makes a small subset of syscalls to the host kernel (around 20), massively reducing the attack surface. The trade-off is performance: every syscall goes through Sentry’s user-space implementation, which adds latency. I/O-heavy workloads can see 2-5x slowdowns. gVisor also does not support every Linux syscall perfectly, so some applications may not run correctly.
Firecracker MicroVMs give each workload its own lightweight VM with a minimal Linux kernel. The host kernel attack surface is reduced to KVM (the hypervisor) and a small set of virtio device emulations. Firecracker itself is written in Rust with a minimal device model (~50k lines of code), so the attack surface is tiny. Boot time is about 125ms, and memory overhead is about 5MB per VM. The trade-off is that you need to manage a full (minimal) kernel per workload, and there is overhead from the virtualization layer (EPT/SLAT page table walks, VM exits).
For untrusted workloads, I would choose Firecracker if latency and syscall compatibility matter (like running arbitrary user-submitted code, which is what AWS Lambda does), or gVisor if the workload is I/O-light and I want strong isolation without the operational complexity of managing VM images. I would never use plain Docker for truly untrusted code.
Follow-up: What is a VM exit, and why does it matter for Firecracker’s performance?A VM exit occurs when the guest executes an instruction that the hardware cannot handle in guest mode — for example, accessing a device register, executing a privileged instruction not configured in the VMCS, or hitting a page fault that requires hypervisor intervention. The CPU saves the guest state, loads the host state, and transfers control to the hypervisor (KVM), which handles the event and then re-enters the guest (VM entry). Each exit/entry cycle costs roughly 1000-3000 CPU cycles. Firecracker minimizes VM exits by using virtio paravirtualized devices (the guest cooperates with the hypervisor using shared memory rings instead of emulating real hardware registers) and by configuring the VMCS to let the guest handle as many operations as possible without exiting. For I/O-heavy workloads, the frequency of VM exits is the primary performance bottleneck.