Containers & Virtualization
Isolation is the core requirement of multi-tenant cloud computing. Whether you are running a SaaS platform or a microservices cluster, you must ensure that processes are contained, resources are metered, and security boundaries are enforced. Modern systems achieve this through two distinct paths: OS-level virtualization (Containers) and Hardware-level virtualization (VMs).Mastery Level: Senior Systems Engineer
Key Internals: CLONE_NEW*, Cgroups v2 Unified Hierarchy, VMCS, EPT/SLAT, Firecracker MicroVMs
Prerequisites: Process Internals, Memory Management
1. Container Internals: The Linux “Trio”
A container is not a “thing” in the Linux kernel. It is a user-space abstraction built using three primary kernel features: Namespaces, Control Groups (cgroups), and Union Filesystems.1.1 Namespaces: The Illusion of Isolation
Namespaces wrap global system resources in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the resource.| Namespace | Flag | Isolated Resource |
|---|---|---|
| Mount | CLONE_NEWNS | Filesystem mount points (independent mount/umount). |
| UTS | CLONE_NEWUTS | Hostname and NIS domain name. |
| IPC | CLONE_NEWIPC | System V IPC, POSIX message queues. |
| PID | CLONE_NEWPID | Process IDs (Process 1 inside the container). |
| Network | CLONE_NEWNET | Network devices, stacks, ports, firewalls. |
| User | CLONE_NEWUSER | User and group IDs (Root in container != Root on host). |
| Cgroup | CLONE_NEWCGROUP | Cgroup root directory view. |
| Time | CLONE_NEWTIME | System boot and monotonic clocks. |
Deep Dive: PID Namespace
The PID namespace creates a hierarchical process view where each namespace has its own PID 1.- First process in namespace becomes PID 1
- If PID 1 exits, kernel kills all processes in namespace
- Parent namespace can see child processes with their “real” PIDs
/procshows only processes in current namespace (with mount namespace)
Deep Dive: Network Namespace
Network namespaces isolate the network stack: devices, routing tables, firewall rules, sockets.Deep Dive: Mount Namespace
Mount namespaces isolate the filesystem mount points.Deep Dive: User Namespace
User namespaces allow mapping UIDs/GIDs, enabling rootless containers.Deep Dive: IPC Namespace
IPC namespaces isolate System V IPC objects and POSIX message queues.Deep Dive: UTS Namespace
UTS namespaces isolate hostname and domain name.Deep Dive: Time Namespace
Time namespaces (Linux 5.6+) allow different boot times and monotonic clocks.The pivot_root vs chroot
While chroot only changes the root directory for path resolution, it is insecure (processes can “break out” via .. or file descriptor trickery). Containers use pivot_root, which moves the entire mount namespace to a new root and removes access to the old one, providing a true filesystem jail.
1.2 Cgroups: Resource Metering and Limiting
If Namespaces provide isolation (what you see), Cgroups provide containment (what you can use).- Cgroups v1 (Legacy): Multiple hierarchies. A process could be in one group for CPU and a completely different group for Memory. This led to massive complexity and performance issues.
- Cgroups v2 (Modern/Unified): A single hierarchy. Every process belongs to exactly one cgroup in a unified tree. This allows for better resource accounting (e.g., attributing page cache writeback to the specific cgroup that caused the dirty pages).
Key Controllers:
CPU Controller:1.3 OverlayFS: The Layered Filesystem
Containers use Union Filesystems (like OverlayFS) to provide a writable layer on top of read-only image layers.- LowerDir: Read-only layers (the Docker image).
- UpperDir: The writable layer where changes are stored.
- MergedDir: The unified view presented to the container.
- Copy-on-Write (CoW): When a container modifies a file in the LowerDir, the kernel first copies it to the UpperDir before applying the change.
2. Virtualization: Emulating the Machine
Virtual Machines (VMs) take the isolation boundary down to the hardware level. Instead of sharing a kernel, they share the physical CPU and Memory.2.1 The Hypervisor (VMM)
The Virtual Machine Monitor (VMM) is the software that manages guest execution.- Type 1 (Bare Metal): Runs directly on hardware (Xen, ESXi).
- Type 2 (Hosted): Runs as an app on a host OS (KVM, VirtualBox). Note: KVM is unique because it turns the Linux kernel itself into a Type 1 hypervisor.
2.2 Hardware-Assisted Virtualization (VT-x / AMD-V)
Early virtualization used “Binary Translation” to replace privileged instructions. Modern CPUs handle this in hardware:- VMX Root Mode: The hypervisor runs here (full privileges).
- VMX Non-Root Mode: The guest OS runs here. If the guest tries to execute a privileged instruction (like
HLTor modifyingCR3), the CPU triggers a VM Exit, trapping into the hypervisor to handle the event.
VMCS (Virtual Machine Control Structure)
The VMCS is a memory block that stores the “state” of a virtual CPU (registers, control bits). When switching from VM A to VM B, the hypervisor swaps the VMCS pointers.2.3 Memory Virtualization: EPT and NPT
In a VM, there are three types of addresses:- Guest Virtual (GV)
- Guest Physical (GP)
- Host Physical (HP)
3. The Middle Ground: MicroVMs
Plain containers have a large attack surface (thousands of syscalls). Plain VMs are slow and heavy. MicroVMs (like Firecracker) bridge the gap.Firecracker Architecture
- Minimalism: Removes all non-essential devices (no VGA, no USB, no sound).
- VirtIO: Uses paravirtualized drivers for network and disk, avoiding the overhead of emulating real hardware registers.
- Jailer: Firecracker itself runs inside a container (Namespaces + Cgroups) to provide “Defense in Depth.”
- Performance: Can boot a Linux kernel in < 125ms and run thousands of instances on a single host.
4. Comparison: When to Use What?
| Feature | Containers | MicroVMs (Firecracker) | Full VMs (ESXi/KVM) |
|---|---|---|---|
| Isolation | Logical (Kernel) | Hardware (Minimal) | Hardware (Full) |
| Startup | < 1s | < 1s | > 10s |
| Payload | Process | Kernel + Rootfs | Full OS |
| Security | Medium (Shared Kernel) | High | Highest |
| Use Case | Microservices | Serverless / Multi-tenant | Legacy / Windows |
5. Docker Internals: Putting It All Together
Docker is a high-level container runtime that orchestrates namespaces, cgroups, and OverlayFS.6. Interview Deep Dive: Senior Level
Q1: How does 'User Namespaces' improve container security?
Q1: How does 'User Namespaces' improve container security?
Answer:User Namespaces (Implementation:Limitations:
CLONE_NEWUSER) allow a process to have UID 0 (root) inside the container while being a non-privileged UID (e.g., 1000) on the host.Security Improvement:- Some operations still require host root (mounting certain filesystems)
- File ownership can be confusing (files created by container appear owned by high UIDs on host)
- Not all containers work with user namespaces (especially those requiring true root)
Q2: Explain the difference between cgroups v1 and v2 and why v2 is better
Q2: Explain the difference between cgroups v1 and v2 and why v2 is better
Answer:Cgroups v1 Problems:
-
Multiple Hierarchies:
- Each controller (cpu, memory, io) has its own hierarchy
- A process can be in
/sys/fs/cgroup/cpu/groupAand/sys/fs/cgroup/memory/groupB - Impossible to do unified resource accounting
-
Writeback Ambiguity:
- Process in cgroup A writes to page cache
- Page cache writeback happens later
- Which cgroup gets charged for the disk I/O?
- v1: Charged to whoever triggers writeback (wrong!)
-
No Delegation:
- Can’t safely give non-root users control over cgroups
- Security issues with nested hierarchies
-
Single Hierarchy:
- One tree, all controllers
- Process location is the same for all resources
- Enables proper delegation and accounting
-
Proper Attribution:
- Tracks which cgroup dirtied pages
- I/O charged correctly even if writeback delayed
-
Pressure Stall Information (PSI):
- Built-in resource pressure metrics
- Can detect when cgroup is starved
Q3: How does Docker implement network isolation and connectivity?
Q3: How does Docker implement network isolation and connectivity?
Answer:Docker uses network namespaces + veth pairs + Linux bridge.Default Bridge Network:Detailed Flow:Port Mapping:Network Modes:
Code Example:
| Mode | Description | Use Case |
|---|---|---|
| bridge | Default, isolated network | Normal containers |
| host | Share host network namespace | High performance |
| none | No network | Security isolation |
| container:ID | Share another container’s netns | Sidecars |
Q4: What is a 'VM Exit' and why is it expensive?
Q4: What is a 'VM Exit' and why is it expensive?
Answer:A VM Exit occurs when the guest OS performs an action that requires hypervisor intervention (e.g., I/O, CPUID, or accessing certain registers).VM Exit Flow:Common VM Exit Causes:
Optimization Strategies:
| Cause | Frequency | Mitigation |
|---|---|---|
| I/O instructions (IN/OUT) | High | Use virtio (paravirtualization) |
| CPUID | Medium | Cache results in guest |
| CR3 writes (page table) | High | Use EPT (hardware MMU) |
| Interrupts | Very High | APIC virtualization |
| MSR access | Medium | Use MSR bitmaps |
| HLT (idle) | Low | Acceptable (CPU idle anyway) |
-
EPT (Extended Page Tables):
- Guest can change CR3 without VM exit
- Hardware handles GVA → GPA → HPA translation
-
APIC Virtualization:
- Virtual APIC page in guest memory
- Most interrupt operations happen without exits
-
VirtIO:
- Paravirtualized drivers
- Shared memory rings reduce I/O exits
Q5: Explain the 'Nested Virtualization' problem
Q5: Explain the 'Nested Virtualization' problem
Answer:Nested virtualization is running a VM inside another VM (e.g., GKE on Google Cloud).Address Translation Complexity:Performance Impact:When to Use Nested Virtualization:
-
Development/Testing:
- Test hypervisor code
- CI/CD pipelines testing VMs
-
Cloud Services:
- Kubernetes on cloud VMs
- CI runners in cloud
-
Education:
- Teaching virtualization
- Lab environments
- Production databases
- High-performance computing
- Latency-sensitive applications
Q6: How does OverlayFS implement copy-on-write for containers?
Q6: How does OverlayFS implement copy-on-write for containers?
Answer:OverlayFS provides a union mount where multiple layers are combined into a single view.Layer Structure:Copy-on-Write Operations:1. Read File:2. Modify File:3. Delete File:4. Create New File:Mount Command:Benefits:
- Shared base layers save disk space
- Fast container startup (no copying)
- Efficient use of cache (shared pages)
- First write to file triggers copy-up (can be slow for large files)
- Many layers slow down lookup
- Whiteouts can accumulate (use
docker system prune)
Q7: What are the security implications of sharing the kernel in containers vs VMs?
Q7: What are the security implications of sharing the kernel in containers vs VMs?
Answer:Container Security (Shared Kernel):Pros:
Best Practices:For Containers:For VMs:Hybrid Approach (Kata Containers):
- Faster startup and lower overhead
- Easier management
-
Kernel Exploits:
-
Large Attack Surface:
-
Resource Exhaustion:
-
Information Leakage:
-
Strong Isolation:
-
Smaller Attack Surface:
-
Different Kernels:
| Aspect | Containers | VMs |
|---|---|---|
| Kernel isolation | Shared | Separate |
| Escape difficulty | Medium | Hard |
| Attack surface | Large (~300 syscalls) | Small (~20 hypercalls) |
| Vulnerability impact | Affects host | Contained to VM |
| Performance overhead | ~2% | ~5-10% |
| Startup time | under 1s | 10-30s |
Q8: How do hypervisors implement device emulation vs paravirtualization?
Q8: How do hypervisors implement device emulation vs paravirtualization?
Answer:Device Emulation (Full Virtualization):Guest believes it’s talking to real hardware, hypervisor emulates every register read/write.Paravirtualization (VirtIO):Guest knows it’s virtualized, uses efficient shared-memory interface.Performance Comparison:
VirtIO Ring Structure:Tradeoffs:Emulation:
| Operation | Emulation | VirtIO | Native |
|---|---|---|---|
| Network throughput | 1 Gbps | 9.5 Gbps | 10 Gbps |
| Disk IOPS | 5,000 | 45,000 | 50,000 |
| VM exits per packet | 100-1000 | 1-2 | 0 |
- Pros: No guest modification, runs any OS
- Cons: Slow, many VM exits
- Pros: Fast, few VM exits
- Cons: Requires guest support (modified drivers)
- Use paravirt for performance-critical devices (disk, network)
- Use emulation for legacy devices (VGA, PS/2)
- Gradually reduce emulation over time
6. Namespaces & Cgroups: A Single Process’s Perspective
What does a process actually “see” when it’s containerized? Here’s the view from inside:What Changes for the Process
Inspecting Your Own Namespace
The Process Doesn’t Know It’s Contained
Key Insight: Syscalls Are Virtualized
Every syscall that returns information about the system goes through namespace translation:| Syscall | Host Returns | Container Returns |
|---|---|---|
getpid() | 12345 | 1 |
getuid() | 1000 | 0 (mapped root) |
uname() | myserver | container-abc |
readdir(/proc) | All PIDs | Only container PIDs |
socket(AF_INET,...) | Host network | Container network |
7. Advanced Practice
- Manual Namespace Build: Use the
unsharecommand to create a shell with its own network and PID namespace. Try to ping the host. - Cgroup Stress Test: Create a cgroup v2 with a 100MB memory limit. Run a program that allocates 200MB and observe the kernel’s OOM killer logs in
dmesg. - VirtIO Analysis: Run a KVM guest and use
lspciinside the guest to identify which devices are usingvirtiodrivers vs. emulated hardware.
Next: OS Security & Hardening →