Chapter 1: Linux Namespaces
Containers aren’t magic — they’re built on Linux kernel primitives that have existed since 2002. The first and most fundamental of these is namespaces. In this chapter, we’ll build our own container runtime in Java, starting with namespace isolation. Think of namespaces like one-way mirrors in an interrogation room. The person inside the room (the container) sees only what’s in their room. They have no idea other rooms exist. The person outside (the host) can see into every room. Namespaces give each container its own private view of system resources — its own process table, its own network stack, its own hostname — while the host kernel manages all of them simultaneously. The key realization is that containers are not virtual machines. There is no hypervisor, no guest kernel. Containers are regular Linux processes that have been given a restricted view of the world.Prerequisites: Linux Internals: Processes
Further Reading: Operating Systems: Process Management
Time: 3-4 hours
Outcome: Understanding of namespace isolation
Further Reading: Operating Systems: Process Management
Time: 3-4 hours
Outcome: Understanding of namespace isolation
What Are Namespaces?
Linux Namespace Types
| Namespace | Flag | Isolates |
|---|---|---|
| PID | CLONE_NEWPID | Process IDs - container sees own PID 1 |
| NET | CLONE_NEWNET | Network stack - own interfaces, IPs, ports |
| MNT | CLONE_NEWNS | Mount points - own filesystem view |
| UTS | CLONE_NEWUTS | Hostname and domain name |
| IPC | CLONE_NEWIPC | Inter-process communication |
| USER | CLONE_NEWUSER | User and group IDs |
| CGROUP | CLONE_NEWCGROUP | Cgroup root directory |
Part 1: Project Setup
We’ll use Java with JNA (Java Native Access) to call Linux system calls.pom.xml
Part 2: Linux System Call Bindings
First, we need to call Linux system calls from Java:src/main/java/com/minidocker/linux/LibC.java
Part 3: Namespace Manager
src/main/java/com/minidocker/namespace/NamespaceManager.java
Part 4: Namespace Options
src/main/java/com/minidocker/namespace/NamespaceOptions.java
Part 5: Understanding Each Namespace
PID Namespace
UTS Namespace
Mount Namespace
Part 6: Container Runner
src/main/java/com/minidocker/Container.java
Exercises
Exercise 1: Add Network Namespace
Exercise 1: Add Network Namespace
Extend the namespace manager to create network namespaces:
Exercise 2: Implement Namespace Joining
Exercise 2: Implement Namespace Joining
Allow joining an existing container’s namespaces:
Exercise 3: User Namespace Mapping
Exercise 3: User Namespace Mapping
Implement user namespace with UID/GID mapping:
Key Takeaways
Isolation Not Virtualization
Namespaces isolate views of resources, not the resources themselves
Kernel Primitives
unshare(), clone(), setns() are the syscalls that power containers
Layered Isolation
Each namespace type isolates a different resource
No Overhead
Namespaces add negligible overhead - just kernel data structures
Further Reading
Linux Namespaces Manual
Official documentation for Linux namespaces
Linux Internals Course
Deep dive into Linux process management
What’s Next?
In Chapter 2: Control Groups (cgroups), we’ll implement:- CPU limits
- Memory limits
- Process count limits
- Resource accounting
Next: Cgroups
Learn how to limit container resources
Interview Deep-Dive
What is the difference between unshare() and clone() for creating namespaces, and when would you use each?
What is the difference between unshare() and clone() for creating namespaces, and when would you use each?
A container process can see that it is PID 1. What responsibilities does PID 1 have in a PID namespace, and what goes wrong if the entrypoint does not handle them?
A container process can see that it is PID 1. What responsibilities does PID 1 have in a PID namespace, and what goes wrong if the entrypoint does not handle them?
Strong Answer:
- PID 1 in any namespace has two critical responsibilities inherited from Unix init: signal handling and zombie reaping. The kernel does not deliver certain default signal dispositions to PID 1 — notably, SIGTERM and SIGINT are ignored unless PID 1 explicitly registers a handler. This is why
docker stophas a 10-second timeout: it sends SIGTERM, but if the container’s entrypoint does not handle it, Docker waits the timeout then sends SIGKILL. - Zombie reaping is the second issue. When a child process exits, it becomes a zombie until its parent calls
wait(). In a normal system, init (PID 1) adopts orphaned processes and reaps them. If a container’s PID 1 is a simple application that does not callwait(), orphaned child processes accumulate as zombies, consuming PID table entries. This is especially common with shell scripts as entrypoints that spawn background processes. - The practical solutions are: use a proper init system like
tini(Docker’s--initflag), or ensure your entrypoint is written to forward signals and reap children. In Go, this is relatively straightforward because the runtime handlesSIGCHLD, but in Node.js or Python, you need explicit signal handlers. - A war story: at scale, zombie accumulation inside containers can hit the PID limit set by cgroups (
pids.max), causing the container to fail to spawn any new processes. The symptoms look like “cannot fork: resource temporarily unavailable” errors that are mystifying if you do not know to check for zombies withps aux | grep Z.
shareProcessNamespace: true. When enabled, the pause container (the pod’s infrastructure container) becomes PID 1 and handles zombie reaping for all containers in the pod. Without this setting, each container has its own PID namespace and must handle its own signal forwarding and reaping. This is one reason the pause container exists — it is a minimal process that correctly implements init behavior, acting as the stable anchor for the pod’s shared namespaces.How does the user namespace enable rootless containers, and what are the security trade-offs?
How does the user namespace enable rootless containers, and what are the security trade-offs?
Strong Answer:
- The user namespace maps UIDs and GIDs inside the namespace to different UIDs outside. A process can be UID 0 (root) inside the container but map to UID 100000 (unprivileged) on the host. This means the container process has full root capabilities within its namespace but if it escapes the container, it lands as an unprivileged user on the host.
- The mapping is configured by writing to
/proc/<pid>/uid_mapand/proc/<pid>/gid_map. A typical mapping like0 100000 65536means container UIDs 0-65535 map to host UIDs 100000-165535. This requires either root on the host or entries in/etc/subuidand/etc/subgidthat grant ranges to unprivileged users. - The trade-off is complexity and compatibility. Some operations inside rootless containers behave differently — for example,
mknodfor device files is restricted because the kernel checks the host UID for device access. Network namespace setup requires workarounds (likeslirp4netnsinstead of veth pairs) because creating network interfaces needs real CAP_NET_ADMIN on the host. - Despite these trade-offs, rootless containers are a significant security improvement and are the default in Podman. For production environments where the threat model includes container escape, running rootless eliminates the most dangerous scenario: an attacker gaining root on the host.