Containers aren’t magic — they’re built on Linux kernel primitives that have existed since 2002. The first and most fundamental of these is namespaces. In this chapter, we’ll build our own container runtime in Java, starting with namespace isolation.Think of namespaces like one-way mirrors in an interrogation room. The person inside the room (the container) sees only what’s in their room. They have no idea other rooms exist. The person outside (the host) can see into every room. Namespaces give each container its own private view of system resources — its own process table, its own network stack, its own hostname — while the host kernel manages all of them simultaneously. The key realization is that containers are not virtual machines. There is no hypervisor, no guest kernel. Containers are regular Linux processes that have been given a restricted view of the world.
First, we need to call Linux system calls from Java:
src/main/java/com/minidocker/linux/LibC.java
package com.minidocker.linux;import com.sun.jna.Library;import com.sun.jna.Native;import com.sun.jna.Pointer;/** * JNA bindings to Linux libc functions. * * These are the low-level system calls that Docker uses internally. */public interface LibC extends Library { LibC INSTANCE = Native.load("c", LibC.class); // Namespace flags int CLONE_NEWNS = 0x00020000; // Mount namespace int CLONE_NEWUTS = 0x04000000; // UTS namespace (hostname) int CLONE_NEWIPC = 0x08000000; // IPC namespace int CLONE_NEWUSER = 0x10000000; // User namespace int CLONE_NEWPID = 0x20000000; // PID namespace int CLONE_NEWNET = 0x40000000; // Network namespace int CLONE_NEWCGROUP = 0x02000000; // Cgroup namespace // Mount flags int MS_BIND = 4096; int MS_REC = 16384; int MS_PRIVATE = 1 << 18; int MS_NOSUID = 2; int MS_NOEXEC = 8; int MS_NODEV = 4; /** * Create a new namespace and move the calling process into it. * * @param flags Combination of CLONE_NEW* flags * @return 0 on success, -1 on error */ int unshare(int flags); /** * Change the root filesystem. * * @param path New root directory * @return 0 on success, -1 on error */ int chroot(String path); /** * Change working directory. */ int chdir(String path); /** * Mount a filesystem. * * @param source Source device/path * @param target Mount point * @param filesystemtype Type (e.g., "proc", "sysfs") * @param mountflags Mount flags * @param data Additional data */ int mount(String source, String target, String filesystemtype, long mountflags, Pointer data); /** * Unmount a filesystem. */ int umount2(String target, int flags); /** * Set hostname. */ int sethostname(String name, int len); /** * Get process ID. */ int getpid(); /** * Get parent process ID. */ int getppid(); /** * Get user ID. */ int getuid(); /** * Get group ID. */ int getgid(); /** * Execute a program. */ int execv(String path, String[] argv); /** * Fork the process. */ int fork(); /** * Wait for child process. */ int waitpid(int pid, int[] status, int options); /** * Set user/group ID mappings. */ int setuid(int uid); int setgid(int gid); /** * Pivot root - atomically swap root filesystems. */ int pivot_root(String new_root, String put_old);}
package com.minidocker.namespace;import com.minidocker.linux.LibC;import java.io.IOException;import java.nio.file.Files;import java.nio.file.Path;/** * Manages Linux namespace creation and configuration. * * Namespaces provide isolation for various system resources: * - PID: Process sees its own process tree, with PID 1 * - NET: Separate network stack (interfaces, routing, firewall) * - MNT: Separate mount points * - UTS: Separate hostname * - IPC: Separate System V IPC and POSIX message queues * - USER: Separate user/group ID mappings */public class NamespaceManager { private final LibC libc = LibC.INSTANCE; /** * Creates all namespaces needed for container isolation. * * @param options Configuration options * @throws NamespaceException if namespace creation fails */ public void createNamespaces(NamespaceOptions options) throws NamespaceException { int flags = 0; if (options.newPidNamespace()) { flags |= LibC.CLONE_NEWPID; } if (options.newNetNamespace()) { flags |= LibC.CLONE_NEWNET; } if (options.newMountNamespace()) { flags |= LibC.CLONE_NEWNS; } if (options.newUtsNamespace()) { flags |= LibC.CLONE_NEWUTS; } if (options.newIpcNamespace()) { flags |= LibC.CLONE_NEWIPC; } if (options.newUserNamespace()) { flags |= LibC.CLONE_NEWUSER; } System.out.println("Creating namespaces with flags: 0x" + Integer.toHexString(flags)); // unshare() creates new namespaces and moves calling process into them int result = libc.unshare(flags); if (result != 0) { throw new NamespaceException("Failed to create namespaces: " + Native.getLastError()); } System.out.println("✓ Namespaces created successfully"); } /** * Sets up user namespace mappings. * * Maps container user 0 (root) to host user. */ public void setupUserNamespace() throws IOException { int uid = libc.getuid(); int gid = libc.getgid(); int pid = libc.getpid(); // Write UID mapping: container_uid host_uid count // Maps container root (0) to current host user Path uidMapPath = Path.of("/proc/" + pid + "/uid_map"); Files.writeString(uidMapPath, "0 " + uid + " 1\n"); // Disable setgroups (required before writing gid_map) Path setgroupsPath = Path.of("/proc/" + pid + "/setgroups"); Files.writeString(setgroupsPath, "deny\n"); // Write GID mapping Path gidMapPath = Path.of("/proc/" + pid + "/gid_map"); Files.writeString(gidMapPath, "0 " + gid + " 1\n"); System.out.println("✓ User namespace mapped: container root -> host uid " + uid); } /** * Sets the hostname within the UTS namespace. */ public void setHostname(String hostname) throws NamespaceException { int result = libc.sethostname(hostname, hostname.length()); if (result != 0) { throw new NamespaceException("Failed to set hostname: " + Native.getLastError()); } System.out.println("✓ Hostname set to: " + hostname); } /** * Demonstrates PID namespace isolation. */ public void showPidNamespaceInfo() { int pid = libc.getpid(); int ppid = libc.getppid(); System.out.println("Inside container:"); System.out.println(" PID: " + pid); System.out.println(" PPID: " + ppid); // In a PID namespace, the first process gets PID 1 if (pid == 1) { System.out.println(" → We are PID 1 (init process) in this namespace!"); } }}
public void demonstrateMountNamespace() throws Exception { // Create mount namespace libc.unshare(LibC.CLONE_NEWNS); // Make all mounts private (changes don't propagate to host) libc.mount(null, "/", null, LibC.MS_REC | LibC.MS_PRIVATE, null); // Now we can mount things that only this container sees libc.mount("tmpfs", "/tmp", "tmpfs", 0, null); // This /tmp is completely separate from host's /tmp}
Extend the namespace manager to create network namespaces:
// 1. Create network namespacelibc.unshare(LibC.CLONE_NEWNET);// 2. Bring up loopback interface// Use: ip link set lo up// (Requires additional native calls or ProcessBuilder)// 3. Verify isolation// The container should have its own network stack
Exercise 2: Implement Namespace Joining
Allow joining an existing container’s namespaces:
// Use setns() syscall to join existing namespace// int setns(int fd, int nstype);// fd = open("/proc/<pid>/ns/<type>")// This is how "docker exec" works!
Exercise 3: User Namespace Mapping
Implement user namespace with UID/GID mapping:
// 1. Create user namespace FIRST (before other namespaces)// 2. Write to /proc/self/uid_map// 3. Write "deny" to /proc/self/setgroups// 4. Write to /proc/self/gid_map// This enables unprivileged containers!
What is the difference between unshare() and clone() for creating namespaces, and when would you use each?
Strong Answer:
clone() creates a new child process that starts in the new namespace(s). It is analogous to fork() but accepts flags that specify which namespaces to create. The parent and child are in different namespaces from the moment the child starts executing.
unshare() moves the calling process itself into new namespace(s). There is no new process created. This is simpler when you want to isolate the current process rather than spawn a child.
Real container runtimes like runc use clone() because they need the container’s init process (PID 1 in the new PID namespace) to be a child process that the runtime can monitor and wait on. If you used unshare(CLONE_NEWPID), the calling process does not get PID 1 in the new namespace — only its next child does. This is a subtle but critical distinction that trips up many implementations.
There is also setns(), which joins an existing namespace by opening /proc/<pid>/ns/<type> and passing the file descriptor. This is how docker exec works — it calls setns() to enter the target container’s namespaces before executing the new command.
Follow-up: Why does the PID namespace behave differently from other namespaces with unshare()?Because PID namespace membership is determined at process creation time, not at runtime. When you call unshare(CLONE_NEWPID), the calling process remains in its original PID namespace — it is the next fork() that gets PID 1 in the new namespace. This is by kernel design: a process cannot change its own PID. Other namespaces like UTS or NET can take effect immediately because they do not involve identity (the hostname or network stack can change under a running process without ambiguity, but changing a process’s PID mid-execution would break everything that references it).
A container process can see that it is PID 1. What responsibilities does PID 1 have in a PID namespace, and what goes wrong if the entrypoint does not handle them?
Strong Answer:
PID 1 in any namespace has two critical responsibilities inherited from Unix init: signal handling and zombie reaping. The kernel does not deliver certain default signal dispositions to PID 1 — notably, SIGTERM and SIGINT are ignored unless PID 1 explicitly registers a handler. This is why docker stop has a 10-second timeout: it sends SIGTERM, but if the container’s entrypoint does not handle it, Docker waits the timeout then sends SIGKILL.
Zombie reaping is the second issue. When a child process exits, it becomes a zombie until its parent calls wait(). In a normal system, init (PID 1) adopts orphaned processes and reaps them. If a container’s PID 1 is a simple application that does not call wait(), orphaned child processes accumulate as zombies, consuming PID table entries. This is especially common with shell scripts as entrypoints that spawn background processes.
The practical solutions are: use a proper init system like tini (Docker’s --init flag), or ensure your entrypoint is written to forward signals and reap children. In Go, this is relatively straightforward because the runtime handles SIGCHLD, but in Node.js or Python, you need explicit signal handlers.
A war story: at scale, zombie accumulation inside containers can hit the PID limit set by cgroups (pids.max), causing the container to fail to spawn any new processes. The symptoms look like “cannot fork: resource temporarily unavailable” errors that are mystifying if you do not know to check for zombies with ps aux | grep Z.
Follow-up: How does Kubernetes handle this problem?Kubernetes enables process namespace sharing between containers in a pod via shareProcessNamespace: true. When enabled, the pause container (the pod’s infrastructure container) becomes PID 1 and handles zombie reaping for all containers in the pod. Without this setting, each container has its own PID namespace and must handle its own signal forwarding and reaping. This is one reason the pause container exists — it is a minimal process that correctly implements init behavior, acting as the stable anchor for the pod’s shared namespaces.
How does the user namespace enable rootless containers, and what are the security trade-offs?
Strong Answer:
The user namespace maps UIDs and GIDs inside the namespace to different UIDs outside. A process can be UID 0 (root) inside the container but map to UID 100000 (unprivileged) on the host. This means the container process has full root capabilities within its namespace but if it escapes the container, it lands as an unprivileged user on the host.
The mapping is configured by writing to /proc/<pid>/uid_map and /proc/<pid>/gid_map. A typical mapping like 0 100000 65536 means container UIDs 0-65535 map to host UIDs 100000-165535. This requires either root on the host or entries in /etc/subuid and /etc/subgid that grant ranges to unprivileged users.
The trade-off is complexity and compatibility. Some operations inside rootless containers behave differently — for example, mknod for device files is restricted because the kernel checks the host UID for device access. Network namespace setup requires workarounds (like slirp4netns instead of veth pairs) because creating network interfaces needs real CAP_NET_ADMIN on the host.
Despite these trade-offs, rootless containers are a significant security improvement and are the default in Podman. For production environments where the threat model includes container escape, running rootless eliminates the most dangerous scenario: an attacker gaining root on the host.
Follow-up: If user namespaces are so beneficial, why did Docker not enable them by default from the start?Primarily because of the compatibility burden. When user namespaces are enabled, every file in the container image is accessed as the mapped (unprivileged) host UID. This breaks volume mounts where the host directory is owned by a different user, breaks images that expect real root capabilities (like installing packages), and introduces subtle permission errors with shared storage. Docker chose operational simplicity over security by default. Podman made the opposite bet, choosing rootless by default and absorbing the compatibility pain. The industry has gradually shifted toward rootless as the ecosystem adapted, but the transition took years.