Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Chapter 1: Linux Namespaces

Containers aren’t magic — they’re built on Linux kernel primitives that have existed since 2002. The first and most fundamental of these is namespaces. In this chapter, we’ll build our own container runtime in Java, starting with namespace isolation. Think of namespaces like one-way mirrors in an interrogation room. The person inside the room (the container) sees only what’s in their room. They have no idea other rooms exist. The person outside (the host) can see into every room. Namespaces give each container its own private view of system resources — its own process table, its own network stack, its own hostname — while the host kernel manages all of them simultaneously. The key realization is that containers are not virtual machines. There is no hypervisor, no guest kernel. Containers are regular Linux processes that have been given a restricted view of the world.
Prerequisites: Linux Internals: Processes
Further Reading: Operating Systems: Process Management
Time: 3-4 hours
Outcome: Understanding of namespace isolation

What Are Namespaces?

┌─────────────────────────────────────────────────────────────────────────────┐
│                         WITHOUT NAMESPACES                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   HOST SYSTEM                                                               │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │                                                                      │  │
│   │   Process A (PID 1234)    Process B (PID 1235)                      │  │
│   │   ┌─────────────────┐     ┌─────────────────┐                       │  │
│   │   │ Can see all     │     │ Can see all     │                       │  │
│   │   │ processes       │     │ processes       │                       │  │
│   │   │ Same network    │     │ Same network    │                       │  │
│   │   │ Same filesystem │     │ Same filesystem │                       │  │
│   │   │ Same users      │     │ Same users      │                       │  │
│   │   └─────────────────┘     └─────────────────┘                       │  │
│   │                                                                      │  │
│   │   Both processes share the same view of the system                  │  │
│   │                                                                      │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                          WITH NAMESPACES                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   HOST SYSTEM                                                               │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │                                                                      │  │
│   │   Container A                    Container B                        │  │
│   │   ┌───────────────────┐          ┌───────────────────┐             │  │
│   │   │ PID Namespace     │          │ PID Namespace     │             │  │
│   │   │ Sees only own PIDs│          │ Sees only own PIDs│             │  │
│   │   │ PID 1 = container │          │ PID 1 = container │             │  │
│   │   │ init process      │          │ init process      │             │  │
│   │   ├───────────────────┤          ├───────────────────┤             │  │
│   │   │ NET Namespace     │          │ NET Namespace     │             │  │
│   │   │ Own eth0, ports   │          │ Own eth0, ports   │             │  │
│   │   ├───────────────────┤          ├───────────────────┤             │  │
│   │   │ MNT Namespace     │          │ MNT Namespace     │             │  │
│   │   │ Own root fs       │          │ Own root fs       │             │  │
│   │   └───────────────────┘          └───────────────────┘             │  │
│   │                                                                      │  │
│   │   Each container has ISOLATED view of system resources              │  │
│   │                                                                      │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Linux Namespace Types

NamespaceFlagIsolates
PIDCLONE_NEWPIDProcess IDs - container sees own PID 1
NETCLONE_NEWNETNetwork stack - own interfaces, IPs, ports
MNTCLONE_NEWNSMount points - own filesystem view
UTSCLONE_NEWUTSHostname and domain name
IPCCLONE_NEWIPCInter-process communication
USERCLONE_NEWUSERUser and group IDs
CGROUPCLONE_NEWCGROUPCgroup root directory

Part 1: Project Setup

We’ll use Java with JNA (Java Native Access) to call Linux system calls.
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0">
    <modelVersion>4.0.0</modelVersion>
    
    <groupId>com.minidocker</groupId>
    <artifactId>minidocker</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>
    
    <properties>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
    </properties>
    
    <dependencies>
        <!-- JNA for native Linux calls -->
        <dependency>
            <groupId>net.java.dev.jna</groupId>
            <artifactId>jna</artifactId>
            <version>5.13.0</version>
        </dependency>
        <dependency>
            <groupId>net.java.dev.jna</groupId>
            <artifactId>jna-platform</artifactId>
            <version>5.13.0</version>
        </dependency>
        
        <!-- CLI parsing -->
        <dependency>
            <groupId>info.picocli</groupId>
            <artifactId>picocli</artifactId>
            <version>4.7.4</version>
        </dependency>
    </dependencies>
</project>

Part 2: Linux System Call Bindings

First, we need to call Linux system calls from Java:
src/main/java/com/minidocker/linux/LibC.java
package com.minidocker.linux;

import com.sun.jna.Library;
import com.sun.jna.Native;
import com.sun.jna.Pointer;

/**
 * JNA bindings to Linux libc functions.
 * 
 * These are the low-level system calls that Docker uses internally.
 */
public interface LibC extends Library {
    
    LibC INSTANCE = Native.load("c", LibC.class);
    
    // Namespace flags
    int CLONE_NEWNS     = 0x00020000;  // Mount namespace
    int CLONE_NEWUTS    = 0x04000000;  // UTS namespace (hostname)
    int CLONE_NEWIPC    = 0x08000000;  // IPC namespace
    int CLONE_NEWUSER   = 0x10000000;  // User namespace
    int CLONE_NEWPID    = 0x20000000;  // PID namespace
    int CLONE_NEWNET    = 0x40000000;  // Network namespace
    int CLONE_NEWCGROUP = 0x02000000;  // Cgroup namespace
    
    // Mount flags
    int MS_BIND         = 4096;
    int MS_REC          = 16384;
    int MS_PRIVATE      = 1 << 18;
    int MS_NOSUID       = 2;
    int MS_NOEXEC       = 8;
    int MS_NODEV        = 4;
    
    /**
     * Create a new namespace and move the calling process into it.
     * 
     * @param flags Combination of CLONE_NEW* flags
     * @return 0 on success, -1 on error
     */
    int unshare(int flags);
    
    /**
     * Change the root filesystem.
     * 
     * @param path New root directory
     * @return 0 on success, -1 on error
     */
    int chroot(String path);
    
    /**
     * Change working directory.
     */
    int chdir(String path);
    
    /**
     * Mount a filesystem.
     * 
     * @param source Source device/path
     * @param target Mount point
     * @param filesystemtype Type (e.g., "proc", "sysfs")
     * @param mountflags Mount flags
     * @param data Additional data
     */
    int mount(String source, String target, String filesystemtype,
              long mountflags, Pointer data);
    
    /**
     * Unmount a filesystem.
     */
    int umount2(String target, int flags);
    
    /**
     * Set hostname.
     */
    int sethostname(String name, int len);
    
    /**
     * Get process ID.
     */
    int getpid();
    
    /**
     * Get parent process ID.
     */
    int getppid();
    
    /**
     * Get user ID.
     */
    int getuid();
    
    /**
     * Get group ID.
     */
    int getgid();
    
    /**
     * Execute a program.
     */
    int execv(String path, String[] argv);
    
    /**
     * Fork the process.
     */
    int fork();
    
    /**
     * Wait for child process.
     */
    int waitpid(int pid, int[] status, int options);
    
    /**
     * Set user/group ID mappings.
     */
    int setuid(int uid);
    int setgid(int gid);
    
    /**
     * Pivot root - atomically swap root filesystems.
     */
    int pivot_root(String new_root, String put_old);
}

Part 3: Namespace Manager

src/main/java/com/minidocker/namespace/NamespaceManager.java
package com.minidocker.namespace;

import com.minidocker.linux.LibC;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;

/**
 * Manages Linux namespace creation and configuration.
 * 
 * Namespaces provide isolation for various system resources:
 * - PID: Process sees its own process tree, with PID 1
 * - NET: Separate network stack (interfaces, routing, firewall)
 * - MNT: Separate mount points
 * - UTS: Separate hostname
 * - IPC: Separate System V IPC and POSIX message queues
 * - USER: Separate user/group ID mappings
 */
public class NamespaceManager {
    
    private final LibC libc = LibC.INSTANCE;
    
    /**
     * Creates all namespaces needed for container isolation.
     * 
     * @param options Configuration options
     * @throws NamespaceException if namespace creation fails
     */
    public void createNamespaces(NamespaceOptions options) throws NamespaceException {
        int flags = 0;
        
        if (options.newPidNamespace()) {
            flags |= LibC.CLONE_NEWPID;
        }
        if (options.newNetNamespace()) {
            flags |= LibC.CLONE_NEWNET;
        }
        if (options.newMountNamespace()) {
            flags |= LibC.CLONE_NEWNS;
        }
        if (options.newUtsNamespace()) {
            flags |= LibC.CLONE_NEWUTS;
        }
        if (options.newIpcNamespace()) {
            flags |= LibC.CLONE_NEWIPC;
        }
        if (options.newUserNamespace()) {
            flags |= LibC.CLONE_NEWUSER;
        }
        
        System.out.println("Creating namespaces with flags: 0x" + Integer.toHexString(flags));
        
        // unshare() creates new namespaces and moves calling process into them
        int result = libc.unshare(flags);
        if (result != 0) {
            throw new NamespaceException("Failed to create namespaces: " + 
                Native.getLastError());
        }
        
        System.out.println("✓ Namespaces created successfully");
    }
    
    /**
     * Sets up user namespace mappings.
     * 
     * Maps container user 0 (root) to host user.
     */
    public void setupUserNamespace() throws IOException {
        int uid = libc.getuid();
        int gid = libc.getgid();
        int pid = libc.getpid();
        
        // Write UID mapping: container_uid host_uid count
        // Maps container root (0) to current host user
        Path uidMapPath = Path.of("/proc/" + pid + "/uid_map");
        Files.writeString(uidMapPath, "0 " + uid + " 1\n");
        
        // Disable setgroups (required before writing gid_map)
        Path setgroupsPath = Path.of("/proc/" + pid + "/setgroups");
        Files.writeString(setgroupsPath, "deny\n");
        
        // Write GID mapping
        Path gidMapPath = Path.of("/proc/" + pid + "/gid_map");
        Files.writeString(gidMapPath, "0 " + gid + " 1\n");
        
        System.out.println("✓ User namespace mapped: container root -> host uid " + uid);
    }
    
    /**
     * Sets the hostname within the UTS namespace.
     */
    public void setHostname(String hostname) throws NamespaceException {
        int result = libc.sethostname(hostname, hostname.length());
        if (result != 0) {
            throw new NamespaceException("Failed to set hostname: " + 
                Native.getLastError());
        }
        System.out.println("✓ Hostname set to: " + hostname);
    }
    
    /**
     * Demonstrates PID namespace isolation.
     */
    public void showPidNamespaceInfo() {
        int pid = libc.getpid();
        int ppid = libc.getppid();
        
        System.out.println("Inside container:");
        System.out.println("  PID:  " + pid);
        System.out.println("  PPID: " + ppid);
        
        // In a PID namespace, the first process gets PID 1
        if (pid == 1) {
            System.out.println("  → We are PID 1 (init process) in this namespace!");
        }
    }
}

Part 4: Namespace Options

src/main/java/com/minidocker/namespace/NamespaceOptions.java
package com.minidocker.namespace;

/**
 * Configuration for which namespaces to create.
 */
public record NamespaceOptions(
    boolean newPidNamespace,
    boolean newNetNamespace,
    boolean newMountNamespace,
    boolean newUtsNamespace,
    boolean newIpcNamespace,
    boolean newUserNamespace
) {
    /**
     * Creates options with all namespaces enabled (typical container).
     */
    public static NamespaceOptions all() {
        return new NamespaceOptions(true, true, true, true, true, true);
    }
    
    /**
     * Creates options with no namespaces (for testing).
     */
    public static NamespaceOptions none() {
        return new NamespaceOptions(false, false, false, false, false, false);
    }
    
    /**
     * Builder for custom namespace configurations.
     */
    public static Builder builder() {
        return new Builder();
    }
    
    public static class Builder {
        private boolean pid = false;
        private boolean net = false;
        private boolean mnt = false;
        private boolean uts = false;
        private boolean ipc = false;
        private boolean user = false;
        
        public Builder withPid() { this.pid = true; return this; }
        public Builder withNet() { this.net = true; return this; }
        public Builder withMount() { this.mnt = true; return this; }
        public Builder withUts() { this.uts = true; return this; }
        public Builder withIpc() { this.ipc = true; return this; }
        public Builder withUser() { this.user = true; return this; }
        
        public NamespaceOptions build() {
            return new NamespaceOptions(pid, net, mnt, uts, ipc, user);
        }
    }
}

Part 5: Understanding Each Namespace

PID Namespace

┌─────────────────────────────────────────────────────────────────────────────┐
│                          PID NAMESPACE                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   HOST VIEW:                         CONTAINER VIEW:                        │
│                                                                              │
│   PID  COMMAND                       PID  COMMAND                           │
│   1    systemd                       1    /bin/sh  ← Thinks it's PID 1!    │
│   100  sshd                          2    nginx                             │
│   200  dockerd                       3    worker                            │
│   300  /bin/sh (container init)                                             │
│   301  nginx                                                                 │
│   302  worker                                                                │
│                                                                              │
│   Host PID 300 = Container PID 1                                            │
│   The container cannot see host processes!                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

UTS Namespace

public void demonstrateUtsNamespace() throws Exception {
    // Before: we have host's hostname
    System.out.println("Host hostname: " + getHostname());
    
    // Create UTS namespace
    libc.unshare(LibC.CLONE_NEWUTS);
    
    // Now we can change hostname without affecting host
    libc.sethostname("mycontainer", 11);
    
    System.out.println("Container hostname: " + getHostname());
    // Output: "mycontainer" - host is unaffected!
}

Mount Namespace

public void demonstrateMountNamespace() throws Exception {
    // Create mount namespace
    libc.unshare(LibC.CLONE_NEWNS);
    
    // Make all mounts private (changes don't propagate to host)
    libc.mount(null, "/", null, LibC.MS_REC | LibC.MS_PRIVATE, null);
    
    // Now we can mount things that only this container sees
    libc.mount("tmpfs", "/tmp", "tmpfs", 0, null);
    
    // This /tmp is completely separate from host's /tmp
}

Part 6: Container Runner

src/main/java/com/minidocker/Container.java
package com.minidocker;

import com.minidocker.linux.LibC;
import com.minidocker.namespace.NamespaceManager;
import com.minidocker.namespace.NamespaceOptions;

/**
 * Main container class that orchestrates isolation.
 */
public class Container {
    
    private final LibC libc = LibC.INSTANCE;
    private final NamespaceManager namespaces = new NamespaceManager();
    
    private final String hostname;
    private final String[] command;
    
    public Container(String hostname, String[] command) {
        this.hostname = hostname;
        this.command = command;
    }
    
    /**
     * Runs the container.
     * 
     * This is a simplified version - real Docker uses fork() and 
     * clone() for proper isolation. We use unshare() for simplicity.
     */
    public void run() throws Exception {
        System.out.println("=== Starting Container ===");
        System.out.println("Hostname: " + hostname);
        System.out.println("Command: " + String.join(" ", command));
        System.out.println();
        
        // Step 1: Create namespaces
        NamespaceOptions options = NamespaceOptions.builder()
            .withPid()
            .withMount()
            .withUts()
            .withIpc()
            .build();
        
        namespaces.createNamespaces(options);
        
        // Step 2: Set hostname
        namespaces.setHostname(hostname);
        
        // Step 3: Show namespace info
        namespaces.showPidNamespaceInfo();
        
        // Step 4: Fork to get PID 1 in new namespace
        int pid = libc.fork();
        
        if (pid == 0) {
            // Child process - this is PID 1 in new PID namespace
            runContainerInit();
        } else if (pid > 0) {
            // Parent - wait for child
            int[] status = new int[1];
            libc.waitpid(pid, status, 0);
            System.out.println("Container exited with status: " + status[0]);
        } else {
            throw new RuntimeException("Fork failed");
        }
    }
    
    private void runContainerInit() {
        try {
            System.out.println("\n=== Container Init (PID " + libc.getpid() + ") ===");
            
            // Execute the command
            if (command.length > 0) {
                String[] argv = new String[command.length + 1];
                System.arraycopy(command, 0, argv, 0, command.length);
                argv[command.length] = null;  // null-terminated
                
                libc.execv(command[0], argv);
                // If we get here, exec failed
                System.err.println("Failed to execute: " + command[0]);
                System.exit(1);
            }
        } catch (Exception e) {
            System.err.println("Container init failed: " + e.getMessage());
            System.exit(1);
        }
    }
    
    public static void main(String[] args) {
        if (args.length < 2) {
            System.out.println("Usage: java Container <hostname> <command...>");
            System.out.println("Example: java Container mycontainer /bin/sh");
            System.exit(1);
        }
        
        String hostname = args[0];
        String[] command = new String[args.length - 1];
        System.arraycopy(args, 1, command, 0, command.length);
        
        try {
            Container container = new Container(hostname, command);
            container.run();
        } catch (Exception e) {
            System.err.println("Error: " + e.getMessage());
            e.printStackTrace();
            System.exit(1);
        }
    }
}

Exercises

Extend the namespace manager to create network namespaces:
// 1. Create network namespace
libc.unshare(LibC.CLONE_NEWNET);

// 2. Bring up loopback interface
// Use: ip link set lo up
// (Requires additional native calls or ProcessBuilder)

// 3. Verify isolation
// The container should have its own network stack
Allow joining an existing container’s namespaces:
// Use setns() syscall to join existing namespace
// int setns(int fd, int nstype);
// fd = open("/proc/<pid>/ns/<type>")

// This is how "docker exec" works!
Implement user namespace with UID/GID mapping:
// 1. Create user namespace FIRST (before other namespaces)
// 2. Write to /proc/self/uid_map
// 3. Write "deny" to /proc/self/setgroups
// 4. Write to /proc/self/gid_map

// This enables unprivileged containers!

Key Takeaways

Isolation Not Virtualization

Namespaces isolate views of resources, not the resources themselves

Kernel Primitives

unshare(), clone(), setns() are the syscalls that power containers

Layered Isolation

Each namespace type isolates a different resource

No Overhead

Namespaces add negligible overhead - just kernel data structures

Further Reading

Linux Namespaces Manual

Official documentation for Linux namespaces

Linux Internals Course

Deep dive into Linux process management

What’s Next?

In Chapter 2: Control Groups (cgroups), we’ll implement:
  • CPU limits
  • Memory limits
  • Process count limits
  • Resource accounting

Next: Cgroups

Learn how to limit container resources

Interview Deep-Dive

Strong Answer:
  • clone() creates a new child process that starts in the new namespace(s). It is analogous to fork() but accepts flags that specify which namespaces to create. The parent and child are in different namespaces from the moment the child starts executing.
  • unshare() moves the calling process itself into new namespace(s). There is no new process created. This is simpler when you want to isolate the current process rather than spawn a child.
  • Real container runtimes like runc use clone() because they need the container’s init process (PID 1 in the new PID namespace) to be a child process that the runtime can monitor and wait on. If you used unshare(CLONE_NEWPID), the calling process does not get PID 1 in the new namespace — only its next child does. This is a subtle but critical distinction that trips up many implementations.
  • There is also setns(), which joins an existing namespace by opening /proc/<pid>/ns/<type> and passing the file descriptor. This is how docker exec works — it calls setns() to enter the target container’s namespaces before executing the new command.
Follow-up: Why does the PID namespace behave differently from other namespaces with unshare()?Because PID namespace membership is determined at process creation time, not at runtime. When you call unshare(CLONE_NEWPID), the calling process remains in its original PID namespace — it is the next fork() that gets PID 1 in the new namespace. This is by kernel design: a process cannot change its own PID. Other namespaces like UTS or NET can take effect immediately because they do not involve identity (the hostname or network stack can change under a running process without ambiguity, but changing a process’s PID mid-execution would break everything that references it).
Strong Answer:
  • PID 1 in any namespace has two critical responsibilities inherited from Unix init: signal handling and zombie reaping. The kernel does not deliver certain default signal dispositions to PID 1 — notably, SIGTERM and SIGINT are ignored unless PID 1 explicitly registers a handler. This is why docker stop has a 10-second timeout: it sends SIGTERM, but if the container’s entrypoint does not handle it, Docker waits the timeout then sends SIGKILL.
  • Zombie reaping is the second issue. When a child process exits, it becomes a zombie until its parent calls wait(). In a normal system, init (PID 1) adopts orphaned processes and reaps them. If a container’s PID 1 is a simple application that does not call wait(), orphaned child processes accumulate as zombies, consuming PID table entries. This is especially common with shell scripts as entrypoints that spawn background processes.
  • The practical solutions are: use a proper init system like tini (Docker’s --init flag), or ensure your entrypoint is written to forward signals and reap children. In Go, this is relatively straightforward because the runtime handles SIGCHLD, but in Node.js or Python, you need explicit signal handlers.
  • A war story: at scale, zombie accumulation inside containers can hit the PID limit set by cgroups (pids.max), causing the container to fail to spawn any new processes. The symptoms look like “cannot fork: resource temporarily unavailable” errors that are mystifying if you do not know to check for zombies with ps aux | grep Z.
Follow-up: How does Kubernetes handle this problem?Kubernetes enables process namespace sharing between containers in a pod via shareProcessNamespace: true. When enabled, the pause container (the pod’s infrastructure container) becomes PID 1 and handles zombie reaping for all containers in the pod. Without this setting, each container has its own PID namespace and must handle its own signal forwarding and reaping. This is one reason the pause container exists — it is a minimal process that correctly implements init behavior, acting as the stable anchor for the pod’s shared namespaces.
Strong Answer:
  • The user namespace maps UIDs and GIDs inside the namespace to different UIDs outside. A process can be UID 0 (root) inside the container but map to UID 100000 (unprivileged) on the host. This means the container process has full root capabilities within its namespace but if it escapes the container, it lands as an unprivileged user on the host.
  • The mapping is configured by writing to /proc/<pid>/uid_map and /proc/<pid>/gid_map. A typical mapping like 0 100000 65536 means container UIDs 0-65535 map to host UIDs 100000-165535. This requires either root on the host or entries in /etc/subuid and /etc/subgid that grant ranges to unprivileged users.
  • The trade-off is complexity and compatibility. Some operations inside rootless containers behave differently — for example, mknod for device files is restricted because the kernel checks the host UID for device access. Network namespace setup requires workarounds (like slirp4netns instead of veth pairs) because creating network interfaces needs real CAP_NET_ADMIN on the host.
  • Despite these trade-offs, rootless containers are a significant security improvement and are the default in Podman. For production environments where the threat model includes container escape, running rootless eliminates the most dangerous scenario: an attacker gaining root on the host.
Follow-up: If user namespaces are so beneficial, why did Docker not enable them by default from the start?Primarily because of the compatibility burden. When user namespaces are enabled, every file in the container image is accessed as the mapped (unprivileged) host UID. This breaks volume mounts where the host directory is owned by a different user, breaks images that expect real root capabilities (like installing packages), and introduces subtle permission errors with shared storage. Docker chose operational simplicity over security by default. Podman made the opposite bet, choosing rootless by default and absorbing the compatibility pain. The industry has gradually shifted toward rootless as the ecosystem adapted, but the transition took years.