Skip to main content

Chapter 1: Linux Namespaces

Containers aren’t magic - they’re built on Linux kernel primitives. The first and most fundamental of these is namespaces. In this chapter, we’ll build our own container runtime in Java, starting with namespace isolation.
Prerequisites: Linux Internals: Processes
Further Reading: Operating Systems: Process Management
Time: 3-4 hours
Outcome: Understanding of namespace isolation

What Are Namespaces?

┌─────────────────────────────────────────────────────────────────────────────┐
│                         WITHOUT NAMESPACES                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   HOST SYSTEM                                                               │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │                                                                      │  │
│   │   Process A (PID 1234)    Process B (PID 1235)                      │  │
│   │   ┌─────────────────┐     ┌─────────────────┐                       │  │
│   │   │ Can see all     │     │ Can see all     │                       │  │
│   │   │ processes       │     │ processes       │                       │  │
│   │   │ Same network    │     │ Same network    │                       │  │
│   │   │ Same filesystem │     │ Same filesystem │                       │  │
│   │   │ Same users      │     │ Same users      │                       │  │
│   │   └─────────────────┘     └─────────────────┘                       │  │
│   │                                                                      │  │
│   │   Both processes share the same view of the system                  │  │
│   │                                                                      │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                          WITH NAMESPACES                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   HOST SYSTEM                                                               │
│   ┌─────────────────────────────────────────────────────────────────────┐  │
│   │                                                                      │  │
│   │   Container A                    Container B                        │  │
│   │   ┌───────────────────┐          ┌───────────────────┐             │  │
│   │   │ PID Namespace     │          │ PID Namespace     │             │  │
│   │   │ Sees only own PIDs│          │ Sees only own PIDs│             │  │
│   │   │ PID 1 = container │          │ PID 1 = container │             │  │
│   │   │ init process      │          │ init process      │             │  │
│   │   ├───────────────────┤          ├───────────────────┤             │  │
│   │   │ NET Namespace     │          │ NET Namespace     │             │  │
│   │   │ Own eth0, ports   │          │ Own eth0, ports   │             │  │
│   │   ├───────────────────┤          ├───────────────────┤             │  │
│   │   │ MNT Namespace     │          │ MNT Namespace     │             │  │
│   │   │ Own root fs       │          │ Own root fs       │             │  │
│   │   └───────────────────┘          └───────────────────┘             │  │
│   │                                                                      │  │
│   │   Each container has ISOLATED view of system resources              │  │
│   │                                                                      │  │
│   └─────────────────────────────────────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Linux Namespace Types

NamespaceFlagIsolates
PIDCLONE_NEWPIDProcess IDs - container sees own PID 1
NETCLONE_NEWNETNetwork stack - own interfaces, IPs, ports
MNTCLONE_NEWNSMount points - own filesystem view
UTSCLONE_NEWUTSHostname and domain name
IPCCLONE_NEWIPCInter-process communication
USERCLONE_NEWUSERUser and group IDs
CGROUPCLONE_NEWCGROUPCgroup root directory

Part 1: Project Setup

We’ll use Java with JNA (Java Native Access) to call Linux system calls.
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0">
    <modelVersion>4.0.0</modelVersion>
    
    <groupId>com.minidocker</groupId>
    <artifactId>minidocker</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>
    
    <properties>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
    </properties>
    
    <dependencies>
        <!-- JNA for native Linux calls -->
        <dependency>
            <groupId>net.java.dev.jna</groupId>
            <artifactId>jna</artifactId>
            <version>5.13.0</version>
        </dependency>
        <dependency>
            <groupId>net.java.dev.jna</groupId>
            <artifactId>jna-platform</artifactId>
            <version>5.13.0</version>
        </dependency>
        
        <!-- CLI parsing -->
        <dependency>
            <groupId>info.picocli</groupId>
            <artifactId>picocli</artifactId>
            <version>4.7.4</version>
        </dependency>
    </dependencies>
</project>

Part 2: Linux System Call Bindings

First, we need to call Linux system calls from Java:
src/main/java/com/minidocker/linux/LibC.java
package com.minidocker.linux;

import com.sun.jna.Library;
import com.sun.jna.Native;
import com.sun.jna.Pointer;

/**
 * JNA bindings to Linux libc functions.
 * 
 * These are the low-level system calls that Docker uses internally.
 */
public interface LibC extends Library {
    
    LibC INSTANCE = Native.load("c", LibC.class);
    
    // Namespace flags
    int CLONE_NEWNS     = 0x00020000;  // Mount namespace
    int CLONE_NEWUTS    = 0x04000000;  // UTS namespace (hostname)
    int CLONE_NEWIPC    = 0x08000000;  // IPC namespace
    int CLONE_NEWUSER   = 0x10000000;  // User namespace
    int CLONE_NEWPID    = 0x20000000;  // PID namespace
    int CLONE_NEWNET    = 0x40000000;  // Network namespace
    int CLONE_NEWCGROUP = 0x02000000;  // Cgroup namespace
    
    // Mount flags
    int MS_BIND         = 4096;
    int MS_REC          = 16384;
    int MS_PRIVATE      = 1 << 18;
    int MS_NOSUID       = 2;
    int MS_NOEXEC       = 8;
    int MS_NODEV        = 4;
    
    /**
     * Create a new namespace and move the calling process into it.
     * 
     * @param flags Combination of CLONE_NEW* flags
     * @return 0 on success, -1 on error
     */
    int unshare(int flags);
    
    /**
     * Change the root filesystem.
     * 
     * @param path New root directory
     * @return 0 on success, -1 on error
     */
    int chroot(String path);
    
    /**
     * Change working directory.
     */
    int chdir(String path);
    
    /**
     * Mount a filesystem.
     * 
     * @param source Source device/path
     * @param target Mount point
     * @param filesystemtype Type (e.g., "proc", "sysfs")
     * @param mountflags Mount flags
     * @param data Additional data
     */
    int mount(String source, String target, String filesystemtype,
              long mountflags, Pointer data);
    
    /**
     * Unmount a filesystem.
     */
    int umount2(String target, int flags);
    
    /**
     * Set hostname.
     */
    int sethostname(String name, int len);
    
    /**
     * Get process ID.
     */
    int getpid();
    
    /**
     * Get parent process ID.
     */
    int getppid();
    
    /**
     * Get user ID.
     */
    int getuid();
    
    /**
     * Get group ID.
     */
    int getgid();
    
    /**
     * Execute a program.
     */
    int execv(String path, String[] argv);
    
    /**
     * Fork the process.
     */
    int fork();
    
    /**
     * Wait for child process.
     */
    int waitpid(int pid, int[] status, int options);
    
    /**
     * Set user/group ID mappings.
     */
    int setuid(int uid);
    int setgid(int gid);
    
    /**
     * Pivot root - atomically swap root filesystems.
     */
    int pivot_root(String new_root, String put_old);
}

Part 3: Namespace Manager

src/main/java/com/minidocker/namespace/NamespaceManager.java
package com.minidocker.namespace;

import com.minidocker.linux.LibC;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;

/**
 * Manages Linux namespace creation and configuration.
 * 
 * Namespaces provide isolation for various system resources:
 * - PID: Process sees its own process tree, with PID 1
 * - NET: Separate network stack (interfaces, routing, firewall)
 * - MNT: Separate mount points
 * - UTS: Separate hostname
 * - IPC: Separate System V IPC and POSIX message queues
 * - USER: Separate user/group ID mappings
 */
public class NamespaceManager {
    
    private final LibC libc = LibC.INSTANCE;
    
    /**
     * Creates all namespaces needed for container isolation.
     * 
     * @param options Configuration options
     * @throws NamespaceException if namespace creation fails
     */
    public void createNamespaces(NamespaceOptions options) throws NamespaceException {
        int flags = 0;
        
        if (options.newPidNamespace()) {
            flags |= LibC.CLONE_NEWPID;
        }
        if (options.newNetNamespace()) {
            flags |= LibC.CLONE_NEWNET;
        }
        if (options.newMountNamespace()) {
            flags |= LibC.CLONE_NEWNS;
        }
        if (options.newUtsNamespace()) {
            flags |= LibC.CLONE_NEWUTS;
        }
        if (options.newIpcNamespace()) {
            flags |= LibC.CLONE_NEWIPC;
        }
        if (options.newUserNamespace()) {
            flags |= LibC.CLONE_NEWUSER;
        }
        
        System.out.println("Creating namespaces with flags: 0x" + Integer.toHexString(flags));
        
        // unshare() creates new namespaces and moves calling process into them
        int result = libc.unshare(flags);
        if (result != 0) {
            throw new NamespaceException("Failed to create namespaces: " + 
                Native.getLastError());
        }
        
        System.out.println("✓ Namespaces created successfully");
    }
    
    /**
     * Sets up user namespace mappings.
     * 
     * Maps container user 0 (root) to host user.
     */
    public void setupUserNamespace() throws IOException {
        int uid = libc.getuid();
        int gid = libc.getgid();
        int pid = libc.getpid();
        
        // Write UID mapping: container_uid host_uid count
        // Maps container root (0) to current host user
        Path uidMapPath = Path.of("/proc/" + pid + "/uid_map");
        Files.writeString(uidMapPath, "0 " + uid + " 1\n");
        
        // Disable setgroups (required before writing gid_map)
        Path setgroupsPath = Path.of("/proc/" + pid + "/setgroups");
        Files.writeString(setgroupsPath, "deny\n");
        
        // Write GID mapping
        Path gidMapPath = Path.of("/proc/" + pid + "/gid_map");
        Files.writeString(gidMapPath, "0 " + gid + " 1\n");
        
        System.out.println("✓ User namespace mapped: container root -> host uid " + uid);
    }
    
    /**
     * Sets the hostname within the UTS namespace.
     */
    public void setHostname(String hostname) throws NamespaceException {
        int result = libc.sethostname(hostname, hostname.length());
        if (result != 0) {
            throw new NamespaceException("Failed to set hostname: " + 
                Native.getLastError());
        }
        System.out.println("✓ Hostname set to: " + hostname);
    }
    
    /**
     * Demonstrates PID namespace isolation.
     */
    public void showPidNamespaceInfo() {
        int pid = libc.getpid();
        int ppid = libc.getppid();
        
        System.out.println("Inside container:");
        System.out.println("  PID:  " + pid);
        System.out.println("  PPID: " + ppid);
        
        // In a PID namespace, the first process gets PID 1
        if (pid == 1) {
            System.out.println("  → We are PID 1 (init process) in this namespace!");
        }
    }
}

Part 4: Namespace Options

src/main/java/com/minidocker/namespace/NamespaceOptions.java
package com.minidocker.namespace;

/**
 * Configuration for which namespaces to create.
 */
public record NamespaceOptions(
    boolean newPidNamespace,
    boolean newNetNamespace,
    boolean newMountNamespace,
    boolean newUtsNamespace,
    boolean newIpcNamespace,
    boolean newUserNamespace
) {
    /**
     * Creates options with all namespaces enabled (typical container).
     */
    public static NamespaceOptions all() {
        return new NamespaceOptions(true, true, true, true, true, true);
    }
    
    /**
     * Creates options with no namespaces (for testing).
     */
    public static NamespaceOptions none() {
        return new NamespaceOptions(false, false, false, false, false, false);
    }
    
    /**
     * Builder for custom namespace configurations.
     */
    public static Builder builder() {
        return new Builder();
    }
    
    public static class Builder {
        private boolean pid = false;
        private boolean net = false;
        private boolean mnt = false;
        private boolean uts = false;
        private boolean ipc = false;
        private boolean user = false;
        
        public Builder withPid() { this.pid = true; return this; }
        public Builder withNet() { this.net = true; return this; }
        public Builder withMount() { this.mnt = true; return this; }
        public Builder withUts() { this.uts = true; return this; }
        public Builder withIpc() { this.ipc = true; return this; }
        public Builder withUser() { this.user = true; return this; }
        
        public NamespaceOptions build() {
            return new NamespaceOptions(pid, net, mnt, uts, ipc, user);
        }
    }
}

Part 5: Understanding Each Namespace

PID Namespace

┌─────────────────────────────────────────────────────────────────────────────┐
│                          PID NAMESPACE                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   HOST VIEW:                         CONTAINER VIEW:                        │
│                                                                              │
│   PID  COMMAND                       PID  COMMAND                           │
│   1    systemd                       1    /bin/sh  ← Thinks it's PID 1!    │
│   100  sshd                          2    nginx                             │
│   200  dockerd                       3    worker                            │
│   300  /bin/sh (container init)                                             │
│   301  nginx                                                                 │
│   302  worker                                                                │
│                                                                              │
│   Host PID 300 = Container PID 1                                            │
│   The container cannot see host processes!                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

UTS Namespace

public void demonstrateUtsNamespace() throws Exception {
    // Before: we have host's hostname
    System.out.println("Host hostname: " + getHostname());
    
    // Create UTS namespace
    libc.unshare(LibC.CLONE_NEWUTS);
    
    // Now we can change hostname without affecting host
    libc.sethostname("mycontainer", 11);
    
    System.out.println("Container hostname: " + getHostname());
    // Output: "mycontainer" - host is unaffected!
}

Mount Namespace

public void demonstrateMountNamespace() throws Exception {
    // Create mount namespace
    libc.unshare(LibC.CLONE_NEWNS);
    
    // Make all mounts private (changes don't propagate to host)
    libc.mount(null, "/", null, LibC.MS_REC | LibC.MS_PRIVATE, null);
    
    // Now we can mount things that only this container sees
    libc.mount("tmpfs", "/tmp", "tmpfs", 0, null);
    
    // This /tmp is completely separate from host's /tmp
}

Part 6: Container Runner

src/main/java/com/minidocker/Container.java
package com.minidocker;

import com.minidocker.linux.LibC;
import com.minidocker.namespace.NamespaceManager;
import com.minidocker.namespace.NamespaceOptions;

/**
 * Main container class that orchestrates isolation.
 */
public class Container {
    
    private final LibC libc = LibC.INSTANCE;
    private final NamespaceManager namespaces = new NamespaceManager();
    
    private final String hostname;
    private final String[] command;
    
    public Container(String hostname, String[] command) {
        this.hostname = hostname;
        this.command = command;
    }
    
    /**
     * Runs the container.
     * 
     * This is a simplified version - real Docker uses fork() and 
     * clone() for proper isolation. We use unshare() for simplicity.
     */
    public void run() throws Exception {
        System.out.println("=== Starting Container ===");
        System.out.println("Hostname: " + hostname);
        System.out.println("Command: " + String.join(" ", command));
        System.out.println();
        
        // Step 1: Create namespaces
        NamespaceOptions options = NamespaceOptions.builder()
            .withPid()
            .withMount()
            .withUts()
            .withIpc()
            .build();
        
        namespaces.createNamespaces(options);
        
        // Step 2: Set hostname
        namespaces.setHostname(hostname);
        
        // Step 3: Show namespace info
        namespaces.showPidNamespaceInfo();
        
        // Step 4: Fork to get PID 1 in new namespace
        int pid = libc.fork();
        
        if (pid == 0) {
            // Child process - this is PID 1 in new PID namespace
            runContainerInit();
        } else if (pid > 0) {
            // Parent - wait for child
            int[] status = new int[1];
            libc.waitpid(pid, status, 0);
            System.out.println("Container exited with status: " + status[0]);
        } else {
            throw new RuntimeException("Fork failed");
        }
    }
    
    private void runContainerInit() {
        try {
            System.out.println("\n=== Container Init (PID " + libc.getpid() + ") ===");
            
            // Execute the command
            if (command.length > 0) {
                String[] argv = new String[command.length + 1];
                System.arraycopy(command, 0, argv, 0, command.length);
                argv[command.length] = null;  // null-terminated
                
                libc.execv(command[0], argv);
                // If we get here, exec failed
                System.err.println("Failed to execute: " + command[0]);
                System.exit(1);
            }
        } catch (Exception e) {
            System.err.println("Container init failed: " + e.getMessage());
            System.exit(1);
        }
    }
    
    public static void main(String[] args) {
        if (args.length < 2) {
            System.out.println("Usage: java Container <hostname> <command...>");
            System.out.println("Example: java Container mycontainer /bin/sh");
            System.exit(1);
        }
        
        String hostname = args[0];
        String[] command = new String[args.length - 1];
        System.arraycopy(args, 1, command, 0, command.length);
        
        try {
            Container container = new Container(hostname, command);
            container.run();
        } catch (Exception e) {
            System.err.println("Error: " + e.getMessage());
            e.printStackTrace();
            System.exit(1);
        }
    }
}

Exercises

Extend the namespace manager to create network namespaces:
// 1. Create network namespace
libc.unshare(LibC.CLONE_NEWNET);

// 2. Bring up loopback interface
// Use: ip link set lo up
// (Requires additional native calls or ProcessBuilder)

// 3. Verify isolation
// The container should have its own network stack
Allow joining an existing container’s namespaces:
// Use setns() syscall to join existing namespace
// int setns(int fd, int nstype);
// fd = open("/proc/<pid>/ns/<type>")

// This is how "docker exec" works!
Implement user namespace with UID/GID mapping:
// 1. Create user namespace FIRST (before other namespaces)
// 2. Write to /proc/self/uid_map
// 3. Write "deny" to /proc/self/setgroups
// 4. Write to /proc/self/gid_map

// This enables unprivileged containers!

Key Takeaways

Isolation Not Virtualization

Namespaces isolate views of resources, not the resources themselves

Kernel Primitives

unshare(), clone(), setns() are the syscalls that power containers

Layered Isolation

Each namespace type isolates a different resource

No Overhead

Namespaces add negligible overhead - just kernel data structures

Further Reading


What’s Next?

In Chapter 2: Control Groups (cgroups), we’ll implement:
  • CPU limits
  • Memory limits
  • Process count limits
  • Resource accounting

Next: Cgroups

Learn how to limit container resources