Skip to main content
Linux Namespaces - The foundation of container isolation

Namespaces Deep Dive

Linux namespaces are the core technology enabling container isolation. Understanding them deeply is essential for infrastructure engineers working with Docker, Kubernetes, and any container-based systems.
Interview Frequency: Very High (especially at infrastructure companies)
Key Topics: Namespace types, creation mechanisms, container implementation
Time to Master: 12-14 hours

What Are Namespaces?

Namespaces partition kernel resources so that processes see isolated views of the system. Linux Namespace Types

Namespace Types

NamespaceFlagWhat it IsolatesKernel Version
MountCLONE_NEWNSMount points2.4.19 (2002)
UTSCLONE_NEWUTSHostname, domain name2.6.19 (2006)
IPCCLONE_NEWIPCIPC, message queues, semaphores2.6.19 (2006)
PIDCLONE_NEWPIDProcess IDs2.6.24 (2008)
NetworkCLONE_NEWNETNetwork stack2.6.29 (2009)
UserCLONE_NEWUSERUser and group IDs3.8 (2013)
CgroupCLONE_NEWCGROUPCgroup root4.6 (2016)
TimeCLONE_NEWTIMESystem clocks5.6 (2020)

PID Namespace

Each PID namespace has its own PID numbering, starting from 1.
┌─────────────────────────────────────────────────────────────────────────────┐
│                         PID NAMESPACE HIERARCHY                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Root PID Namespace (Host)                                                  │
│   ┌────────────────────────────────────────────────────────────────────┐    │
│   │  PID 1: systemd                                                     │    │
│   │  PID 100: sshd                                                      │    │
│   │  PID 5000: containerd                                               │    │
│   │  PID 5001: container init ──────────────────────────────────────┐  │    │
│   │  PID 5002: container app                                        │  │    │
│   │  PID 5050: container init ────────────────────────────────┐     │  │    │
│   │  PID 5051: container app                                  │     │  │    │
│   └───────────────────────────────────────────────────────────│─────│──┘    │
│                                                               │     │        │
│   Container B PID Namespace                     Container A PID Namespace   │
│   ┌────────────────────────────┐              ┌────────────────────────────┐│
│   │  PID 1: /bin/sh (5050)     │              │  PID 1: /bin/sh (5001)     ││
│   │  PID 2: app (5051)         │              │  PID 2: app (5002)         ││
│   │                            │              │                            ││
│   │  Sees only PIDs 1, 2       │              │  Sees only PIDs 1, 2       ││
│   │  Cannot see host PIDs      │              │  Cannot see host PIDs      ││
│   └────────────────────────────┘              └────────────────────────────┘│
│                                                                              │
│   Key Properties:                                                            │
│   - Nested hierarchy (parent can see child PIDs)                            │
│   - PID 1 in namespace is "init" (reaps orphans)                            │
│   - Signals from parent namespace use host PIDs                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

PID 1 Responsibilities

// PID 1 in a container must:
// 1. Reap zombie processes
// 2. Forward signals appropriately
// 3. Not exit (or container dies)

// Simple init process
#include <signal.h>
#include <sys/wait.h>

void sigchld_handler(int sig) {
    while (waitpid(-1, NULL, WNOHANG) > 0);
}

int main() {
    signal(SIGCHLD, sigchld_handler);
    
    // Fork the actual application
    pid_t child = fork();
    if (child == 0) {
        execv("/app", argv);
    }
    
    // Wait forever, reaping children
    while (1) {
        pause();
    }
}

PID Namespace Demo

# Create new PID namespace
sudo unshare --pid --fork --mount-proc bash

# Inside new namespace:
ps aux  # Only shows processes in this namespace
echo $$  # PID is 1!

# View from host:
# The bash process has a different PID in host namespace

Network Namespace

Isolates the entire network stack: interfaces, routing, iptables, sockets.
┌─────────────────────────────────────────────────────────────────────────────┐
│                        NETWORK NAMESPACE ISOLATION                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Host Network Namespace                                                     │
│   ┌────────────────────────────────────────────────────────────────────┐    │
│   │                                                                     │    │
│   │  Physical Interface: eth0 (192.168.1.100)                          │    │
│   │  Bridge: docker0 (172.17.0.1)                                      │    │
│   │  Routes: default via 192.168.1.1                                   │    │
│   │  iptables: NAT rules for containers                                │    │
│   │                                                                     │    │
│   │  ┌──────────────────┐                                              │    │
│   │  │  veth-host-a     │ ←─────────────────────────────────┐          │    │
│   │  └──────────────────┘                                   │          │    │
│   │  ┌──────────────────┐                                   │          │    │
│   │  │  veth-host-b     │ ←─────────────────────────────┐   │          │    │
│   │  └──────────────────┘                               │   │          │    │
│   │                                                      │   │          │    │
│   └──────────────────────────────────────────────────────│───│──────────┘    │
│                                                          │   │               │
│   Container A Network NS                    Container B Network NS           │
│   ┌────────────────────────────┐          ┌────────────────────────────┐    │
│   │                            │          │                            │    │
│   │  Interface: eth0           │ ←──(veth pair)──→ veth-host-a        │    │
│   │  IP: 172.17.0.2           │          │  Interface: eth0           │    │
│   │  Routes: default via      │          │  IP: 172.17.0.3           │    │
│   │          172.17.0.1       │          │  Routes: default via      │    │
│   │  Loopback: lo             │          │          172.17.0.1       │    │
│   │                            │          │                            │    │
│   │  Own iptables rules        │          │  Own iptables rules        │    │
│   │  Own routing table         │          │  Own routing table         │    │
│   │  Own sockets               │          │  Own sockets               │    │
│   │                            │          │                            │    │
│   └────────────────────────────┘          └────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Creating Network Namespaces

# Create network namespace
sudo ip netns add container1

# List network namespaces
ip netns list

# Execute command in namespace
sudo ip netns exec container1 ip addr

# Create veth pair
sudo ip link add veth-host type veth peer name veth-container

# Move one end to container namespace
sudo ip link set veth-container netns container1

# Configure interfaces
sudo ip addr add 10.0.0.1/24 dev veth-host
sudo ip link set veth-host up

sudo ip netns exec container1 ip addr add 10.0.0.2/24 dev veth-container
sudo ip netns exec container1 ip link set veth-container up
sudo ip netns exec container1 ip link set lo up

# Test connectivity
sudo ip netns exec container1 ping 10.0.0.1

# Enable NAT for internet access
sudo iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -j MASQUERADE
sudo ip netns exec container1 ip route add default via 10.0.0.1

Mount Namespace

Isolates mount points - each namespace can have different filesystem views.
┌─────────────────────────────────────────────────────────────────────────────┐
│                        MOUNT NAMESPACE ISOLATION                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Host Mount Namespace                                                       │
│   /                                                                          │
│   ├── bin/                                                                   │
│   ├── etc/                                                                   │
│   ├── home/                                                                  │
│   ├── var/                                                                   │
│   │   └── lib/                                                              │
│   │       └── docker/                                                       │
│   │           └── overlay2/                                                 │
│   └── ...                                                                   │
│                                                                              │
│   Container Mount Namespace (via overlay filesystem)                         │
│   /                                  ← overlay mount (merged view)           │
│   ├── bin/ → ubuntu:latest          ← image layer (read-only)              │
│   ├── etc/ → merged                 ← overlay of layers                     │
│   ├── var/ → container layer        ← writable layer                        │
│   ├── proc → new procfs             ← container-specific                    │
│   ├── sys → new sysfs               ← container-specific                    │
│   ├── dev → device subset           ← limited devices                       │
│   └── app/ → bind mount             ← volume from host                      │
│                                                                              │
│   Mount Propagation:                                                         │
│   - private: mounts not visible outside namespace                           │
│   - shared: mounts propagate to peer namespaces                             │
│   - slave: receives but doesn't send mount events                           │
│   - unbindable: cannot be bind mounted                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Container Root Filesystem

# Create container-like filesystem isolation
sudo unshare --mount --fork bash

# Make mounts private (don't leak to host)
mount --make-rprivate /

# Create new root
mkdir -p /tmp/newroot/{bin,lib,lib64,proc,sys,dev}

# Copy busybox for a minimal root
cp /bin/busybox /tmp/newroot/bin/

# Mount special filesystems
mount -t proc proc /tmp/newroot/proc
mount -t sysfs sys /tmp/newroot/sys

# Change root
cd /tmp/newroot
pivot_root . .
umount -l .

# Now we're in new root
/bin/busybox sh

User Namespace

Maps user/group IDs between namespace and host. Enables rootless containers.
┌─────────────────────────────────────────────────────────────────────────────┐
│                        USER NAMESPACE MAPPING                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Host                              Container                                │
│   ┌────────────────────────┐       ┌────────────────────────┐               │
│   │                        │       │                        │               │
│   │  UID 0: root           │       │  UID 0: root ──────────│──┐            │
│   │  UID 1000: alice ──────│───────│→ UID 0 (mapped!)       │  │            │
│   │  UID 1001: bob         │       │  UID 1: app            │  │            │
│   │  ...                   │       │  ...                   │  │            │
│   │                        │       │                        │  │            │
│   │  GID 1000: alice ──────│───────│→ GID 0                 │  │            │
│   │                        │       │                        │  │            │
│   └────────────────────────┘       └────────────────────────┘  │            │
│                                                                 │            │
│   Mapping file (/proc/PID/uid_map):                            │            │
│   ┌─────────────────────────────────────────────────────────┐  │            │
│   │  0 1000 1     ← Container UID 0 = Host UID 1000        │  │            │
│   │  1 100000 65536  ← Container 1-65536 = Host 100000+    │──┘            │
│   └─────────────────────────────────────────────────────────┘               │
│                                                                              │
│   Effect:                                                                    │
│   - "root" in container is unprivileged user on host                        │
│   - Files created with UID 0 appear as UID 1000 on host                    │
│   - Enables rootless containers (Podman, Docker rootless)                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

User Namespace Demo

# Create user namespace (as unprivileged user)
unshare --user --map-root-user bash

# Check identity
id  # Shows uid=0(root) gid=0(root)
whoami  # Shows root

# But we're not really root on the host
cat /proc/self/uid_map
# 0  1000  1  (container 0 = host 1000)

# Try to access host files
cat /etc/shadow  # Permission denied (not real root)

Rootless Containers

# Run rootless container with Podman
podman run -it --rm alpine sh

# Inside container: appears as root
id  # uid=0(root)

# On host: process runs as your user
ps aux | grep alpine  # Shows your username, not root

Creating Namespaces

System Calls

#include <sched.h>

// Method 1: clone() with namespace flags
int child_fn(void *arg) {
    // New process in new namespace(s)
    execv("/bin/sh", NULL);
    return 0;
}

int main() {
    char stack[8192];
    
    int flags = CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | SIGCHLD;
    
    pid_t child = clone(child_fn, stack + sizeof(stack), flags, NULL);
    waitpid(child, NULL, 0);
    
    return 0;
}

// Method 2: unshare() - move current process to new namespace
if (unshare(CLONE_NEWPID | CLONE_NEWNS) == -1) {
    perror("unshare");
}

// Method 3: setns() - join existing namespace
int fd = open("/proc/1234/ns/net", O_RDONLY);
if (setns(fd, CLONE_NEWNET) == -1) {
    perror("setns");
}
close(fd);

Namespace Files

# View namespace of a process
ls -la /proc/$$/ns/
# lrwxrwxrwx 1 user user 0 Nov 29 10:00 cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 user user 0 Nov 29 10:00 ipc -> 'ipc:[4026531839]'
# lrwxrwxrwx 1 user user 0 Nov 29 10:00 mnt -> 'mnt:[4026531840]'
# lrwxrwxrwx 1 user user 0 Nov 29 10:00 net -> 'net:[4026531992]'
# lrwxrwxrwx 1 user user 0 Nov 29 10:00 pid -> 'pid:[4026531836]'
# lrwxrwxrwx 1 user user 0 Nov 29 10:00 user -> 'user:[4026531837]'
# lrwxrwxrwx 1 user user 0 Nov 29 10:00 uts -> 'uts:[4026531838]'

# Compare namespaces
readlink /proc/$$/ns/net
readlink /proc/1/ns/net  # Different if in different namespace

# Enter namespace
nsenter --target 1234 --net --pid bash

Container Implementation

How Docker/runc actually creates containers:
┌─────────────────────────────────────────────────────────────────────────────┐
│                      CONTAINER CREATION FLOW                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   docker run ubuntu bash                                                     │
│        │                                                                     │
│        ▼                                                                     │
│   1. Docker daemon receives request                                          │
│        │                                                                     │
│        ▼                                                                     │
│   2. containerd prepares container                                           │
│      - Pull image if needed                                                  │
│      - Prepare overlay filesystem                                            │
│      - Generate OCI runtime spec                                            │
│        │                                                                     │
│        ▼                                                                     │
│   3. runc creates container:                                                 │
│      a. Create namespaces:                                                   │
│         clone(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | ...)             │
│                                                                              │
│      b. Set up cgroups:                                                      │
│         mkdir /sys/fs/cgroup/mycgroup                                       │
│         echo limits > memory.max, cpu.max, etc.                             │
│                                                                              │
│      c. Set up rootfs:                                                       │
│         mount("overlay", "/", ...)                                          │
│         pivot_root(newroot, oldroot)                                        │
│                                                                              │
│      d. Set up networking:                                                   │
│         ip link add veth pair                                               │
│         ip link set to namespace                                            │
│                                                                              │
│      e. Apply security:                                                      │
│         seccomp(filter)                                                      │
│         drop capabilities                                                    │
│         apply AppArmor/SELinux                                              │
│                                                                              │
│      f. Execute entrypoint:                                                  │
│         execve("/bin/bash", ...)                                            │
│        │                                                                     │
│        ▼                                                                     │
│   4. Container running!                                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Minimal Container Runtime

// minimal_container.c - Educational container runtime
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/wait.h>
#include <unistd.h>

#define STACK_SIZE (1024 * 1024)

static char child_stack[STACK_SIZE];

int child_fn(void *arg) {
    char **argv = (char **)arg;
    
    // Set hostname
    sethostname("container", 9);
    
    // Mount proc
    mount("proc", "/proc", "proc", 0, NULL);
    
    // Execute command
    execvp(argv[0], argv);
    perror("execvp");
    return 1;
}

int main(int argc, char **argv) {
    if (argc < 2) {
        fprintf(stderr, "Usage: %s <command>\n", argv[0]);
        return 1;
    }
    
    int flags = CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNS | SIGCHLD;
    
    pid_t child = clone(
        child_fn,
        child_stack + STACK_SIZE,
        flags,
        &argv[1]
    );
    
    if (child == -1) {
        perror("clone");
        return 1;
    }
    
    waitpid(child, NULL, 0);
    return 0;
}

Lab Exercises

Objective: Understand PID namespace hierarchy
# Create nested PID namespaces
sudo unshare --pid --fork bash -c '
    echo "Level 1 PID: $$"
    unshare --pid --fork bash -c "
        echo \"Level 2 PID: \$$\"
        ps aux
        sleep infinity
    " &
    ps aux
    wait
'

# View from host
ps aux | grep sleep
# Shows actual PID (not 1)

# Check namespace relationship
sudo ls -la /proc/<outer-pid>/ns/
sudo ls -la /proc/<inner-pid>/ns/
Objective: Build container networking from scratch
# Create two "containers" that can communicate

# Create namespaces
sudo ip netns add container1
sudo ip netns add container2

# Create bridge
sudo ip link add br0 type bridge
sudo ip addr add 10.0.0.1/24 dev br0
sudo ip link set br0 up

# Create veth pairs
sudo ip link add veth1 type veth peer name veth1-br
sudo ip link add veth2 type veth peer name veth2-br

# Move to namespaces
sudo ip link set veth1 netns container1
sudo ip link set veth2 netns container2

# Connect to bridge
sudo ip link set veth1-br master br0
sudo ip link set veth2-br master br0
sudo ip link set veth1-br up
sudo ip link set veth2-br up

# Configure container1
sudo ip netns exec container1 ip addr add 10.0.0.2/24 dev veth1
sudo ip netns exec container1 ip link set veth1 up
sudo ip netns exec container1 ip link set lo up

# Configure container2
sudo ip netns exec container2 ip addr add 10.0.0.3/24 dev veth2
sudo ip netns exec container2 ip link set veth2 up
sudo ip netns exec container2 ip link set lo up

# Test connectivity
sudo ip netns exec container1 ping -c 3 10.0.0.3

# Cleanup
sudo ip netns del container1
sudo ip netns del container2
sudo ip link del br0
Objective: Create container-like isolation
#!/bin/bash
# mini-container.sh

set -e

ROOTFS="/tmp/container-root"
CONTAINER_NAME="mini-container"

# Create rootfs using debootstrap or busybox
mkdir -p $ROOTFS
if ! [ -f "$ROOTFS/bin/sh" ]; then
    # Use busybox for minimal root
    mkdir -p $ROOTFS/{bin,proc,sys,dev,tmp}
    cp /bin/busybox $ROOTFS/bin/
    for cmd in sh ls ps cat echo mount; do
        ln -sf busybox $ROOTFS/bin/$cmd
    done
fi

# Run container
sudo unshare \
    --mount \
    --uts \
    --ipc \
    --pid \
    --fork \
    /bin/bash -c "
        # Set hostname
        hostname $CONTAINER_NAME
        
        # Mount proc and sys
        mount -t proc proc $ROOTFS/proc
        mount -t sysfs sys $ROOTFS/sys
        
        # Change root
        cd $ROOTFS
        mkdir -p .oldroot
        pivot_root . .oldroot
        
        # Unmount old root
        umount -l /.oldroot
        rmdir /.oldroot
        
        # Run shell
        exec /bin/sh
    "

Interview Questions

Answer:
AspectContainersVMs
KernelShared with hostSeparate kernel per VM
IsolationNamespaces + cgroupsHardware virtualization
Overhead~1-2%~5-15%
Boot timeMillisecondsSeconds to minutes
Isolation strengthProcess-levelHardware-level
Key differences:
  • Containers share kernel (same syscalls, same vulnerabilities)
  • VMs have complete kernel isolation
  • Container escape = host access; VM escape = hypervisor access
  • Containers are lighter but less isolated
When to use VMs: Multi-tenant, untrusted workloads, different kernel requirements When to use containers: Single-tenant, microservices, CI/CD
Answer:Mechanism:
  1. User namespace maps container root to unprivileged host user
  2. No actual root privileges on host
  3. Uses subordinate UIDs/GIDs from /etc/subuid, /etc/subgid
Implementation:
Host user alice (UID 1000)
Container root (UID 0) → mapped to host UID 100000
Container UID 1 → mapped to host UID 100001
Benefits:
  • Container compromise = unprivileged access
  • No root daemon required
  • Better security posture
Limitations:
  • Can’t bind to ports < 1024 (without capabilities)
  • Some syscalls restricted
  • Network namespace requires slirp4netns
Answer:In regular Linux:
  • Orphaned process is adopted by PID 1 (init)
  • Init reaps zombie when process exits
In PID namespace:
  • Orphaned process adopted by namespace’s PID 1
  • If PID 1 doesn’t reap, zombies accumulate
  • If PID 1 exits, all processes in namespace killed
Docker behavior:
  • Uses tini or —init flag for proper init
  • Reaps zombies and forwards signals
  • Without init: potential zombie accumulation
Why it matters:
  • Zombie processes consume PID table entries
  • Application might not handle SIGCHLD
  • Proper init is critical for long-running containers
Answer:Bridge networking (default):
  1. Create bridge: docker0 bridge interface on host
  2. Per container:
    • Create veth pair
    • One end in container namespace (eth0)
    • Other end connected to docker0
  3. IP assignment: Docker’s IPAM assigns from subnet
  4. Routing:
    • Container default route via docker0
    • Host NATs outgoing (iptables MASQUERADE)
    • Port mapping via DNAT rules
iptables rules:
-A POSTROUTING -s 172.17.0.0/16 -j MASQUERADE
-A DOCKER -p tcp --dport 80 -j DNAT --to-destination 172.17.0.2:80
Host networking: No network namespace isolation, uses host stack directlyNone networking: Only loopback, no external connectivity

Key Takeaways

Namespace Types

8 namespace types isolate different kernel resources

PID 1 Matters

Container’s init process must reap zombies and handle signals

User Namespaces

Enable rootless containers by mapping UIDs

Network Isolation

veth pairs and bridges connect isolated network stacks

Next: Cgroups v1 & v2 →