Linux namespaces are the core technology enabling container isolation. Understanding them deeply is essential for infrastructure engineers working with Docker, Kubernetes, and any container-based systems.
Interview Frequency: Very High (especially at infrastructure companies) Key Topics: Namespace types, creation mechanisms, container implementation Time to Master: 12-14 hours
# Create new PID namespacesudo unshare --pid --fork --mount-proc bash# Inside new namespace:ps aux # Only shows processes in this namespaceecho $$ # PID is 1!# View from host:# The bash process has a different PID in host namespace
# Create network namespacesudo ip netns add container1# List network namespacesip netns list# Execute command in namespacesudo ip netns exec container1 ip addr# Create veth pairsudo ip link add veth-host type veth peer name veth-container# Move one end to container namespacesudo ip link set veth-container netns container1# Configure interfacessudo ip addr add 10.0.0.1/24 dev veth-hostsudo ip link set veth-host upsudo ip netns exec container1 ip addr add 10.0.0.2/24 dev veth-containersudo ip netns exec container1 ip link set veth-container upsudo ip netns exec container1 ip link set lo up# Test connectivitysudo ip netns exec container1 ping 10.0.0.1# Enable NAT for internet accesssudo iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -j MASQUERADEsudo ip netns exec container1 ip route add default via 10.0.0.1
# Create container-like filesystem isolationsudo unshare --mount --fork bash# Make mounts private (don't leak to host)mount --make-rprivate /# Create new rootmkdir -p /tmp/newroot/{bin,lib,lib64,proc,sys,dev}# Copy busybox for a minimal rootcp /bin/busybox /tmp/newroot/bin/# Mount special filesystemsmount -t proc proc /tmp/newroot/procmount -t sysfs sys /tmp/newroot/sys# Change rootcd /tmp/newrootpivot_root . .umount -l .# Now we're in new root/bin/busybox sh
# Run rootless container with Podmanpodman run -it --rm alpine sh# Inside container: appears as rootid # uid=0(root)# On host: process runs as your userps aux | grep alpine # Shows your username, not root
# View namespace of a processls -la /proc/$$/ns/# lrwxrwxrwx 1 user user 0 Nov 29 10:00 cgroup -> 'cgroup:[4026531835]'# lrwxrwxrwx 1 user user 0 Nov 29 10:00 ipc -> 'ipc:[4026531839]'# lrwxrwxrwx 1 user user 0 Nov 29 10:00 mnt -> 'mnt:[4026531840]'# lrwxrwxrwx 1 user user 0 Nov 29 10:00 net -> 'net:[4026531992]'# lrwxrwxrwx 1 user user 0 Nov 29 10:00 pid -> 'pid:[4026531836]'# lrwxrwxrwx 1 user user 0 Nov 29 10:00 user -> 'user:[4026531837]'# lrwxrwxrwx 1 user user 0 Nov 29 10:00 uts -> 'uts:[4026531838]'# Compare namespacesreadlink /proc/$$/ns/netreadlink /proc/1/ns/net # Different if in different namespace# Enter namespacensenter --target 1234 --net --pid bash
# Create nested PID namespacessudo unshare --pid --fork bash -c ' echo "Level 1 PID: $$" unshare --pid --fork bash -c " echo \"Level 2 PID: \$$\" ps aux sleep infinity " & ps aux wait'# View from hostps aux | grep sleep# Shows actual PID (not 1)# Check namespace relationshipsudo ls -la /proc/<outer-pid>/ns/sudo ls -la /proc/<inner-pid>/ns/
Lab 2: Network Namespace Networking
Objective: Build container networking from scratch
# Create two "containers" that can communicate# Create namespacessudo ip netns add container1sudo ip netns add container2# Create bridgesudo ip link add br0 type bridgesudo ip addr add 10.0.0.1/24 dev br0sudo ip link set br0 up# Create veth pairssudo ip link add veth1 type veth peer name veth1-brsudo ip link add veth2 type veth peer name veth2-br# Move to namespacessudo ip link set veth1 netns container1sudo ip link set veth2 netns container2# Connect to bridgesudo ip link set veth1-br master br0sudo ip link set veth2-br master br0sudo ip link set veth1-br upsudo ip link set veth2-br up# Configure container1sudo ip netns exec container1 ip addr add 10.0.0.2/24 dev veth1sudo ip netns exec container1 ip link set veth1 upsudo ip netns exec container1 ip link set lo up# Configure container2sudo ip netns exec container2 ip addr add 10.0.0.3/24 dev veth2sudo ip netns exec container2 ip link set veth2 upsudo ip netns exec container2 ip link set lo up# Test connectivitysudo ip netns exec container1 ping -c 3 10.0.0.3# Cleanupsudo ip netns del container1sudo ip netns del container2sudo ip link del br0
Lab 3: Build a Minimal Container
Objective: Create container-like isolation
#!/bin/bash# mini-container.shset -eROOTFS="/tmp/container-root"CONTAINER_NAME="mini-container"# Create rootfs using debootstrap or busyboxmkdir -p $ROOTFSif ! [ -f "$ROOTFS/bin/sh" ]; then # Use busybox for minimal root mkdir -p $ROOTFS/{bin,proc,sys,dev,tmp} cp /bin/busybox $ROOTFS/bin/ for cmd in sh ls ps cat echo mount; do ln -sf busybox $ROOTFS/bin/$cmd donefi# Run containersudo unshare \ --mount \ --uts \ --ipc \ --pid \ --fork \ /bin/bash -c " # Set hostname hostname $CONTAINER_NAME # Mount proc and sys mount -t proc proc $ROOTFS/proc mount -t sysfs sys $ROOTFS/sys # Change root cd $ROOTFS mkdir -p .oldroot pivot_root . .oldroot # Unmount old root umount -l /.oldroot rmdir /.oldroot # Run shell exec /bin/sh "
A security researcher claims they can escape a Docker container by exploiting the mount namespace. Walk through how mount namespace isolation works and identify the potential attack vectors.
Strong Answer:
Mount namespace isolation works by giving each container its own mount table, so mounts made inside the container are invisible to the host and vice versa. During container creation, runc calls clone(CLONE_NEWNS) to create a new mount namespace, then uses pivot_root() to change the container’s root filesystem to the overlay mount. The old host root is unmounted with MNT_DETACH, so the container should not be able to see or access host filesystems.
The attack vectors are primarily misconfiguration, not namespace bugs. First, bind mounts: if the container runtime mounts host directories into the container (Docker volumes like -v /:/host), the container has direct access to the host filesystem through that mount. Second, the CAP_SYS_ADMIN capability allows processes to call mount() inside the container, potentially remounting filesystems or mounting procfs/sysfs entries that leak host information. Third, device access: if /dev is not properly filtered, the container could mknod and access host block devices directly, bypassing the filesystem entirely.
A properly configured container mitigates these: drop CAP_SYS_ADMIN, use seccomp to block the mount syscall, make the rootfs read-only, minimize bind mounts, and use a device whitelist. User namespaces add another layer by mapping container root to an unprivileged host UID, so even if mount is somehow called, the kernel rejects it because the user lacks real privileges.
Follow-up: How does pivot_root differ from chroot, and why do container runtimes use pivot_root?Follow-up Answer:
chroot simply changes the process’s root directory reference (task_struct->fs->root) but does not change the mount namespace. The old root filesystem remains mounted and accessible via /proc/1/root or by opening file descriptors before the chroot. A privileged process can escape chroot by using chdir("../..") combined with another chroot. pivot_root is fundamentally different: it atomically swaps the mount namespace’s root mount, making the old root a subdirectory that can then be fully unmounted. After unmounting the old root, there is no reference to the host filesystem in the mount namespace at all. This is why container runtimes use pivot_root followed by umount2(old_root, MNT_DETACH) — it provides genuine isolation, while chroot is just a pathname illusion.
Explain how PID namespace nesting works. If a process in a nested PID namespace sends a signal using kill(), what PID does it use, and how does the kernel resolve it?
Strong Answer:
PID namespaces form a hierarchy. A process can have a different PID in each level of the hierarchy. For example, the first process in a child PID namespace has PID 1 inside the namespace, but might have PID 5001 in the parent namespace. The kernel tracks all these PIDs simultaneously using a pid structure that contains an array of upid entries, one per namespace level.
When a process in the child namespace calls kill(2, SIGTERM), the kernel resolves PID 2 relative to the caller’s PID namespace. It looks up PID 2 in the caller’s namespace to find the target task_struct. If PID 2 exists in that namespace, the signal is delivered. The target process might be PID 5002 in the host namespace, but the caller never sees or uses that number.
Crucially, processes in a child namespace cannot see or signal processes in the parent namespace (they simply do not have PID numbers for parent namespace processes). However, processes in the parent namespace can see and signal all processes in child namespaces using the host-level PIDs. This asymmetry is intentional: containers should be isolated from the host, but the host must retain full control.
The kernel function find_task_by_vpid() performs the namespace-aware PID lookup, using the caller’s PID namespace as the search context. find_task_by_pid_ns() allows specifying an explicit namespace, which is how the host sends signals to container processes.
Follow-up: What happens when PID 1 inside a container exits or crashes?Follow-up Answer:
When PID 1 in a PID namespace exits, the kernel sends SIGKILL to every remaining process in that namespace. This is because PID 1 is the init process for the namespace, responsible for reaping orphaned zombie processes. Without it, zombies would accumulate indefinitely. The kernel enforces this cleanup by iterating through all tasks whose PID namespace matches and sending them SIGKILL. This is why container runtimes use an init process (like tini or Docker’s --init flag) that properly handles signals and reaps children. If the application runs as PID 1 directly and does not handle SIGCHLD, zombie processes accumulate. If it crashes, the entire container terminates. Running the application as PID 1 also means SIGTERM must be explicitly handled — the kernel does not deliver unhandled signals to PID 1 (since killing init would destroy the namespace prematurely).
How do rootless containers work at the kernel level? Walk through the user namespace UID mapping and explain what security guarantees it provides versus privileged containers.
Strong Answer:
Rootless containers use user namespaces to map UID 0 inside the container to an unprivileged UID on the host. When the container runtime calls clone(CLONE_NEWUSER), the kernel creates a new user namespace where the creator can define UID/GID mappings by writing to /proc/<pid>/uid_map. A typical mapping is 0 100000 65536, meaning container UIDs 0-65535 map to host UIDs 100000-165535. The host UIDs come from the subordinate UID ranges defined in /etc/subuid.
Inside the container, the process sees itself as root (UID 0) with full capabilities within its user namespace. It can create files owned by root, bind to privileged ports within its network namespace, and perform operations that normally require root. However, the kernel enforces that capabilities are scoped to the user namespace: CAP_SYS_ADMIN in the container’s user namespace does not grant CAP_SYS_ADMIN in the host’s user namespace. Any operation that touches a resource outside the container’s namespaces (like accessing a host-owned file) uses the mapped host UID (100000), which is unprivileged.
Security guarantees: if an attacker escapes the container, they land on the host as UID 100000, not root. They cannot read /etc/shadow, cannot load kernel modules, cannot mount host filesystems. This is a fundamental improvement over privileged containers where container root equals host root.
Limitations: some operations genuinely require host root (binding to host ports below 1024 without network namespace, certain FUSE operations). Network namespace setup requires either slirp4netns (user-space network stack, slower) or root helper processes. Performance is slightly lower due to UID translation overhead.
Follow-up: Can a process inside a user namespace mount a filesystem, and if so, what are the restrictions?Follow-up Answer:
A process with CAP_SYS_ADMIN in its user namespace can perform certain mounts, but the kernel restricts which filesystem types are allowed. Only filesystems marked as FS_USERNS_MOUNT in the kernel are permitted: this includes tmpfs, procfs, sysfs, and overlay (with restrictions). Block device filesystems like ext4 or XFS cannot be mounted because they interact directly with hardware and could be used to exploit device-level vulnerabilities. Even for allowed filesystems, the kernel performs additional checks: procfs mounted in a user namespace only exposes information relevant to that namespace, and sysfs entries are restricted. These restrictions are implemented in do_new_mount() where the kernel checks mount_too_revealing() to prevent information leaks.