Docker Internals Deep Dive

If you love understanding how things actually work, this chapter is for you. If you just want to use Docker and ship code, feel free to skip ahead. No judgment.

This chapter pulls back the curtain on Docker. We will explore the Linux kernel features that make containers possible, understand how the Docker daemon orchestrates everything, and demystify the image layer system. This knowledge separates container users from container engineers.

Why Internals Matter

Understanding Docker internals helps you:

Debug production issues when containers misbehave
Optimize performance by understanding resource allocation
Ace interviews where internals questions are common
Make informed decisions about container security
Build better Dockerfiles by understanding how layers work

The Big Secret: Containers Are Not Virtual Machines

Here is the fundamental truth that changes everything: containers are just isolated Linux processes. There is no hypervisor. There is no guest operating system. Containers share the host kernel and use kernel features to create isolation.

Virtual Machine:
┌─────────────────────────────────────────────────────────┐
│                     Hypervisor                          │
├───────────────┬───────────────┬───────────────┬────────┤
│   Guest OS    │   Guest OS    │   Guest OS    │        │
│   (Full OS)   │   (Full OS)   │   (Full OS)   │        │
├───────────────┼───────────────┼───────────────┤        │
│     App       │     App       │     App       │ Host OS│
└───────────────┴───────────────┴───────────────┴────────┘

Container:
┌─────────────────────────────────────────────────────────┐
│                    Docker Engine                         │
├───────────────┬───────────────┬───────────────┬────────┤
│   Container   │   Container   │   Container   │        │
│   (Process)   │   (Process)   │   (Process)   │        │
│               │               │               │ Host   │
│     App       │     App       │     App       │ Kernel │
└───────────────┴───────────────┴───────────────┴────────┘

This is why containers are:

Faster to start (no OS boot, just process fork)
More lightweight (no duplicate OS, shared kernel)
Less isolated (shared kernel attack surface)

The Three Pillars of Container Isolation

Docker leverages three key Linux kernel features to create containers:

1. Namespaces - What You Can See

Namespaces provide isolation by limiting what a process can see. Each namespace type isolates a different aspect of the system.

Namespace	Flag	What It Isolates
PID	`CLONE_NEWPID`	Process IDs - container sees only its own processes
NET	`CLONE_NEWNET`	Network interfaces, IP addresses, routing tables
MNT	`CLONE_NEWNS`	Mount points - container has its own filesystem view
UTS	`CLONE_NEWUTS`	Hostname and domain name
IPC	`CLONE_NEWIPC`	Inter-process communication (shared memory, semaphores)
USER	`CLONE_NEWUSER`	User and group IDs (root in container != root on host)
CGROUP	`CLONE_NEWCGROUP`	Cgroup root directory

PID Namespace in Action

# On the host, you see all processes
ps aux | wc -l
# Output: 247

# Inside a container, you only see container processes
docker run --rm alpine ps aux
# PID   USER     COMMAND
# 1     root     ps aux

The container believes it is running PID 1 (init), but on the host, it has a completely different PID. This isolation is what makes containers feel like separate machines.

NET Namespace Deep Dive

Each container gets its own:

Network interfaces (typically eth0)
IP address
Routing table
Firewall rules (iptables)
/proc/net directory

# View container network namespace
docker run --rm alpine ip addr
# Shows container's own eth0 with its own IP

# On host, view the veth pair connecting to container
ip link | grep veth

Docker creates a virtual ethernet pair (veth) - one end goes into the container namespace, the other connects to the docker0 bridge.

2. Cgroups - What You Can Use

Control Groups (cgroups) limit, account, and isolate resource usage. While namespaces control what you can see, cgroups control what you can use.

Resource	Cgroup Controller	What It Controls
CPU	`cpu`, `cpuset`	CPU time, which CPUs to use
Memory	`memory`	RAM limits, swap
I/O	`blkio`	Block device I/O bandwidth
Network	`net_cls`, `net_prio`	Network traffic classification
PIDs	`pids`	Maximum number of processes

Setting Resource Limits

# Limit container to 512MB RAM and 1 CPU
docker run -d \
  --memory=512m \
  --cpus=1 \
  nginx

# View cgroup settings for a container
docker inspect --format='{{.HostConfig.Memory}}' container_id

# On the host, examine cgroup files directly
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.limit_in_bytes

What Happens When Limits Are Exceeded

Memory: The OOM (Out of Memory) killer terminates the process
CPU: Process is throttled, not killed
PIDs: Fork bomb protection - fork() syscalls fail

# The infamous OOM kill
docker run --memory=10m alpine dd if=/dev/zero of=/dev/null
# Container gets killed when it exceeds 10MB

3. Union Filesystems - The Layer Cake

Union filesystems allow stacking multiple directories as a single unified view. This is the magic behind Docker’s efficient image layering.

How Layers Work

Docker Image Layers (Read-Only):
┌─────────────────────────────────────┐
│  Layer 4: COPY app.py /app/         │  <- Your code (small)
├─────────────────────────────────────┤
│  Layer 3: RUN pip install flask     │  <- Dependencies
├─────────────────────────────────────┤
│  Layer 2: RUN apt-get update        │  <- System packages
├─────────────────────────────────────┤
│  Layer 1: Ubuntu Base Image         │  <- Base OS (shared!)
└─────────────────────────────────────┘

Container Layer (Read-Write):
┌─────────────────────────────────────┐
│  Container Layer (ephemeral)        │  <- Runtime changes
└─────────────────────────────────────┘

Copy-on-Write (CoW) Strategy

When a container modifies a file from a lower layer:

The file is copied up to the container’s writable layer
The modification happens on the copy
The original file in the image layer remains unchanged

This is why:

100 containers from the same image share the base layers (massive space savings)
Container startup is instant (no copying, just layer on top)
Deleting a file does not reduce image size (whiteout file created instead)

Storage Drivers

Docker supports multiple storage drivers, each implementing union filesystem differently:

Driver	Technology	Best For
overlay2	OverlayFS	Modern default, most Linux distros
aufs	AuFS	Legacy Ubuntu
btrfs	Btrfs	Btrfs filesystems
zfs	ZFS	ZFS filesystems
devicemapper	Device Mapper	RHEL/CentOS legacy

# Check your storage driver
docker info | grep "Storage Driver"
# Storage Driver: overlay2

Docker Daemon Architecture

The Docker daemon (dockerd) is the brain of Docker. Let us trace what happens when you run docker run nginx.

The Request Flow

┌─────────────┐     REST API      ┌──────────────┐
│   Docker    │ ───────────────── │   dockerd    │
│   CLI       │   /var/run/       │  (daemon)    │
└─────────────┘   docker.sock     └──────┬───────┘
                                         │
                                         │ gRPC
                                         ▼
                                  ┌──────────────┐
                                  │  containerd  │
                                  │  (runtime)   │
                                  └──────┬───────┘
                                         │
                                         │ OCI Runtime
                                         ▼
                                  ┌──────────────┐
                                  │    runc      │
                                  │ (container)  │
                                  └──────────────┘

Component Breakdown

Docker CLI: The command-line interface you interact with. Translates commands into REST API calls. Docker Daemon (dockerd):

Listens on Unix socket /var/run/docker.sock
Manages images, networks, volumes
Orchestrates container lifecycle
Exposes the Docker API

containerd:

High-level container runtime
Manages container lifecycle (create, start, stop)
Handles image pull/push
CNCF graduated project (also used by Kubernetes)

runc:

Low-level OCI runtime
Actually creates and runs containers
Sets up namespaces and cgroups
Reference implementation of OCI Runtime Spec

What Happens During `docker run nginx`

CLI parses command, sends POST to /containers/create
dockerd checks if nginx image exists locally
If not, dockerd tells containerd to pull from registry
containerd downloads image layers, unpacks them
dockerd creates container metadata
containerd calls runc with container spec
runc creates namespaces, sets up cgroups, mounts filesystem
runc executes the container’s entrypoint (nginx)
runc exits, containerd monitors the running container

The OCI Specification

The Open Container Initiative (OCI) defines industry standards for containers:

OCI Runtime Spec

Defines how to run a container given a filesystem bundle:

config.json - container configuration
rootfs/ - container root filesystem

// Simplified config.json
{
  "ociVersion": "1.0.0",
  "process": {
    "args": ["nginx", "-g", "daemon off;"],
    "env": ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"],
    "cwd": "/"
  },
  "root": {
    "path": "rootfs",
    "readonly": false
  },
  "linux": {
    "namespaces": [
      {"type": "pid"},
      {"type": "network"},
      {"type": "mount"}
    ]
  }
}

OCI Image Spec

Defines the format of container images:

Manifest (list of layers)
Configuration (env vars, entrypoint)
Layers (tar.gz of filesystem changes)

This standardization means images built with Docker work with Podman, containerd, CRI-O, and other OCI-compliant runtimes.

Security Implications

Understanding internals reveals security considerations:

Shared Kernel Attack Surface

Since containers share the host kernel, a kernel vulnerability affects all containers.

# Container escapes are possible with kernel exploits
# Famous examples: CVE-2019-5736 (runc vulnerability)

Root in Container vs Root on Host

By default, root inside a container is UID 0 - the same as host root. If a container escapes, it has root privileges on the host. Mitigation: User Namespaces

# Map container root to unprivileged host user
docker run --userns-remap=default nginx

Capabilities

Linux capabilities break down root privileges into smaller units. Docker drops many capabilities by default:

# View container capabilities
docker run --rm alpine cat /proc/1/status | grep Cap

# Add specific capability
docker run --cap-add NET_ADMIN alpine

Seccomp Profiles

Limit which system calls a container can make:

# Use default seccomp profile (recommended)
docker run --security-opt seccomp=default nginx

# Check if seccomp is enabled
docker info | grep Seccomp

Interview Deep Dive Questions

These are the questions that separate junior from senior candidates:

What is the difference between a container and a VM?

Answer: VMs virtualize hardware with a hypervisor, running a complete guest OS. Containers virtualize the OS, sharing the host kernel and using namespaces/cgroups for isolation. Containers are lighter (MBs vs GBs), faster to start (seconds vs minutes), but less isolated (shared kernel).

Explain how Docker uses namespaces

Answer: Docker uses 7 Linux namespaces: PID (process isolation), NET (network isolation), MNT (filesystem isolation), UTS (hostname), IPC (inter-process communication), USER (UID/GID mapping), and CGROUP. Each container runs in its own namespace set, making it appear as an isolated system.

What is copy-on-write and why does Docker use it?

Answer: Copy-on-write is a strategy where resources are shared until modified. In Docker, image layers are read-only and shared between containers. When a container modifies a file, it is copied to the container’s writable layer first. This enables: instant container startup, efficient disk usage (shared layers), and immutable base images.

What happens when you run docker run?

Answer: 1) CLI sends API request to dockerd, 2) dockerd checks for image, pulls if needed, 3) containerd is instructed to create container, 4) containerd calls runc with OCI spec, 5) runc creates namespaces, configures cgroups, sets up rootfs, 6) runc executes the entrypoint, 7) containerd monitors the process.

Why might running as root in a container be dangerous?

Answer: By default, root in a container is UID 0, same as host root. If a container escape occurs (kernel vulnerability, misconfiguration), the attacker has root access to the host. Mitigations: user namespaces, rootless containers, drop capabilities, run as non-root user inside container.

Explain the difference between overlay2 and aufs

Answer: Both are union filesystems for Docker. OverlayFS (overlay2) is in mainline Linux kernel since 3.18, faster, and the modern default. AuFS is older, not in mainline kernel (requires patching), slower, but was the original Docker storage driver. overlay2 is preferred for all modern deployments.

Debugging with Internals Knowledge

Inspect Container Namespaces

# Get container PID on host
docker inspect --format '{{.State.Pid}}' container_name

# View namespaces
ls -la /proc/<PID>/ns/

# Enter container namespace from host
nsenter --target <PID> --mount --uts --ipc --net --pid

Examine Cgroups

# Find container's cgroup
cat /proc/<PID>/cgroup

# View memory limit
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.limit_in_bytes

# View CPU quota
cat /sys/fs/cgroup/cpu/docker/<container_id>/cpu.cfs_quota_us

Trace System Calls

# See what syscalls a container makes
docker run --rm --security-opt seccomp=unconfined strace -f nginx

Key Takeaways

Containers are isolated processes, not VMs - they share the host kernel
Namespaces provide the “what you can see” isolation
Cgroups provide the “what you can use” resource limits
Union filesystems enable efficient layered images with copy-on-write
Docker is a stack: CLI -> dockerd -> containerd -> runc
OCI standards ensure portability across container runtimes
Security requires understanding - shared kernel means shared risk

Ready to see these internals in action? Next up: Docker Images where we will build optimized, multi-stage Dockerfiles with layer caching.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Docker Internals Deep Dive

​Why Internals Matter

​The Big Secret: Containers Are Not Virtual Machines

​The Three Pillars of Container Isolation

​1. Namespaces - What You Can See

​PID Namespace in Action

​NET Namespace Deep Dive

​2. Cgroups - What You Can Use

​Setting Resource Limits

​What Happens When Limits Are Exceeded

​3. Union Filesystems - The Layer Cake

​How Layers Work

​Copy-on-Write (CoW) Strategy

​Storage Drivers

​Docker Daemon Architecture

​The Request Flow

​Component Breakdown

​What Happens During docker run nginx

​The OCI Specification

​OCI Runtime Spec

​OCI Image Spec

​Security Implications

​Shared Kernel Attack Surface

​Root in Container vs Root on Host

​Capabilities

​Seccomp Profiles

​Interview Deep Dive Questions

​Debugging with Internals Knowledge

​Inspect Container Namespaces

​Examine Cgroups

​Trace System Calls

​Key Takeaways