Docker Internals Deep Dive
If you love understanding how things actually work, this chapter is for you. If you just want to use Docker and ship code, feel free to skip ahead. No judgment.This chapter pulls back the curtain on Docker. We will explore the Linux kernel features that make containers possible, understand how the Docker daemon orchestrates everything, and demystify the image layer system. This knowledge separates container users from container engineers.
Why Internals Matter
Understanding Docker internals helps you:- Debug production issues when containers misbehave
- Optimize performance by understanding resource allocation
- Ace interviews where internals questions are common
- Make informed decisions about container security
- Build better Dockerfiles by understanding how layers work
The Big Secret: Containers Are Not Virtual Machines
Here is the fundamental truth that changes everything: containers are just isolated Linux processes. There is no hypervisor. There is no guest operating system. Containers share the host kernel and use kernel features to create isolation.- Faster to start (no OS boot, just process fork)
- More lightweight (no duplicate OS, shared kernel)
- Less isolated (shared kernel attack surface)
The Three Pillars of Container Isolation
Docker leverages three key Linux kernel features to create containers:1. Namespaces - What You Can See
Namespaces provide isolation by limiting what a process can see. Each namespace type isolates a different aspect of the system.| Namespace | Flag | What It Isolates |
|---|---|---|
| PID | CLONE_NEWPID | Process IDs - container sees only its own processes |
| NET | CLONE_NEWNET | Network interfaces, IP addresses, routing tables |
| MNT | CLONE_NEWNS | Mount points - container has its own filesystem view |
| UTS | CLONE_NEWUTS | Hostname and domain name |
| IPC | CLONE_NEWIPC | Inter-process communication (shared memory, semaphores) |
| USER | CLONE_NEWUSER | User and group IDs (root in container != root on host) |
| CGROUP | CLONE_NEWCGROUP | Cgroup root directory |
PID Namespace in Action
NET Namespace Deep Dive
Each container gets its own:- Network interfaces (typically
eth0) - IP address
- Routing table
- Firewall rules (iptables)
/proc/netdirectory
docker0 bridge.
2. Cgroups - What You Can Use
Control Groups (cgroups) limit, account, and isolate resource usage. While namespaces control what you can see, cgroups control what you can use.| Resource | Cgroup Controller | What It Controls |
|---|---|---|
| CPU | cpu, cpuset | CPU time, which CPUs to use |
| Memory | memory | RAM limits, swap |
| I/O | blkio | Block device I/O bandwidth |
| Network | net_cls, net_prio | Network traffic classification |
| PIDs | pids | Maximum number of processes |
Setting Resource Limits
What Happens When Limits Are Exceeded
- Memory: The OOM (Out of Memory) killer terminates the process
- CPU: Process is throttled, not killed
- PIDs: Fork bomb protection - fork() syscalls fail
3. Union Filesystems - The Layer Cake
Union filesystems allow stacking multiple directories as a single unified view. This is the magic behind Docker’s efficient image layering.How Layers Work
Copy-on-Write (CoW) Strategy
When a container modifies a file from a lower layer:- The file is copied up to the container’s writable layer
- The modification happens on the copy
- The original file in the image layer remains unchanged
- 100 containers from the same image share the base layers (massive space savings)
- Container startup is instant (no copying, just layer on top)
- Deleting a file does not reduce image size (whiteout file created instead)
Storage Drivers
Docker supports multiple storage drivers, each implementing union filesystem differently:| Driver | Technology | Best For |
|---|---|---|
| overlay2 | OverlayFS | Modern default, most Linux distros |
| aufs | AuFS | Legacy Ubuntu |
| btrfs | Btrfs | Btrfs filesystems |
| zfs | ZFS | ZFS filesystems |
| devicemapper | Device Mapper | RHEL/CentOS legacy |
Docker Daemon Architecture
The Docker daemon (dockerd) is the brain of Docker. Let us trace what happens when you run docker run nginx.
The Request Flow
Component Breakdown
Docker CLI: The command-line interface you interact with. Translates commands into REST API calls. Docker Daemon (dockerd):- Listens on Unix socket
/var/run/docker.sock - Manages images, networks, volumes
- Orchestrates container lifecycle
- Exposes the Docker API
- High-level container runtime
- Manages container lifecycle (create, start, stop)
- Handles image pull/push
- CNCF graduated project (also used by Kubernetes)
- Low-level OCI runtime
- Actually creates and runs containers
- Sets up namespaces and cgroups
- Reference implementation of OCI Runtime Spec
What Happens During docker run nginx
- CLI parses command, sends POST to
/containers/create - dockerd checks if
nginximage exists locally - If not, dockerd tells containerd to pull from registry
- containerd downloads image layers, unpacks them
- dockerd creates container metadata
- containerd calls runc with container spec
- runc creates namespaces, sets up cgroups, mounts filesystem
- runc executes the container’s entrypoint (nginx)
- runc exits, containerd monitors the running container
The OCI Specification
The Open Container Initiative (OCI) defines industry standards for containers:OCI Runtime Spec
Defines how to run a container given a filesystem bundle:config.json- container configurationrootfs/- container root filesystem
OCI Image Spec
Defines the format of container images:- Manifest (list of layers)
- Configuration (env vars, entrypoint)
- Layers (tar.gz of filesystem changes)
Security Implications
Understanding internals reveals security considerations:Shared Kernel Attack Surface
Since containers share the host kernel, a kernel vulnerability affects all containers.Root in Container vs Root on Host
By default, root inside a container is UID 0 - the same as host root. If a container escapes, it has root privileges on the host. Mitigation: User NamespacesCapabilities
Linux capabilities break down root privileges into smaller units. Docker drops many capabilities by default:Seccomp Profiles
Limit which system calls a container can make:Interview Deep Dive Questions
These are the questions that separate junior from senior candidates:What is the difference between a container and a VM?
What is the difference between a container and a VM?
Answer: VMs virtualize hardware with a hypervisor, running a complete guest OS. Containers virtualize the OS, sharing the host kernel and using namespaces/cgroups for isolation. Containers are lighter (MBs vs GBs), faster to start (seconds vs minutes), but less isolated (shared kernel).
Explain how Docker uses namespaces
Explain how Docker uses namespaces
Answer: Docker uses 7 Linux namespaces: PID (process isolation), NET (network isolation), MNT (filesystem isolation), UTS (hostname), IPC (inter-process communication), USER (UID/GID mapping), and CGROUP. Each container runs in its own namespace set, making it appear as an isolated system.
What is copy-on-write and why does Docker use it?
What is copy-on-write and why does Docker use it?
Answer: Copy-on-write is a strategy where resources are shared until modified. In Docker, image layers are read-only and shared between containers. When a container modifies a file, it is copied to the container’s writable layer first. This enables: instant container startup, efficient disk usage (shared layers), and immutable base images.
What happens when you run docker run?
What happens when you run docker run?
Answer: 1) CLI sends API request to dockerd, 2) dockerd checks for image, pulls if needed, 3) containerd is instructed to create container, 4) containerd calls runc with OCI spec, 5) runc creates namespaces, configures cgroups, sets up rootfs, 6) runc executes the entrypoint, 7) containerd monitors the process.
Why might running as root in a container be dangerous?
Why might running as root in a container be dangerous?
Answer: By default, root in a container is UID 0, same as host root. If a container escape occurs (kernel vulnerability, misconfiguration), the attacker has root access to the host. Mitigations: user namespaces, rootless containers, drop capabilities, run as non-root user inside container.
Explain the difference between overlay2 and aufs
Explain the difference between overlay2 and aufs
Answer: Both are union filesystems for Docker. OverlayFS (overlay2) is in mainline Linux kernel since 3.18, faster, and the modern default. AuFS is older, not in mainline kernel (requires patching), slower, but was the original Docker storage driver. overlay2 is preferred for all modern deployments.
Debugging with Internals Knowledge
Inspect Container Namespaces
Examine Cgroups
Trace System Calls
Key Takeaways
- Containers are isolated processes, not VMs - they share the host kernel
- Namespaces provide the “what you can see” isolation
- Cgroups provide the “what you can use” resource limits
- Union filesystems enable efficient layered images with copy-on-write
- Docker is a stack: CLI -> dockerd -> containerd -> runc
- OCI standards ensure portability across container runtimes
- Security requires understanding - shared kernel means shared risk
Ready to see these internals in action? Next up: Docker Images where we will build optimized, multi-stage Dockerfiles with layer caching.