> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Docker Internals

> How Docker actually works under the hood - namespaces, cgroups, and the container runtime

# Docker Internals Deep Dive

> **If you love understanding how things actually work, this chapter is for you. If you just want to use Docker and ship code, feel free to skip ahead. No judgment.**

This chapter pulls back the curtain on Docker. We will explore the Linux kernel features that make containers possible, understand how the Docker daemon orchestrates everything, and demystify the image layer system. This knowledge separates container users from container engineers.

***

## Why Internals Matter

Understanding Docker internals helps you:

* **Debug production issues** when containers misbehave
* **Optimize performance** by understanding resource allocation
* **Ace interviews** where internals questions are common
* **Make informed decisions** about container security
* **Build better Dockerfiles** by understanding how layers work

***

## The Big Secret: Containers Are Not Virtual Machines

Here is the fundamental truth that changes everything: **containers are just isolated Linux processes**. There is no hypervisor. There is no guest operating system. Containers share the host kernel and use kernel features to create isolation.

```
Virtual Machine:
┌─────────────────────────────────────────────────────────┐
│                     Hypervisor                          │
├───────────────┬───────────────┬───────────────┬────────┤
│   Guest OS    │   Guest OS    │   Guest OS    │        │
│   (Full OS)   │   (Full OS)   │   (Full OS)   │        │
├───────────────┼───────────────┼───────────────┤        │
│     App       │     App       │     App       │ Host OS│
└───────────────┴───────────────┴───────────────┴────────┘

Container:
┌─────────────────────────────────────────────────────────┐
│                    Docker Engine                         │
├───────────────┬───────────────┬───────────────┬────────┤
│   Container   │   Container   │   Container   │        │
│   (Process)   │   (Process)   │   (Process)   │        │
│               │               │               │ Host   │
│     App       │     App       │     App       │ Kernel │
└───────────────┴───────────────┴───────────────┴────────┘
```

This is why containers are:

* **Faster to start** (no OS boot, just process fork)
* **More lightweight** (no duplicate OS, shared kernel)
* **Less isolated** (shared kernel attack surface)

***

## The Three Pillars of Container Isolation

Docker leverages three key Linux kernel features to create containers:

### 1. Namespaces - What You Can See

Namespaces provide **isolation** by limiting what a process can see. Each namespace type isolates a different aspect of the system.

| Namespace  | Flag              | What It Isolates                                        |
| ---------- | ----------------- | ------------------------------------------------------- |
| **PID**    | `CLONE_NEWPID`    | Process IDs - container sees only its own processes     |
| **NET**    | `CLONE_NEWNET`    | Network interfaces, IP addresses, routing tables        |
| **MNT**    | `CLONE_NEWNS`     | Mount points - container has its own filesystem view    |
| **UTS**    | `CLONE_NEWUTS`    | Hostname and domain name                                |
| **IPC**    | `CLONE_NEWIPC`    | Inter-process communication (shared memory, semaphores) |
| **USER**   | `CLONE_NEWUSER`   | User and group IDs (root in container != root on host)  |
| **CGROUP** | `CLONE_NEWCGROUP` | Cgroup root directory                                   |

#### PID Namespace in Action

```bash theme={null}
# On the host, you see all processes
ps aux | wc -l
# Output: 247

# Inside a container, you only see container processes
docker run --rm alpine ps aux
# PID   USER     COMMAND
# 1     root     ps aux
```

The container believes it is running PID 1 (init), but on the host, it has a completely different PID. This isolation is what makes containers feel like separate machines.

#### NET Namespace Deep Dive

Each container gets its own:

* Network interfaces (typically `eth0`)
* IP address
* Routing table
* Firewall rules (iptables)
* `/proc/net` directory

```bash theme={null}
# View container network namespace
docker run --rm alpine ip addr
# Shows container's own eth0 with its own IP

# On host, view the veth pair connecting to container
ip link | grep veth
```

Docker creates a **virtual ethernet pair** (veth) - one end goes into the container namespace, the other connects to the `docker0` bridge.

### 2. Cgroups - What You Can Use

**Control Groups (cgroups)** limit, account, and isolate resource usage. While namespaces control what you can see, cgroups control what you can use.

| Resource    | Cgroup Controller     | What It Controls               |
| ----------- | --------------------- | ------------------------------ |
| **CPU**     | `cpu`, `cpuset`       | CPU time, which CPUs to use    |
| **Memory**  | `memory`              | RAM limits, swap               |
| **I/O**     | `blkio`               | Block device I/O bandwidth     |
| **Network** | `net_cls`, `net_prio` | Network traffic classification |
| **PIDs**    | `pids`                | Maximum number of processes    |

#### Setting Resource Limits

```bash theme={null}
# Limit container to 512MB RAM and 1 CPU
docker run -d \
  --memory=512m \
  --cpus=1 \
  nginx

# View cgroup settings for a container
docker inspect --format='{{.HostConfig.Memory}}' container_id

# On the host, examine cgroup files directly
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.limit_in_bytes
```

#### What Happens When Limits Are Exceeded

This is critical to understand because the failure modes are completely different depending on the resource:

* **Memory**: The OOM (Out of Memory) killer **terminates the process immediately** -- no graceful shutdown, no cleanup. Your app just disappears. This is the most common cause of mysterious container "crashes" in production.
* **CPU**: The process is **throttled, not killed**. It still runs, just slower. Users experience increased latency rather than errors. This is usually the better failure mode, which is why some teams prefer CPU limits over memory limits.
* **PIDs**: Fork bomb protection -- `fork()` syscalls return errors. This prevents a runaway process from spawning thousands of children and taking down the host.

```bash theme={null}
# The infamous OOM kill -- the container is terminated instantly
docker run --memory=10m alpine dd if=/dev/zero of=/dev/null
# Container gets killed when it exceeds 10MB

# Check if a container was OOM-killed
docker inspect --format='{{.State.OOMKilled}}' container_name
```

<Tip>
  **Production gotcha**: Java applications are especially prone to OOM kills because the JVM allocates memory upfront. A container with `--memory=512m` running a JVM with `-Xmx512m` will be killed because the JVM also needs memory for metaspace, thread stacks, and native allocations. Set `-Xmx` to about 75% of your container memory limit (e.g., `-Xmx384m` for a 512MB container).
</Tip>

### 3. Union Filesystems - The Layer Cake

Union filesystems allow **stacking multiple directories as a single unified view**. This is the magic behind Docker's efficient image layering.

#### How Layers Work

```
Docker Image Layers (Read-Only):
┌─────────────────────────────────────┐
│  Layer 4: COPY app.py /app/         │  <- Your code (small)
├─────────────────────────────────────┤
│  Layer 3: RUN pip install flask     │  <- Dependencies
├─────────────────────────────────────┤
│  Layer 2: RUN apt-get update        │  <- System packages
├─────────────────────────────────────┤
│  Layer 1: Ubuntu Base Image         │  <- Base OS (shared!)
└─────────────────────────────────────┘

Container Layer (Read-Write):
┌─────────────────────────────────────┐
│  Container Layer (ephemeral)        │  <- Runtime changes
└─────────────────────────────────────┘
```

#### Copy-on-Write (CoW) Strategy

Copy-on-write is a resource-sharing strategy borrowed from operating systems. Think of it like a shared library book -- everyone reads the same copy, but if someone wants to write notes in the margins, they photocopy the page first and write on the copy.

When a container modifies a file from a lower (read-only) layer:

1. The file is **copied up** to the container's writable layer
2. The modification happens on the copy
3. The original file in the image layer remains unchanged

This is why:

* **100 containers from the same image share the base layers** -- massive space savings; you do not have 100 copies of Ubuntu on disk
* **Container startup is instant** -- no copying the entire filesystem, just adding an empty writable layer on top
* **Deleting a file does not reduce image size** -- a "whiteout" marker is created in the upper layer that hides the file, but the original bytes still exist in the lower layer

<Tip>
  **Performance implication**: The first write to any file from a lower layer incurs a copy-up penalty. For write-heavy workloads (like databases), this overhead is significant. That is why databases should always write to volumes, not to the container's writable layer.
</Tip>

#### Storage Drivers

Docker supports multiple storage drivers, each implementing union filesystem differently:

| Driver           | Technology    | Best For                           |
| ---------------- | ------------- | ---------------------------------- |
| **overlay2**     | OverlayFS     | Modern default, most Linux distros |
| **aufs**         | AuFS          | Legacy Ubuntu                      |
| **btrfs**        | Btrfs         | Btrfs filesystems                  |
| **zfs**          | ZFS           | ZFS filesystems                    |
| **devicemapper** | Device Mapper | RHEL/CentOS legacy                 |

```bash theme={null}
# Check your storage driver
docker info | grep "Storage Driver"
# Storage Driver: overlay2
```

***

## Docker Daemon Architecture

The Docker daemon (`dockerd`) is the brain of Docker. Let us trace what happens when you run `docker run nginx`.

### The Request Flow

```
┌─────────────┐     REST API      ┌──────────────┐
│   Docker    │ ───────────────── │   dockerd    │
│   CLI       │   /var/run/       │  (daemon)    │
└─────────────┘   docker.sock     └──────┬───────┘
                                         │
                                         │ gRPC
                                         ▼
                                  ┌──────────────┐
                                  │  containerd  │
                                  │  (runtime)   │
                                  └──────┬───────┘
                                         │
                                         │ OCI Runtime
                                         ▼
                                  ┌──────────────┐
                                  │    runc      │
                                  │ (container)  │
                                  └──────────────┘
```

### Component Breakdown

**Docker CLI**: The command-line interface you interact with. Translates commands into REST API calls.

**Docker Daemon (dockerd)**:

* Listens on Unix socket `/var/run/docker.sock`
* Manages images, networks, volumes
* Orchestrates container lifecycle
* Exposes the Docker API

**containerd**:

* High-level container runtime
* Manages container lifecycle (create, start, stop)
* Handles image pull/push
* CNCF graduated project (also used by Kubernetes)

**runc**:

* Low-level OCI runtime
* Actually creates and runs containers
* Sets up namespaces and cgroups
* Reference implementation of OCI Runtime Spec

### What Happens During `docker run nginx`

1. **CLI** parses command, sends POST to `/containers/create`
2. **dockerd** checks if `nginx` image exists locally
3. If not, **dockerd** tells **containerd** to pull from registry
4. **containerd** downloads image layers, unpacks them
5. **dockerd** creates container metadata
6. **containerd** calls **runc** with container spec
7. **runc** creates namespaces, sets up cgroups, mounts filesystem
8. **runc** executes the container's entrypoint (nginx)
9. **runc** exits, **containerd** monitors the running container

***

## The OCI Specification

The **Open Container Initiative (OCI)** defines industry standards for containers:

### OCI Runtime Spec

Defines how to run a container given a filesystem bundle:

* `config.json` - container configuration
* `rootfs/` - container root filesystem

```json theme={null}
// Simplified config.json
{
  "ociVersion": "1.0.0",
  "process": {
    "args": ["nginx", "-g", "daemon off;"],
    "env": ["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"],
    "cwd": "/"
  },
  "root": {
    "path": "rootfs",
    "readonly": false
  },
  "linux": {
    "namespaces": [
      {"type": "pid"},
      {"type": "network"},
      {"type": "mount"}
    ]
  }
}
```

### OCI Image Spec

Defines the format of container images:

* Manifest (list of layers)
* Configuration (env vars, entrypoint)
* Layers (tar.gz of filesystem changes)

This standardization means images built with Docker work with Podman, containerd, CRI-O, and other OCI-compliant runtimes.

***

## Security Implications

Understanding internals reveals security considerations:

### Shared Kernel Attack Surface

Since containers share the host kernel, a kernel vulnerability affects all containers.

```bash theme={null}
# Container escapes are possible with kernel exploits
# Famous examples: CVE-2019-5736 (runc vulnerability)
```

### Root in Container vs Root on Host

By default, root inside a container is UID 0 -- the same as host root. This means a container escape (via kernel exploit, misconfigured mount, or exposed Docker socket) gives the attacker full root access to the host. Think of it like a jail where the inmates have the warden's keys -- the walls only matter if the keys stay outside.

**Mitigation: User Namespaces**

```bash theme={null}
# Map container root (UID 0) to an unprivileged host user (e.g., UID 100000).
# Even if an attacker escapes the container, they land as a nobody on the host.
docker run --userns-remap=default nginx
```

<Warning>
  **The Docker socket is the crown jewel**: Mounting `/var/run/docker.sock` into a container (common in CI/CD setups) gives that container full control over the Docker daemon -- which means full root access to the host. Never mount the Docker socket unless absolutely necessary, and when you do, use read-only mode or a proxy like [Docker Socket Proxy](https://github.com/Tecnativa/docker-socket-proxy) that restricts API calls.
</Warning>

### Capabilities

Linux capabilities break down root privileges into smaller units. Docker drops many capabilities by default:

```bash theme={null}
# View container capabilities
docker run --rm alpine cat /proc/1/status | grep Cap

# Add specific capability
docker run --cap-add NET_ADMIN alpine
```

### Seccomp Profiles

Limit which system calls a container can make:

```bash theme={null}
# Use default seccomp profile (recommended)
docker run --security-opt seccomp=default nginx

# Check if seccomp is enabled
docker info | grep Seccomp
```

***

## Interview Deep Dive Questions

These are the questions that separate junior from senior candidates:

<AccordionGroup>
  <Accordion title="What is the difference between a container and a VM?" icon="circle-question">
    **Answer**: VMs virtualize hardware with a hypervisor, running a complete guest OS. Containers virtualize the OS, sharing the host kernel and using namespaces/cgroups for isolation. Containers are lighter (MBs vs GBs), faster to start (seconds vs minutes), but less isolated (shared kernel).
  </Accordion>

  <Accordion title="Explain how Docker uses namespaces" icon="circle-question">
    **Answer**: Docker uses 7 Linux namespaces: PID (process isolation), NET (network isolation), MNT (filesystem isolation), UTS (hostname), IPC (inter-process communication), USER (UID/GID mapping), and CGROUP. Each container runs in its own namespace set, making it appear as an isolated system.
  </Accordion>

  <Accordion title="What is copy-on-write and why does Docker use it?" icon="circle-question">
    **Answer**: Copy-on-write is a strategy where resources are shared until modified. In Docker, image layers are read-only and shared between containers. When a container modifies a file, it is copied to the container's writable layer first. This enables: instant container startup, efficient disk usage (shared layers), and immutable base images.
  </Accordion>

  <Accordion title="What happens when you run docker run?" icon="circle-question">
    **Answer**: 1) CLI sends API request to dockerd, 2) dockerd checks for image, pulls if needed, 3) containerd is instructed to create container, 4) containerd calls runc with OCI spec, 5) runc creates namespaces, configures cgroups, sets up rootfs, 6) runc executes the entrypoint, 7) containerd monitors the process.
  </Accordion>

  <Accordion title="Why might running as root in a container be dangerous?" icon="circle-question">
    **Answer**: By default, root in a container is UID 0, same as host root. If a container escape occurs (kernel vulnerability, misconfiguration), the attacker has root access to the host. Mitigations: user namespaces, rootless containers, drop capabilities, run as non-root user inside container.
  </Accordion>

  <Accordion title="Explain the difference between overlay2 and aufs" icon="circle-question">
    **Answer**: Both are union filesystems for Docker. OverlayFS (overlay2) is in mainline Linux kernel since 3.18, faster, and the modern default. AuFS is older, not in mainline kernel (requires patching), slower, but was the original Docker storage driver. overlay2 is preferred for all modern deployments.
  </Accordion>
</AccordionGroup>

***

## Debugging with Internals Knowledge

### Inspect Container Namespaces

```bash theme={null}
# Get container PID on host
docker inspect --format '{{.State.Pid}}' container_name

# View namespaces
ls -la /proc/<PID>/ns/

# Enter container namespace from host
nsenter --target <PID> --mount --uts --ipc --net --pid
```

### Examine Cgroups

```bash theme={null}
# Find container's cgroup
cat /proc/<PID>/cgroup

# View memory limit
cat /sys/fs/cgroup/memory/docker/<container_id>/memory.limit_in_bytes

# View CPU quota
cat /sys/fs/cgroup/cpu/docker/<container_id>/cpu.cfs_quota_us
```

### Trace System Calls

```bash theme={null}
# See what syscalls a container makes
docker run --rm --security-opt seccomp=unconfined strace -f nginx
```

***

## Key Takeaways

1. **Containers are isolated processes**, not VMs - they share the host kernel
2. **Namespaces** provide the "what you can see" isolation
3. **Cgroups** provide the "what you can use" resource limits
4. **Union filesystems** enable efficient layered images with copy-on-write
5. **Docker is a stack**: CLI -> dockerd -> containerd -> runc
6. **OCI standards** ensure portability across container runtimes
7. **Security requires understanding** - shared kernel means shared risk

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="A security team asks you to evaluate whether Docker containers provide sufficient isolation for running untrusted code from third-party vendors. What is your assessment?" icon="circle-question">
    **Strong Answer:**

    * My short answer is no -- not by default. Containers share the host kernel, and kernel vulnerabilities can lead to container escapes. CVE-2019-5736 (runc vulnerability) demonstrated that a malicious container could overwrite the host runc binary and gain root access to the host. For untrusted code, the shared kernel is an unacceptable attack surface.
    * For running untrusted workloads, I would recommend one of three approaches depending on the security requirements. First, gVisor (Google's user-space kernel) intercepts syscalls and implements a subset of Linux in user space. The untrusted code never touches the real kernel. The trade-off is performance overhead (\~5-15% for syscall-heavy workloads) and incomplete syscall coverage. Second, Kata Containers run each container inside a lightweight VM, giving you hardware-level isolation with near-container ergonomics. The trade-off is higher memory usage (each "container" boots a minimal VM kernel) and slower startup. Third, Firecracker (from AWS, powers Lambda) provides microVM isolation with very fast boot times (\~125ms) and minimal memory overhead (\~5MB per VM).
    * If the team insists on standard Docker containers, I would layer defenses: user namespaces (so root in the container maps to a nobody on the host), seccomp profiles that restrict syscalls to a minimal allowlist, AppArmor or SELinux mandatory access controls, dropping all capabilities with `--cap-drop=ALL`, and read-only filesystems. But even with all of these, I would not call it "sufficient" for genuinely adversarial code -- defense in depth reduces risk but does not eliminate the fundamental shared-kernel exposure.

    **Follow-up: How do user namespaces change the security model, and why are they not enabled by default in Docker?**

    User namespaces remap UID 0 (root) inside the container to a high-numbered unprivileged UID on the host (e.g., 100000). If an attacker escapes the container, they land as an unprivileged user on the host, dramatically limiting the damage. They are not enabled by default because they introduce complexity: file ownership on bind mounts becomes confusing (files appear owned by nobody), some applications break when they detect they are not "real" root, and there are subtle interactions with storage drivers. Docker has been improving support over time, but many teams still disable it for compatibility. For untrusted workloads, the friction is worth accepting.
  </Accordion>

  <Accordion title="Explain copy-on-write in Docker, and then tell me why running a database directly on the container's writable layer is a performance disaster." icon="circle-question">
    **Strong Answer:**

    * Copy-on-write means that when a container needs to modify a file from a read-only image layer, OverlayFS copies the entire file from the lower layer to the container's writable upper layer before the modification happens. The original file remains unchanged in the lower layer. All subsequent reads and writes for that file go to the upper layer copy.
    * For a database like PostgreSQL, the write pattern is catastrophic. A database does random I/O across many data files. Every first write to a data page triggers a copy-up from the lower layer. The copy-up is a synchronous operation that blocks the write. For a busy database, this means hundreds or thousands of copy-ups at startup, each involving a full-file copy (not just the modified page). After the copy-up, subsequent writes to that file are at native speed, but the initial penalty is significant.
    * Beyond performance, there is a correctness issue. The container's writable layer is ephemeral -- it vanishes when the container is removed. A `docker rm` destroys the database. Even a `docker restart` is safe (writable layer survives), but `docker rm` followed by `docker run` starts with a fresh layer and no data.
    * The fix is volumes. A named volume or bind mount bypasses the union filesystem entirely. The database reads and writes directly to the volume's filesystem (ext4, XFS, etc.) with zero copy-on-write overhead. The volume persists independently of the container lifecycle.
    * In my experience, this is the single most impactful performance and reliability fix for teams running databases in Docker. I have seen PostgreSQL write throughput improve 3-5x just by switching from container-layer storage to a named volume.

    **Follow-up: If 100 containers are created from the same image, how much disk space do the image layers consume?**

    The image layers are shared. Whether you run 1 container or 100, the base layers exist only once on disk. Each container adds only its own thin writable layer (typically a few kilobytes to megabytes, depending on what the container writes at runtime). This is the efficiency of copy-on-write: a team running 100 microservice instances from the same Node.js base image does not have 100 copies of Node.js on disk. The deduplication happens at the storage driver level (OverlayFS lower directories are shared).
  </Accordion>

  <Accordion title="Your organization mounts /var/run/docker.sock into CI/CD containers so they can build Docker images. A security auditor flags this. Explain the risk and propose alternatives." icon="circle-question">
    **Strong Answer:**

    * Mounting the Docker socket into a container gives that container full, unrestricted access to the Docker daemon API. The Docker daemon runs as root on the host. This means the CI container can: start new privileged containers, mount the host filesystem, access any container on the host, pull/push any image, and effectively has root access to the entire host machine. If the CI job runs untrusted code (e.g., building a pull request from an external contributor), that code inherits full host root access.
    * This is not a theoretical risk. In practice, attackers exploit this by running `docker run -v /:/host --privileged alpine chroot /host` from inside the CI container -- this gives them a root shell on the host.
    * Alternatives, in order of preference: First, use Kaniko (Google's in-container image builder). Kaniko builds Docker images from a Dockerfile entirely in user space, without a Docker daemon. It runs as a regular container and pushes directly to a registry. This is the standard approach for Kubernetes-based CI (Tekton, GitHub Actions on GKE). Second, use BuildKit's rootless mode, which runs the build daemon as a non-root user. Third, use Podman, which is daemonless and can build images without a socket. Fourth, if you must expose the Docker daemon, use a Docker Socket Proxy (like Tecnativa's) that restricts API calls -- allow image build and push but block container creation, volume mounts, and privileged operations.
    * The broader principle: CI/CD pipelines are a high-value target for supply chain attacks. Minimizing privileges in the build environment is not optional -- it is a core security requirement.

    **Follow-up: The team uses Docker-in-Docker (dind) as an alternative. Is that safe?**

    Docker-in-Docker (running a Docker daemon inside a container) is better than mounting the host socket because the inner daemon is isolated from the host daemon. However, it still typically runs as privileged (`--privileged` flag), which disables most of Docker's security features. A container escape from the inner Docker daemon can reach the outer Docker daemon. The best dind alternative is running in rootless mode or using sysbox (a container runtime that enables Docker-in-Docker without `--privileged`). But in most cases, Kaniko or BuildKit is still the preferred approach because it eliminates the daemon entirely.
  </Accordion>
</AccordionGroup>

***

Ready to see these internals in action? Next up: [Docker Images](/courses/devops-tools/docker-images) where we will build optimized, multi-stage Dockerfiles with layer caching.
