> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Kubernetes Internals

> How Kubernetes actually works - etcd, scheduler, controller patterns, and the API machinery

# Kubernetes Internals Deep Dive

> **If you love understanding how things actually work, this chapter is for you. If you just want to deploy pods and call it a day, feel free to skip ahead. No judgment.**

This chapter takes you inside the Kubernetes control plane. We will explore how etcd stores cluster state, how the scheduler makes placement decisions, and why the controller pattern is one of the most elegant designs in distributed systems. This knowledge is what separates Kubernetes operators from Kubernetes engineers.

***

## Why Internals Matter

Understanding Kubernetes internals helps you:

* **Troubleshoot production outages** when pods refuse to schedule
* **Optimize cluster performance** by understanding scheduler behavior
* **Design better applications** that work with Kubernetes, not against it
* **Ace interviews** where internals questions are standard
* **Build operators and controllers** that extend Kubernetes

***

## The Declarative Model: Desired State vs Actual State

Kubernetes is built on a simple but powerful idea: **you declare what you want, and controllers work to make reality match your declaration**.

```yaml theme={null}
# You declare: "I want 3 nginx replicas"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 3
```

Kubernetes stores this as **desired state** and continuously works to ensure **actual state** matches it.

```
┌─────────────────────────────────────────────────────────────────┐
│                    The Reconciliation Loop                       │
│                                                                  │
│   ┌──────────────┐                        ┌──────────────┐       │
│   │  Desired     │     Controllers        │   Actual     │       │
│   │  State       │ ──────────────────────▶│   State      │       │
│   │  (etcd)      │   "Make it so"         │  (Reality)   │       │
│   └──────────────┘                        └──────────────┘       │
│          ▲                                       │               │
│          │         Observe & Compare             │               │
│          └───────────────────────────────────────┘               │
└─────────────────────────────────────────────────────────────────┘
```

This model is why:

* **Self-healing works** - controllers constantly reconcile
* **Rollbacks are easy** - just change the desired state
* **Scaling is declarative** - change `replicas: 3` to `replicas: 10`

***

## etcd: The Cluster Brain

**etcd** is a distributed key-value store that holds the entire cluster state. If etcd dies, your cluster dies.

### What is Stored in etcd

```
/registry/
├── pods/
│   └── default/
│       └── nginx-abc123
├── deployments/
│   └── default/
│       └── nginx
├── services/
│   └── default/
│       └── nginx-service
├── secrets/
│   └── default/
│       └── api-key
└── configmaps/
    └── default/
        └── app-config
```

Every Kubernetes object is stored as a key-value pair:

* **Key**: `/registry/<resource>/<namespace>/<name>`
* **Value**: Serialized object (JSON or Protobuf)

### etcd Consistency Model

etcd uses the **Raft consensus algorithm** to ensure consistency:

```
┌─────────┐     ┌─────────┐     ┌─────────┐
│  etcd   │     │  etcd   │     │  etcd   │
│ Leader  │────▶│Follower │     │Follower │
│  (R/W)  │     │  (R/O)  │     │  (R/O)  │
└─────────┘     └─────────┘     └─────────┘
     │                ▲               ▲
     │                │               │
     └────────────────┴───────────────┘
           Replicated writes
```

* **Leader** handles all writes
* **Followers** replicate and serve reads
* **Quorum** required for writes: (n/2) + 1 nodes must agree
* **3 nodes** tolerates 1 failure, **5 nodes** tolerates 2

### Watches: The Real-Time Notification System

This is the magic that makes Kubernetes responsive:

```bash theme={null}
# etcd watch on pods
etcdctl watch --prefix /registry/pods/
```

When you create a pod:

1. API server writes to etcd
2. etcd notifies all watchers
3. Scheduler watches for unscheduled pods
4. Kubelet watches for pods assigned to its node

This event-driven architecture is why Kubernetes reacts in seconds, not minutes.

***

## The API Server: Central Hub

The **kube-apiserver** is the only component that talks to etcd. Everything else goes through it.

### Request Flow

```
kubectl apply -f pod.yaml
        │
        ▼
┌──────────────────────────────────────────────────────────────────┐
│                          API Server                               │
│                                                                   │
│  ┌──────────┐   ┌──────────────┐   ┌────────────┐   ┌─────────┐ │
│  │ Authent- │──▶│   Authoriz-  │──▶│ Admission  │──▶│ Valida- │ │
│  │ ication  │   │   ation      │   │ Controllers│   │ tion    │ │
│  │ (Who?)   │   │ (Allowed?)   │   │ (Mutate?)  │   │ (Valid?)│ │
│  └──────────┘   └──────────────┘   └────────────┘   └─────────┘ │
│                                                          │       │
│                                            ┌─────────────▼─────┐ │
│                                            │  Write to etcd   │ │
│                                            └───────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
```

### Authentication: Who Are You?

Methods supported:

* **X.509 Client Certificates** - Most secure
* **Bearer Tokens** - Service accounts
* **OpenID Connect** - Enterprise SSO
* **Webhooks** - Custom authentication

### Authorization: Are You Allowed?

**RBAC (Role-Based Access Control)** is the standard:

```yaml theme={null}
# Role: What actions are allowed
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

---
# RoleBinding: Who gets the role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
subjects:
- kind: User
  name: jane
roleRef:
  kind: Role
  name: pod-reader
```

### Admission Controllers: The Gatekeepers

Admission controllers intercept requests before persistence:

**Mutating Admission** (modifies objects):

* Add default resource limits
* Inject sidecar containers (Istio)
* Add labels/annotations

**Validating Admission** (rejects invalid objects):

* Enforce naming conventions
* Require specific labels
* Block privileged containers

```yaml theme={null}
# Example: Always pull images (via admission controller)
# Input:  imagePullPolicy: IfNotPresent
# Output: imagePullPolicy: Always
```

***

## The Scheduler: Where Do Pods Go?

The **kube-scheduler** watches for unscheduled pods and assigns them to nodes.

### The Scheduling Algorithm

```
┌──────────────────────────────────────────────────────────────┐
│                     Scheduling Cycle                          │
│                                                               │
│   1. FILTERING (Predicates)                                   │
│      ┌─────────┐                                              │
│      │All Nodes│──▶ Remove nodes that cannot run the pod     │
│      └─────────┘                                              │
│           │                                                   │
│           ▼                                                   │
│   2. SCORING (Priorities)                                     │
│      ┌─────────┐                                              │
│      │Feasible │──▶ Rank remaining nodes by preference       │
│      │ Nodes   │                                              │
│      └─────────┘                                              │
│           │                                                   │
│           ▼                                                   │
│   3. BINDING                                                  │
│      Assign pod to highest-scoring node                       │
└──────────────────────────────────────────────────────────────┘
```

### Filter Plugins (Predicates)

These eliminate unsuitable nodes:

| Filter              | What It Checks                    |
| ------------------- | --------------------------------- |
| `NodeResourcesFit`  | Enough CPU/memory?                |
| `NodeName`          | Pod requests specific node?       |
| `NodeSelector`      | Node has required labels?         |
| `NodeAffinity`      | Node matches affinity rules?      |
| `TaintToleration`   | Pod tolerates node taints?        |
| `NodePorts`         | Port available on node?           |
| `PodTopologySpread` | Would violate spread constraints? |

### Score Plugins (Priorities)

These rank feasible nodes:

| Priority                     | What It Prefers                               |
| ---------------------------- | --------------------------------------------- |
| `LeastAllocated`             | Nodes with more free resources                |
| `MostAllocated`              | Nodes with least free resources (bin packing) |
| `BalancedResourceAllocation` | Balanced CPU/memory ratio                     |
| `ImageLocality`              | Nodes that already have the image             |
| `NodeAffinity`               | Nodes matching preferred affinity             |
| `InterPodAffinity`           | Nodes with related pods                       |

### Example Scheduling Decision

```yaml theme={null}
# Pod with requirements
spec:
  containers:
  - resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
  nodeSelector:
    disktype: ssd
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "ml-workload"
```

Scheduling flow:

1. **Filter**: Remove nodes without `disktype: ssd` label
2. **Filter**: Remove nodes without enough CPU/memory
3. **Filter**: Remove nodes with untolerated taints
4. **Score**: Rank by resource availability, image locality
5. **Bind**: Assign to highest-scoring node

***

## Controllers: The Brains of Kubernetes

Controllers are the workhorses that make Kubernetes self-healing. Each controller follows the same pattern.

### The Controller Pattern

```go theme={null}
for {
    desired := getDesiredState()
    actual := getActualState()
    
    if actual != desired {
        reconcile(actual, desired)
    }
}
```

This is an **infinite reconciliation loop**. Controllers never stop watching and fixing.

### Controller Manager Components

| Controller                    | What It Manages                   |
| ----------------------------- | --------------------------------- |
| **ReplicaSet Controller**     | Ensures correct pod count         |
| **Deployment Controller**     | Manages ReplicaSets for rollouts  |
| **Node Controller**           | Monitors node health, evicts pods |
| **Endpoint Controller**       | Populates Service endpoints       |
| **ServiceAccount Controller** | Creates default accounts          |
| **Job Controller**            | Runs pods to completion           |
| **PV Binder**                 | Binds PVCs to PVs                 |

### Deployment Controller Deep Dive

What happens when you update a Deployment:

```yaml theme={null}
# Change image from nginx:1.20 to nginx:1.21
spec:
  template:
    spec:
      containers:
      - name: nginx
        image: nginx:1.21  # Changed!
```

1. **Deployment Controller** detects spec change
2. Creates new **ReplicaSet** with new pod template
3. Scales up new ReplicaSet (gradually)
4. Scales down old ReplicaSet (gradually)
5. Updates Deployment status

```
Rolling Update (maxSurge=1, maxUnavailable=0):

Time 0: [Old-1] [Old-2] [Old-3]
Time 1: [Old-1] [Old-2] [Old-3] [New-1]      # +1 new
Time 2: [Old-1] [Old-2] [New-1] [New-2]      # -1 old, +1 new
Time 3: [Old-1] [New-1] [New-2] [New-3]      # -1 old, +1 new
Time 4: [New-1] [New-2] [New-3]              # -1 old, done!
```

***

## Kubelet: The Node Agent

The **kubelet** runs on every node and manages pods assigned to that node.

### Kubelet Responsibilities

1. **Pod Lifecycle**: Create, start, stop, delete containers
2. **Health Checks**: Liveness, readiness, startup probes
3. **Resource Reporting**: Tell API server about node capacity
4. **Image Management**: Pull container images
5. **Volume Mounting**: Attach storage to pods

### How Kubelet Runs Containers

```
┌─────────────────────────────────────────────────────────────────┐
│                           Kubelet                                │
│                              │                                   │
│                              ▼                                   │
│                    ┌─────────────────┐                          │
│                    │       CRI       │  Container Runtime        │
│                    │   (Interface)   │  Interface                │
│                    └────────┬────────┘                          │
│                             │                                    │
│         ┌───────────────────┼───────────────────┐               │
│         ▼                   ▼                   ▼               │
│   ┌──────────┐        ┌──────────┐        ┌──────────┐         │
│   │containerd│        │  CRI-O   │        │  Docker  │         │
│   └──────────┘        └──────────┘        └──────────┘         │
└─────────────────────────────────────────────────────────────────┘
```

**CRI (Container Runtime Interface)** abstracts the container runtime. Kubernetes does not care if you use containerd, CRI-O, or something else.

### Pod Sandbox and Containers

```
Pod:
┌─────────────────────────────────────────────────────────┐
│  Pause Container (Sandbox)                               │
│  - Holds network namespace                               │
│  - Holds IPC namespace                                   │
│  ┌─────────────────┐  ┌─────────────────┐               │
│  │  App Container  │  │  Sidecar        │               │
│  │  (nginx)        │  │  (log-shipper)  │               │
│  └─────────────────┘  └─────────────────┘               │
│                                                          │
│  Shared: Network (IP, ports), IPC, Volumes              │
└─────────────────────────────────────────────────────────┘
```

The **pause container** is the parent that holds namespaces. App containers join these namespaces.

***

## Networking Internals

Kubernetes networking follows three rules:

1. All pods can communicate with all other pods without NAT
2. All nodes can communicate with all pods without NAT
3. The IP a pod sees for itself is the same IP others see

### CNI: Container Network Interface

**CNI plugins** implement pod networking:

| Plugin          | Approach              | Best For                         |
| --------------- | --------------------- | -------------------------------- |
| **Calico**      | BGP routing, L3       | Large clusters, network policies |
| **Flannel**     | VXLAN overlay         | Simplicity, small clusters       |
| **Cilium**      | eBPF-based            | Performance, observability       |
| **Weave**       | VXLAN, encryption     | Multi-cloud, encrypted overlay   |
| **AWS VPC CNI** | Native AWS networking | EKS clusters                     |

### Pod-to-Pod Communication

```
Pod A (10.244.1.5)                      Pod B (10.244.2.3)
┌──────────────────┐                    ┌──────────────────┐
│                  │                    │                  │
│  eth0: 10.244.1.5│                    │  eth0: 10.244.2.3│
└────────┬─────────┘                    └────────┬─────────┘
         │                                       │
    ┌────┴─────┐                           ┌────┴─────┐
    │   veth   │                           │   veth   │
    └────┬─────┘                           └────┬─────┘
         │                                       │
    ┌────┴──────────────────────────────────────┴────┐
    │                CNI Bridge/Overlay               │
    └────────────────────────┬───────────────────────┘
                             │
                    Physical Network
```

### Service Networking: kube-proxy

**kube-proxy** implements Service load balancing using:

| Mode         | How It Works                   | Performance            |
| ------------ | ------------------------------ | ---------------------- |
| **iptables** | Rules for packet redirection   | Good, O(n) rules       |
| **IPVS**     | Linux Virtual Server           | Better, O(1) lookup    |
| **eBPF**     | Kernel-level packet processing | Best, Cilium uses this |

```bash theme={null}
# View iptables rules for a Service
iptables -t nat -L KUBE-SERVICES -n

# KUBE-SVC-xxxx chain load balances to endpoints
# KUBE-SEP-xxxx chains DNAT to actual pod IPs
```

***

## Storage Internals

Kubernetes abstracts storage through several layers.

### CSI: Container Storage Interface

```
┌─────────────────────────────────────────────────────────────────┐
│                         Kubernetes                               │
│                              │                                   │
│                              ▼                                   │
│                    ┌─────────────────┐                          │
│                    │       CSI       │  Container Storage        │
│                    │   (Interface)   │  Interface                │
│                    └────────┬────────┘                          │
│                             │                                    │
│         ┌───────────────────┼───────────────────┐               │
│         ▼                   ▼                   ▼               │
│   ┌──────────┐        ┌──────────┐        ┌──────────┐         │
│   │ AWS EBS  │        │ GCP PD   │        │  Ceph    │         │
│   └──────────┘        └──────────┘        └──────────┘         │
└─────────────────────────────────────────────────────────────────┘
```

### PV/PVC Binding

```yaml theme={null}
# PersistentVolumeClaim (User Request)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: fast-ssd
```

The **PV Binder Controller** matches PVCs to PVs:

1. Find PVs with matching StorageClass
2. Filter by access modes and capacity
3. Select best fit (smallest PV that satisfies request)
4. Bind PVC to PV

***

## Interview Deep Dive Questions

<AccordionGroup>
  <Accordion title="What happens when you run kubectl apply?" icon="circle-question">
    **Answer**: 1) kubectl validates YAML locally, 2) Request sent to API server, 3) Authentication (who is this?), 4) Authorization (can they do this?), 5) Admission controllers (mutate/validate), 6) Object written to etcd, 7) etcd notifies watchers, 8) Relevant controllers react (e.g., scheduler assigns pod, kubelet runs it).
  </Accordion>

  <Accordion title="How does the scheduler decide where to place a pod?" icon="circle-question">
    **Answer**: Two phases: 1) Filtering - eliminate nodes that cannot run the pod (insufficient resources, wrong labels, taints not tolerated), 2) Scoring - rank remaining nodes by preferences (resource balance, image locality, affinity). Highest score wins. Configurable via scheduler profiles and plugins.
  </Accordion>

  <Accordion title="What is the controller pattern?" icon="circle-question">
    **Answer**: Controllers run infinite reconciliation loops: observe actual state, compare to desired state, take action to align them. This pattern enables self-healing - if a pod dies, the ReplicaSet controller notices the mismatch and creates a new one. Every Kubernetes controller follows this pattern.
  </Accordion>

  <Accordion title="Why is etcd critical to Kubernetes?" icon="circle-question">
    **Answer**: etcd is the single source of truth for cluster state. All objects (pods, services, secrets) are stored in etcd. It uses Raft consensus for consistency and watches for real-time notifications. If etcd is unavailable, the cluster cannot make changes. This is why etcd is typically run as a 3 or 5 node cluster for high availability.
  </Accordion>

  <Accordion title="Explain how Services work internally" icon="circle-question">
    **Answer**: Services are abstractions. kube-proxy watches Services and Endpoints, then programs iptables/IPVS rules on each node. When traffic hits a Service IP, rules DNAT (destination NAT) to a backend pod IP. Load balancing happens via random selection (iptables) or round-robin (IPVS). ClusterIP is virtual - exists only in iptables rules.
  </Accordion>

  <Accordion title="What is the difference between a DaemonSet and a Deployment?" icon="circle-question">
    **Answer**: Deployment runs N replicas scheduled anywhere. DaemonSet runs exactly one pod per node (or selected nodes). DaemonSets bypass the scheduler - they are controlled by the DaemonSet controller. Use cases: log collectors (fluentd), monitoring agents (node-exporter), network plugins (calico-node).
  </Accordion>
</AccordionGroup>

***

## Debugging with Internals Knowledge

### Check etcd Health

```bash theme={null}
# etcd status (run inside etcd pod)
etcdctl endpoint health
etcdctl endpoint status --write-out=table

# List all keys (careful - lots of output)
etcdctl get / --prefix --keys-only
```

### Trace Scheduler Decisions

```bash theme={null}
# Why is my pod pending?
kubectl describe pod my-pod

# Look for:
# Events: FailedScheduling
# Message: "0/3 nodes are available: 1 Insufficient cpu, 2 node(s) had taint..."

# Check scheduler logs
kubectl logs -n kube-system kube-scheduler-<node>
```

### Debug Controller Issues

```bash theme={null}
# Check controller manager logs
kubectl logs -n kube-system kube-controller-manager-<node>

# Look for reconciliation errors
kubectl describe deployment my-deployment
```

***

## Key Takeaways

1. **Desired state vs actual state** - controllers constantly reconcile
2. **etcd is the brain** - stores all cluster state, uses Raft consensus
3. **API server is the hub** - authentication, authorization, admission, persistence
4. **Scheduler uses filtering then scoring** - predicates eliminate, priorities rank
5. **Controllers follow the reconciliation pattern** - observe, compare, act
6. **Kubelet uses CRI** - pluggable container runtimes
7. **CNI implements pod networking** - Calico, Cilium, Flannel, etc.
8. **CSI implements storage** - pluggable storage backends

***

Ready to deploy workloads? Next up: [Kubernetes Workloads](/courses/devops-tools/kubernetes-workloads) where we will master Deployments, StatefulSets, and DaemonSets.