Skip to main content

Kubernetes Internals Deep Dive

If you love understanding how things actually work, this chapter is for you. If you just want to deploy pods and call it a day, feel free to skip ahead. No judgment.
This chapter takes you inside the Kubernetes control plane. We will explore how etcd stores cluster state, how the scheduler makes placement decisions, and why the controller pattern is one of the most elegant designs in distributed systems. This knowledge is what separates Kubernetes operators from Kubernetes engineers.

Why Internals Matter

Understanding Kubernetes internals helps you:
  • Troubleshoot production outages when pods refuse to schedule
  • Optimize cluster performance by understanding scheduler behavior
  • Design better applications that work with Kubernetes, not against it
  • Ace interviews where internals questions are standard
  • Build operators and controllers that extend Kubernetes

The Declarative Model: Desired State vs Actual State

Kubernetes is built on a simple but powerful idea: you declare what you want, and controllers work to make reality match your declaration.
# You declare: "I want 3 nginx replicas"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 3
Kubernetes stores this as desired state and continuously works to ensure actual state matches it.
┌─────────────────────────────────────────────────────────────────┐
│                    The Reconciliation Loop                       │
│                                                                  │
│   ┌──────────────┐                        ┌──────────────┐       │
│   │  Desired     │     Controllers        │   Actual     │       │
│   │  State       │ ──────────────────────▶│   State      │       │
│   │  (etcd)      │   "Make it so"         │  (Reality)   │       │
│   └──────────────┘                        └──────────────┘       │
│          ▲                                       │               │
│          │         Observe & Compare             │               │
│          └───────────────────────────────────────┘               │
└─────────────────────────────────────────────────────────────────┘
This model is why:
  • Self-healing works - controllers constantly reconcile
  • Rollbacks are easy - just change the desired state
  • Scaling is declarative - change replicas: 3 to replicas: 10

etcd: The Cluster Brain

etcd is a distributed key-value store that holds the entire cluster state. If etcd dies, your cluster dies.

What is Stored in etcd

/registry/
├── pods/
│   └── default/
│       └── nginx-abc123
├── deployments/
│   └── default/
│       └── nginx
├── services/
│   └── default/
│       └── nginx-service
├── secrets/
│   └── default/
│       └── api-key
└── configmaps/
    └── default/
        └── app-config
Every Kubernetes object is stored as a key-value pair:
  • Key: /registry/<resource>/<namespace>/<name>
  • Value: Serialized object (JSON or Protobuf)

etcd Consistency Model

etcd uses the Raft consensus algorithm to ensure consistency:
┌─────────┐     ┌─────────┐     ┌─────────┐
│  etcd   │     │  etcd   │     │  etcd   │
│ Leader  │────▶│Follower │     │Follower │
│  (R/W)  │     │  (R/O)  │     │  (R/O)  │
└─────────┘     └─────────┘     └─────────┘
     │                ▲               ▲
     │                │               │
     └────────────────┴───────────────┘
           Replicated writes
  • Leader handles all writes
  • Followers replicate and serve reads
  • Quorum required for writes: (n/2) + 1 nodes must agree
  • 3 nodes tolerates 1 failure, 5 nodes tolerates 2

Watches: The Real-Time Notification System

This is the magic that makes Kubernetes responsive:
# etcd watch on pods
etcdctl watch --prefix /registry/pods/
When you create a pod:
  1. API server writes to etcd
  2. etcd notifies all watchers
  3. Scheduler watches for unscheduled pods
  4. Kubelet watches for pods assigned to its node
This event-driven architecture is why Kubernetes reacts in seconds, not minutes.

The API Server: Central Hub

The kube-apiserver is the only component that talks to etcd. Everything else goes through it.

Request Flow

kubectl apply -f pod.yaml


┌──────────────────────────────────────────────────────────────────┐
│                          API Server                               │
│                                                                   │
│  ┌──────────┐   ┌──────────────┐   ┌────────────┐   ┌─────────┐ │
│  │ Authent- │──▶│   Authoriz-  │──▶│ Admission  │──▶│ Valida- │ │
│  │ ication  │   │   ation      │   │ Controllers│   │ tion    │ │
│  │ (Who?)   │   │ (Allowed?)   │   │ (Mutate?)  │   │ (Valid?)│ │
│  └──────────┘   └──────────────┘   └────────────┘   └─────────┘ │
│                                                          │       │
│                                            ┌─────────────▼─────┐ │
│                                            │  Write to etcd   │ │
│                                            └───────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

Authentication: Who Are You?

Methods supported:
  • X.509 Client Certificates - Most secure
  • Bearer Tokens - Service accounts
  • OpenID Connect - Enterprise SSO
  • Webhooks - Custom authentication

Authorization: Are You Allowed?

RBAC (Role-Based Access Control) is the standard:
# Role: What actions are allowed
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

---
# RoleBinding: Who gets the role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
subjects:
- kind: User
  name: jane
roleRef:
  kind: Role
  name: pod-reader

Admission Controllers: The Gatekeepers

Admission controllers intercept requests before persistence: Mutating Admission (modifies objects):
  • Add default resource limits
  • Inject sidecar containers (Istio)
  • Add labels/annotations
Validating Admission (rejects invalid objects):
  • Enforce naming conventions
  • Require specific labels
  • Block privileged containers
# Example: Always pull images (via admission controller)
# Input:  imagePullPolicy: IfNotPresent
# Output: imagePullPolicy: Always

The Scheduler: Where Do Pods Go?

The kube-scheduler watches for unscheduled pods and assigns them to nodes.

The Scheduling Algorithm

┌──────────────────────────────────────────────────────────────┐
│                     Scheduling Cycle                          │
│                                                               │
│   1. FILTERING (Predicates)                                   │
│      ┌─────────┐                                              │
│      │All Nodes│──▶ Remove nodes that cannot run the pod     │
│      └─────────┘                                              │
│           │                                                   │
│           ▼                                                   │
│   2. SCORING (Priorities)                                     │
│      ┌─────────┐                                              │
│      │Feasible │──▶ Rank remaining nodes by preference       │
│      │ Nodes   │                                              │
│      └─────────┘                                              │
│           │                                                   │
│           ▼                                                   │
│   3. BINDING                                                  │
│      Assign pod to highest-scoring node                       │
└──────────────────────────────────────────────────────────────┘

Filter Plugins (Predicates)

These eliminate unsuitable nodes:
FilterWhat It Checks
NodeResourcesFitEnough CPU/memory?
NodeNamePod requests specific node?
NodeSelectorNode has required labels?
NodeAffinityNode matches affinity rules?
TaintTolerationPod tolerates node taints?
NodePortsPort available on node?
PodTopologySpreadWould violate spread constraints?

Score Plugins (Priorities)

These rank feasible nodes:
PriorityWhat It Prefers
LeastAllocatedNodes with more free resources
MostAllocatedNodes with least free resources (bin packing)
BalancedResourceAllocationBalanced CPU/memory ratio
ImageLocalityNodes that already have the image
NodeAffinityNodes matching preferred affinity
InterPodAffinityNodes with related pods

Example Scheduling Decision

# Pod with requirements
spec:
  containers:
  - resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
  nodeSelector:
    disktype: ssd
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "ml-workload"
Scheduling flow:
  1. Filter: Remove nodes without disktype: ssd label
  2. Filter: Remove nodes without enough CPU/memory
  3. Filter: Remove nodes with untolerated taints
  4. Score: Rank by resource availability, image locality
  5. Bind: Assign to highest-scoring node

Controllers: The Brains of Kubernetes

Controllers are the workhorses that make Kubernetes self-healing. Each controller follows the same pattern.

The Controller Pattern

for {
    desired := getDesiredState()
    actual := getActualState()
    
    if actual != desired {
        reconcile(actual, desired)
    }
}
This is an infinite reconciliation loop. Controllers never stop watching and fixing.

Controller Manager Components

ControllerWhat It Manages
ReplicaSet ControllerEnsures correct pod count
Deployment ControllerManages ReplicaSets for rollouts
Node ControllerMonitors node health, evicts pods
Endpoint ControllerPopulates Service endpoints
ServiceAccount ControllerCreates default accounts
Job ControllerRuns pods to completion
PV BinderBinds PVCs to PVs

Deployment Controller Deep Dive

What happens when you update a Deployment:
# Change image from nginx:1.20 to nginx:1.21
spec:
  template:
    spec:
      containers:
      - name: nginx
        image: nginx:1.21  # Changed!
  1. Deployment Controller detects spec change
  2. Creates new ReplicaSet with new pod template
  3. Scales up new ReplicaSet (gradually)
  4. Scales down old ReplicaSet (gradually)
  5. Updates Deployment status
Rolling Update (maxSurge=1, maxUnavailable=0):

Time 0: [Old-1] [Old-2] [Old-3]
Time 1: [Old-1] [Old-2] [Old-3] [New-1]      # +1 new
Time 2: [Old-1] [Old-2] [New-1] [New-2]      # -1 old, +1 new
Time 3: [Old-1] [New-1] [New-2] [New-3]      # -1 old, +1 new
Time 4: [New-1] [New-2] [New-3]              # -1 old, done!

Kubelet: The Node Agent

The kubelet runs on every node and manages pods assigned to that node.

Kubelet Responsibilities

  1. Pod Lifecycle: Create, start, stop, delete containers
  2. Health Checks: Liveness, readiness, startup probes
  3. Resource Reporting: Tell API server about node capacity
  4. Image Management: Pull container images
  5. Volume Mounting: Attach storage to pods

How Kubelet Runs Containers

┌─────────────────────────────────────────────────────────────────┐
│                           Kubelet                                │
│                              │                                   │
│                              ▼                                   │
│                    ┌─────────────────┐                          │
│                    │       CRI       │  Container Runtime        │
│                    │   (Interface)   │  Interface                │
│                    └────────┬────────┘                          │
│                             │                                    │
│         ┌───────────────────┼───────────────────┐               │
│         ▼                   ▼                   ▼               │
│   ┌──────────┐        ┌──────────┐        ┌──────────┐         │
│   │containerd│        │  CRI-O   │        │  Docker  │         │
│   └──────────┘        └──────────┘        └──────────┘         │
└─────────────────────────────────────────────────────────────────┘
CRI (Container Runtime Interface) abstracts the container runtime. Kubernetes does not care if you use containerd, CRI-O, or something else.

Pod Sandbox and Containers

Pod:
┌─────────────────────────────────────────────────────────┐
│  Pause Container (Sandbox)                               │
│  - Holds network namespace                               │
│  - Holds IPC namespace                                   │
│  ┌─────────────────┐  ┌─────────────────┐               │
│  │  App Container  │  │  Sidecar        │               │
│  │  (nginx)        │  │  (log-shipper)  │               │
│  └─────────────────┘  └─────────────────┘               │
│                                                          │
│  Shared: Network (IP, ports), IPC, Volumes              │
└─────────────────────────────────────────────────────────┘
The pause container is the parent that holds namespaces. App containers join these namespaces.

Networking Internals

Kubernetes networking follows three rules:
  1. All pods can communicate with all other pods without NAT
  2. All nodes can communicate with all pods without NAT
  3. The IP a pod sees for itself is the same IP others see

CNI: Container Network Interface

CNI plugins implement pod networking:
PluginApproachBest For
CalicoBGP routing, L3Large clusters, network policies
FlannelVXLAN overlaySimplicity, small clusters
CiliumeBPF-basedPerformance, observability
WeaveVXLAN, encryptionMulti-cloud, encrypted overlay
AWS VPC CNINative AWS networkingEKS clusters

Pod-to-Pod Communication

Pod A (10.244.1.5)                      Pod B (10.244.2.3)
┌──────────────────┐                    ┌──────────────────┐
│                  │                    │                  │
│  eth0: 10.244.1.5│                    │  eth0: 10.244.2.3│
└────────┬─────────┘                    └────────┬─────────┘
         │                                       │
    ┌────┴─────┐                           ┌────┴─────┐
    │   veth   │                           │   veth   │
    └────┬─────┘                           └────┬─────┘
         │                                       │
    ┌────┴──────────────────────────────────────┴────┐
    │                CNI Bridge/Overlay               │
    └────────────────────────┬───────────────────────┘

                    Physical Network

Service Networking: kube-proxy

kube-proxy implements Service load balancing using:
ModeHow It WorksPerformance
iptablesRules for packet redirectionGood, O(n) rules
IPVSLinux Virtual ServerBetter, O(1) lookup
eBPFKernel-level packet processingBest, Cilium uses this
# View iptables rules for a Service
iptables -t nat -L KUBE-SERVICES -n

# KUBE-SVC-xxxx chain load balances to endpoints
# KUBE-SEP-xxxx chains DNAT to actual pod IPs

Storage Internals

Kubernetes abstracts storage through several layers.

CSI: Container Storage Interface

┌─────────────────────────────────────────────────────────────────┐
│                         Kubernetes                               │
│                              │                                   │
│                              ▼                                   │
│                    ┌─────────────────┐                          │
│                    │       CSI       │  Container Storage        │
│                    │   (Interface)   │  Interface                │
│                    └────────┬────────┘                          │
│                             │                                    │
│         ┌───────────────────┼───────────────────┐               │
│         ▼                   ▼                   ▼               │
│   ┌──────────┐        ┌──────────┐        ┌──────────┐         │
│   │ AWS EBS  │        │ GCP PD   │        │  Ceph    │         │
│   └──────────┘        └──────────┘        └──────────┘         │
└─────────────────────────────────────────────────────────────────┘

PV/PVC Binding

# PersistentVolumeClaim (User Request)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: fast-ssd
The PV Binder Controller matches PVCs to PVs:
  1. Find PVs with matching StorageClass
  2. Filter by access modes and capacity
  3. Select best fit (smallest PV that satisfies request)
  4. Bind PVC to PV

Interview Deep Dive Questions

Answer: 1) kubectl validates YAML locally, 2) Request sent to API server, 3) Authentication (who is this?), 4) Authorization (can they do this?), 5) Admission controllers (mutate/validate), 6) Object written to etcd, 7) etcd notifies watchers, 8) Relevant controllers react (e.g., scheduler assigns pod, kubelet runs it).
Answer: Two phases: 1) Filtering - eliminate nodes that cannot run the pod (insufficient resources, wrong labels, taints not tolerated), 2) Scoring - rank remaining nodes by preferences (resource balance, image locality, affinity). Highest score wins. Configurable via scheduler profiles and plugins.
Answer: Controllers run infinite reconciliation loops: observe actual state, compare to desired state, take action to align them. This pattern enables self-healing - if a pod dies, the ReplicaSet controller notices the mismatch and creates a new one. Every Kubernetes controller follows this pattern.
Answer: etcd is the single source of truth for cluster state. All objects (pods, services, secrets) are stored in etcd. It uses Raft consensus for consistency and watches for real-time notifications. If etcd is unavailable, the cluster cannot make changes. This is why etcd is typically run as a 3 or 5 node cluster for high availability.
Answer: Services are abstractions. kube-proxy watches Services and Endpoints, then programs iptables/IPVS rules on each node. When traffic hits a Service IP, rules DNAT (destination NAT) to a backend pod IP. Load balancing happens via random selection (iptables) or round-robin (IPVS). ClusterIP is virtual - exists only in iptables rules.
Answer: Deployment runs N replicas scheduled anywhere. DaemonSet runs exactly one pod per node (or selected nodes). DaemonSets bypass the scheduler - they are controlled by the DaemonSet controller. Use cases: log collectors (fluentd), monitoring agents (node-exporter), network plugins (calico-node).

Debugging with Internals Knowledge

Check etcd Health

# etcd status (run inside etcd pod)
etcdctl endpoint health
etcdctl endpoint status --write-out=table

# List all keys (careful - lots of output)
etcdctl get / --prefix --keys-only

Trace Scheduler Decisions

# Why is my pod pending?
kubectl describe pod my-pod

# Look for:
# Events: FailedScheduling
# Message: "0/3 nodes are available: 1 Insufficient cpu, 2 node(s) had taint..."

# Check scheduler logs
kubectl logs -n kube-system kube-scheduler-<node>

Debug Controller Issues

# Check controller manager logs
kubectl logs -n kube-system kube-controller-manager-<node>

# Look for reconciliation errors
kubectl describe deployment my-deployment

Key Takeaways

  1. Desired state vs actual state - controllers constantly reconcile
  2. etcd is the brain - stores all cluster state, uses Raft consensus
  3. API server is the hub - authentication, authorization, admission, persistence
  4. Scheduler uses filtering then scoring - predicates eliminate, priorities rank
  5. Controllers follow the reconciliation pattern - observe, compare, act
  6. Kubelet uses CRI - pluggable container runtimes
  7. CNI implements pod networking - Calico, Cilium, Flannel, etc.
  8. CSI implements storage - pluggable storage backends

Ready to deploy workloads? Next up: Kubernetes Workloads where we will master Deployments, StatefulSets, and DaemonSets.