Kubernetes Internals Deep Dive

If you love understanding how things actually work, this chapter is for you. If you just want to deploy pods and call it a day, feel free to skip ahead. No judgment.

This chapter takes you inside the Kubernetes control plane. We will explore how etcd stores cluster state, how the scheduler makes placement decisions, and why the controller pattern is one of the most elegant designs in distributed systems. This knowledge is what separates Kubernetes operators from Kubernetes engineers.

Why Internals Matter

Understanding Kubernetes internals helps you:

Troubleshoot production outages when pods refuse to schedule
Optimize cluster performance by understanding scheduler behavior
Design better applications that work with Kubernetes, not against it
Ace interviews where internals questions are standard
Build operators and controllers that extend Kubernetes

The Declarative Model: Desired State vs Actual State

Kubernetes is built on a simple but powerful idea: you declare what you want, and controllers work to make reality match your declaration.

# You declare: "I want 3 nginx replicas"
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 3

Kubernetes stores this as desired state and continuously works to ensure actual state matches it.

┌─────────────────────────────────────────────────────────────────┐
│                    The Reconciliation Loop                       │
│                                                                  │
│   ┌──────────────┐                        ┌──────────────┐       │
│   │  Desired     │     Controllers        │   Actual     │       │
│   │  State       │ ──────────────────────▶│   State      │       │
│   │  (etcd)      │   "Make it so"         │  (Reality)   │       │
│   └──────────────┘                        └──────────────┘       │
│          ▲                                       │               │
│          │         Observe & Compare             │               │
│          └───────────────────────────────────────┘               │
└─────────────────────────────────────────────────────────────────┘

This model is why:

Self-healing works - controllers constantly reconcile
Rollbacks are easy - just change the desired state
Scaling is declarative - change replicas: 3 to replicas: 10

etcd: The Cluster Brain

etcd is a distributed key-value store that holds the entire cluster state. If etcd dies, your cluster dies.

What is Stored in etcd

/registry/
├── pods/
│   └── default/
│       └── nginx-abc123
├── deployments/
│   └── default/
│       └── nginx
├── services/
│   └── default/
│       └── nginx-service
├── secrets/
│   └── default/
│       └── api-key
└── configmaps/
    └── default/
        └── app-config

Every Kubernetes object is stored as a key-value pair:

Key: /registry/<resource>/<namespace>/<name>
Value: Serialized object (JSON or Protobuf)

etcd Consistency Model

etcd uses the Raft consensus algorithm to ensure consistency:

┌─────────┐     ┌─────────┐     ┌─────────┐
│  etcd   │     │  etcd   │     │  etcd   │
│ Leader  │────▶│Follower │     │Follower │
│  (R/W)  │     │  (R/O)  │     │  (R/O)  │
└─────────┘     └─────────┘     └─────────┘
     │                ▲               ▲
     │                │               │
     └────────────────┴───────────────┘
           Replicated writes

Leader handles all writes
Followers replicate and serve reads
Quorum required for writes: (n/2) + 1 nodes must agree
3 nodes tolerates 1 failure, 5 nodes tolerates 2

Watches: The Real-Time Notification System

This is the magic that makes Kubernetes responsive:

# etcd watch on pods
etcdctl watch --prefix /registry/pods/

When you create a pod:

API server writes to etcd
etcd notifies all watchers
Scheduler watches for unscheduled pods
Kubelet watches for pods assigned to its node

This event-driven architecture is why Kubernetes reacts in seconds, not minutes.

The API Server: Central Hub

The kube-apiserver is the only component that talks to etcd. Everything else goes through it.

Request Flow

kubectl apply -f pod.yaml
        │
        ▼
┌──────────────────────────────────────────────────────────────────┐
│                          API Server                               │
│                                                                   │
│  ┌──────────┐   ┌──────────────┐   ┌────────────┐   ┌─────────┐ │
│  │ Authent- │──▶│   Authoriz-  │──▶│ Admission  │──▶│ Valida- │ │
│  │ ication  │   │   ation      │   │ Controllers│   │ tion    │ │
│  │ (Who?)   │   │ (Allowed?)   │   │ (Mutate?)  │   │ (Valid?)│ │
│  └──────────┘   └──────────────┘   └────────────┘   └─────────┘ │
│                                                          │       │
│                                            ┌─────────────▼─────┐ │
│                                            │  Write to etcd   │ │
│                                            └───────────────────┘ │
└──────────────────────────────────────────────────────────────────┘

Authentication: Who Are You?

Methods supported:

X.509 Client Certificates - Most secure
Bearer Tokens - Service accounts
OpenID Connect - Enterprise SSO
Webhooks - Custom authentication

Authorization: Are You Allowed?

RBAC (Role-Based Access Control) is the standard:

# Role: What actions are allowed
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

---
# RoleBinding: Who gets the role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
subjects:
- kind: User
  name: jane
roleRef:
  kind: Role
  name: pod-reader

Admission Controllers: The Gatekeepers

Admission controllers intercept requests before persistence: Mutating Admission (modifies objects):

Add default resource limits
Inject sidecar containers (Istio)
Add labels/annotations

Validating Admission (rejects invalid objects):

Enforce naming conventions
Require specific labels
Block privileged containers

# Example: Always pull images (via admission controller)
# Input:  imagePullPolicy: IfNotPresent
# Output: imagePullPolicy: Always

The Scheduler: Where Do Pods Go?

The kube-scheduler watches for unscheduled pods and assigns them to nodes.

The Scheduling Algorithm

┌──────────────────────────────────────────────────────────────┐
│                     Scheduling Cycle                          │
│                                                               │
│   1. FILTERING (Predicates)                                   │
│      ┌─────────┐                                              │
│      │All Nodes│──▶ Remove nodes that cannot run the pod     │
│      └─────────┘                                              │
│           │                                                   │
│           ▼                                                   │
│   2. SCORING (Priorities)                                     │
│      ┌─────────┐                                              │
│      │Feasible │──▶ Rank remaining nodes by preference       │
│      │ Nodes   │                                              │
│      └─────────┘                                              │
│           │                                                   │
│           ▼                                                   │
│   3. BINDING                                                  │
│      Assign pod to highest-scoring node                       │
└──────────────────────────────────────────────────────────────┘

Filter Plugins (Predicates)

These eliminate unsuitable nodes:

Filter	What It Checks
`NodeResourcesFit`	Enough CPU/memory?
`NodeName`	Pod requests specific node?
`NodeSelector`	Node has required labels?
`NodeAffinity`	Node matches affinity rules?
`TaintToleration`	Pod tolerates node taints?
`NodePorts`	Port available on node?
`PodTopologySpread`	Would violate spread constraints?

Score Plugins (Priorities)

These rank feasible nodes:

Priority	What It Prefers
`LeastAllocated`	Nodes with more free resources
`MostAllocated`	Nodes with least free resources (bin packing)
`BalancedResourceAllocation`	Balanced CPU/memory ratio
`ImageLocality`	Nodes that already have the image
`NodeAffinity`	Nodes matching preferred affinity
`InterPodAffinity`	Nodes with related pods

Example Scheduling Decision

# Pod with requirements
spec:
  containers:
  - resources:
      requests:
        memory: "64Mi"
        cpu: "250m"
  nodeSelector:
    disktype: ssd
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "ml-workload"

Scheduling flow:

Filter: Remove nodes without disktype: ssd label
Filter: Remove nodes without enough CPU/memory
Filter: Remove nodes with untolerated taints
Score: Rank by resource availability, image locality
Bind: Assign to highest-scoring node

Controllers: The Brains of Kubernetes

Controllers are the workhorses that make Kubernetes self-healing. Each controller follows the same pattern.

The Controller Pattern

for {
    desired := getDesiredState()
    actual := getActualState()
    
    if actual != desired {
        reconcile(actual, desired)
    }
}

This is an infinite reconciliation loop. Controllers never stop watching and fixing.

Controller Manager Components

Controller	What It Manages
ReplicaSet Controller	Ensures correct pod count
Deployment Controller	Manages ReplicaSets for rollouts
Node Controller	Monitors node health, evicts pods
Endpoint Controller	Populates Service endpoints
ServiceAccount Controller	Creates default accounts
Job Controller	Runs pods to completion
PV Binder	Binds PVCs to PVs

Deployment Controller Deep Dive

What happens when you update a Deployment:

# Change image from nginx:1.20 to nginx:1.21
spec:
  template:
    spec:
      containers:
      - name: nginx
        image: nginx:1.21  # Changed!

Deployment Controller detects spec change
Creates new ReplicaSet with new pod template
Scales up new ReplicaSet (gradually)
Scales down old ReplicaSet (gradually)
Updates Deployment status

Rolling Update (maxSurge=1, maxUnavailable=0):

Time 0: [Old-1] [Old-2] [Old-3]
Time 1: [Old-1] [Old-2] [Old-3] [New-1]      # +1 new
Time 2: [Old-1] [Old-2] [New-1] [New-2]      # -1 old, +1 new
Time 3: [Old-1] [New-1] [New-2] [New-3]      # -1 old, +1 new
Time 4: [New-1] [New-2] [New-3]              # -1 old, done!

Kubelet: The Node Agent

The kubelet runs on every node and manages pods assigned to that node.

Kubelet Responsibilities

Pod Lifecycle: Create, start, stop, delete containers
Health Checks: Liveness, readiness, startup probes
Resource Reporting: Tell API server about node capacity
Image Management: Pull container images
Volume Mounting: Attach storage to pods

How Kubelet Runs Containers

┌─────────────────────────────────────────────────────────────────┐
│                           Kubelet                                │
│                              │                                   │
│                              ▼                                   │
│                    ┌─────────────────┐                          │
│                    │       CRI       │  Container Runtime        │
│                    │   (Interface)   │  Interface                │
│                    └────────┬────────┘                          │
│                             │                                    │
│         ┌───────────────────┼───────────────────┐               │
│         ▼                   ▼                   ▼               │
│   ┌──────────┐        ┌──────────┐        ┌──────────┐         │
│   │containerd│        │  CRI-O   │        │  Docker  │         │
│   └──────────┘        └──────────┘        └──────────┘         │
└─────────────────────────────────────────────────────────────────┘

CRI (Container Runtime Interface) abstracts the container runtime. Kubernetes does not care if you use containerd, CRI-O, or something else.

Pod Sandbox and Containers

Pod:
┌─────────────────────────────────────────────────────────┐
│  Pause Container (Sandbox)                               │
│  - Holds network namespace                               │
│  - Holds IPC namespace                                   │
│  ┌─────────────────┐  ┌─────────────────┐               │
│  │  App Container  │  │  Sidecar        │               │
│  │  (nginx)        │  │  (log-shipper)  │               │
│  └─────────────────┘  └─────────────────┘               │
│                                                          │
│  Shared: Network (IP, ports), IPC, Volumes              │
└─────────────────────────────────────────────────────────┘

The pause container is the parent that holds namespaces. App containers join these namespaces.

Networking Internals

Kubernetes networking follows three rules:

All pods can communicate with all other pods without NAT
All nodes can communicate with all pods without NAT
The IP a pod sees for itself is the same IP others see

CNI: Container Network Interface

CNI plugins implement pod networking:

Plugin	Approach	Best For
Calico	BGP routing, L3	Large clusters, network policies
Flannel	VXLAN overlay	Simplicity, small clusters
Cilium	eBPF-based	Performance, observability
Weave	VXLAN, encryption	Multi-cloud, encrypted overlay
AWS VPC CNI	Native AWS networking	EKS clusters

Pod-to-Pod Communication

Pod A (10.244.1.5)                      Pod B (10.244.2.3)
┌──────────────────┐                    ┌──────────────────┐
│                  │                    │                  │
│  eth0: 10.244.1.5│                    │  eth0: 10.244.2.3│
└────────┬─────────┘                    └────────┬─────────┘
         │                                       │
    ┌────┴─────┐                           ┌────┴─────┐
    │   veth   │                           │   veth   │
    └────┬─────┘                           └────┬─────┘
         │                                       │
    ┌────┴──────────────────────────────────────┴────┐
    │                CNI Bridge/Overlay               │
    └────────────────────────┬───────────────────────┘
                             │
                    Physical Network

Service Networking: kube-proxy

kube-proxy implements Service load balancing using:

Mode	How It Works	Performance
iptables	Rules for packet redirection	Good, O(n) rules
IPVS	Linux Virtual Server	Better, O(1) lookup
eBPF	Kernel-level packet processing	Best, Cilium uses this

# View iptables rules for a Service
iptables -t nat -L KUBE-SERVICES -n

# KUBE-SVC-xxxx chain load balances to endpoints
# KUBE-SEP-xxxx chains DNAT to actual pod IPs

Storage Internals

Kubernetes abstracts storage through several layers.

CSI: Container Storage Interface

┌─────────────────────────────────────────────────────────────────┐
│                         Kubernetes                               │
│                              │                                   │
│                              ▼                                   │
│                    ┌─────────────────┐                          │
│                    │       CSI       │  Container Storage        │
│                    │   (Interface)   │  Interface                │
│                    └────────┬────────┘                          │
│                             │                                    │
│         ┌───────────────────┼───────────────────┐               │
│         ▼                   ▼                   ▼               │
│   ┌──────────┐        ┌──────────┐        ┌──────────┐         │
│   │ AWS EBS  │        │ GCP PD   │        │  Ceph    │         │
│   └──────────┘        └──────────┘        └──────────┘         │
└─────────────────────────────────────────────────────────────────┘

PV/PVC Binding

# PersistentVolumeClaim (User Request)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: fast-ssd

The PV Binder Controller matches PVCs to PVs:

Find PVs with matching StorageClass
Filter by access modes and capacity
Select best fit (smallest PV that satisfies request)
Bind PVC to PV

Interview Deep Dive Questions

What happens when you run kubectl apply?

Answer: 1) kubectl validates YAML locally, 2) Request sent to API server, 3) Authentication (who is this?), 4) Authorization (can they do this?), 5) Admission controllers (mutate/validate), 6) Object written to etcd, 7) etcd notifies watchers, 8) Relevant controllers react (e.g., scheduler assigns pod, kubelet runs it).

How does the scheduler decide where to place a pod?

Answer: Two phases: 1) Filtering - eliminate nodes that cannot run the pod (insufficient resources, wrong labels, taints not tolerated), 2) Scoring - rank remaining nodes by preferences (resource balance, image locality, affinity). Highest score wins. Configurable via scheduler profiles and plugins.

What is the controller pattern?

Answer: Controllers run infinite reconciliation loops: observe actual state, compare to desired state, take action to align them. This pattern enables self-healing - if a pod dies, the ReplicaSet controller notices the mismatch and creates a new one. Every Kubernetes controller follows this pattern.

Why is etcd critical to Kubernetes?

Answer: etcd is the single source of truth for cluster state. All objects (pods, services, secrets) are stored in etcd. It uses Raft consensus for consistency and watches for real-time notifications. If etcd is unavailable, the cluster cannot make changes. This is why etcd is typically run as a 3 or 5 node cluster for high availability.

Explain how Services work internally

Answer: Services are abstractions. kube-proxy watches Services and Endpoints, then programs iptables/IPVS rules on each node. When traffic hits a Service IP, rules DNAT (destination NAT) to a backend pod IP. Load balancing happens via random selection (iptables) or round-robin (IPVS). ClusterIP is virtual - exists only in iptables rules.

What is the difference between a DaemonSet and a Deployment?

Answer: Deployment runs N replicas scheduled anywhere. DaemonSet runs exactly one pod per node (or selected nodes). DaemonSets bypass the scheduler - they are controlled by the DaemonSet controller. Use cases: log collectors (fluentd), monitoring agents (node-exporter), network plugins (calico-node).

Debugging with Internals Knowledge

Check etcd Health

# etcd status (run inside etcd pod)
etcdctl endpoint health
etcdctl endpoint status --write-out=table

# List all keys (careful - lots of output)
etcdctl get / --prefix --keys-only

Trace Scheduler Decisions

# Why is my pod pending?
kubectl describe pod my-pod

# Look for:
# Events: FailedScheduling
# Message: "0/3 nodes are available: 1 Insufficient cpu, 2 node(s) had taint..."

# Check scheduler logs
kubectl logs -n kube-system kube-scheduler-<node>

Debug Controller Issues

# Check controller manager logs
kubectl logs -n kube-system kube-controller-manager-<node>

# Look for reconciliation errors
kubectl describe deployment my-deployment

Key Takeaways

Desired state vs actual state - controllers constantly reconcile
etcd is the brain - stores all cluster state, uses Raft consensus
API server is the hub - authentication, authorization, admission, persistence
Scheduler uses filtering then scoring - predicates eliminate, priorities rank
Controllers follow the reconciliation pattern - observe, compare, act
Kubelet uses CRI - pluggable container runtimes
CNI implements pod networking - Calico, Cilium, Flannel, etc.
CSI implements storage - pluggable storage backends

Ready to deploy workloads? Next up: Kubernetes Workloads where we will master Deployments, StatefulSets, and DaemonSets.

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Kubernetes Internals Deep Dive

​Why Internals Matter

​The Declarative Model: Desired State vs Actual State

​etcd: The Cluster Brain

​What is Stored in etcd

​etcd Consistency Model

​Watches: The Real-Time Notification System

​The API Server: Central Hub

​Request Flow

​Authentication: Who Are You?

​Authorization: Are You Allowed?

​Admission Controllers: The Gatekeepers

​The Scheduler: Where Do Pods Go?

​The Scheduling Algorithm

​Filter Plugins (Predicates)

​Score Plugins (Priorities)

​Example Scheduling Decision

​Controllers: The Brains of Kubernetes

​The Controller Pattern

​Controller Manager Components

​Deployment Controller Deep Dive

​Kubelet: The Node Agent

​Kubelet Responsibilities

​How Kubelet Runs Containers

​Pod Sandbox and Containers

​Networking Internals

​CNI: Container Network Interface

​Pod-to-Pod Communication

​Service Networking: kube-proxy

​Storage Internals

​CSI: Container Storage Interface

​PV/PVC Binding

​Interview Deep Dive Questions

​Debugging with Internals Knowledge

​Check etcd Health

​Trace Scheduler Decisions

​Debug Controller Issues

​Key Takeaways