Kubernetes Internals Deep Dive
If you love understanding how things actually work, this chapter is for you. If you just want to deploy pods and call it a day, feel free to skip ahead. No judgment.This chapter takes you inside the Kubernetes control plane. We will explore how etcd stores cluster state, how the scheduler makes placement decisions, and why the controller pattern is one of the most elegant designs in distributed systems. This knowledge is what separates Kubernetes operators from Kubernetes engineers.
Why Internals Matter
Understanding Kubernetes internals helps you:- Troubleshoot production outages when pods refuse to schedule
- Optimize cluster performance by understanding scheduler behavior
- Design better applications that work with Kubernetes, not against it
- Ace interviews where internals questions are standard
- Build operators and controllers that extend Kubernetes
The Declarative Model: Desired State vs Actual State
Kubernetes is built on a simple but powerful idea: you declare what you want, and controllers work to make reality match your declaration.- Self-healing works - controllers constantly reconcile
- Rollbacks are easy - just change the desired state
- Scaling is declarative - change
replicas: 3toreplicas: 10
etcd: The Cluster Brain
etcd is a distributed key-value store that holds the entire cluster state. If etcd dies, your cluster dies.What is Stored in etcd
- Key:
/registry/<resource>/<namespace>/<name> - Value: Serialized object (JSON or Protobuf)
etcd Consistency Model
etcd uses the Raft consensus algorithm to ensure consistency:- Leader handles all writes
- Followers replicate and serve reads
- Quorum required for writes: (n/2) + 1 nodes must agree
- 3 nodes tolerates 1 failure, 5 nodes tolerates 2
Watches: The Real-Time Notification System
This is the magic that makes Kubernetes responsive:- API server writes to etcd
- etcd notifies all watchers
- Scheduler watches for unscheduled pods
- Kubelet watches for pods assigned to its node
The API Server: Central Hub
The kube-apiserver is the only component that talks to etcd. Everything else goes through it.Request Flow
Authentication: Who Are You?
Methods supported:- X.509 Client Certificates - Most secure
- Bearer Tokens - Service accounts
- OpenID Connect - Enterprise SSO
- Webhooks - Custom authentication
Authorization: Are You Allowed?
RBAC (Role-Based Access Control) is the standard:Admission Controllers: The Gatekeepers
Admission controllers intercept requests before persistence: Mutating Admission (modifies objects):- Add default resource limits
- Inject sidecar containers (Istio)
- Add labels/annotations
- Enforce naming conventions
- Require specific labels
- Block privileged containers
The Scheduler: Where Do Pods Go?
The kube-scheduler watches for unscheduled pods and assigns them to nodes.The Scheduling Algorithm
Filter Plugins (Predicates)
These eliminate unsuitable nodes:| Filter | What It Checks |
|---|---|
NodeResourcesFit | Enough CPU/memory? |
NodeName | Pod requests specific node? |
NodeSelector | Node has required labels? |
NodeAffinity | Node matches affinity rules? |
TaintToleration | Pod tolerates node taints? |
NodePorts | Port available on node? |
PodTopologySpread | Would violate spread constraints? |
Score Plugins (Priorities)
These rank feasible nodes:| Priority | What It Prefers |
|---|---|
LeastAllocated | Nodes with more free resources |
MostAllocated | Nodes with least free resources (bin packing) |
BalancedResourceAllocation | Balanced CPU/memory ratio |
ImageLocality | Nodes that already have the image |
NodeAffinity | Nodes matching preferred affinity |
InterPodAffinity | Nodes with related pods |
Example Scheduling Decision
- Filter: Remove nodes without
disktype: ssdlabel - Filter: Remove nodes without enough CPU/memory
- Filter: Remove nodes with untolerated taints
- Score: Rank by resource availability, image locality
- Bind: Assign to highest-scoring node
Controllers: The Brains of Kubernetes
Controllers are the workhorses that make Kubernetes self-healing. Each controller follows the same pattern.The Controller Pattern
Controller Manager Components
| Controller | What It Manages |
|---|---|
| ReplicaSet Controller | Ensures correct pod count |
| Deployment Controller | Manages ReplicaSets for rollouts |
| Node Controller | Monitors node health, evicts pods |
| Endpoint Controller | Populates Service endpoints |
| ServiceAccount Controller | Creates default accounts |
| Job Controller | Runs pods to completion |
| PV Binder | Binds PVCs to PVs |
Deployment Controller Deep Dive
What happens when you update a Deployment:- Deployment Controller detects spec change
- Creates new ReplicaSet with new pod template
- Scales up new ReplicaSet (gradually)
- Scales down old ReplicaSet (gradually)
- Updates Deployment status
Kubelet: The Node Agent
The kubelet runs on every node and manages pods assigned to that node.Kubelet Responsibilities
- Pod Lifecycle: Create, start, stop, delete containers
- Health Checks: Liveness, readiness, startup probes
- Resource Reporting: Tell API server about node capacity
- Image Management: Pull container images
- Volume Mounting: Attach storage to pods
How Kubelet Runs Containers
Pod Sandbox and Containers
Networking Internals
Kubernetes networking follows three rules:- All pods can communicate with all other pods without NAT
- All nodes can communicate with all pods without NAT
- The IP a pod sees for itself is the same IP others see
CNI: Container Network Interface
CNI plugins implement pod networking:| Plugin | Approach | Best For |
|---|---|---|
| Calico | BGP routing, L3 | Large clusters, network policies |
| Flannel | VXLAN overlay | Simplicity, small clusters |
| Cilium | eBPF-based | Performance, observability |
| Weave | VXLAN, encryption | Multi-cloud, encrypted overlay |
| AWS VPC CNI | Native AWS networking | EKS clusters |
Pod-to-Pod Communication
Service Networking: kube-proxy
kube-proxy implements Service load balancing using:| Mode | How It Works | Performance |
|---|---|---|
| iptables | Rules for packet redirection | Good, O(n) rules |
| IPVS | Linux Virtual Server | Better, O(1) lookup |
| eBPF | Kernel-level packet processing | Best, Cilium uses this |
Storage Internals
Kubernetes abstracts storage through several layers.CSI: Container Storage Interface
PV/PVC Binding
- Find PVs with matching StorageClass
- Filter by access modes and capacity
- Select best fit (smallest PV that satisfies request)
- Bind PVC to PV
Interview Deep Dive Questions
What happens when you run kubectl apply?
What happens when you run kubectl apply?
Answer: 1) kubectl validates YAML locally, 2) Request sent to API server, 3) Authentication (who is this?), 4) Authorization (can they do this?), 5) Admission controllers (mutate/validate), 6) Object written to etcd, 7) etcd notifies watchers, 8) Relevant controllers react (e.g., scheduler assigns pod, kubelet runs it).
How does the scheduler decide where to place a pod?
How does the scheduler decide where to place a pod?
Answer: Two phases: 1) Filtering - eliminate nodes that cannot run the pod (insufficient resources, wrong labels, taints not tolerated), 2) Scoring - rank remaining nodes by preferences (resource balance, image locality, affinity). Highest score wins. Configurable via scheduler profiles and plugins.
What is the controller pattern?
What is the controller pattern?
Answer: Controllers run infinite reconciliation loops: observe actual state, compare to desired state, take action to align them. This pattern enables self-healing - if a pod dies, the ReplicaSet controller notices the mismatch and creates a new one. Every Kubernetes controller follows this pattern.
Why is etcd critical to Kubernetes?
Why is etcd critical to Kubernetes?
Answer: etcd is the single source of truth for cluster state. All objects (pods, services, secrets) are stored in etcd. It uses Raft consensus for consistency and watches for real-time notifications. If etcd is unavailable, the cluster cannot make changes. This is why etcd is typically run as a 3 or 5 node cluster for high availability.
Explain how Services work internally
Explain how Services work internally
Answer: Services are abstractions. kube-proxy watches Services and Endpoints, then programs iptables/IPVS rules on each node. When traffic hits a Service IP, rules DNAT (destination NAT) to a backend pod IP. Load balancing happens via random selection (iptables) or round-robin (IPVS). ClusterIP is virtual - exists only in iptables rules.
What is the difference between a DaemonSet and a Deployment?
What is the difference between a DaemonSet and a Deployment?
Answer: Deployment runs N replicas scheduled anywhere. DaemonSet runs exactly one pod per node (or selected nodes). DaemonSets bypass the scheduler - they are controlled by the DaemonSet controller. Use cases: log collectors (fluentd), monitoring agents (node-exporter), network plugins (calico-node).
Debugging with Internals Knowledge
Check etcd Health
Trace Scheduler Decisions
Debug Controller Issues
Key Takeaways
- Desired state vs actual state - controllers constantly reconcile
- etcd is the brain - stores all cluster state, uses Raft consensus
- API server is the hub - authentication, authorization, admission, persistence
- Scheduler uses filtering then scoring - predicates eliminate, priorities rank
- Controllers follow the reconciliation pattern - observe, compare, act
- Kubelet uses CRI - pluggable container runtimes
- CNI implements pod networking - Calico, Cilium, Flannel, etc.
- CSI implements storage - pluggable storage backends
Ready to deploy workloads? Next up: Kubernetes Workloads where we will master Deployments, StatefulSets, and DaemonSets.