Azure Kubernetes Service (AKS)
AKS is Azure’s managed Kubernetes service. Master AKS to deploy modern, cloud-native applications at scale.What You’ll Learn
By the end of this chapter, you’ll understand:- What containers are and why they exist
- Why Kubernetes is needed for managing containers
- How AKS simplifies Kubernetes management
- How to deploy and scale applications with containers
- When to use containers vs VMs vs serverless
Introduction: What Are Containers? (Start Here if You’re Completely New)
The Problem Containers Solve
Have you ever heard this?“But it works on my machine!” 😤This is one of software development’s biggest headaches. Let me explain why:
Before Containers: The “Works on My Machine” Problem
Scenario: You built an amazing web application. Your development laptop:- Node.js version 18.15.0
- Python 3.10
- PostgreSQL 14
- Ubuntu 22.04
- 16 GB RAM
- Node.js version 16.14.0 ← Different!
- Python 3.9 ← Different!
- PostgreSQL 13 ← Different!
- Windows 11 ← Different OS!
- 8 GB RAM ← Different!
Real-World Analogy: Shipping Containers
Think about physical shipping containers: Before Shipping Containers (1950s):- Pack goods in boxes, crates, barrels
- Different ships need different loading methods
- Goods get damaged during transfer
- Loading/unloading takes weeks
- Ship → Train → Truck = repack everything each time
- Standard 20-foot or 40-foot metal boxes
- Works on ships, trains, trucks
- Contents protected and isolated
- Loading/unloading takes hours, not weeks
- Ship → Train → Truck = same container, no repacking
- Package your app + all dependencies in one “box”
- Runs identically on any computer with Docker
- Your laptop → Colleague’s laptop → Production server = same container
What is a Container?
Container = A lightweight, standalone package that includes:- Your application code
- Runtime (Node.js, Python, Java, etc.)
- Libraries and dependencies
- Configuration files
- Operating system files (just what your app needs)
Container Example: Blog Website
Without Containers (Traditional Setup):Containers vs Virtual Machines
Visual Comparison:| Feature | Virtual Machine | Container |
|---|---|---|
| Size | 2-10 GB | 50-500 MB |
| Startup Time | 1-5 minutes | 1-5 seconds |
| Resource Usage | High (full OS) | Low (shared OS) |
| Isolation | Complete (hardware-level) | Process-level |
| Portability | Moderate | Excellent |
| Use Case | Different OS needed | Same OS, fast deployment |
- VM: You want to run Windows software on a Mac → Use VM
- Container: You want to deploy 100 copies of your web app → Use containers
Why Use Containers?
1. Consistency:Real-World Example: E-Commerce Website
Traditional Deployment (No Containers):Common Mistakes Beginners Make
❌ Mistake 1: Thinking containers are just lightweight VMs ✅ Reality: Containers share the host OS kernel, VMs don’t ❌ Mistake 2: Storing data inside containers ✅ Reality: Containers are ephemeral (temporary). Use volumes for persistent data. ❌ Mistake 3: Running multiple apps in one container ✅ Reality: One container = one process (web server OR database, not both) ❌ Mistake 4: Using containers for everything ✅ Reality: Sometimes VMs are better (need different OS, strong isolation)When to Use Containers vs VMs
Use Containers When: ✅ You want fast deployment (seconds) ✅ You need to run many copies of the same app ✅ You want consistent environments (dev = production) ✅ Your app runs on Linux Use VMs When: ✅ You need complete isolation (security, compliance) ✅ You need different operating systems (Windows + Linux on same hardware) ✅ You have legacy apps that can’t be containerized ✅ You need full control over the operating systemWhat is Kubernetes? (The Next Step After Containers)
The Problem Kubernetes Solves
You’ve learned containers solve the “works on my machine” problem. But… Scenario: Your blog got popular! 🎉 Month 1:- 100 visitors/day
- 1 container handles it easily
- Cost: $10/month
- 50,000 visitors/day
- Need 20 containers to handle traffic
- Multiple servers needed
-
Which server should run which container?
- Server 1 has 20 GB RAM free, Server 2 has 4 GB free
- Manual placement = nightmare
-
What if a container crashes?
- Who restarts it? How do you know it crashed?
- Manual monitoring = 24/7 job
-
How do users reach the right container?
- 20 containers, each with different IP address
- Users need one URL: myblog.com
-
How to update without downtime?
- Stop all 20 containers = website down
- Update one-by-one manually = takes hours, error-prone
-
How to handle traffic spikes?
- Black Friday: need 50 containers
- Tuesday at 3am: need 5 containers
- Manual scaling = expensive or too slow
Real-World Analogy: Shipping Port
Without Kubernetes (Manual Container Management):What is Kubernetes?
Kubernetes (K8s) = An open-source container orchestration platform that automates deployment, scaling, and management of containerized applications. Think of Kubernetes as: An operating system for your containers across many servers. Core Features:- Self-Healing: Container crashed? Kubernetes restarts it automatically
- Load Balancing: Distributes traffic across containers
- Auto-Scaling: More traffic? Kubernetes adds containers. Less traffic? Removes them.
- Rolling Updates: Update app without downtime
- Service Discovery: Containers find each other automatically
- Storage Orchestration: Attach storage to containers automatically
Kubernetes in Simple Terms
Transportation Analogy:Real Example: Online Shopping Website
Black Friday Sale (No Kubernetes):What is Azure Kubernetes Service (AKS)?
Plain Kubernetes (DIY):Under the Hood: The AKS Control Plane
In AKS, the cluster is split into two distinct lives:1. The Control Plane (Managed by Azure)
You never see these VMs, but they are there. They run:- kube-apiserver: The front door. Every
kubectlcommand hits this. - etcd: The source of truth. A distributed database that stores the cluster state.
- kube-scheduler: Decides which node should run your pod based on resources.
- kube-controller-manager: Watches for deviations (e.g., “I need 3 pods, but only 2 are running”) and fixes them.
2. The Data Plane (Your Worker Nodes)
These are the VMs in your subscription. They run:- kubelet: The agent that takes orders from the control plane and starts containers.
- kube-proxy: Handles networking and load balancing between pods.
- Container Runtime: Usually containerd (Docker’s core).
[!IMPORTANT] Pro Insight: The ‘Free’ Control Plane In the standard AKS tier, the control plane is free. However, if you have a massive cluster (100+ nodes), you should upgrade to the Uptime SLA tier ($0.10/hour). This gives you a guaranteed 99.95% availability for the API server itself, backed by financially-backed credits.
Cost Comparison Example
Running 20 Containers for a Web App: Option 1: Traditional VMs (No Containers):1. Why Kubernetes?
Before Kubernetes
- Manual container orchestration
- No auto-scaling
- Complex networking
- Manual load balancing
- No self-healing
With Kubernetes
- Automated orchestration
- Auto-scaling (HPA, VPA, Cluster Autoscaler)
- Service discovery
- Built-in load balancing
- Self-healing (restart failed pods)
[!WARNING] Gotcha: System Node Pools Every AKS cluster needs at least one “System Node Pool” to run Kubernetes itself (CoreDNS, Metrics Server). You cannot delete this pool or scale it to 0. It will always cost you money (usually 1-3 VMs).
[!TIP] Jargon Alert: Pod vs Node Node: A Virtual Machine (The house). Pod: A running process/container (The tenant living in the house). A single Node (VM) usually hosts many Pods.
2. AKS Architecture
3. Create AKS Cluster
4. AKS Networking
- kubenet (Basic)
- Azure CNI (Advanced)
5. Deploy Application
6. Autoscaling
- Horizontal Pod Autoscaler
- Cluster Autoscaler
7. Best Practices
Resource Limits
Health Checks
Use Namespaces
Security
Monitoring
GitOps
8. Interview Questions
Beginner Level
Q1: What is the difference between a Pod and a Node?
Q1: What is the difference between a Pod and a Node?
- Node: A worker machine (VM) in Kubernetes. It runs pods.
- Pod: The smallest deployable unit. Usually contains one container (but can have sidecars).
Q2: Explain the difference between ClusterIP, NodePort, and LoadBalancer
Q2: Explain the difference between ClusterIP, NodePort, and LoadBalancer
- ClusterIP: Internal IP only. Not accessible from outside. Default type.
- NodePort: Exposes service on a static port on each Node IP.
- LoadBalancer: Provisions an external Azure Load Balancer to expose service publicly.
Q3: What is the Master Node (Control Plane) responsible for?
Q3: What is the Master Node (Control Plane) responsible for?
- Scheduling pods (kube-scheduler)
- Detecting and responding to cluster events (kube-controller-manager)
- Storing cluster state (etcd)
- Exposing the Kubernetes API (kube-apiserver)
Intermediate Level
Q4: How does an Ingress Controller differ from a Load Balancer?
Q4: How does an Ingress Controller differ from a Load Balancer?
- Load Balancer: Layer 4 (TCP/UDP). One IP per service. Expensive for many services.
- Ingress Controller: Layer 7 (HTTP/HTTPS). Single IP for multiple services. Supports path-based routing (
/api,/web), SSL termination, and rewriting.
Q5: What happens when a Pod crashes?
Q5: What happens when a Pod crashes?
- The Kubelet on the node detects the crash.
- Based on
restartPolicy(default: Always), it restarts the container. - If the pod is part of a Deployment/ReplicaSet, if the Node dies, the Scheduler creates a new Pod on a healthy Node.
Advanced Level
Q6: How do you upgrade an AKS cluster with zero downtime?
Q6: How do you upgrade an AKS cluster with zero downtime?
- Cordon a node (prevent new pods).
- Drain the node (move existing pods to other nodes).
- Delete the node.
- Create a new node with the updated version.
- Repeat for all nodes (one by one or in batches).
PodDisruptionBudgets must be configured to ensure minAvailable replicas during the process.Q7: Explain the Sidecar Pattern
Q7: Explain the Sidecar Pattern
- Logging (sending logs to Splunk/Log Analytics)
- Proxying (Service Mesh like Istio/Linkerd)
- Config watching (reloading configuration)
- Security (TLS termination)
9. Helm: Kubernetes Package Manager
Why Helm?
Without Helm:- Manage 20+ YAML files manually
- Copy-paste configurations for dev/staging/prod
- Hard to version and rollback deployments
- Single command deployment:
helm install myapp ./chart - Templated configurations with values
- Easy rollbacks:
helm rollback myapp 1 - Reusable charts from public repositories
Helm Architecture
Creating a Helm Chart
Chart.yaml (Metadata)
values.yaml (Configuration)
templates/deployment.yaml (Templated Manifest)
Deploying with Helm
Helm Repositories
Multi-Environment Strategy
values-dev.yaml:[!TIP] Best Practice: Chart Versioning
- Chart version (
versionin Chart.yaml): Increment when chart structure changes- App version (
appVersion): Tracks the application version being deployed- Use semantic versioning:
1.2.3(MAJOR.MINOR.PATCH)
[!WARNING] Gotcha: Helm Secrets Never commit secrets tovalues.yaml! Use:
- Azure Key Vault: Inject secrets via CSI driver
- Sealed Secrets: Encrypt secrets in Git
- helm-secrets plugin: Encrypt values files with SOPS
10. GitOps with ArgoCD
GitOps Principles
- Declarative: Entire system described declaratively (YAML in Git)
- Versioned: Git history = deployment history
- Automated: Changes in Git automatically deployed
- Reconciled: Cluster state continuously reconciled with Git
ArgoCD Architecture
Installing ArgoCD on AKS
Creating an Application
Git Repository Structure:GitOps Workflow
[!IMPORTANT] Recommendation: Separate Repos
- Application code repo: Source code, Dockerfile
- GitOps repo: Kubernetes manifests, Helm charts
- CI updates GitOps repo after building image
Sync Strategies
| Strategy | Behavior | Use Case |
|---|---|---|
| Manual | Requires manual sync | Production (human approval) |
| Automated | Auto-sync on Git change | Dev/Staging |
| Auto-Prune | Delete resources not in Git | Clean up old resources |
| Self-Heal | Revert manual kubectl changes | Enforce Git as source of truth |
11. Service Mesh Basics (Istio)
Why Service Mesh?
Without Service Mesh:- Implement retries, timeouts, circuit breakers in every microservice
- No visibility into service-to-service traffic
- Difficult to enforce mTLS between services
- Traffic Management: Canary deployments, A/B testing, retries
- Security: Automatic mTLS between services
- Observability: Distributed tracing, metrics, logs
Istio Architecture
Installing Istio on AKS
Traffic Management Example
Canary Deployment (90% v1, 10% v2):[!NOTE] Deep Dive: When to Use Service Mesh?
- YES: Microservices (10+ services), need mTLS, complex traffic routing
- NO: Monolith, simple apps, small teams (adds complexity)
13. AKS Security Deep Dive
Pod Security Standards
Pod Security Standards replace deprecated Pod Security Policies (PSPs). Three Levels:- Privileged: Unrestricted (no restrictions)
- Baseline: Minimally restrictive (prevents known privilege escalations)
- Restricted: Heavily restricted (hardened, follows pod hardening best practices)
Network Policies
Network Policies = Firewall rules for pods.Secrets Management with Azure Key Vault
CSI Driver for Azure Key Vault:14. StatefulSets & Persistent Storage
StatefulSet = For stateful applications (databases, message queues) that need stable network identity and persistent storage.StatefulSet vs Deployment
| Feature | Deployment | StatefulSet |
|---|---|---|
| Pod Names | Random (web-7d8f-xyz) | Ordered (web-0, web-1, web-2) |
| Scaling | Parallel | Sequential (web-0 → web-1 → web-2) |
| Storage | Shared or ephemeral | Dedicated persistent volume per pod |
| Network Identity | Random | Stable (web-0.service.namespace.svc) |
| Use Case | Stateless apps | Databases, Kafka, Redis |
StatefulSet Example
Azure Disk vs Azure Files
| Feature | Azure Disk | Azure Files |
|---|---|---|
| Access Mode | ReadWriteOnce (single pod) | ReadWriteMany (multiple pods) |
| Performance | Higher IOPS | Lower IOPS |
| Use Case | Databases | Shared storage, logs |
| Storage Class | managed-premium, managed-standard | azurefile, azurefile-premium |
15. KEDA: Event-Driven Autoscaling
Installing KEDA
Example: Scale Based on Azure Service Bus Queue
- Queue has 50 messages → KEDA scales to 5 pods (50/10)
- Queue has 200 messages → KEDA scales to 20 pods (max)
- Queue empty → KEDA scales to 1 pod (min)
Popular KEDA Scalers
- Azure Service Bus: Queue/Topic message count
- Azure Storage Queue: Queue length
- HTTP: Incoming HTTP requests
- Prometheus: Custom metrics
- Kafka: Consumer lag
- Redis: List length
- Cron: Time-based scaling
16. Interview Questions
Beginner Level
Q1: What is the difference between a Pod and a Deployment?
Q1: What is the difference between a Pod and a Deployment?
- Smallest deployable unit in Kubernetes
- One or more containers running together
- Ephemeral (dies when node fails)
- No self-healing
- Manages a set of identical Pods (ReplicaSet)
- Ensures desired number of Pods are running
- Self-healing (recreates failed Pods)
- Supports rolling updates and rollbacks
Q2: Explain Kubernetes namespaces
Q2: Explain Kubernetes namespaces
- Environment separation: dev, staging, prod
- Team isolation: team-a, team-b
- Resource quotas: Limit CPU/memory per namespace
default: Default namespace for resourceskube-system: Kubernetes system componentskube-public: Public resources (readable by all)
Q3: What is a Service in Kubernetes?
Q3: What is a Service in Kubernetes?
- ClusterIP (default): Internal only (10.0.1.5)
- NodePort: Exposes on each node’s IP (30000-32767)
- LoadBalancer: Creates Azure Load Balancer (public IP)
Intermediate Level
Q4: How does Horizontal Pod Autoscaler (HPA) work?
Q4: How does Horizontal Pod Autoscaler (HPA) work?
- Metrics Server collects pod metrics every 15 seconds
- HPA controller checks metrics every 30 seconds
- If avg CPU > target, scale up
- If avg CPU < target (for 5 min), scale down
- Current: 3 pods, avg CPU 80%
- Target: 50%
- Desired: ceil(3 * (80/50)) = ceil(4.8) = 5 pods
resources.requests to be set!Q5: Explain the difference between Kubenet and Azure CNI
Q5: Explain the difference between Kubenet and Azure CNI
| Feature | Kubenet | Azure CNI |
|---|---|---|
| Pod IP | Private (10.244.x.x) | VNet IP (10.0.1.x) |
| IP Consumption | Low (NAT used) | High (1 IP per pod) |
| Performance | Slight overhead (NAT) | Direct routing (faster) |
| VNet Integration | No | Yes (pods directly in VNet) |
| Network Policies | Calico required | Native support |
| Use Case | Small clusters, IP conservation | Enterprise, VNet integration |
Q6: How do you implement zero-downtime deployments in AKS?
Q6: How do you implement zero-downtime deployments in AKS?
- Create 1 new pod (v2)
- Wait for readiness probe to pass
- Terminate 1 old pod (v1)
- Repeat until all pods are v2
Advanced Level
Q7: Design a multi-tenant AKS architecture
Q7: Design a multi-tenant AKS architecture
- Isolate tenants (security, resources)
- Cost allocation per tenant
- Prevent noisy neighbor
Q8: How do you troubleshoot a CrashLoopBackOff pod?
Q8: How do you troubleshoot a CrashLoopBackOff pod?
- Check pod events:
- Check logs:
- Common causes:
- OOMKilled: Increase
resources.limits.memory - Application error: Fix code, check environment variables
- Missing dependencies: Database not ready → Add init container
- Liveness probe failing: Adjust probe settings
- Debug with ephemeral container (Kubernetes 1.23+):
- Disable probes temporarily:
Q9: Implement a blue-green deployment strategy in AKS
Q9: Implement a blue-green deployment strategy in AKS
Troubleshooting: The AKS Production Triage
When a pod fails in production, don’t panic. Follow this 3-step triage:1. The “Pods won’t start” Phase
- ImagePullBackOff: Kubernetes can’t download your container image.
- The Pro Check: Does the AKS Cluster have the
AcrPullpermission on your Container Registry?
- The Pro Check: Does the AKS Cluster have the
- CrashLoopBackOff: The container starts but immediately crashes.
- The Pro Check: Run
kubectl logs <pod-name> --previous. You need to see the logs from the failed instance, not the new one that just restarted.
- The Pro Check: Run
- Pending: The pod isn’t even trying to start.
- The Pro Check: Run
kubectl describe pod <pod-name>. Usually, it’s because you requested2 GBof RAM but your nodes only have1 GBavailable.
- The Pro Check: Run
2. The “Network Ghost” Phase
- Service but no Response: The service is running, but you get a 504 timeout.
- The Pro Check: Do the
selectorsin your Service YAML exactly match thelabelsin your Deployment YAML? If not, the Load Balancer is sending traffic into a black hole.
- The Pro Check: Do the
3. The “Node Pressure” Phase
- Evicted Pods: Your pods are being killed randomly.
- The Pro Check: Your Node is out of disk space or RAM. Check “Azure Monitor for Containers” to see which app is leaking memory.
[!TIP]
Pro Tool: Lens & k9s
While kubectl is the standard, Principal Engineers often use Lens (Desktop UI) or k9s (Terminal UI) to visualize cluster health in real-time. These tools make it instantly obvious when a deployment is failing across multiple zones.
17. Key Takeaways
Managed Control Plane
Declarative Config
kubectl run) in production.