Skip to main content

Azure Kubernetes Service (AKS)

AKS is Azure’s managed Kubernetes service. Master AKS to deploy modern, cloud-native applications at scale. Azure AKS Architecture

What You’ll Learn

By the end of this chapter, you’ll understand:
  • What containers are and why they exist
  • Why Kubernetes is needed for managing containers
  • How AKS simplifies Kubernetes management
  • How to deploy and scale applications with containers
  • When to use containers vs VMs vs serverless

Introduction: What Are Containers? (Start Here if You’re Completely New)

The Problem Containers Solve

Have you ever heard this?
“But it works on my machine!” 😤
This is one of software development’s biggest headaches. Let me explain why:

Before Containers: The “Works on My Machine” Problem

Scenario: You built an amazing web application. Your development laptop:
  • Node.js version 18.15.0
  • Python 3.10
  • PostgreSQL 14
  • Ubuntu 22.04
  • 16 GB RAM
Your colleague’s laptop:
  • Node.js version 16.14.0 ← Different!
  • Python 3.9 ← Different!
  • PostgreSQL 13 ← Different!
  • Windows 11 ← Different OS!
  • 8 GB RAM ← Different!
What happens?
Your machine: ✅ App runs perfectly
Colleague's machine: ❌ App crashes with dependency errors
Production server: ❌ App won't even start
The root cause: Your app depends on specific versions of libraries, tools, and the operating system. When any of these change, things break.

Real-World Analogy: Shipping Containers

Think about physical shipping containers: Before Shipping Containers (1950s):
  • Pack goods in boxes, crates, barrels
  • Different ships need different loading methods
  • Goods get damaged during transfer
  • Loading/unloading takes weeks
  • Ship → Train → Truck = repack everything each time
After Shipping Containers:
  • Standard 20-foot or 40-foot metal boxes
  • Works on ships, trains, trucks
  • Contents protected and isolated
  • Loading/unloading takes hours, not weeks
  • Ship → Train → Truck = same container, no repacking
Software containers work the same way:
  • Package your app + all dependencies in one “box”
  • Runs identically on any computer with Docker
  • Your laptop → Colleague’s laptop → Production server = same container

What is a Container?

Container = A lightweight, standalone package that includes:
  • Your application code
  • Runtime (Node.js, Python, Java, etc.)
  • Libraries and dependencies
  • Configuration files
  • Operating system files (just what your app needs)
Think of it as: A fully-furnished apartment in a box. Everything your app needs to run is inside.

Container Example: Blog Website

Without Containers (Traditional Setup):
# On production server:
1. Install Node.js 18.15.0 manually
2. Install PostgreSQL 14 manually
3. Configure environment variables
4. Clone your code from Git
5. Install dependencies: npm install
6. Start the app: npm start
7. Hope nothing breaks 🤞

Problems:
- Takes 2-3 hours to setup
- Different versions = different bugs
- New server = redo all steps
- Server updates might break your app
With Containers (Modern Setup):
# Dockerfile (defines your container)
FROM node:18.15.0                    # Start with Node.js 18.15.0
WORKDIR /app                          # Working directory
COPY package*.json ./                 # Copy dependency files
RUN npm install                       # Install dependencies
COPY . .                              # Copy app code
EXPOSE 3000                           # App runs on port 3000
CMD ["npm", "start"]                  # Start the app
# On any server with Docker:
docker build -t myblog:1.0 .          # Build container (1 minute)
docker run -p 3000:3000 myblog:1.0    # Run container (5 seconds)

Benefits:
- Works identically everywhere
- Takes seconds to start
- Isolated from other apps
- Easy to update (new container = new version)

Containers vs Virtual Machines

Visual Comparison:
VIRTUAL MACHINES (Heavy):
┌─────────────────────────────────┐
│  Physical Server                 │
│  ├── Hardware (CPU, RAM, Disk)   │
│  ├── Host OS (Windows/Linux)     │
│  ├── Hypervisor (VMware/Hyper-V) │
│  └── Virtual Machines:           │
│      ├── VM 1:                   │
│      │   ├── Guest OS (2 GB)     │ ← Full operating system
│      │   └── App A                │
│      ├── VM 2:                   │
│      │   ├── Guest OS (2 GB)     │ ← Another full OS
│      │   └── App B                │
│      └── VM 3:                   │
│          ├── Guest OS (2 GB)     │ ← Yet another full OS
│          └── App C                │
└─────────────────────────────────┘
Total: 6 GB just for operating systems!

CONTAINERS (Lightweight):
┌─────────────────────────────────┐
│  Physical Server                 │
│  ├── Hardware (CPU, RAM, Disk)   │
│  ├── Host OS (Linux)              │
│  ├── Docker Engine                │
│  └── Containers:                 │
│      ├── Container 1 (App A)     │ ← Shares host OS
│      ├── Container 2 (App B)     │ ← Shares host OS
│      └── Container 3 (App C)     │ ← Shares host OS
└─────────────────────────────────┘
Total: ~100 MB for all containers!
FeatureVirtual MachineContainer
Size2-10 GB50-500 MB
Startup Time1-5 minutes1-5 seconds
Resource UsageHigh (full OS)Low (shared OS)
IsolationComplete (hardware-level)Process-level
PortabilityModerateExcellent
Use CaseDifferent OS neededSame OS, fast deployment
Example:
  • VM: You want to run Windows software on a Mac → Use VM
  • Container: You want to deploy 100 copies of your web app → Use containers

Why Use Containers?

1. Consistency:
Developer laptop: ✅ Works
Colleague's laptop: ✅ Works (same container)
Testing server: ✅ Works (same container)
Production server: ✅ Works (same container)
2. Speed:
VM startup: 2 minutes
Container startup: 2 seconds ← 60x faster!
3. Efficiency:
Physical Server (64 GB RAM):
├── VMs: Can run ~10 VMs (each uses 2+ GB OS)
└── Containers: Can run ~100 containers (share host OS)
4. Portability:
Build once → Run anywhere:
- Your laptop (Windows)
- Colleague's laptop (Mac)
- Azure datacenter (Linux)
- AWS datacenter (Linux)
- Google Cloud (Linux)
5. Isolation:
Container A crashes → Containers B, C, D unaffected
Container B has security bug → Containers A, C, D unaffected

Real-World Example: E-Commerce Website

Traditional Deployment (No Containers):
Production Server 1:
- Install Node.js 18
- Install nginx web server
- Install MongoDB 6
- Install Redis 7
- Deploy web app code
- Configure everything manually

Problem: Takes 3-4 hours, prone to human error

New developer joins:
- Spend 2 days setting up local environment
- "It doesn't work on my machine!" ← Wastes days debugging
Container Deployment:
docker-compose.yml (defines all containers):

version: '3'
services:
  web:                           # Web application
    image: mycompany/webapp:2.1
    ports: ["80:3000"]

  database:                      # MongoDB database
    image: mongo:6.0
    volumes: ["db-data:/data/db"]

  cache:                         # Redis cache
    image: redis:7.0
# On ANY server (dev, staging, production):
docker-compose up -d

# Result:
- Starts 3 containers in 10 seconds
- Works identically everywhere
- New developer: 5 minutes to run entire stack locally

Common Mistakes Beginners Make

Mistake 1: Thinking containers are just lightweight VMs ✅ Reality: Containers share the host OS kernel, VMs don’t Mistake 2: Storing data inside containers ✅ Reality: Containers are ephemeral (temporary). Use volumes for persistent data. Mistake 3: Running multiple apps in one container ✅ Reality: One container = one process (web server OR database, not both) Mistake 4: Using containers for everything ✅ Reality: Sometimes VMs are better (need different OS, strong isolation)

When to Use Containers vs VMs

Use Containers When: ✅ You want fast deployment (seconds) ✅ You need to run many copies of the same app ✅ You want consistent environments (dev = production) ✅ Your app runs on Linux Use VMs When: ✅ You need complete isolation (security, compliance) ✅ You need different operating systems (Windows + Linux on same hardware) ✅ You have legacy apps that can’t be containerized ✅ You need full control over the operating system

What is Kubernetes? (The Next Step After Containers)

The Problem Kubernetes Solves

You’ve learned containers solve the “works on my machine” problem. But… Scenario: Your blog got popular! 🎉 Month 1:
  • 100 visitors/day
  • 1 container handles it easily
  • Cost: $10/month
Month 6:
  • 50,000 visitors/day
  • Need 20 containers to handle traffic
  • Multiple servers needed
New problems arise:
  1. Which server should run which container?
    • Server 1 has 20 GB RAM free, Server 2 has 4 GB free
    • Manual placement = nightmare
  2. What if a container crashes?
    • Who restarts it? How do you know it crashed?
    • Manual monitoring = 24/7 job
  3. How do users reach the right container?
    • 20 containers, each with different IP address
    • Users need one URL: myblog.com
  4. How to update without downtime?
    • Stop all 20 containers = website down
    • Update one-by-one manually = takes hours, error-prone
  5. How to handle traffic spikes?
    • Black Friday: need 50 containers
    • Tuesday at 3am: need 5 containers
    • Manual scaling = expensive or too slow
Kubernetes solves ALL these problems automatically.

Real-World Analogy: Shipping Port

Without Kubernetes (Manual Container Management):
You own 20 shipping containers (your app containers)
You have 5 trucks (your servers)

Every day you manually:
- Decide which containers go on which trucks
- Check if containers fell off trucks (crashed)
- Put fallen containers back on trucks (restart)
- Tell customers which truck has their container
- Swap trucks when they're full

Result: Full-time job, errors, slow
With Kubernetes (Automated):
Kubernetes = Smart logistics system

You tell Kubernetes:
- "I need 20 containers of my app running"
- "Each container needs 500 MB RAM"
- "Don't run more than 5 on one truck (server)"

Kubernetes automatically:
✅ Places containers on trucks (servers) optimally
✅ Monitors all containers
✅ Restarts crashed containers instantly
✅ Gives customers one address (myblog.com)
✅ Routes customers to healthy containers
✅ Replaces containers during updates (zero downtime)
✅ Adds/removes containers based on traffic

Result: You focus on your app, Kubernetes handles operations

What is Kubernetes?

Kubernetes (K8s) = An open-source container orchestration platform that automates deployment, scaling, and management of containerized applications. Think of Kubernetes as: An operating system for your containers across many servers. Core Features:
  1. Self-Healing: Container crashed? Kubernetes restarts it automatically
  2. Load Balancing: Distributes traffic across containers
  3. Auto-Scaling: More traffic? Kubernetes adds containers. Less traffic? Removes them.
  4. Rolling Updates: Update app without downtime
  5. Service Discovery: Containers find each other automatically
  6. Storage Orchestration: Attach storage to containers automatically

Kubernetes in Simple Terms

Transportation Analogy:
Your App = Passengers that need to travel

Container = Car (carries your app)

Kubernetes = Smart Transportation System:
- Monitors all cars (containers)
- Dispatches new cars when needed
- Removes cars when traffic is light
- Redirects passengers if a car breaks down
- Shows passengers one address (the airport) instead of 20 car locations
- Replaces old cars with new models (updates) while passengers keep arriving

Real Example: Online Shopping Website

Black Friday Sale (No Kubernetes):
5:00 AM: Normal traffic (5 containers)
8:00 AM: Traffic increases (need 20 containers)

Your manual actions:
1. Notice website is slow (user complaints pour in)
2. Log into Azure portal
3. Manually create 15 new VMs (20 minutes)
4. Manually start 15 containers
5. Manually update load balancer configuration
6. Total time: 45 minutes of website slowness

4:00 PM: Traffic decreases (back to 5 containers needed)
7. Manually stop 15 containers
8. Manually delete 15 VMs (to save money)
9. Total time: 30 minutes of manual work

Result: Lost sales, frustrated customers, exhausted you
Black Friday Sale (With Kubernetes):
5:00 AM: Normal traffic (5 containers)
8:00 AM: Traffic increases

Kubernetes automatically:
1. Detects high CPU usage (70%+)
2. Creates 15 new containers in 2 minutes
3. Distributes traffic across all 20 containers
4. Total time: 2 minutes, zero human intervention ✅

4:00 PM: Traffic decreases

Kubernetes automatically:
1. Detects low CPU usage (<30%)
2. Gracefully stops 15 containers (waits for requests to finish)
3. Scales back to 5 containers
4. Total time: 5 minutes, zero human intervention ✅

Result: Happy customers, maximized sales, you sleep peacefully

What is Azure Kubernetes Service (AKS)?

Plain Kubernetes (DIY):
Your responsibilities:
- Install Kubernetes on VMs (complex, 50+ steps)
- Configure networking, storage, security
- Upgrade Kubernetes versions manually
- Monitor control plane (master nodes)
- Pay for control plane VMs
- Fix control plane issues (3am outages)

Time investment: 40-80 hours/month
Azure Kubernetes Service (AKS) (Managed):
Your responsibilities:
- Click "Create AKS Cluster" in Azure portal
- Deploy your containers

Azure's responsibilities:
✅ Installs and configures Kubernetes
✅ Manages control plane (free!)
✅ Auto-upgrades Kubernetes
✅ Monitors control plane health 24/7
✅ Fixes control plane issues
✅ Provides enterprise features (security, compliance)

Time investment: 4-8 hours/month
Cost: Control plane is FREE, you only pay for worker nodes (VMs that run your containers)
AKS = Kubernetes without the operational headache

Under the Hood: The AKS Control Plane

In AKS, the cluster is split into two distinct lives:

1. The Control Plane (Managed by Azure)

You never see these VMs, but they are there. They run:
  • kube-apiserver: The front door. Every kubectl command hits this.
  • etcd: The source of truth. A distributed database that stores the cluster state.
  • kube-scheduler: Decides which node should run your pod based on resources.
  • kube-controller-manager: Watches for deviations (e.g., “I need 3 pods, but only 2 are running”) and fixes them.

2. The Data Plane (Your Worker Nodes)

These are the VMs in your subscription. They run:
  • kubelet: The agent that takes orders from the control plane and starts containers.
  • kube-proxy: Handles networking and load balancing between pods.
  • Container Runtime: Usually containerd (Docker’s core).
[!IMPORTANT] Pro Insight: The ‘Free’ Control Plane In the standard AKS tier, the control plane is free. However, if you have a massive cluster (100+ nodes), you should upgrade to the Uptime SLA tier ($0.10/hour). This gives you a guaranteed 99.95% availability for the API server itself, backed by financially-backed credits.

Cost Comparison Example

Running 20 Containers for a Web App: Option 1: Traditional VMs (No Containers):
20 VMs × $50/month = $1,000/month
+ Slow to scale (5-10 minutes to provision new VM)
+ Manual management required
Option 2: Plain Kubernetes (DIY):
3 Control Plane VMs × $50 = $150/month  ← You pay for this
5 Worker VMs × $80 = $400/month         ← You pay for this
Total: $550/month
+ You manage control plane (time cost)
+ Complex setup and maintenance
Option 3: Azure Kubernetes Service (AKS):
Control Plane: $0/month                 ← Azure manages free!
5 Worker VMs × $80 = $400/month         ← You only pay this
Total: $400/month
+ Azure manages control plane
+ Enterprise-grade security and updates
+ Scales in seconds
+ Integrates with Azure services
Winner: AKS saves $150/month + hundreds of hours of management time

1. Why Kubernetes?

Before Kubernetes

  • Manual container orchestration
  • No auto-scaling
  • Complex networking
  • Manual load balancing
  • No self-healing

With Kubernetes

  • Automated orchestration
  • Auto-scaling (HPA, VPA, Cluster Autoscaler)
  • Service discovery
  • Built-in load balancing
  • Self-healing (restart failed pods)
[!WARNING] Gotcha: System Node Pools Every AKS cluster needs at least one “System Node Pool” to run Kubernetes itself (CoreDNS, Metrics Server). You cannot delete this pool or scale it to 0. It will always cost you money (usually 1-3 VMs).
[!TIP] Jargon Alert: Pod vs Node Node: A Virtual Machine (The house). Pod: A running process/container (The tenant living in the house). A single Node (VM) usually hosts many Pods.

2. AKS Architecture


3. Create AKS Cluster

# Create AKS cluster
az aks create \
  --name aks-prod \
  --resource-group rg-prod \
  --node-count 3 \
  --node-vm-size Standard_D4s_v3 \
  --zones 1 2 3 \
  --enable-managed-identity \
  --network-plugin azure \
  --enable-addons monitoring \
  --generate-ssh-keys

# Get credentials
az aks get-credentials \
  --name aks-prod \
  --resource-group rg-prod

# Verify
kubectl get nodes

4. AKS Networking

- Pods get IPs from separate address space
- NAT for outbound connectivity
- Simpler, fewer IP addresses needed
- Use for: Dev/test, small clusters

5. Deploy Application

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: myregistry.azurecr.io/web:v1
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        ports:
        - containerPort: 80

---
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: web-service
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 80
  selector:
    app: web

---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-ingress
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  tls:
  - hosts:
    - myapp.example.com
    secretName: tls-secret
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: web-service
            port:
              number: 80
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f ingress.yaml

6. Autoscaling

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

7. Best Practices

Resource Limits

Always set CPU/memory requests and limits to prevent noisy neighbors

Health Checks

Configure liveness and readiness probes for self-healing

Use Namespaces

Separate environments (dev, staging, prod) with namespaces

Security

Use Azure AD pod identity, network policies, and Pod Security Standards

Monitoring

Enable Container Insights for observability

GitOps

Use Flux or ArgoCD for declarative deployments

8. Interview Questions

Beginner Level

Answer:
  • Node: A worker machine (VM) in Kubernetes. It runs pods.
  • Pod: The smallest deployable unit. Usually contains one container (but can have sidecars).
Analogy: Node = House, Pod = Room, Container = Person in the room.
Answer:
  • ClusterIP: Internal IP only. Not accessible from outside. Default type.
  • NodePort: Exposes service on a static port on each Node IP.
  • LoadBalancer: Provisions an external Azure Load Balancer to expose service publicly.
Answer:
  • Scheduling pods (kube-scheduler)
  • Detecting and responding to cluster events (kube-controller-manager)
  • Storing cluster state (etcd)
  • Exposing the Kubernetes API (kube-apiserver)
Note: In AKS, Azure manages the control plane for you (free).

Intermediate Level

Answer:
  • Load Balancer: Layer 4 (TCP/UDP). One IP per service. Expensive for many services.
  • Ingress Controller: Layer 7 (HTTP/HTTPS). Single IP for multiple services. Supports path-based routing (/api, /web), SSL termination, and rewriting.
Answer:
  1. The Kubelet on the node detects the crash.
  2. Based on restartPolicy (default: Always), it restarts the container.
  3. If the pod is part of a Deployment/ReplicaSet, if the Node dies, the Scheduler creates a new Pod on a healthy Node.

Advanced Level

Answer: AKS handles this via Surge Upgrades:
  1. Cordon a node (prevent new pods).
  2. Drain the node (move existing pods to other nodes).
  3. Delete the node.
  4. Create a new node with the updated version.
  5. Repeat for all nodes (one by one or in batches).
Requirement: PodDisruptionBudgets must be configured to ensure minAvailable replicas during the process.
Answer: A helper container running alongside the main application container in the same Pod.Uses:
  • Logging (sending logs to Splunk/Log Analytics)
  • Proxying (Service Mesh like Istio/Linkerd)
  • Config watching (reloading configuration)
  • Security (TLS termination)

9. Helm: Kubernetes Package Manager

Helm Architecture - Kubernetes Package Manager
Helm is the package manager for Kubernetes. It simplifies deploying complex applications with reusable charts.

Why Helm?

Without Helm:
  • Manage 20+ YAML files manually
  • Copy-paste configurations for dev/staging/prod
  • Hard to version and rollback deployments
With Helm:
  • Single command deployment: helm install myapp ./chart
  • Templated configurations with values
  • Easy rollbacks: helm rollback myapp 1
  • Reusable charts from public repositories

Helm Architecture

Helm Chart (Package)
├── Chart.yaml          # Metadata (name, version)
├── values.yaml         # Default configuration
├── templates/          # Kubernetes manifests with templating
│   ├── deployment.yaml
│   ├── service.yaml
│   └── ingress.yaml
└── charts/             # Dependencies (sub-charts)

Creating a Helm Chart

# Create new chart
helm create myapp

# Chart structure created:
myapp/
├── Chart.yaml
├── values.yaml
├── templates/
   ├── deployment.yaml
   ├── service.yaml
   ├── ingress.yaml
   └── _helpers.tpl

Chart.yaml (Metadata)

apiVersion: v2
name: myapp
description: A Helm chart for my application
type: application
version: 1.0.0        # Chart version
appVersion: "2.5.1"   # Application version

values.yaml (Configuration)

replicaCount: 3

image:
  repository: myregistry.azurecr.io/myapp
  tag: "2.5.1"
  pullPolicy: IfNotPresent

service:
  type: LoadBalancer
  port: 80

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: myapp.example.com
      paths:
        - path: /
          pathType: Prefix

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 250m
    memory: 256Mi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70

templates/deployment.yaml (Templated Manifest)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "myapp.fullname" . }}
  labels:
    {{- include "myapp.labels" . | nindent 4 }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  selector:
    matchLabels:
      {{- include "myapp.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "myapp.selectorLabels" . | nindent 8 }}
    spec:
      containers:
      - name: {{ .Chart.Name }}
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        ports:
        - name: http
          containerPort: 80
          protocol: TCP
        resources:
          {{- toYaml .Values.resources | nindent 12 }}

Deploying with Helm

# Install chart
helm install myapp ./myapp

# Install with custom values
helm install myapp ./myapp \
  --set replicaCount=5 \
  --set image.tag=3.0.0

# Install with values file
helm install myapp ./myapp \
  -f values-production.yaml

# Upgrade deployment
helm upgrade myapp ./myapp \
  --set image.tag=3.1.0

# Rollback to previous version
helm rollback myapp 1

# Uninstall
helm uninstall myapp

Helm Repositories

# Add official Helm repo
helm repo add stable https://charts.helm.sh/stable

# Add Bitnami repo (popular charts)
helm repo add bitnami https://charts.bitnami.com/bitnami

# Search for charts
helm search repo nginx

# Install from repository
helm install my-nginx bitnami/nginx

# Update repo index
helm repo update

Multi-Environment Strategy

values-dev.yaml:
replicaCount: 1
image:
  tag: "latest"
ingress:
  hosts:
    - host: myapp-dev.example.com
values-prod.yaml:
replicaCount: 5
image:
  tag: "2.5.1"
ingress:
  hosts:
    - host: myapp.example.com
resources:
  limits:
    cpu: 1000m
    memory: 1Gi
# Deploy to dev
helm install myapp ./myapp -f values-dev.yaml

# Deploy to prod
helm install myapp ./myapp -f values-prod.yaml
[!TIP] Best Practice: Chart Versioning
  • Chart version (version in Chart.yaml): Increment when chart structure changes
  • App version (appVersion): Tracks the application version being deployed
  • Use semantic versioning: 1.2.3 (MAJOR.MINOR.PATCH)
[!WARNING] Gotcha: Helm Secrets Never commit secrets to values.yaml! Use:
  • Azure Key Vault: Inject secrets via CSI driver
  • Sealed Secrets: Encrypt secrets in Git
  • helm-secrets plugin: Encrypt values files with SOPS

10. GitOps with ArgoCD

GitOps with ArgoCD Workflow
GitOps = Git as the single source of truth for declarative infrastructure and applications.

GitOps Principles

  1. Declarative: Entire system described declaratively (YAML in Git)
  2. Versioned: Git history = deployment history
  3. Automated: Changes in Git automatically deployed
  4. Reconciled: Cluster state continuously reconciled with Git

ArgoCD Architecture

Git Repository (Source of Truth)

ArgoCD (Continuous Sync)

Kubernetes Cluster (Desired State)

Installing ArgoCD on AKS

# Create namespace
kubectl create namespace argocd

# Install ArgoCD
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Expose ArgoCD UI (LoadBalancer)
kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}'

# Get admin password
kubectl -n argocd get secret argocd-initial-admin-secret \
  -o jsonpath="{.data.password}" | base64 -d

# Get external IP
kubectl get svc argocd-server -n argocd

Creating an Application

Git Repository Structure:
my-app-gitops/
├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── kustomization.yaml
├── overlays/
│   ├── dev/
│   │   └── kustomization.yaml
│   └── prod/
│       └── kustomization.yaml
ArgoCD Application Manifest:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp-prod
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/my-app-gitops
    targetRevision: main
    path: overlays/prod
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true      # Delete resources not in Git
      selfHeal: true   # Auto-sync if cluster drifts
    syncOptions:
      - CreateNamespace=true
# Apply ArgoCD application
kubectl apply -f argocd-app.yaml

# Or use ArgoCD CLI
argocd app create myapp-prod \
  --repo https://github.com/myorg/my-app-gitops \
  --path overlays/prod \
  --dest-server https://kubernetes.default.svc \
  --dest-namespace production \
  --sync-policy automated

GitOps Workflow

Developer commits code

CI builds Docker image (tag: v1.2.3)

CI updates Git repo (image: myapp:v1.2.3)

ArgoCD detects change

ArgoCD syncs to cluster

Deployment updated automatically
[!IMPORTANT] Recommendation: Separate Repos
  • Application code repo: Source code, Dockerfile
  • GitOps repo: Kubernetes manifests, Helm charts
  • CI updates GitOps repo after building image

Sync Strategies

StrategyBehaviorUse Case
ManualRequires manual syncProduction (human approval)
AutomatedAuto-sync on Git changeDev/Staging
Auto-PruneDelete resources not in GitClean up old resources
Self-HealRevert manual kubectl changesEnforce Git as source of truth

11. Service Mesh Basics (Istio)

Istio Service Mesh Architecture
Service Mesh = Infrastructure layer for service-to-service communication with observability, security, and traffic management.

Why Service Mesh?

Without Service Mesh:
  • Implement retries, timeouts, circuit breakers in every microservice
  • No visibility into service-to-service traffic
  • Difficult to enforce mTLS between services
With Service Mesh (Istio):
  • Traffic Management: Canary deployments, A/B testing, retries
  • Security: Automatic mTLS between services
  • Observability: Distributed tracing, metrics, logs

Istio Architecture

Application Pod
├── App Container (your code)
└── Envoy Sidecar (injected by Istio)

All traffic flows through Envoy

Istio Control Plane (manages Envoy configs)

Installing Istio on AKS

# Download Istio
curl -L https://istio.io/downloadIstio | sh -
cd istio-1.20.0

# Install Istio
istioctl install --set profile=demo -y

# Enable sidecar injection for namespace
kubectl label namespace default istio-injection=enabled

# Verify
kubectl get pods -n istio-system

Traffic Management Example

Canary Deployment (90% v1, 10% v2):
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
    - myapp.example.com
  http:
    - match:
        - headers:
            user-type:
              exact: beta-tester
      route:
        - destination:
            host: myapp
            subset: v2
    - route:
        - destination:
            host: myapp
            subset: v1
          weight: 90
        - destination:
            host: myapp
            subset: v2
          weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: myapp
spec:
  host: myapp
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2
[!NOTE] Deep Dive: When to Use Service Mesh?
  • YES: Microservices (10+ services), need mTLS, complex traffic routing
  • NO: Monolith, simple apps, small teams (adds complexity)


13. AKS Security Deep Dive

Pod Security Standards

Pod Security Standards replace deprecated Pod Security Policies (PSPs). Three Levels:
  1. Privileged: Unrestricted (no restrictions)
  2. Baseline: Minimally restrictive (prevents known privilege escalations)
  3. Restricted: Heavily restricted (hardened, follows pod hardening best practices)
# Enforce restricted policy on namespace
kubectl label namespace production \
  pod-security.kubernetes.io/enforce=restricted \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/warn=restricted
Example: Restricted Pod:
apiVersion: v1
kind: Pod
metadata:
  name: secure-app
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp:1.0
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
          - ALL
      readOnlyRootFilesystem: true

Network Policies

Network Policies = Firewall rules for pods.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-allow-from-web
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: web
      ports:
        - protocol: TCP
          port: 8080
Default Deny All:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Secrets Management with Azure Key Vault

CSI Driver for Azure Key Vault:
# Install CSI driver
helm repo add csi-secrets-store-provider-azure \
  https://azure.github.io/secrets-store-csi-driver-provider-azure/charts

helm install csi csi-secrets-store-provider-azure/csi-secrets-store-provider-azure
SecretProviderClass:
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: azure-kv-sync
spec:
  provider: azure
  parameters:
    usePodIdentity: "false"
    useVMManagedIdentity: "true"
    userAssignedIdentityID: "your-identity-client-id"
    keyvaultName: "mykeyvault"
    objects: |
      array:
        - |
          objectName: database-password
          objectType: secret
    tenantId: "your-tenant-id"
Pod using Key Vault secret:
apiVersion: v1
kind: Pod
metadata:
  name: app-with-secrets
spec:
  containers:
  - name: app
    image: myapp:1.0
    volumeMounts:
    - name: secrets-store
      mountPath: "/mnt/secrets"
      readOnly: true
  volumes:
  - name: secrets-store
    csi:
      driver: secrets-store.csi.k8s.io
      readOnly: true
      volumeAttributes:
        secretProviderClass: "azure-kv-sync"

14. StatefulSets & Persistent Storage

StatefulSet = For stateful applications (databases, message queues) that need stable network identity and persistent storage.

StatefulSet vs Deployment

FeatureDeploymentStatefulSet
Pod NamesRandom (web-7d8f-xyz)Ordered (web-0, web-1, web-2)
ScalingParallelSequential (web-0 → web-1 → web-2)
StorageShared or ephemeralDedicated persistent volume per pod
Network IdentityRandomStable (web-0.service.namespace.svc)
Use CaseStateless appsDatabases, Kafka, Redis

StatefulSet Example

apiVersion: v1
kind: Service
metadata:
  name: mysql
spec:
  clusterIP: None  # Headless service
  selector:
    app: mysql
  ports:
  - port: 3306
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
spec:
  serviceName: mysql
  replicas: 3
  selector:
    matchLabels:
      app: mysql
  template:
    metadata:
      labels:
        app: mysql
    spec:
      containers:
      - name: mysql
        image: mysql:8.0
        ports:
        - containerPort: 3306
        volumeMounts:
        - name: data
          mountPath: /var/lib/mysql
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: managed-premium
      resources:
        requests:
          storage: 100Gi
Accessing pods:
# Direct access to specific pod
mysql-0.mysql.default.svc.cluster.local
mysql-1.mysql.default.svc.cluster.local
mysql-2.mysql.default.svc.cluster.local

Azure Disk vs Azure Files

FeatureAzure DiskAzure Files
Access ModeReadWriteOnce (single pod)ReadWriteMany (multiple pods)
PerformanceHigher IOPSLower IOPS
Use CaseDatabasesShared storage, logs
Storage Classmanaged-premium, managed-standardazurefile, azurefile-premium

15. KEDA: Event-Driven Autoscaling

KEDA Event-Driven Autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) = Scale pods based on external metrics (queue length, HTTP requests, database queries).

Installing KEDA

# Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

Example: Scale Based on Azure Service Bus Queue

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaler
spec:
  scaleTargetRef:
    name: order-processor  # Deployment to scale
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
  - type: azure-servicebus
    metadata:
      queueName: orders
      namespace: mynamespace
      messageCount: "10"  # Scale up when >10 messages
      connectionFromEnv: SERVICEBUS_CONNECTION
Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-processor
spec:
  replicas: 1  # KEDA will override this
  selector:
    matchLabels:
      app: order-processor
  template:
    metadata:
      labels:
        app: order-processor
    spec:
      containers:
      - name: processor
        image: myapp/order-processor:1.0
        env:
        - name: SERVICEBUS_CONNECTION
          valueFrom:
            secretKeyRef:
              name: servicebus-secret
              key: connection-string
How it works:
  • Queue has 50 messages → KEDA scales to 5 pods (50/10)
  • Queue has 200 messages → KEDA scales to 20 pods (max)
  • Queue empty → KEDA scales to 1 pod (min)
  • Azure Service Bus: Queue/Topic message count
  • Azure Storage Queue: Queue length
  • HTTP: Incoming HTTP requests
  • Prometheus: Custom metrics
  • Kafka: Consumer lag
  • Redis: List length
  • Cron: Time-based scaling

16. Interview Questions

Beginner Level

Answer:Pod:
  • Smallest deployable unit in Kubernetes
  • One or more containers running together
  • Ephemeral (dies when node fails)
  • No self-healing
Deployment:
  • Manages a set of identical Pods (ReplicaSet)
  • Ensures desired number of Pods are running
  • Self-healing (recreates failed Pods)
  • Supports rolling updates and rollbacks
In production: Always use Deployments, never bare Pods.
Answer:Namespaces = Virtual clusters within a physical cluster.Use cases:
  • Environment separation: dev, staging, prod
  • Team isolation: team-a, team-b
  • Resource quotas: Limit CPU/memory per namespace
Default namespaces:
  • default: Default namespace for resources
  • kube-system: Kubernetes system components
  • kube-public: Public resources (readable by all)
Example:
kubectl create namespace production
kubectl get pods -n production
Answer:Service = Stable network endpoint for a set of Pods.Problem: Pods have dynamic IPs (change on restart) Solution: Service provides a stable IP and DNS nameTypes:
  • ClusterIP (default): Internal only (10.0.1.5)
  • NodePort: Exposes on each node’s IP (30000-32767)
  • LoadBalancer: Creates Azure Load Balancer (public IP)
Example:
apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  type: LoadBalancer
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 8080

Intermediate Level

Answer:HPA = Automatically scales pods based on CPU/memory usage.How it works:
  1. Metrics Server collects pod metrics every 15 seconds
  2. HPA controller checks metrics every 30 seconds
  3. If avg CPU > target, scale up
  4. If avg CPU < target (for 5 min), scale down
Formula:
desiredReplicas = ceil(currentReplicas * (currentMetric / targetMetric))
Example:
  • Current: 3 pods, avg CPU 80%
  • Target: 50%
  • Desired: ceil(3 * (80/50)) = ceil(4.8) = 5 pods
Gotcha: Requires resources.requests to be set!
Answer:
FeatureKubenetAzure CNI
Pod IPPrivate (10.244.x.x)VNet IP (10.0.1.x)
IP ConsumptionLow (NAT used)High (1 IP per pod)
PerformanceSlight overhead (NAT)Direct routing (faster)
VNet IntegrationNoYes (pods directly in VNet)
Network PoliciesCalico requiredNative support
Use CaseSmall clusters, IP conservationEnterprise, VNet integration
Recommendation: Use Azure CNI for production (better integration, performance).
Answer:Strategy: Rolling Update with readiness probes
apiVersion: apps/v1
kind: Deployment
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1  # Max 1 pod down at a time
      maxSurge: 1        # Max 1 extra pod during update
  template:
    spec:
      containers:
      - name: app
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
Process:
  1. Create 1 new pod (v2)
  2. Wait for readiness probe to pass
  3. Terminate 1 old pod (v1)
  4. Repeat until all pods are v2
Result: Always 4-6 pods running (never less than 4).

Advanced Level

Answer:Requirements:
  • Isolate tenants (security, resources)
  • Cost allocation per tenant
  • Prevent noisy neighbor
Architecture:Option 1: Namespace per Tenant (Soft Isolation)
# Namespace
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-acme
  labels:
    tenant: acme

# Resource Quota
apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-quota
  namespace: tenant-acme
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    pods: "50"

# Network Policy (Deny cross-tenant traffic)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-other-tenants
  namespace: tenant-acme
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          tenant: acme
Option 2: Node Pool per Tenant (Hard Isolation)
# Create dedicated node pool for tenant
az aks nodepool add \
  --cluster-name aks-prod \
  --name acmepool \
  --node-count 3 \
  --node-taints tenant=acme:NoSchedule \
  --labels tenant=acme
Deployment with node affinity:
spec:
  tolerations:
  - key: tenant
    operator: Equal
    value: acme
    effect: NoSchedule
  nodeSelector:
    tenant: acme
Cost Allocation: Use tags/labels + Azure Cost Management.
Answer:CrashLoopBackOff = Pod starts, crashes, Kubernetes restarts it, crashes again (loop).Troubleshooting Steps:
  1. Check pod events:
kubectl describe pod mypod
# Look for: Events section (OOMKilled, ImagePullBackOff, etc.)
  1. Check logs:
kubectl logs mypod
kubectl logs mypod --previous  # Logs from crashed container
  1. Common causes:
  • OOMKilled: Increase resources.limits.memory
  • Application error: Fix code, check environment variables
  • Missing dependencies: Database not ready → Add init container
  • Liveness probe failing: Adjust probe settings
  1. Debug with ephemeral container (Kubernetes 1.23+):
kubectl debug mypod -it --image=busybox --target=mycontainer
  1. Disable probes temporarily:
# Comment out liveness probe to prevent restarts
# livenessProbe:
#   httpGet:
#     path: /health
Answer:Blue-Green = Run two identical environments (blue=current, green=new), switch traffic instantly.Implementation with Services:
# Blue Deployment (current)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-blue
spec:
  replicas: 5
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
      - name: app
        image: myapp:1.0

# Green Deployment (new)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-green
spec:
  replicas: 5
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
      - name: app
        image: myapp:2.0

# Service (points to blue initially)
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    version: blue  # Switch to 'green' to cutover
  ports:
  - port: 80
    targetPort: 8080
Cutover Process:
# 1. Deploy green
kubectl apply -f deployment-green.yaml

# 2. Test green internally
kubectl port-forward deployment/myapp-green 8080:8080

# 3. Switch traffic (instant cutover)
kubectl patch service myapp -p '{"spec":{"selector":{"version":"green"}}}'

# 4. Monitor for issues
# If problems: kubectl patch service myapp -p '{"spec":{"selector":{"version":"blue"}}}'

# 5. Delete blue after validation
kubectl delete deployment myapp-blue
Pros: Instant rollback, zero downtime Cons: 2x resources during deployment

Troubleshooting: The AKS Production Triage

When a pod fails in production, don’t panic. Follow this 3-step triage:

1. The “Pods won’t start” Phase

  • ImagePullBackOff: Kubernetes can’t download your container image.
    • The Pro Check: Does the AKS Cluster have the AcrPull permission on your Container Registry?
  • CrashLoopBackOff: The container starts but immediately crashes.
    • The Pro Check: Run kubectl logs <pod-name> --previous. You need to see the logs from the failed instance, not the new one that just restarted.
  • Pending: The pod isn’t even trying to start.
    • The Pro Check: Run kubectl describe pod <pod-name>. Usually, it’s because you requested 2 GB of RAM but your nodes only have 1 GB available.

2. The “Network Ghost” Phase

  • Service but no Response: The service is running, but you get a 504 timeout.
    • The Pro Check: Do the selectors in your Service YAML exactly match the labels in your Deployment YAML? If not, the Load Balancer is sending traffic into a black hole.

3. The “Node Pressure” Phase

  • Evicted Pods: Your pods are being killed randomly.
    • The Pro Check: Your Node is out of disk space or RAM. Check “Azure Monitor for Containers” to see which app is leaking memory.
[!TIP] Pro Tool: Lens & k9s While kubectl is the standard, Principal Engineers often use Lens (Desktop UI) or k9s (Terminal UI) to visualize cluster health in real-time. These tools make it instantly obvious when a deployment is failing across multiple zones.

17. Key Takeaways

Managed Control Plane

AKS manages the master nodes (API server, etcd) for free. You only pay for worker nodes.

Declarative Config

Use YAML manifests to define desired state. Avoid imperative commands (kubectl run) in production.

Autoscaling

Use HPA for pods (CPU/Memory) and Cluster Autoscaler for nodes to handle variable loads efficiently.

Networking Choice

Use Kubenet for simplicity/IP conservation. Use Azure CNI for distinct IPs per pod and direct VNet connectivity.

Security

Integrate Azure AD for authentication. Use Network Policies to restrict traffic between pods.

Namespace Isolation

Use namespaces to logically separate teams, environments (dev/prod), or applications within a cluster.

Next Steps

Continue to Chapter 8

Master Azure Functions and serverless event-driven architecture