Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Kubernetes Deployment

Kubernetes (K8s) is the industry standard for orchestrating containerized microservices at scale.
Learning Objectives:
  • Understand Kubernetes architecture
  • Create Deployments and Services
  • Manage configuration with ConfigMaps and Secrets
  • Set up Ingress for external access
  • Use Helm for package management

Kubernetes Architecture

Before writing any YAML, it helps to understand what Kubernetes actually is. At its core, Kubernetes is a distributed control loop: you declare the desired state of your system (10 replicas of order-service, 3 replicas of payment-service), and Kubernetes continuously reconciles the actual state with that declaration. If a pod dies, Kubernetes starts a new one. If a node goes offline, Kubernetes reschedules its pods elsewhere. This reconciliation model is why Kubernetes is so resilient — you’re not telling it “start these pods,” you’re telling it “maintain this state forever.” The control plane is the brain. The API server is the only component that talks to etcd (the distributed key-value store holding all cluster state), and it’s the only component the other components talk to. The scheduler decides which node each pod should run on. The controller manager runs the reconciliation loops for each resource type (Deployment controller, ReplicaSet controller, etc.). etcd holds the source of truth; if etcd dies, the cluster dies. The worker nodes are where your containers actually run. Each node runs kubelet (which talks to the API server and manages the container runtime) and kube-proxy (which programs iptables rules so that Service virtual IPs route to pod IPs). Your container runtime (containerd, CRI-O) is what actually launches containers. Why does any of this matter for microservices developers? Because understanding the control loop model explains Kubernetes’s failure modes. When a deployment fails to roll out, it’s because a controller cannot achieve the desired state. When pods are stuck in Pending, it’s because the scheduler cannot find a node. When services cannot reach each other, it’s usually a kube-proxy or DNS problem. The more your mental model matches the architecture, the faster you can debug.
┌─────────────────────────────────────────────────────────────────────────────┐
│                    KUBERNETES ARCHITECTURE                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                         CONTROL PLANE                                │    │
│  │                                                                      │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐            │    │
│  │  │   API    │  │  Sched-  │  │Controller│  │   etcd   │            │    │
│  │  │  Server  │  │  uler    │  │ Manager  │  │ (store)  │            │    │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘            │    │
│  │                                                                      │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                    │                                         │
│                                    │ kubectl, API calls                      │
│                                    ▼                                         │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                          WORKER NODES                                │    │
│  │                                                                      │    │
│  │  ┌─────────────────────────────┐  ┌─────────────────────────────┐   │    │
│  │  │         NODE 1              │  │         NODE 2              │   │    │
│  │  │                             │  │                             │   │    │
│  │  │  ┌───────┐    ┌───────┐    │  │  ┌───────┐    ┌───────┐    │   │    │
│  │  │  │  Pod  │    │  Pod  │    │  │  │  Pod  │    │  Pod  │    │   │    │
│  │  │  │Order-1│    │Order-2│    │  │  │Pay-1  │    │Inv-1  │    │   │    │
│  │  │  └───────┘    └───────┘    │  │  └───────┘    └───────┘    │   │    │
│  │  │                             │  │                             │   │    │
│  │  │  ┌────────┐  ┌──────────┐  │  │  ┌────────┐  ┌──────────┐  │   │    │
│  │  │  │ kubelet│  │kube-proxy│  │  │  │ kubelet│  │kube-proxy│  │   │    │
│  │  │  └────────┘  └──────────┘  │  │  └────────┘  └──────────┘  │   │    │
│  │  │                             │  │                             │   │    │
│  │  └─────────────────────────────┘  └─────────────────────────────┘   │    │
│  │                                                                      │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Core Resources

Namespace

Namespaces are logical partitions within a Kubernetes cluster. Think of them as folders: they scope resource names (two services can both be called api if they’re in different namespaces) and provide a boundary for RBAC, network policies, and resource quotas. In a microservices setup, you’ll typically have namespaces per environment (dev, staging, production) or per team (platform, payments, search). Using the default namespace for everything is a mistake that hurts later — isolation is much harder to retrofit than to apply from day one. If you skip namespacing, your cluster becomes a flat soup of 200+ resources with colliding names and no way to apply per-team policies. With namespaces, you can say “team-payments can deploy only into the payments namespace” via RBAC, and “staging resources cannot exceed 20 CPUs total” via ResourceQuota.
# k8s/namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: microservices
  labels:
    name: microservices
    environment: production

Deployment

Why Deployments exist. Before Deployments, running a service on Kubernetes meant manually creating ReplicaSets and orchestrating updates yourself — tedious and error-prone. The Deployment resource exists to answer the most common production question: “how do I run N replicas of a stateless service and update them safely?” It wraps the lower-level ReplicaSet with a declarative rollout strategy, version history, and rollback support. Think of a Deployment as a thin supervisor that owns one or more ReplicaSets and shifts traffic between them during updates. What Kubernetes does internally. The Deployment controller continuously reconciles: it reads the current Deployment spec, compares it against existing ReplicaSets, and creates/scales ReplicaSets to match. When you change the pod template, the controller creates a new ReplicaSet with a hashed name suffix, then gradually scales it up while scaling the old ReplicaSet down, obeying your maxSurge and maxUnavailable settings. The old ReplicaSet is kept around (up to revisionHistoryLimit) so you can roll back without re-pulling old images. Key fields you must set. replicas (default 1 is never right for production), selector.matchLabels (must match template.metadata.labels exactly — mismatches are a classic “pod isn’t owned by deployment” bug), resources.requests and resources.limits (omitting these leads to scheduler chaos and OOM kills), both probes (liveness and readiness — see below), and a strategy block for rolling updates. Without a proper strategy, the default maxUnavailable: 25% can take a surprising chunk of your capacity offline during a deploy. Common beginner mistakes. (1) Changing container image but not the tag (:latest) — Kubernetes doesn’t pull a new image if the tag is the same, so nothing updates. Always use immutable tags. (2) Forgetting imagePullPolicy: Always for mutable tags during development. (3) Setting maxUnavailable and maxSurge both to 0, which deadlocks the rollout. (4) Not defining a readiness probe, so traffic is sent to pods that aren’t ready, causing 502s during every rollout. The Deployment is the workhorse resource for stateless microservices. It declares: “I want N replicas of this pod template, and when the template changes, do a rolling update.” Under the hood, a Deployment creates a ReplicaSet, which creates pods. When you update the image, Kubernetes creates a new ReplicaSet with the new template and scales it up while scaling down the old one — that’s the rolling update. The Deployment below is a near-complete template for a production microservice. Each piece solves a specific failure mode:
  • replicas: 3 — single replica is a single point of failure. Three replicas across three nodes survive one node failure.
  • maxSurge: 1, maxUnavailable: 0 — during rolling updates, we add one new pod before removing an old one. Zero downtime. Setting maxUnavailable: 1 would be faster but causes brief capacity drops under load.
  • resources.requests — used by the scheduler to find a node with enough free capacity. Without requests, the scheduler guesses, and you can end up with 20 pods crammed onto one node that’s about to OOM.
  • resources.limits — hard ceiling. Memory limit exceeded = OOM kill. CPU limit exceeded = throttled (not killed). Without limits, a memory leak in one pod takes down the whole node.
  • livenessProbe — “is the process alive?” Failure triggers a pod restart.
  • readinessProbe — “can the process accept traffic?” Failure removes the pod from Service load balancing but keeps it running.
  • securityContext.runAsNonRoot — defense in depth. Blocks a whole class of container escape vulnerabilities.
  • podAntiAffinity — spreads replicas across nodes. Without it, all three pods could land on the same node, defeating the point of replication.
If you omit any one of these, you will eventually be paged for an incident that was preventable. Ask me how I know.
Caveats & Common Pitfalls: Resource Requests vs Limits
  1. Setting memory limit much higher than request. The scheduler packs pods based on requests, so you can fit 10 pods on a node whose combined limits are 3x the node’s RAM. Under pressure, the kernel OOM-kills them one by one. Symptom: seemingly random pod restarts, OOMKilled in kubectl describe.
  2. No CPU limit, assuming “requests protect you”. CPU requests only influence scheduling and weighted fair sharing. A runaway goroutine will happily consume every available core on the node, starving the kubelet itself. Node goes NotReady, the control plane evicts everything.
  3. CPU limit set way below P99. Exceeding a CPU limit causes throttling, not restarts. Your app does not crash, it just gets slow. You will see P99 latency spikes with no log output and no errors — one of the hardest Kubernetes problems to diagnose.
  4. Copy-pasting requests/limits between services with different profiles. A CPU-bound Python service and an I/O-bound Node.js service have totally different sizing. Each service needs its own measurement.
Solutions & Patterns
  • Size requests at P95 of observed usage, limits at 2-3x requests. This gives bursts without letting one pod cannibalize the node.
  • Memory: set request == limit for latency-sensitive services (Guaranteed QoS class, last to be evicted). Eat the packing inefficiency — you are buying predictability.
  • Use Vertical Pod Autoscaler in recommendation mode to get data-driven sizing from real traffic. Never enable auto-update mode in production; it restarts pods to change limits.
  • Always alert on CPU throttling (container_cpu_cfs_throttled_seconds_total). This is the silent killer of P99. If throttling is nonzero, raise the limit or fix the hot path.
# k8s/order-service/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: microservices
  labels:
    app: order-service
    version: v1
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: order-service
        version: v1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "3000"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: order-service
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: order-service
          image: myregistry/order-service:1.0.0
          imagePullPolicy: Always
          ports:
            - containerPort: 3000
              name: http
          env:
            - name: NODE_ENV
              value: production
            - name: PORT
              value: "3000"
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: order-service-secrets
                  key: database-url
            - name: KAFKA_BROKERS
              valueFrom:
                configMapKeyRef:
                  name: order-service-config
                  key: kafka-brokers
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi
          livenessProbe:
            httpGet:
              path: /health/live
              port: 3000
            initialDelaySeconds: 15
            periodSeconds: 20
            timeoutSeconds: 5
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 3
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            capabilities:
              drop:
                - ALL
          volumeMounts:
            - name: tmp
              mountPath: /tmp
      volumes:
        - name: tmp
          emptyDir: {}
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: order-service
                topologyKey: kubernetes.io/hostname

Implementing Liveness and Readiness Probes in Your App

The Deployment declares probes, but your application code must implement them. A liveness probe should be cheap and internal — just confirm the process is running and the event loop/worker threads are responsive. A readiness probe should verify dependencies: can you reach the database? Is the cache up? The split matters because failing a liveness probe restarts the pod (expensive, disruptive), while failing a readiness probe just drains traffic until the dependency recovers (safe).
// src/health.js (Express)
const express = require('express');
const router = express.Router();

let isShuttingDown = false;

// Liveness: only fails if the process itself is broken
router.get('/health/live', (req, res) => {
  if (isShuttingDown) return res.status(503).send('shutting down');
  res.status(200).json({ status: 'alive' });
});

// Readiness: fails if dependencies aren't healthy
router.get('/health/ready', async (req, res) => {
  try {
    await db.raw('SELECT 1');       // Postgres reachable
    await redis.ping();             // Redis reachable
    res.status(200).json({ status: 'ready' });
  } catch (err) {
    res.status(503).json({ status: 'not-ready', error: err.message });
  }
});

process.on('SIGTERM', () => { isShuttingDown = true; });

module.exports = router;
Caveats & Common Pitfalls: Liveness Probes
  1. Liveness probe checks dependencies. You put SELECT 1 against Postgres in /health/live. Postgres has a 30-second hiccup. Every pod fails the probe, every pod restarts, cold caches everywhere, cascading failure. Liveness must check the process, not its dependencies.
  2. initialDelaySeconds too low for slow-starting apps. A JVM service that needs 45 seconds to boot with probe delay set to 15s gets killed before it is ready. Classic CrashLoopBackOff mystery — the logs show a successful startup, but the pod dies anyway.
  3. Same endpoint for liveness and readiness. If your liveness endpoint also validates the database, you reintroduce the dependency-check problem. Keep them split: liveness is internal, readiness is external.
  4. failureThreshold: 1. One flaky probe (a momentary GC pause, a CPU-starved sidecar) restarts the pod. Use failureThreshold: 3 as a minimum.
Solutions & Patterns
  • Liveness = “am I still running?” A counter that increments on the event loop is enough. If it stops advancing, the process is hung and a restart is warranted.
  • Readiness = “should I receive traffic?” This is where you check DB, cache, Kafka, whatever you need. Failing readiness drains traffic without killing the pod.
  • Use startupProbe for slow-booting apps. It runs before liveness kicks in and has its own generous timeout. Once it passes, the normal liveness probe takes over.
  • On SIGTERM, flip readiness to 503 and keep liveness at 200. This drains traffic during graceful shutdown without triggering a restart mid-shutdown.

Service

Why Services exist. Pods are mortal — they die, get rescheduled, and come back with new IPs. If your API gateway hardcoded pod IPs, every deployment would break every caller. The Service resource exists to provide a stable addressing layer: a virtual IP and DNS name that abstracts over the ever-changing set of pods behind it. Clients talk to http://order-service/, and Kubernetes handles the rest. What Kubernetes does internally. When you create a Service, the endpoints controller watches for pods matching the selector and maintains an EndpointSlice resource listing their IPs. On every node, kube-proxy watches EndpointSlices and programs iptables (or IPVS, or eBPF in Cilium) rules that intercept traffic to the Service’s ClusterIP and DNAT it to a random backing pod. CoreDNS publishes the Service’s DNS name, so order-service.microservices.svc.cluster.local resolves to the ClusterIP. Key fields you must set. selector — must match the pod labels exactly, or the Service will have zero endpoints. ports.port is the Service port; ports.targetPort is the container port (they can differ). type: ClusterIP is the default and the right choice for internal services. For headless services (needed for StatefulSets and some discovery patterns), set clusterIP: None. Common beginner mistakes. (1) Mismatched labels between Service selector and Deployment pod template — the Service lists zero endpoints and requests hang forever. Always verify with kubectl get endpoints <name>. (2) Using LoadBalancer type for internal services, which provisions a paid cloud load balancer for traffic that never leaves the cluster. (3) Setting targetPort to the Service port instead of the container port — works in some cases because of Kubernetes’s flexibility but is semantically confusing. ClusterIP is the default and most common type for microservices: a virtual IP that’s only reachable from inside the cluster. This is what you use for service-to-service communication (order-service calling payment-service). External types (NodePort, LoadBalancer) expose services outside the cluster, but for east-west traffic (between microservices), ClusterIP is the right answer. Getting this wrong — for example, using LoadBalancer for internal services — means you pay for a cloud load balancer per service and route internal traffic through external infrastructure. That’s both expensive and slower. The Service’s label selector is what connects it to pods. If your pod labels and Service selector don’t match, the Service has no endpoints and requests time out. This is one of the most common “why isn’t my service working” causes, and kubectl describe service <name> (which shows the endpoints) is the diagnostic.
# k8s/order-service/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: order-service
  namespace: microservices
  labels:
    app: order-service
spec:
  type: ClusterIP
  ports:
    - port: 80
      targetPort: 3000
      protocol: TCP
      name: http
  selector:
    app: order-service

ConfigMap

Why ConfigMaps exist. The 12-Factor App principle says “store config in the environment” — not in the binary, not in the image. If you bake KAFKA_BROKERS=prod-kafka:9092 into your Docker image, you need a new image per environment, and the same artifact no longer promotes cleanly from dev to prod. ConfigMaps exist so one image can behave differently in different environments based on externally-injected values. What Kubernetes does internally. A ConfigMap is just a key-value map stored in etcd. When you reference it from a pod spec (via envFrom, env.valueFrom, or a volume mount), the kubelet reads it at pod startup and wires the values in. For volume mounts specifically, the kubelet periodically re-syncs the file contents (within seconds to a minute), so mounted ConfigMap files update live — but env vars are frozen at pod start. Key fields you must set. Just data for string values, or binaryData for non-UTF-8 content. Set immutable: true for ConfigMaps you never change — it improves API server performance at scale and prevents accidental edits. For large configs, prefer mounting as a file over env vars (Linux env block has size limits and Kubernetes imposes a 1 MiB ConfigMap ceiling). Common beginner mistakes. (1) Editing a ConfigMap and expecting pods to pick up new env values — they won’t until restart. Use the checksum annotation pattern (shown later) to trigger rollouts. (2) Putting secrets in ConfigMaps because they “work the same way” — ConfigMaps have weaker RBAC defaults and often end up in shared Git repos. Use Secrets for anything sensitive. (3) Storing entire application binaries or large JSON blobs (>1 MiB) and hitting the size limit. ConfigMaps hold non-sensitive configuration data. The philosophy is 12-Factor: configuration lives outside the image, so the same image can run in dev, staging, and production with different configs. If you bake configuration into the image, you need a new image build (and a new deployment) for every config change — that’s slow and couples unrelated concerns. ConfigMap values can be injected as environment variables (for simple key-value pairs), mounted as files (for config files that the app reads from disk), or consumed via the Kubernetes API (for apps that watch for config changes). The environment variable approach is simplest; the file approach is better for larger configs or when the app expects a specific file format. A common gotcha: changing a ConfigMap does not automatically restart pods. If you inject the ConfigMap as env vars, the pods keep their old values until they restart. The Helm “checksum annotation” pattern (shown later) solves this by triggering a rolling restart when the ConfigMap content changes.
# k8s/order-service/configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: order-service-config
  namespace: microservices
data:
  kafka-brokers: "kafka.microservices.svc.cluster.local:9092"
  redis-host: "redis.microservices.svc.cluster.local"
  redis-port: "6379"
  log-level: "info"
  
  # Configuration file
  app-config.json: |
    {
      "features": {
        "newCheckout": true,
        "splitPayments": false
      },
      "limits": {
        "maxOrderItems": 50,
        "maxOrderValue": 10000
      }
    }

Loading ConfigMap Values in Your App

Because ConfigMap values arrive as environment variables, your app needs a typed way to read them. In Node.js people often reach for process.env.X directly, but that’s fragile — one typo and you silently get undefined. In Python, pydantic-settings is the standard answer: define a settings class with types and defaults, and Pydantic validates the environment at startup. If something is missing or the wrong type, the process crashes immediately with a clear message instead of exploding later in a subtle way.
// src/config.js
const required = (name) => {
  const v = process.env[name];
  if (!v) throw new Error(`Missing required env var: ${name}`);
  return v;
};

module.exports = {
  port: parseInt(process.env.PORT || '3000', 10),
  nodeEnv: process.env.NODE_ENV || 'development',
  kafkaBrokers: required('KAFKA_BROKERS').split(','),
  redisHost: required('REDIS_HOST'),
  redisPort: parseInt(process.env.REDIS_PORT || '6379', 10),
  logLevel: process.env.LOG_LEVEL || 'info',
  databaseUrl: required('DATABASE_URL'),
};

Secret

Why Secrets exist. Sensitive values — DB passwords, API keys, TLS private keys — need stricter handling than plain config. Secrets are a separate resource type so Kubernetes can apply different defaults: they can be mounted as tmpfs (never hit disk), they have their own RBAC verbs, they’re kept out of many debug/log paths, and they integrate with external secret managers. What Kubernetes does internally. Secrets are stored in etcd, base64-encoded by default. With encryption-at-rest enabled (strongly recommended), the API server encrypts Secret values with a KMS key before writing to etcd — so dumping the etcd disk no longer exposes plaintext. When mounted into pods, Secrets live in a tmpfs volume so they never touch the node’s persistent disk. Updates to Secrets propagate to mounted volumes the same way ConfigMaps do. Key fields you must set. typeOpaque for generic secrets, kubernetes.io/tls for TLS certs (enables cert-manager integration), kubernetes.io/dockerconfigjson for image pull secrets. Use stringData (plaintext in YAML, base64-encoded by Kubernetes) during authoring — the data field requires you to base64-encode yourself, which is error-prone. Mark long-lived Secrets as immutable: true once stable. Common beginner mistakes. (1) Confusing base64 for encryption and committing Secrets to Git — decoding is one base64 -d away. (2) Mounting a Secret as an env var and then logging process.env during debug, accidentally shipping the secret to your log aggregator. (3) Granting list secrets cluster-wide to a service that only needs one specific secret — compromising that service exposes everything. (4) Skipping etcd encryption-at-rest, thinking “Kubernetes Secrets are encrypted” (they’re not, by default). Secrets are ConfigMaps for sensitive data — database passwords, API keys, TLS certificates. The name is misleading: by default, Kubernetes stores Secrets as base64-encoded values in etcd, which is encoding, not encryption. Anyone with read access to the Secret can decode it trivially. Real protection requires two additional layers: enable etcd encryption at rest (so the raw disk cannot be read without the KMS key), and restrict Secret read access via RBAC (so only the pods that need a Secret can read it). For serious production environments, Secrets stored directly in Kubernetes are usually not good enough. The pattern the industry has converged on: use an external secrets manager (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault) as the source of truth, and sync Secrets into Kubernetes via the External Secrets Operator. This gives you versioning, audit logs, centralized rotation, and the ability to revoke a secret without redeploying pods. The Secret resource in Kubernetes becomes a mirror of the authoritative secret, not the primary store.
# k8s/order-service/secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: order-service-secrets
  namespace: microservices
type: Opaque
stringData:
  database-url: "postgresql://user:password@order-db:5432/orders"
  jwt-secret: "your-jwt-secret-here"
  
# For sensitive data, use external secrets managers:
# - HashiCorp Vault with External Secrets Operator
# - AWS Secrets Manager
# - Azure Key Vault

ServiceAccount with RBAC

Every pod runs with an identity — a ServiceAccount — that determines what Kubernetes API calls it can make. By default, pods use the namespace’s default ServiceAccount, which often has more permissions than it needs. The principle of least privilege says: each service should have its own ServiceAccount with the minimum RBAC permissions required. Why does this matter? If an attacker compromises a pod, they inherit the pod’s ServiceAccount token and can call the Kubernetes API with those permissions. If that ServiceAccount has list secrets permission across the cluster, the attacker can now enumerate every secret your organization stores in Kubernetes. By scoping ServiceAccounts per service and granting only the specific permissions needed, you turn a potentially catastrophic breach into a localized one. The RBAC model has three parts: the ServiceAccount (identity), the Role (what actions are allowed on what resources in a namespace), and the RoleBinding (which identities get which roles). For cluster-wide permissions, ClusterRole and ClusterRoleBinding are the equivalents. The rule below grants order-service permission to read its own ConfigMaps and one specific Secret — nothing more.
# k8s/order-service/rbac.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: order-service
  namespace: microservices

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: order-service-role
  namespace: microservices
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["secrets"]
    resourceNames: ["order-service-secrets"]
    verbs: ["get"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: order-service-binding
  namespace: microservices
subjects:
  - kind: ServiceAccount
    name: order-service
    namespace: microservices
roleRef:
  kind: Role
  name: order-service-role
  apiGroup: rbac.authorization.k8s.io

Ingress Configuration

Why Ingress exists. A Service of type LoadBalancer gives each microservice its own external IP, and on cloud providers that means one paid load balancer per service. With 50 services, that’s 50 LBs, 50 DNS records, 50 TLS certs. Ingress solves this: one HTTP-aware gateway at the cluster edge that terminates TLS once, then routes to internal Services by host and path. You go from 50 LBs to 1. What Kubernetes does internally. The Ingress resource is declarative only — it does nothing on its own. You also install an ingress controller (NGINX, Traefik, HAProxy, AWS ALB controller). The controller watches Ingress resources and reconfigures its underlying proxy (e.g., rewrites nginx.conf) to match. Traffic flow: Internet → cloud LB → ingress controller pod → internal ClusterIP Service → backing pod. Key fields you must set. ingressClassName (which controller handles this — multiple controllers can coexist), rules[].host and rules[].http.paths, pathType (Prefix is what you almost always want; Exact matches only the exact string), and tls for HTTPS. Annotations vary per controller and do most of the heavy lifting for rate limiting, auth, redirects, and CORS. Common beginner mistakes. (1) Writing an Ingress resource with no controller installed — nothing works and the resource just sits there. (2) Forgetting pathType: Prefix and wondering why /orders/123 doesn’t match /orders. (3) Mixing annotations from different controllers (copying NGINX annotations into a Traefik setup). (4) TLS certificate issues because cert-manager isn’t installed or the ClusterIssuer isn’t configured. Services solve east-west traffic (pod-to-pod). Ingress solves north-south traffic (outside-the-cluster-to-pod). Without Ingress, exposing a service to the internet means creating a LoadBalancer Service for each one — a cloud load balancer per microservice, each costing money and each needing its own DNS entry. This gets absurd quickly. Ingress provides a single entry point that routes traffic to multiple services based on hostname and path. You pay for one load balancer (the ingress controller), and it routes internally to whichever service matches. This is how most production clusters expose APIs: one Ingress resource defines /orders -> order-service, /payments -> payment-service, /inventory -> inventory-service, and the ingress controller handles TLS termination, path rewriting, rate limiting, and routing. The ingress controller is a separate component (nginx-ingress, Traefik, HAProxy, AWS ALB controller) that implements the Ingress API. You install the controller once per cluster, then declare Ingress resources that configure it. Without the controller running, Ingress resources do nothing — they’re declarations without an implementation.

NGINX Ingress

# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: microservices-ingress
  namespace: microservices
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  tls:
    - hosts:
        - api.example.com
      secretName: api-tls-secret
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /orders
            pathType: Prefix
            backend:
              service:
                name: order-service
                port:
                  number: 80
          - path: /payments
            pathType: Prefix
            backend:
              service:
                name: payment-service
                port:
                  number: 80
          - path: /inventory
            pathType: Prefix
            backend:
              service:
                name: inventory-service
                port:
                  number: 80

Rate Limiting with Ingress

Rate limiting at the ingress layer is your first line of defense against abusive clients. Better to reject excess requests at the edge (cheap) than to let them hit your application pods (expensive). NGINX Ingress supports rate limiting via annotations: requests per second, concurrent connections, and bandwidth. Limits are applied per client IP, which is a reasonable default for most APIs. The tradeoff: IP-based limits don’t work well when your clients are behind NAT (mobile carriers, corporate proxies) — all users from that network share one IP and one rate limit bucket. For fairness across users, you need application-level rate limiting (per-API-key or per-user-ID), which happens inside the service, not at the ingress. Both layers are useful: ingress protects against volumetric abuse, application-level protects against individual misuse.
# k8s/ingress-rate-limit.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
  annotations:
    nginx.ingress.kubernetes.io/limit-rps: "10"
    nginx.ingress.kubernetes.io/limit-connections: "5"
    nginx.ingress.kubernetes.io/limit-rate: "500"
    nginx.ingress.kubernetes.io/limit-rate-after: "1000"
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-gateway
                port:
                  number: 80

Horizontal Pod Autoscaler (HPA)

Why HPA exists. Static replica counts are a trap. Set it too low and you can’t handle traffic spikes; set it too high and you burn money running idle pods. HPA exists to let you pay for capacity proportional to actual load — the replica count follows the metric. On Black Friday, 20 pods; Wednesday at 3 AM, 3 pods. What Kubernetes does internally. Every 15 seconds (default), the HPA controller polls the metrics API (Metrics Server for CPU/memory, or a custom metrics adapter for Prometheus metrics). It runs the formula desired = ceil(currentReplicas × currentMetric / targetMetric), applies stabilization windows and scaling policies, and updates the target Deployment’s replicas field. Notice that the HPA controls replicas — if you also set replicas in your Deployment manifest and reapply, you fight the HPA and replicas flip-flop. Key fields you must set. scaleTargetRef (which Deployment), minReplicas (never below this — usually the floor for high availability), maxReplicas (protects you from a metric bug scaling to 10,000 pods), and at least one metrics entry. For the metrics, set a realistic averageUtilization — 70% CPU is a common default, but the right number depends on your service’s headroom needs. The behavior block is where the nuance lives — without it, defaults are too aggressive on scale-down and too cautious on scale-up. Common beginner mistakes. (1) No resource requests on the target pods — HPA can’t compute utilization without a baseline and does nothing. (2) Scaling on CPU for an I/O-bound service whose CPU never moves; you need custom metrics like requests-per-second or queue depth. (3) Leaving replicas set in the Deployment while HPA is active, causing replica flapping. (4) maxReplicas too close to minReplicas, so HPA has nothing useful to do during a spike. The whole point of deploying to Kubernetes is elastic capacity. HPA is the resource that turns that elasticity on: it watches a metric (CPU, memory, or custom) and adjusts the replica count to keep that metric near a target value. If CPU usage averages 90% across 3 replicas, HPA scales up to 5. If it drops to 20%, HPA scales down to 2 (respecting minReplicas). The subtle art of HPA is tuning the scaling behavior. Scale up too slowly and you miss the traffic spike. Scale down too aggressively and you add churn — pods constantly starting and stopping, which hurts cache hit rates and cold-start latency. The behavior block below tunes this carefully: scale up aggressively (no stabilization window, up to 100% growth in 15 seconds) but scale down cautiously (5-minute stabilization, max 10% reduction per minute). Traffic spikes are common; traffic drops are usually a brief lull, not a sustained decrease. CPU-based scaling is the default and works for CPU-bound services. For I/O-bound services (which is most microservices), CPU stays low even under heavy load — you need custom metrics like requests per second or queue depth. The Prometheus Adapter lets you expose any Prometheus metric to HPA, so you can scale on “in-flight HTTP requests” or “Kafka lag” or whatever actually correlates with load for your service.
# k8s/order-service/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service-hpa
  namespace: microservices
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15
        - type: Pods
          value: 4
          periodSeconds: 15
      selectPolicy: Max
Caveats & Common Pitfalls: HPA on CPU alone
  1. CPU-based scaling on an I/O-bound service. Your Node.js service spends 95% of its time awaiting Postgres. CPU stays at 20% even when the request queue is 500 deep and P99 latency has blown past the SLA. HPA does nothing. Customers time out.
  2. Scaling on CPU with queue backpressure in play. Kafka consumers that block on poll() appear idle — HPA never scales them, even as the lag grows to millions of events. You need the queue depth as the metric, not CPU.
  3. Setting replicas in both the Deployment and HPA. Every kubectl apply of the Deployment overwrites the HPA’s replica count, which then immediately rewrites it back. Pods thrash — creating, terminating, creating.
  4. minReplicas: 1. During a scale-down to 1, a single pod crash triggers a full outage while Kubernetes creates a replacement. For any production service, floor at 2 or 3.
Solutions & Patterns
  • Scale on custom metrics that correlate with user-visible load: http_requests_per_second, kafka_consumer_lag, queue_depth, inflight_requests. Expose these via Prometheus and wire them through the Prometheus Adapter.
  • Use KEDA for event-driven scaling. It supports Kafka lag, SQS depth, Redis Stream length, CloudWatch metrics — anything queue-like. Scales from 0 when idle.
  • Combine CPU and custom metrics, HPA takes the max. CPU catches compute-bound spikes; queue depth catches I/O-bound spikes.
  • Remove replicas from the Deployment once HPA is active, or use kubectl apply --server-side with ownership transfer so the HPA owns that field.

Pod Disruption Budget

Why PDBs exist. Kubernetes is designed to move pods around — that’s the whole point. But during a routine cluster upgrade, if nothing stops it, Kubernetes could evict all your replicas of a service simultaneously and cause an outage. PDBs exist as a contract with the Kubernetes control plane: “when you’re doing voluntary operations, respect this minimum availability.” What Kubernetes does internally. When something calls the Eviction API (kubectl drain, cluster autoscaler scaling down a node, descheduler rebalancing workloads), Kubernetes checks all PDBs that select the target pod. If evicting it would violate any PDB, the eviction is denied with a 429 status. The caller retries with backoff until conditions allow the eviction. This is cooperative — it only protects against components that use the Eviction API, not against direct deletes or kernel panics. Key fields you must set. selector (matches the pods you’re protecting — usually the same selector your Deployment uses), and exactly one of minAvailable or maxUnavailable. Use absolute numbers (minAvailable: 2) for small deployments and percentages (maxUnavailable: 25%) for larger ones. Common beginner mistakes. (1) Setting minAvailable equal to replicas, which means no pod can ever be evicted — node drains hang forever. (2) Forgetting a PDB exists and wondering why a cluster upgrade is stuck. (3) Thinking PDBs protect against node crashes — they don’t. PDBs are only for voluntary disruptions. Node crashes are involuntary, and replication + anti-affinity + multi-zone is what protects you there. PodDisruptionBudget (PDB) is a seatbelt that protects your service from Kubernetes itself. Kubernetes voluntarily disrupts pods during node upgrades, cluster scaling, or manual drains. Without a PDB, a cluster operator running kubectl drain node-3 could evict all three of your order-service pods simultaneously, taking the service down entirely. With a PDB declaring minAvailable: 2, Kubernetes refuses to evict the third pod until a replacement is running elsewhere, maintaining at least two healthy replicas throughout the operation. PDBs apply only to voluntary disruptions (drains, upgrades). Involuntary disruptions (node crashes, hardware failures) are not bound by PDBs — if a node dies, its pods die with it, PDB or not. The way to protect against involuntary disruptions is replication plus pod anti-affinity (spread replicas across nodes and zones).
# k8s/order-service/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: order-service-pdb
  namespace: microservices
spec:
  minAvailable: 2
  # Or use: maxUnavailable: 1
  selector:
    matchLabels:
      app: order-service
Caveats & Common Pitfalls: PodDisruptionBudget
  1. Absent PDB during a rolling cluster upgrade. You run 3 replicas of payment-service. The node autoscaler drains all 3 nodes in parallel because nothing tells it to coordinate. For 30 seconds, your service has zero available pods. Checkouts 500 out.
  2. minAvailable: 100% (or equal to replicas). Kubernetes cannot evict any pod, ever. Node drains hang forever, cluster upgrades stall, SREs learn to force-delete pods (which defeats the PDB).
  3. Forgetting PDB is only for voluntary disruptions. A PDB of minAvailable: 2 does not save you when the node’s EBS volume dies. That is an involuntary disruption and requires replication + anti-affinity + multi-zone.
  4. PDB without anti-affinity. All 3 replicas land on the same node. A drain on that node evicts them all (PDB says minAvailable: 2 — so 1 eviction proceeds, then blocks, stalling the drain). Spread the pods first, then set a PDB.
Solutions & Patterns
  • Default: maxUnavailable: 1 for services with 3+ replicas. Simple, scales with your replica count, never deadlocks drains.
  • For critical services: pair PDB with topologySpreadConstraints across zones. The PDB protects you from voluntary disruption, the spread protects from zone failure.
  • Write a PDB for every Deployment as a matter of course. Include it in your Helm chart template so new services inherit the protection.
  • Monitor the DisruptionAllowed status: kubectl get pdb -A -o json | jq '.items[] | select(.status.disruptionsAllowed == 0)'. Zero allowed disruptions is how upgrades get stuck — alert on it.

Network Policies

Why NetworkPolicies exist. Kubernetes’s default network model is a flat, promiscuous LAN — every pod can reach every other pod on every port. That’s great for “it just works” in development and catastrophic for security in production. One compromised pod can scan the entire cluster. NetworkPolicies exist to let you declare “which pods can talk to which” and enforce segmentation at Layer 3/4. What Kubernetes does internally. NetworkPolicy is a spec that the CNI plugin enforces. Calico, Cilium, and Weave translate policies into iptables rules, eBPF programs, or OVS flows on each node. The rule engine is additive and default-deny: if any NetworkPolicy selects a pod, only explicitly allowed traffic is permitted; if no policy selects a pod, everything is allowed (backward compatibility default). So a namespace with zero policies is wide open. Key fields you must set. podSelector (which pods this policy applies to — empty selector {} means all pods in the namespace), policyTypes (Ingress, Egress, or both — you must list Egress for egress rules to take effect), and ingress/egress rules with from/to and ports. Always remember DNS egress (kube-dns in kube-system) or your apps will fail every hostname resolution. Common beginner mistakes. (1) Writing a policy without allowing DNS, so every hostname resolves to nothing. (2) Expecting policies to work when the CNI doesn’t implement them (AWS VPC CNI without the separate Calico add-on doesn’t enforce NetworkPolicy). (3) Using podSelector: {} thinking it matches nothing when it actually matches everything in the namespace. (4) Forgetting that policies are per-namespace — a cross-namespace allow needs namespaceSelector. By default, every pod in a Kubernetes cluster can talk to every other pod. No firewall, no segmentation. In a microservices architecture with 50 services, this means if an attacker compromises a low-privilege service (say, a marketing analytics dashboard), they can directly reach the payment database, the user service, and anything else on the cluster. NetworkPolicies fix this by declaring which pods are allowed to talk to which other pods. The model is default-deny once any NetworkPolicy applies to a pod: if a pod has any NetworkPolicy selecting it, only the explicitly allowed traffic is permitted; everything else is blocked. The policy below says: order-service accepts traffic only from the ingress-nginx namespace and from api-gateway pods, and can only make outbound connections to DNS, its database, Kafka, and two other specific services. Everything else is blocked. The catch: NetworkPolicies require a CNI plugin that implements them (Calico, Cilium, Weave). If your cluster’s CNI doesn’t support NetworkPolicies — or you’ve forgotten to enable them — you can write all the NetworkPolicy YAML you want and it will do absolutely nothing. Verify with kubectl get networkpolicies and test enforcement by trying a disallowed connection from inside a pod.
# k8s/network-policies.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: order-service-network-policy
  namespace: microservices
spec:
  podSelector:
    matchLabels:
      app: order-service
  policyTypes:
    - Ingress
    - Egress
  ingress:
    # Allow traffic from ingress controller
    - from:
        - namespaceSelector:
            matchLabels:
              name: ingress-nginx
      ports:
        - protocol: TCP
          port: 3000
    # Allow traffic from api-gateway
    - from:
        - podSelector:
            matchLabels:
              app: api-gateway
      ports:
        - protocol: TCP
          port: 3000
  egress:
    # Allow DNS
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53
    # Allow to databases
    - to:
        - podSelector:
            matchLabels:
              app: order-db
      ports:
        - protocol: TCP
          port: 5432
    # Allow to Kafka
    - to:
        - podSelector:
            matchLabels:
              app: kafka
      ports:
        - protocol: TCP
          port: 9092
    # Allow to other services
    - to:
        - podSelector:
            matchLabels:
              app: payment-service
        - podSelector:
            matchLabels:
              app: inventory-service
      ports:
        - protocol: TCP
          port: 3000

Helm Charts

Why Helm exists. Managing raw YAML for 50 microservices across 3 environments means 150 near-duplicate manifests, manual find-and-replace when things change, and no atomic rollback story. Helm exists to treat Kubernetes manifests as a package — a parameterized chart you install and upgrade as a single unit, with release history and rollback. It’s the apt/npm of Kubernetes. What Helm does internally. A chart is a directory of Go-templated YAML files plus a values.yaml defaults file. When you run helm install, Helm renders the templates against your values (plus built-ins like .Release.Name), produces final YAML, and submits it to the API server. It also stores the release metadata (name, version, rendered manifest) as a Secret in the target namespace. helm upgrade re-renders and diffs; helm rollback reapplies a prior stored release. Key fields you must set (per chart). Chart.yaml (name, version, appVersion — remember version is the chart version and appVersion is the app version; they’re independent). In values.yaml, every value that varies between environments should be parameterized; hardcode only truly universal defaults. In templates, use {{ include "chart.fullname" . }} helpers for consistent naming rather than hardcoding names. Common beginner mistakes. (1) Forgetting the checksum/config annotation, so ConfigMap changes don’t trigger pod rollouts. (2) Putting secrets in values.yaml and checking them into Git — use --set at install time or a secrets plugin (helm-secrets, sops). (3) Upgrading with --reuse-values when you meant --reset-values (or vice versa) and losing configuration. (4) Chart-per-microservice explosion without a shared library chart, so every team reinvents the same patterns slightly differently. Writing raw Kubernetes YAML for 50 microservices is a nightmare. You end up with hundreds of YAML files that are 90% identical and 10% service-specific. The industry settled on Helm as the solution: a templating engine that lets you define one chart (a parameterized collection of Kubernetes manifests) and install it many times with different values. One chart, many releases — each with its own image tag, replica count, and environment-specific config. Helm also provides release management: a history of what was deployed when, with a helm rollback command that atomically reverts to a previous release. Compared to kubectl apply, which has no concept of a release, Helm gives you transactional deployments and rollbacks. For production microservices, this is a significant operational improvement. The downside is complexity. Helm templates are Go templates with a DSL of their own, and the error messages when a template is wrong are notoriously cryptic. Chart maintenance can become its own discipline. The alternatives are Kustomize (simpler, no templating, built into kubectl), Jsonnet, or pure code tools like CDK8s and Pulumi. My default is Helm for anything shared between teams (chart reuse is a killer feature) and Kustomize for team-internal variations.

Chart Structure

order-service/
├── Chart.yaml
├── values.yaml
├── values-staging.yaml
├── values-production.yaml
├── templates/
│   ├── _helpers.tpl
│   ├── deployment.yaml
│   ├── service.yaml
│   ├── configmap.yaml
│   ├── secret.yaml
│   ├── hpa.yaml
│   ├── pdb.yaml
│   ├── serviceaccount.yaml
│   └── ingress.yaml
└── charts/

Chart.yaml

Chart.yaml is the metadata for the chart itself — its name, version, and dependencies. The version field is the chart’s version (bumped when you change templates), while appVersion is the version of the application the chart deploys (the image tag). These are intentionally separate: you might ship chart v1.2.0 that deploys app v2.5.0. Dependencies let one chart pull in others (e.g., your service chart can depend on a Postgres chart), useful for development but often skipped in production where databases are managed separately.
# Chart.yaml
apiVersion: v2
name: order-service
description: Order Service Helm Chart
type: application
version: 1.0.0
appVersion: "1.0.0"

dependencies:
  - name: postgresql
    version: "12.x.x"
    repository: "https://charts.bitnami.com/bitnami"
    condition: postgresql.enabled

values.yaml

values.yaml is the default configuration for the chart. When you install the chart, these values fill in the templated placeholders. For per-environment differences, you override via values-staging.yaml or --set image.tag=1.2.0 on the command line. The discipline here is to make every environment-specific concept parameterizable (replicas, resources, ingress hosts, feature flags) and hard-code only the truly universal defaults. A good rule of thumb: if you find yourself with values-dev.yaml, values-staging.yaml, values-prod.yaml, and they each override 40 fields, your values.yaml is probably wrong. The base should be reasonable defaults, and environment overrides should be minimal deltas (different image tags, different replica counts, different ingress hosts). If your environments differ structurally (staging has no ingress at all, production has three), you may need separate charts or a more sophisticated tool.
# values.yaml
replicaCount: 3

image:
  repository: myregistry/order-service
  tag: "1.0.0"
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 80
  targetPort: 3000

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: api.example.com
      paths:
        - path: /orders
          pathType: Prefix
  tls:
    - secretName: api-tls
      hosts:
        - api.example.com

resources:
  limits:
    cpu: 500m
    memory: 512Mi
  requests:
    cpu: 100m
    memory: 128Mi

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

podDisruptionBudget:
  enabled: true
  minAvailable: 2

config:
  logLevel: info
  kafkaBrokers: kafka:9092
  redisHost: redis

secrets:
  databaseUrl: ""
  jwtSecret: ""

postgresql:
  enabled: true
  auth:
    database: orders

Deployment Template

The template below shows several important Helm idioms. The {{ include "order-service.fullname" . }} pulls a name from a helper template (defined later), ensuring consistent naming across resources. The {{- if not .Values.autoscaling.enabled }} conditionally includes the replicas field — when HPA is enabled, you don’t want the Deployment fighting the HPA over replica count. The checksum/config annotation is a clever trick: it hashes the rendered ConfigMap content and adds the hash as a pod annotation. When the ConfigMap changes, the hash changes, which means the pod spec changes, which triggers a rolling update. Without this annotation, changing a ConfigMap has no effect on running pods — they keep their old env vars until restart. This tiny line solves a very common bug.
# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "order-service.fullname" . }}
  labels:
    {{- include "order-service.labels" . | nindent 4 }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  selector:
    matchLabels:
      {{- include "order-service.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      annotations:
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
        checksum/secret: {{ include (print $.Template.BasePath "/secret.yaml") . | sha256sum }}
      labels:
        {{- include "order-service.selectorLabels" . | nindent 8 }}
    spec:
      serviceAccountName: {{ include "order-service.serviceAccountName" . }}
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
      containers:
        - name: {{ .Chart.Name }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - name: http
              containerPort: {{ .Values.service.targetPort }}
          env:
            - name: NODE_ENV
              value: production
            - name: LOG_LEVEL
              valueFrom:
                configMapKeyRef:
                  name: {{ include "order-service.fullname" . }}-config
                  key: log-level
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: {{ include "order-service.fullname" . }}-secrets
                  key: database-url
          livenessProbe:
            httpGet:
              path: /health/live
              port: http
            initialDelaySeconds: 15
            periodSeconds: 20
          readinessProbe:
            httpGet:
              path: /health/ready
              port: http
            initialDelaySeconds: 5
            periodSeconds: 10
          resources:
            {{- toYaml .Values.resources | nindent 12 }}

Helpers Template

Helper templates live in _helpers.tpl (the leading underscore tells Helm not to render them as Kubernetes resources) and define named templates you can reuse. The order-service.labels helper produces a consistent set of labels for every resource in the chart — name, version, chart, release. Consistency matters: selectors rely on exact label matches, and inconsistent labels are a classic source of “my Service has no endpoints” bugs. The labels here follow the Kubernetes recommended schema (app.kubernetes.io/name, app.kubernetes.io/instance, etc.). Adopting this schema means your resources work with standard tooling (monitoring dashboards, kubectl label selectors, helm itself) without custom configuration.
# templates/_helpers.tpl
{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "order-service.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}

{{/*
Common labels
*/}}
{{- define "order-service.labels" -}}
helm.sh/chart: {{ include "order-service.chart" . }}
{{ include "order-service.selectorLabels" . }}
app.kubernetes.io/version: {{ .Values.image.tag | quote }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}

{{/*
Selector labels
*/}}
{{- define "order-service.selectorLabels" -}}
app.kubernetes.io/name: {{ .Chart.Name }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

{{/*
Create the name of the service account to use
*/}}
{{- define "order-service.serviceAccountName" -}}
{{- default (include "order-service.fullname" .) .Values.serviceAccount.name }}
{{- end }}

Helm Commands

# Install
helm install order-service ./order-service -n microservices

# Install with custom values
helm install order-service ./order-service \
  -n microservices \
  -f values-production.yaml \
  --set image.tag=1.2.0

# Upgrade
helm upgrade order-service ./order-service \
  -n microservices \
  --set image.tag=1.3.0

# Rollback
helm rollback order-service 1 -n microservices

# Uninstall
helm uninstall order-service -n microservices

# Template (dry-run)
helm template order-service ./order-service -n microservices

Validating Helm Values from Code

Before a chart ever hits the cluster, you want to catch misconfigurations — wrong types, missing required fields, production-only values left off. Helm has a values.schema.json mechanism for this, but it’s often easier to write a small script that loads values.yaml (and any override file) and validates it against a real schema. In Python, Pydantic models shine here: you can model your chart’s values exactly and get a rich error report when something’s off.
// scripts/validate-values.js
const fs = require('fs');
const yaml = require('js-yaml');
const Ajv = require('ajv');

const schema = {
  type: 'object',
  required: ['image', 'replicaCount', 'resources'],
  properties: {
    replicaCount: { type: 'integer', minimum: 1, maximum: 100 },
    image: {
      type: 'object',
      required: ['repository', 'tag'],
      properties: {
        repository: { type: 'string' },
        tag: { type: 'string', pattern: '^v?\\d+\\.\\d+\\.\\d+' },
        pullPolicy: { enum: ['Always', 'IfNotPresent', 'Never'] },
      },
    },
    resources: {
      type: 'object',
      required: ['requests', 'limits'],
    },
  },
};

const file = process.argv[2] || 'values.yaml';
const values = yaml.load(fs.readFileSync(file, 'utf8'));

const ajv = new Ajv({ allErrors: true });
const valid = ajv.validate(schema, values);

if (!valid) {
  console.error('Validation errors in', file);
  console.error(ajv.errors);
  process.exit(1);
}
console.log(`${file} is valid`);

Programmatic Kubernetes Client

Beyond kubectl, you’ll often need to interact with Kubernetes from your own code — for custom operators, deployment automation, health dashboards, or CI/CD tooling. Both Node.js and Python have official Kubernetes clients. The API surface mirrors the REST API exactly: listing pods, watching for changes, creating resources, all of it. The pattern below shows a typical automation use case: listing pods in a namespace and reporting their status. For anything more sophisticated than simple scripts — custom controllers, webhooks, operators — the Kubernetes API’s watch mechanism becomes important. Watches let you subscribe to resource changes (pod created, config updated) and react in real time, which is how the control plane itself works. If you’re writing a controller in Python, kopf is a popular framework that wraps the low-level client with a clean decorator API. For Node.js, the @kubernetes/client-node package has watch support but you typically build the control loop yourself.
// scripts/pod-status.js
const k8s = require('@kubernetes/client-node');

const kc = new k8s.KubeConfig();
kc.loadFromDefault();  // Reads ~/.kube/config or in-cluster config

const coreApi = kc.makeApiClient(k8s.CoreV1Api);

async function listPods(namespace) {
  const res = await coreApi.listNamespacedPod(namespace);
  return res.body.items.map(pod => ({
    name: pod.metadata.name,
    status: pod.status.phase,
    node: pod.spec.nodeName,
    ready: pod.status.containerStatuses?.every(c => c.ready) ?? false,
    restarts: pod.status.containerStatuses?.reduce(
      (sum, c) => sum + c.restartCount, 0
    ) ?? 0
  }));
}

async function main() {
  const pods = await listPods('microservices');
  console.table(pods);

  const unhealthy = pods.filter(p => !p.ready || p.restarts > 5);
  if (unhealthy.length > 0) {
    console.error(`Found ${unhealthy.length} unhealthy pods`);
    process.exit(1);
  }
}

main().catch(err => {
  console.error(err);
  process.exit(1);
});

Watching Deployment Rollouts

A common automation pattern is waiting for a rollout to complete — for CI/CD to verify a deploy succeeded, or for dashboards to surface rollout progress. Both clients expose watch streams, but Python’s kubernetes.watch.Watch API is particularly clean. The pattern: open a watch on the Deployment, stream MODIFIED events, and check status.ready_replicas against spec.replicas until they match. Add a timeout so you don’t wait forever on a stuck rollout.
// scripts/wait-rollout.js
const k8s = require('@kubernetes/client-node');

const kc = new k8s.KubeConfig();
kc.loadFromDefault();

const appsApi = kc.makeApiClient(k8s.AppsV1Api);
const watch = new k8s.Watch(kc);

async function waitForRollout(namespace, name, timeoutMs = 300_000) {
  const start = Date.now();

  return new Promise((resolve, reject) => {
    const req = watch.watch(
      `/apis/apps/v1/namespaces/${namespace}/deployments`,
      { fieldSelector: `metadata.name=${name}` },
      (type, obj) => {
        const d = obj;
        const desired = d.spec.replicas;
        const ready = d.status.readyReplicas || 0;
        const updated = d.status.updatedReplicas || 0;
        console.log(`${type} ${name}: ${ready}/${desired} ready, ${updated} updated`);

        if (ready === desired && updated === desired) {
          req.then(r => r.abort());
          resolve();
        }
        if (Date.now() - start > timeoutMs) {
          req.then(r => r.abort());
          reject(new Error('rollout timed out'));
        }
      },
      err => err && reject(err)
    );
  });
}

waitForRollout('microservices', 'order-service')
  .then(() => console.log('rollout complete'))
  .catch(err => { console.error(err); process.exit(1); });

FastAPI Service Ready for Kubernetes

Pulling the threads together, here’s a minimal FastAPI service that’s fully prepared for Kubernetes: structured logging with structlog, settings loaded from ConfigMap/Secret environment variables via pydantic-settings, proper /health/live and /health/ready endpoints, and graceful shutdown on SIGTERM. This is the kind of template you’d scaffold every new Python microservice from.
# app/main.py
"""Kubernetes-ready FastAPI microservice template.

Wiring:
- pydantic-settings reads env vars populated from ConfigMap & Secret
- structlog emits JSON logs consumable by Fluent Bit / Loki
- /health/live is a cheap liveness check (process-only)
- /health/ready verifies dependencies before accepting traffic
- SIGTERM handler flips readiness to 503 so the Service endpoint
  is drained before the pod actually exits (preStop + terminationGracePeriodSeconds)
"""
import asyncio
import signal
from contextlib import asynccontextmanager

import structlog
from fastapi import FastAPI, Response, status
from pydantic import Field, SecretStr
from pydantic_settings import BaseSettings

# ---------- Settings ----------
class Settings(BaseSettings):
    port: int = 3000
    log_level: str = "info"
    database_url: SecretStr = Field(..., alias="DATABASE_URL")
    kafka_brokers: str = Field(..., alias="KAFKA_BROKERS")

settings = Settings()

# ---------- Logging ----------
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ]
)
log = structlog.get_logger()

# ---------- Lifecycle ----------
_ready = False
_shutting_down = False

@asynccontextmanager
async def lifespan(app: FastAPI):
    global _ready
    log.info("startup", kafka=settings.kafka_brokers)
    # TODO: open DB pool, Kafka producer, etc.
    _ready = True
    yield
    log.info("shutdown")
    _ready = False

app = FastAPI(lifespan=lifespan)

# ---------- Probes ----------
@app.get("/health/live")
async def liveness():
    if _shutting_down:
        return Response(status_code=status.HTTP_503_SERVICE_UNAVAILABLE)
    return {"status": "alive"}

@app.get("/health/ready")
async def readiness():
    if not _ready or _shutting_down:
        return Response(status_code=status.HTTP_503_SERVICE_UNAVAILABLE)
    return {"status": "ready"}

# ---------- Business routes ----------
@app.get("/orders/{order_id}")
async def get_order(order_id: str):
    log.info("get_order", order_id=order_id)
    return {"id": order_id, "status": "confirmed"}

# ---------- Graceful shutdown ----------
def _handle_sigterm(*_):
    global _shutting_down
    _shutting_down = True
    log.info("sigterm_received")

signal.signal(signal.SIGTERM, _handle_sigterm)

kubectl Commands Reference

# Context & Namespace
kubectl config get-contexts
kubectl config use-context production
kubectl config set-context --current --namespace=microservices

# Get resources
kubectl get pods -n microservices
kubectl get pods -o wide
kubectl get pods -l app=order-service
kubectl get all -n microservices

# Describe resources
kubectl describe pod order-service-xxx
kubectl describe deployment order-service

# Logs
kubectl logs order-service-xxx
kubectl logs -f order-service-xxx
kubectl logs order-service-xxx --previous
kubectl logs -l app=order-service --all-containers

# Execute commands
kubectl exec -it order-service-xxx -- /bin/sh
kubectl exec order-service-xxx -- env

# Port forwarding
kubectl port-forward svc/order-service 3000:80
kubectl port-forward pod/order-service-xxx 3000:3000

# Apply/Delete
kubectl apply -f k8s/
kubectl apply -f k8s/ -n microservices
kubectl delete -f k8s/

# Scale
kubectl scale deployment order-service --replicas=5

# Rollout
kubectl rollout status deployment/order-service
kubectl rollout history deployment/order-service
kubectl rollout undo deployment/order-service
kubectl rollout restart deployment/order-service

Interview Questions

Answer:Deployment:
  • Stateless workloads
  • Pods are interchangeable
  • Random pod names (order-xxx-yyy)
  • Parallel scaling
  • Use for: API servers, web apps
StatefulSet:
  • Stateful workloads
  • Stable network identity (order-0, order-1)
  • Ordered deployment/scaling
  • Persistent storage per pod
  • Use for: Databases, message queues
Key differences:
  • StatefulSet maintains pod identity across restarts
  • Each StatefulSet pod gets its own PVC
  • Ordered, graceful deployment and scaling
Answer:Liveness Probe:
  • “Is the container alive?”
  • Failure → container restart
  • Detect deadlocks, infinite loops
  • Usually checks internal health
Readiness Probe:
  • “Can the container accept traffic?”
  • Failure → removed from service endpoints
  • Container keeps running
  • Checks dependencies (DB, cache)
Best Practices:
  • Liveness: Simple, fast check
  • Readiness: Check dependencies
  • Different endpoints for each
  • Appropriate timeouts and thresholds
Answer:Horizontal Pod Autoscaler:
  1. Metrics collection (every 15s default)
    • CPU, memory via Metrics Server
    • Custom metrics via Prometheus Adapter
  2. Calculation:
    desiredReplicas = ceil(currentReplicas * (currentMetric/targetMetric))
    
  3. Scaling decision:
    • Considers stabilization window
    • Applies scaling policies
    • Respects min/max replicas
Tuning:
  • stabilizationWindowSeconds: Prevent flapping
  • scaleDown.policies: Gradual scale down
  • scaleUp.policies: Aggressive scale up

Summary

Key Takeaways

  • Use Deployments for stateless services
  • ConfigMaps and Secrets for configuration
  • HPA for automatic scaling
  • Network Policies for security
  • Helm for package management

Next Steps

In the next chapter, we’ll cover Testing Strategies for microservices.

Interview Deep-Dive

Strong Answer:This is a noisy neighbor problem, and it is solved by proper resource management at three levels: pod resource limits, Quality of Service classes, and pod scheduling.First, every pod must have CPU requests and limits. The request is the guaranteed minimum — Kubernetes uses it for scheduling decisions. The limit is the maximum — if the process exceeds the limit, it gets CPU-throttled (not killed, unlike memory). I set requests based on the P95 CPU usage from production metrics and limits at 2-3x the requests to allow burst capacity.Second, by setting both requests and limits, you get a “Guaranteed” QoS class. Pods in this class are the last to be evicted when a node runs out of resources. Pods without limits get “BestEffort” and are evicted first. This ensures that your critical services (payment, orders) survive node resource pressure while less critical services (analytics, logging) absorb the impact.Third, I use pod anti-affinity rules to prevent all instances of a critical service from landing on the same node. If all 3 payment pods are on one node and that node fails, payment processing goes down entirely. Anti-affinity spreads them across nodes. For the noisy neighbor specifically, I might use taints and tolerations to dedicate certain nodes to the resource-hungry service, isolating it from others.The deeper fix: investigate why the service is consuming excessive CPU. Common causes in Node.js: unintentional synchronous operations blocking the event loop (JSON parsing of large payloads, regex backtracking), memory pressure causing excessive garbage collection, or simply inadequate horizontal scaling — one pod handling traffic that should be spread across five.Follow-up: “How do you right-size resource requests and limits for a new service that has no production data?”I start with conservative estimates based on load testing: run the service under expected peak load in a staging environment and observe CPU and memory usage. Set requests to the observed peak usage plus 20% buffer, and limits to 3x the requests. After the first week in production, I revisit using Vertical Pod Autoscaler (VPA) in recommendation mode — it analyzes actual resource usage and suggests optimal request/limit values. I never enable VPA’s auto-update mode in production because automatically changing resource limits can trigger pod restarts during traffic.
Strong Answer:Deployments manage stateless services where any pod can handle any request and pods are interchangeable. This covers 90% of microservices: your Order Service, Payment Service, User Service — they read state from a database and cache, so the pod itself holds no unique state. Deployments provide rolling updates, rollback, and horizontal scaling. When a pod dies, the replacement gets a new name and new IP, and that is fine because the load balancer routes to all healthy pods equally.StatefulSets manage stateful workloads where each pod has a stable identity (fixed hostname, persistent volume claim). I use StatefulSets for databases (PostgreSQL, MongoDB), message brokers (Kafka, RabbitMQ), and distributed caches (Redis Sentinel/Cluster) running inside Kubernetes. The key features: pods get ordinal names (kafka-0, kafka-1, kafka-2), they start and stop in order (important for leader election), and each pod gets its own PersistentVolumeClaim that survives pod restarts. When kafka-1 restarts, it gets the same hostname and the same disk, so it can rejoin the cluster with its data intact.DaemonSets ensure one pod runs on every node (or every matching node). I use DaemonSets for infrastructure agents: log collectors (Fluentd, Fluent Bit), metrics exporters (node-exporter), security agents (Falco), and storage drivers (CSI). Each node needs exactly one instance of these, and when new nodes are added to the cluster, the DaemonSet automatically deploys to them.The common mistake: running databases as StatefulSets in production Kubernetes. While it works, the operational complexity is significant — backup, restore, failover, and storage management are all harder in Kubernetes than using a managed database (RDS, Cloud SQL). My recommendation: use managed databases for production, StatefulSets for development/staging environments, and only run databases on Kubernetes if you have a dedicated platform team.Follow-up: “When would you use a Job or CronJob in a microservices architecture?”Jobs are for one-time or batch tasks: database migrations before a deployment, data backfill scripts, report generation. CronJobs are for scheduled tasks: daily reconciliation (comparing order records with payment records), nightly analytics aggregation, periodic cache warming. The important detail: CronJobs should be idempotent because Kubernetes might create duplicate jobs during edge cases (missed schedule catchup, controller restart). I always design CronJobs to be safe to run twice with the same result.
Strong Answer:Three layers of isolation: ResourceQuotas, LimitRanges, and NetworkPolicies.ResourceQuotas limit the total resources a namespace can consume. I would set the payments namespace to 60% of cluster capacity (it is critical), analytics to 30%, and staging to 10%. If a staging deployment tries to create 100 replicas, it hits the quota and the pod creation is rejected. This prevents any single namespace from starving others.LimitRanges set default and maximum resource limits for pods within a namespace. In the staging namespace, I would set a default CPU limit of 500m and memory of 512Mi per pod, with a maximum of 1 CPU and 1Gi. This prevents a developer from accidentally deploying a pod that requests 32 CPUs in staging.NetworkPolicies control which pods can communicate across namespaces. By default, Kubernetes allows all pod-to-pod communication. I would create a default-deny ingress policy in the payments namespace, then explicitly allow traffic only from the API gateway namespace and the analytics namespace (for read-only reporting queries). Staging should have zero access to the payments namespace — not even DNS resolution.Additionally, I would use PriorityClasses. Pods in the payments namespace get a high priority (1000), analytics gets medium (500), staging gets low (100). When the cluster is under resource pressure, Kubernetes preempts low-priority staging pods to make room for high-priority payment pods. This guarantees that payments continue to function even if the cluster is over-committed.Follow-up: “How do you handle the case where the payments namespace itself needs to scale beyond its resource quota during a traffic spike?”This is why I set the quota at 60% rather than 40%. The buffer accounts for spikes. But for truly exceptional events (Black Friday), I use Cluster Autoscaler to add nodes automatically when pending pods cannot be scheduled. The payments namespace’s HorizontalPodAutoscaler creates more pods, the scheduler cannot place them, the Cluster Autoscaler adds nodes, and the new pods get scheduled on the new nodes. The entire scaling chain takes 3-5 minutes on AWS, which is why I also configure the HPA to scale proactively based on custom metrics (incoming request rate) rather than reactively based on CPU utilization.

Scenario-Driven Interview Questions

Strong Answer Framework
  1. Confirm the symptom, do not assume. Run kubectl get pods -l app=order-service and read the status column. CrashLoopBackOff means the container exited nonzero and is in exponential restart backoff. Capture the restart count and age — they tell you how long this has been happening.
  2. Read the logs of the crashed container, not the current one. kubectl logs <pod> --previous retrieves logs from the last terminated instance. Nine times out of ten, the root cause is a one-line error here: missing env var, port conflict, migration failure, bad config parse.
  3. Inspect the pod’s events. kubectl describe pod <pod> surfaces scheduler events, image pull errors, probe failures, and OOMKills. The Last State field shows Terminated: OOMKilled vs Error vs Completed — each implies a different bug class.
  4. Compare against the last green revision. kubectl rollout history deployment/order-service lists revisions. Diff the two manifests: image tag, env vars, config hash, resource limits. The delta between green and red is almost always the cause.
  5. Rule out environmental causes. If logs look normal and previous revision also crashes now, the issue is outside the deployment: a dependency (DB migration not yet run), a Secret that was rotated incorrectly, a network policy blocking egress.
  6. Decide: fix forward or roll back? If the blast radius is small and the fix is obvious (typo in env var), fix forward. If production is bleeding traffic, kubectl rollout undo deployment/order-service first, debug after. Rollback is cheap; prolonged incidents are expensive.
  7. Post-incident: instrument the gap. Whatever check would have caught this pre-deploy — a validating admission webhook, a CI smoke test, a staging canary — add it before the next release.
Real-World ExampleShopify published a 2022 postmortem where a Kubernetes release caused widespread CrashLoopBackOff across their Rails monolith. The root cause was a change to the readiness probe’s path that collided with a rate-limiter middleware returning 429 for unknown routes. The probe failed, pods never became ready, rolling update stalled, and the previous ReplicaSet had already been scaled down by the time the on-call engineer started investigating. Their fix: require probe endpoints to bypass all middleware by convention, and add a synthetic probe test to CI.Senior Follow-up Questions
Q: What if kubectl logs --previous returns “previous terminated container not found”?A: The kubelet has garbage-collected the prior container, usually because the crash loop restarted too many times. Use kubectl get events --sort-by=.lastTimestamp to reconstruct history, inspect kubectl describe for Last State, and pull logs from your centralized log store (Loki, ELK) filtered by pod name and time range. Never rely on kubectl as your only log source in production.
Q: The logs show no error, the process just exits 0. Is that still CrashLoopBackOff?A: Yes, and it is usually the worst kind. A zero-exit without CrashLoopBackOff label means the container intentionally terminated — often a main() function returning early because a required dep is missing, or Node.js exiting because the event loop is empty (forgot to .listen()). Check your startup code path: is there an await you skipped? A server you forgot to start? A promise that unhandled-rejected?
Q: How do you debug a CrashLoopBackOff when the crash happens before logging is initialized?A: Three techniques: (1) set command: ["/bin/sh", "-c", "sleep 3600"] temporarily in the manifest, exec into the pod, and run the binary manually to see stderr; (2) add PYTHONUNBUFFERED=1 or NODE_OPTIONS=--unhandled-rejections=strict to force early stderr; (3) use an ephemeral debug container with kubectl debug to attach and inspect the filesystem, env, and network namespace without altering the failing pod.
Common Wrong Answers
  • “I would restart the deployment.” Kubernetes is already restarting the pod — that is literally what CrashLoopBackOff means. Restarting the Deployment just re-triggers the same crash. This answer signals the candidate does not understand the control loop.
  • “I would delete the pod and let it recreate.” Same problem as above. The Deployment will recreate it with the same spec and it will crash again.
Further Reading
  • Kubernetes docs: Debug Running Pods — official flow for pod-level debugging including ephemeral containers.
  • Shopify Engineering: Anatomy of a Kubernetes Outage — real postmortem walkthrough.
  • “Kubernetes Patterns” by Bilgin Ibryam and Roland Huss — Chapter on Health Probes and Managed Lifecycle.
Strong Answer Framework
  1. Identify the workload profile. CPU at 30% under load almost always means I/O-bound: the service spends its time awaiting a downstream (database, external API, Kafka). The Node.js event loop is saturated with pending callbacks, but the CPU is idle most of the time.
  2. Confirm with the right metric. Look at nodejs_eventloop_lag_seconds or Python’s asyncio task queue depth, or just count active HTTP requests (http_inflight_requests). If active requests far exceed what the pod can realistically handle, you have queueing that CPU cannot see.
  3. Diagnose the bottleneck. Is it a downstream service (check its P99), a DB connection pool (check wait time), or client-side concurrency limits (Node.js default HTTP agent pool is 5)?
  4. Choose the right scaling metric. Replace CPU with one of: requests-per-second per pod (linear to load), event loop lag, queue depth, or explicit kafka_consumer_lag. Expose via Prometheus, wire through Prometheus Adapter, reference in HPA.
  5. Validate with a load test. Replay the traffic spike (or use k6 to simulate) and watch HPA replica count move in real time. If replicas scale proactively and P99 stays flat, you have the right metric.
  6. Fix the adjacent issues. HPA alone will not save a service whose single Postgres pool caps at 20 connections. Scale the pod but also raise the DB connection limit, tune pgbouncer, or add a read replica.
Real-World ExampleZalando published an engineering blog in 2020 describing this exact failure mode. Their search service hit P99 latency spikes during morning rush while CPU sat below 40%. The fix was migrating HPA to scale on a custom metric: search_requests_in_flight_per_pod, exposed via Prometheus. Scaling kicked in 90 seconds earlier and P99 returned to baseline without CPU ever crossing 50%. They wrote a library (kube-metrics-adapter) to make custom-metric HPAs easy across their fleet.Senior Follow-up Questions
Q: What if scaling out does not help because the bottleneck is the database?A: HPA cannot fix a downstream bottleneck — more pods just queue more concurrent connections at the DB. You address it at that layer: connection pooling (pgbouncer), read replicas for heavy queries, query optimization, or cache interposition. In the meantime, apply backpressure at the service (return 503 when pool is saturated) so slow requests do not spiral into a thundering herd.
Q: How do you measure whether the new metric is working?A: Two signals. First, time from traffic spike start to HPA scale-up event — should drop from minutes to tens of seconds. Second, P99 latency during the spike — should plateau near steady-state instead of spiking. Run an A/B by deploying the new HPA to one zone and the old CPU-based HPA to another, then trigger synthetic load.
Q: What stops the HPA from over-scaling and then thrashing?A: Three guardrails. (1) behavior.scaleDown.stabilizationWindowSeconds — defaults to 5 minutes, prevents rapid scale-down. (2) behavior.scaleDown.policies to cap reduction rate (e.g., 10% per minute). (3) Hard maxReplicas ceiling. The combination means HPA scales up fast but scales down slow, which matches real traffic patterns (spikes are common, drop-offs are gradual).
Common Wrong Answers
  • “Increase the CPU target to 90%.” This makes the problem worse — the service is already under stress at 30% CPU because CPU is the wrong signal. Raising the target delays scaling further.
  • “Just set minReplicas higher.” Overprovisions always-on capacity, wasting money during off-peak and still not scaling fast enough during peak. Addressing the metric is cheaper and more effective.
Further Reading
Strong Answer Framework
  1. Match the duration to the cause. 30 seconds lines up suspiciously with terminationGracePeriodSeconds (default 30s) or a probe’s combined periodSeconds * failureThreshold. That is your first hypothesis: the old pod is being killed before draining, or the new pod is receiving traffic before it is ready.
  2. Inspect the Service endpoints during rollout. watch kubectl get endpoints <svc> shows which pod IPs are registered. If new pod IPs appear in endpoints while still failing readiness, the readiness probe is misconfigured or the endpoint controller has a race.
  3. Check the pod termination sequence. When a pod is deleted, Kubernetes simultaneously (a) sends SIGTERM, and (b) removes it from Service endpoints. There is no ordering guarantee. In-flight requests from kube-proxy iptables rules can still hit a SIGTERM-ed pod for several seconds.
  4. Add a preStop hook. preStop: { exec: { command: ["sleep", "10"] } } forces the pod to sit in “Terminating” state for 10 seconds before SIGTERM fires, giving kube-proxy time to converge on the new endpoints.
  5. Verify graceful shutdown in the app. On SIGTERM, the app should stop accepting new connections, let in-flight ones drain, then exit. Node.js server.close(), Python uvicorn with --timeout-graceful-shutdown, or a SIGTERM handler that flips readiness to 503.
  6. Confirm PDB + maxSurge/maxUnavailable settings. maxUnavailable: 0, maxSurge: 1 guarantees at least original replicas are healthy at any time. maxUnavailable: 25% on 4 replicas briefly drops you to 3 — usually fine, sometimes not.
  7. Measure after fix. Deploy again with tracing on. The rolling update should cause zero 502s at the ingress. If it does not, your problem is upstream (ingress controller connection draining, cloud LB reconfiguration lag).
Real-World ExampleAirbnb documented in a 2019 engineering blog that every rollout caused brief 502 spikes at the ingress, and they traced it to the race between SIGTERM and endpoint removal. Their fix was a universal preStop sleep of 10 seconds plus readiness-flip-on-SIGTERM in every service template. Post-fix: rolling updates across 500+ services with zero user-visible errors. They open-sourced their service scaffolding and the pattern is now industry-standard.Senior Follow-up Questions
Q: Why a preStop sleep specifically and not a proper readiness toggle?A: A readiness toggle inside the app depends on the kubelet re-polling the probe and the endpoints controller propagating the change — at least periodSeconds + a few more seconds. The preStop sleep is a brute-force way to guarantee time passes before SIGTERM, without relying on probe timing. Best practice is both: preStop sleep AND readiness flip, belt and braces.
Q: How do you handle long-running requests that exceed terminationGracePeriodSeconds?A: Raise terminationGracePeriodSeconds to exceed the longest expected request plus buffer (e.g., 90s for a service with up to 60s requests). If you have requests longer than a minute, that is a design smell — move to async processing (202 Accepted + polling) so the HTTP layer can terminate quickly.
Q: What if the 502s come from the ingress controller itself, not from pod termination?A: NGINX ingress reloads its config when Service endpoints change, and old reloads can drop connections mid-flight. Enable worker-shutdown-timeout: 240s on the controller and use strategy: RollingUpdate with maxUnavailable: 0 on the ingress controller Deployment. For AWS ALB Ingress, configure target group deregistration_delay_timeout_seconds to match your grace period.
Common Wrong Answers
  • “Just increase the readiness probe delay.” Does not address the termination-side race; you will still get 502s on the pod being deleted.
  • “Roll back and retry.” Does not fix the underlying bug; next rollout has the same problem. Only use rollback to buy time while you diagnose.
Further Reading
  • Kubernetes docs: Pod Lifecycle and preStop Hooks.
  • Airbnb Engineering: “Graceful Pod Shutdown in Kubernetes” — the canonical writeup of the termination race.
  • “Kubernetes in Action” by Marko Lukša, Chapter 17 — lifecycle hooks and termination flow.