Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

DevOps bridges development and operations, enabling faster, more reliable software delivery through automation and best practices. Before DevOps, developers would “throw code over the wall” to an ops team, who would spend days manually deploying it, often discovering it did not work in production. DevOps eliminates this wall by making the team that builds the software also responsible for running it — Amazon calls this “you build it, you run it.” The result: organizations practicing DevOps deploy 200x more frequently with 24x faster recovery from failures (per the DORA State of DevOps reports).

CI/CD Pipeline

┌────────┐   ┌────────┐   ┌────────┐   ┌────────┐   ┌────────┐
│  Code  │──►│ Build  │──►│  Test  │──►│ Deploy │──►│Monitor │
│ Commit │   │        │   │        │   │        │   │        │
└────────┘   └────────┘   └────────┘   └────────┘   └────────┘
                CI                           CD

Example GitHub Actions Workflow

A well-designed pipeline catches bugs before they reach production. The principle is simple: every code change must pass through an automated gauntlet of builds, tests, and checks before it can deploy. If any step fails, the pipeline stops and alerts the developer immediately — not the ops team at 2 AM.
name: CI/CD Pipeline

on:
  push:
    branches: [main]       # Trigger on merges to main
  pull_request:
    branches: [main]       # Trigger on PRs targeting main (catches issues before merge)

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3      # Pull the code
      
      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'           # Pin to specific version -- avoid "works on my machine"
          
      - name: Install dependencies
        run: npm ci                    # Use 'ci' not 'install' -- deterministic, lockfile-based
        
      - name: Run tests
        run: npm test                  # Gate: if tests fail, pipeline stops here
        
      - name: Build
        run: npm run build             # Gate: catches TypeScript/build errors
  
  deploy:
    needs: build-and-test              # Only runs if build-and-test succeeds
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'  # Only deploy from main, not from PR branches
    steps:
      - name: Deploy to production
        run: |
          # In practice, this calls a deploy script, pushes a Docker image,
          # or triggers a GitOps sync (ArgoCD, Flux). Never run raw shell
          # commands against production servers.
          echo "Deploying to production..."
Practical tip: Keep your CI pipeline under 10 minutes. Slow pipelines cause developers to batch changes into larger, riskier deploys. If tests take too long, parallelize them, use test splitting, or run slow integration tests on a separate scheduled pipeline.

Containers (Docker)

Docker Concepts

┌─────────────────────────────────────────────────┐
│                    Host OS                       │
│  ┌─────────────────────────────────────────────┐│
│  │              Docker Engine                  ││
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐ ││
│  │  │ Container │ │ Container │ │ Container │ ││
│  │  │  ┌─────┐  │ │  ┌─────┐  │ │  ┌─────┐  │ ││
│  │  │  │ App │  │ │  │ App │  │ │  │ App │  │ ││
│  │  │  └─────┘  │ │  └─────┘  │ │  └─────┘  │ ││
│  │  │   Libs    │ │   Libs    │ │   Libs    │ ││
│  │  └───────────┘ └───────────┘ └───────────┘ ││
│  └─────────────────────────────────────────────┘│
└─────────────────────────────────────────────────┘

Dockerfile Best Practices

A Dockerfile is a recipe for building a reproducible, portable application image. The order of instructions matters enormously for build speed because Docker caches each layer. Put things that change rarely (OS, dependencies) at the top and things that change often (your code) at the bottom.
# 1. Use specific version, not 'latest' -- 'latest' is a moving target
#    that breaks builds unpredictably. Pin the exact version.
#    Use -alpine for smaller images (~5x smaller than debian-based).
FROM node:18-alpine

# 2. Set working directory -- all subsequent commands run from here
WORKDIR /app

# 3. Copy ONLY dependency files first -- this is the cache optimization trick.
#    If package.json has not changed, Docker reuses the cached layer and
#    skips the expensive npm install step entirely. This alone can cut
#    build times from 5 minutes to 30 seconds on subsequent builds.
COPY package*.json ./

# 4. Install production dependencies only (skip devDependencies)
RUN npm ci --only=production

# 5. NOW copy application code -- this layer changes on every build,
#    but all layers above it are cached if dependencies did not change.
COPY . .

# 6. Security: never run as root inside containers. If the app is
#    compromised, the attacker only has limited 'node' user permissions.
USER node

# 7. Document the port (does not actually publish it -- that's done at runtime)
EXPOSE 3000

# 8. Health check -- lets Docker/K8s know if the container is healthy.
#    Unhealthy containers are automatically restarted.
HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost:3000/health || exit 1

# 9. Use exec form (JSON array), not shell form -- ensures proper signal handling
#    so the container shuts down gracefully on SIGTERM.
CMD ["node", "server.js"]

Docker Compose

Docker Compose lets you define multi-container applications in a single file. It is indispensable for local development — one command (docker compose up) gives every developer an identical environment with all dependencies running locally. No more “you need to install PostgreSQL 14, Redis, and configure these 8 environment variables.”
version: '3.8'

services:
  web:
    build: .                                  # Build from local Dockerfile
    ports:
      - "3000:3000"                           # host:container port mapping
    environment:
      - DATABASE_URL=postgres://db:5432/app   # 'db' is the service name (Docker DNS)
    depends_on:
      - db                                    # Start db before web (order, not readiness)
      - redis
    
  db:
    image: postgres:14                        # Use official image, pin version
    volumes:
      - postgres_data:/var/lib/postgresql/data  # Persist data across container restarts
    environment:
      - POSTGRES_PASSWORD=secret              # For local dev only -- use secrets in prod
      
  redis:
    image: redis:alpine                       # Alpine variant for smaller image

volumes:
  postgres_data:                              # Named volume survives docker compose down
Practical tip: depends_on only guarantees start order, not that the service is actually ready to accept connections. Use health checks or a wait-for-it script to ensure PostgreSQL is fully initialized before your app tries to connect.

Kubernetes (K8s)

Core Concepts

┌────────────────────────────────────────────────────────┐
│                      Cluster                           │
│  ┌──────────────────────────────────────────────────┐  │
│  │                 Control Plane                    │  │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │  │
│  │  │ API      │ │ Scheduler│ │ Controller       │ │  │
│  │  │ Server   │ │          │ │ Manager          │ │  │
│  │  └──────────┘ └──────────┘ └──────────────────┘ │  │
│  └──────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────┐  │
│  │                    Nodes                         │  │
│  │  ┌─────────────────┐  ┌─────────────────┐       │  │
│  │  │     Node 1      │  │     Node 2      │       │  │
│  │  │ ┌─────┐ ┌─────┐ │  │ ┌─────┐ ┌─────┐ │       │  │
│  │  │ │ Pod │ │ Pod │ │  │ │ Pod │ │ Pod │ │       │  │
│  │  │ └─────┘ └─────┘ │  │ └─────┘ └─────┘ │       │  │
│  │  └─────────────────┘  └─────────────────┘       │  │
│  └──────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────┘

K8s Resources

Kubernetes manifests declare your desired state (“I want 3 replicas of my app”), and K8s continuously works to make reality match that declaration. If a pod crashes, K8s automatically creates a new one. If a node dies, K8s reschedules all its pods onto surviving nodes. This declarative model is what makes K8s powerful — you describe WHAT you want, not HOW to achieve it.
# Deployment -- manages a set of identical pods (your application instances)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3                     # K8s maintains exactly 3 pods at all times
  selector:
    matchLabels:
      app: web                    # "This deployment manages pods labeled app=web"
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: myapp:1.0.0        # Always use specific tags, never 'latest'
        ports:
        - containerPort: 3000
        resources:
          requests:               # Minimum resources guaranteed to the pod
            memory: "128Mi"       # Used by scheduler to find a node with enough capacity
            cpu: "100m"           # 100 millicores = 0.1 CPU cores
          limits:                 # Maximum resources allowed -- pod is killed if exceeded
            memory: "256Mi"       # OOMKilled if the container exceeds this
            cpu: "200m"           # Throttled (not killed) if it exceeds this
---
# Service -- provides stable networking for the ephemeral pods above.
# Pods come and go (new IPs each time), but the Service gives them a
# permanent address that other services can rely on.
apiVersion: v1
kind: Service
metadata:
  name: web-service
spec:
  selector:
    app: web                      # Routes traffic to all pods with label app=web
  ports:
  - port: 80                     # Port the Service exposes
    targetPort: 3000              # Port the container listens on
  type: LoadBalancer              # Provisions a cloud load balancer (AWS ALB, GCP LB)
Practical tip: Always set resource requests AND limits. Without requests, the scheduler cannot make intelligent placement decisions. Without limits, a memory leak in one pod can starve other pods on the same node.

Cloud Fundamentals

Major Cloud Providers Comparison

Service TypeAWSAzureGCP
ComputeEC2Virtual MachinesCompute Engine
ContainersECS/EKSAKSGKE
ServerlessLambdaFunctionsCloud Functions
StorageS3Blob StorageCloud Storage
DatabaseRDS/DynamoDBSQL/CosmosDBCloud SQL/Firestore
CDNCloudFrontCDNCloud CDN

Infrastructure as Code (Terraform)

Think of IaC like a recipe versus cooking by feel. Without IaC, setting up a server is like a chef who eyeballs ingredients and adjusts on the fly — the result is never exactly the same twice, and nobody else can reproduce it. With IaC, your infrastructure is a precise recipe: version-controlled, reviewable, repeatable. If your production environment burns down, you run terraform apply and it is rebuilt identically in minutes, not days.
# main.tf -- This file IS your infrastructure. Delete the server in the console?
# Run 'terraform apply' and it comes back exactly as specified. No guessing,
# no "what was that setting again?" moments at 3 AM.
provider "aws" {
  region = "us-east-1"
}

# Compute -- a single web server instance.
# In practice, you would use an Auto Scaling Group behind an ALB,
# but this illustrates the core concept: infrastructure as a text file.
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"  # Amazon Linux 2 -- pin the AMI ID
  instance_type = "t2.micro"                # ~$8/month -- right-size for dev/staging
  
  tags = {
    Name        = "web-server"
    Environment = "production"              # Always tag for cost tracking and filtering
    ManagedBy   = "terraform"               # So humans know not to edit this in the console
  }
}

# Storage -- versioned S3 bucket for static assets.
# Versioning lets you recover from accidental deletes or overwrites.
resource "aws_s3_bucket" "static" {
  bucket = "my-static-assets"
  
  versioning {
    enabled = true  # Every overwrite creates a new version -- undo is trivial
  }
}

# Output values -- useful for piping into other tools or scripts.
# After 'terraform apply', these print to stdout and are stored in state.
output "web_server_ip" {
  value = aws_instance.web.public_ip
}
Practical tip: Run terraform plan before every terraform apply. The plan shows you exactly what Terraform intends to create, modify, or destroy. Treat it like a code diff — review it carefully, especially in production. A common horror story: engineers accidentally destroy a production database because they did not read the plan, which clearly showed aws_db_instance.main: destroy in red. Stripe, HashiCorp, and most mature teams require plan output to be attached to the pull request before any infrastructure change is approved.

Monitoring & Observability

Metrics

  • CPU, Memory, Disk
  • Request rate, Latency
  • Error rate
  • Tools: Prometheus, Grafana

Logs

  • Application logs
  • Access logs
  • Error logs
  • Tools: ELK Stack, Loki

Traces

  • Request flow
  • Service dependencies
  • Bottleneck detection
  • Tools: Jaeger, Zipkin

The Four Golden Signals

Defined by Google’s SRE book, these four signals answer the question “Is my service healthy right now?” Monitor these for every service, and you will catch 90% of production issues before users notice them.
SignalWhat to MeasureAlert WhenWhy It Matters
LatencyRequest duration (p50, p95, p99)p99 > SLA thresholdUsers perceive slowness as broken. A p99 spike means 1 in 100 users is having a bad time.
TrafficRequests per secondUnusual spikes or dropsA sudden drop often means something upstream broke. A spike could mean you are about to hit capacity.
ErrorsError rate percentage> 1% of requests failingError rates creep up before full outages. Catching a 2% error rate prevents a 100% outage.
SaturationResource utilizationCPU/Memory > 80%At 80%+ utilization, performance degrades non-linearly. 90% CPU feels 10x worse than 80%.

Prometheus Metrics Example

Prometheus uses a pull-based model: it scrapes metrics from your application’s /metrics endpoint at regular intervals. Your application exposes three core metric types, each suited to different measurement needs.
from prometheus_client import Counter, Histogram, Gauge

# Counter -- cumulative, only increases. Resets to 0 on restart.
# Use for: total requests, errors, bytes processed.
# Prometheus calculates rate of change: rate(http_requests_total[5m])
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']  # Labels for dimensional filtering
)

# Histogram -- tracks distribution of values in configurable buckets.
# Use for: request durations, response sizes. Enables percentile queries.
# The buckets define the resolution: more buckets = more precision = more storage.
request_duration = Histogram(
    'http_request_duration_seconds',
    'Request duration in seconds',
    ['endpoint'],
    buckets=[.01, .05, .1, .5, 1, 5]  # Choose buckets around your SLO thresholds
)

# Gauge -- value that can go up and down. Current state, not cumulative.
# Use for: active connections, queue depth, temperature, memory usage.
active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

# Usage -- instrument your endpoints to automatically emit metrics.
@app.route('/api/users')
def get_users():
    with request_duration.labels(endpoint='/api/users').time():  # Auto-measure duration
        result = fetch_users()
    http_requests_total.labels(
        method='GET', endpoint='/api/users', status='200'
    ).inc()  # Increment the counter
    return result
Practical tip: Resist the temptation to add labels for high-cardinality dimensions (like user_id or request_id). Each unique label combination creates a new time series, and 1 million users means 1 million time series — this will crash your Prometheus server.

Structured Logging

Unstructured logs are write-only — easy to produce, nearly impossible to search. When you have 50 services each emitting thousands of log lines per second, “grepping for a string” does not scale. Structured logs (JSON format) let you query across millions of log entries: “show me all orders over $500 that took more than 1 second in the last hour.”
import structlog
import json

logger = structlog.get_logger()

# ❌ Bad: Unstructured logs -- looks fine in development, useless at scale.
# How do you query "all orders from user 123 that took over 1s"? You cannot.
print(f"User {user_id} created order {order_id}")

# ✅ Good: Structured JSON logs -- every field is queryable and filterable.
# In Elasticsearch/Loki/Splunk, you can write: user_id="123" AND duration_ms>1000
logger.info(
    "order_created",
    user_id=user_id,
    order_id=order_id,
    total=order.total,
    items_count=len(order.items),
    duration_ms=duration
)

# Output (JSON) -- machines love this, log aggregators index every field:
# {"event": "order_created", "user_id": "123", "order_id": "456", 
#  "total": 99.99, "items_count": 3, "duration_ms": 45, 
#  "timestamp": "2024-01-15T10:30:00Z"}
Practical tip: Always include a correlation/request ID in every log line so you can trace a single user request across all services. Without it, debugging distributed systems is like finding a specific conversation in a stadium full of people all talking at once.

GitOps

Infrastructure and application configs stored in Git as the single source of truth. The core idea: if you want to know what is running in production, look at the Git repo — not at the cluster, not at a dashboard, not at someone’s local Terraform state. Every change goes through a pull request, gets reviewed, and creates an audit trail. Rollback is just git revert. This eliminates the “someone SSH’d into production and made a change that nobody documented” failure mode.
┌─────────────┐     ┌─────────────┐     ┌─────────────────────┐
│  Developer  │────►│    Git      │────►│    GitOps Agent     │
│  Push       │     │  Repository │     │ (ArgoCD/Flux)       │
└─────────────┘     └─────────────┘     └──────────┬──────────┘


                                        ┌─────────────────────┐
                                        │    Kubernetes       │
                                        │    Cluster          │
                                        └─────────────────────┘

ArgoCD Application

ArgoCD watches your Git repository and continuously reconciles the cluster state with what is declared in Git. If someone manually changes a resource in the cluster (kubectl edit, console click), ArgoCD detects the drift and either alerts you or automatically reverts it — depending on your sync policy. Think of it as a thermostat for your infrastructure: you set the desired temperature (Git state), and it continuously adjusts the actual temperature (cluster state) to match.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd                            # ArgoCD lives in its own namespace
spec:
  project: default
  source:
    repoURL: https://github.com/org/repo.git  # Git repo = single source of truth
    targetRevision: HEAD                        # Track the latest commit on default branch
    path: k8s/overlays/production               # Kustomize overlay for production config
  destination:
    server: https://kubernetes.default.svc      # The cluster ArgoCD is managing
    namespace: production
  syncPolicy:
    automated:
      prune: true       # Delete resources from cluster if removed from Git
      selfHeal: true    # Revert manual cluster changes to match Git (the thermostat)
    syncOptions:
      - CreateNamespace=true   # Create the namespace if it does not exist
Practical tip: Start with selfHeal: false while your team adapts to GitOps. Engineers accustomed to kubectl apply will be frustrated when ArgoCD reverts their manual changes. Once everyone commits to the “change it in Git, not in the cluster” workflow, enable self-heal. Intuit, Red Hat, and most large-scale Kubernetes operators use ArgoCD with self-heal enabled in production.

Deployment Strategies

Blue-Green Deployment

Run two identical production environments. One serves live traffic (“blue”), the other has the new version staged and tested (“green”). When ready, flip the load balancer to green. If anything goes wrong, flip back to blue in seconds — zero downtime, near-instant rollback. The trade-off: you need 2x the infrastructure (and budget) during the deployment window.
Before: Blue is live, Green has the new version staged and tested
┌─────────────────────────────────────────────────┐
│  Load Balancer  ───────────►  Blue (v1) ✓      │
│                               Green (v2)        │
└─────────────────────────────────────────────────┘

After: One DNS/LB change flips all traffic to Green instantly
┌─────────────────────────────────────────────────┐
│  Load Balancer  ───────────►  Blue (v1)        │
│                 ───────────►  Green (v2) ✓     │
└─────────────────────────────────────────────────┘
Rollback: Flip back to Blue in seconds if Green has issues

Canary Deployment

Named after the canary in a coal mine — send a small percentage of traffic to the new version first. If error rates or latency spike, roll back before most users are affected. This is how Google, Netflix, and Facebook deploy changes to billions of users: never all at once, always gradually. The key is automated analysis — your monitoring system should automatically compare canary metrics against baseline and halt the rollout if something looks wrong.
┌──────────────────────────────────────────────────┐
│                  Load Balancer                   │
│                       │                          │
│          ┌────────────┴────────────┐            │
│          │                         │            │
│          ▼ (90%)                   ▼ (10%)      │
│    ┌──────────┐              ┌──────────┐      │
│    │  v1.0    │              │  v1.1    │      │
│    │ (stable) │              │ (canary) │      │
│    └──────────┘              └──────────┘      │
└──────────────────────────────────────────────────┘

Gradually increase canary traffic: 10% → 25% → 50% → 100%
At each step: compare error rates, latency, and business metrics.
If canary looks healthy, promote. If not, roll back automatically.

Rolling Update

The simplest zero-downtime deployment strategy: gradually replace old pods with new ones. K8s terminates one old pod, starts one new pod, waits for it to pass health checks, then repeats. At no point are all pods down simultaneously. This is the default K8s deployment strategy and works well for most stateless applications. The trade-off: during the rollout, both old and new versions serve traffic simultaneously — your application must handle this (backward-compatible APIs, database migrations that work with both versions).
# Kubernetes rolling update strategy -- fine-tune the rollout speed
# and safety margin for your specific application.
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # At most 5 pods during update (4 desired + 1 extra)
      maxUnavailable: 1  # At least 3 pods always serving traffic
  # Readiness probe is CRITICAL for rolling updates -- without it, K8s
  # considers a pod "ready" the instant its container starts, even if
  # the app needs 30 seconds to warm up. Result: traffic hits an
  # uninitialized app and users see errors during deployment.
  template:
    spec:
      containers:
      - name: web
        readinessProbe:
          httpGet:
            path: /health        # Your app must return 200 when ready
            port: 3000
          initialDelaySeconds: 5  # Wait 5s before first check
          periodSeconds: 10       # Check every 10s
Practical tip: Set maxUnavailable: 0 and maxSurge: 1 for the safest (but slowest) rollout — capacity never drops below desired. For faster rollouts that tolerate brief capacity reduction, increase maxUnavailable. Always combine with readiness probes; without them, rolling updates are just “rolling restarts with traffic loss.”

Secrets Management

HashiCorp Vault

import hvac

# Initialize client
client = hvac.Client(url='https://vault.example.com')
client.token = os.environ['VAULT_TOKEN']

# Read secret
secret = client.secrets.kv.v2.read_secret_version(
    path='myapp/database'
)
db_password = secret['data']['data']['password']

# Dynamic secrets (auto-rotated)
creds = client.secrets.database.generate_credentials(
    name='my-role'
)
print(f"Username: {creds['data']['username']}")
print(f"Password: {creds['data']['password']}")
# Credentials automatically expire and rotate

Kubernetes Secrets with External Secrets Operator

Native Kubernetes Secrets are base64-encoded (not encrypted) and stored in etcd. Anyone with cluster access can decode them trivially. The External Secrets Operator solves this by syncing secrets from a secure external source (Vault, AWS Secrets Manager, GCP Secret Manager) into Kubernetes Secrets automatically. Your application code reads a normal K8s Secret, but the actual secret value lives in a hardened, access-controlled vault.
# This Custom Resource tells the External Secrets Operator:
# "Go fetch the password from Vault, and create a K8s Secret called
# 'database-secret' with it. Refresh every hour in case it rotated."
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-credentials
spec:
  refreshInterval: 1h                  # Re-sync from Vault every hour (catches rotations)
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault-backend                # Points to your Vault instance configuration
  target:
    name: database-secret              # The K8s Secret name your pods will reference
  data:
  - secretKey: password                # Key inside the K8s Secret
    remoteRef:
      key: secret/data/production/database   # Path in Vault
      property: password                      # Field within the Vault secret
Practical tip: Never put secrets directly in Kubernetes manifests or Helm values files — they end up in Git history, which is effectively permanent. Even if you delete the file, git log preserves it forever. Use the External Secrets Operator or sealed-secrets (Bitnami) so that only encrypted references live in your repo.

Site Reliability Engineering (SRE) Concepts

Service Level Objectives (SLOs)

SLOs are the foundation of SRE practice — they answer “how reliable does this service need to be?” and “when should we stop shipping features to fix reliability?” The error budget is the key insight: it turns reliability from an abstract goal into a concrete, spendable resource. If your error budget is healthy, ship features fast. If it is nearly exhausted, freeze deployments and fix reliability.
TermDefinitionExample
SLI (Indicator)The metric you actually measureRequest latency p99, successful request ratio
SLO (Objective)The target threshold for your SLIp99 latency < 200ms, 99.9% of requests succeed
SLA (Agreement)A contract with financial consequences if the SLO is missed99.9% uptime or customer receives service credits
Error BudgetHow much unreliability you can “spend” before breaching SLO43.8 min/month for 99.9% — once spent, freeze all risky changes
# Error budget calculation
monthly_minutes = 30 * 24 * 60  # 43,200 minutes

slo_99_9 = 0.999
error_budget_minutes = monthly_minutes * (1 - slo_99_9)  # 43.2 minutes

slo_99_99 = 0.9999
error_budget_minutes = monthly_minutes * (1 - slo_99_99)  # 4.32 minutes

Availability Table

AvailabilityDowntime/YearDowntime/MonthDowntime/Week
99%3.65 days7.31 hours1.68 hours
99.9%8.76 hours43.8 minutes10.1 minutes
99.95%4.38 hours21.9 minutes5.04 minutes
99.99%52.6 minutes4.38 minutes1.01 minutes
99.999%5.26 minutes26.3 seconds6.05 seconds

Infrastructure as Code Best Practices

Terraform Module Structure

modules/
├── vpc/
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
│   └── README.md
├── eks/
│   ├── main.tf
│   ├── variables.tf
│   └── outputs.tf
└── rds/
    ├── main.tf
    ├── variables.tf
    └── outputs.tf

environments/
├── dev/
│   ├── main.tf
│   ├── terraform.tfvars
│   └── backend.tf
├── staging/
│   └── ...
└── production/
    └── ...

Terraform State Management

Terraform state is the mapping between your configuration files and real-world resources. If the state file is lost or corrupted, Terraform cannot manage your infrastructure — it will try to create duplicate resources or lose track of existing ones. Always store state remotely with locking enabled to prevent two engineers from running terraform apply simultaneously and corrupting the state.
# backend.tf -- remote state with locking (non-negotiable for team use)
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"       # Dedicated bucket, versioned
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true                       # State contains secrets -- encrypt at rest
    dynamodb_table = "terraform-locks"          # Prevents concurrent modifications
  }
}
Practical tip: Enable S3 bucket versioning so you can recover from state corruption by restoring a previous version. This has saved many teams from having to manually import hundreds of resources.

DevOps Toolchain

CategoryTools
Version ControlGit, GitHub, GitLab
CI/CDGitHub Actions, Jenkins, GitLab CI, CircleCI
ContainersDocker, Podman, containerd
OrchestrationKubernetes, Docker Swarm, Nomad
IaCTerraform, Pulumi, CloudFormation, Ansible
MonitoringPrometheus, Grafana, Datadog, New Relic
LoggingELK Stack, Loki, Splunk
TracingJaeger, Zipkin, AWS X-Ray
SecretsVault, AWS Secrets Manager, SOPS
GitOpsArgoCD, Flux
Key Metric: The “Four Golden Signals” - Latency, Traffic, Errors, Saturation. Monitor these for any service.
Interview Tip: Be ready to discuss the trade-offs of different deployment strategies, how you’d handle rollbacks, and your experience with specific tools in the DevOps toolchain. A strong answer sounds like: “We used canary deployments with automated rollback triggered by error rate thresholds in Prometheus. When our canary error rate exceeded 2x the baseline, the pipeline automatically promoted the previous version and paged the on-call engineer.” Specific details and reasoning beat abstract knowledge every time.

Interview Deep-Dive

Strong Answer:
  • Step 0 (first 30 seconds): Acknowledge the alert, check if anyone else is already responding (incident channel in Slack/PagerDuty), and open the runbook for this service if one exists. Do not start debugging before establishing communication.
  • Step 1 (first 2 minutes): Determine blast radius. Check the Grafana dashboard for the Four Golden Signals: which endpoints are failing, is it all traffic or a subset, what error codes (5xx vs 4xx), and is latency also degraded? A 15% error rate across all endpoints suggests a systemic issue (infrastructure, database, dependency). A 15% error rate on one endpoint suggests an application bug.
  • Step 2 (next 3 minutes): Check what changed. Was there a deployment in the last 30 minutes? (kubectl rollout history or check the CD pipeline). If yes, the fastest mitigation is a rollback: kubectl rollout undo deployment/api-server. A rollback takes 30-60 seconds and buys you time to diagnose without users suffering. If there was no deployment, check if a downstream dependency is degraded (database CPU, Redis connection count, third-party API status page).
  • Step 3: Look at the logs. Use structured log queries filtered to the error-producing requests: status=500 AND timestamp > "2am-5min". Look for the common thread — is it a specific user, a specific database query timing out, a nil pointer exception, a certificate expiration?
  • Step 4: Mitigate first, fix later. If the database is overloaded, enable a circuit breaker or serve cached responses. If a specific bad request pattern is causing crashes, add a temporary block rule at the API gateway. The goal at 2 AM is to stop the bleeding, not to write a beautiful fix.
  • Step 5: After mitigating, write a brief incident timeline in the channel (“15% errors started at 2:03, caused by X, mitigated at 2:18 by Y”) and decide if it can wait until morning for a proper fix or needs immediate attention.
Follow-up: You rolled back the deployment but the errors did not stop. What is your next hypothesis and what do you check?If rollback did not fix it, the deployment was not the cause — it is an environmental issue. My next checks in priority order: (1) Database health — check connection count, active queries, replication lag, disk I/O. A common culprit: a long-running analytics query or migration that is locking tables. (2) Downstream dependencies — if the service calls three external APIs, check each for degraded performance. Use the distributed traces (Jaeger/Zipkin) to find which span is slow or failing. (3) Infrastructure — check if a node in the Kubernetes cluster went unhealthy, if there was an AWS/GCP availability zone issue (check the cloud provider’s status page), or if the cluster hit resource limits (CPU throttling, memory OOM). (4) Traffic pattern change — check if there is an unusual traffic spike (DDoS, viral event, bot scraping) that exceeded auto-scaling capacity. Prometheus metrics for RPS compared to baseline will show this. (5) Certificate or secret expiration — a TLS certificate or API key that expired at exactly this time can cause exactly this pattern.
Strong Answer:
  • Moving from 3 deploys per week to continuous deployment is not primarily a tooling problem — it is a confidence problem. You need enough automated verification that you trust the pipeline to deploy without a human gatekeeper.
  • Testing prerequisites: (1) Unit test coverage above 80% on critical paths with fast execution (under 2 minutes). (2) Integration tests that cover the top 20 user journeys and run against a production-like environment with real databases, not mocks. (3) Contract tests between services (Pact) so that a change in service A does not break service B. (4) Smoke tests that run POST-deployment against the live environment to catch configuration issues that tests cannot simulate.
  • Monitoring prerequisites: (1) Real-time error rate tracking with alerting thresholds (if error rate exceeds 2x baseline within 5 minutes of deploy, auto-rollback). (2) Latency monitoring at p50, p95, p99 — a deploy that passes all tests but doubles p99 latency should be flagged. (3) Business metric monitoring — order completion rate, signup conversion — because some bugs are only visible through business metrics, not technical ones. (4) Distributed tracing so you can quickly pinpoint which service a regression originates from.
  • Infrastructure prerequisites: (1) Canary or blue-green deployment capability so new code gets a fraction of traffic first. (2) Automated rollback triggered by monitoring thresholds. (3) Feature flags to decouple code deployment from feature activation — you deploy code continuously but enable features gradually. (4) Database migrations that are backward-compatible (expand-contract pattern) so the old code version can still run during rollout.
  • The organizational prerequisite most people overlook: the team must agree that “the pipeline is the gatekeeper, not a person.” This is a cultural shift. Senior engineers who are used to manually reviewing every deployment need to trust the automated checks. This trust is built gradually by running both systems in parallel (automated deploy + manual approval) until the team is confident the automation catches everything the human would.
Follow-up: How do you handle database schema migrations in a continuous deployment environment where the old and new versions of the application are running simultaneously during rollout?This is the hardest part of continuous deployment and the reason most teams get stuck. The rule is: every migration must be backward-compatible. I use the expand-contract pattern. For adding a column: (1) Deploy migration that adds the new column with a default value (expand). (2) Deploy application code that writes to both old and new columns. (3) Backfill the new column for existing rows. (4) Deploy code that reads from the new column. (5) Eventually remove the old column (contract). For renaming a column, you treat it as “add new column + migrate data + remove old column” over three separate deployments. This means what used to be one migration becomes three deployments spread over days, which feels slow but is the price of zero-downtime deploys. Tools like gh-ost (GitHub) or pt-online-schema-change (Percona) handle large table migrations without locking. The key constraint: NEVER deploy a migration that drops a column or changes a column type in the same deployment as the code that depends on the change. The old code version must be able to run against both the old and new schema simultaneously.
Strong Answer:
  • Requests are the guaranteed minimum resources a pod gets. The Kubernetes scheduler uses requests to decide which node has enough capacity for the pod. If you request 256Mi of memory, the scheduler ensures the node has at least 256Mi available. If you set requests too low, the scheduler packs too many pods onto one node, causing resource contention. If you set them too high, you waste cluster capacity because the reserved resources sit unused.
  • Limits are the maximum resources a pod can use. Exceeding the memory limit results in an OOMKill — Kubernetes terminates the container immediately. Exceeding the CPU limit results in throttling — the container is not killed but slowed down. This distinction matters: a memory leak hits a hard wall (crash and restart), while a CPU spike gets a soft degradation (requests slow down but succeed).
  • The most dangerous misconfiguration is setting no limits at all. A memory leak in one pod can consume all memory on the node, triggering OOMKills on OTHER pods that happen to share the same node. One misbehaving pod takes down three healthy services. This is called a “noisy neighbor” problem.
  • For a new service, I would determine values empirically: (1) Deploy with generous limits (4x what you estimate) and observe actual usage in staging under realistic load for 48-72 hours. (2) Use Prometheus metrics (container_memory_working_set_bytes, container_cpu_usage_seconds_total) to find the p99 usage. (3) Set requests to the p50 usage (typical steady state) and limits to 2x the p99 (headroom for spikes). (4) In production, use Vertical Pod Autoscaler (VPA) in recommendation mode to continuously suggest adjustments based on real usage.
  • A common mistake: setting requests equal to limits. This guarantees the pod a fixed resource allocation (Guaranteed QoS class in Kubernetes), which sounds safe but wastes resources. If the pod typically uses 100Mi but spikes to 200Mi during peak, setting both to 200Mi wastes 100Mi 99% of the time. Instead, set requests to 100Mi and limits to 250Mi.
Follow-up: Your pod keeps getting OOMKilled despite having what seems like sufficient memory limits. How do you investigate?First, I check if it is a genuine memory leak or a legitimate spike. I would graph container_memory_working_set_bytes over time. A steadily increasing line that never decreases is a memory leak. A sawtooth pattern (grows, garbage collection drops it, grows again) with peaks exceeding the limit is a legitimate spike that needs a higher limit. For a memory leak in a JVM-based application, the container memory limit must account for both heap and off-heap memory (thread stacks, native memory, metaspace). A common trap: setting the JVM’s -Xmx to the same value as the container memory limit. The JVM uses ~30-50% more memory than the heap for other purposes, so a container with 512Mi limit and -Xmx512m will always be OOMKilled. Set -Xmx to about 60-70% of the container limit. For a Python or Node.js service, check for unreleased file handles, growing caches without eviction, or large response bodies buffered in memory. Tools like memory_profiler (Python) or --inspect (Node.js) can take heap snapshots for analysis.
Strong Answer:
  • First, the immediate response: restore from the most recent backup (which exists because we have automated backups with point-in-time recovery, right?). Then we hold a blameless postmortem to understand the systemic failure — the question is not “who did this?” but “why did our process allow this to happen?”
  • Prevention layer 1: Remove direct Terraform access from individual engineers. All Terraform changes go through a CI/CD pipeline (Atlantis, Terraform Cloud, or a GitHub Actions workflow). The pipeline runs terraform plan, posts the plan output as a comment on the pull request, requires peer review of the plan, and only then runs terraform apply. No human ever runs terraform apply locally against production.
  • Prevention layer 2: Implement policy-as-code using Sentinel (Terraform Enterprise) or Open Policy Agent (OPA). Write policies like: “No resource of type aws_db_instance may be destroyed without explicit approval from the database team” and “No changes to production during the hours of 10 PM to 6 AM.” These policies are checked automatically during the plan phase and block violations.
  • Prevention layer 3: Protect critical resources with lifecycle &#123; prevent_destroy = true &#125; in the Terraform configuration. This makes Terraform refuse to create a plan that destroys the resource. You must explicitly remove this flag (which goes through code review) before destruction is possible.
  • Prevention layer 4: Enable S3 bucket versioning on the Terraform state file and DynamoDB locking. Versioning lets you recover from state corruption, and locking prevents two people from running terraform apply simultaneously.
  • Prevention layer 5: Separate Terraform state files by environment and by risk level. The database infrastructure should be in a separate state file from the application infrastructure. A developer deploying a new Lambda function should not even have the database in their blast radius.
Follow-up: The engineer argues that requiring a PR and pipeline for every infrastructure change slows them down in an emergency. How do you balance speed and safety?Emergencies are real, and the process must account for them without abandoning safety entirely. I would implement a “break glass” procedure: a documented, audited fast-path for emergencies. The engineer can make the change directly, but they must: (1) get verbal approval from one other engineer (documented in the incident channel), (2) use a privileged credential that logs every action to an immutable audit trail, and (3) create a retroactive PR within 24 hours that brings the Terraform code into sync with the manual change. The audit log ensures accountability without blocking urgent action. We track how often the break-glass procedure is used — if it happens more than once a month, that signals the normal process is too slow and needs improvement, not that we need more exceptions. Netflix’s approach is instructive: they automated so aggressively that most “emergencies” became self-healing (auto-scaling, auto-rollback), reducing the need for manual intervention to near zero.