Skip to main content

Overview

DevOps bridges development and operations, enabling faster, more reliable software delivery through automation and best practices.

CI/CD Pipeline

┌────────┐   ┌────────┐   ┌────────┐   ┌────────┐   ┌────────┐
│  Code  │──►│ Build  │──►│  Test  │──►│ Deploy │──►│Monitor │
│ Commit │   │        │   │        │   │        │   │        │
└────────┘   └────────┘   └────────┘   └────────┘   └────────┘
                CI                           CD

Example GitHub Actions Workflow

name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'
          
      - name: Install dependencies
        run: npm ci
        
      - name: Run tests
        run: npm test
        
      - name: Build
        run: npm run build
  
  deploy:
    needs: build-and-test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - name: Deploy to production
        run: |
          # Deploy script here
          echo "Deploying to production..."

Containers (Docker)

Docker Concepts

┌─────────────────────────────────────────────────┐
│                    Host OS                       │
│  ┌─────────────────────────────────────────────┐│
│  │              Docker Engine                  ││
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐ ││
│  │  │ Container │ │ Container │ │ Container │ ││
│  │  │  ┌─────┐  │ │  ┌─────┐  │ │  ┌─────┐  │ ││
│  │  │  │ App │  │ │  │ App │  │ │  │ App │  │ ││
│  │  │  └─────┘  │ │  └─────┘  │ │  └─────┘  │ ││
│  │  │   Libs    │ │   Libs    │ │   Libs    │ ││
│  │  └───────────┘ └───────────┘ └───────────┘ ││
│  └─────────────────────────────────────────────┘│
└─────────────────────────────────────────────────┘

Dockerfile Best Practices

# Use specific version, not 'latest'
FROM node:18-alpine

# Set working directory
WORKDIR /app

# Copy dependency files first (leverage cache)
COPY package*.json ./

# Install dependencies
RUN npm ci --only=production

# Copy application code
COPY . .

# Use non-root user
USER node

# Expose port
EXPOSE 3000

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost:3000/health || exit 1

# Start application
CMD ["node", "server.js"]

Docker Compose

version: '3.8'

services:
  web:
    build: .
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgres://db:5432/app
    depends_on:
      - db
      - redis
    
  db:
    image: postgres:14
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      - POSTGRES_PASSWORD=secret
      
  redis:
    image: redis:alpine

volumes:
  postgres_data:

Kubernetes (K8s)

Core Concepts

┌────────────────────────────────────────────────────────┐
│                      Cluster                           │
│  ┌──────────────────────────────────────────────────┐  │
│  │                 Control Plane                    │  │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │  │
│  │  │ API      │ │ Scheduler│ │ Controller       │ │  │
│  │  │ Server   │ │          │ │ Manager          │ │  │
│  │  └──────────┘ └──────────┘ └──────────────────┘ │  │
│  └──────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────┐  │
│  │                    Nodes                         │  │
│  │  ┌─────────────────┐  ┌─────────────────┐       │  │
│  │  │     Node 1      │  │     Node 2      │       │  │
│  │  │ ┌─────┐ ┌─────┐ │  │ ┌─────┐ ┌─────┐ │       │  │
│  │  │ │ Pod │ │ Pod │ │  │ │ Pod │ │ Pod │ │       │  │
│  │  │ └─────┘ └─────┘ │  │ └─────┘ └─────┘ │       │  │
│  │  └─────────────────┘  └─────────────────┘       │  │
│  └──────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────┘

K8s Resources

# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: myapp:1.0.0
        ports:
        - containerPort: 3000
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"
---
# Service
apiVersion: v1
kind: Service
metadata:
  name: web-service
spec:
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 3000
  type: LoadBalancer

Cloud Fundamentals

Major Cloud Providers Comparison

Service TypeAWSAzureGCP
ComputeEC2Virtual MachinesCompute Engine
ContainersECS/EKSAKSGKE
ServerlessLambdaFunctionsCloud Functions
StorageS3Blob StorageCloud Storage
DatabaseRDS/DynamoDBSQL/CosmosDBCloud SQL/Firestore
CDNCloudFrontCDNCloud CDN

Infrastructure as Code (Terraform)

# main.tf
provider "aws" {
  region = "us-east-1"
}

resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"
  
  tags = {
    Name = "web-server"
  }
}

resource "aws_s3_bucket" "static" {
  bucket = "my-static-assets"
  
  versioning {
    enabled = true
  }
}

Monitoring & Observability

Metrics

  • CPU, Memory, Disk
  • Request rate, Latency
  • Error rate
  • Tools: Prometheus, Grafana

Logs

  • Application logs
  • Access logs
  • Error logs
  • Tools: ELK Stack, Loki

Traces

  • Request flow
  • Service dependencies
  • Bottleneck detection
  • Tools: Jaeger, Zipkin

The Four Golden Signals

SignalWhat to MeasureAlert When
LatencyRequest duration (p50, p95, p99)p99 > SLA threshold
TrafficRequests per secondUnusual spikes or drops
ErrorsError rate percentage> 1% of requests failing
SaturationResource utilizationCPU/Memory > 80%

Prometheus Metrics Example

from prometheus_client import Counter, Histogram, Gauge

# Counter - cumulative metric (only increases)
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Histogram - distribution of values
request_duration = Histogram(
    'http_request_duration_seconds',
    'Request duration in seconds',
    ['endpoint'],
    buckets=[.01, .05, .1, .5, 1, 5]
)

# Gauge - value that can go up or down
active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

# Usage
@app.route('/api/users')
def get_users():
    with request_duration.labels(endpoint='/api/users').time():
        result = fetch_users()
    http_requests_total.labels(
        method='GET', endpoint='/api/users', status='200'
    ).inc()
    return result

Structured Logging

import structlog
import json

logger = structlog.get_logger()

# ❌ Bad: Unstructured logs
print(f"User {user_id} created order {order_id}")

# ✅ Good: Structured JSON logs
logger.info(
    "order_created",
    user_id=user_id,
    order_id=order_id,
    total=order.total,
    items_count=len(order.items),
    duration_ms=duration
)

# Output (JSON):
# {"event": "order_created", "user_id": "123", "order_id": "456", 
#  "total": 99.99, "items_count": 3, "duration_ms": 45, 
#  "timestamp": "2024-01-15T10:30:00Z"}

GitOps

Infrastructure and application configs stored in Git as single source of truth.
┌─────────────┐     ┌─────────────┐     ┌─────────────────────┐
│  Developer  │────►│    Git      │────►│    GitOps Agent     │
│  Push       │     │  Repository │     │ (ArgoCD/Flux)       │
└─────────────┘     └─────────────┘     └──────────┬──────────┘


                                        ┌─────────────────────┐
                                        │    Kubernetes       │
                                        │    Cluster          │
                                        └─────────────────────┘

ArgoCD Application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/repo.git
    targetRevision: HEAD
    path: k8s/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Deployment Strategies

Blue-Green Deployment

Before:
┌─────────────────────────────────────────────────┐
│  Load Balancer  ───────────►  Blue (v1) ✓      │
│                               Green (v2)        │
└─────────────────────────────────────────────────┘

After:
┌─────────────────────────────────────────────────┐
│  Load Balancer  ───────────►  Blue (v1)        │
│                 ───────────►  Green (v2) ✓     │
└─────────────────────────────────────────────────┘

Canary Deployment

┌──────────────────────────────────────────────────┐
│                  Load Balancer                   │
│                       │                          │
│          ┌────────────┴────────────┐            │
│          │                         │            │
│          ▼ (90%)                   ▼ (10%)      │
│    ┌──────────┐              ┌──────────┐      │
│    │  v1.0    │              │  v1.1    │      │
│    │ (stable) │              │ (canary) │      │
│    └──────────┘              └──────────┘      │
└──────────────────────────────────────────────────┘

Gradually increase canary traffic: 10% → 25% → 50% → 100%

Rolling Update

# Kubernetes rolling update strategy
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Max pods above desired
      maxUnavailable: 1  # Max pods unavailable during update

Secrets Management

HashiCorp Vault

import hvac

# Initialize client
client = hvac.Client(url='https://vault.example.com')
client.token = os.environ['VAULT_TOKEN']

# Read secret
secret = client.secrets.kv.v2.read_secret_version(
    path='myapp/database'
)
db_password = secret['data']['data']['password']

# Dynamic secrets (auto-rotated)
creds = client.secrets.database.generate_credentials(
    name='my-role'
)
print(f"Username: {creds['data']['username']}")
print(f"Password: {creds['data']['password']}")
# Credentials automatically expire and rotate

Kubernetes Secrets with External Secrets Operator

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault-backend
  target:
    name: database-secret
  data:
  - secretKey: password
    remoteRef:
      key: secret/data/production/database
      property: password

Site Reliability Engineering (SRE) Concepts

Service Level Objectives (SLOs)

TermDefinitionExample
SLI (Indicator)Metric that measures service levelRequest latency p99
SLO (Objective)Target value for SLIp99 latency < 200ms
SLA (Agreement)Contract with consequences99.9% uptime or credits
Error BudgetAllowed downtime before SLO breach43.8 min/month for 99.9%
# Error budget calculation
monthly_minutes = 30 * 24 * 60  # 43,200 minutes

slo_99_9 = 0.999
error_budget_minutes = monthly_minutes * (1 - slo_99_9)  # 43.2 minutes

slo_99_99 = 0.9999
error_budget_minutes = monthly_minutes * (1 - slo_99_99)  # 4.32 minutes

Availability Table

AvailabilityDowntime/YearDowntime/MonthDowntime/Week
99%3.65 days7.31 hours1.68 hours
99.9%8.76 hours43.8 minutes10.1 minutes
99.95%4.38 hours21.9 minutes5.04 minutes
99.99%52.6 minutes4.38 minutes1.01 minutes
99.999%5.26 minutes26.3 seconds6.05 seconds

Infrastructure as Code Best Practices

Terraform Module Structure

modules/
├── vpc/
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
│   └── README.md
├── eks/
│   ├── main.tf
│   ├── variables.tf
│   └── outputs.tf
└── rds/
    ├── main.tf
    ├── variables.tf
    └── outputs.tf

environments/
├── dev/
│   ├── main.tf
│   ├── terraform.tfvars
│   └── backend.tf
├── staging/
│   └── ...
└── production/
    └── ...

Terraform State Management

# backend.tf
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"  # State locking
  }
}

DevOps Toolchain

CategoryTools
Version ControlGit, GitHub, GitLab
CI/CDGitHub Actions, Jenkins, GitLab CI, CircleCI
ContainersDocker, Podman, containerd
OrchestrationKubernetes, Docker Swarm, Nomad
IaCTerraform, Pulumi, CloudFormation, Ansible
MonitoringPrometheus, Grafana, Datadog, New Relic
LoggingELK Stack, Loki, Splunk
TracingJaeger, Zipkin, AWS X-Ray
SecretsVault, AWS Secrets Manager, SOPS
GitOpsArgoCD, Flux
Key Metric: The “Four Golden Signals” - Latency, Traffic, Errors, Saturation. Monitor these for any service.
Interview Tip: Be ready to discuss the trade-offs of different deployment strategies, how you’d handle rollbacks, and your experience with specific tools in the DevOps toolchain.