DevOps & Infrastructure

Overview
CI/CD Pipeline
Example GitHub Actions Workflow
Containers (Docker)
Docker Concepts
Dockerfile Best Practices
Docker Compose
Kubernetes (K8s)
Core Concepts
K8s Resources
Cloud Fundamentals
Major Cloud Providers Comparison
Infrastructure as Code (Terraform)
Monitoring & Observability
The Four Golden Signals
Prometheus Metrics Example
Structured Logging
GitOps
ArgoCD Application
Deployment Strategies
Blue-Green Deployment
Canary Deployment
Rolling Update
Secrets Management
HashiCorp Vault
Kubernetes Secrets with External Secrets Operator
Site Reliability Engineering (SRE) Concepts
Service Level Objectives (SLOs)
Availability Table
Infrastructure as Code Best Practices
Terraform Module Structure
Terraform State Management
DevOps Toolchain

Overview

DevOps bridges development and operations, enabling faster, more reliable software delivery through automation and best practices.

CI/CD Pipeline

┌────────┐   ┌────────┐   ┌────────┐   ┌────────┐   ┌────────┐
│  Code  │──►│ Build  │──►│  Test  │──►│ Deploy │──►│Monitor │
│ Commit │   │        │   │        │   │        │   │        │
└────────┘   └────────┘   └────────┘   └────────┘   └────────┘
                CI                           CD

Example GitHub Actions Workflow

name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '18'
          
      - name: Install dependencies
        run: npm ci
        
      - name: Run tests
        run: npm test
        
      - name: Build
        run: npm run build
  
  deploy:
    needs: build-and-test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - name: Deploy to production
        run: |
          # Deploy script here
          echo "Deploying to production..."

Containers (Docker)

Docker Concepts

┌─────────────────────────────────────────────────┐
│                    Host OS                       │
│  ┌─────────────────────────────────────────────┐│
│  │              Docker Engine                  ││
│  │  ┌───────────┐ ┌───────────┐ ┌───────────┐ ││
│  │  │ Container │ │ Container │ │ Container │ ││
│  │  │  ┌─────┐  │ │  ┌─────┐  │ │  ┌─────┐  │ ││
│  │  │  │ App │  │ │  │ App │  │ │  │ App │  │ ││
│  │  │  └─────┘  │ │  └─────┘  │ │  └─────┘  │ ││
│  │  │   Libs    │ │   Libs    │ │   Libs    │ ││
│  │  └───────────┘ └───────────┘ └───────────┘ ││
│  └─────────────────────────────────────────────┘│
└─────────────────────────────────────────────────┘

Dockerfile Best Practices

# Use specific version, not 'latest'
FROM node:18-alpine

# Set working directory
WORKDIR /app

# Copy dependency files first (leverage cache)
COPY package*.json ./

# Install dependencies
RUN npm ci --only=production

# Copy application code
COPY . .

# Use non-root user
USER node

# Expose port
EXPOSE 3000

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
  CMD curl -f http://localhost:3000/health || exit 1

# Start application
CMD ["node", "server.js"]

Docker Compose

version: '3.8'

services:
  web:
    build: .
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgres://db:5432/app
    depends_on:
      - db
      - redis
    
  db:
    image: postgres:14
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      - POSTGRES_PASSWORD=secret
      
  redis:
    image: redis:alpine

volumes:
  postgres_data:

Kubernetes (K8s)

Core Concepts

┌────────────────────────────────────────────────────────┐
│                      Cluster                           │
│  ┌──────────────────────────────────────────────────┐  │
│  │                 Control Plane                    │  │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │  │
│  │  │ API      │ │ Scheduler│ │ Controller       │ │  │
│  │  │ Server   │ │          │ │ Manager          │ │  │
│  │  └──────────┘ └──────────┘ └──────────────────┘ │  │
│  └──────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────┐  │
│  │                    Nodes                         │  │
│  │  ┌─────────────────┐  ┌─────────────────┐       │  │
│  │  │     Node 1      │  │     Node 2      │       │  │
│  │  │ ┌─────┐ ┌─────┐ │  │ ┌─────┐ ┌─────┐ │       │  │
│  │  │ │ Pod │ │ Pod │ │  │ │ Pod │ │ Pod │ │       │  │
│  │  │ └─────┘ └─────┘ │  │ └─────┘ └─────┘ │       │  │
│  │  └─────────────────┘  └─────────────────┘       │  │
│  └──────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────┘

K8s Resources

# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: web
        image: myapp:1.0.0
        ports:
        - containerPort: 3000
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"
---
# Service
apiVersion: v1
kind: Service
metadata:
  name: web-service
spec:
  selector:
    app: web
  ports:
  - port: 80
    targetPort: 3000
  type: LoadBalancer

Cloud Fundamentals

Major Cloud Providers Comparison

Service Type	AWS	Azure	GCP
Compute	EC2	Virtual Machines	Compute Engine
Containers	ECS/EKS	AKS	GKE
Serverless	Lambda	Functions	Cloud Functions
Storage	S3	Blob Storage	Cloud Storage
Database	RDS/DynamoDB	SQL/CosmosDB	Cloud SQL/Firestore
CDN	CloudFront	CDN	Cloud CDN

Infrastructure as Code (Terraform)

# main.tf
provider "aws" {
  region = "us-east-1"
}

resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t2.micro"
  
  tags = {
    Name = "web-server"
  }
}

resource "aws_s3_bucket" "static" {
  bucket = "my-static-assets"
  
  versioning {
    enabled = true
  }
}

Monitoring & Observability

Metrics

CPU, Memory, Disk
Request rate, Latency
Error rate
Tools: Prometheus, Grafana

Logs

Application logs
Access logs
Error logs
Tools: ELK Stack, Loki

Traces

Request flow
Service dependencies
Bottleneck detection
Tools: Jaeger, Zipkin

The Four Golden Signals

Signal	What to Measure	Alert When
Latency	Request duration (p50, p95, p99)	p99 > SLA threshold
Traffic	Requests per second	Unusual spikes or drops
Errors	Error rate percentage	> 1% of requests failing
Saturation	Resource utilization	CPU/Memory > 80%

Prometheus Metrics Example

from prometheus_client import Counter, Histogram, Gauge

# Counter - cumulative metric (only increases)
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Histogram - distribution of values
request_duration = Histogram(
    'http_request_duration_seconds',
    'Request duration in seconds',
    ['endpoint'],
    buckets=[.01, .05, .1, .5, 1, 5]
)

# Gauge - value that can go up or down
active_connections = Gauge(
    'active_connections',
    'Number of active connections'
)

# Usage
@app.route('/api/users')
def get_users():
    with request_duration.labels(endpoint='/api/users').time():
        result = fetch_users()
    http_requests_total.labels(
        method='GET', endpoint='/api/users', status='200'
    ).inc()
    return result

Structured Logging

import structlog
import json

logger = structlog.get_logger()

# ❌ Bad: Unstructured logs
print(f"User {user_id} created order {order_id}")

# ✅ Good: Structured JSON logs
logger.info(
    "order_created",
    user_id=user_id,
    order_id=order_id,
    total=order.total,
    items_count=len(order.items),
    duration_ms=duration
)

# Output (JSON):
# {"event": "order_created", "user_id": "123", "order_id": "456", 
#  "total": 99.99, "items_count": 3, "duration_ms": 45, 
#  "timestamp": "2024-01-15T10:30:00Z"}

GitOps

Infrastructure and application configs stored in Git as single source of truth.

┌─────────────┐     ┌─────────────┐     ┌─────────────────────┐
│  Developer  │────►│    Git      │────►│    GitOps Agent     │
│  Push       │     │  Repository │     │ (ArgoCD/Flux)       │
└─────────────┘     └─────────────┘     └──────────┬──────────┘
                                                   │
                                                   ▼
                                        ┌─────────────────────┐
                                        │    Kubernetes       │
                                        │    Cluster          │
                                        └─────────────────────┘

ArgoCD Application

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/org/repo.git
    targetRevision: HEAD
    path: k8s/overlays/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Deployment Strategies

Blue-Green Deployment

Before:
┌─────────────────────────────────────────────────┐
│  Load Balancer  ───────────►  Blue (v1) ✓      │
│                               Green (v2)        │
└─────────────────────────────────────────────────┘

After:
┌─────────────────────────────────────────────────┐
│  Load Balancer  ───────────►  Blue (v1)        │
│                 ───────────►  Green (v2) ✓     │
└─────────────────────────────────────────────────┘

Canary Deployment

┌──────────────────────────────────────────────────┐
│                  Load Balancer                   │
│                       │                          │
│          ┌────────────┴────────────┐            │
│          │                         │            │
│          ▼ (90%)                   ▼ (10%)      │
│    ┌──────────┐              ┌──────────┐      │
│    │  v1.0    │              │  v1.1    │      │
│    │ (stable) │              │ (canary) │      │
│    └──────────┘              └──────────┘      │
└──────────────────────────────────────────────────┘

Gradually increase canary traffic: 10% → 25% → 50% → 100%

Rolling Update

# Kubernetes rolling update strategy
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Max pods above desired
      maxUnavailable: 1  # Max pods unavailable during update

Secrets Management

HashiCorp Vault

import hvac

# Initialize client
client = hvac.Client(url='https://vault.example.com')
client.token = os.environ['VAULT_TOKEN']

# Read secret
secret = client.secrets.kv.v2.read_secret_version(
    path='myapp/database'
)
db_password = secret['data']['data']['password']

# Dynamic secrets (auto-rotated)
creds = client.secrets.database.generate_credentials(
    name='my-role'
)
print(f"Username: {creds['data']['username']}")
print(f"Password: {creds['data']['password']}")
# Credentials automatically expire and rotate

Kubernetes Secrets with External Secrets Operator

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault-backend
  target:
    name: database-secret
  data:
  - secretKey: password
    remoteRef:
      key: secret/data/production/database
      property: password

Site Reliability Engineering (SRE) Concepts

Service Level Objectives (SLOs)

Term	Definition	Example
SLI (Indicator)	Metric that measures service level	Request latency p99
SLO (Objective)	Target value for SLI	p99 latency < 200ms
SLA (Agreement)	Contract with consequences	99.9% uptime or credits
Error Budget	Allowed downtime before SLO breach	43.8 min/month for 99.9%

# Error budget calculation
monthly_minutes = 30 * 24 * 60  # 43,200 minutes

slo_99_9 = 0.999
error_budget_minutes = monthly_minutes * (1 - slo_99_9)  # 43.2 minutes

slo_99_99 = 0.9999
error_budget_minutes = monthly_minutes * (1 - slo_99_99)  # 4.32 minutes

Availability Table

Availability	Downtime/Year	Downtime/Month	Downtime/Week
99%	3.65 days	7.31 hours	1.68 hours
99.9%	8.76 hours	43.8 minutes	10.1 minutes
99.95%	4.38 hours	21.9 minutes	5.04 minutes
99.99%	52.6 minutes	4.38 minutes	1.01 minutes
99.999%	5.26 minutes	26.3 seconds	6.05 seconds

Infrastructure as Code Best Practices

Terraform Module Structure

modules/
├── vpc/
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
│   └── README.md
├── eks/
│   ├── main.tf
│   ├── variables.tf
│   └── outputs.tf
└── rds/
    ├── main.tf
    ├── variables.tf
    └── outputs.tf

environments/
├── dev/
│   ├── main.tf
│   ├── terraform.tfvars
│   └── backend.tf
├── staging/
│   └── ...
└── production/
    └── ...

Terraform State Management

# backend.tf
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"  # State locking
  }
}

DevOps Toolchain

Category	Tools
Version Control	Git, GitHub, GitLab
CI/CD	GitHub Actions, Jenkins, GitLab CI, CircleCI
Containers	Docker, Podman, containerd
Orchestration	Kubernetes, Docker Swarm, Nomad
IaC	Terraform, Pulumi, CloudFormation, Ansible
Monitoring	Prometheus, Grafana, Datadog, New Relic
Logging	ELK Stack, Loki, Splunk
Tracing	Jaeger, Zipkin, AWS X-Ray
Secrets	Vault, AWS Secrets Manager, SOPS
GitOps	ArgoCD, Flux

Key Metric: The “Four Golden Signals” - Latency, Traffic, Errors, Saturation. Monitor these for any service.

Interview Tip: Be ready to discuss the trade-offs of different deployment strategies, how you’d handle rollbacks, and your experience with specific tools in the DevOps toolchain.

Architecture Patterns Security Fundamentals

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Overview

​CI/CD Pipeline

​Example GitHub Actions Workflow

​Containers (Docker)

​Docker Concepts

​Dockerfile Best Practices

​Docker Compose

​Kubernetes (K8s)

​Core Concepts

​K8s Resources

​Cloud Fundamentals

​Major Cloud Providers Comparison

​Infrastructure as Code (Terraform)

​Monitoring & Observability

Metrics

Logs

Traces

​The Four Golden Signals

​Prometheus Metrics Example

​Structured Logging

​GitOps

​ArgoCD Application

​Deployment Strategies

​Blue-Green Deployment

​Canary Deployment

​Rolling Update

​Secrets Management

​HashiCorp Vault

​Kubernetes Secrets with External Secrets Operator

​Site Reliability Engineering (SRE) Concepts

​Service Level Objectives (SLOs)

​Availability Table

​Infrastructure as Code Best Practices

​Terraform Module Structure

​Terraform State Management

​DevOps Toolchain

Overview

CI/CD Pipeline

Example GitHub Actions Workflow

Containers (Docker)

Docker Concepts

Dockerfile Best Practices

Docker Compose

Kubernetes (K8s)

Core Concepts

K8s Resources

Cloud Fundamentals

Major Cloud Providers Comparison

Infrastructure as Code (Terraform)

Monitoring & Observability

The Four Golden Signals

Prometheus Metrics Example

Structured Logging

GitOps

ArgoCD Application

Deployment Strategies

Blue-Green Deployment

Canary Deployment

Rolling Update

Secrets Management

HashiCorp Vault

Kubernetes Secrets with External Secrets Operator

Site Reliability Engineering (SRE) Concepts

Service Level Objectives (SLOs)

Availability Table

Infrastructure as Code Best Practices

Terraform Module Structure

Terraform State Management

DevOps Toolchain