Overview
DevOps bridges development and operations, enabling faster, more reliable software delivery through automation and best practices.
CI/CD Pipeline
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ Code │──►│ Build │──►│ Test │──►│ Deploy │──►│Monitor │
│ Commit │ │ │ │ │ │ │ │ │
└────────┘ └────────┘ └────────┘ └────────┘ └────────┘
CI CD
Example GitHub Actions Workflow
name : CI/CD Pipeline
on :
push :
branches : [ main ]
pull_request :
branches : [ main ]
jobs :
build-and-test :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v3
- name : Setup Node.js
uses : actions/setup-node@v3
with :
node-version : '18'
- name : Install dependencies
run : npm ci
- name : Run tests
run : npm test
- name : Build
run : npm run build
deploy :
needs : build-and-test
runs-on : ubuntu-latest
if : github.ref == 'refs/heads/main'
steps :
- name : Deploy to production
run : |
# Deploy script here
echo "Deploying to production..."
Containers (Docker)
Docker Concepts
┌─────────────────────────────────────────────────┐
│ Host OS │
│ ┌─────────────────────────────────────────────┐│
│ │ Docker Engine ││
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ ││
│ │ │ Container │ │ Container │ │ Container │ ││
│ │ │ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │ ││
│ │ │ │ App │ │ │ │ App │ │ │ │ App │ │ ││
│ │ │ └─────┘ │ │ └─────┘ │ │ └─────┘ │ ││
│ │ │ Libs │ │ Libs │ │ Libs │ ││
│ │ └───────────┘ └───────────┘ └───────────┘ ││
│ └─────────────────────────────────────────────┘│
└─────────────────────────────────────────────────┘
Dockerfile Best Practices
# Use specific version, not 'latest'
FROM node:18-alpine
# Set working directory
WORKDIR /app
# Copy dependency files first (leverage cache)
COPY package*.json ./
# Install dependencies
RUN npm ci --only=production
# Copy application code
COPY . .
# Use non-root user
USER node
# Expose port
EXPOSE 3000
# Health check
HEALTHCHECK --interval=30s --timeout=3s \
CMD curl -f http://localhost:3000/health || exit 1
# Start application
CMD [ "node" , "server.js" ]
Docker Compose
version : '3.8'
services :
web :
build : .
ports :
- "3000:3000"
environment :
- DATABASE_URL=postgres://db:5432/app
depends_on :
- db
- redis
db :
image : postgres:14
volumes :
- postgres_data:/var/lib/postgresql/data
environment :
- POSTGRES_PASSWORD=secret
redis :
image : redis:alpine
volumes :
postgres_data :
Kubernetes (K8s)
Core Concepts
┌────────────────────────────────────────────────────────┐
│ Cluster │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Control Plane │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │
│ │ │ API │ │ Scheduler│ │ Controller │ │ │
│ │ │ Server │ │ │ │ Manager │ │ │
│ │ └──────────┘ └──────────┘ └──────────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Nodes │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ Node 1 │ │ Node 2 │ │ │
│ │ │ ┌─────┐ ┌─────┐ │ │ ┌─────┐ ┌─────┐ │ │ │
│ │ │ │ Pod │ │ Pod │ │ │ │ Pod │ │ Pod │ │ │ │
│ │ │ └─────┘ └─────┘ │ │ └─────┘ └─────┘ │ │ │
│ │ └─────────────────┘ └─────────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
K8s Resources
# Deployment
apiVersion : apps/v1
kind : Deployment
metadata :
name : web-app
spec :
replicas : 3
selector :
matchLabels :
app : web
template :
metadata :
labels :
app : web
spec :
containers :
- name : web
image : myapp:1.0.0
ports :
- containerPort : 3000
resources :
requests :
memory : "128Mi"
cpu : "100m"
limits :
memory : "256Mi"
cpu : "200m"
---
# Service
apiVersion : v1
kind : Service
metadata :
name : web-service
spec :
selector :
app : web
ports :
- port : 80
targetPort : 3000
type : LoadBalancer
Cloud Fundamentals
Major Cloud Providers Comparison
Service Type AWS Azure GCP Compute EC2 Virtual Machines Compute Engine Containers ECS/EKS AKS GKE Serverless Lambda Functions Cloud Functions Storage S3 Blob Storage Cloud Storage Database RDS/DynamoDB SQL/CosmosDB Cloud SQL/Firestore CDN CloudFront CDN Cloud CDN
# main.tf
provider "aws" {
region = "us-east-1"
}
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t2.micro"
tags = {
Name = "web-server"
}
}
resource "aws_s3_bucket" "static" {
bucket = "my-static-assets"
versioning {
enabled = true
}
}
Monitoring & Observability
Metrics
CPU, Memory, Disk
Request rate, Latency
Error rate
Tools: Prometheus, Grafana
Logs
Application logs
Access logs
Error logs
Tools: ELK Stack, Loki
Traces
Request flow
Service dependencies
Bottleneck detection
Tools: Jaeger, Zipkin
The Four Golden Signals
Signal What to Measure Alert When Latency Request duration (p50, p95, p99) p99 > SLA threshold Traffic Requests per second Unusual spikes or drops Errors Error rate percentage > 1% of requests failing Saturation Resource utilization CPU/Memory > 80%
Prometheus Metrics Example
from prometheus_client import Counter, Histogram, Gauge
# Counter - cumulative metric (only increases)
http_requests_total = Counter(
'http_requests_total' ,
'Total HTTP requests' ,
[ 'method' , 'endpoint' , 'status' ]
)
# Histogram - distribution of values
request_duration = Histogram(
'http_request_duration_seconds' ,
'Request duration in seconds' ,
[ 'endpoint' ],
buckets = [ .01 , .05 , .1 , .5 , 1 , 5 ]
)
# Gauge - value that can go up or down
active_connections = Gauge(
'active_connections' ,
'Number of active connections'
)
# Usage
@app.route ( '/api/users' )
def get_users ():
with request_duration.labels( endpoint = '/api/users' ).time():
result = fetch_users()
http_requests_total.labels(
method = 'GET' , endpoint = '/api/users' , status = '200'
).inc()
return result
Structured Logging
import structlog
import json
logger = structlog.get_logger()
# ❌ Bad: Unstructured logs
print ( f "User { user_id } created order { order_id } " )
# ✅ Good: Structured JSON logs
logger.info(
"order_created" ,
user_id = user_id,
order_id = order_id,
total = order.total,
items_count = len (order.items),
duration_ms = duration
)
# Output (JSON):
# {"event": "order_created", "user_id": "123", "order_id": "456",
# "total": 99.99, "items_count": 3, "duration_ms": 45,
# "timestamp": "2024-01-15T10:30:00Z"}
GitOps
Infrastructure and application configs stored in Git as single source of truth.
┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐
│ Developer │────►│ Git │────►│ GitOps Agent │
│ Push │ │ Repository │ │ (ArgoCD/Flux) │
└─────────────┘ └─────────────┘ └──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Kubernetes │
│ Cluster │
└─────────────────────┘
ArgoCD Application
apiVersion : argoproj.io/v1alpha1
kind : Application
metadata :
name : my-app
namespace : argocd
spec :
project : default
source :
repoURL : https://github.com/org/repo.git
targetRevision : HEAD
path : k8s/overlays/production
destination :
server : https://kubernetes.default.svc
namespace : production
syncPolicy :
automated :
prune : true
selfHeal : true
Deployment Strategies
Blue-Green Deployment
Before:
┌─────────────────────────────────────────────────┐
│ Load Balancer ───────────► Blue (v1) ✓ │
│ Green (v2) │
└─────────────────────────────────────────────────┘
After:
┌─────────────────────────────────────────────────┐
│ Load Balancer ───────────► Blue (v1) │
│ ───────────► Green (v2) ✓ │
└─────────────────────────────────────────────────┘
Canary Deployment
┌──────────────────────────────────────────────────┐
│ Load Balancer │
│ │ │
│ ┌────────────┴────────────┐ │
│ │ │ │
│ ▼ (90%) ▼ (10%) │
│ ┌──────────┐ ┌──────────┐ │
│ │ v1.0 │ │ v1.1 │ │
│ │ (stable) │ │ (canary) │ │
│ └──────────┘ └──────────┘ │
└──────────────────────────────────────────────────┘
Gradually increase canary traffic: 10% → 25% → 50% → 100%
Rolling Update
# Kubernetes rolling update strategy
spec :
replicas : 4
strategy :
type : RollingUpdate
rollingUpdate :
maxSurge : 1 # Max pods above desired
maxUnavailable : 1 # Max pods unavailable during update
Secrets Management
HashiCorp Vault
import hvac
# Initialize client
client = hvac.Client( url = 'https://vault.example.com' )
client.token = os.environ[ 'VAULT_TOKEN' ]
# Read secret
secret = client.secrets.kv.v2.read_secret_version(
path = 'myapp/database'
)
db_password = secret[ 'data' ][ 'data' ][ 'password' ]
# Dynamic secrets (auto-rotated)
creds = client.secrets.database.generate_credentials(
name = 'my-role'
)
print ( f "Username: { creds[ 'data' ][ 'username' ] } " )
print ( f "Password: { creds[ 'data' ][ 'password' ] } " )
# Credentials automatically expire and rotate
Kubernetes Secrets with External Secrets Operator
apiVersion : external-secrets.io/v1beta1
kind : ExternalSecret
metadata :
name : database-credentials
spec :
refreshInterval : 1h
secretStoreRef :
kind : ClusterSecretStore
name : vault-backend
target :
name : database-secret
data :
- secretKey : password
remoteRef :
key : secret/data/production/database
property : password
Site Reliability Engineering (SRE) Concepts
Service Level Objectives (SLOs)
Term Definition Example SLI (Indicator)Metric that measures service level Request latency p99 SLO (Objective)Target value for SLI p99 latency < 200ms SLA (Agreement)Contract with consequences 99.9% uptime or credits Error Budget Allowed downtime before SLO breach 43.8 min/month for 99.9%
# Error budget calculation
monthly_minutes = 30 * 24 * 60 # 43,200 minutes
slo_99_9 = 0.999
error_budget_minutes = monthly_minutes * ( 1 - slo_99_9) # 43.2 minutes
slo_99_99 = 0.9999
error_budget_minutes = monthly_minutes * ( 1 - slo_99_99) # 4.32 minutes
Availability Table
Availability Downtime/Year Downtime/Month Downtime/Week 99% 3.65 days 7.31 hours 1.68 hours 99.9% 8.76 hours 43.8 minutes 10.1 minutes 99.95% 4.38 hours 21.9 minutes 5.04 minutes 99.99% 52.6 minutes 4.38 minutes 1.01 minutes 99.999% 5.26 minutes 26.3 seconds 6.05 seconds
Infrastructure as Code Best Practices
modules/
├── vpc/
│ ├── main.tf
│ ├── variables.tf
│ ├── outputs.tf
│ └── README.md
├── eks/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
└── rds/
├── main.tf
├── variables.tf
└── outputs.tf
environments/
├── dev/
│ ├── main.tf
│ ├── terraform.tfvars
│ └── backend.tf
├── staging/
│ └── ...
└── production/
└── ...
# backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state"
key = "production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-locks" # State locking
}
}
Category Tools Version Control Git, GitHub, GitLab CI/CD GitHub Actions, Jenkins, GitLab CI, CircleCI Containers Docker, Podman, containerd Orchestration Kubernetes, Docker Swarm, Nomad IaC Terraform, Pulumi, CloudFormation, Ansible Monitoring Prometheus, Grafana, Datadog, New Relic Logging ELK Stack, Loki, Splunk Tracing Jaeger, Zipkin, AWS X-Ray Secrets Vault, AWS Secrets Manager, SOPS GitOps ArgoCD, Flux
Key Metric : The “Four Golden Signals” - Latency, Traffic, Errors, Saturation. Monitor these for any service.
Interview Tip : Be ready to discuss the trade-offs of different deployment strategies, how you’d handle rollbacks, and your experience with specific tools in the DevOps toolchain.