Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
DevOps bridges development and operations, enabling faster, more reliable software delivery through automation and best practices. Before DevOps, developers would “throw code over the wall” to an ops team, who would spend days manually deploying it, often discovering it did not work in production. DevOps eliminates this wall by making the team that builds the software also responsible for running it — Amazon calls this “you build it, you run it.” The result: organizations practicing DevOps deploy 200x more frequently with 24x faster recovery from failures (per the DORA State of DevOps reports).CI/CD Pipeline
Example GitHub Actions Workflow
A well-designed pipeline catches bugs before they reach production. The principle is simple: every code change must pass through an automated gauntlet of builds, tests, and checks before it can deploy. If any step fails, the pipeline stops and alerts the developer immediately — not the ops team at 2 AM.Containers (Docker)
Docker Concepts
Dockerfile Best Practices
A Dockerfile is a recipe for building a reproducible, portable application image. The order of instructions matters enormously for build speed because Docker caches each layer. Put things that change rarely (OS, dependencies) at the top and things that change often (your code) at the bottom.Docker Compose
Docker Compose lets you define multi-container applications in a single file. It is indispensable for local development — one command (docker compose up) gives every developer an identical environment with all dependencies running locally. No more “you need to install PostgreSQL 14, Redis, and configure these 8 environment variables.”
depends_on only guarantees start order, not that the service is actually ready to accept connections. Use health checks or a wait-for-it script to ensure PostgreSQL is fully initialized before your app tries to connect.
Kubernetes (K8s)
Core Concepts
K8s Resources
Kubernetes manifests declare your desired state (“I want 3 replicas of my app”), and K8s continuously works to make reality match that declaration. If a pod crashes, K8s automatically creates a new one. If a node dies, K8s reschedules all its pods onto surviving nodes. This declarative model is what makes K8s powerful — you describe WHAT you want, not HOW to achieve it.Cloud Fundamentals
Major Cloud Providers Comparison
| Service Type | AWS | Azure | GCP |
|---|---|---|---|
| Compute | EC2 | Virtual Machines | Compute Engine |
| Containers | ECS/EKS | AKS | GKE |
| Serverless | Lambda | Functions | Cloud Functions |
| Storage | S3 | Blob Storage | Cloud Storage |
| Database | RDS/DynamoDB | SQL/CosmosDB | Cloud SQL/Firestore |
| CDN | CloudFront | CDN | Cloud CDN |
Infrastructure as Code (Terraform)
Think of IaC like a recipe versus cooking by feel. Without IaC, setting up a server is like a chef who eyeballs ingredients and adjusts on the fly — the result is never exactly the same twice, and nobody else can reproduce it. With IaC, your infrastructure is a precise recipe: version-controlled, reviewable, repeatable. If your production environment burns down, you runterraform apply and it is rebuilt identically in minutes, not days.
terraform plan before every terraform apply. The plan shows you exactly what Terraform intends to create, modify, or destroy. Treat it like a code diff — review it carefully, especially in production. A common horror story: engineers accidentally destroy a production database because they did not read the plan, which clearly showed aws_db_instance.main: destroy in red. Stripe, HashiCorp, and most mature teams require plan output to be attached to the pull request before any infrastructure change is approved.
Monitoring & Observability
Metrics
- CPU, Memory, Disk
- Request rate, Latency
- Error rate
- Tools: Prometheus, Grafana
Logs
- Application logs
- Access logs
- Error logs
- Tools: ELK Stack, Loki
Traces
- Request flow
- Service dependencies
- Bottleneck detection
- Tools: Jaeger, Zipkin
The Four Golden Signals
Defined by Google’s SRE book, these four signals answer the question “Is my service healthy right now?” Monitor these for every service, and you will catch 90% of production issues before users notice them.| Signal | What to Measure | Alert When | Why It Matters |
|---|---|---|---|
| Latency | Request duration (p50, p95, p99) | p99 > SLA threshold | Users perceive slowness as broken. A p99 spike means 1 in 100 users is having a bad time. |
| Traffic | Requests per second | Unusual spikes or drops | A sudden drop often means something upstream broke. A spike could mean you are about to hit capacity. |
| Errors | Error rate percentage | > 1% of requests failing | Error rates creep up before full outages. Catching a 2% error rate prevents a 100% outage. |
| Saturation | Resource utilization | CPU/Memory > 80% | At 80%+ utilization, performance degrades non-linearly. 90% CPU feels 10x worse than 80%. |
Prometheus Metrics Example
Prometheus uses a pull-based model: it scrapes metrics from your application’s/metrics endpoint at regular intervals. Your application exposes three core metric types, each suited to different measurement needs.
Structured Logging
Unstructured logs are write-only — easy to produce, nearly impossible to search. When you have 50 services each emitting thousands of log lines per second, “grepping for a string” does not scale. Structured logs (JSON format) let you query across millions of log entries: “show me all orders over $500 that took more than 1 second in the last hour.”GitOps
Infrastructure and application configs stored in Git as the single source of truth. The core idea: if you want to know what is running in production, look at the Git repo — not at the cluster, not at a dashboard, not at someone’s local Terraform state. Every change goes through a pull request, gets reviewed, and creates an audit trail. Rollback is justgit revert. This eliminates the “someone SSH’d into production and made a change that nobody documented” failure mode.
ArgoCD Application
ArgoCD watches your Git repository and continuously reconciles the cluster state with what is declared in Git. If someone manually changes a resource in the cluster (kubectl edit, console click), ArgoCD detects the drift and either alerts you or automatically reverts it — depending on your sync policy. Think of it as a thermostat for your infrastructure: you set the desired temperature (Git state), and it continuously adjusts the actual temperature (cluster state) to match.selfHeal: false while your team adapts to GitOps. Engineers accustomed to kubectl apply will be frustrated when ArgoCD reverts their manual changes. Once everyone commits to the “change it in Git, not in the cluster” workflow, enable self-heal. Intuit, Red Hat, and most large-scale Kubernetes operators use ArgoCD with self-heal enabled in production.
Deployment Strategies
Blue-Green Deployment
Run two identical production environments. One serves live traffic (“blue”), the other has the new version staged and tested (“green”). When ready, flip the load balancer to green. If anything goes wrong, flip back to blue in seconds — zero downtime, near-instant rollback. The trade-off: you need 2x the infrastructure (and budget) during the deployment window.Canary Deployment
Named after the canary in a coal mine — send a small percentage of traffic to the new version first. If error rates or latency spike, roll back before most users are affected. This is how Google, Netflix, and Facebook deploy changes to billions of users: never all at once, always gradually. The key is automated analysis — your monitoring system should automatically compare canary metrics against baseline and halt the rollout if something looks wrong.Rolling Update
The simplest zero-downtime deployment strategy: gradually replace old pods with new ones. K8s terminates one old pod, starts one new pod, waits for it to pass health checks, then repeats. At no point are all pods down simultaneously. This is the default K8s deployment strategy and works well for most stateless applications. The trade-off: during the rollout, both old and new versions serve traffic simultaneously — your application must handle this (backward-compatible APIs, database migrations that work with both versions).maxUnavailable: 0 and maxSurge: 1 for the safest (but slowest) rollout — capacity never drops below desired. For faster rollouts that tolerate brief capacity reduction, increase maxUnavailable. Always combine with readiness probes; without them, rolling updates are just “rolling restarts with traffic loss.”
Secrets Management
HashiCorp Vault
Kubernetes Secrets with External Secrets Operator
Native Kubernetes Secrets are base64-encoded (not encrypted) and stored in etcd. Anyone with cluster access can decode them trivially. The External Secrets Operator solves this by syncing secrets from a secure external source (Vault, AWS Secrets Manager, GCP Secret Manager) into Kubernetes Secrets automatically. Your application code reads a normal K8s Secret, but the actual secret value lives in a hardened, access-controlled vault.git log preserves it forever. Use the External Secrets Operator or sealed-secrets (Bitnami) so that only encrypted references live in your repo.
Site Reliability Engineering (SRE) Concepts
Service Level Objectives (SLOs)
SLOs are the foundation of SRE practice — they answer “how reliable does this service need to be?” and “when should we stop shipping features to fix reliability?” The error budget is the key insight: it turns reliability from an abstract goal into a concrete, spendable resource. If your error budget is healthy, ship features fast. If it is nearly exhausted, freeze deployments and fix reliability.| Term | Definition | Example |
|---|---|---|
| SLI (Indicator) | The metric you actually measure | Request latency p99, successful request ratio |
| SLO (Objective) | The target threshold for your SLI | p99 latency < 200ms, 99.9% of requests succeed |
| SLA (Agreement) | A contract with financial consequences if the SLO is missed | 99.9% uptime or customer receives service credits |
| Error Budget | How much unreliability you can “spend” before breaching SLO | 43.8 min/month for 99.9% — once spent, freeze all risky changes |
Availability Table
| Availability | Downtime/Year | Downtime/Month | Downtime/Week |
|---|---|---|---|
| 99% | 3.65 days | 7.31 hours | 1.68 hours |
| 99.9% | 8.76 hours | 43.8 minutes | 10.1 minutes |
| 99.95% | 4.38 hours | 21.9 minutes | 5.04 minutes |
| 99.99% | 52.6 minutes | 4.38 minutes | 1.01 minutes |
| 99.999% | 5.26 minutes | 26.3 seconds | 6.05 seconds |
Infrastructure as Code Best Practices
Terraform Module Structure
Terraform State Management
Terraform state is the mapping between your configuration files and real-world resources. If the state file is lost or corrupted, Terraform cannot manage your infrastructure — it will try to create duplicate resources or lose track of existing ones. Always store state remotely with locking enabled to prevent two engineers from runningterraform apply simultaneously and corrupting the state.
DevOps Toolchain
| Category | Tools |
|---|---|
| Version Control | Git, GitHub, GitLab |
| CI/CD | GitHub Actions, Jenkins, GitLab CI, CircleCI |
| Containers | Docker, Podman, containerd |
| Orchestration | Kubernetes, Docker Swarm, Nomad |
| IaC | Terraform, Pulumi, CloudFormation, Ansible |
| Monitoring | Prometheus, Grafana, Datadog, New Relic |
| Logging | ELK Stack, Loki, Splunk |
| Tracing | Jaeger, Zipkin, AWS X-Ray |
| Secrets | Vault, AWS Secrets Manager, SOPS |
| GitOps | ArgoCD, Flux |
Interview Deep-Dive
It is 2 AM and your pager fires. The error rate for your main API has jumped from 0.1% to 15% in the last 5 minutes. Walk me through your incident response, step by step.
It is 2 AM and your pager fires. The error rate for your main API has jumped from 0.1% to 15% in the last 5 minutes. Walk me through your incident response, step by step.
- Step 0 (first 30 seconds): Acknowledge the alert, check if anyone else is already responding (incident channel in Slack/PagerDuty), and open the runbook for this service if one exists. Do not start debugging before establishing communication.
- Step 1 (first 2 minutes): Determine blast radius. Check the Grafana dashboard for the Four Golden Signals: which endpoints are failing, is it all traffic or a subset, what error codes (5xx vs 4xx), and is latency also degraded? A 15% error rate across all endpoints suggests a systemic issue (infrastructure, database, dependency). A 15% error rate on one endpoint suggests an application bug.
- Step 2 (next 3 minutes): Check what changed. Was there a deployment in the last 30 minutes? (
kubectl rollout historyor check the CD pipeline). If yes, the fastest mitigation is a rollback:kubectl rollout undo deployment/api-server. A rollback takes 30-60 seconds and buys you time to diagnose without users suffering. If there was no deployment, check if a downstream dependency is degraded (database CPU, Redis connection count, third-party API status page). - Step 3: Look at the logs. Use structured log queries filtered to the error-producing requests:
status=500 AND timestamp > "2am-5min". Look for the common thread — is it a specific user, a specific database query timing out, a nil pointer exception, a certificate expiration? - Step 4: Mitigate first, fix later. If the database is overloaded, enable a circuit breaker or serve cached responses. If a specific bad request pattern is causing crashes, add a temporary block rule at the API gateway. The goal at 2 AM is to stop the bleeding, not to write a beautiful fix.
- Step 5: After mitigating, write a brief incident timeline in the channel (“15% errors started at 2:03, caused by X, mitigated at 2:18 by Y”) and decide if it can wait until morning for a proper fix or needs immediate attention.
Your team deploys 3 times a week and wants to move to continuous deployment. What needs to be true about your testing, monitoring, and infrastructure before you can do this safely?
Your team deploys 3 times a week and wants to move to continuous deployment. What needs to be true about your testing, monitoring, and infrastructure before you can do this safely?
- Moving from 3 deploys per week to continuous deployment is not primarily a tooling problem — it is a confidence problem. You need enough automated verification that you trust the pipeline to deploy without a human gatekeeper.
- Testing prerequisites: (1) Unit test coverage above 80% on critical paths with fast execution (under 2 minutes). (2) Integration tests that cover the top 20 user journeys and run against a production-like environment with real databases, not mocks. (3) Contract tests between services (Pact) so that a change in service A does not break service B. (4) Smoke tests that run POST-deployment against the live environment to catch configuration issues that tests cannot simulate.
- Monitoring prerequisites: (1) Real-time error rate tracking with alerting thresholds (if error rate exceeds 2x baseline within 5 minutes of deploy, auto-rollback). (2) Latency monitoring at p50, p95, p99 — a deploy that passes all tests but doubles p99 latency should be flagged. (3) Business metric monitoring — order completion rate, signup conversion — because some bugs are only visible through business metrics, not technical ones. (4) Distributed tracing so you can quickly pinpoint which service a regression originates from.
- Infrastructure prerequisites: (1) Canary or blue-green deployment capability so new code gets a fraction of traffic first. (2) Automated rollback triggered by monitoring thresholds. (3) Feature flags to decouple code deployment from feature activation — you deploy code continuously but enable features gradually. (4) Database migrations that are backward-compatible (expand-contract pattern) so the old code version can still run during rollout.
- The organizational prerequisite most people overlook: the team must agree that “the pipeline is the gatekeeper, not a person.” This is a cultural shift. Senior engineers who are used to manually reviewing every deployment need to trust the automated checks. This trust is built gradually by running both systems in parallel (automated deploy + manual approval) until the team is confident the automation catches everything the human would.
gh-ost (GitHub) or pt-online-schema-change (Percona) handle large table migrations without locking. The key constraint: NEVER deploy a migration that drops a column or changes a column type in the same deployment as the code that depends on the change. The old code version must be able to run against both the old and new schema simultaneously.Compare Kubernetes resource requests and limits. What happens if you set them wrong, and how do you determine the right values for a new service?
Compare Kubernetes resource requests and limits. What happens if you set them wrong, and how do you determine the right values for a new service?
- Requests are the guaranteed minimum resources a pod gets. The Kubernetes scheduler uses requests to decide which node has enough capacity for the pod. If you request 256Mi of memory, the scheduler ensures the node has at least 256Mi available. If you set requests too low, the scheduler packs too many pods onto one node, causing resource contention. If you set them too high, you waste cluster capacity because the reserved resources sit unused.
- Limits are the maximum resources a pod can use. Exceeding the memory limit results in an OOMKill — Kubernetes terminates the container immediately. Exceeding the CPU limit results in throttling — the container is not killed but slowed down. This distinction matters: a memory leak hits a hard wall (crash and restart), while a CPU spike gets a soft degradation (requests slow down but succeed).
- The most dangerous misconfiguration is setting no limits at all. A memory leak in one pod can consume all memory on the node, triggering OOMKills on OTHER pods that happen to share the same node. One misbehaving pod takes down three healthy services. This is called a “noisy neighbor” problem.
- For a new service, I would determine values empirically: (1) Deploy with generous limits (4x what you estimate) and observe actual usage in staging under realistic load for 48-72 hours. (2) Use Prometheus metrics (
container_memory_working_set_bytes,container_cpu_usage_seconds_total) to find the p99 usage. (3) Set requests to the p50 usage (typical steady state) and limits to 2x the p99 (headroom for spikes). (4) In production, use Vertical Pod Autoscaler (VPA) in recommendation mode to continuously suggest adjustments based on real usage. - A common mistake: setting requests equal to limits. This guarantees the pod a fixed resource allocation (Guaranteed QoS class in Kubernetes), which sounds safe but wastes resources. If the pod typically uses 100Mi but spikes to 200Mi during peak, setting both to 200Mi wastes 100Mi 99% of the time. Instead, set requests to 100Mi and limits to 250Mi.
container_memory_working_set_bytes over time. A steadily increasing line that never decreases is a memory leak. A sawtooth pattern (grows, garbage collection drops it, grows again) with peaks exceeding the limit is a legitimate spike that needs a higher limit. For a memory leak in a JVM-based application, the container memory limit must account for both heap and off-heap memory (thread stacks, native memory, metaspace). A common trap: setting the JVM’s -Xmx to the same value as the container memory limit. The JVM uses ~30-50% more memory than the heap for other purposes, so a container with 512Mi limit and -Xmx512m will always be OOMKilled. Set -Xmx to about 60-70% of the container limit. For a Python or Node.js service, check for unreleased file handles, growing caches without eviction, or large response bodies buffered in memory. Tools like memory_profiler (Python) or --inspect (Node.js) can take heap snapshots for analysis.Your company uses Terraform to manage infrastructure. An engineer runs terraform apply in production without running terraform plan first and accidentally destroys a database. How do you prevent this from happening again?
Your company uses Terraform to manage infrastructure. An engineer runs terraform apply in production without running terraform plan first and accidentally destroys a database. How do you prevent this from happening again?
- First, the immediate response: restore from the most recent backup (which exists because we have automated backups with point-in-time recovery, right?). Then we hold a blameless postmortem to understand the systemic failure — the question is not “who did this?” but “why did our process allow this to happen?”
- Prevention layer 1: Remove direct Terraform access from individual engineers. All Terraform changes go through a CI/CD pipeline (Atlantis, Terraform Cloud, or a GitHub Actions workflow). The pipeline runs
terraform plan, posts the plan output as a comment on the pull request, requires peer review of the plan, and only then runsterraform apply. No human ever runsterraform applylocally against production. - Prevention layer 2: Implement policy-as-code using Sentinel (Terraform Enterprise) or Open Policy Agent (OPA). Write policies like: “No resource of type aws_db_instance may be destroyed without explicit approval from the database team” and “No changes to production during the hours of 10 PM to 6 AM.” These policies are checked automatically during the plan phase and block violations.
- Prevention layer 3: Protect critical resources with
lifecycle { prevent_destroy = true }in the Terraform configuration. This makes Terraform refuse to create a plan that destroys the resource. You must explicitly remove this flag (which goes through code review) before destruction is possible. - Prevention layer 4: Enable S3 bucket versioning on the Terraform state file and DynamoDB locking. Versioning lets you recover from state corruption, and locking prevents two people from running
terraform applysimultaneously. - Prevention layer 5: Separate Terraform state files by environment and by risk level. The database infrastructure should be in a separate state file from the application infrastructure. A developer deploying a new Lambda function should not even have the database in their blast radius.