Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

CI/CD for Microservices

Microservices unlock the ability to deploy services independently, but this requires sophisticated CI/CD pipelines. Here is the paradox: the whole point of microservices is independent deployment, yet most teams end up with a monolithic deployment pipeline that deploys everything together. This chapter covers building truly independent, production-grade pipelines where each service has its own build-test-deploy lifecycle. The goal is that a single team can ship a fix to their service in 15 minutes without coordinating with anyone — that is the microservices promise, and CI/CD is what actually delivers it.
Learning Objectives:
  • Design CI/CD pipelines for microservices
  • Implement deployment strategies (Blue-Green, Canary, Rolling)
  • Set up GitOps with ArgoCD
  • Configure automated testing in pipelines
  • Build multi-service deployment orchestration

CI/CD Challenges in Microservices

Before diving into pipeline code, it helps to understand WHY microservices CI/CD is so much harder than monolith CI/CD. In a monolith, you have one build, one test suite, one deploy target, and one rollback story. Every commit exercises the whole codebase, so the “blast radius” of a pipeline change is bounded. With microservices, you have N pipelines (one per service), plus shared pipelines for integration, contract testing, and orchestration. A single feature often spans 3-5 services, and each must deploy in the correct order without breaking compatibility. The temptation — which almost every team falls into first — is to build one “god pipeline” that knows about all services and deploys them together. This seems simpler but actively destroys the microservices value proposition: you end up with a distributed monolith where every deploy is a release train. The better path is harder up-front: invest in per-service pipelines, automated contract testing to catch cross-service breakage, and GitOps as the deployment substrate. The payoff is deploy frequency measured in hundreds per day rather than weekly release windows.
┌─────────────────────────────────────────────────────────────────────────────┐
│                    MONOLITH vs MICROSERVICES CI/CD                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  MONOLITH:                                                                   │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                              │
│  ┌────────────┐    ┌──────────┐    ┌───────────┐    ┌────────────┐         │
│  │   Commit   │───▶│  Build   │───▶│   Test    │───▶│   Deploy   │         │
│  │            │    │  (1 app) │    │  (1 suite)│    │  (1 target)│         │
│  └────────────┘    └──────────┘    └───────────┘    └────────────┘         │
│                                                                              │
│  ✓ Simple pipeline                                                          │
│  ✓ One build, one deploy                                                    │
│  ✗ All or nothing deployment                                               │
│  ✗ Long deployment cycles                                                   │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════════│
│                                                                              │
│  MICROSERVICES:                                                              │
│  ─────────────────────────────────────────────────────────────────────────  │
│                                                                              │
│  ┌────────────┐    ┌──────────────────────────────────────────────┐        │
│  │  Commit    │    │              PARALLEL PIPELINES               │        │
│  │  Service A │───▶│  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │        │
│  └────────────┘    │  │ Build  │▶│  Test  │▶│ Scan   │▶│ Deploy │ │        │
│                    │  └────────┘ └────────┘ └────────┘ └────────┘ │        │
│  ┌────────────┐    │                                               │        │
│  │  Commit    │───▶│  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │        │
│  │  Service B │    │  │ Build  │▶│  Test  │▶│ Scan   │▶│ Deploy │ │        │
│  └────────────┘    │  └────────┘ └────────┘ └────────┘ └────────┘ │        │
│                    │                                               │        │
│  ┌────────────┐    │  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │        │
│  │  Commit    │───▶│  │ Build  │▶│  Test  │▶│ Scan   │▶│ Deploy │ │        │
│  │  Service C │    │  └────────┘ └────────┘ └────────┘ └────────┘ │        │
│  └────────────┘    └──────────────────────────────────────────────┘        │
│                                         │                                   │
│                                         ▼                                   │
│                    ┌──────────────────────────────────────────────┐        │
│                    │            INTEGRATION TESTS                  │        │
│                    │     (Contract tests, E2E tests)              │        │
│                    └──────────────────────────────────────────────┘        │
│                                                                              │
│  ✓ Independent deployments                                                  │
│  ✓ Faster feedback loops                                                    │
│  ✓ Targeted rollbacks                                                       │
│  ✗ Complex orchestration                                                    │
│  ✗ Dependency management                                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Caveats & Common Pitfalls: Pipeline Bottlenecks and Deploy/Release Coupling

CI/CD problems that silently compound until they become unignorable:
  • The 30-minute pipeline that blocks every team. What started as a 5-minute pipeline now runs for half an hour because tests accumulated, build caches broke, and security scans got added to critical path. Engineers start batching changes to avoid paying the cost, which makes each deploy riskier. The feedback loop inverts: long pipelines cause bigger, rarer deploys, which break more often, which justifies more gates, which makes the pipeline longer.
  • Deploy and release coupled together. “Deploy to production” and “turn feature on for users” are the same action. This means every deploy is user-visible, every rollback is user-visible, and you cannot ship incomplete work safely. The fix (feature flags) is often not adopted until after an embarrassing incident.
  • Canary deploys with no automated rollback criteria. Team configures a canary, bumps to 10%, then 25%, with a human staring at a Grafana dashboard. The human context-switches, misses the error-rate rise for 8 minutes, by which time 25% of users hit the bug. Without automation, canary is just a slower outage.
  • Staging drift post-deploy. Engineer deploys a hotfix to production, marks the incident resolved, forgets to deploy the same fix to staging. Two weeks later, a feature test in staging passes because staging is effectively running old broken code. The divergence grows until staging is useless as a test environment.
Solutions & Patterns:
  • Measure pipeline p95 and alert on regressions. Your pipeline is a production system; treat its SLO seriously. When median build time creeps past 15 minutes, schedule a cleanup sprint. Track the blockers: slow tests, missing caches, sequential steps that could parallelize.
  • Separate deploy from release with feature flags. Deploy code to production with the new path behind a flag defaulting to off. Release later by flipping the flag. Rollback is flag-level, not deploy-level — which means it is measured in seconds.
  • Automate canary analysis, do not watch dashboards. Tools like Flagger or Argo Rollouts compare canary vs stable metrics on a schedule and auto-rollback if thresholds are exceeded. Human-in-the-loop is fine as a final check but should never be the primary signal.
  • Make staging and production deployment atomic. Use GitOps where a merge to main triggers parallel deploys to both environments. If prod gets a hotfix, the same commit deploys to staging within minutes. Divergence becomes impossible by construction.

Pipeline Architecture

CI Pipeline Stages: Why They Exist and What They Replace

Before the era of CI, “integration” meant a developer manually pulling everyone’s changes onto their laptop at the end of a sprint, spending two days reconciling conflicts, and then praying the resulting build worked in production. Integration bugs surfaced weeks after the code was written, long after the author had context-switched to something else. The classic failure mode was the “integration week” — a scheduled period where the team did nothing but fix merge conflicts and environment-specific bugs. CI replaces this with continuous, automated integration on every commit: small changes, tested in isolation, merged frequently. The eight-stage pipeline (build, unit test, security scan, integration test, contract test, push, deploy staging, E2E, deploy prod) exists because each stage catches a specific class of bug earlier and cheaper than the stage after it. What goes wrong without it? Real story: a fintech team skipped contract testing because “our services rarely change their APIs.” Three months later, a senior engineer renamed a field from userId to user_id in the auth service, ran the auth service’s unit tests (they passed), deployed to production, and took down payments, notifications, and order history simultaneously — all consumers broke at once because none of them had been rebuilt against the new contract. Contract testing would have caught this at PR review. The general pattern: every stage you skip in CI is a bug class you are deferring to production incidents. The key decision is when to block vs warn. A failed unit test should block the pipeline — there is no ambiguity, the code is broken. A security scan finding a HIGH severity vulnerability should block. But a MEDIUM severity vulnerability or a code coverage drop of 2% should warn, not block, or you will train the team to bypass the pipeline. The rule of thumb: block on things that would cause a production incident; warn on things that are quality signals. If you block on everything, engineers learn to hate CI and find ways around it. If you block on nothing, the pipeline is theater.

Service-Specific Pipelines

Every service gets its own pipeline, triggered only by changes to its own directory (or to shared dependencies it relies on). This path-filtering is the single most important optimization — without it, a change to Service A triggers rebuilds and redeploys across all services, collapsing the independence you paid so much architectural complexity to gain. If you skipped this and used one global pipeline, you would reintroduce the monolithic deployment problem: every commit takes the slowest service’s build time, and a flaky test in Service B blocks Service A’s release. The pipeline below has eight stages arranged in a DAG: build and unit test, security scan, integration tests, contract tests (run in parallel with integration), push to registry, deploy to staging, E2E tests against staging, and finally deploy to production gated by human approval. Each stage is a gate — failing any stage stops the pipeline. The critical architectural decision here is that the artifact (the Docker image) is built exactly once in the build stage and passed through the rest of the pipeline. Rebuilding in each stage would waste compute and, worse, create the possibility that “the image we tested” is not “the image we deployed.”
# .github/workflows/service-pipeline.yml
name: Service Pipeline

on:
  push:
    paths:
      - 'services/order-service/**'
      - '.github/workflows/order-service.yml'
  pull_request:
    paths:
      - 'services/order-service/**'

env:
  SERVICE_NAME: order-service
  REGISTRY: ghcr.io/${{ github.repository_owner }}
  
jobs:
  # Stage 1: Build and Unit Test
  build:
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
      version: ${{ steps.version.outputs.version }}
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
          cache-dependency-path: services/${{ env.SERVICE_NAME }}/package-lock.json
      
      - name: Install dependencies
        working-directory: services/${{ env.SERVICE_NAME }}
        run: npm ci
      
      - name: Run linting
        working-directory: services/${{ env.SERVICE_NAME }}
        run: npm run lint
      
      - name: Run unit tests
        working-directory: services/${{ env.SERVICE_NAME }}
        run: npm test -- --coverage
      
      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: services/${{ env.SERVICE_NAME }}/coverage/lcov.info
          flags: ${{ env.SERVICE_NAME }}
      
      - name: Generate version
        id: version
        run: |
          VERSION=$(date +%Y%m%d)-${{ github.run_number }}-${GITHUB_SHA::8}
          echo "version=$VERSION" >> $GITHUB_OUTPUT
      
      - name: Build Docker image
        uses: docker/build-push-action@v5
        with:
          context: services/${{ env.SERVICE_NAME }}
          push: false
          tags: ${{ env.REGISTRY }}/${{ env.SERVICE_NAME }}:${{ steps.version.outputs.version }}
          cache-from: type=gha
          cache-to: type=gha,mode=max
          outputs: type=docker,dest=/tmp/image.tar
      
      - name: Upload artifact
        uses: actions/upload-artifact@v4
        with:
          name: docker-image
          path: /tmp/image.tar

  # Stage 2: Security Scanning
  security:
    needs: build
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Download artifact
        uses: actions/download-artifact@v4
        with:
          name: docker-image
          path: /tmp
      
      - name: Load image
        run: docker load --input /tmp/image.tar
      
      - name: Run Trivy vulnerability scanner
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: ${{ env.REGISTRY }}/${{ env.SERVICE_NAME }}:${{ needs.build.outputs.version }}
          format: 'sarif'
          output: 'trivy-results.sarif'
          severity: 'CRITICAL,HIGH'
      
      - name: Upload Trivy scan results
        uses: github/codeql-action/upload-sarif@v2
        with:
          sarif_file: 'trivy-results.sarif'
      
      - name: Run Snyk
        uses: snyk/actions/node@master
        env:
          SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
        with:
          args: --severity-threshold=high
          command: test

  # Stage 3: Integration Tests
  integration:
    needs: [build, security]
    runs-on: ubuntu-latest
    
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: testpassword
          POSTGRES_DB: testdb
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      
      redis:
        image: redis:7
        ports:
          - 6379:6379
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Download artifact
        uses: actions/download-artifact@v4
        with:
          name: docker-image
          path: /tmp
      
      - name: Load image
        run: docker load --input /tmp/image.tar
      
      - name: Run integration tests
        working-directory: services/${{ env.SERVICE_NAME }}
        env:
          DATABASE_URL: postgresql://postgres:testpassword@localhost:5432/testdb
          REDIS_URL: redis://localhost:6379
        run: npm run test:integration

  # Stage 4: Contract Tests
  contract-tests:
    needs: build
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
      
      - name: Install dependencies
        working-directory: services/${{ env.SERVICE_NAME }}
        run: npm ci
      
      - name: Run contract tests
        working-directory: services/${{ env.SERVICE_NAME }}
        run: npm run test:contract
      
      - name: Publish pacts
        if: github.ref == 'refs/heads/main'
        working-directory: services/${{ env.SERVICE_NAME }}
        env:
          PACT_BROKER_URL: ${{ secrets.PACT_BROKER_URL }}
          PACT_BROKER_TOKEN: ${{ secrets.PACT_BROKER_TOKEN }}
        run: npm run pact:publish

  # Stage 5: Push to Registry
  push:
    needs: [build, security, integration, contract-tests]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    
    steps:
      - name: Download artifact
        uses: actions/download-artifact@v4
        with:
          name: docker-image
          path: /tmp
      
      - name: Load image
        run: docker load --input /tmp/image.tar
      
      - name: Login to Registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      
      - name: Push image
        run: |
          docker push ${{ env.REGISTRY }}/${{ env.SERVICE_NAME }}:${{ needs.build.outputs.version }}
          docker tag ${{ env.REGISTRY }}/${{ env.SERVICE_NAME }}:${{ needs.build.outputs.version }} \
                     ${{ env.REGISTRY }}/${{ env.SERVICE_NAME }}:latest
          docker push ${{ env.REGISTRY }}/${{ env.SERVICE_NAME }}:latest

  # Stage 6: Deploy to Staging
  deploy-staging:
    needs: push
    runs-on: ubuntu-latest
    environment: staging
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Update Kubernetes manifest
        run: |
          cd k8s/overlays/staging
          kustomize edit set image ${{ env.REGISTRY }}/${{ env.SERVICE_NAME }}=${{ env.REGISTRY }}/${{ env.SERVICE_NAME }}:${{ needs.build.outputs.version }}
      
      - name: Commit and push
        run: |
          git config --global user.name 'GitHub Actions'
          git config --global user.email 'actions@github.com'
          git add .
          git commit -m "Deploy ${{ env.SERVICE_NAME }}:${{ needs.build.outputs.version }} to staging"
          git push

  # Stage 7: E2E Tests on Staging
  e2e-staging:
    needs: deploy-staging
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Wait for deployment
        run: |
          kubectl rollout status deployment/${{ env.SERVICE_NAME }} -n staging --timeout=300s
      
      - name: Run E2E tests
        working-directory: tests/e2e
        env:
          BASE_URL: https://staging.example.com
        run: npm run test:e2e

  # Stage 8: Deploy to Production (with approval)
  # The 'environment: production' key enables GitHub's environment protection rules.
  # Configure this to require manual approval from a team lead before production deploys.
  # This is your last human checkpoint before code reaches real users.
  deploy-production:
    needs: e2e-staging
    runs-on: ubuntu-latest
    environment: production
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Update production manifest
        run: |
          cd k8s/overlays/production
          kustomize edit set image ${{ env.REGISTRY }}/${{ env.SERVICE_NAME }}=${{ env.REGISTRY }}/${{ env.SERVICE_NAME }}:${{ needs.build.outputs.version }}
      
      - name: Commit and push
        run: |
          git config --global user.name 'GitHub Actions'
          git config --global user.email 'actions@github.com'
          git add .
          git commit -m "Deploy ${{ env.SERVICE_NAME }}:${{ needs.build.outputs.version }} to production"
          git push

Writing Integration Tests for the Pipeline

Integration tests verify that your service works end-to-end against real infrastructure (a real database, a real cache, a real message queue). Unit tests can lie — they use mocks that can drift from reality — but integration tests tell you whether the deployed artifact actually functions. The CI pipeline above spins up ephemeral Postgres and Redis containers specifically so integration tests have real dependencies. A common mistake is to write integration tests that just mock more layers; if you mock the database in integration tests, you have a slower unit test.
// services/order-service/tests/integration/orders.test.js
const request = require('supertest');
const { Pool } = require('pg');
const { createApp } = require('../../src/app');

describe('Orders API (integration)', () => {
  let app;
  let db;

  beforeAll(async () => {
    db = new Pool({ connectionString: process.env.DATABASE_URL });
    await db.query('CREATE TABLE IF NOT EXISTS orders (id SERIAL PRIMARY KEY, user_id TEXT, total NUMERIC)');
    app = createApp({ db });
  });

  afterAll(async () => {
    await db.query('DROP TABLE orders');
    await db.end();
  });

  beforeEach(async () => {
    await db.query('TRUNCATE orders');
  });

  it('creates an order and persists it', async () => {
    const response = await request(app)
      .post('/orders')
      .send({ userId: 'u-123', total: 49.99 })
      .expect(201);

    expect(response.body).toMatchObject({ userId: 'u-123', total: 49.99 });

    const { rows } = await db.query('SELECT * FROM orders WHERE id = $1', [response.body.id]);
    expect(rows).toHaveLength(1);
  });

  it('returns 404 for missing orders', async () => {
    await request(app).get('/orders/99999').expect(404);
  });
});

Database Migrations in the Pipeline

Migrations are the most dangerous part of any deploy. A broken migration can corrupt production data in seconds and take hours to roll back. The pipeline needs to run migrations automatically (manual migrations are an incident waiting to happen), but also safely — which means running them against a disposable test database first, running them with a timeout, and structuring them to be backward compatible so a rollback does not leave the schema in a broken state. The Alembic snippet below is the migration runner CI should invoke before deploying new application code.
// services/order-service/scripts/run-migrations.js
const { execSync } = require('child_process');

const DB_URL = process.env.DATABASE_URL;
if (!DB_URL) throw new Error('DATABASE_URL is required');

try {
  console.log('Running migrations via knex...');
  execSync(`npx knex migrate:latest --env production`, { stdio: 'inherit' });
  console.log('Migrations complete');
} catch (err) {
  console.error('Migration failed, attempting rollback');
  execSync(`npx knex migrate:rollback --env production`, { stdio: 'inherit' });
  process.exit(1);
}

Deployment Strategies

Blue-Green Deployment

Blue-green deployment is the “belt-and-suspenders” approach to zero-downtime deploys: run two full copies of your service (blue and green), keep all traffic on one while you deploy to the other, then flip traffic over in one atomic step. The beauty is the instant rollback — if v2 misbehaves, flip the traffic back to blue and you are back on v1 in milliseconds. The cost is infrastructure: you pay for 2x capacity during the transition. The alternative — rolling updates where pods are replaced one at a time — is cheaper but exposes you to mixed-version states for several minutes. During a rolling update, some requests hit v1 and others hit v2 simultaneously. If v1 and v2 have any API incompatibility (even subtle ones like a renamed field or a changed default), you get intermittent errors that are brutal to debug. Blue-green eliminates this class of bug entirely because v1 and v2 never serve traffic at the same moment. The tradeoff: a Kubernetes service selector flip is still not truly atomic at the edge (in-flight requests to old pods complete, new requests go to new pods), so you still need drain periods and connection-level graceful shutdowns for true zero downtime.
┌─────────────────────────────────────────────────────────────────────────────┐
│                      BLUE-GREEN DEPLOYMENT                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  STEP 1: Current State (Blue is Active)                                     │
│  ─────────────────────────────────────────                                  │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────┐           │
│  │                       Load Balancer                          │           │
│  └───────────────────────────┬──────────────────────────────────┘           │
│                              │                                               │
│                              ▼                                               │
│  ┌────────────────────────────────────┐    ┌─────────────────────────────┐  │
│  │           BLUE (v1.0)              │    │       GREEN (idle)          │  │
│  │  ┌────────┐ ┌────────┐ ┌────────┐  │    │                             │  │
│  │  │ Pod 1  │ │ Pod 2  │ │ Pod 3  │  │    │        (empty)              │  │
│  │  └────────┘ └────────┘ └────────┘  │    │                             │  │
│  │         ✅ ACTIVE                   │    │                             │  │
│  └────────────────────────────────────┘    └─────────────────────────────┘  │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════════│
│                                                                              │
│  STEP 2: Deploy New Version to Green                                        │
│  ───────────────────────────────────────                                    │
│                                                                              │
│  ┌────────────────────────────────────┐    ┌─────────────────────────────┐  │
│  │           BLUE (v1.0)              │    │       GREEN (v2.0)          │  │
│  │  ┌────────┐ ┌────────┐ ┌────────┐  │    │ ┌────────┐ ┌────────┐      │  │
│  │  │ Pod 1  │ │ Pod 2  │ │ Pod 3  │  │    │ │ Pod 1  │ │ Pod 2  │      │  │
│  │  └────────┘ └────────┘ └────────┘  │    │ └────────┘ └────────┘      │  │
│  │         ✅ ACTIVE                   │    │    🔄 DEPLOYING            │  │
│  └────────────────────────────────────┘    └─────────────────────────────┘  │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════════│
│                                                                              │
│  STEP 3: Switch Traffic (Instant Cutover)                                   │
│  ────────────────────────────────────────                                   │
│                                                                              │
│  ┌──────────────────────────────────────────────────────────────┐           │
│  │                       Load Balancer                          │           │
│  └───────────────────────────────────────────────┬──────────────┘           │
│                                                  │                          │
│                                                  ▼                          │
│  ┌────────────────────────────────────┐    ┌─────────────────────────────┐  │
│  │           BLUE (v1.0)              │    │       GREEN (v2.0)          │  │
│  │  ┌────────┐ ┌────────┐ ┌────────┐  │    │ ┌────────┐ ┌────────┐      │  │
│  │  │ Pod 1  │ │ Pod 2  │ │ Pod 3  │  │    │ │ Pod 1  │ │ Pod 2  │      │  │
│  │  └────────┘ └────────┘ └────────┘  │    │ └────────┘ └────────┘      │  │
│  │        🔄 STANDBY                   │    │     ✅ ACTIVE              │  │
│  └────────────────────────────────────┘    └─────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
The Kubernetes manifests below implement blue-green with two Deployments (one for each slot) and a Service selector that picks one slot. To “flip” traffic, you patch the Service’s selector from slot: blue to slot: green. It is that simple mechanically — the hard parts are schema compatibility and readiness probes. Without readiness probes, Kubernetes would send traffic to green pods before they finish warming up (JIT compilation, cache priming, connection pool establishment), and your “instant cutover” becomes “instant brownout.” Always configure readiness probes that actually verify the service can handle work, not just that the process is alive.
# blue-green/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service-green
  labels:
    app: order-service
    slot: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
      slot: green
  template:
    metadata:
      labels:
        app: order-service
        slot: green
        version: v2.0.0
    spec:
      containers:
      - name: order-service
        image: myregistry/order-service:v2.0.0
        ports:
        - containerPort: 3000
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health/live
            port: 3000
          initialDelaySeconds: 15
          periodSeconds: 10
---
# Switch service selector to point to green
apiVersion: v1
kind: Service
metadata:
  name: order-service
spec:
  selector:
    app: order-service
    slot: green  # Change to 'blue' to rollback
  ports:
  - port: 80
    targetPort: 3000
Production pitfall: Blue-green deployments require your database schema to be backward-compatible between v1 and v2. If v2 adds a NOT NULL column that v1 does not write to, the moment you switch traffic back to v1 (rollback), v1’s writes will fail. Always use the expand-contract pattern for schema changes alongside blue-green deploys. Blue-Green deployment script: The script below automates the four steps of a blue-green deploy: detect which slot is currently active, patch the inactive slot’s Deployment with the new image, wait for its pods to become Ready, then flip the Service selector. Everything is driven through the Kubernetes API — no kubectl shell-outs, no imperative config edits. This is important because it makes the deploy idempotent and auditable: if the script crashes mid-flight, you can re-run it and it will pick up from wherever the cluster state is. The rollback method is symmetric: it just flips the selector back to the previous slot, which is why rollback is measured in milliseconds rather than minutes.
// scripts/blue-green-deploy.js
const k8s = require('@kubernetes/client-node');

class BlueGreenDeployer {
  constructor(serviceName, namespace = 'default') {
    this.serviceName = serviceName;
    this.namespace = namespace;
    
    const kc = new k8s.KubeConfig();
    kc.loadFromDefault();
    this.appsApi = kc.makeApiClient(k8s.AppsV1Api);
    this.coreApi = kc.makeApiClient(k8s.CoreV1Api);
  }

  async getCurrentSlot() {
    const service = await this.coreApi.readNamespacedService(
      this.serviceName, 
      this.namespace
    );
    return service.body.spec.selector.slot;
  }

  async getInactiveSlot() {
    const current = await this.getCurrentSlot();
    return current === 'blue' ? 'green' : 'blue';
  }

  async deployToInactive(imageTag) {
    const inactiveSlot = await this.getInactiveSlot();
    const deploymentName = `${this.serviceName}-${inactiveSlot}`;
    
    console.log(`Deploying ${imageTag} to ${inactiveSlot} slot...`);
    
    // Update deployment
    const patch = {
      spec: {
        template: {
          spec: {
            containers: [{
              name: this.serviceName,
              image: `myregistry/${this.serviceName}:${imageTag}`
            }]
          }
        }
      }
    };
    
    await this.appsApi.patchNamespacedDeployment(
      deploymentName,
      this.namespace,
      patch,
      undefined,
      undefined,
      undefined,
      undefined,
      undefined,
      { headers: { 'Content-Type': 'application/strategic-merge-patch+json' } }
    );
    
    // Wait for rollout
    await this.waitForRollout(deploymentName);
    
    return inactiveSlot;
  }

  async waitForRollout(deploymentName, timeout = 300) {
    const start = Date.now();
    
    while (Date.now() - start < timeout * 1000) {
      const deployment = await this.appsApi.readNamespacedDeployment(
        deploymentName,
        this.namespace
      );
      
      const status = deployment.body.status;
      if (status.readyReplicas === status.replicas &&
          status.updatedReplicas === status.replicas) {
        console.log(`✅ Deployment ${deploymentName} is ready`);
        return;
      }
      
      console.log(`⏳ Waiting... (${status.readyReplicas}/${status.replicas} ready)`);
      await new Promise(r => setTimeout(r, 5000));
    }
    
    throw new Error(`Deployment ${deploymentName} timed out`);
  }

  async runHealthChecks(slot) {
    console.log(`Running health checks on ${slot}...`);
    
    // Get pods in the slot
    const pods = await this.coreApi.listNamespacedPod(
      this.namespace,
      undefined,
      undefined,
      undefined,
      undefined,
      `app=${this.serviceName},slot=${slot}`
    );
    
    for (const pod of pods.body.items) {
      const ready = pod.status.conditions?.find(c => c.type === 'Ready');
      if (!ready || ready.status !== 'True') {
        throw new Error(`Pod ${pod.metadata.name} is not ready`);
      }
    }
    
    console.log(`✅ All pods healthy in ${slot}`);
  }

  async switchTraffic(newSlot) {
    console.log(`Switching traffic to ${newSlot}...`);
    
    const patch = {
      spec: {
        selector: {
          app: this.serviceName,
          slot: newSlot
        }
      }
    };
    
    await this.coreApi.patchNamespacedService(
      this.serviceName,
      this.namespace,
      patch,
      undefined,
      undefined,
      undefined,
      undefined,
      undefined,
      { headers: { 'Content-Type': 'application/strategic-merge-patch+json' } }
    );
    
    console.log(`✅ Traffic switched to ${newSlot}`);
  }

  async rollback() {
    const currentSlot = await this.getCurrentSlot();
    const previousSlot = currentSlot === 'blue' ? 'green' : 'blue';
    
    console.log(`🔄 Rolling back from ${currentSlot} to ${previousSlot}...`);
    await this.switchTraffic(previousSlot);
    console.log(`✅ Rollback complete`);
  }

  async deploy(imageTag) {
    try {
      // Deploy to inactive slot
      const newSlot = await this.deployToInactive(imageTag);
      
      // Run health checks
      await this.runHealthChecks(newSlot);
      
      // Switch traffic
      await this.switchTraffic(newSlot);
      
      console.log(`\n✅ Blue-green deployment complete!`);
      console.log(`   Active slot: ${newSlot}`);
      console.log(`   Image: ${imageTag}`);
      
    } catch (error) {
      console.error('Deployment failed:', error.message);
      console.log('Run rollback to switch back to previous version');
      throw error;
    }
  }
}

// Usage
const deployer = new BlueGreenDeployer('order-service');
deployer.deploy('v2.0.0');

Post-Deploy Validation with Health Checks

The moment after a deploy completes, you want automated confirmation that the new pods are not just “Ready” from Kubernetes’ perspective but actually serving real traffic correctly. A pod can be Ready and still be broken if, for example, it connects to the database at startup (passing the readiness probe) but returns 500 on every request because a config flag is missing. The script below runs a burst of real HTTP requests against the service right after deploy, measures error rates, and fails the deploy if errors exceed a threshold.
// scripts/post-deploy-validate.js
const axios = require('axios');

async function validate({ baseUrl, durationSec = 60, threshold = 0.01 }) {
  const end = Date.now() + durationSec * 1000;
  let total = 0, failed = 0;

  while (Date.now() < end) {
    total += 1;
    try {
      const resp = await axios.get(`${baseUrl}/health/ready`, { timeout: 2000 });
      if (resp.status >= 500) failed += 1;
    } catch (err) {
      failed += 1;
    }
    await new Promise(r => setTimeout(r, 500));
  }

  const errorRate = failed / total;
  console.log(`Checked ${total} requests, ${failed} errors (${(errorRate * 100).toFixed(2)}%)`);
  if (errorRate > threshold) {
    throw new Error(`Error rate ${errorRate} exceeds threshold ${threshold}`);
  }
}

validate({ baseUrl: process.env.SERVICE_URL }).catch(err => {
  console.error(err.message);
  process.exit(1);
});

Canary Deployment

Canary deployment takes a different philosophy: instead of flipping all traffic at once, send a tiny slice (5%) to v2 while 95% stays on v1. Watch the metrics. If the canary misbehaves, only 5% of users are affected and you roll back. If it looks healthy, increase to 10%, then 25%, then 50%, then 100%. You are essentially A/B-testing your release in production, with live users as the judges. This is the gold standard for high-traffic services where even a small regression would cost real money or reputation. Why does canary deployment exist? It replaces the traditional “big bang” release where a team shipped on a Friday at 5pm, went home, and got paged at 8pm when 100% of users hit the new bug. That pattern is where the phrase “read-only Friday” came from — teams learned to stop deploying on Fridays because every deploy was a coin flip. Canary turns the coin flip into a graduated bet: the bug still exists, but only 5% of users meet it, and the automation aborts the deploy before it spreads. What goes wrong without canary: Knight Capital’s 2012 incident, where a bad deploy pushed to 8 servers simultaneously cost them 440Min45minutes.Acanarywithevena1minutebaketimewouldhavecaughtthefaultytradinglogicat440M in 45 minutes. A canary with even a 1-minute bake time would have caught the faulty trading logic at 5M of losses, not $440M. Why not just use blue-green everywhere? Because blue-green gives you an all-or-nothing bet. If v2 has a subtle bug that only shows up under real production load (a race condition at 10,000 QPS, a memory leak that manifests after 30 minutes), blue-green will hit 100% of users with that bug the instant you flip traffic. Canary lets you discover the bug at 5% and bail out. The cost is complexity: canary requires traffic-splitting infrastructure (Istio, Linkerd, or a smart load balancer) and automated analysis (is the error rate in the canary statistically higher than stable?). The naive version — “deploy one canary pod alongside nine stable pods and let Kubernetes’ round-robin split traffic” — works but gives you no control over percentages or header-based routing. The key decision for canary: what metric do you use to decide success? Error rate alone is not enough — a service can have identical error rates between v1 and v2 but v2 takes 5x longer to respond, silently degrading user experience. Production canary analysis compares error rate, latency percentiles (p50, p95, p99), and business metrics (order completion rate, checkout success) between canary and stable. If any metric diverges by more than a threshold, auto-rollback. The temptation is to use too many metrics; the failure mode is that one noisy metric causes constant aborts and the team disables canary analysis entirely.
# canary/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service-canary
spec:
  replicas: 1  # Start with 1 canary replica
  selector:
    matchLabels:
      app: order-service
      track: canary
  template:
    metadata:
      labels:
        app: order-service
        track: canary
        version: v2.0.0
    spec:
      containers:
      - name: order-service
        image: myregistry/order-service:v2.0.0
---
# Main deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service-stable
spec:
  replicas: 9  # 9 stable replicas
  selector:
    matchLabels:
      app: order-service
      track: stable
  template:
    metadata:
      labels:
        app: order-service
        track: stable
        version: v1.0.0
    spec:
      containers:
      - name: order-service
        image: myregistry/order-service:v1.0.0
---
# Service routes to both
apiVersion: v1
kind: Service
metadata:
  name: order-service
spec:
  selector:
    app: order-service  # Routes to both stable and canary
  ports:
  - port: 80
    targetPort: 3000
Progressive Canary with Argo Rollouts: Argo Rollouts takes the canary concept and makes it declarative and fully automated. Instead of manually setting weights and watching Grafana, you describe the canary steps (5% → 10% → 25% → 50% → 75% → 100%) and the analysis rules (“success rate must be >= 99%”). Argo Rollouts orchestrates the traffic shifting through your service mesh (Istio here) and automatically pauses, aborts, or promotes based on the metrics. If the analysis fails, traffic is rolled back automatically. This turns canary deployment from “senior engineer babysits Grafana for 40 minutes” into “push a commit and go get coffee.” The tradeoff is setup cost: you need a service mesh, a metrics provider (Prometheus), and analysis templates tuned to your service’s actual behavior — which means you need good observability before you can automate progressive delivery.
# argo-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: order-service
spec:
  replicas: 10
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
      - name: order-service
        image: myregistry/order-service:v2.0.0
        ports:
        - containerPort: 3000
  strategy:
    canary:
      # Canary steps
      steps:
      - setWeight: 5
      - pause: { duration: 2m }
      - setWeight: 10
      - pause: { duration: 5m }
      - setWeight: 25
      - pause: { duration: 10m }
      - setWeight: 50
      - pause: { duration: 10m }
      - setWeight: 75
      - pause: { duration: 10m }
      - setWeight: 100
      
      # Automated analysis
      analysis:
        templates:
        - templateName: success-rate
        startingStep: 2
        args:
        - name: service-name
          value: order-service
      
      # Traffic routing with Istio
      trafficRouting:
        istio:
          virtualService:
            name: order-service
            routes:
            - primary
          destinationRule:
            name: order-service
            canarySubsetName: canary
            stableSubsetName: stable
      
      # Anti-affinity for canary pods
      antiAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
          weight: 100
---
# Analysis template
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 1m
    count: 5
    successCondition: result[0] >= 0.99
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(
            istio_requests_total{
              destination_service_name="{{args.service-name}}",
              response_code!~"5.*"
            }[5m]
          )) / 
          sum(rate(
            istio_requests_total{
              destination_service_name="{{args.service-name}}"
            }[5m]
          ))

Canary Validation: Comparing Error Rates Between Versions

Argo Rollouts handles the automated case, but many teams start with a simpler approach: a validation script that queries Prometheus, compares the error rate of the canary pods to the stable pods, and exits non-zero if the canary is worse by more than a threshold. The CI pipeline calls this script after bumping canary weight; if it fails, the pipeline auto-rolls back. The script below is the minimum viable version.
// scripts/canary-validate.js
const axios = require('axios');

const PROM = process.env.PROMETHEUS_URL;
const SERVICE = process.env.SERVICE_NAME;
const THRESHOLD = Number(process.env.ERROR_DELTA_THRESHOLD || 0.01);

async function errorRate(track) {
  const query = `
    sum(rate(http_requests_total{service="${SERVICE}",track="${track}",status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total{service="${SERVICE}",track="${track}"}[5m]))
  `;
  const resp = await axios.get(`${PROM}/api/v1/query`, { params: { query } });
  const value = resp.data.data.result?.[0]?.value?.[1];
  return Number(value || 0);
}

(async () => {
  const [stable, canary] = await Promise.all([errorRate('stable'), errorRate('canary')]);
  const delta = canary - stable;
  console.log(`stable=${stable} canary=${canary} delta=${delta}`);
  if (delta > THRESHOLD) {
    console.error(`Canary error rate exceeds stable by ${delta}, aborting`);
    process.exit(1);
  }
})();

Caveats & Common Pitfalls: Canary Deploys That Lie

Canary deployments create a false sense of safety without the right rigor:
  • Too-short bake windows. Teams push 5% traffic for 60 seconds and call the canary successful. 60 seconds is not enough to see memory leaks, slow-burning errors from background jobs, or user-behavior-dependent bugs that only manifest during real traffic patterns. Target 15-30 minutes minimum per step.
  • Comparing canary against a stale baseline. Your canary monitors compare v2 to v1, but v1’s baseline was captured last week under different load. A normal Friday-afternoon lull gets read as “v2 degradation” and the canary is aborted for no reason. Use rolling baselines that compare canary against the current stable track, not historical numbers.
  • Sticky user routing breaking canary statistics. If your load balancer does sticky sessions, the same user always hits the same track. A user with a bad experience keeps hitting the bad canary — you see a spike in complaints from 5 users instead of spread across thousands. Use random or hash-based routing for canary, and accept session stickiness only post-promotion.
  • Canary sees different traffic than production. Canary receives 5% of inbound traffic, but that 5% is systematically different (e.g., only traffic from one region, only API calls not web). Bugs in the non-sampled cohort are invisible to the canary. Route canary traffic randomly across the full traffic distribution.
Solutions & Patterns:
  • Define canary success criteria in code, not just in the runbook. Explicit metrics with thresholds: “error rate delta under 0.5%, p99 latency delta under 15%, checkout conversion delta under 2%.” Codify these in Argo Rollouts analysis templates so they are enforced, not just aspirational.
  • Use statistical rather than absolute thresholds. “Canary error rate must not exceed stable by more than 2 standard deviations” is more robust than “canary error rate under 1%” which can trigger on noise at low traffic volumes.
  • Promote gradually and pause between steps. 1% for 15 min, 5% for 30 min, 25% for 30 min, 50% for 30 min, 100%. Each step gives time for delayed bugs to surface. Total rollout: 2 hours for a large-blast-radius change, which is the right speed for high-stakes services.
  • Test the rollback path as a pre-deploy check. Before any canary, verify that traffic can be shifted back to stable in under 60 seconds. If that takes longer, fix the rollback path before adding new deploys.

Rollback Strategies: The Forgotten Half of Deployment

Most teams spend 90% of their deployment thinking on “how do we deploy v2?” and 10% on “how do we roll back?” This is backwards. The moment production breaks, rollback speed is the only thing that matters — every minute of downtime costs revenue, users, and trust. A mature deployment strategy treats rollback as a first-class operation, tested regularly (monthly rollback drills are a good practice), with clear triggers for when to invoke it. The three rollback strategies are: infrastructure rollback (revert the image tag, redeploy the old version), traffic rollback (flip the router back to the old version, which is faster but assumes the old version is still running), and data rollback (restore from backup, which is the last resort and usually involves data loss). Without a tested rollback story, teams fall into a failure mode called “rolling forward” — when v2 is broken, they try to push v3 to fix it, then v4 to fix v3, compounding the incident. I have seen incidents where a 5-minute rollback would have restored service, but the team spent 3 hours trying to patch forward because they had never practiced rollback and were afraid of it. The rule: if you cannot restore service in under 10 minutes, you do not have a rollback strategy — you have wishful thinking. The key decision for rollback: automatic vs manual. Automatic rollback (Argo Rollouts aborts on analysis failure) is fast but can thrash during transient issues (a brief Prometheus outage triggers rollback of a healthy deploy). Manual rollback requires a human in the loop, which is slower but more deliberate. The right answer depends on blast radius: for a low-risk service, automatic rollback saves human time; for a mission-critical service (payments, auth), you want a human to confirm the rollback because a false positive is more costly than an extra 60 seconds of incident time.

Deployment Strategy Decision Framework

Choosing the wrong deployment strategy for a given service can range from “annoying” (unnecessary complexity) to “catastrophic” (data corruption from incompatible versions running simultaneously). Use this matrix:
StrategyRollback SpeedResource CostRisk LevelBest For
Rolling updateSlow (minutes)Low (no extra infra)Medium (mixed versions during rollout)Stateless services with backward-compatible changes
Blue-greenInstant (switch selector)High (2x infrastructure)Low (full testing before switch)Database-backed services needing instant rollback
CanaryFast (shift traffic back)Low-Medium (1 extra pod)Lowest (gradual exposure)High-traffic services where subtle regressions matter
RecreateN/A (downtime accepted)LowHigh (downtime)Development environments; services with incompatible version transitions
Decision tree:
  1. Can you afford downtime? If yes, use Recreate (simplest). If no, continue.
  2. Is the change backward-compatible? If no, use Blue-green (both versions never serve traffic simultaneously). If yes, continue.
  3. Do you have more than 1,000 requests/minute? If yes, use Canary (catch regressions before they hit all users). If no, use Rolling update (simplest zero-downtime option).
  4. Do you need header-based routing (e.g., internal users see v2)? You need a service mesh (Istio) with Canary or Blue-green.
Edge case: database migrations during deployment. None of these strategies are safe if v2 changes the database schema in a way that breaks v1. Always use the expand-contract pattern: first deploy a migration that adds new columns (expand), deploy v2 that writes to both old and new columns, then deploy a cleanup migration that removes old columns (contract). This adds 2-3 extra deployments but prevents data corruption.

GitOps with ArgoCD

GitOps Principles: Why Pull Beats Push

GitOps exists because push-based CI/CD has fundamental security and auditability problems. In push-based CI/CD, your CI system (GitHub Actions, Jenkins, CircleCI) has long-lived credentials that let it write to every Kubernetes cluster you deploy to. If an attacker compromises your CI — through a malicious dependency, a leaked token, or a poisoned PR — they have direct access to production. This has happened in real breaches: the Codecov bash uploader incident (2021) exfiltrated credentials from thousands of CI pipelines worldwide because those credentials had too much power. GitOps flips the model: the cluster polls a Git repo for changes (pull) rather than the CI pushing changes into the cluster (push). CI’s only job is to update manifests in Git; it has no cluster credentials. A GitOps agent inside the cluster (ArgoCD, Flux) is the only actor with write access, and it only applies what is in Git. Two key consequences: (1) you can revoke all CI’s cluster access without breaking deploys, which massively reduces blast radius of a CI compromise; (2) the Git repo becomes the audit log — every deploy is a signed commit, traceable to a human and a PR. What goes wrong without this: traditional push pipelines leave no durable record of who deployed what when, because the pipeline logs expire and kubectl apply does not leave a commit trail. The key decision in GitOps: strict reconciliation vs advisory mode. Strict reconciliation (selfHeal: true) means the cluster is forced to exactly match Git — any manual kubectl edit is reverted within seconds. Advisory mode means drift is reported but not corrected, giving operators an escape valve for emergencies. Strict is the right default because it prevents configuration drift; advisory is sometimes used during the initial GitOps adoption phase when teams are learning the workflow and need room to experiment. Once your team is fluent, strict reconciliation is mandatory — without it, GitOps becomes “Git is one source of truth among several,” which is no source of truth at all.

ArgoCD Architecture

GitOps inverts the traditional CI/CD model. In the old way, your CI pipeline had write credentials to the cluster and ran kubectl apply to push changes in. That is a security nightmare (your CI system effectively has admin on prod) and an auditability nightmare (to know what is running, you have to query the cluster, not read a file). GitOps flips this: the cluster has a pull-based agent (ArgoCD) that watches a Git repo and syncs the cluster to match whatever is in Git. Git becomes the single source of truth; the cluster is a derived, reproducible projection of it. Why this matters for microservices specifically: with 20+ services, you need a clean answer to “what version of every service is running in every environment, and how do I roll any of them back?” The Git-repo-of-manifests answer is: look at the file, run git log for history, run git revert to roll back. The imperative answer is: run kubectl get across dozens of namespaces and hope nobody made a manual change. The tradeoff is operational: you now manage a second repo (the manifests repo) in addition to your application code, and you need discipline to never kubectl edit directly in production. Every team that adopts GitOps discovers that the hardest part is not the tooling — it is teaching engineers to resist the temptation to hot-patch production.
┌─────────────────────────────────────────────────────────────────────────────┐
│                        GITOPS WORKFLOW                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌────────────────┐       ┌────────────────┐       ┌────────────────┐       │
│  │  Application   │       │  CI Pipeline   │       │  Git Repo      │       │
│  │  Repository    │──────▶│  (GitHub       │──────▶│  (K8s Manifests│       │
│  │                │ build │   Actions)     │ push  │   Kustomize)   │       │
│  └────────────────┘       └────────────────┘       └───────┬────────┘       │
│                                                            │                 │
│                                                     watch/sync              │
│                                                            │                 │
│                                                            ▼                 │
│                                                   ┌────────────────┐        │
│                                                   │    ArgoCD      │        │
│  ┌────────────────────────────────────────────────┤                │        │
│  │                                                │  • Watches Git │        │
│  │          KUBERNETES CLUSTER                    │  • Syncs state │        │
│  │                                                │  • Self-heals  │        │
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐       │                │        │
│  │  │ Service  │ │ Service  │ │ Service  │       └────────────────┘        │
│  │  │    A     │ │    B     │ │    C     │                                 │
│  │  └──────────┘ └──────────┘ └──────────┘                                 │
│  │                                                                          │
│  └──────────────────────────────────────────────────────────────────────────│
│                                                                              │
│  ✅ BENEFITS:                                                                │
│  • Git = single source of truth                                             │
│  • Audit trail (git history)                                                │
│  • Easy rollback (git revert)                                               │
│  • Declarative configuration                                                │
│  • Self-healing (auto-sync)                                                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

ArgoCD Application Configuration

ArgoCD deserves its own section because it is the most common GitOps operator and its configuration model repeats concepts you will see across the ecosystem. The Application CRD is ArgoCD’s core abstraction: one Application represents one deployed workload, bound to one Git source. The critical decision is granularity — one Application per service, or one Application per “bounded context” (group of services that deploy together). Per-service Applications give you the finest-grained control (each service can sync independently, rollback independently), but you end up with hundreds of Application manifests to manage. The ApplicationSet pattern below solves this by generating Applications from a template. Why does ArgoCD exist when kubectl apply -f has existed forever? Because kubectl apply is stateless — once you run it, there is no ongoing reconciliation. If someone edits the resource afterwards, your apply is forgotten. ArgoCD maintains an ongoing reconciliation loop: it knows what Git says the state should be, it observes what the cluster actually is, and it continuously drives the cluster toward the desired state. This is the same pattern Kubernetes itself uses for Deployments reconciling Pods. ArgoCD extends that pattern from “pods match deployment” to “cluster matches Git.” An ArgoCD Application is the atomic unit of GitOps deployment: it binds a Git source (repo + path + revision) to a Kubernetes destination (cluster + namespace). The syncPolicy.automated.selfHeal: true flag is what makes GitOps different from “CI that runs kubectl” — it means ArgoCD continuously reconciles cluster state against Git state, so manual kubectl changes get reverted within seconds. The prune: true flag means if you delete a file from Git, ArgoCD deletes the corresponding resource in the cluster. Without these flags, ArgoCD is just a fancy deployment tool; with them, Git is truly the source of truth. The ApplicationSet below is how you avoid defining the same Application manifest twelve times for twelve environments. It generates one Application per entry in the elements list, templating the environment name, namespace, and cluster URL. This keeps your GitOps config DRY and prevents drift between environment definitions — a common bug is “staging and prod diverged because someone updated one but not the other.” ApplicationSets make the divergence impossible by construction.
# argocd/applications/order-service.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: order-service
  namespace: argocd
  finalizers:
  - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  
  source:
    repoURL: https://github.com/myorg/k8s-manifests.git
    targetRevision: main
    path: services/order-service/overlays/production
    
    # Kustomize configuration
    kustomize:
      images:
      - myregistry/order-service
  
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  
  syncPolicy:
    automated:
      prune: true  # Delete resources not in Git
      selfHeal: true  # Revert manual changes
      allowEmpty: false
    syncOptions:
    - Validate=true
    - CreateNamespace=true
    - PrunePropagationPolicy=foreground
    - PruneLast=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m
  
  # Health checks
  ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
    - /spec/replicas  # Ignore HPA-managed replicas
  
  # Notifications
  annotations:
    notifications.argoproj.io/subscribe.on-sync-succeeded.slack: deployments
    notifications.argoproj.io/subscribe.on-sync-failed.slack: deployments
---
# ApplicationSet for multiple environments
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: order-service-environments
  namespace: argocd
spec:
  generators:
  - list:
      elements:
      - environment: staging
        namespace: staging
        cluster: https://staging.k8s.example.com
      - environment: production
        namespace: production
        cluster: https://production.k8s.example.com
  
  template:
    metadata:
      name: 'order-service-{{environment}}'
    spec:
      project: default
      source:
        repoURL: https://github.com/myorg/k8s-manifests.git
        targetRevision: main
        path: 'services/order-service/overlays/{{environment}}'
      destination:
        server: '{{cluster}}'
        namespace: '{{namespace}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

Kustomize Structure

Kustomize lets you maintain one set of “base” Kubernetes manifests and “overlay” environment-specific patches on top. Without it, you end up with three copies of nearly-identical deployment YAML (dev, staging, prod) and they slowly drift apart as people fix bugs in one but not the others. The Kustomize base/overlay pattern below keeps the shared 90% in one place and isolates the 10% that actually differs per environment (replica counts, resource limits, config values). This is mechanically similar to CSS inheritance — define defaults in the base, override selectively in the overlay.
k8s-manifests/
└── services/
    └── order-service/
        ├── base/
        │   ├── kustomization.yaml
        │   ├── deployment.yaml
        │   ├── service.yaml
        │   ├── configmap.yaml
        │   └── hpa.yaml
        └── overlays/
            ├── staging/
            │   ├── kustomization.yaml
            │   ├── replicas-patch.yaml
            │   └── config-patch.yaml
            └── production/
                ├── kustomization.yaml
                ├── replicas-patch.yaml
                └── config-patch.yaml
# base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
- deployment.yaml
- service.yaml
- configmap.yaml
- hpa.yaml

commonLabels:
  app: order-service

images:
- name: order-service
  newName: myregistry/order-service
  newTag: latest
# overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: production

resources:
- ../../base

patches:
- path: replicas-patch.yaml
- path: config-patch.yaml

images:
- name: myregistry/order-service
  newTag: v2.1.0  # Updated by CI pipeline

configMapGenerator:
- name: order-service-config
  behavior: merge
  literals:
  - LOG_LEVEL=warn
  - ENABLE_DEBUG=false

Monorepo vs Polyrepo CI/CD

Monorepo Strategy

A monorepo (one Git repo, many services) sounds like it should be easier than polyrepo (one Git repo per service), but the CI complexity is substantial. The naive implementation — “run all service pipelines on every commit” — means a one-line change to services/billing/README.md triggers a 45-minute pipeline across every service. The fix is change detection: determine which services were actually affected by a commit and only build those. Shared libraries are the gotcha: if libs/common changes, every service that depends on it must be rebuilt. The dorny/paths-filter action below is one way to implement this cleanly. It parses the changed files against a set of filter patterns and emits a list of affected services. The matrix strategy then spawns a parallel job per affected service. The shared-libs filter handles the dependency case: if shared libs change, we rebuild all dependent services. The alternative (Nx, Turborepo, Bazel) does this with actual dependency graphs parsed from your code, which is more accurate but adds tooling complexity.
# .github/workflows/monorepo-ci.yml
name: Monorepo CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      services: ${{ steps.filter.outputs.changes }}
    steps:
      - uses: actions/checkout@v4
      - uses: dorny/paths-filter@v2
        id: filter
        with:
          filters: |
            order-service:
              - 'services/order-service/**'
            user-service:
              - 'services/user-service/**'
            payment-service:
              - 'services/payment-service/**'
            shared-libs:
              - 'libs/**'

  build:
    needs: detect-changes
    if: ${{ needs.detect-changes.outputs.services != '[]' }}
    runs-on: ubuntu-latest
    strategy:
      matrix:
        service: ${{ fromJson(needs.detect-changes.outputs.services) }}
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Build ${{ matrix.service }}
        if: matrix.service != 'shared-libs'
        run: |
          cd services/${{ matrix.service }}
          docker build -t myregistry/${{ matrix.service }}:${{ github.sha }} .
      
      - name: Rebuild dependent services
        if: matrix.service == 'shared-libs'
        run: |
          # Find all services depending on shared libs
          for service in $(find services -name "package.json" -exec grep -l "@myorg/shared" {} \;); do
            service_name=$(dirname $service | xargs basename)
            echo "Building $service_name..."
            cd services/$service_name
            docker build -t myregistry/$service_name:${{ github.sha }} .
            cd ../..
          done

Polyrepo Strategy

Polyrepo (one repo per service) inverts the tradeoffs. Each repo’s CI is simple — there is only one service, so “what changed?” is always “this service.” No path filtering needed, no dependency graph. But cross-service coordination gets harder: a feature that spans three services requires three PRs in three repos, and keeping them in sync is manual. Large refactors that touch many services become painful. The polyrepo workflow below is dead simple, which is its main virtue. It pushes the image and then triggers a repository-dispatch event in a separate k8s-manifests repo, which is where ArgoCD picks it up. This separation means the application repo has no write access to the k8s cluster, only the manifests repo does.
# Each service has its own repo
# order-service/.github/workflows/ci.yml
name: Order Service CI

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Build
        run: docker build -t myregistry/order-service:${{ github.sha }} .
      
      - name: Test
        run: npm test
      
      - name: Push
        if: github.ref == 'refs/heads/main'
        run: docker push myregistry/order-service:${{ github.sha }}
      
      - name: Trigger deployment
        if: github.ref == 'refs/heads/main'
        uses: peter-evans/repository-dispatch@v2
        with:
          token: ${{ secrets.DEPLOY_REPO_TOKEN }}
          repository: myorg/k8s-manifests
          event-type: deploy-order-service
          client-payload: '{"image": "myregistry/order-service:${{ github.sha }}"}'

Deployment Helper Scripts

Pipelines do the heavy lifting, but you will also need ad-hoc deployment scripts — for manual cutovers, for emergency rollbacks, for triggering builds across multiple services during a coordinated release, and for querying the state of the deployment fleet. These scripts are typically written in whatever language your platform team is most comfortable with. Below is a Python helper that combines GitPython (for commit-driven deploys), boto3 (for AWS ECR pushes), and subprocess (for invoking kubectl/kustomize). The structure maps 1:1 to what you would write in Node.js, so if your team is Python-first for tooling, this is the pattern to use.
// scripts/deploy-helper.js
const { execSync } = require('child_process');
const simpleGit = require('simple-git');

class DeploymentHelper {
  constructor(serviceName, manifestsRepoPath) {
    this.serviceName = serviceName;
    this.manifestsRepo = simpleGit(manifestsRepoPath);
    this.manifestsPath = manifestsRepoPath;
  }

  getImageTag() {
    const sha = execSync('git rev-parse --short HEAD').toString().trim();
    const timestamp = new Date().toISOString().slice(0, 10).replace(/-/g, '');
    return `${timestamp}-${sha}`;
  }

  async updateManifest(environment, imageTag) {
    const overlayPath = `services/${this.serviceName}/overlays/${environment}`;
    execSync(
      `cd ${this.manifestsPath}/${overlayPath} && ` +
      `kustomize edit set image myregistry/${this.serviceName}=myregistry/${this.serviceName}:${imageTag}`
    );

    await this.manifestsRepo.add('.');
    await this.manifestsRepo.commit(
      `Deploy ${this.serviceName}:${imageTag} to ${environment}`
    );
    await this.manifestsRepo.push('origin', 'main');

    console.log(`✅ Updated ${environment} manifest to ${imageTag}`);
  }

  async deploy(environment) {
    const imageTag = this.getImageTag();
    console.log(`Deploying ${this.serviceName}:${imageTag} to ${environment}...`);
    await this.updateManifest(environment, imageTag);
  }
}

const helper = new DeploymentHelper('order-service', '/path/to/k8s-manifests');
helper.deploy('staging');

Interview Questions

Answer:Key principles:
  1. Independent pipelines: Each service has its own pipeline
  2. Parallel execution: Build services in parallel
  3. Contract testing: Verify API contracts between services
  4. Progressive deployment: Canary → Staging → Production
Pipeline stages:
Build → Unit Test → Security Scan → Integration Test → 
Contract Test → Push → Deploy Staging → E2E Test → Deploy Prod
Best practices:
  • Feature flags for incomplete features
  • Automated rollback on failure
  • Environment parity (staging ≈ production)
Answer:
AspectBlue-GreenCanary
Traffic switchInstant (100%)Gradual (5% → 100%)
Rollback speedInstantInstant
Resource cost2x capacity1x + small %
RiskHigher (all traffic)Lower (partial traffic)
ComplexitySimplerMore complex
Use Blue-Green when:
  • Need instant rollback
  • Have resources for 2x capacity
  • Database schema is compatible
Use Canary when:
  • Want to test with real traffic
  • Need to validate gradually
  • Resource constrained
Answer:GitOps = Git as single source of truth for infrastructure.Principles:
  1. Declarative configuration in Git
  2. Version controlled (history, audit)
  3. Automated sync (Git → Cluster)
  4. Self-healing (drift correction)
Benefits:
  • Easy rollback (git revert)
  • Audit trail (git history)
  • Pull-based security (no cluster credentials in CI)
  • Consistent environments
Tools: ArgoCD, Flux, Jenkins X
Answer:Strategies:
  1. Backward-compatible migrations
    • Add columns, don’t remove
    • Deploy code that works with old/new schema
    • Separate deployment from migration
  2. Expand-Contract pattern
    Step 1: Add new column (nullable)
    Step 2: Deploy code writing to both
    Step 3: Backfill old data
    Step 4: Deploy code reading from new
    Step 5: Remove old column
    
  3. Versioned APIs
    • API v1 uses old schema
    • API v2 uses new schema
    • Migrate gradually
Never: Rename/delete columns in one deploy
Answer:Testing pyramid:
          /\
         /  \      E2E Tests (few)
        /----\     
       /      \    Integration Tests
      /--------\   
     /          \  Contract Tests (Pact)
    /------------\ 
   /              \ Unit Tests (many)
  /________________\
In pipeline:
  1. Unit tests: Fast, isolated, run first
  2. Contract tests: Verify API contracts
  3. Integration tests: Test with real dependencies
  4. E2E tests: Full user flows (staging only)
Key practices:
  • Use service containers in CI (Postgres, Redis)
  • Mock external services
  • Parallelize test suites
  • Test data isolation

Chapter Summary

Key Takeaways:
  • Each microservice needs its own CI/CD pipeline
  • Use progressive deployment strategies (Canary, Blue-Green)
  • GitOps provides audit trail and easy rollbacks
  • Contract testing catches breaking changes early
  • Automate everything: testing, security scanning, deployment
Next Chapter: Database Patterns - Data management strategies for microservices.

Interview Questions: Pipeline Bottlenecks at Scale

Strong Answer Framework:
  1. Diagnose before optimizing. Profile the pipeline: what are the top 3 slowest stages? Typically it is (a) unit tests that do not parallelize, (b) sequential integration tests against real infrastructure, (c) Docker builds that do not use layer caching. Without profiling, you will optimize the wrong things.
  2. Partition by change detection. A change to services/order/ should not build, test, or scan the 19 other services. Use dorny/paths-filter, Nx, Turborepo, or Bazel to determine which services are affected by the diff. This alone often drops pipeline time 80%+ for typical PRs.
  3. Parallelize aggressively within affected services. Unit tests run in parallel per service, integration tests run in parallel per service, security scans run in parallel with integration tests. The critical path should be build + unit test then security || integration || contract then push + deploy.
  4. Shift slow stages out of the critical path. Full E2E tests do not need to block every PR merge. Run them post-merge against main, in a separate pipeline that gates the staging-to-prod promotion rather than the merge itself. PRs finish in 10-15 minutes; full validation finishes in 45-60 minutes in the background.
  5. Cache everything cacheable. Docker layer cache (BuildKit, buildx), dependency cache (package-lock.json-keyed npm cache, requirements-hash-keyed pip cache), compiled artifacts. Teams routinely discover they were rebuilding node_modules from scratch on every PR.
  6. Invest in test speed, not just pipeline speed. Slow integration tests often have N+1 setup: each test spins up its own DB. Use transaction-per-test or shared container patterns to amortize setup cost. A 10-minute test suite becoming 30 seconds changes the pipeline economics entirely.
  7. Move required-for-merge checks to a small core. Fast unit tests, linting, security scans (of only changed files) — these must pass to merge. Everything slower runs post-merge and produces rollback signal, not merge blocking.
Real-World Example: Shopify’s monorepo CI engineering blog (2020-2022) describes this exact journey: from 40+ minute pipelines blocking thousands of engineers to under 10 minutes for typical PRs through change detection and test selection. Their key insight: most PRs do not need the full test matrix, so do not run it.Senior Follow-up Questions:
Q: How do you justify to leadership spending a quarter optimizing CI rather than shipping features? A: Compute the ROI: 10 teams * 10 engineers * 2 PRs/day * 45 min wait = 150 engineer-hours per day lost to pipeline wait time. At fully-loaded cost, that is 50k50k-75k/week. A 3-engineer quarter reclaiming 30 minutes per PR pays back in a few weeks. Present it as the throughput investment it is, not as internal tooling.
Q: What if the slow stage is security scanning that legal says must run on every PR? A: Two options. First, incremental scanning — scan only changed files (modern Snyk, Trivy, Semgrep support this). Second, run the full scan post-merge on a schedule and block the merge to main only if previous scans passed. Legal cares about the scan running on every change, not about when it blocks; frame the conversation around that.
Q: You have shaved the pipeline to 10 minutes but engineers still complain. What do you do? A: Profile engineer workflow, not pipeline. Often the issue is flaky tests that fail randomly and force reruns (so the effective pipeline time is 20 minutes for a flaky PR). Build a flaky-test tracker, quarantine them automatically, and make test ownership explicit. Flakiness is often more expensive than pipeline time once pipeline time is reasonable.
Common Wrong Answers:
  • “Add more self-hosted runners.” Throwing hardware at the problem does not help if the pipeline serializes work. A single 45-minute sequential pipeline runs at 45 minutes on any runner.
  • “Remove the integration tests to speed it up.” This reintroduces bugs that the tests were catching. The right move is to make the tests fast, not remove them.
Further Reading:
  • Shopify Engineering blog: “How Shopify reduced its CI time by 80%” (2021).
  • Uber Engineering blog on SubmitQueue and monorepo CI architecture.
  • Google’s blog post series on Bazel remote caching and remote execution.
Strong Answer Framework:
  1. Define success criteria as explicit thresholds. Error rate delta (canary vs stable) must be under 0.5%. P95 latency delta must be under 15%. Business metric (checkout conversion) delta must be under 2%. These are the signals the automation checks.
  2. Use statistical tests, not raw thresholds, at low traffic. At 1% canary traffic, you may have 100 requests per minute — a single error is 1% error rate, which looks catastrophic but is noise. Use two-sample tests (Mann-Whitney for latency, chi-squared for error rates) that account for sample size.
  3. Require sustained threshold breach, not instantaneous. “Error rate delta over 0.5% for 3 consecutive 1-minute windows” beats “error rate delta over 0.5% in any single minute.” Eliminates false positives from transient blips (one user’s retry storm, a slow S3 request).
  4. Argo Rollouts AnalysisTemplate is the standard primitive. Define a template once per service that encodes the thresholds. Rollouts automatically evaluates it at each canary step, pausing on warn, rolling back on failure. Documented, versioned, reviewable.
  5. Independent kill switch for ops. Automation handles regression, but ops still needs a one-button rollback for novel situations the automation doesn’t cover. kubectl argo rollouts abort <rollout> or equivalent; tested monthly.
  6. Tune thresholds with real data over time. Start conservative (e.g., 1% error rate delta triggers abort). Observe false-abort rate. If the automation aborts good deploys, thresholds are too tight; tighten only after false aborts drop below 1 in 20.
Real-World Example: Netflix’s Spinnaker pipeline analysis (Kayenta) does exactly this with Mann-Whitney tests and z-score analysis for canary evaluation. Their public talks describe tuning thresholds over 12+ months to get false-abort rates down to acceptable levels. Automation quality is iterative, not instant.Senior Follow-up Questions:
Q: The team is skeptical of automation after it aborted a good deploy last month. How do you rebuild trust? A: Make the automation’s reasoning visible. When it aborts, post to Slack with the exact metric, its value, the threshold, and the window. Engineers can then argue with specific numbers instead of “the bot is wrong again.” Also expose a dry-run mode where engineers can see what the automation would have done without actually acting, building confidence before enforcement.
Q: What do you do when the automation and a human disagree? The dashboard looks fine but the automation says abort. A: Default to the automation, with an explicit override ceremony (paged senior engineer confirms override via specific command). The automation usually catches statistical signals humans miss, but sometimes it is wrong. The key is that overriding is deliberate and logged.
Q: Once the canary is at 100%, is the automation still needed? A: Yes — monitor for 30 minutes post-promotion. Bugs that only manifest at 100% traffic (capacity issues, shared-resource contention, dependency load patterns) would not have shown at 25%. Extended bake with monitoring catches these.
Common Wrong Answers:
  • “Just set the error rate threshold to 0 — abort on any error.” Every service has a baseline error rate. Aborting on any error will abort every deploy.
  • “Run the canary at 50% so you get enough traffic for statistics.” Defeats the purpose of canary (limit blast radius). Use better statistical methods instead of higher traffic.
Further Reading:
  • Netflix Tech Blog, “Automated Canary Analysis at Netflix with Kayenta” (2017).
  • Argo Rollouts documentation on AnalysisTemplate and metric providers.
  • Google SRE Workbook, Chapter 3 (“Working with SLOs”) — statistical tests for deploys.
Strong Answer Framework:
  1. Diagnose the divergence. Production got the hotfix, staging did not. This is deploy/release coupling at the environment level: prod and staging are managed as separate timelines. The hotfix existed in a branch that was merged to the prod deployment artifact but not into the main branch that staging tracked.
  2. Unify the deployment artifact and the source of truth. With GitOps, main is the only source of truth. The hotfix goes into a branch, gets fast-tracked PR-approved (even with one reviewer at 2 AM), merged to main, and then the GitOps operator automatically deploys main to both staging and production. There is no way to deploy something to prod that staging will not eventually get.
  3. Add a post-incident checklist item. Every production hotfix must be followed within 24 hours by: confirmation the same commit is on main, confirmation staging has picked it up, and a regression test that would have caught the bug. This becomes the closing criteria of the incident.
  4. Detect drift automatically. A nightly job compares the deployed artifact in each environment. If production is running a commit that is not an ancestor of staging’s commit, alert. This catches rogue hotfixes before they cause staging-passes-prod-fails mysteries.
  5. Run staging tests against prod configuration on promotion. When a deploy is about to promote to prod, automatically run the relevant staging tests against the prod-targeted artifact. Any failure blocks promotion. Staging drift becomes a promotion failure, not a silent rot.
  6. Culturally: the incident is not closed until staging is updated. Make this part of incident postmortem review. A pattern of “hotfix deployed, staging forgotten” indicates process-level gaps, not individual mistakes.
Real-World Example: Knight Capital’s 2012 $440M loss started with exactly this pattern: a config change was deployed to 7 of 8 servers but the 8th was forgotten. The mismatch triggered latent code that took them out of business in 45 minutes. The generalization: whenever N environments are supposed to be in sync but nothing forces them to be, they will drift, and the drift will eventually cause an incident.Senior Follow-up Questions:
Q: What if the hotfix is too urgent for even a fast-tracked PR? A: It is not. A 5-minute PR with one reviewer is almost always feasible, even during an incident. If you cannot afford 5 minutes, the alternative is to take the full outage — which is worse. Build the muscle memory that even 2 AM fixes go through git.
Q: Some teams argue staging should be a faster environment to experiment in — not locked to prod. How do you respond? A: They are conflating two environments. Have a dedicated “dev” environment for experiments and a “staging” environment that mirrors prod exactly. Dev can be whatever; staging must track prod or it is useless for validation.
Q: What metric would have surfaced this drift before the incident? A: “Commit SHA deployed in production but not in staging.” If this metric is above zero for more than 1 hour, something is wrong. Alert on it, display it on the team dashboard, make it visible.
Common Wrong Answers:
  • “Require staging to always be updated before production.” This is unworkable during actual incidents, and teams will route around it. Better to make simultaneous updates automatic.
  • “Kill staging entirely and test in production with feature flags.” Can work for some shops, but removes a genuine safety net for changes that are not flag-gated (infrastructure, configs, migrations). Keep staging and keep it in sync.
Further Reading:
  • Knight Capital 2012 incident SEC filings and retrospectives.
  • GitOps Working Group principles documentation.
  • Michael Nygard, Release It! (2nd ed., Pragmatic Bookshelf, 2018) — patterns for deploy consistency.

Interview Deep-Dive

Strong Answer:The root cause is that the CI pipeline does not understand which services were affected by a change. The fix is change detection: analyze the git diff to determine which service directories were modified, and only run pipelines for those services.In GitHub Actions, I use path filters on each service’s workflow: on: push: paths: ['services/order-service/**', 'shared/proto/**']. This means the order-service pipeline only triggers when files in its directory or shared proto definitions change. A change to the payment service does not trigger order-service CI.The tricky part is shared code. If shared/utils/ is modified, which services are affected? I maintain a dependency graph: if order-service imports from shared/utils/, it needs to be rebuilt. I automate this with a script that parses import statements and determines the transitive closure of affected services. Some teams use build tools like Nx, Turborepo, or Bazel that have this dependency analysis built in.For the test phase, I further optimize by running only affected tests. If only the Order Service changed, I run Order Service unit tests, Order Service integration tests, and contract tests between Order Service and its consumers. I do not run Payment Service tests unless the shared API contract changed.The result: a change to one service takes 5-8 minutes instead of 45 minutes for all 20. The only time all 20 pipelines run is when a shared dependency (base Docker image, shared library, proto definitions) changes — and that is correct behavior because all consumers need to verify compatibility.Follow-up: “What about the case where a change to Service A causes Service B to fail, but Service B’s CI was not triggered because only Service A’s files changed?”This is exactly what contract tests are for. Service B has a contract test that verifies Service A’s API. When Service A changes, its pipeline runs Service A’s own tests plus publishes updated API contracts. A separate “contract verification” job runs in the background: it checks whether Service A’s new contract still satisfies all consumer contracts. If Service B’s contract is broken by Service A’s change, the verification fails and blocks Service A’s deployment. This way, Service B’s full CI does not need to run — only the lightweight contract verification, which takes seconds.
Strong Answer:Rolling deployment replaces pods incrementally — old pods are terminated as new pods become healthy. This is the Kubernetes default and the right choice for most services. Trade-offs: zero additional infrastructure cost (no duplicate environment), but during the rollout both versions coexist, which can cause issues if the API contract changed between versions. Best for: stateless services with backward-compatible changes.Blue-green deployment runs two identical environments. You deploy to the inactive environment (green), test it, then switch all traffic from active (blue) to green atomically. Trade-offs: instant rollback (just switch traffic back to blue), but requires double the infrastructure during deployment. Best for: critical services where you need instant rollback capability and can afford the cost, or when the new version requires a database migration that is not backward-compatible.Canary deployment sends a small percentage of traffic to the new version while monitoring metrics. If metrics look good, increase the percentage until 100%. Trade-offs: most sophisticated (requires metric-based automation), catches real-world bugs that staging environments miss, but requires good observability to detect regressions in small traffic samples. Best for: high-traffic services where even a small regression is costly, or for changes that are difficult to test in staging (performance changes, ML model updates).My default recommendation: rolling deployments for 80% of services, canary for the 20% that handle payments, checkout, or other revenue-critical flows. Blue-green only when I need the atomic switchover guarantee — for example, deploying a service that requires a specific database state that is only compatible with one version.Follow-up: “How do you handle database migrations during a canary deployment where both old and new versions need to work simultaneously?”The expand-and-contract pattern. Before deploying the canary, I run a migration that adds new columns/tables but does not remove or rename existing ones. Both the old and new application versions work with this expanded schema. The canary reads from and writes to both old and new columns. After the canary is promoted to 100% and the old version is fully decommissioned, a second migration removes the deprecated columns. This means every breaking schema change requires two deploy cycles. It is slower but eliminates the risk of schema incompatibility during the canary window.
Strong Answer:GitOps makes Git the single source of truth for both application code and infrastructure state. Instead of a CI pipeline running kubectl apply to deploy to Kubernetes, a GitOps operator (ArgoCD, Flux) watches a Git repository containing Kubernetes manifests and automatically synchronizes the cluster state to match the repository state. If someone manually changes something in the cluster, the operator detects the drift and reverts it.The key difference from traditional CI/CD: in traditional pipelines, the pipeline has write access to the cluster (kubectl credentials). In GitOps, only the GitOps operator has cluster write access. Developers push changes to Git, and the operator applies them. This provides an audit trail (every change is a Git commit with author, timestamp, and review), easy rollback (git revert the commit), and drift detection (the operator alerts when cluster state does not match Git).The problem it solves: in traditional deployments, if someone runs kubectl edit deployment at 2 AM to fix an urgent issue, that change is undocumented, unreviewable, and will be overwritten on the next deploy. With GitOps, that manual change is detected as drift and either reverted automatically or flagged for review. The 2 AM fix still happens, but it goes through Git — even as an emergency PR with expedited review.For microservices specifically, GitOps shines because you have 20+ services, each with their own Kubernetes manifests. Without GitOps, answering “what version of each service is running in production right now?” requires kubectl queries across namespaces. With GitOps, the answer is in the Git repository — the production branch shows the exact manifest for every service.Follow-up: “What are the challenges of GitOps with secrets? You cannot store database passwords in a Git repository.”Correct — secrets are the main pain point in GitOps. I use Sealed Secrets (Bitnami) or External Secrets Operator. With Sealed Secrets, you encrypt the secret using a cluster-specific public key and commit the encrypted version to Git. Only the Sealed Secrets controller in the cluster can decrypt it. With External Secrets Operator, you commit a reference to a secret in Vault or AWS Secrets Manager, and the operator fetches the actual value at runtime. Either way, the Git repository never contains plaintext secrets, but the infrastructure-as-code principle is maintained.