Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Configuration Management

In a microservices architecture, managing configuration across dozens or hundreds of services is a significant challenge. Imagine 40 services each with their own .env file, and you need to rotate a database password. With decentralized config, that is 40 deploys, 40 chances for human error, and an anxious hour hoping you did not miss one. Centralized configuration turns that into a single update that propagates everywhere. This chapter covers patterns and tools for centralized, dynamic configuration — including feature flags, which are arguably the most underrated tool in a microservices toolkit.
Learning Objectives:
  • Implement centralized configuration management
  • Set up dynamic configuration with hot reload
  • Design feature flags for progressive rollouts
  • Manage environment-specific configurations
  • Handle secrets securely across services

The Configuration Challenge

Before looking at solutions, it helps to see exactly how painful decentralized configuration becomes at scale. Every service owning its own .env creates three compounding problems: drift (staging and production quietly diverge), secrets leakage (credentials end up in git history), and operational fragility (changing a shared value requires coordinating N deploys). The “fix it in all 40 places” approach only works until someone misses one, and then you have a service pointing at the old database while the other 39 point at the new one — usually discovered in production, at night, under pressure.
┌─────────────────────────────────────────────────────────────────────────────┐
│                      CONFIGURATION CHALLENGES                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  WITHOUT CENTRALIZED CONFIG:                                                 │
│                                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                       │
│  │  Service A   │  │  Service B   │  │  Service C   │                       │
│  │ ┌──────────┐ │  │ ┌──────────┐ │  │ ┌──────────┐ │                       │
│  │ │.env file │ │  │ │.env file │ │  │ │.env file │ │                       │
│  │ │DB_HOST=..│ │  │ │DB_HOST=..│ │  │ │DB_HOST=..│ │                       │
│  │ │API_KEY=..│ │  │ │API_KEY=..│ │  │ │API_KEY=..│ │                       │
│  │ └──────────┘ │  │ └──────────┘ │  │ └──────────┘ │                       │
│  └──────────────┘  └──────────────┘  └──────────────┘                       │
│                                                                              │
│  ⚠️ Problems:                                                                │
│  • Configuration scattered across services                                  │
│  • Hard to update consistently                                              │
│  • Requires redeployment for changes                                        │
│  • Secrets in plain text files                                              │
│  • No audit trail                                                           │
│                                                                              │
│  ══════════════════════════════════════════════════════════════════════════ │
│                                                                              │
│  WITH CENTRALIZED CONFIG:                                                    │
│                                                                              │
│                    ┌────────────────────────┐                               │
│                    │   Config Server        │                               │
│                    │   ┌────────────────┐   │                               │
│                    │   │ Consul / etcd  │   │                               │
│                    │   │ Vault / AWS SM │   │                               │
│                    │   └────────────────┘   │                               │
│                    └───────────┬────────────┘                               │
│                                │                                             │
│            ┌───────────────────┼───────────────────┐                        │
│            ▼                   ▼                   ▼                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                       │
│  │  Service A   │  │  Service B   │  │  Service C   │                       │
│  │  (watches)   │  │  (watches)   │  │  (watches)   │                       │
│  └──────────────┘  └──────────────┘  └──────────────┘                       │
│                                                                              │
│  ✅ Benefits:                                                                │
│  • Single source of truth                                                   │
│  • Hot reload without redeployment                                          │
│  • Encrypted secrets                                                        │
│  • Audit logging for changes                                                │
│  • Environment-specific overrides                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Caveats & Common Pitfalls: The Silent Killers of Config Management

The traps that cause 2 AM incidents:
  • Config drift between environments. Staging says max_retries: 3, production says max_retries: 5, and nobody remembers why. Bugs that reproduce in staging but not production (or vice versa) trace back to this 60% of the time. Teams discover it when a “successful” staging test passes, prod deploys, and breaks — because staging was testing a different config path.
  • Secrets in git history. Someone commits an API key, notices an hour later, force-pushes the “fix.” The key is still in the git history, still indexed by GitHub secret scanners, still compromised. Rotation must be assumed, not optional, once a secret touches a repo.
  • Feature flag explosion. A healthy feature flag system has 20-50 active flags at a mature org. An unhealthy one has 500+ flags, most of which nobody remembers what they do. The code is now a maze of nested if (flag.enabled) branches, and removing any flag has become risky because its behavior interacts with 12 others.
  • Silent config failure modes. A new config key is misspelled (max_retires), your code reads it with a default of 0, and retries now run zero times. Or a boolean flag is set to the string “false” which is truthy in JavaScript. Either bug passes every health check and only surfaces as elevated error rates that take an hour to trace.
Solutions & Patterns:
  • GitOps for config, not just code. All environment configs live in a single repo with a clear defaults/, staging/, production/ structure. Changes go through PR review. A drift-detection job runs daily and flags unexpected differences.
  • Pre-commit secret scanning. Use gitleaks or trufflehog in a pre-commit hook plus a CI check. If a secret ever lands in git history, rotate immediately — do not try to “clean” git history; the secret has been published.
  • Enforce feature flag lifecycle. Every flag has an owner, a creation date, and an expiration date in its metadata. A nightly job creates cleanup tickets for flags older than 90 days. A linter fails CI if a flag is referenced in code without a corresponding definition with required metadata.
  • Strongly-typed config loading with startup validation. Tools like pydantic-settings (Python) or convict (Node) refuse to boot the service if a key is missing, misspelled, or has the wrong type. Startup-time failure is 1000x cheaper than runtime failure buried in production traffic.

The 12-Factor App Configuration

Factor III: Config

Store config in the environment, not in code. This is one of the most violated 12-factor principles in practice. The test is simple: could you open-source your codebase right now without exposing a single credential? If the answer is no, you have config leaking into code. Production pitfall: A common mistake is having different config loading logic per environment (if (env === 'production')). This means your staging environment is not actually testing the same config path as production, which defeats the purpose of having staging at all. The first step toward centralization is making your service read every configuration value from the environment rather than hardcoding it. This seems trivial, but it’s where most teams start accumulating debt: a hardcoded localhost here, a default password there, and six months later you have three PRs open just to change an endpoint URL. The rule: if a value could ever differ between environments (dev, staging, prod, CI, a teammate’s laptop), it belongs in the environment — not in code.
// ❌ Bad: Hardcoded configuration
const config = {
  database: {
    host: 'localhost',
    port: 5432,
    password: 'secretpassword'  // Never do this!
  },
  api: {
    timeout: 5000,
    retries: 3
  }
};

// ✅ Good: Environment-based configuration
const config = {
  database: {
    host: process.env.DB_HOST || 'localhost',
    port: parseInt(process.env.DB_PORT, 10) || 5432,
    password: process.env.DB_PASSWORD  // Must be set externally
  },
  api: {
    timeout: parseInt(process.env.API_TIMEOUT, 10) || 5000,
    retries: parseInt(process.env.API_RETRIES, 10) || 3
  }
};

Configuration Hierarchy

Hierarchy is how you keep centralized config sane as it grows. Without structure, a flat key-value store becomes a dumping ground: thousands of untyped strings with no clear ownership. A good hierarchy expresses three things: what the value is (database/host), which scope it applies to (service-specific vs global), and which environment it targets (dev/staging/prod). You also want validation on load — misspelling a key or passing a string where a number is expected should fail at startup, loudly, not silently become “undefined” halfway through handling a request. Tools like convict (Node) and pydantic-settings (Python) do this automatically, which is why they are worth using over a hand-rolled process.env grab-bag.
// config/index.js - Hierarchical configuration with validation
const convict = require('convict');
const path = require('path');

const config = convict({
  env: {
    doc: 'The application environment',
    format: ['production', 'staging', 'development', 'test'],
    default: 'development',
    env: 'NODE_ENV'
  },
  
  server: {
    port: {
      doc: 'The port to bind to',
      format: 'port',
      default: 3000,
      env: 'PORT'
    },
    host: {
      doc: 'The host to bind to',
      format: 'ipaddress',
      default: '0.0.0.0',
      env: 'HOST'
    }
  },
  
  database: {
    host: {
      doc: 'Database host',
      format: String,
      default: 'localhost',
      env: 'DB_HOST'
    },
    port: {
      doc: 'Database port',
      format: 'port',
      default: 5432,
      env: 'DB_PORT'
    },
    name: {
      doc: 'Database name',
      format: String,
      default: 'myapp',
      env: 'DB_NAME'
    },
    username: {
      doc: 'Database username',
      format: String,
      default: '',
      env: 'DB_USERNAME',
      sensitive: true
    },
    password: {
      doc: 'Database password',
      format: String,
      default: '',
      env: 'DB_PASSWORD',
      sensitive: true
    },
    pool: {
      min: {
        doc: 'Minimum pool size',
        format: 'nat',
        default: 2,
        env: 'DB_POOL_MIN'
      },
      max: {
        doc: 'Maximum pool size',
        format: 'nat',
        default: 10,
        env: 'DB_POOL_MAX'
      }
    }
  },
  
  redis: {
    host: {
      doc: 'Redis host',
      format: String,
      default: 'localhost',
      env: 'REDIS_HOST'
    },
    port: {
      doc: 'Redis port',
      format: 'port',
      default: 6379,
      env: 'REDIS_PORT'
    },
    password: {
      doc: 'Redis password',
      format: String,
      default: '',
      env: 'REDIS_PASSWORD',
      sensitive: true
    }
  },
  
  services: {
    payment: {
      url: {
        doc: 'Payment service URL',
        format: 'url',
        default: 'http://payment-service:3000',
        env: 'PAYMENT_SERVICE_URL'
      },
      timeout: {
        doc: 'Payment service timeout (ms)',
        format: 'nat',
        default: 5000,
        env: 'PAYMENT_SERVICE_TIMEOUT'
      }
    },
    inventory: {
      url: {
        doc: 'Inventory service URL',
        format: 'url',
        default: 'http://inventory-service:3000',
        env: 'INVENTORY_SERVICE_URL'
      }
    }
  },
  
  features: {
    newCheckout: {
      doc: 'Enable new checkout flow',
      format: Boolean,
      default: false,
      env: 'FEATURE_NEW_CHECKOUT'
    },
    darkMode: {
      doc: 'Enable dark mode',
      format: Boolean,
      default: true,
      env: 'FEATURE_DARK_MODE'
    }
  },
  
  logging: {
    level: {
      doc: 'Log level',
      format: ['error', 'warn', 'info', 'debug'],
      default: 'info',
      env: 'LOG_LEVEL'
    }
  }
});

// Load environment-specific config
const env = config.get('env');
const configPath = path.join(__dirname, `${env}.json`);

try {
  config.loadFile(configPath);
} catch (e) {
  console.log(`No config file found for ${env}, using defaults and env vars`);
}

// Validate configuration
config.validate({ allowed: 'strict' });

module.exports = config;

Configuration Tool Comparison

Before choosing a tool, understand what each is optimized for. The most common mistake is using a general-purpose key-value store for secrets management, or paying for a dedicated feature flag service when Consul can handle your simple boolean flags.
CapabilityConsuletcdSpring Cloud ConfigAWS Parameter StoreVault
Primary purposeService discovery + configDistributed KV storeApp config serverCloud-native configSecrets management
Hot reloadYes (watches)Yes (watches)Yes (bus refresh)No (polling only)No (app must re-fetch)
Secret managementBasic (ACLs)Basic (RBAC)Encrypt/decryptYes (SecureString)Excellent (dynamic secrets, leasing, rotation)
Feature flagsManual (KV structure)Manual (KV structure)ManualManualNot designed for this
Kubernetes native?Helm chart availableBuilt into K8s (backing store)Helm chart availableAWS onlyHelm chart, K8s auth
Multi-datacenterYes (built-in)Requires federationNoMulti-region with replicationYes (replication)
Operational complexityMediumLow-Medium (if using K8s etcd)LowLow (managed)High
CostFree (OSS) / EnterpriseFree (OSS)Free (OSS)$0.05 per 10K API callsFree (OSS) / Enterprise
Decision framework:
  • Startup with 5 services on Kubernetes: Use K8s ConfigMaps + Secrets (already there, zero extra infrastructure)
  • Growing team needing feature flags: Add LaunchDarkly or Unleash (purpose-built; do not build your own if you can avoid it)
  • Enterprise with compliance requirements: Vault for secrets + Consul for config (audit trails, dynamic credentials, RBAC)
  • AWS-native shop: Parameter Store + Secrets Manager (managed, integrates with IAM)

Consul for Configuration

Consul is one of the most popular choices for centralized config because it combines service discovery with a hierarchical KV store. The key-value API lets you model configuration as a tree (config/production/order-service/database/host), and every service watches its subtree for changes. The killer feature is blocking queries: instead of polling, your client tells Consul “give me this key, but if it has not changed in 60 seconds, return nothing.” When the key changes, Consul responds instantly. This gives you hot-reload without the latency and cost of polling. The trade-off to understand: Consul is eventually consistent across datacenters but strongly consistent within a single datacenter (Raft-backed). If your service reads a value immediately after you wrote it in the same DC, you see the new value. Across DCs there may be a small replication lag. For config, this is almost always fine; for coordinating distributed locks, it matters more.

Setup and Connection

The client below shows the full lifecycle for Consul-based configuration: load an initial snapshot on startup, then register watches that fire whenever values change. Notice how we merge a global config layer with a service-specific layer — this is the “hierarchy” pattern in action. Global values (like the company SMTP server) live once; service-specific overrides (like the order service’s custom timeout) live under the service’s own prefix. If you flatten everything into one namespace, you lose the ability to reason about scope and end up with copies of the same value under different keys.
// config/consul-config.js
const Consul = require('consul');
const EventEmitter = require('events');

class ConsulConfig extends EventEmitter {
  constructor(options = {}) {
    super();
    
    this.consul = new Consul({
      host: process.env.CONSUL_HOST || 'localhost',
      port: process.env.CONSUL_PORT || 8500,
      promisify: true
    });
    
    this.serviceName = options.serviceName || process.env.SERVICE_NAME;
    this.environment = options.environment || process.env.NODE_ENV || 'development';
    this.prefix = `config/${this.environment}`;
    
    this.config = {};
    this.watchers = new Map();
  }

  async load() {
    // Load global config
    const globalConfig = await this.getPrefix(`${this.prefix}/global`);
    
    // Load service-specific config (overrides global)
    const serviceConfig = await this.getPrefix(`${this.prefix}/${this.serviceName}`);
    
    // Merge configs
    this.config = this.deepMerge(globalConfig, serviceConfig);
    
    console.log(`Loaded configuration for ${this.serviceName} in ${this.environment}`);
    return this.config;
  }

  async getPrefix(prefix) {
    try {
      const result = await this.consul.kv.get({
        key: prefix,
        recurse: true
      });
      
      if (!result) return {};
      
      const config = {};
      for (const item of result) {
        const key = item.Key.replace(`${prefix}/`, '');
        const value = this.parseValue(item.Value);
        this.setNestedValue(config, key, value);
      }
      
      return config;
    } catch (error) {
      console.error(`Failed to load config from ${prefix}:`, error.message);
      return {};
    }
  }

  parseValue(value) {
    if (!value) return null;
    
    try {
      return JSON.parse(value);
    } catch {
      // Not JSON, return as string
      return value;
    }
  }

  setNestedValue(obj, path, value) {
    const keys = path.split('/');
    let current = obj;
    
    for (let i = 0; i < keys.length - 1; i++) {
      if (!(keys[i] in current)) {
        current[keys[i]] = {};
      }
      current = current[keys[i]];
    }
    
    current[keys[keys.length - 1]] = value;
  }

  deepMerge(target, source) {
    const result = { ...target };
    
    for (const key in source) {
      if (source[key] instanceof Object && key in target) {
        result[key] = this.deepMerge(target[key], source[key]);
      } else {
        result[key] = source[key];
      }
    }
    
    return result;
  }

  get(path, defaultValue = undefined) {
    const keys = path.split('.');
    let value = this.config;
    
    for (const key of keys) {
      if (value && typeof value === 'object' && key in value) {
        value = value[key];
      } else {
        return defaultValue;
      }
    }
    
    return value;
  }

  // Watch for configuration changes
  watch(key, callback) {
    const fullKey = `${this.prefix}/${this.serviceName}/${key}`;
    
    const watcher = this.consul.watch({
      method: this.consul.kv.get,
      options: { key: fullKey }
    });
    
    watcher.on('change', (data) => {
      const newValue = data ? this.parseValue(data.Value) : null;
      const oldValue = this.get(key.replace(/\//g, '.'));
      
      if (JSON.stringify(newValue) !== JSON.stringify(oldValue)) {
        // Update local config
        this.setNestedValue(this.config, key, newValue);
        
        // Emit change event
        this.emit('change', { key, oldValue, newValue });
        callback(newValue, oldValue);
      }
    });
    
    watcher.on('error', (err) => {
      console.error(`Watch error for ${key}:`, err);
    });
    
    this.watchers.set(key, watcher);
    return watcher;
  }

  // Set configuration value
  async set(key, value) {
    const fullKey = `${this.prefix}/${this.serviceName}/${key}`;
    const stringValue = typeof value === 'object' ? JSON.stringify(value) : String(value);
    
    await this.consul.kv.set(fullKey, stringValue);
    this.setNestedValue(this.config, key, value);
  }

  // Close all watchers
  close() {
    for (const watcher of this.watchers.values()) {
      watcher.end();
    }
    this.watchers.clear();
  }
}

module.exports = ConsulConfig;

Using Consul Config in Services

Here is where the payoff shows up: your application startup code loads config once, registers watches for anything that might change at runtime (database pool sizes, feature flag states, third-party endpoints), and then calls config.get(...) freely throughout the codebase. When the value changes in Consul, your watch callback fires and gracefully reconfigures whatever needs to change — no restart required. The common mistake is caching a value from config.get() at startup and holding it in a local variable forever; always re-read from config.get() at request time, or wire up a watch to refresh your local copy.
// app.js
const express = require('express');
const ConsulConfig = require('./config/consul-config');

const app = express();
const config = new ConsulConfig({ serviceName: 'order-service' });

async function startServer() {
  // Load initial configuration
  await config.load();
  
  // Watch for configuration changes
  config.watch('database', (newValue, oldValue) => {
    console.log('Database config changed:', { oldValue, newValue });
    // Reconnect to database with new config
    reconnectDatabase(newValue);
  });
  
  config.watch('features/rateLimit', (newValue) => {
    console.log('Rate limit changed to:', newValue);
    updateRateLimiter(newValue);
  });
  
  // Listen for any config changes
  config.on('change', ({ key, oldValue, newValue }) => {
    console.log(`Config changed: ${key}`, { oldValue, newValue });
  });
  
  // Use configuration
  const port = config.get('server.port', 3000);
  const dbConfig = config.get('database');
  
  app.get('/health', (req, res) => {
    res.json({
      status: 'healthy',
      config: {
        environment: config.environment,
        features: config.get('features')
      }
    });
  });
  
  app.listen(port, () => {
    console.log(`Server running on port ${port}`);
  });
}

startServer().catch(console.error);

Feature Flags

Feature flags are the secret weapon of high-performing engineering teams. They decouple deployment from release — you can merge and deploy code that is not yet visible to users, then turn it on gradually (1% of users, then 10%, then 50%, then 100%). If something goes wrong, you flip the flag off in seconds instead of rolling back a deployment. Netflix, Google, and Facebook all use feature flags extensively. The trade-off: they add complexity and create technical debt if you do not clean up old flags. Set a rule: every feature flag gets a cleanup ticket with a deadline when it is created.

Caveats & Common Pitfalls: Feature Flag Technical Debt

Feature flags promise agility but accumulate sharp costs if untended:
  • The “temporary” flag that never goes away. A flag added for a 2-week experiment in 2022 is still in the code in 2026. The code branches have diverged. Removing the flag now requires reading and testing both branches carefully — a 1-week task for what should have been 10 minutes of cleanup.
  • Flag interactions creating untested state-space. 10 boolean flags = 1024 possible combinations. You cannot test them all. A user who happens to have 4 specific flags enabled hits a code path nobody has ever executed. This manifests as “impossible” bugs that QA could never reproduce.
  • Flag evaluation latency on hot paths. Each flag check is cheap (microseconds), but on a request that evaluates 20 flags, each calling out to an in-memory map, you can burn milliseconds. Worse: if the flag SDK falls back to a network call when the local cache is cold, you have just coupled latency to an external service.
  • Inconsistent flag state across services. Order service evaluates newCheckout as true for user X; Inventory service evaluates it as false. The request gets half the new flow and half the old one, creating data corruption. This happens when services use different SDK versions or have different rule caches.
Solutions & Patterns:
  • Treat flag cleanup as part of the feature’s definition of done. The ticket that creates the flag also creates the ticket to remove it. CI fails if a flag lives past its expiration date without an explicit extension review.
  • Pass flag evaluation results downstream. When the Order service evaluates a flag, include the result in a request header (X-Feature-NewCheckout: true). Downstream services read the header rather than re-evaluating. Guarantees consistency across the entire request.
  • Kill switches separate from feature flags. A kill switch is a flag owned by ops/SRE with 1-click disable. It should be independent from product flags and should never be tangled with experimentation logic. When prod is on fire, you do not want to navigate an experimentation UI.
  • Evaluate once per request, cache for the request scope. Middleware evaluates all flags at request entry, attaches them to the request context, and every downstream call reads from the context. Avoids re-evaluating the same flag 20 times per request.

Feature Flag System

The core of any feature flag system is consistent user bucketing — given the same user ID and the same flag, you must always return the same answer. Otherwise a user could see feature X on one page load and not on the next, which is worse than not having the feature at all. The standard technique is hashing the user ID plus the flag name and mapping it to a number in [0, 100). If the number falls below the flag’s rollout percentage, the user is in the experiment. This is stateless (no database of user assignments) and stable across service restarts. Beyond simple boolean and percentage rollouts, the patterns below cover user allowlists (beta testers), time-based gradual rollouts (ramp from 0% to 100% over a week), and attribute-based targeting (only premium subscribers see this). You do not need all of these on day one — but building the abstraction upfront means adding new strategies later is a one-function change rather than a rewrite.
// features/feature-flags.js
const ConsulConfig = require('./config/consul-config');
const crypto = require('crypto');

class FeatureFlags {
  constructor(options = {}) {
    this.config = options.config || new ConsulConfig({ serviceName: 'features' });
    this.flags = new Map();
    this.overrides = new Map(); // For testing
  }

  async initialize() {
    await this.config.load();
    
    // Watch for feature flag changes
    this.config.watch('flags', (newFlags) => {
      console.log('Feature flags updated:', newFlags);
      this.updateFlags(newFlags);
    });
    
    this.updateFlags(this.config.get('flags', {}));
  }

  updateFlags(flags) {
    this.flags.clear();
    for (const [name, config] of Object.entries(flags)) {
      this.flags.set(name, this.parseFlag(config));
    }
  }

  parseFlag(config) {
    if (typeof config === 'boolean') {
      return { enabled: config, type: 'boolean' };
    }
    return config;
  }

  // Check if feature is enabled
  isEnabled(flagName, context = {}) {
    // Check overrides first (for testing)
    if (this.overrides.has(flagName)) {
      return this.overrides.get(flagName);
    }
    
    const flag = this.flags.get(flagName);
    if (!flag) return false;
    
    switch (flag.type) {
      case 'boolean':
        return flag.enabled;
      
      case 'percentage':
        return this.checkPercentage(flag, context);
      
      case 'userList':
        return this.checkUserList(flag, context);
      
      case 'gradualRollout':
        return this.checkGradualRollout(flag, context);
      
      case 'userAttribute':
        return this.checkUserAttribute(flag, context);
      
      default:
        return flag.enabled || false;
    }
  }

  // Percentage-based rollout
  checkPercentage(flag, context) {
    const userId = context.userId || context.sessionId || 'anonymous';
    const hash = crypto.createHash('md5').update(userId + flag.name).digest('hex');
    const percentage = parseInt(hash.substring(0, 8), 16) % 100;
    return percentage < flag.percentage;
  }

  // User allowlist
  checkUserList(flag, context) {
    if (!context.userId) return false;
    return flag.users.includes(context.userId);
  }

  // Gradual rollout based on time
  checkGradualRollout(flag, context) {
    const now = Date.now();
    const start = new Date(flag.startDate).getTime();
    const end = new Date(flag.endDate).getTime();
    
    if (now < start) return false;
    if (now >= end) return true;
    
    const progress = (now - start) / (end - start);
    return this.checkPercentage({ ...flag, percentage: progress * 100 }, context);
  }

  // User attribute matching
  checkUserAttribute(flag, context) {
    const userValue = context[flag.attribute];
    if (!userValue) return false;
    
    switch (flag.operator) {
      case 'equals':
        return userValue === flag.value;
      case 'contains':
        return userValue.includes(flag.value);
      case 'in':
        return flag.values.includes(userValue);
      case 'regex':
        return new RegExp(flag.pattern).test(userValue);
      default:
        return false;
    }
  }

  // Get flag value (for non-boolean flags)
  getValue(flagName, defaultValue, context = {}) {
    const flag = this.flags.get(flagName);
    if (!flag) return defaultValue;
    
    if (!this.isEnabled(flagName, context)) {
      return defaultValue;
    }
    
    return flag.value !== undefined ? flag.value : defaultValue;
  }

  // Set override for testing
  setOverride(flagName, value) {
    this.overrides.set(flagName, value);
  }

  clearOverrides() {
    this.overrides.clear();
  }

  // Get all flags for debugging
  getAllFlags() {
    const result = {};
    for (const [name, config] of this.flags) {
      result[name] = config;
    }
    return result;
  }
}

// Singleton instance
let instance = null;

module.exports = {
  FeatureFlags,
  
  async getFeatureFlags() {
    if (!instance) {
      instance = new FeatureFlags();
      await instance.initialize();
    }
    return instance;
  }
};

Feature Flag Configuration in Consul

// Stored in Consul at: config/production/features/flags
{
  "newCheckoutFlow": {
    "type": "percentage",
    "name": "newCheckoutFlow",
    "percentage": 25,
    "description": "New streamlined checkout experience"
  },
  
  "betaFeatures": {
    "type": "userList",
    "name": "betaFeatures",
    "users": ["user-123", "user-456", "user-789"],
    "description": "Beta features for selected users"
  },
  
  "darkMode": {
    "type": "boolean",
    "enabled": true,
    "description": "Dark mode UI"
  },
  
  "newRecommendationEngine": {
    "type": "gradualRollout",
    "name": "newRecommendationEngine",
    "startDate": "2024-01-01T00:00:00Z",
    "endDate": "2024-01-15T00:00:00Z",
    "description": "Gradual rollout of new ML recommendations"
  },
  
  "premiumFeatures": {
    "type": "userAttribute",
    "attribute": "subscriptionTier",
    "operator": "in",
    "values": ["premium", "enterprise"],
    "description": "Features for premium subscribers"
  },
  
  "experimentalApi": {
    "type": "percentage",
    "name": "experimentalApi",
    "percentage": 10,
    "value": {
      "apiVersion": "v2",
      "timeout": 10000
    },
    "description": "Test new API version"
  }
}

Using Feature Flags in Routes

The pattern below is the “feature flag dependency” — pass flag evaluation context (user ID, session, subscription tier) into the flag check at the point of use, then branch on the result. Middleware is the cleanest way to propagate flag evaluation results through a request: evaluate the flag once, attach the result to the request object, and let handlers read it without re-evaluating. Avoid scattering isEnabled calls deep in business logic — you want feature branches to be visible at the request-handling level where they can be audited and traced.
// routes/checkout.js
const express = require('express');
const router = express.Router();
const { getFeatureFlags } = require('../features/feature-flags');

router.post('/checkout', async (req, res) => {
  const featureFlags = await getFeatureFlags();
  const context = {
    userId: req.user.id,
    subscriptionTier: req.user.subscriptionTier,
    country: req.user.country
  };
  
  if (featureFlags.isEnabled('newCheckoutFlow', context)) {
    // New checkout flow
    return handleNewCheckout(req, res);
  }
  
  // Legacy checkout flow
  return handleLegacyCheckout(req, res);
});

// Feature flag middleware
const featureFlagMiddleware = (flagName, options = {}) => {
  return async (req, res, next) => {
    const featureFlags = await getFeatureFlags();
    const context = {
      userId: req.user?.id,
      sessionId: req.sessionID,
      ...req.user
    };
    
    const isEnabled = featureFlags.isEnabled(flagName, context);
    req.featureFlags = req.featureFlags || {};
    req.featureFlags[flagName] = isEnabled;
    
    if (options.required && !isEnabled) {
      return res.status(404).json({
        error: 'Feature not available'
      });
    }
    
    next();
  };
};

// Use middleware
router.get('/new-dashboard',
  featureFlagMiddleware('newDashboard', { required: true }),
  (req, res) => {
    res.render('new-dashboard');
  }
);

module.exports = router;

Kubernetes ConfigMaps and Secrets

ConfigMap for Non-Sensitive Config

ConfigMaps are Kubernetes’ built-in answer to the configuration problem, and for small-to-medium clusters they are often all you need. A ConfigMap is a cluster-scoped object that stores key-value pairs or entire config files; pods can consume them as environment variables or mounted files. The tradeoff vs a dedicated config server like Consul: ConfigMaps have no native hot-reload (changes require a pod restart unless you mount them as files and the app watches the filesystem), no cross-cluster replication, and no audit trail beyond Kubernetes audit logs. What you get in return is zero extra infrastructure — ConfigMaps are just Kubernetes.
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: order-service-config
  labels:
    app: order-service
data:
  # Simple key-value pairs
  LOG_LEVEL: "info"
  API_TIMEOUT: "5000"
  MAX_RETRIES: "3"
  
  # JSON config file
  config.json: |
    {
      "server": {
        "port": 3000,
        "host": "0.0.0.0"
      },
      "database": {
        "pool": {
          "min": 2,
          "max": 10
        }
      },
      "features": {
        "caching": true,
        "compression": true
      }
    }
  
  # Application properties
  application.properties: |
    spring.datasource.hikari.minimum-idle=2
    spring.datasource.hikari.maximum-pool-size=10
    logging.level.root=INFO

Secrets for Sensitive Data

Kubernetes Secrets are the sibling of ConfigMaps for sensitive data, but the name is deceptive: by default, Secret values are only base64-encoded, not encrypted. Anyone with kubectl get secret permissions can trivially decode them. For real security you need encryption-at-rest enabled in etcd (which requires configuring the API server with an encryption config) plus strict RBAC. For anything sensitive enough to warrant a dedicated secrets manager — production database credentials, API keys for paid services, signing keys — use Vault or a cloud provider’s secrets manager instead. Secrets are fine for low-stakes values where the main goal is keeping tokens out of ConfigMaps and source control.
# secrets.yaml
apiVersion: v1
kind: Secret
metadata:
  name: order-service-secrets
type: Opaque
data:
  # Base64 encoded values
  DB_PASSWORD: cGFzc3dvcmQxMjM=
  API_KEY: c2VjcmV0LWFwaS1rZXk=
  JWT_SECRET: and0LXN1cGVyLXNlY3JldC1rZXk=
---
# For Docker registry credentials
apiVersion: v1
kind: Secret
metadata:
  name: registry-credentials
type: kubernetes.io/dockerconfigjson
data:
  .dockerconfigjson: eyJhdXRocyI6eyJyZWdpc3RyeS5leGFtcGxlLmNvbSI6eyJ1c2VybmFtZSI6InVzZXIiLCJwYXNzd29yZCI6InBhc3MifX19

Using ConfigMaps and Secrets in Deployments

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
      - name: order-service
        image: myregistry/order-service:v1
        
        # Environment variables from ConfigMap
        envFrom:
        - configMapRef:
            name: order-service-config
        
        # Individual env vars from secrets
        env:
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: order-service-secrets
              key: DB_PASSWORD
        - name: API_KEY
          valueFrom:
            secretKeyRef:
              name: order-service-secrets
              key: API_KEY
        
        # Mount config files
        volumeMounts:
        - name: config-volume
          mountPath: /app/config
          readOnly: true
        - name: secrets-volume
          mountPath: /app/secrets
          readOnly: true
      
      volumes:
      - name: config-volume
        configMap:
          name: order-service-config
          items:
          - key: config.json
            path: config.json
      - name: secrets-volume
        secret:
          secretName: order-service-secrets

Hot Reload with ConfigMap Updates

When a ConfigMap is mounted as a volume, Kubernetes automatically updates the mounted files when the ConfigMap changes — typically within about a minute. This gives you a free hot-reload channel: have your app watch the config directory with chokidar (Node) or watchdog (Python), parse the file when it changes, and apply the new values. The gotcha: ConfigMap values injected as environment variables do NOT get updated; env vars are set at container start and never change. So for anything you want to hot-reload, mount it as a file. Also beware: there is no atomicity guarantee across multiple files, so if you have two related config files updating, your app might briefly see one old and one new. Favor a single config.json file for related values.
// config/kubernetes-config.js
const fs = require('fs');
const path = require('path');
const chokidar = require('chokidar');
const EventEmitter = require('events');

class KubernetesConfig extends EventEmitter {
  constructor(configPath = '/app/config') {
    super();
    this.configPath = configPath;
    this.config = {};
  }

  load() {
    const configFile = path.join(this.configPath, 'config.json');
    
    if (fs.existsSync(configFile)) {
      const content = fs.readFileSync(configFile, 'utf8');
      this.config = JSON.parse(content);
    }
    
    return this.config;
  }

  watch() {
    const watcher = chokidar.watch(this.configPath, {
      persistent: true,
      ignoreInitial: true
    });
    
    watcher.on('change', (filePath) => {
      console.log(`Config file changed: ${filePath}`);
      const oldConfig = { ...this.config };
      this.load();
      this.emit('change', { oldConfig, newConfig: this.config });
    });
    
    return watcher;
  }

  get(key, defaultValue) {
    const keys = key.split('.');
    let value = this.config;
    
    for (const k of keys) {
      if (value && typeof value === 'object' && k in value) {
        value = value[k];
      } else {
        return defaultValue;
      }
    }
    
    return value;
  }
}

module.exports = KubernetesConfig;

HashiCorp Vault for Secrets

Vault is the gold standard for secrets management, and it is worth understanding why before you commit to the operational cost. The key idea is dynamic secrets: instead of sharing one long-lived database password across all your services, Vault generates a unique short-lived credential per service instance on demand. When the service is done (or its lease expires, usually after 24 hours), Vault automatically revokes the credential. This turns credential rotation from a quarterly security project into a continuous background process. You also get audit logs (every secret access is recorded), fine-grained policies (service X can read these paths, write nothing), and multiple auth methods (Kubernetes service account tokens, AWS IAM, LDAP). The cost is real operational complexity. Vault has a “sealed” state it enters on restart and requires unsealing (either manually with key shares or automatically via a cloud KMS). Running Vault in HA requires a backend like Consul or Raft, plus careful backup strategy for the encryption keys. If you do not have compliance requirements (PCI, HIPAA, SOC 2) forcing the issue, start simpler.

Vault Integration

The code below shows the standard Vault lifecycle: authenticate using the pod’s Kubernetes service account token (no hardcoded credentials), fetch static secrets from the KV store, request dynamic database credentials, and schedule lease renewal before expiry. Lease renewal is the subtle part — if you let a lease expire without renewing, Vault revokes the credential and your next database query fails. The typical pattern is to renew at 75% of the lease duration, giving a safety margin for network hiccups.
// config/vault-config.js
const vault = require('node-vault');

class VaultConfig {
  constructor(options = {}) {
    this.client = vault({
      apiVersion: 'v1',
      endpoint: process.env.VAULT_ADDR || 'http://localhost:8200',
      token: process.env.VAULT_TOKEN
    });
    
    this.secretPath = options.secretPath || 'secret/data';
    this.serviceName = options.serviceName || process.env.SERVICE_NAME;
    this.secrets = {};
    this.leases = new Map();
  }

  // Authenticate with Kubernetes
  async authenticateWithKubernetes() {
    const jwt = require('fs').readFileSync(
      '/var/run/secrets/kubernetes.io/serviceaccount/token',
      'utf8'
    );
    
    const response = await this.client.kubernetesLogin({
      role: this.serviceName,
      jwt: jwt
    });
    
    this.client.token = response.auth.client_token;
    
    // Set up token renewal
    this.scheduleTokenRenewal(response.auth.lease_duration);
  }

  async loadSecrets() {
    const environment = process.env.NODE_ENV || 'development';
    
    // Load service-specific secrets
    const servicePath = `${this.secretPath}/${environment}/${this.serviceName}`;
    
    try {
      const response = await this.client.read(servicePath);
      this.secrets = response.data.data;
      console.log(`Loaded secrets for ${this.serviceName}`);
    } catch (error) {
      if (error.response?.statusCode === 404) {
        console.warn(`No secrets found at ${servicePath}`);
        this.secrets = {};
      } else {
        throw error;
      }
    }
    
    return this.secrets;
  }

  // Get dynamic database credentials
  async getDatabaseCredentials(dbRole = 'readonly') {
    const path = `database/creds/${dbRole}`;
    
    try {
      const response = await this.client.read(path);
      
      // Schedule credential renewal before expiry
      this.scheduleLease(path, response.lease_id, response.lease_duration);
      
      return {
        username: response.data.username,
        password: response.data.password,
        leaseDuration: response.lease_duration
      };
    } catch (error) {
      console.error('Failed to get database credentials:', error);
      throw error;
    }
  }

  // Renew lease before expiry
  scheduleLease(path, leaseId, leaseDuration) {
    // Renew at 75% of lease duration
    const renewAt = leaseDuration * 0.75 * 1000;
    
    const timer = setTimeout(async () => {
      try {
        const response = await this.client.write('sys/leases/renew', {
          lease_id: leaseId
        });
        
        console.log(`Renewed lease for ${path}`);
        this.scheduleLease(path, leaseId, response.lease_duration);
      } catch (error) {
        console.error(`Failed to renew lease for ${path}:`, error);
        // Get new credentials
        await this.getDatabaseCredentials(path.split('/').pop());
      }
    }, renewAt);
    
    this.leases.set(leaseId, timer);
  }

  scheduleTokenRenewal(leaseDuration) {
    const renewAt = leaseDuration * 0.75 * 1000;
    
    setTimeout(async () => {
      try {
        await this.client.tokenRenewSelf();
        console.log('Vault token renewed');
        this.scheduleTokenRenewal(leaseDuration);
      } catch (error) {
        console.error('Failed to renew Vault token:', error);
        await this.authenticateWithKubernetes();
      }
    }, renewAt);
  }

  get(key, defaultValue) {
    return this.secrets[key] || defaultValue;
  }

  async close() {
    for (const timer of this.leases.values()) {
      clearTimeout(timer);
    }
    this.leases.clear();
  }
}

module.exports = VaultConfig;

Using Vault in Application

Tying Vault into your app during startup follows a clear order: authenticate, load static secrets (API keys, shared secrets), then request dynamic credentials for anything that supports them (most commonly database credentials). The dynamic credentials path is where Vault shines — your service starts up, gets a fresh unique database user with a 24-hour lease, uses it, and when the lease is about to expire the background renewal task extends it. If Vault goes down mid-operation, your service keeps running on its current credentials until the lease expires; by then Vault should be back. This graceful-degradation property is why static caching of credentials matters even when using Vault.
// app.js
const express = require('express');
const VaultConfig = require('./config/vault-config');
const { Pool } = require('pg');

const app = express();
const vaultConfig = new VaultConfig({ serviceName: 'order-service' });

let dbPool = null;

async function initializeDatabase() {
  // Get dynamic credentials from Vault
  const credentials = await vaultConfig.getDatabaseCredentials('order-service-rw');
  
  dbPool = new Pool({
    host: process.env.DB_HOST,
    database: process.env.DB_NAME,
    user: credentials.username,
    password: credentials.password,
    max: 10
  });
  
  // Reconnect when credentials are renewed
  vaultConfig.on('credentialsRenewed', async (newCredentials) => {
    console.log('Database credentials renewed, reconnecting...');
    await dbPool.end();
    dbPool = new Pool({
      host: process.env.DB_HOST,
      database: process.env.DB_NAME,
      user: newCredentials.username,
      password: newCredentials.password,
      max: 10
    });
  });
}

async function startServer() {
  // Authenticate with Vault (Kubernetes auth)
  await vaultConfig.authenticateWithKubernetes();
  
  // Load static secrets
  await vaultConfig.loadSecrets();
  
  // Initialize database with dynamic credentials
  await initializeDatabase();
  
  const apiKey = vaultConfig.get('API_KEY');
  const jwtSecret = vaultConfig.get('JWT_SECRET');
  
  app.listen(3000, () => {
    console.log('Server started with Vault integration');
  });
}

startServer().catch(console.error);

Environment-Specific Configuration

Multi-Environment Setup

The pattern below — a default file plus per-environment overrides plus env-var mappings — is battle-tested across large teams because it expresses three concerns cleanly: defaults that rarely change, per-environment overrides that track structural differences between dev/staging/prod, and runtime env-var injection for things like passwords and host-specific values. The “local.json” file (gitignored) is a key detail: it lets individual developers override any value for local dev without polluting shared config. When someone asks “why does it work on your machine but not mine,” the answer is almost always something in their local.json.
config/
├── default.json          # Base configuration
├── development.json      # Dev overrides
├── staging.json         # Staging overrides
├── production.json      # Prod overrides
└── custom-environment-variables.json  # Env var mappings
// config/loader.js
const fs = require('fs');
const path = require('path');

class ConfigLoader {
  constructor(configDir = './config') {
    this.configDir = configDir;
    this.environment = process.env.NODE_ENV || 'development';
    this.config = {};
  }

  load() {
    // Load base config
    this.config = this.loadFile('default.json');
    
    // Load environment-specific config
    const envConfig = this.loadFile(`${this.environment}.json`);
    this.config = this.deepMerge(this.config, envConfig);
    
    // Load local overrides (not committed to git)
    const localConfig = this.loadFile('local.json');
    this.config = this.deepMerge(this.config, localConfig);
    
    // Apply environment variable overrides
    this.applyEnvVars();
    
    return this.config;
  }

  loadFile(filename) {
    const filePath = path.join(this.configDir, filename);
    
    if (fs.existsSync(filePath)) {
      return JSON.parse(fs.readFileSync(filePath, 'utf8'));
    }
    
    return {};
  }

  applyEnvVars() {
    const envVarMappings = this.loadFile('custom-environment-variables.json');
    this.applyEnvVarsRecursive(this.config, envVarMappings);
  }

  applyEnvVarsRecursive(config, mappings) {
    for (const [key, value] of Object.entries(mappings)) {
      if (typeof value === 'object') {
        if (!config[key]) config[key] = {};
        this.applyEnvVarsRecursive(config[key], value);
      } else {
        // Value is the env var name
        const envValue = process.env[value];
        if (envValue !== undefined) {
          config[key] = this.parseValue(envValue);
        }
      }
    }
  }

  parseValue(value) {
    // Try to parse as JSON
    try {
      return JSON.parse(value);
    } catch {
      return value;
    }
  }

  deepMerge(target, source) {
    const result = { ...target };
    
    for (const key in source) {
      if (source[key] instanceof Object && key in target && !(source[key] instanceof Array)) {
        result[key] = this.deepMerge(target[key], source[key]);
      } else {
        result[key] = source[key];
      }
    }
    
    return result;
  }
}

module.exports = new ConfigLoader().load();

Interview Questions

Answer:Use a centralized configuration server (Consul, etcd, Spring Cloud Config):
  1. Hierarchy: Global → Environment → Service-specific
  2. Hot reload: Watch for changes without restart
  3. Secrets: Separate from config (Vault, AWS Secrets Manager)
  4. Versioning: Track config changes in Git
  5. Validation: Schema validation on load
Best Practices:
  • 12-Factor App: Config in environment
  • Encryption at rest and in transit
  • Audit logging for changes
  • Feature flags for gradual rollouts
Answer:Types of feature flags:
  • Boolean: Simple on/off
  • Percentage: Gradual rollout (10% → 50% → 100%)
  • User targeting: Specific users or attributes
  • Time-based: Scheduled activation
Implementation:
  • Hash user ID for consistent bucketing
  • Use configuration store for flag definitions
  • SDK in each service to evaluate flags
Best Practices:
  • Clean up old flags (tech debt)
  • Monitor flag usage
  • Have kill switches for quick rollback
Answer:Never in code or config files!Solutions:
  • HashiCorp Vault: Dynamic secrets, leasing, rotation
  • AWS Secrets Manager: AWS-native, automatic rotation
  • Kubernetes Secrets: Basic, encode with base64 (not encrypted)
Best Practices:
  • Dynamic credentials (short-lived)
  • Automatic rotation
  • Principle of least privilege
  • Audit access
  • Encrypt at rest
Answer:Hot Reload Pattern:
  1. Watch config source for changes
  2. Validate new config before applying
  3. Gracefully transition (connection pools, caches)
  4. Rollback if validation fails
Kubernetes approach:
  • Update ConfigMap/Secret
  • Trigger rolling restart: kubectl rollout restart
  • Or use sidecar to watch and signal reload
Application approach:
  • Watch config file (chokidar/inotify)
  • Poll config server periodically
  • Subscribe to config change events

Chapter Summary

Key Takeaways:
  • Centralize configuration with tools like Consul or etcd
  • Use hierarchical config: global → environment → service
  • Implement feature flags for safe progressive rollouts
  • Separate secrets from configuration (use Vault)
  • Enable hot reload for zero-downtime config updates
  • Follow 12-Factor App principles
Next Chapter: CI/CD for Microservices - Pipelines and deployment strategies.

Interview Questions: Silent Config Failures

Strong Answer Framework:
  1. Root-cause the detection delay, not just the failure. 45 minutes to detect a 2% error rate spike means your alerts are tuned for 5%+ thresholds. Tighten SLO burn-rate alerts so a 2% error rate on 10% of traffic (effectively 0.2% of global traffic) still pages within 5 minutes. Fast alerts are the first line of defense.
  2. Stage config rollouts like code rollouts. A config change should never apply to 10% of pods instantly. Use the same canary pattern: 1 pod, 5% pods, 25%, 100%, with metrics checks at each step.
  3. Type-check and schema-validate configs at load time. The most common cause of silent failures is a misspelled key or wrong-typed value. pydantic-settings, convict, or JSON schema validation catches this before the service serves traffic. A pod that cannot parse its config should fail readiness, not run with undefined behavior.
  4. Automate drift detection between environments. Daily comparison: staging’s config schema should match production’s, with documented exceptions. A key that exists in prod but not staging means staging is not actually testing what prod runs.
  5. Emit a structured log event on every config change. “Config change: key=max_retries, old=3, new=5, source=consul-watch, env=production, ts=…” This is gold for incident correlation — when something breaks, you can trace it to an exact config change.
  6. Create a config-change-specific dashboard. Traffic, latency, and error rate overlaid with config change events. When SRE opens the incident, they see immediately whether a recent config change correlates with the spike.
Real-World Example: Cloudflare’s July 2020 outage (public postmortem) was triggered exactly this way — a regex config change deployed progressively, took down a small fraction first, detection was delayed because the error rate was below paging thresholds. Their fix: tighter burn-rate alerts and gated progressive rollouts for config changes.Senior Follow-up Questions:
Q: What if the config change was technically valid but semantically wrong — no schema can catch it? A: Then the defense shifts to canary analysis and automated rollback. The canary pod’s error rate is compared against the stable pods’ error rate; if the delta exceeds 0.5%, roll back. This catches semantic regressions even when the config schema is valid.
Q: You can’t canary every config change — some are urgent prod fixes. How do you handle those? A: Have an explicit “break glass” mode with extra scrutiny: double-review required, paged-SRE watching the dashboard, automatic rollback armed if error rate increases above baseline. The goal is not to slow down emergencies but to ensure they do not become their own incidents.
Q: How do you ensure config changes are reversible? What if rolling back introduces its own bug? A: Config rollback is the same class of problem as code rollback: sometimes reverting a config after 2 hours means rejecting data that was written with the new config in effect. For this reason, config changes should be backward-compatible by default (add, do not remove; widen, do not narrow). Breaking config changes go through a multi-step expand-contract pattern just like schema changes.
Common Wrong Answers:
  • “Add more logging.” More logs do not help if nobody reads them during the incident. The fix is alerts that trigger on the specific failure mode, not more data to sift through.
  • “Require approval for every config change.” This creates approval queue bottlenecks and trains teams to batch changes, which makes diagnosis harder. Instead, make the changes safer and the detection faster.
Further Reading:
  • Cloudflare’s outage postmortems (indexed at blog.cloudflare.com).
  • Google SRE Workbook, Chapter 5 (“Alerting on SLOs”) — burn-rate alerting formulas.
  • LaunchDarkly’s “Progressive Delivery” whitepaper on feature flag rollout patterns.
Strong Answer Framework:
  1. Accept that the secrets are already compromised. Once in git history, a secret is exposed to anyone with repo read access plus GitHub’s own secret scanners plus any tool with clone access. Rotate all three secrets immediately, before doing process work.
  2. Install a pre-commit hook. gitleaks or trufflehog scan staged files before commit. A secret is rejected at the developer’s machine, before it ever hits the remote. Make this mandatory via a husky or pre-commit.com template in the repo.
  3. Add a second-line check in CI. Even if someone bypasses the pre-commit hook (or installs a fresh clone), CI rescans on every push. A matched secret fails the build. This catches both accidents and intentional bypasses.
  4. Scan the entire git history once. gitleaks detect --source . Walks the full history. Every finding triggers a rotation ticket. Assume every secret ever committed is compromised regardless of whether it was “removed.”
  5. Make the right path the easiest path. If developers commit secrets because .env is the easiest way to inject values locally, make the secret manager easier. direnv + a shell wrapper that fetches from Vault is one pattern. Secrets in a tool that integrates cleanly into the dev loop get used correctly.
  6. Educate on the specific threat model. Developers often believe a force-push removes the secret. Show them that GitHub’s secret scanner still detected it, show them the log of scanner hits. Make the failure mode visceral.
Real-World Example: The Uber 2016 breach exposed a private GitHub repo containing AWS credentials. The engineer who committed them believed they were harmless because the repo was private — the attacker obtained access via a separate breach and pivoted from leaked credentials to 57 million user records. Secrets in git are a timed explosive with the timer set at commit time.Senior Follow-up Questions:
Q: Pre-commit hooks can be disabled with --no-verify. How do you prevent that? A: You cannot fully prevent it on developer machines, but you make it visible: CI rejects pushes with unverifiable commits, PRs require CI green, and a weekly report lists force-pushes and --no-verify commits for review. Make the bypass expensive socially, not just technically.
Q: Some services legitimately need to write secrets to disk at runtime. How do you distinguish accidental commits from legitimate writes? A: Runtime writes go to /var/run/secrets/... or a tmpfs volume, never to the working tree. Scanners ignore those paths. If a secret appears anywhere under the checked-out source tree, it is always wrong.
Q: A former employee had read access to this repo. The secrets they saw are being rotated now. Is that enough? A: No. Also check what systems those secrets granted access to and whether any logs show suspicious access in the window between commit and rotation. If the secret was an AWS key, pull CloudTrail for that key’s usage. Assume worst case until proven otherwise.
Common Wrong Answers:
  • “Use git filter-branch to remove the secret from history.” Does not help — the secret has already been pulled by anyone with repo access, and scanners keep copies. Rotate instead.
  • “Train developers better.” Training has never prevented this class of mistake at any organization. Technical controls (pre-commit + CI scan) are the only reliable defense.
Further Reading:
  • GitHub’s documentation on secret scanning and push protection.
  • OWASP: “Sensitive Data Exposure” top-10 category.
  • Vault’s Kubernetes auth method documentation for replacing .env files.
Strong Answer Framework:
  1. Instrument evaluation counts per flag. For the next 30 days, log every flag evaluation with flag name, result, and context. At the end of the window, you know which flags are actually in use and which ones are dead code.
  2. Categorize by state. Flags that always evaluate to the same value (100% on or 100% off for 30 days) are either fully rolled out or fully abandoned — either way, they are removable. Flags with dynamic rollout (targeting by user attributes or percentages) need more care.
  3. Assign ownership retroactively. Use git blame on each flag definition to find who last touched it. Send that person (or their current team) a ticket: “You own flag X. Decide: keep, remove, or transfer ownership.” Flags with no owner after 30 days are removed by default.
  4. Remove in waves, not all at once. Start with the “always true” flags (safest). Remove the flag reference and the else branch in the code; only the on-path code remains. Validate in staging, deploy. Repeat for “always false” flags, then for dormant-but-still-dynamic flags.
  5. Add automated guardrails against regression. Linter check: every flag in code must have a corresponding flag definition with owner, creation date, and expiration date. CI fails if a flag is older than 90 days without documented extension.
  6. Budget the cleanup. Removing 300 flags is a 6-month project. Allocate 10% of each team’s sprint capacity to “feature flag debt” until the list is under 50. Treat it like any other debt-reduction program.
Real-World Example: Facebook publicly discussed their feature-flag system (Gatekeeper) at various conferences; at peak they had tens of thousands of active flags and had to build automated cleanup tooling to track ownership, staleness, and impact analysis. Smaller orgs hit this wall at 200-500 flags and have to build their own version.Senior Follow-up Questions:
Q: What if removing a flag breaks something you did not know depended on it? A: The instrumentation in step 1 should catch this — you know who is calling the flag. But defensively: use a feature-flag SDK that logs “flag lookup returned default because not defined” at warn level. If you remove the flag and something starts warning in prod, you have a fast signal to restore it.
Q: How do you handle flags that were part of experimentation? The data team wants to keep them for historical analysis. A: Separate concerns: flag values live in the feature-flag system; historical assignments live in the data warehouse as an immutable event stream. Removing the flag does not remove the historical record. Make this contract explicit with the data team.
Q: Engineering leaders want a metric to prevent this from happening again. What do you propose? A: “Flag-age p95” — the 95th percentile age of active flags. A healthy system has p95 under 60 days. Track this monthly as part of platform health review. When it climbs past 90, trigger a cleanup sprint.
Common Wrong Answers:
  • “Just delete all flags older than 6 months.” Without instrumentation, you do not know which old flags are safely removable. Some old flags are kill switches that rarely fire — removing them removes the escape hatch.
  • “Force engineers to remove flags before shipping new ones.” Creates a queue bottleneck and encourages engineers to leave flags named as generically as possible to avoid cleanup. Use technical guardrails instead of social ones.
Further Reading:
  • Pete Hodgson, “Feature Toggles” on martinfowler.com — the canonical taxonomy and lifecycle model.
  • LaunchDarkly’s “Effective Feature Management” e-book (chapters on tech debt).
  • John Allspaw, “On Being a Senior Engineer” (2012) — relevant for thinking about long-term system ownership.

Interview Deep-Dive

Strong Answer:The immediate problem is that secrets are baked into the deployment — changing them requires restarting every pod. The solution is decoupling secret retrieval from deployment.I would introduce HashiCorp Vault (or AWS Secrets Manager) as the centralized secret store. Each service authenticates to Vault at startup using its Kubernetes service account (no hardcoded Vault credentials). Vault returns the database password, and the service caches it locally with a TTL.For zero-downtime rotation, I use Vault’s dynamic database secrets. Instead of one shared password, Vault creates a unique database user/password per service instance with a 24-hour TTL. Vault automatically creates the credentials and revokes them when they expire. Rotation means creating new credentials, not changing an existing password.If dynamic secrets are not feasible (legacy database, compliance constraints), I use dual-password rotation. Step one: add a second valid password to the database. Step two: update the secret in Vault to the new password. Step three: services pick up the new password on their next TTL refresh (no restart needed). Step four: remove the old password from the database. At no point during this process are zero valid passwords configured, so there is no downtime window.The key operational practice: every secret rotation should be automated and tested monthly. If your rotation procedure requires a human to run 30 kubectl commands, it will fail during an actual security incident when you are under time pressure.Follow-up: “How do you handle the case where Vault is down and a service restarts? It cannot fetch its secrets.”I implement a secret cache on disk (encrypted with a local key derived from the Kubernetes service account). When the service starts, it tries Vault first. If Vault is unreachable, it falls back to the cached secrets. The cache has a maximum age — if secrets are older than 48 hours, the service refuses to start with stale secrets and raises a critical alert. This balances availability (service can start without Vault) against security (stale secrets are eventually rejected).
Strong Answer:Feature flags decouple deployment from release. You deploy code with the flag defaulting to off, then enable it progressively: 1% of users, then 10%, then 50%, then 100%. If anything goes wrong, you toggle the flag off — no rollback, no redeploy.Implementation: a feature flag service (LaunchDarkly, Unleash, or custom-built) stores flag definitions with targeting rules. Each microservice has a lightweight SDK that evaluates flags locally using cached rules (no network call per flag check). The SDK receives rule updates via streaming (SSE or WebSocket) for near-instant propagation.In a microservices environment, the critical challenge is flag consistency across services. If the Order Service evaluates a flag as “on” for user X but the Inventory Service evaluates it as “off,” the order flow breaks. I enforce consistency by having all services use the same flag evaluation SDK and the same targeting rules. For flags that span services, I pass the flag evaluation result as a header (X-Feature-Checkout-V2: true) rather than having each service evaluate independently.The risks at scale are real. First, technical debt accumulation. Every flag is a branch in your code. After 12 months with 200 flags, your codebase is a maze of if/else branches. I enforce a policy: every flag has an owner and an expiration date. Flags older than 90 days without a documented reason get a removal ticket auto-created.Second, testing combinatorial explosion. 10 flags mean 1024 possible combinations. You cannot test all of them. I group flags into independent categories and test each category independently, accepting that rare cross-flag interactions might slip through.Third, flag evaluation performance. If a hot code path checks 5 flags, and each flag evaluation involves a map lookup and rule evaluation, the cumulative overhead matters. The SDK should cache evaluated results per user context, not re-evaluate on every call.Follow-up: “How do you handle a feature flag that accidentally gets flipped on for all users in production and causes an outage?”This is why feature flags need the same operational rigor as deployments. I implement flag change auditing (who changed what, when), flag change approval workflows for high-risk flags (percentage rollout changes above 50% require review), and a kill switch that disables all recently changed flags with one command. The kill switch is the equivalent of a rollback — it reverts all flag changes from the last N minutes. I also set up alerts on flag change events: if a flag changes and error rates spike within 5 minutes, the on-call gets paged with the flag change as the probable cause.
Strong Answer:The principle is: configuration should be layered, not duplicated. I use a hierarchical configuration model with four layers: defaults (in code), global (shared across all services), environment-specific (overrides per environment), and service-specific (overrides per service per environment).In practice with Consul KV: config/defaults/database_pool_size = 10, config/production/database_pool_size = 50, config/production/order-service/database_pool_size = 100. The service reads in order: service-specific overrides the environment, which overrides the default. This means most configuration is shared (defaults), and only the values that genuinely differ per environment are overridden.Configuration drift happens when someone manually changes a production value without updating staging. I prevent this three ways. First, all configuration changes go through version control (GitOps for config). A PR changes a config file, gets reviewed, and an automated process applies it to the target environment. No manual Consul KV edits in production.Second, I run a daily drift detection job that compares the configuration across environments and reports differences that are not in the “expected overrides” list. If staging has payment_gateway = sandbox and production has payment_gateway = live, that is expected. If staging has max_retries = 3 and production has max_retries = 5 but the override was not documented, that is drift and gets flagged.Third, I use infrastructure-as-code (Terraform) for environment provisioning. The same Terraform module creates all environments with parameterized differences. This ensures structural consistency even as values differ.Follow-up: “What about configuration that is specific to a single developer’s local environment?”Local config should never leak into the shared configuration system. I use a .env.local file (gitignored) that overrides the defaults for local development. The application loads configuration in priority order: .env.local > environment variables > configuration service > defaults in code. This way, a developer can override any value locally without affecting anyone else, and the committed code always references the default or environment-specific values.