Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Configuration Management
In a microservices architecture, managing configuration across dozens or hundreds of services is a significant challenge. Imagine 40 services each with their own.env file, and you need to rotate a database password. With decentralized config, that is 40 deploys, 40 chances for human error, and an anxious hour hoping you did not miss one. Centralized configuration turns that into a single update that propagates everywhere. This chapter covers patterns and tools for centralized, dynamic configuration — including feature flags, which are arguably the most underrated tool in a microservices toolkit.
- Implement centralized configuration management
- Set up dynamic configuration with hot reload
- Design feature flags for progressive rollouts
- Manage environment-specific configurations
- Handle secrets securely across services
The Configuration Challenge
Before looking at solutions, it helps to see exactly how painful decentralized configuration becomes at scale. Every service owning its own.env creates three compounding problems: drift (staging and production quietly diverge), secrets leakage (credentials end up in git history), and operational fragility (changing a shared value requires coordinating N deploys). The “fix it in all 40 places” approach only works until someone misses one, and then you have a service pointing at the old database while the other 39 point at the new one — usually discovered in production, at night, under pressure.
Caveats & Common Pitfalls: The Silent Killers of Config Management
The 12-Factor App Configuration
Factor III: Config
Store config in the environment, not in code. This is one of the most violated 12-factor principles in practice. The test is simple: could you open-source your codebase right now without exposing a single credential? If the answer is no, you have config leaking into code. Production pitfall: A common mistake is having different config loading logic per environment (if (env === 'production')). This means your staging environment is not actually testing the same config path as production, which defeats the purpose of having staging at all.
The first step toward centralization is making your service read every configuration value from the environment rather than hardcoding it. This seems trivial, but it’s where most teams start accumulating debt: a hardcoded localhost here, a default password there, and six months later you have three PRs open just to change an endpoint URL. The rule: if a value could ever differ between environments (dev, staging, prod, CI, a teammate’s laptop), it belongs in the environment — not in code.
- Node.js
- Python
Configuration Hierarchy
Hierarchy is how you keep centralized config sane as it grows. Without structure, a flat key-value store becomes a dumping ground: thousands of untyped strings with no clear ownership. A good hierarchy expresses three things: what the value is (database/host), which scope it applies to (service-specific vs global), and which environment it targets (dev/staging/prod). You also want validation on load — misspelling a key or passing a string where a number is expected should fail at startup, loudly, not silently become “undefined” halfway through handling a request. Tools likeconvict (Node) and pydantic-settings (Python) do this automatically, which is why they are worth using over a hand-rolled process.env grab-bag.
- Node.js
- Python
Configuration Tool Comparison
Before choosing a tool, understand what each is optimized for. The most common mistake is using a general-purpose key-value store for secrets management, or paying for a dedicated feature flag service when Consul can handle your simple boolean flags.| Capability | Consul | etcd | Spring Cloud Config | AWS Parameter Store | Vault |
|---|---|---|---|---|---|
| Primary purpose | Service discovery + config | Distributed KV store | App config server | Cloud-native config | Secrets management |
| Hot reload | Yes (watches) | Yes (watches) | Yes (bus refresh) | No (polling only) | No (app must re-fetch) |
| Secret management | Basic (ACLs) | Basic (RBAC) | Encrypt/decrypt | Yes (SecureString) | Excellent (dynamic secrets, leasing, rotation) |
| Feature flags | Manual (KV structure) | Manual (KV structure) | Manual | Manual | Not designed for this |
| Kubernetes native? | Helm chart available | Built into K8s (backing store) | Helm chart available | AWS only | Helm chart, K8s auth |
| Multi-datacenter | Yes (built-in) | Requires federation | No | Multi-region with replication | Yes (replication) |
| Operational complexity | Medium | Low-Medium (if using K8s etcd) | Low | Low (managed) | High |
| Cost | Free (OSS) / Enterprise | Free (OSS) | Free (OSS) | $0.05 per 10K API calls | Free (OSS) / Enterprise |
- Startup with 5 services on Kubernetes: Use K8s ConfigMaps + Secrets (already there, zero extra infrastructure)
- Growing team needing feature flags: Add LaunchDarkly or Unleash (purpose-built; do not build your own if you can avoid it)
- Enterprise with compliance requirements: Vault for secrets + Consul for config (audit trails, dynamic credentials, RBAC)
- AWS-native shop: Parameter Store + Secrets Manager (managed, integrates with IAM)
Consul for Configuration
Consul is one of the most popular choices for centralized config because it combines service discovery with a hierarchical KV store. The key-value API lets you model configuration as a tree (config/production/order-service/database/host), and every service watches its subtree for changes. The killer feature is blocking queries: instead of polling, your client tells Consul “give me this key, but if it has not changed in 60 seconds, return nothing.” When the key changes, Consul responds instantly. This gives you hot-reload without the latency and cost of polling.
The trade-off to understand: Consul is eventually consistent across datacenters but strongly consistent within a single datacenter (Raft-backed). If your service reads a value immediately after you wrote it in the same DC, you see the new value. Across DCs there may be a small replication lag. For config, this is almost always fine; for coordinating distributed locks, it matters more.
Setup and Connection
The client below shows the full lifecycle for Consul-based configuration: load an initial snapshot on startup, then register watches that fire whenever values change. Notice how we merge a global config layer with a service-specific layer — this is the “hierarchy” pattern in action. Global values (like the company SMTP server) live once; service-specific overrides (like the order service’s custom timeout) live under the service’s own prefix. If you flatten everything into one namespace, you lose the ability to reason about scope and end up with copies of the same value under different keys.- Node.js
- Python
Using Consul Config in Services
Here is where the payoff shows up: your application startup code loads config once, registers watches for anything that might change at runtime (database pool sizes, feature flag states, third-party endpoints), and then callsconfig.get(...) freely throughout the codebase. When the value changes in Consul, your watch callback fires and gracefully reconfigures whatever needs to change — no restart required. The common mistake is caching a value from config.get() at startup and holding it in a local variable forever; always re-read from config.get() at request time, or wire up a watch to refresh your local copy.
- Node.js
- Python
Feature Flags
Feature flags are the secret weapon of high-performing engineering teams. They decouple deployment from release — you can merge and deploy code that is not yet visible to users, then turn it on gradually (1% of users, then 10%, then 50%, then 100%). If something goes wrong, you flip the flag off in seconds instead of rolling back a deployment. Netflix, Google, and Facebook all use feature flags extensively. The trade-off: they add complexity and create technical debt if you do not clean up old flags. Set a rule: every feature flag gets a cleanup ticket with a deadline when it is created.Caveats & Common Pitfalls: Feature Flag Technical Debt
Feature Flag System
The core of any feature flag system is consistent user bucketing — given the same user ID and the same flag, you must always return the same answer. Otherwise a user could see feature X on one page load and not on the next, which is worse than not having the feature at all. The standard technique is hashing the user ID plus the flag name and mapping it to a number in [0, 100). If the number falls below the flag’s rollout percentage, the user is in the experiment. This is stateless (no database of user assignments) and stable across service restarts. Beyond simple boolean and percentage rollouts, the patterns below cover user allowlists (beta testers), time-based gradual rollouts (ramp from 0% to 100% over a week), and attribute-based targeting (only premium subscribers see this). You do not need all of these on day one — but building the abstraction upfront means adding new strategies later is a one-function change rather than a rewrite.- Node.js
- Python
Feature Flag Configuration in Consul
Using Feature Flags in Routes
The pattern below is the “feature flag dependency” — pass flag evaluation context (user ID, session, subscription tier) into the flag check at the point of use, then branch on the result. Middleware is the cleanest way to propagate flag evaluation results through a request: evaluate the flag once, attach the result to the request object, and let handlers read it without re-evaluating. Avoid scatteringisEnabled calls deep in business logic — you want feature branches to be visible at the request-handling level where they can be audited and traced.
- Node.js
- Python
Kubernetes ConfigMaps and Secrets
ConfigMap for Non-Sensitive Config
ConfigMaps are Kubernetes’ built-in answer to the configuration problem, and for small-to-medium clusters they are often all you need. A ConfigMap is a cluster-scoped object that stores key-value pairs or entire config files; pods can consume them as environment variables or mounted files. The tradeoff vs a dedicated config server like Consul: ConfigMaps have no native hot-reload (changes require a pod restart unless you mount them as files and the app watches the filesystem), no cross-cluster replication, and no audit trail beyond Kubernetes audit logs. What you get in return is zero extra infrastructure — ConfigMaps are just Kubernetes.Secrets for Sensitive Data
Kubernetes Secrets are the sibling of ConfigMaps for sensitive data, but the name is deceptive: by default, Secret values are only base64-encoded, not encrypted. Anyone withkubectl get secret permissions can trivially decode them. For real security you need encryption-at-rest enabled in etcd (which requires configuring the API server with an encryption config) plus strict RBAC. For anything sensitive enough to warrant a dedicated secrets manager — production database credentials, API keys for paid services, signing keys — use Vault or a cloud provider’s secrets manager instead. Secrets are fine for low-stakes values where the main goal is keeping tokens out of ConfigMaps and source control.
Using ConfigMaps and Secrets in Deployments
Hot Reload with ConfigMap Updates
When a ConfigMap is mounted as a volume, Kubernetes automatically updates the mounted files when the ConfigMap changes — typically within about a minute. This gives you a free hot-reload channel: have your app watch the config directory withchokidar (Node) or watchdog (Python), parse the file when it changes, and apply the new values. The gotcha: ConfigMap values injected as environment variables do NOT get updated; env vars are set at container start and never change. So for anything you want to hot-reload, mount it as a file. Also beware: there is no atomicity guarantee across multiple files, so if you have two related config files updating, your app might briefly see one old and one new. Favor a single config.json file for related values.
- Node.js
- Python
HashiCorp Vault for Secrets
Vault is the gold standard for secrets management, and it is worth understanding why before you commit to the operational cost. The key idea is dynamic secrets: instead of sharing one long-lived database password across all your services, Vault generates a unique short-lived credential per service instance on demand. When the service is done (or its lease expires, usually after 24 hours), Vault automatically revokes the credential. This turns credential rotation from a quarterly security project into a continuous background process. You also get audit logs (every secret access is recorded), fine-grained policies (service X can read these paths, write nothing), and multiple auth methods (Kubernetes service account tokens, AWS IAM, LDAP). The cost is real operational complexity. Vault has a “sealed” state it enters on restart and requires unsealing (either manually with key shares or automatically via a cloud KMS). Running Vault in HA requires a backend like Consul or Raft, plus careful backup strategy for the encryption keys. If you do not have compliance requirements (PCI, HIPAA, SOC 2) forcing the issue, start simpler.Vault Integration
The code below shows the standard Vault lifecycle: authenticate using the pod’s Kubernetes service account token (no hardcoded credentials), fetch static secrets from the KV store, request dynamic database credentials, and schedule lease renewal before expiry. Lease renewal is the subtle part — if you let a lease expire without renewing, Vault revokes the credential and your next database query fails. The typical pattern is to renew at 75% of the lease duration, giving a safety margin for network hiccups.- Node.js
- Python
Using Vault in Application
Tying Vault into your app during startup follows a clear order: authenticate, load static secrets (API keys, shared secrets), then request dynamic credentials for anything that supports them (most commonly database credentials). The dynamic credentials path is where Vault shines — your service starts up, gets a fresh unique database user with a 24-hour lease, uses it, and when the lease is about to expire the background renewal task extends it. If Vault goes down mid-operation, your service keeps running on its current credentials until the lease expires; by then Vault should be back. This graceful-degradation property is why static caching of credentials matters even when using Vault.- Node.js
- Python
Environment-Specific Configuration
Multi-Environment Setup
The pattern below — a default file plus per-environment overrides plus env-var mappings — is battle-tested across large teams because it expresses three concerns cleanly: defaults that rarely change, per-environment overrides that track structural differences between dev/staging/prod, and runtime env-var injection for things like passwords and host-specific values. The “local.json” file (gitignored) is a key detail: it lets individual developers override any value for local dev without polluting shared config. When someone asks “why does it work on your machine but not mine,” the answer is almost always something in their local.json.- Node.js
- Python
Interview Questions
Q1: How do you manage configuration across multiple microservices?
Q1: How do you manage configuration across multiple microservices?
- Hierarchy: Global → Environment → Service-specific
- Hot reload: Watch for changes without restart
- Secrets: Separate from config (Vault, AWS Secrets Manager)
- Versioning: Track config changes in Git
- Validation: Schema validation on load
- 12-Factor App: Config in environment
- Encryption at rest and in transit
- Audit logging for changes
- Feature flags for gradual rollouts
Q2: How do you implement feature flags?
Q2: How do you implement feature flags?
- Boolean: Simple on/off
- Percentage: Gradual rollout (10% → 50% → 100%)
- User targeting: Specific users or attributes
- Time-based: Scheduled activation
- Hash user ID for consistent bucketing
- Use configuration store for flag definitions
- SDK in each service to evaluate flags
- Clean up old flags (tech debt)
- Monitor flag usage
- Have kill switches for quick rollback
Q3: How do you handle secrets in microservices?
Q3: How do you handle secrets in microservices?
- HashiCorp Vault: Dynamic secrets, leasing, rotation
- AWS Secrets Manager: AWS-native, automatic rotation
- Kubernetes Secrets: Basic, encode with base64 (not encrypted)
- Dynamic credentials (short-lived)
- Automatic rotation
- Principle of least privilege
- Audit access
- Encrypt at rest
Q4: How do you update configuration without downtime?
Q4: How do you update configuration without downtime?
- Watch config source for changes
- Validate new config before applying
- Gracefully transition (connection pools, caches)
- Rollback if validation fails
- Update ConfigMap/Secret
- Trigger rolling restart:
kubectl rollout restart - Or use sidecar to watch and signal reload
- Watch config file (chokidar/inotify)
- Poll config server periodically
- Subscribe to config change events
Chapter Summary
- Centralize configuration with tools like Consul or etcd
- Use hierarchical config: global → environment → service
- Implement feature flags for safe progressive rollouts
- Separate secrets from configuration (use Vault)
- Enable hot reload for zero-downtime config updates
- Follow 12-Factor App principles
Interview Questions: Silent Config Failures
A config change deployed to 10% of pods caused a 2% error rate spike that took 45 minutes to detect. How do you prevent this from happening again?
A config change deployed to 10% of pods caused a 2% error rate spike that took 45 minutes to detect. How do you prevent this from happening again?
- Root-cause the detection delay, not just the failure. 45 minutes to detect a 2% error rate spike means your alerts are tuned for 5%+ thresholds. Tighten SLO burn-rate alerts so a 2% error rate on 10% of traffic (effectively 0.2% of global traffic) still pages within 5 minutes. Fast alerts are the first line of defense.
- Stage config rollouts like code rollouts. A config change should never apply to 10% of pods instantly. Use the same canary pattern: 1 pod, 5% pods, 25%, 100%, with metrics checks at each step.
- Type-check and schema-validate configs at load time. The most common cause of silent failures is a misspelled key or wrong-typed value.
pydantic-settings,convict, or JSON schema validation catches this before the service serves traffic. A pod that cannot parse its config should fail readiness, not run with undefined behavior. - Automate drift detection between environments. Daily comparison: staging’s config schema should match production’s, with documented exceptions. A key that exists in prod but not staging means staging is not actually testing what prod runs.
- Emit a structured log event on every config change. “Config change: key=
max_retries, old=3, new=5, source=consul-watch, env=production, ts=…” This is gold for incident correlation — when something breaks, you can trace it to an exact config change. - Create a config-change-specific dashboard. Traffic, latency, and error rate overlaid with config change events. When SRE opens the incident, they see immediately whether a recent config change correlates with the spike.
- “Add more logging.” More logs do not help if nobody reads them during the incident. The fix is alerts that trigger on the specific failure mode, not more data to sift through.
- “Require approval for every config change.” This creates approval queue bottlenecks and trains teams to batch changes, which makes diagnosis harder. Instead, make the changes safer and the detection faster.
- Cloudflare’s outage postmortems (indexed at blog.cloudflare.com).
- Google SRE Workbook, Chapter 5 (“Alerting on SLOs”) — burn-rate alerting formulas.
- LaunchDarkly’s “Progressive Delivery” whitepaper on feature flag rollout patterns.
Your team keeps committing secrets to git by accident. Three times this quarter a developer pushed a .env file. How do you fix the process?
Your team keeps committing secrets to git by accident. Three times this quarter a developer pushed a .env file. How do you fix the process?
- Accept that the secrets are already compromised. Once in git history, a secret is exposed to anyone with repo read access plus GitHub’s own secret scanners plus any tool with clone access. Rotate all three secrets immediately, before doing process work.
- Install a pre-commit hook.
gitleaksortrufflehogscan staged files before commit. A secret is rejected at the developer’s machine, before it ever hits the remote. Make this mandatory via a husky or pre-commit.com template in the repo. - Add a second-line check in CI. Even if someone bypasses the pre-commit hook (or installs a fresh clone), CI rescans on every push. A matched secret fails the build. This catches both accidents and intentional bypasses.
- Scan the entire git history once.
gitleaks detect --source .Walks the full history. Every finding triggers a rotation ticket. Assume every secret ever committed is compromised regardless of whether it was “removed.” - Make the right path the easiest path. If developers commit secrets because
.envis the easiest way to inject values locally, make the secret manager easier.direnv+ a shell wrapper that fetches from Vault is one pattern. Secrets in a tool that integrates cleanly into the dev loop get used correctly. - Educate on the specific threat model. Developers often believe a force-push removes the secret. Show them that GitHub’s secret scanner still detected it, show them the log of scanner hits. Make the failure mode visceral.
--no-verify. How do you prevent that?
A: You cannot fully prevent it on developer machines, but you make it visible: CI rejects pushes with unverifiable commits, PRs require CI green, and a weekly report lists force-pushes and --no-verify commits for review. Make the bypass expensive socially, not just technically./var/run/secrets/... or a tmpfs volume, never to the working tree. Scanners ignore those paths. If a secret appears anywhere under the checked-out source tree, it is always wrong.- “Use
git filter-branchto remove the secret from history.” Does not help — the secret has already been pulled by anyone with repo access, and scanners keep copies. Rotate instead. - “Train developers better.” Training has never prevented this class of mistake at any organization. Technical controls (pre-commit + CI scan) are the only reliable defense.
- GitHub’s documentation on secret scanning and push protection.
- OWASP: “Sensitive Data Exposure” top-10 category.
- Vault’s Kubernetes auth method documentation for replacing
.envfiles.
Your feature-flag system now has 340 active flags. Engineers say they're afraid to remove any because they do not know which ones are safe. How do you dig out?
Your feature-flag system now has 340 active flags. Engineers say they're afraid to remove any because they do not know which ones are safe. How do you dig out?
- Instrument evaluation counts per flag. For the next 30 days, log every flag evaluation with flag name, result, and context. At the end of the window, you know which flags are actually in use and which ones are dead code.
- Categorize by state. Flags that always evaluate to the same value (100% on or 100% off for 30 days) are either fully rolled out or fully abandoned — either way, they are removable. Flags with dynamic rollout (targeting by user attributes or percentages) need more care.
- Assign ownership retroactively. Use
git blameon each flag definition to find who last touched it. Send that person (or their current team) a ticket: “You own flag X. Decide: keep, remove, or transfer ownership.” Flags with no owner after 30 days are removed by default. - Remove in waves, not all at once. Start with the “always true” flags (safest). Remove the flag reference and the else branch in the code; only the on-path code remains. Validate in staging, deploy. Repeat for “always false” flags, then for dormant-but-still-dynamic flags.
- Add automated guardrails against regression. Linter check: every flag in code must have a corresponding flag definition with owner, creation date, and expiration date. CI fails if a flag is older than 90 days without documented extension.
- Budget the cleanup. Removing 300 flags is a 6-month project. Allocate 10% of each team’s sprint capacity to “feature flag debt” until the list is under 50. Treat it like any other debt-reduction program.
- “Just delete all flags older than 6 months.” Without instrumentation, you do not know which old flags are safely removable. Some old flags are kill switches that rarely fire — removing them removes the escape hatch.
- “Force engineers to remove flags before shipping new ones.” Creates a queue bottleneck and encourages engineers to leave flags named as generically as possible to avoid cleanup. Use technical guardrails instead of social ones.
- Pete Hodgson, “Feature Toggles” on martinfowler.com — the canonical taxonomy and lifecycle model.
- LaunchDarkly’s “Effective Feature Management” e-book (chapters on tech debt).
- John Allspaw, “On Being a Senior Engineer” (2012) — relevant for thinking about long-term system ownership.
Interview Deep-Dive
'Your company has 30 microservices and needs to rotate a database password. Currently, each service reads it from an environment variable and requires a redeploy. How do you fix this, and how do you handle the rotation without downtime?'
'Your company has 30 microservices and needs to rotate a database password. Currently, each service reads it from an environment variable and requires a redeploy. How do you fix this, and how do you handle the rotation without downtime?'
'Explain how you would implement feature flags in a microservices environment. What are the risks of feature flags at scale?'
'Explain how you would implement feature flags in a microservices environment. What are the risks of feature flags at scale?'
'How do you manage environment-specific configuration (dev, staging, production) without configuration drift across environments?'
'How do you manage environment-specific configuration (dev, staging, production) without configuration drift across environments?'
config/defaults/database_pool_size = 10, config/production/database_pool_size = 50, config/production/order-service/database_pool_size = 100. The service reads in order: service-specific overrides the environment, which overrides the default. This means most configuration is shared (defaults), and only the values that genuinely differ per environment are overridden.Configuration drift happens when someone manually changes a production value without updating staging. I prevent this three ways. First, all configuration changes go through version control (GitOps for config). A PR changes a config file, gets reviewed, and an automated process applies it to the target environment. No manual Consul KV edits in production.Second, I run a daily drift detection job that compares the configuration across environments and reports differences that are not in the “expected overrides” list. If staging has payment_gateway = sandbox and production has payment_gateway = live, that is expected. If staging has max_retries = 3 and production has max_retries = 5 but the override was not documented, that is drift and gets flagged.Third, I use infrastructure-as-code (Terraform) for environment provisioning. The same Terraform module creates all environments with parameterized differences. This ensures structural consistency even as values differ.Follow-up: “What about configuration that is specific to a single developer’s local environment?”Local config should never leak into the shared configuration system. I use a .env.local file (gitignored) that overrides the defaults for local development. The application loads configuration in priority order: .env.local > environment variables > configuration service > defaults in code. This way, a developer can override any value locally without affecting anyone else, and the committed code always references the default or environment-specific values.