Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Monolith vs Microservices
Architecture Comparison
When to Use Microservices
Use Microservices
- Large, complex applications
- Multiple teams working independently
- Parts need different scaling
- Different technology requirements
- Frequent, independent deployments
Avoid Microservices
- Small team (< 10 engineers)
- Simple domain
- Early-stage startup
- Unclear domain boundaries
- Team lacks distributed systems experience
Service Communication
Synchronous vs Asynchronous
API Gateway Pattern
Backend for Frontend (BFF)
Service Discovery
Client-Side vs Server-Side
Service Discovery Implementation
- Python
- JavaScript
Service Registration
Data Management
Database per Service
Saga Pattern for Distributed Transactions
Saga Pattern Implementation
- Python
- JavaScript
CQRS (Command Query Responsibility Segregation)
Event Sourcing
Store all changes to application state as a sequence of events.Resilience Patterns
Circuit Breaker
Bulkhead Pattern
Retry with Backoff
Service Mesh
What is a Service Mesh?
Service Mesh Features
Traffic Management
- Load balancing
- A/B testing
- Canary deployments
- Traffic splitting
- Retries & timeouts
Security
- mTLS encryption
- Service-to-service auth
- Access policies
- Certificate management
Observability
- Distributed tracing
- Metrics collection
- Access logs
- Service topology
Resilience
- Circuit breaking
- Rate limiting
- Fault injection
- Health checks
Observability
The Three Pillars
Distributed Tracing
Deployment Strategies
Common Strategies
Key Takeaways
| Pattern | When to Use |
|---|---|
| API Gateway | Single entry point, cross-cutting concerns |
| Service Discovery | Dynamic service locations |
| Circuit Breaker | Prevent cascading failures |
| Saga | Distributed transactions |
| CQRS | Different read/write patterns |
| Event Sourcing | Audit trail, temporal queries |
| Service Mesh | Complex microservices, need observability |
Interview Deep-Dive Questions
Q1: Your team is building a new product at a 50-person startup. A senior engineer proposes starting with microservices from day one because 'we will need to scale eventually.' How do you evaluate this proposal?
Q1: Your team is building a new product at a 50-person startup. A senior engineer proposes starting with microservices from day one because 'we will need to scale eventually.' How do you evaluate this proposal?
- I would push back on this proposal in most cases. Starting with microservices at a 50-person startup is almost always premature optimization. The reason is not that microservices are bad — it is that the costs are front-loaded while the benefits only materialize at a scale and organizational complexity the startup has not reached yet.
- The concrete costs of starting with microservices on day one: (1) You need service discovery, an API gateway, distributed tracing, centralized logging, and a container orchestration platform before you can ship your first feature. That is 2-4 weeks of infrastructure work that delivers zero customer value. (2) Every feature that spans services requires coordinated deployments, API contracts, and cross-service debugging. At a startup, almost every feature spans what would be service boundaries. (3) You do not yet know where the domain boundaries are. The biggest risk is drawing service boundaries wrong and spending months refactoring when the business model pivots.
- The right approach: start with a well-structured monolith. Use clear module boundaries internally (separate packages or modules for user management, orders, payments). Enforce these boundaries with code review and linting rules — no direct database access across module boundaries, communication through well-defined interfaces. This gives you 80% of the organizational benefit (clear ownership, clean interfaces) with none of the distributed systems overhead.
- When to extract: extract a service when you have a concrete, measurable reason. Examples: the payment module needs to scale independently because it has different resource requirements; the recommendation engine team wants to deploy on a different cadence; a specific module needs to be in a different language for performance reasons. Each extraction should be driven by a pain point, not a prediction.
- The exception: if the startup is building a platform product (like an API or marketplace) where different components have fundamentally different scaling characteristics from day one, a small number of services (2-3, not 15) might be justified.
- Example: Shopify ran as a monolith (the “Shopify monolith” is famous) serving millions of merchants for years before selectively extracting services. They explicitly chose to invest in making the monolith modular rather than splitting prematurely. When they did extract services, they had years of production data telling them exactly where the boundaries should be.
Q2: You have a microservices architecture where the Order Service needs data from the User Service and the Inventory Service to process a single order. The User Service is experiencing intermittent 500 errors. How do you design the Order Service to be resilient to this?
Q2: You have a microservices architecture where the Order Service needs data from the User Service and the Inventory Service to process a single order. The User Service is experiencing intermittent 500 errors. How do you design the Order Service to be resilient to this?
- The first step is to categorize the dependency: is the User Service data critical or supplementary for order processing? The user’s shipping address is critical — you cannot fulfill an order without it. The user’s display name for the order confirmation email is supplementary — you can send the email later.
- For critical data (shipping address): implement a local cache in the Order Service. When the User Service is healthy, cache user profile data with a reasonable TTL (5-15 minutes for addresses). When the User Service returns a 500, fall back to the cached data. If the cache is also empty (new user, cache eviction), then the order must fail — but fail fast with a clear error message, not after a 30-second timeout chain.
- For supplementary data (display name, preferences): use a circuit breaker. After 5 consecutive failures to the User Service, open the circuit. While the circuit is open, skip the User Service call entirely and use default values. The order proceeds without the supplementary data. A background job re-enriches the order when the User Service recovers.
- The circuit breaker configuration matters: failure threshold of 5, timeout of 30 seconds in the open state, then allow one probe request in half-open state. If the probe succeeds, close the circuit. If it fails, reset the timeout. Add jitter to the timeout so all Order Service instances do not probe the User Service simultaneously when it is recovering.
- For the Inventory Service, the pattern is different because inventory checks have side effects (reservation). Use the Saga pattern: create the order optimistically, then attempt inventory reservation as a separate step. If inventory reservation fails, compensate by canceling the order. This decouples order creation from inventory availability at the cost of occasionally needing to cancel.
- Timeout configuration: set aggressive timeouts (200-500ms) for supplementary calls, longer (2-5s) for critical calls. Never use the default HTTP client timeout (often 30-60 seconds) — that will cascade into thread pool exhaustion in the Order Service.
- Example: Netflix’s Hystrix (now in maintenance, but the patterns live on in Resilience4j) introduced the bulkhead pattern alongside circuit breakers. Each downstream dependency gets its own thread pool. If the User Service thread pool is exhausted, Order Service can still call Inventory Service because those threads are isolated. This prevents a single bad dependency from starving all other calls.
user_data_degraded: true to the order record). When the circuit closes, a background consumer picks up these flagged orders, fetches fresh user data from the User Service, and updates any fields that were stale or defaulted. For critical fields like shipping address, if the cached version differs from the fresh version, flag the order for manual review before shipping. This is cheaper than blocking all orders during the outage.Q3: You are decomposing a monolith into microservices. The existing monolith uses a single PostgreSQL database with foreign keys between the Orders table, Users table, and Products table. How do you handle the data decomposition?
Q3: You are decomposing a monolith into microservices. The existing monolith uses a single PostgreSQL database with foreign keys between the Orders table, Users table, and Products table. How do you handle the data decomposition?
- Data decomposition is the hardest part of migrating to microservices, and it should be done incrementally, not all at once. The Strangler Fig pattern applied to data: start by creating logical boundaries in the existing database, then physically separate over time.
- Phase 1 — Logical separation: Within the monolith, introduce a data access layer per domain. The Order module can only access the Orders table through an OrderRepository. The User module owns the Users table. Enforce this with code review and eventually with database views or schemas. Foreign keys still exist at this point — that is fine.
- Phase 2 — API boundary: The Order Service stops joining directly to the Users table. Instead, it calls a User Service API (which initially is just another module in the same monolith) to get user data. Replace the join with an application-level join: fetch the order, then fetch the user by ID. Yes, this is slower. Cache user data aggressively.
- Phase 3 — Physical separation: Move the Users table to its own database (or schema) that only the User Service can access. Drop the foreign key constraint. The Order table stores
user_idas a plain column with no foreign key. Referential integrity is now the application’s responsibility. - Handling the loss of foreign keys: (1) Use soft deletes — never hard delete a user; mark them as deleted. This prevents dangling references. (2) Implement eventual consistency checks — a periodic job that scans orders for user_ids that no longer exist in the User Service and flags them. (3) Use events — when a user is deleted, publish a UserDeleted event. The Order Service subscribes and handles orphaned orders (archive them, anonymize them, whatever the business requires).
- Handling the loss of joins: for read-heavy queries that previously joined Orders and Users (e.g., “show all orders with user names”), either (1) denormalize — store the user name in the Order record and update it via events when the name changes, or (2) use an API composition layer that fetches from both services and merges the results, or (3) maintain a read-optimized view using CQRS — an event-driven projection that materializes the joined view into a query-optimized store.
- Example: Uber’s migration from a monolith to microservices took years. They specifically called out the data layer as the bottleneck. They used the “database-per-service” pattern but maintained a shared schema registry to prevent drift. Their biggest challenge was cross-service reporting, which they solved by streaming all events into a central data lake for analytics queries.
Q4: Your microservices architecture has grown to 40 services. Debugging a single user request that fails intermittently has become extremely difficult because the request touches 8 services. How do you approach observability in this environment?
Q4: Your microservices architecture has grown to 40 services. Debugging a single user request that fails intermittently has become extremely difficult because the request touches 8 services. How do you approach observability in this environment?
- This is exactly the scenario where distributed tracing becomes non-negotiable. The three pillars — metrics, logs, and traces — need to work together, but traces are the primary tool for debugging cross-service request failures.
- The foundation is a correlation ID (trace ID) that is generated at the edge (API gateway or first service) and propagated through every service call via HTTP headers (W3C Trace Context standard or B3 propagation). Every log line, every metric, every error report includes this trace ID. When a user reports a failure, you search by trace ID and see the entire request path.
- The trace shows you: which services were called, in what order, how long each took, and which one failed. For an intermittent failure across 8 services, the trace will immediately narrow it down to “the Payment Service returned a 500 after 12 seconds” or “the Inventory Service timed out on the database query.”
- Beyond basic tracing, instrument span attributes: add business context to spans (user_id, order_id, product_id). This lets you search for “all traces where user X’s requests to the Inventory Service took longer than 2 seconds.” Without these attributes, you have infrastructure data but no business context.
- For intermittent failures specifically, use tail-based sampling: capture 100% of traces for errored or slow requests, and sample 1-5% of successful requests. This ensures you never miss a failure trace while keeping storage costs manageable. Head-based sampling (decide at the edge whether to trace) risks missing the exact failures you need to debug.
- Service dependency maps: generate these automatically from trace data. “The Order Service calls User, Inventory, Payment, and Notification. Payment calls Fraud Detection and Bank Gateway.” This map is invaluable for understanding blast radius — if the Bank Gateway goes down, which user-facing flows are affected?
- Example: Uber built Jaeger specifically for this problem. With thousands of microservices, they needed to trace requests that could touch dozens of services. They found that most debugging time was spent not on finding the failing service (traces made that obvious) but on understanding why it failed (which required correlated logs and metrics). Their solution was a unified observability platform where clicking a span in a trace opens the relevant logs and metrics for that exact time window and service.
Interview Questions
What is the Saga pattern, and why can't you just use a distributed transaction (two-phase commit) across microservices?
What is the Saga pattern, and why can't you just use a distributed transaction (two-phase commit) across microservices?
- The Saga pattern manages data consistency across services without a single ACID transaction. Instead of locking all resources at once (as 2PC does), a saga breaks a business operation into a sequence of local transactions, each with a compensating action that undoes its effect if a later step fails. For example, an e-commerce checkout saga might go: Create Order, Reserve Inventory, Charge Payment, Schedule Shipping — and if Payment fails, it runs Release Inventory then Cancel Order in reverse.
- Two-phase commit does not work across microservices for practical reasons: it requires all participants to hold locks until the coordinator says “commit.” In a distributed system with services that have independent databases, this means holding row-level locks across network boundaries for the duration of the slowest participant. At 1,000 concurrent orders, you are holding 1,000 sets of cross-service locks. One slow or crashed participant blocks the entire system. 2PC is a single point of failure disguised as a coordination protocol.
- Sagas give you eventual consistency instead of strong consistency. The trade-off is that intermediate states are visible — a customer might briefly see an order as “created” before payment is confirmed. You handle this with careful UI design (showing “processing” states) and idempotent compensations.
- There are two saga execution styles: choreography (each service publishes events and the next service reacts) and orchestration (a central coordinator tells each service what to do). Choreography is simpler for 3-4 step flows but becomes spaghetti at 8+ steps because the flow logic is scattered. Orchestration adds a single point of coordination but makes complex flows readable and testable.
- A compensation step in your saga fails (e.g., the refund API to Stripe returns a 500). Now you have an inconsistent state — the order is cancelled but the customer was still charged. How do you handle this?
- Your orchestration-based saga has 6 steps. During step 4, the saga orchestrator itself crashes. How do you ensure the saga resumes or compensates correctly after the orchestrator restarts?
Explain the Circuit Breaker pattern. How do you decide the failure threshold and timeout values in production?
Explain the Circuit Breaker pattern. How do you decide the failure threshold and timeout values in production?
- The circuit breaker is a resilience pattern that prevents a service from repeatedly calling a failing downstream dependency, which would waste resources and cascade the failure. It has three states: Closed (requests flow normally, failures are counted), Open (all requests fail immediately without calling the dependency, giving it time to recover), and Half-Open (after a timeout, one probe request is allowed through — if it succeeds the circuit closes, if it fails it reopens).
- The key insight is that calling a failing service is worse than not calling it. Without a circuit breaker, 100 threads can pile up waiting for 30-second timeouts from a dead Payment Service, exhausting the Order Service’s thread pool and causing the Order Service to also appear dead to its callers. The circuit breaker converts a 30-second timeout into a 5-millisecond “fail fast” response.
- Choosing the failure threshold: this depends on the expected error rate under normal conditions. If the downstream service has a normal error rate of 0.1%, a threshold of 5 failures in 10 seconds is reasonable. If the normal error rate is 2% (some services are flaky by nature), you need a higher threshold like 10-15 failures in a 30-second window, or you will get false trips. The window matters as much as the count — 5 failures in 1 second is a signal; 5 failures in 10 minutes is noise.
- Choosing the timeout (how long the circuit stays open): this should match the dependency’s expected recovery time. If the dependency typically recovers within 30 seconds (e.g., a pod restart on Kubernetes), use a 30-second open timeout. For services with longer recovery (database failover, 2-5 minutes), use longer timeouts. Start conservative (30 seconds) and tune based on production data.
- In practice, combine the circuit breaker with bulkheading (separate thread pools per dependency) and fallbacks (return cached data or a degraded response when the circuit is open). The circuit breaker without a fallback just changes the error from “timeout” to “circuit open” — the user still gets an error.
- You have a circuit breaker on the Payment Service. The circuit opens during a flash sale with 50,000 users actively checking out. How do you handle the business impact of failing all payment attempts for 30 seconds?
- Your service calls the same downstream dependency for two different use cases — one is critical (payment verification) and one is supplementary (fetching display metadata). Should they share a circuit breaker or have separate ones? Why?
What is an API Gateway in microservices, and how does it differ from a reverse proxy or a load balancer?
What is an API Gateway in microservices, and how does it differ from a reverse proxy or a load balancer?
- An API Gateway is the single entry point for all client requests in a microservices architecture. It sits between external clients and internal services and handles cross-cutting concerns: request routing, authentication/authorization, rate limiting, request/response transformation, protocol translation (e.g., REST to gRPC), and API composition (aggregating responses from multiple services into one).
- A reverse proxy (like Nginx) forwards requests to backend servers based on URL patterns. It handles basic routing and TLS termination but has no awareness of your API semantics or business logic. A load balancer distributes traffic across instances of a single service for availability and throughput. The API Gateway is a superset — it does what both do, plus API-aware features like request validation, response shaping, and per-route rate limiting.
- The critical difference: an API Gateway understands that
/api/orders/123should go to the Order Service, that the response should be enriched with user data from the User Service, that this particular endpoint requires an OAuth2 token with theorders:readscope, and that free-tier users are limited to 100 requests per minute on this endpoint. A load balancer knows none of this. - The trade-off: the API Gateway is a single point of failure and a potential bottleneck. If it goes down, everything goes down. You mitigate this with horizontal scaling (multiple gateway instances behind a load balancer — yes, an LB in front of the gateway), health checks, and keeping the gateway thin — business logic belongs in services, not the gateway.
- Real-world examples: Kong (open-source, plugin ecosystem), AWS API Gateway (managed, pay-per-request), Envoy (often used as both gateway and service mesh data plane), and custom BFF gateways built with Node.js or Go for specific client needs.
- Your team is debating between a single API Gateway for all clients versus a Backend-for-Frontend (BFF) pattern with separate gateways for web, mobile, and IoT. What are the trade-offs and when would you choose BFF?
- The API Gateway is doing request aggregation — calling 3 services and combining the results. One of those services is slow. How do you prevent the gateway from becoming the bottleneck?
How does service discovery work in a microservices environment, and what happens when the service registry becomes unavailable?
How does service discovery work in a microservices environment, and what happens when the service registry becomes unavailable?
What is the 'database per service' pattern and how do you handle cross-service queries that previously relied on SQL JOINs?
What is the 'database per service' pattern and how do you handle cross-service queries that previously relied on SQL JOINs?
- Database per service means each microservice owns and exclusively manages its own data store. No other service can read or write to it directly — only through the service’s API. This is arguably the most important microservices principle because without it, you have a distributed monolith: services that are deployed independently but coupled through shared data.
- The immediate pain point: in a monolith, “get all orders with customer names and product details” is a single SQL query with two JOINs. In microservices, this data lives in three separate databases owned by three separate services. You cannot JOIN across them.
- There are four strategies, each with different trade-offs. (1) API Composition: the caller makes three API calls (Order Service, User Service, Product Service) and joins the data in application code. Simple but slow for large result sets and creates runtime coupling. (2) Denormalization via events: the Order Service stores a copy of the customer name and product name in its own database, updated via events (CustomerNameChanged, ProductNameChanged). Fast reads, but data can be slightly stale and you pay storage cost. (3) CQRS with event-driven projections: build a dedicated read model that consumes events from all three services and materializes the joined view. Best for complex queries but adds architectural complexity. (4) Data lake/warehouse: stream all events or CDC changes into a central analytics store (BigQuery, Redshift) for reporting queries. Not real-time, but perfect for business intelligence.
- In practice, most teams use a mix: API composition for simple, low-volume queries; denormalization for high-traffic read paths (e.g., order detail page); and a data warehouse for cross-domain analytics.
- You chose denormalization: the Order Service stores the customer name. The customer changes their name. The event gets published, but 10,000 historical orders still have the old name. Do you backfill? What are the trade-offs?
- Your read model (CQRS projection) is running 30 seconds behind the write model due to event processing lag. A user places an order and immediately navigates to “My Orders” but does not see the new order. How do you handle this read-your-own-writes problem?
What is a service mesh, and when is the complexity of introducing one justified versus handling cross-cutting concerns in application code?
What is a service mesh, and when is the complexity of introducing one justified versus handling cross-cutting concerns in application code?
- A service mesh is an infrastructure layer that handles service-to-service communication transparently. It deploys a sidecar proxy (typically Envoy) alongside every service instance. All traffic goes through the sidecar, which handles mTLS encryption, retries, circuit breaking, load balancing, observability (metrics and traces), and access control — without the application code knowing about any of it.
- The architecture has two components: the data plane (the sidecar proxies that handle traffic) and the control plane (Istio, Linkerd, or Consul Connect, which configures the proxies and manages certificates). The application talks to localhost, the sidecar intercepts the traffic and applies policies before forwarding.
- When it is justified: (1) you have 20+ services and enforcing consistent resilience patterns, TLS, and observability across all of them via application libraries is becoming a maintenance nightmare — every service needs the same circuit breaker library, the same retry config, the same TLS setup, and they drift; (2) you operate in a polyglot environment (Go, Java, Python, Node.js) and cannot share a single SDK; (3) you need zero-trust networking (mTLS everywhere) for compliance and doing it per-service is error-prone; (4) you need fine-grained traffic management like canary deployments at the network level.
- When it is not justified: fewer than 10 services, a single-language stack where a shared library handles cross-cutting concerns, a team that does not have Kubernetes expertise (most meshes assume Kubernetes), or when latency overhead matters — each sidecar adds 1-3ms per hop, which compounds across a 5-service call chain.
- The hidden cost: operational complexity. The mesh itself needs to be monitored, upgraded, and debugged. When something goes wrong, you are debugging both the application AND the mesh. Istio in particular has a reputation for being difficult to operate. Linkerd is lighter-weight but has fewer features.
- Your service mesh sidecar is adding 2ms of latency per hop. A single user request passes through 6 services. That is 12ms of pure mesh overhead on a 100ms budget. How do you reduce this?
- A new engineer on the team deploys a service without the sidecar (they used a plain Kubernetes Deployment instead of the mesh-injected one). What breaks, and how do you prevent this from happening again?
Compare choreography-based and orchestration-based saga patterns. When would you choose one over the other?
Compare choreography-based and orchestration-based saga patterns. When would you choose one over the other?
- In choreography, there is no central coordinator. Each service listens for events and reacts by performing its local transaction and publishing the next event. The flow emerges from the event chain: Order Service publishes
OrderCreated, Payment Service hears it and publishesPaymentProcessed, Inventory Service hears that and publishesStockReserved. The “saga” exists implicitly in the event chain, not as a single artifact. - In orchestration, a central Saga Orchestrator explicitly defines the flow: “Step 1: call Order Service. Step 2: call Payment Service. Step 3: call Inventory Service.” The orchestrator tracks state, decides next steps, and initiates compensations if a step fails. The flow is explicitly defined in one place.
- Choreography advantages: no single point of failure (no orchestrator to crash), services are more loosely coupled (they react to events, not commands), and it is simpler for short, linear flows (3-4 steps). Think of it as the publish-subscribe equivalent of a relay race — each runner knows to start when the previous runner finishes.
- Choreography disadvantages: the flow logic is distributed across multiple services, making it very hard to understand, test, and debug. If you have 8 services in a saga and step 5 fails, understanding the compensation chain requires reading code in 5 different services. Adding a new step means modifying existing services to react to new events. Circular dependencies can emerge (Service A reacts to Service B which reacts to Service A).
- Orchestration advantages: the flow is defined in one place, making it readable, testable, and debuggable. Adding a step means modifying one file. Compensations are clearly defined next to their corresponding actions. Complex branching logic (if payment fails AND order is over $500, escalate to manual review) is straightforward.
- Orchestration disadvantages: the orchestrator is a potential single point of failure (mitigate with persistent saga state and recovery on restart), and it introduces tighter coupling between the orchestrator and the services it calls.
- Rule of thumb: use choreography for simple, linear flows with 3-4 steps where the participants are unlikely to change. Use orchestration for anything more complex — 5+ steps, branching logic, or flows that change frequently.
- You have a choreography-based saga with 6 services. A bug in Service D causes it to publish the wrong event. How do you trace the issue and understand the impact on downstream services?
- Your orchestration-based saga orchestrator is processing 10,000 sagas per second. It stores saga state in PostgreSQL and is hitting write throughput limits. How do you scale it?
You are designing a canary deployment strategy for a critical payment service in a microservices architecture. Walk me through the process and what metrics you would watch.
You are designing a canary deployment strategy for a critical payment service in a microservices architecture. Walk me through the process and what metrics you would watch.
- Canary deployment routes a small percentage of production traffic (typically 1-5% to start) to the new version while the rest continues hitting the old version. The goal is to detect problems with real production traffic before they affect all users. For a payment service, this is especially critical because a bad deploy directly impacts revenue.
- The process: (1) Deploy the new version alongside the old version (do not replace). (2) Configure the traffic router (service mesh, API gateway, or Kubernetes Ingress) to send 1% of traffic to the canary. (3) Monitor key metrics for a bake period (15-30 minutes minimum for a payment service). (4) If metrics are healthy, gradually increase to 5%, 10%, 25%, 50%, 100%. (5) If any metric breaches the threshold, automatically roll back by routing 100% to the old version.
- The metrics for a payment service specifically: (a) Error rate — compare the canary’s 5xx rate to the baseline. If the canary is 2x higher, roll back. (b) Latency — compare p50, p95, and p99 latency. A jump in p99 from 200ms to 800ms signals a problem even if p50 looks fine. (c) Business metrics — payment success rate is the most important. If the canary’s payment success rate drops from 98.5% to 96%, that is a revenue-impacting regression even if there are zero 5xx errors (the failure might be in the payment gateway integration, returning a 200 with an error payload). (d) Downstream health — is the canary causing increased errors in downstream services like fraud detection?
- The subtlety most people miss: statistical significance. At 1% traffic, you might see 50 requests per minute. A single failed request is a 2% error rate, but it could be noise. You need enough traffic volume for the comparison to be meaningful. For low-traffic services, you might need to run the canary for hours instead of minutes, or increase the initial percentage to 5-10%.
- For a payment service, I would also do a shadow/dark launch first: route duplicate traffic to the new version but do not use its responses. Compare the canary’s responses to the production version. This catches logic bugs without any customer impact.
- Your canary is at 5% traffic and the payment success rate drops by 0.3%. That is within normal variance for a 15-minute window but outside normal variance for a 1-hour window. Do you roll back immediately or wait? How do you decide?
- The canary version introduces a new database schema migration. How do you handle the fact that both the old version and the canary version need to work with the same database simultaneously?