Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Design for Failure Mindset
Redundancy Patterns
Active-Passive (Standby)
Active-Active
Multi-Region Active-Active
Resilience Patterns
Circuit Breaker (Deep Dive)
Retry Strategies
- Python
- JavaScript
Bulkhead Pattern
- Python
- JavaScript
Health Checks
Health Check Types
Timeouts
Timeout Hierarchy
Deadline Propagation
Graceful Degradation
Senior Interview Questions
How do you achieve 99.99% availability?
How do you achieve 99.99% availability?
- Redundancy: At least 2 of everything (servers, DBs, regions)
- Load balancing: Automatic failover when node fails
- Health checks: Detect failures in seconds
- Auto-scaling: Handle traffic spikes
- Multi-region: Survive region outages
- Chaos engineering: Regularly test failure scenarios
- Single component at 99.9% can’t achieve 99.99%
- Need redundancy: 2 components at 99.9% = 99.9999% (if independent)
How do you handle partial failures in distributed transactions?
How do you handle partial failures in distributed transactions?
- Each step has a compensating action
- If step N fails, run compensations for steps N-1 to 1
- Track saga state in database
How do you prevent cascading failures?
How do you prevent cascading failures?
- Circuit breakers: Stop calling failing service
- Timeouts: Don’t wait forever
- Bulkheads: Isolate failures to one service
- Rate limiting: Prevent overload
- Load shedding: Reject low-priority requests
- Fallbacks: Degrade gracefully
How do you test system reliability?
How do you test system reliability?
- Define steady state: Normal metrics (latency, error rate)
- Form hypothesis: “System handles server failure”
- Inject failure: Kill a server
- Observe: Did metrics stay within bounds?
- Fix and repeat
- Server crashes
- Network partitions
- High latency
- Disk full
- Memory exhaustion
- Clock skew
- Dependency outages
Interview Questions
Explain the circuit breaker pattern. When would you use it and what are the three states?
Explain the circuit breaker pattern. When would you use it and what are the three states?
- A circuit breaker wraps calls to an external dependency and monitors failures. The core idea is borrowed from electrical engineering — when current overloads, the breaker trips to prevent fires. In software, it prevents your system from wasting resources hammering a service that is already down, which can turn a partial outage into a full cascading failure.
- The three states are Closed (normal operation — requests flow through and failures are counted), Open (the breaker has tripped after the failure threshold is exceeded — all requests fail immediately without calling the downstream service), and Half-Open (after a timeout period, the breaker allows a limited number of test requests through to see if the service has recovered).
- The key design decisions are: what counts as a failure (timeouts? 5xx errors? specific exceptions?), what the failure threshold should be (e.g., 5 failures in 30 seconds), and what the recovery timeout is. In production, you want these configurable per-dependency because a payment service and a notification service have very different tolerance profiles. A payment failure at 3/5 should trip the breaker fast; a non-critical analytics service might tolerate 10/20.
- The pattern becomes critical when combined with fallbacks. When the circuit is open, you serve cached data, a default response, or queue the request for later. Netflix’s Hystrix popularized this — if the recommendation engine is down, they show a generic “top 10” list instead of personalized suggestions. The user experience degrades but doesn’t break.
- Example: An e-commerce checkout service calls a payment gateway. Without a circuit breaker, if the gateway goes down, every checkout request hangs for 30 seconds waiting for a timeout, thread pools fill up, and the entire site becomes unresponsive. With a circuit breaker set to trip at 5 failures, after 5 timeouts the breaker opens, checkout immediately returns “payment temporarily unavailable, try again in a minute,” and the rest of the site (browsing, search, cart) stays healthy.
- What happens if you set the failure threshold too low versus too high? How would you tune it for a payment service versus a recommendation service?
- How would you implement circuit breaker state sharing across multiple instances of the same service behind a load balancer — should the state be local or distributed?
Compare exponential backoff with jitter versus fixed-interval retries. When does each make sense?
Compare exponential backoff with jitter versus fixed-interval retries. When does each make sense?
- Fixed-interval retries (e.g., retry every 2 seconds) are simple but dangerous at scale because they create the thundering herd problem. If a service goes down and 10,000 clients all start retrying at the same 2-second interval, they synchronize into periodic traffic spikes that can keep the recovering service permanently overloaded. It is literally the worst thing you can do during an outage.
- Exponential backoff (1s, 2s, 4s, 8s, 16s…) spreads retries out over time so the downstream service gets breathing room to recover. But pure exponential backoff without jitter still has a problem — clients that started at the same time will still be synchronized at each backoff interval.
- Adding jitter (randomness) to the backoff breaks this synchronization. There are two common approaches: full jitter (
random(0, base * 2^attempt)) and decorrelated jitter (random(base, previous_delay * 3)). AWS’s research showed decorrelated jitter provides the best balance of spread and convergence time. The key insight is that jitter is not optional for production retry logic — it is mandatory. - Fixed-interval retries only make sense in tightly controlled environments where you know the number of callers is small and bounded — for example, a single cron job retrying a database migration, or an internal tool with one user. Anything customer-facing or multi-tenant must use backoff with jitter.
- Example: Stripe’s API docs recommend exponential backoff with jitter for their webhooks. If Stripe sends 50,000 webhooks and the receiving server is momentarily down, pure fixed retries would DDoS the server on recovery. With decorrelated jitter, retries spread over a wide time window, giving the server a smooth ramp-up.
- Should you retry a payment charge that timed out? What is the risk and how would you make the operation safe to retry?
- How would you implement a retry budget — limiting total retries across all callers to prevent retry storms from amplifying an outage?
What is the bulkhead pattern and how does it prevent cascading failures?
What is the bulkhead pattern and how does it prevent cascading failures?
- The bulkhead pattern borrows its name from ship design — ships have watertight compartments (bulkheads) so that if one section floods, the ship doesn’t sink. In software, the idea is identical: you isolate resources for different dependencies so that one failing dependency can’t consume all available resources and take down everything else.
- The most common implementation is giving each external dependency its own connection pool or thread pool with a fixed size and a bounded queue. For instance, your payment service gets a pool of 10 concurrent connections, your inventory service gets 30, and your notification service gets 50. If the payment gateway becomes slow and all 10 connections are saturated, only payment-related requests queue up and eventually fail — the inventory and notification pools continue operating normally.
- Without bulkheads, all dependencies share the same thread pool. When one dependency becomes slow, its requests hold threads open, gradually consuming the entire pool until no threads are available for any requests — even completely healthy code paths. This is how a slow notification email service can take down your checkout flow.
- The key tuning parameters are
max_concurrent(how many simultaneous requests to allow),max_queue(how many waiting requests to buffer), andqueue_timeout(how long to wait before rejecting). Setting these requires understanding each dependency’s latency profile and criticality. A slow payment call with a 5-second timeout and 10 concurrent slots means you can process 2 payments per second sustained — is that enough? - Example: At a large e-commerce platform, a third-party shipping rate calculator became unresponsive during Black Friday. Without bulkheads, the slow shipping API calls consumed all 200 threads in the shared pool within minutes. With bulkheads, the shipping pool (20 threads) filled up, shipping rate requests got “service unavailable” errors, but checkout, search, and browsing (using the other 180 threads) continued without interruption. Revenue impact was reduced from “site down” to “shipping estimates temporarily unavailable.”
- How would you decide the
max_concurrentandmax_queuevalues for each bulkhead in production? What metrics would you monitor to know if your values are correct? - How does the bulkhead pattern interact with circuit breakers — should they be layered together, and if so, in what order?
Explain the difference between liveness and readiness health checks. What goes wrong if you mix them up?
Explain the difference between liveness and readiness health checks. What goes wrong if you mix them up?
- Liveness answers “is this process alive and not deadlocked?” It should be trivially cheap — essentially
return 200 OK. If a liveness check fails, the orchestrator (Kubernetes, ECS) kills and restarts the container. The only thing you should check here is whether the process can respond at all. If you put a database check in your liveness probe and the database goes down, Kubernetes will restart all your healthy application pods, turning a database outage into a complete application outage. - Readiness answers “can this instance handle traffic right now?” If a readiness check fails, the load balancer removes the instance from the rotation but does not kill it. This is where you check dependency connections (database, cache, message queue), whether warmup/cache priming is complete, and whether the instance has finished initialization. An instance that fails readiness stays alive and keeps checking — once its dependencies recover, it passes readiness again and gets added back to the pool.
- The critical mistake is putting dependency checks in liveness probes. In Kubernetes specifically, a failed liveness probe triggers a container restart. If your database has a brief hiccup and your liveness probe checks the DB, every pod restarts simultaneously, causing a full outage on top of the DB issue. This is called a death spiral — restarting pods increases load on the recovering database, which causes more liveness failures, which causes more restarts.
- There is also a startup probe concept (Kubernetes added this in 1.16) for slow-starting applications. It disables liveness checking until the app finishes startup, preventing premature kills during initialization (e.g., a Java application loading a large ML model).
- Example: A team put a Redis connectivity check in their liveness probe. During a routine Redis failover (primary to replica, takes 3-5 seconds), all 40 pods simultaneously failed liveness, Kubernetes restarted them all, the new pods all tried to connect to Redis at once, overwhelmed the new Redis primary, and the entire service was down for 12 minutes. The fix was moving Redis checks to readiness only and making the liveness probe a simple
return 200.
- How would you design a deep health check endpoint for monitoring dashboards that doesn’t accidentally become a DoS vector against your own dependencies?
- What startup probe configuration would you use for a service that takes 60 seconds to warm up its in-memory cache before it can serve accurate responses?
How does deadline propagation work in a microservices chain and why is it essential?
How does deadline propagation work in a microservices chain and why is it essential?
- Deadline propagation means passing a request’s absolute expiration time (not a relative timeout) through every service in the call chain. If the client sets a 5-second deadline and Service A takes 2 seconds, Service A passes the remaining 3 seconds (or the original absolute deadline timestamp) to Service B. Without this, every service in the chain uses its own independent timeout, and the total end-to-end time can exceed anything reasonable.
- The problem deadline propagation solves is wasted work. Consider a chain: Client -> API Gateway (10s timeout) -> Service A (30s timeout) -> Service B (30s timeout) -> Database. The client gives up after 10 seconds. Without deadline propagation, Service A is still waiting on Service B, which is still waiting on the database — all doing work that nobody will use because the client is already gone. Multiply this across thousands of concurrent requests and you have a significant resource waste that can worsen an overload.
- The standard implementation uses a context variable (Go’s
context.Contextdoes this natively, gRPC has built-in deadline propagation). At each service boundary, you check remaining time before making a downstream call. If remaining time is less than or equal to zero, you short-circuit immediately. When making the call, you set the downstream timeout tomin(remaining_time * 0.9, service_default_timeout)— the 0.9 factor leaves a buffer for network latency and response processing. - A subtlety: you must use absolute timestamps (wall-clock deadlines), not relative durations. If Service A receives “timeout: 3 seconds” and spends 500ms doing local work before calling Service B, it needs to pass “timeout: 2.5 seconds” to B. With an absolute deadline like “expires at 14:30:05.000Z”, this math is trivial and avoids clock drift compounding across hops.
- Example: Google’s internal systems (and gRPC by default) propagate deadlines through their entire call stack. If a user search request has a 200ms deadline, every service in the chain (query parsing, index lookup, ranking, ad serving) receives the same deadline. If the index lookup takes 180ms, the ranking service knows it only has 20ms left and can return a faster but less optimal ranking rather than doing its full 150ms computation that would be wasted anyway.
- How do you handle clock skew between services when propagating absolute deadlines? What if Service A’s clock is 2 seconds ahead of Service B’s?
- Should you always respect the propagated deadline, or are there cases where a downstream service should ignore it and complete its work anyway (e.g., a write that must not be half-completed)?
Your service is hitting 99.9% availability but needs to reach 99.99%. What concrete changes would you make?
Your service is hitting 99.9% availability but needs to reach 99.99%. What concrete changes would you make?
- First, the math: 99.9% allows 8.76 hours of downtime per year (43.8 minutes/month). 99.99% allows only 52.6 minutes per year (4.38 minutes/month). That is a 10x reduction. Every single deployment, config change, and dependency failure now matters. You cannot achieve this with good engineering alone — it requires operational discipline and architectural changes.
- Multi-region active-active deployment is almost mandatory. A single region cannot realistically deliver 99.99% because cloud providers themselves typically only guarantee 99.99% per region for compute, and any single dependency below that threshold (database, load balancer, DNS) breaks your SLA. With active-active in two regions, you survive an entire region outage. The math: if each region is 99.9% available independently, two regions give you
1 - (0.001 * 0.001) = 99.9999%theoretical availability (assuming independent failures). - Zero-downtime deployments become non-negotiable. Blue-green or canary deployments where you shift traffic gradually. A bad deploy that takes 5 minutes to detect and roll back consumes your entire monthly error budget at 99.99%. You need automated canary analysis that compares error rates between old and new versions and auto-rolls-back within 60 seconds.
- Dependency isolation and fallbacks for every critical path. Every external call needs a circuit breaker, timeout, and a fallback that lets the core user journey succeed even if that dependency is down. If your recommendation engine is down, show trending items. If your user profile service is slow, serve from cache.
- Runbook automation and on-call SLOs. At 99.99%, human response time is too slow. You need automated detection (anomaly detection, not just threshold alerts), automated mitigation (auto-scaling, auto-failover, auto-rollback), and humans are only for novel incidents. Mean time to detect (MTTD) must be under 1 minute and mean time to recover (MTTR) under 5 minutes.
- Example: Moving from 99.9% to 99.99% at a fintech company required: adding a second AWS region with active-active routing via Route 53 health checks, switching from rolling deployments to canary with automated rollback, adding circuit breakers to all 12 downstream services, implementing database read replicas with automatic failover, and establishing an on-call rotation with 5-minute response SLO. The infrastructure cost roughly doubled, and operational complexity tripled.
- How would you handle database writes in a multi-region active-active setup? What consistency model would you choose and what are the trade-offs?
- If your monthly error budget for 99.99% is 4.38 minutes and a bad deploy already consumed 3 minutes, what policy changes would you enforce for the rest of the month?
What is graceful degradation and how do you decide what to degrade?
What is graceful degradation and how do you decide what to degrade?
- Graceful degradation means your system continues to provide core functionality even when some components fail, by shedding non-essential features. The key word is “graceful” — users should get a slightly worse experience, not an error page. It is the opposite of the “all or nothing” approach where any failure returns a 500 error.
- The decision of what to degrade requires a feature criticality matrix defined before an incident happens, not during one. You categorize every feature into tiers: Tier 1 (critical) — must always work (checkout, login, core search), Tier 2 (important) — degrade with notice (recommendations, reviews, real-time inventory counts), Tier 3 (nice-to-have) — can be completely disabled (analytics tracking, A/B test variants, social features). During an overload event, you shed tiers in reverse order.
- Implementation typically involves feature flags combined with dependency health monitoring. When the recommendation service circuit breaker opens, the feature flag for “personalized recommendations” automatically switches to “show trending items” (static cache). When the inventory service is slow, you show “in stock” based on a cached snapshot from 5 minutes ago instead of real-time counts.
- Load shedding is the extreme form: when the system is overwhelmed, you actively reject low-priority requests to preserve capacity for high-priority ones. For example, during a flash sale, you might reject browse/search requests from non-authenticated users to preserve capacity for users who are actively in checkout. This is controversial but effective.
- Example: Twitter (now X) historically degrades by disabling features under load: first, follower count updates stop being real-time and switch to periodic batch updates. Then, the “who to follow” recommendations disappear. Then, the trending topics become stale. But the core timeline and tweet posting continue working. Each degradation tier has a predefined trigger (e.g., p99 latency exceeding 500ms, error rate above 1%) and an automatic activation mechanism.
- How would you implement automatic degradation that triggers without human intervention? What signals would you use and how do you prevent false positives from triggering unnecessary degradations?
- How do you test graceful degradation — can you verify that each degradation tier actually works before you need it in production?
How do you prevent retry storms from amplifying a partial outage into a full outage?
How do you prevent retry storms from amplifying a partial outage into a full outage?
- A retry storm happens when a service becomes slow or partially unavailable, causing all its callers to retry simultaneously, which multiplies the load on the already struggling service and pushes it from “slow” to “completely down.” If every caller retries 3 times, a service that was handling 10,000 requests/second now receives 30,000 requests/second — exactly when it can least handle the load.
- The first defense is exponential backoff with jitter at each individual client, which we discussed. But that alone is not sufficient because each client acts independently. The more powerful mechanism is a retry budget at the caller side: “this service is allowed to retry at most 10% of its requests over any 30-second window.” If the service is failing 50% of requests and every failure triggers a retry, you cap the retry traffic at 10% of total volume rather than letting it grow to 50% additional load.
- Server-side cooperation is equally important. The struggling service should return
429 Too Many Requestswith aRetry-Afterheader when it is overloaded, giving clients an explicit signal to back off. Even better, it can return a503 Service UnavailablewithRetry-After: 30to tell clients not to retry for 30 seconds. Clients that respect these headers dramatically reduce retry pressure. - Circuit breakers at the caller side are the final safety net. After N failures, the circuit opens and all requests fail immediately (no retries at all) for a timeout period. This gives the downstream service complete relief. The combination of retry budgets + circuit breakers + exponential backoff with jitter forms a layered defense.
- Example: An internal platform team at a large company discovered that during a database failover (which took 15 seconds), 200 microservices all started retrying their database calls simultaneously. Each service retried 3 times with 1-second intervals. The database received 600x normal write volume the moment it came back online, immediately fell over again, triggering another round of retries. The fix was implementing a global retry budget (max 10% retry ratio per service), adding jitter, and having the database return
503 Retry-After: 10during failover. Recovery time dropped from 12 minutes to 20 seconds.
- How would you implement a distributed retry budget across multiple instances of the same service? Do they need to coordinate, or can each instance track its own budget independently?
- Your service returns a mix of 200s and 503s during degraded operation. How do you differentiate between “this request is safe to retry” and “the service is overloaded, stop retrying entirely”?
Interview Deep-Dive Questions
Q1: Your payment service has a 99.95% availability SLO but depends on a third-party payment gateway that only guarantees 99.9%. How do you design the payment service to meet its SLO when the dependency is less reliable?
Q1: Your payment service has a 99.95% availability SLO but depends on a third-party payment gateway that only guarantees 99.9%. How do you design the payment service to meet its SLO when the dependency is less reliable?
- The math problem: if you depend on a 99.9% gateway and call it synchronously on every request, your service’s availability ceiling is 99.9% — below the 99.95% target. The gap between 99.9% and 99.95% is 4.38 hours of downtime per year that the gateway experiences but your service cannot.
- Strategy 1 — Multi-provider failover: integrate with two payment gateways (e.g., Stripe and Adyen). Route traffic primarily through Stripe. When Stripe’s circuit breaker opens (5 failures in 30 seconds), automatically route to Adyen. This gives you availability of
1 - (0.001 * 0.001) = 99.9999%assuming independent failures. The cost: maintaining two integrations, handling different response formats, and reconciling transactions across providers. This is the highest-impact solution and most production payment systems use it. - Strategy 2 — Async processing with queuing: for payments that are not time-sensitive (subscriptions, scheduled payments), queue the payment request and process it asynchronously. If the gateway is down, the request stays in the queue and is retried with backoff until the gateway recovers. The user sees “payment processing” instead of an error. This converts a synchronous availability dependency into a latency dependency — the payment succeeds eventually, just slower.
- Strategy 3 — Intelligent caching and pre-authorization: for repeat customers, pre-authorize payment methods during idle periods. Store the authorization token. When the customer checks out, you already have a valid authorization and only need to capture, which is a simpler call that can be retried more aggressively. If the capture fails, you have a window (usually 7 days) to retry before the authorization expires.
- Strategy 4 — Graceful degradation: if the gateway is down and failover is not available, accept the order and process the payment later. This requires credit risk assessment (do you trust this customer enough to ship before payment clears?). For returning customers with payment history, this is often acceptable. For new customers, show “We are experiencing payment issues, please try again in a few minutes.”
- The combination I would implement: multi-provider failover as the primary defense, async queuing for non-interactive payments, and graceful degradation as the last resort. This gives a theoretical availability well above 99.95%.
- Example: Amazon reportedly uses multiple payment processors and will route around failures automatically. If Visa’s network is slow, they can fall back to processing through a different acquiring bank. Their checkout success rate is a key business metric that is monitored second-by-second.
Q2: You are running a chaos engineering experiment: you plan to randomly terminate 10% of production instances for one of your critical services during business hours. Your manager is nervous. How do you justify this, and what safeguards do you put in place?
Q2: You are running a chaos engineering experiment: you plan to randomly terminate 10% of production instances for one of your critical services during business hours. Your manager is nervous. How do you justify this, and what safeguards do you put in place?
- The justification is simple: untested resilience is not resilience. If we have never verified that our service survives instance failures in production, we are relying on hope. The question is not whether instances will fail — cloud instances fail regularly (AWS reports that individual EC2 instances have a roughly 2-4% annual failure rate). The question is whether our system handles it gracefully or falls apart. Finding out during a real incident is far more costly than finding out in a controlled experiment.
- Safeguards before the experiment: (1) Define a steady state hypothesis: “Terminating 10% of instances will result in no user-visible errors, latency increase of less than 50ms at p99, and auto-scaling will replace instances within 90 seconds.” (2) Set up automated abort conditions: if error rate exceeds 1% or p99 latency exceeds 500ms, automatically stop the experiment and restore the killed instances. Use an experiment controller (like Gremlin, LitmusChaos, or Chaos Monkey) that monitors these conditions in real-time. (3) Start with a smaller blast radius: begin with 1 instance, not 10%. Validate the hypothesis. Then increase to 5%, then 10%. (4) Run during low-traffic hours initially, then graduate to business hours once confidence is established. (5) Notify the on-call team and customer support that a chaos experiment is running so they are not surprised by alerts.
- Safeguards during the experiment: (1) The experiment controller watches dashboards in real-time and has a kill switch that immediately stops the experiment. (2) Run the experiment for a bounded duration (10 minutes, not all day). (3) A human operator is watching the dashboards during the entire experiment. (4) The experiment logs exactly which instances were terminated and when, for post-mortem correlation.
- What you learn: either the system handles it (confidence increases, you document this and run it regularly) or it does not (you found a resilience gap before a real incident found it for you). Common findings: auto-scaling takes longer than expected, health check intervals are too long so traffic is still routed to dying instances, connection pools do not recover gracefully, and caches are cold on new instances causing a latency spike.
- Example: Netflix runs Chaos Monkey continuously in production — it terminates random instances during business hours every single day. But they started small (one instance at a time in non-critical services) and built up over years. They also have Chaos Kong, which simulates an entire region failure. The key insight from their practice: the experiments themselves rarely find bugs. The discipline of preparing for chaos (building fallbacks, testing auto-scaling, validating health checks) is what actually improves reliability.
Q3: Your e-commerce platform needs to process an order that involves five services: Order, Payment, Inventory, Shipping, and Notification. Describe how you would design the system so that a failure in any one service does not leave the order in an inconsistent state.
Q3: Your e-commerce platform needs to process an order that involves five services: Order, Payment, Inventory, Shipping, and Notification. Describe how you would design the system so that a failure in any one service does not leave the order in an inconsistent state.
- This is a distributed workflow that cannot use traditional database transactions because the data lives across five independent services. The correct pattern is an Orchestration Saga with compensating transactions.
- The Saga sequence: (1) Create Order (status: PENDING). (2) Reserve Inventory (decrement available stock). (3) Charge Payment (authorize and capture). (4) Schedule Shipping (create shipment label). (5) Send Notification (confirmation email/push). Each step has a compensating action that undoes it if a later step fails.
- Compensations: (1) Cancel Order. (2) Release Inventory. (3) Refund Payment. (4) Cancel Shipment. (5) No compensation needed for notifications (send an “order canceled” notification instead).
- Failure scenario walkthrough: Payment succeeds at step 3, but Shipping fails at step 4 (carrier API is down). The Saga orchestrator triggers compensations in reverse: Cancel Shipment (no-op since it failed), Refund Payment, Release Inventory, Cancel Order. The user sees “We could not complete your order. Your payment has been refunded.”
- The hard parts: (1) Idempotent compensations. The refund must be safe to call twice. If the orchestrator retries the compensation due to a network timeout, you cannot double-refund. Use an idempotency key derived from the order ID. (2) Partial compensation. What if the Refund call fails? You need a retry loop with exponential backoff for compensations, and if retries are exhausted, escalate to a dead-letter queue for manual intervention. (3) Observability. The orchestrator must log every state transition so that customer support can see exactly where the order failed and what compensations ran. (4) Concurrent Saga instances. Two orders for the last item in stock: both reserve inventory at step 2, but only one can succeed. Use optimistic locking on inventory (check-and-decrement atomically) so the second Saga fails at step 2 and compensates immediately.
- Implementation: use a workflow engine (Temporal, AWS Step Functions, or Cadence). Define the Saga as an explicit workflow with retries, timeouts, and compensation handlers. The workflow engine provides durable execution (survives orchestrator restarts), automatic retries, and visibility into workflow state.
- What I would explicitly avoid: (1) Two-phase commit across all five services — holds locks, kills availability. (2) Choreography Saga (event-driven) for this workflow — with 5 services, the implicit workflow is too hard to debug and monitor. (3) Ignoring notification failures — even if the email fails, the order should still succeed. Treat notification as best-effort with a separate retry mechanism.