Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Part XV — Event Sourcing, Messaging, and Async Systems
Asynchronous messaging changes the fundamental contract of your system: instead of “I did the thing,” your service says “I requested the thing.” This unlocks decoupling, resilience, and scalability — but introduces a new class of problems: duplicate messages, ordering, eventual consistency, and the terrifying question “did my message actually get processed?” Every pattern in this section exists because async is powerful but unsafe by default.Real-World Stories: Why This Matters
How LinkedIn Built Kafka — and Accidentally Changed the Industry
How LinkedIn Built Kafka — and Accidentally Changed the Industry
The Therac-25 Disasters — When Concurrency Bugs Kill
The Therac-25 Disasters — When Concurrency Bugs Kill
go func() or new Thread(), you are creating the possibility of interleavings your tests will never exercise.Uber's Real-Time Event Processing — Millions of Rides per Second
Uber's Real-Time Event Processing — Millions of Rides per Second
Discord's Message Ordering Problem — Real-Time Chat at Scale
Discord's Message Ordering Problem — Real-Time Chat at Scale
Chapter 22: Messaging
22.1 Queues vs Topics
These serve fundamentally different purposes: Queue (point-to-point): one message -> one consumer. Used for work distribution. Example: 1000 image resize jobs in a queue, 10 workers consuming them. Each job is processed by exactly one worker. If you add more workers, work is distributed across them. Use for: background jobs, task processing, load leveling.OrderPlaced event is published to a topic. The Inventory Service, Email Service, and Analytics Service all receive it. Each subscriber gets every message. Use for: event-driven architecture, notifying multiple services, decoupling producers from consumers.
22.2 Delivery Semantics
At-most-once: May lose messages. Never duplicate. Fast. Fire-and-forget. Use when: losing a message is acceptable (metrics sampling, non-critical logs). The producer sends and moves on without waiting for acknowledgment. At-least-once: Never lose. May duplicate. The standard choice for most production systems. The producer retries on failure, guaranteeing delivery but potentially sending the same message twice. Requires idempotent consumers — your processing logic must handle duplicates gracefully (more on this below). Exactly-once: This is the one that trips up most candidates in interviews. Here is the critical distinction:- Exactly-once delivery is impossible. This is a provable impossibility in distributed systems, not an engineering limitation we haven’t solved yet.
- Exactly-once processing is achievable. This is what every production system actually implements.
Interview Question: Your team uses Kafka's exactly-once semantics. A consumer reads from topic A, processes, and writes to topic B. It also sends a webhook to a partner API. During a consumer restart, the partner reports receiving duplicate webhooks. Explain what happened and how you fix it.
Interview Question: Your team uses Kafka's exactly-once semantics. A consumer reads from topic A, processes, and writes to topic B. It also sends a webhook to a partner API. During a consumer restart, the partner reports receiving duplicate webhooks. Explain what happened and how you fix it.
-
If the partner API supports idempotency keys: Pass the Kafka message’s unique identifier (e.g.,
topic-partition-offsetor the business-level idempotency key) as the partner’s idempotency key header. The partner deduplicates on their side. This is the cleanest solution and the reason idempotency keys exist in API design. -
If the partner API does not support idempotency keys: Implement a send-once table pattern. Before sending the webhook, check a local database table (
webhooks_sent) for the message ID. If present, skip the send. The check and the record insert must be in the same transaction as the Kafka offset commit — but since the webhook is an external call, you cannot include it in the Kafka transaction. The solution is a two-phase approach:- Phase 1: Process the message, write to topic B, record the webhook payload in an outbox table, commit the Kafka transaction.
- Phase 2: A separate outbox poller reads unsent webhooks from the outbox, sends them, and marks them as sent. If the poller crashes, it retries — but the outbox entry acts as the dedup mechanism (send only once per outbox row).
- If the webhook is fire-and-forget (non-critical): Accept occasional duplicates and document this for the partner. Design the webhook payload to be idempotent (include a unique event ID so the partner can dedup on their end even without explicit idempotency key support).
- Failure mode: What if your outbox poller crashes mid-batch? The outbox entries remain unsent and the poller retries on restart, but downstream consumers see a delay spike. Monitor outbox table age (
MAX(created_at) - MAX(sent_at)) and alert if the gap grows beyond your SLA. - Rollout: Deploy the outbox-based webhook path in shadow mode first — run both the direct webhook call and the outbox path, compare results for 48 hours, then cut over. This catches serialization differences and latency regressions before they hit production.
- Rollback: Keep the direct webhook code behind a feature flag. If the outbox path introduces unacceptable latency or bugs, flip the flag to revert to direct calls within seconds.
- Measurement: Track
webhook_delivery_latency_p99(outbox introduces one poll-interval of delay),outbox_table_depth(rows awaiting send), andduplicate_webhook_rate(should be zero with idempotency keys). - Cost: The outbox table adds write amplification to every transaction (one extra INSERT). At 100K transactions/day this is negligible; at 10M/day, size the outbox table with partitioned cleanup and monitor disk usage.
- Security/Governance: Outbox rows contain webhook payloads which may include PII (customer names, emails). Apply the same data classification and encryption-at-rest policies to the outbox table as to your primary business data. Audit who can query the outbox directly.
- A senior engineer correctly identifies that EOS breaks at the Kafka boundary, explains the outbox pattern, and implements idempotency keys for the external webhook call.
- A staff/principal engineer additionally designs the monitoring and alerting for the outbox relay, writes a runbook for outbox table growth incidents, evaluates CDC (Debezium) vs polling for the relay based on throughput requirements, coordinates with the partner API team on idempotency key contracts, and ensures the outbox table is included in the data retention and GDPR deletion policies.
Interview Question: How do you handle duplicate messages?
Interview Question: How do you handle duplicate messages?
message_id and catch the constraint violation as the dedup signal. I would also think about dedup table growth (TTL-based cleanup, partitioning by date), what happens during database failover (consumer blocks rather than bypasses), and how to handle the case where our dedup window is shorter than a potential replay window.”Follow-up chain:- Failure mode: If the dedup table’s database is down, do you fail open (process and risk duplicates) or fail closed (block and accumulate lag)? For financial systems, always fail closed. For analytics, fail open with downstream reconciliation.
- Rollout: Deploy dedup logic in shadow mode first — run the check but process regardless, logging whether the check would have skipped the message. Verify for 48 hours that the dedup logic correctly identifies duplicates without false positives.
- Rollback: Feature-flag the dedup check. If the dedup logic incorrectly skips legitimate messages, disable it instantly. Downstream idempotency (conditional writes, gateway idempotency keys) acts as the safety net.
- Measurement: Track
dedup_hit_rate(steady-state should be < 0.1%; a spike indicates a producer or consumer issue),dedup_check_latency_p99(should be < 5ms), anddedup_table_row_count(alert if growth exceeds expected rate based on message volume and TTL). - Cost: At 10M messages/day with 128-byte IDs and a 7-day TTL, the dedup table holds ~70M rows (~8 GB). Factor in index maintenance, vacuum cost (PostgreSQL), and query latency under load.
- Security/Governance: Message IDs in the dedup table may correlate with business entities (order IDs, customer IDs). If the dedup table lives in a less-secured store (Redis vs. primary DB), ensure IDs do not leak entity relationships or PII. Dedup bypass via crafted message IDs is an attack vector — ensure IDs are server-generated.
- A senior engineer implements the dedup table correctly within a single transaction, handles the
UniqueViolationpath, and acks only after commit. - A staff/principal engineer additionally designs the dedup table schema for operational longevity (partitioned by date for efficient cleanup), adds Bloom filter optimization for high-throughput paths, writes the monitoring dashboards and alert thresholds, plans the rollout strategy with shadow mode, and writes the incident runbook for when dedup fails at 3 AM.
22.3 Dead Letter Queues, Poison Messages, Consumer Lag
DLQ: Messages that fail after max retries are moved to a dead letter queue for investigation rather than blocking the main queue. Without a DLQ, a single unprocessable “poison message” blocks all messages behind it in the queue. Poison messages: A message that always fails processing — malformed data, a bug in the consumer logic, or a dependency that will never recover for this specific input. Retrying a poison message forever wastes resources and blocks the queue. After 3-5 attempts, move to DLQ. Log the error with the message content and correlation ID for debugging. Consumer lag: The distance between the latest message in the queue/topic and the last message processed by the consumer. Monitor this. If lag is increasing, consumers cannot keep up. Causes: slow processing logic, downstream dependency latency, insufficient consumer instances. Fix: optimize processing, scale consumers horizontally, or investigate the slow dependency.| Strategy | How It Works | Best For |
|---|---|---|
| Immediate retry | Retry instantly (1-2 times) | Transient network glitches |
| Exponential backoff | Wait 1s, 2s, 4s, 8s, 16s… between retries | Downstream overload, rate limiting |
| Exponential backoff + jitter | Backoff with random jitter added | Preventing “thundering herd” when many consumers retry simultaneously |
| Max retries with DLQ | After N failures (typically 3-5), move to DLQ | All message-driven systems as a safety net |
| Poison pill detection | If a message fails with a non-retryable error (e.g., deserialization failure, schema mismatch), skip retries entirely and DLQ immediately | Malformed data, schema evolution bugs |
| Lag Pattern | What It Means | First Action |
|---|---|---|
| Sudden spike, then stable | Burst of production; consumer can keep up at steady state | Monitor. If lag shrinks over the next 10 minutes, no action needed |
| Slow, steady growth | Consumer throughput < production rate permanently | Scale consumers or optimize processing. This will not self-heal |
| Sawtooth pattern (grows, drops, grows) | Consumer is being periodically rebalanced or restarted | Check for rebalance storms, deployment-triggered restarts, OOMKills |
| One partition lagging, others fine | Hot partition or stuck consumer | Check partition key distribution; inspect the lagging consumer’s logs for errors |
| All partitions lag identically | Systemic issue — downstream dependency slow, all consumers equally impacted | Check downstream health (database latency, API errors, network) |
Interview Question: Your Kafka consumer group is lagging by 2 million messages. The lag is growing. What do you do?
Interview Question: Your Kafka consumer group is lagging by 2 million messages. The lag is growing. What do you do?
kafka-consumer-groups.sh --describe or your monitoring tool like Burrow, Datadog, Confluent Control Center). Is the lag uniform across all partitions, or is one partition significantly behind? Uniform lag suggests a systemic issue (all consumers are slow). Skewed lag suggests a hot partition or a stuck consumer.2. Identify the bottleneck — is it production rate or consumption rate?
Check the producer throughput. Did a new feature or upstream change suddenly increase message volume? If production rate doubled but consumer capacity stayed the same, the fix is scaling consumers. Check consumer processing time per message — if it jumped from 5ms to 500ms, something downstream is slow (a database query, an external API call, a GC pause).3. Short-term fixes (stop the bleeding):- Scale out consumers — add more consumer instances up to the number of partitions (Kafka’s hard limit: one consumer per partition within a group). If you have 12 partitions and 4 consumers, you can scale to 12.
- Increase consumer throughput — if processing is the bottleneck, check for downstream latency (slow database? overloaded API?). Batch processing if the consumer is making one DB call per message instead of batching.
- Pause non-critical consumers — if multiple consumer groups read the same topic, temporarily pause lower-priority ones to free broker resources.
- Check for poison messages — a single message that causes repeated failures and retries can stall an entire partition. Check DLQ volume and error logs.
- Repartition the topic — if you hit the consumer-per-partition limit, increase partitions (careful: this changes message distribution and can break ordering guarantees for existing keys).
- Optimize consumer processing — profile the consumer. Is it doing synchronous I/O? Can processing be parallelized within the consumer (process batch of messages concurrently, commit offsets after the batch)?
- Add autoscaling — trigger consumer scaling based on lag thresholds, not just CPU.
- Set up lag-based alerts — alert when lag exceeds N messages or when lag growth rate is positive for more than M minutes.
- Do not blindly increase
max.poll.recordswithout ensuring your consumer can process the larger batch withinmax.poll.interval.ms— this causes rebalances and makes things worse. - Do not skip messages to “catch up” unless you have a dead letter mechanism and the messages are truly replayable later.
- Do not add consumers beyond the partition count — they will sit idle.
- Failure mode: What if the lag is caused by a poison message that triggers infinite retries on one partition? The lag grows on that partition while others are fine. Check DLQ ingestion rate and error logs before scaling consumers.
- Rollout (adding consumers): Add consumers incrementally (1-2 at a time), monitor lag trend after each addition. A rebalance triggered by each new consumer temporarily pauses all partitions (with eager protocol), so adding 8 consumers at once causes 8 rebalances back-to-back.
- Rollback: If newly added consumers cause rebalance storms or introduce bugs, remove them and let the original consumers catch up. Ensure you are using CooperativeStickyAssignor so removals do not trigger full rebalances.
- Measurement: Primary metric is
lag_growth_rate(messages/second falling behind), not absolute lag. Also trackprocessing_time_per_message_p99,rebalance_count_per_hour, anddlq_ingestion_rate. - Cost: Each additional consumer instance costs compute (CPU, memory, network). At 500 consumers across 50 topics, the Kafka client overhead (heartbeats, metadata fetches) itself becomes non-trivial. Budget the broker-side cost too — more consumers means more fetch requests to brokers.
- Security/Governance: Before scaling consumers, verify that new instances have the same Kafka ACLs, TLS certificates, and data access permissions as existing ones. In regulated environments, adding consumer instances may require change management approval.
kafka_consumer_lag{group='payment-processor', topic='payments'} > 5000000 and rate(kafka_consumer_lag[5m]) > 0. Your dashboard shows lag is growing at 200 messages/second. You have 6 consumers and 24 partitions. Walk me through your next 15 minutes.”Interview Question: A single Kafka partition is stuck -- messages are not being consumed. Other partitions in the same topic are fine. Walk me through your diagnosis.
Interview Question: A single Kafka partition is stuck -- messages are not being consumed. Other partitions in the same topic are fine. Walk me through your diagnosis.
kafka-consumer-groups.sh --describe --group <group> and check which consumer instance owns the stuck partition. If the CONSUMER-ID column is empty for that partition, the partition is unassigned — no consumer is reading it. This happens during rebalances, after a consumer crash, or when the number of consumers exceeds the number of partitions (some consumers get no partitions, but more critically, a partition between rebalance rounds can be temporarily unowned).2. Check the consumer instance’s health.
If the partition is assigned, SSH or kubectl exec into the consumer that owns it. Check:- Is the consumer process alive? (
ps,jstackfor Java consumers to check for deadlocks or GC pauses) - Is it stuck in a
poll()loop that never returns? A consumer processing a single message for longer thanmax.poll.interval.mswill be kicked from the group, triggering a rebalance, but the long-processing message may cause the partition to appear stuck between kick and reassignment. - Are there repeated exceptions in the consumer logs for that specific partition? This points to a poison message.
kafka-console-consumer.sh --partition <N> --offset <stuck_offset> --max-messages 1 to read the actual message. Common findings:- Malformed JSON or a schema the consumer cannot deserialize (poison message — DLQ it and advance the offset)
- An extremely large message that exceeds the consumer’s
max.partition.fetch.bytes(the consumer silently fails to fetch it) - A message referencing a deleted entity that causes a
NullPointerExceptionon every processing attempt
kafka.consumer:type=consumer-coordinator-metrics,client-id=* JMX metric rebalance-rate-per-hour), the partition may be stuck in a cycle: assigned to a consumer, consumer gets kicked before processing, reassigned, kicked again. Common cause: max.poll.interval.ms is too low for the processing time of messages on that partition. Fix: increase the interval or reduce max.poll.records.5. Fix and prevent.- If poison message: advance the offset past the message (manually or via DLQ logic) and add poison pill detection to the consumer
- If rebalance storm: switch to
CooperativeStickyAssignor+ static group membership to reduce rebalance frequency - If large message: increase
max.partition.fetch.byteson the consumer andmessage.max.byteson the topic, or fix the producer to avoid publishing oversized messages - If consumer bug: fix the bug, redeploy, and monitor the partition’s lag to confirm it is catching up
- A poison message causes a consumer to crash or exceed
max.poll.interval.ms - The consumer is evicted from the group, triggering a rebalance
- During the rebalance (especially with the eager protocol), all partitions stop processing for 10-30 seconds
- Consumer lag spikes across every partition, not just the one with the poison message
- After rebalance, the poison message is reassigned to another consumer, which also fails
- Another rebalance triggers. Lag grows further. Alerts fire. The on-call engineer is now firefighting a system-wide lag incident caused by a single malformed message.
- Poison pill detection with fast DLQ: After 3 failures, send the message to the DLQ and commit the offset. Never let a single message trigger unbounded retries.
- CooperativeSticky assignor: Only the partition being vacated stops processing during rebalance. Other partitions continue uninterrupted.
- Static group membership: Prevents rebalances when consumers restart during rolling deploys. Each consumer reclaims its prior partitions without triggering group-wide reassignment.
- Separate retry topics: Instead of retrying in-place (which blocks the partition), publish failed messages to a retry topic with a delay. The original partition advances immediately.
22.4 Event-Driven Patterns
Domain events represent something that happened within a bounded context:OrderPlaced, InventoryReserved, PaymentReceived. They are named in past tense (facts, not commands). They drive business logic within a single service or bounded context.
Integration events are communicated between services and may have a different (simpler, more stable) schema than the internal domain event. The Order Service’s internal OrderPlaced event might contain order line items, internal pricing metadata, and customer preferences. The integration event published to other services contains only: order_id, customer_id, total, items, timestamp — the minimum other services need.
Event versioning: Events are contracts. Changing an event schema is a breaking change for all consumers. Use a schema registry (Confluent Schema Registry, AWS Glue) to enforce compatibility rules (backward, forward, full). Compatibility modes: backward-compatible (new consumers can read old events), forward-compatible (old consumers can read new events), full (both).
22.5 Kafka vs RabbitMQ vs SQS — When to Use Each
| Criteria | Kafka | RabbitMQ | SQS |
|---|---|---|---|
| Model | Distributed log (append-only) | Message broker (queue + exchange routing) | Managed queue (AWS) |
| Ordering | Per partition | Per queue (single consumer) | FIFO queues only (per message group) |
| Retention | Configurable (days/weeks/forever) | Until consumed and acked | 1-14 days |
| Replay | Yes (consumers re-read from any offset) | No (consumed = gone) | No |
| Throughput | Very high (millions/sec per cluster) | Medium (tens of thousands/sec) | Medium (3,000/sec standard, higher with batching) |
| Consumer groups | Yes (parallel consumption + pub/sub) | Yes (competing consumers) | Yes (multiple consumers) |
| Exactly-once | Yes (idempotent producers + transactions) | No (at-least-once) | No (at-least-once, FIFO has dedup) |
| Operational complexity | High (ZooKeeper/KRaft, partitions, replication) | Medium (Erlang cluster) | None (fully managed) |
| Best for | Event streaming, event sourcing, high-throughput, replay needed | Task queues, complex routing, request-reply | Simple cloud queues, serverless triggers |
| Worst for | Simple task queues (overkill) | High-throughput streaming | Replay, event sourcing, cross-cloud |
| Scenario | Best Choice | Why |
|---|---|---|
| Real-time clickstream analytics (100K+ events/sec) | Kafka | High throughput, replay for reprocessing, consumer groups for parallel processing |
| E-commerce order processing with complex routing (priority orders, regional routing) | RabbitMQ | Exchange-based routing, priority queues, dead letter exchanges |
| Lambda-triggered image thumbnail generation | SQS | Zero ops, native Lambda integration, pay-per-message |
| Event sourcing for financial audit trail | Kafka | Immutable log, indefinite retention, replay from any point |
| Microservice task queue (send emails, generate PDFs) | RabbitMQ or SQS | Simple work distribution, no need for replay or event streaming |
| Multi-region event replication | Kafka | MirrorMaker 2, built-in replication across clusters |
| Serverless webhook processing with unpredictable spikes | SQS | Auto-scales, no capacity planning, built-in retry + DLQ |
- A consumer joins the group (new deployment, autoscaling)
- A consumer leaves the group (shutdown, crash, network partition)
- A consumer misses a heartbeat (GC pause, slow processing exceeding
max.poll.interval.ms) - Topic metadata changes (new partitions added)
| Strategy | How It Works | Pros | Cons | Best For |
|---|---|---|---|---|
| Range (default) | Sorts partitions and consumers alphabetically, assigns contiguous ranges to each consumer | Simple, predictable | Uneven distribution when partition count is not divisible by consumer count; co-partitioning across topics can skew load | Simple setups with a single topic |
| RoundRobin | Assigns partitions one at a time in round-robin order across consumers | Even distribution across consumers | Does not respect topic affinity; all partitions are reshuffled on every rebalance | Multiple topics, uniform processing |
| Sticky | Like RoundRobin, but tries to preserve existing assignments during rebalance — only reassigns partitions from departed consumers | Minimizes partition movement, faster rebalance recovery | Still uses the “stop-the-world” eager rebalance protocol | Reducing churn in moderate-scale groups |
| CooperativeSticky | Incremental rebalance — consumers only revoke partitions that need to move, continue processing all others | No stop-the-world pause. Partitions that stay assigned are never interrupted. Only moved partitions experience a brief gap. | Rebalance happens in two rounds (revoke, then assign) so it takes slightly longer to fully settle | Production workloads where processing continuity matters |
max.poll.interval.ms trap: If your consumer takes longer than this interval to process a batch of records, Kafka assumes it is dead and triggers a rebalance. The consumer gets kicked from the group, its partitions are reassigned, and when the slow consumer finishes processing, it discovers it no longer owns those partitions. The messages it just processed may be processed again by the new owner. Fix: either reduce max.poll.records so each batch is smaller, optimize processing time, or increase max.poll.interval.ms (but this delays detection of actually-dead consumers).Static group membership (Kafka 2.3+): Assign each consumer a persistent group.instance.id. When a consumer with a static ID disconnects and reconnects within session.timeout.ms, it reclaims its previous partition assignments without triggering a rebalance. This is critical for rolling deployments — without static membership, restarting 10 consumers triggers 10 rebalances. With static membership, each restart is a brief blip.max.poll.interval.ms timeouts, timeouts cause more rebalances. This feedback loop is why a simple rolling deployment can take down an event pipeline for 45 minutes.The rolling deployment problem: You have 6 consumers, 24 partitions. A rolling deployment restarts one consumer at a time. With the default eager protocol:- Consumer 1 shuts down -> full rebalance (all 24 partitions revoked and reassigned across 5 consumers). Takes 15-30 seconds.
- Consumer 1 comes back -> full rebalance (24 partitions reassigned across 6 consumers). 15-30 seconds.
- Consumer 2 shuts down -> full rebalance again. 15-30 seconds.
- This repeats 6 times. Total rebalance time: 3-6 minutes of repeated processing pauses.
| Phase | Action | Monitoring | Rollback Trigger |
|---|---|---|---|
| Canary (1 consumer) | Deploy new code to 1 consumer | Error rate, processing time, lag on its partitions | Error rate > 1% or processing time > 2x baseline |
| Partial (50%) | Roll to half the consumers | Same metrics, plus overall consumer group lag | Lag growth rate positive for > 5 minutes |
| Full rollout | Roll remaining consumers | All metrics stable for 15 minutes | Any metric regression vs. pre-deployment baseline |
| Bake period (1 hour) | No changes, monitor | DLQ volume, downstream error rates | DLQ ingestion rate > 2x normal |
22.6 Messaging Architecture Decision Guide — Sync vs Async vs Streaming
One of the most impactful architectural decisions you’ll make is how services communicate. This is not just “which message broker do I pick?” — it’s the more fundamental question of whether you need a broker at all. The three main communication paradigms — synchronous (HTTP/gRPC), asynchronous (message queues), and streaming (event log) — are not interchangeable. Each makes a different bet about coupling, latency, and failure handling.-
YES -> The caller blocks until the response arrives. This is synchronous communication.
- Use HTTP/gRPC (request-response). The caller sends a request, waits for the response, and uses that response to continue processing.
- Examples: user authentication (need the token to proceed), reading a product catalog (need the data to render the page), payment authorization (need approval before confirming the order).
- NO -> The caller can fire and forget, or check the result later. Continue to Question 2.
-
ONE consumer (work distribution) -> Use a message queue (SQS, RabbitMQ, BullMQ).
- The message represents a unit of work. One worker picks it up, processes it, and acknowledges it. If the worker fails, the message goes back to the queue for another worker.
- Examples: send an email, resize an image, generate a PDF, process a refund.
- MANY consumers (event notification) -> Continue to Question 3.
-
YES (replay, reprocessing, independent consumer speeds) -> Use event streaming (Kafka, Redpanda, AWS Kinesis).
- The event log retains messages. Consumers maintain their own position (offset) and can re-read from any point. New consumers can start from the beginning and “catch up.” Consumer speed is independent of producer speed.
- Examples: event sourcing, real-time analytics pipelines, populating search indexes, cross-service data synchronization, audit trails.
-
NO (simple fan-out, consumers are always online) -> Use pub/sub (SNS, RabbitMQ fanout exchange, Google Pub/Sub).
- Messages are broadcast to all active subscribers. Once delivered, messages are gone. Simpler to operate than a full streaming platform.
- Examples: cache invalidation notifications, webhook delivery, real-time alerts.
| Paradigm | Coupling | Latency | Failure Handling | Data Retention | Best For |
|---|---|---|---|---|---|
| Sync (HTTP/gRPC) | Tight — caller waits for callee | Low (ms) | Caller sees failure immediately; must handle retries, circuit breakers | None (request-response) | Queries, authentication, real-time reads, operations where the caller needs the result |
| Async Queue (SQS, RabbitMQ) | Loose — fire and forget | Medium (seconds to minutes acceptable) | Broker retries, DLQ for permanent failures | Until consumed | Background jobs, work distribution, load leveling, decoupling write-heavy operations |
| Streaming (Kafka, Kinesis) | Loosest — producers and consumers are fully independent | Medium-High (configurable) | Consumer re-reads from last committed offset; at-least-once by default | Configurable (hours to forever) | Event sourcing, analytics, cross-service sync, audit trails, replay for reprocessing |
| Pub/Sub (SNS, fanout) | Loose — one-to-many broadcast | Low-Medium | Delivery attempts with retry; no replay if missed | None (fire-and-forget broadcast) | Cache invalidation, notifications, webhook fan-out |
Interview Question: Your team is building a new microservice that needs to communicate with 5 other services. How do you decide which interactions should be synchronous and which should be asynchronous?
Interview Question: Your team is building a new microservice that needs to communicate with 5 other services. How do you decide which interactions should be synchronous and which should be asynchronous?
- Does the caller need the response to continue? If yes, sync. If no, async.
- What is the acceptable latency? If sub-second, sync (or async with a websocket/polling pattern). If seconds-to-minutes is fine, async.
- What happens if the callee is down? If the caller must fail too, sync is honest about the coupling. If the caller should succeed regardless, async with a queue buffers the dependency.
- Payment Service -> Sync (gRPC). The order cannot be confirmed without payment authorization. Tight coupling is honest here.
- Email Service -> Async queue (SQS). The order is confirmed whether or not the email sends. Email can retry independently.
- Inventory Service -> Async event (Kafka
OrderPlacedtopic). Inventory subscribes and reserves stock. If inventory service is temporarily down, Kafka retains the event. Inventory catches up when it recovers. - Analytics Service -> Async event (same Kafka
OrderPlacedtopic, different consumer group). Analytics never needs to block the order flow. - Fraud Detection Service -> Sync if you want to block suspicious orders in real-time. Async if you prefer to flag and review after the fact. This is a business decision, not a technical one — discuss the trade-off with the interviewer.
- Downstream database latency spikes from 10ms to 500ms (perhaps due to a long-running query or a schema migration lock).
- Consumer requests start timing out. Consumers retry immediately.
- The database now receives the original load PLUS all the retries. Latency goes from 500ms to 5 seconds.
- More timeouts trigger more retries. The retry volume exceeds the original traffic volume.
- The database connection pool is exhausted. All consumers are blocked. Consumer lag grows exponentially.
- Exponential backoff with jitter — never retry immediately. The jitter ensures retries are spread over time instead of hitting the dependency in synchronized waves.
- Circuit breakers — after N consecutive failures, stop retrying entirely for a cooldown period. One consumer tripping its circuit breaker reduces load on the downstream, giving it room to recover.
- Retry budgets — instead of per-message retry limits, set a global retry budget: “no more than 10% of total requests in any 10-second window can be retries.” When the budget is exhausted, new retries are dropped or DLQ’d.
- Distinguish retryable from non-retryable errors at the transport level. HTTP 503 (service unavailable) is retryable. HTTP 400 (bad request) is not. Retrying a 400 is pointless and adds load.
OrderShippedarrives beforeOrderPaidbecause the payment service is slower to publish than the shipping service.- Two updates to the same user profile arrive in reverse chronological order because they were produced on different Kafka partitions.
- A retry causes a message from 30 seconds ago to arrive after a message from 5 seconds ago.
| Strategy | How It Works | Trade-off |
|---|---|---|
| Sequence numbers | Each event for an entity carries a monotonic sequence number. Consumer rejects events where event.seq <= entity.last_applied_seq | Requires a centralized sequence generator per entity. Rejected events need a redelivery mechanism. |
| Timestamp-based ordering | Buffer events for a short window (e.g., 500ms), sort by event timestamp before processing | Adds latency. Late events outside the window are mishandled. Clock skew between producers causes incorrect ordering. |
| Idempotent + compensating | Process every event regardless of order, but design handlers to detect stale state and emit compensating events | No latency penalty. System passes through transient incorrect states but converges. Requires careful handler design. |
| Single-writer per entity | One service owns all writes for an entity, serializing all state changes | Eliminates ordering issues but creates a bottleneck and single point of failure for that entity. |
- Forward recovery (complete the remaining steps): Persist the saga/workflow state durably (Temporal, Step Functions, or an outbox). On restart, read the last checkpoint and continue forward. This is the preferred approach when each step is idempotent.
- Backward recovery (undo the completed steps): Execute compensating actions in reverse order. Refund the debit. This is the Saga pattern’s compensation flow. Only possible when every step has a defined compensating action.
- Reconciliation (detect and fix): Run a periodic reconciliation job that compares the expected state (derived from the event log) against the actual state (in the database). Any discrepancy is flagged and corrected. This is the safety net when forward and backward recovery both miss edge cases.
partial_failure_rate (how often multi-step operations fail mid-way), mean_time_to_recovery (how long until the system reaches consistent state after a partial failure), and compensation_success_rate (what percentage of compensating actions succeed on the first attempt).Security consideration for replay and recovery: When replaying events or executing compensating actions, ensure that the replayed events pass the same authorization checks as the originals. A common vulnerability: an event replay bypasses API-level auth because the consumer trusts all events from the internal Kafka topic. If an attacker can inject events into the topic (compromised producer, insufficient ACLs), replay amplifies the blast radius. Kafka ACLs, mutual TLS between brokers and clients, and consumer-side validation of event signatures are the mitigations.22.7 Event Schema Evolution — Managing Change in Event-Driven Systems
Events are contracts. Once you publish anOrderPlaced event and three downstream services consume it, changing that event’s structure is as dangerous as changing a public API. Worse, actually — with a REST API, you can version the URL and run two versions simultaneously. With events, old messages may sit in Kafka topics for weeks, and consumers may be running different code versions simultaneously during a rolling deployment. Schema evolution is how you manage this change safely.
Serialization formats for schema evolution:
| Format | Schema Enforcement | Evolution Support | Wire Size | Human Readable | Best For |
|---|---|---|---|---|---|
| JSON | None (schema-less) | Ad-hoc (add fields, hope consumers ignore unknowns) | Large (field names in every message) | Yes | Prototyping, low-throughput systems, external APIs |
| Avro | Schema required for read + write | Excellent (built-in compatibility rules, schema registry native) | Small (schema not in message, binary encoding) | No | Kafka ecosystems, data pipelines, Confluent stack |
| Protobuf | Schema required (.proto file) | Good (field numbers enable evolution, unknown fields preserved) | Small (binary, field tags instead of names) | No | gRPC services, cross-language systems, Google ecosystem |
| Thrift | Schema required | Good (similar to Protobuf) | Small | No | Legacy systems (Facebook origin), less common for new projects |
| Mode | Rule | Use When | Example |
|---|---|---|---|
| BACKWARD | New schema can read data written by old schema. New consumers can read old messages. | Consumers are upgraded before producers. Most common default. | Adding a new optional field with a default value |
| FORWARD | Old schema can read data written by new schema. Old consumers can read new messages. | Producers are upgraded before consumers. | Removing an optional field, adding a field that old consumers will ignore |
| FULL | Both backward and forward compatible. | Consumers and producers upgrade independently in any order. Safest but most restrictive. | Adding optional fields with defaults only |
| NONE | No compatibility checking. | Development/testing environments only. Never in production. | Any change — but you accept the risk of breaking consumers |
reserve its number so no one accidentally reuses it later. Reusing a number with a different type causes silent data corruption — the deserializer interprets bytes meant for the old field as the new type.
Interview Question: A downstream team reports that their consumer started crashing after your team deployed a change to an event schema. How do you investigate, fix, and prevent this from recurring?
Interview Question: A downstream team reports that their consumer started crashing after your team deployed a change to an event schema. How do you investigate, fix, and prevent this from recurring?
- Check if your deployment changed the event schema — diff the before/after schema files
- Identify the breaking change — common culprits: removed field, changed type, renamed field, added required field without default
- Check if the downstream consumer is using a pinned schema version or dynamically fetching from a registry
- If the change is backward-compatible and the consumer has a bug: assist the downstream team with a consumer-side fix
- If the change is genuinely breaking: roll back the producer to the old schema, publish a corrective schema version, and reprocess any messages written with the broken schema
- If messages in the topic are already corrupted: you may need to publish “correction events” or manually fix the affected data in the downstream system
- Schema registry with compatibility enforcement — set the topic’s compatibility mode to BACKWARD or FULL so the registry rejects breaking changes at registration time
- CI/CD schema validation — add a pipeline step that registers the schema against the registry in a dry-run mode before deploying. If the registry rejects it, the build fails
- Consumer contract testing — downstream teams publish their expected schema (the fields they actually use) and your CI validates that your schema changes do not remove or alter those fields
- Schema change review process — any event schema change requires a PR review from consuming teams before merge
orders-v1, orders-v2) as the primary solution — this works but creates operational complexity (two topics, migration period, eventual decommission). Not considering messages already in the topic that were written with the breaking schema.Words that impress: “schema registry compatibility mode,” “backward-compatible evolution,” “consumer contract testing,” “schema-on-read vs. schema-on-write,” “the wire format is the contract, not the code.”22.8 Idempotency Patterns for Message Consumers
In any at-least-once delivery system — which is every production messaging system — your consumers will receive duplicate messages. Network retries, consumer rebalances, process crashes between processing and acknowledgment: all of these produce duplicates. Idempotent consumers are not optional; they are a baseline requirement. Section 22.2 introduced the basic idempotency check. Here we go deeper into the patterns, trade-offs, and production considerations. Pattern 1: Idempotency Key + Deduplication Table The most common and most robust pattern. Every message carries a unique idempotency key (typically a UUID set by the producer). The consumer checks a deduplication table before processing.| Strategy | How | Trade-off |
|---|---|---|
| TTL-based cleanup | Delete entries older than the maximum possible redelivery window (e.g., 7 days) | Simple, but must be confident that redelivery cannot happen after the TTL |
| Partition by date | Create daily/weekly partitions, drop old partitions entirely | Efficient cleanup, no row-by-row deletes, requires partitioned table setup |
| Bloom filter pre-check | Check an in-memory Bloom filter before hitting the database. Only query the dedup table if the Bloom filter says “maybe seen” | Reduces database load by 90%+ for non-duplicate messages. Bloom filter says “definitely not seen” (fast skip) or “maybe seen” (check DB). False positives cause unnecessary DB lookups but never incorrect behavior |
- You set a 7-day TTL on your
processed_messagestable. On day 8, if a message is redelivered (Kafka consumer offset reset, manual replay, topic repartitioning), the dedup entry is gone. The message is processed again. - Kafka’s idempotent producer dedup is bounded by the broker’s
max.idle.producer.expiry.ms(default 7 days). If a producer is idle for longer than this, its PID is evicted and sequence numbers reset. Messages from the restarted producer may not be deduplicated. - SQS FIFO queues have a 5-minute dedup window. Messages with the same
MessageDeduplicationIdwithin 5 minutes are suppressed. After 5 minutes, a duplicate is treated as new.
| Factor | Guidance |
|---|---|
| Maximum consumer restart time | Window must exceed this. If your consumer can be down for 4 hours during an incident, 7-day window is safe. 1-hour window is not. |
| Maximum replay age | If you ever replay events older than your window, duplicates will pass through. For event-sourced systems, consider infinite dedup (or rebuild-from-scratch strategy). |
| Downstream side effect cost | Cheap side effects (update a counter) can tolerate a shorter window and occasional duplicates. Expensive side effects (charge a credit card) need the longest practical window. |
| Storage cost | At 1M messages/day, a 7-day dedup table holds 7M rows. At 100M messages/day, it holds 700M rows. Partition by date and drop old partitions. |
- Detect: Monitor for duplicate processing. Instrument your dedup check to emit a metric when it catches a duplicate (
dedup_hits). Also instrument downstream systems to detect duplicate effects (two charges for the same order, two emails to the same customer for the same event). - Contain: If duplicates are flowing through, pause the consumer. Do not let the blast radius grow while you investigate.
- Diagnose: Was the dedup table pruned too aggressively? Was there an offset reset? Did a producer restart with a new PID? Check the message headers for the original timestamp and compare against your dedup table’s oldest entry.
- Remediate: For idempotent operations (database upserts), the duplicate had no effect — just resume. For non-idempotent operations (payments, emails), run a reconciliation query to identify affected records and trigger compensating actions (refunds, apology emails).
- Prevent: Extend the dedup window. Add a secondary dedup check at the side-effect layer (payment gateway idempotency keys, email provider dedup). Implement end-to-end reconciliation as a scheduled safety net.
| Replay Age | Dedup Window | Result | Action |
|---|---|---|---|
| 2 hours | 7 days | Safe — dedup entries exist, duplicates are caught | Normal operation |
| 10 days | 7 days | Unsafe — dedup entries expired, replay events treated as new | Extend dedup window before replay, or use a clean rebuild strategy |
| Full log (months) | Any finite window | Unsafe for incremental replay — events outside the window bypass dedup | Must use clean-state rebuild: wipe the target state store and rebuild from scratch |
| Any | Bloom filter only (no persistent dedup) | Unsafe — Bloom filter is ephemeral and lost on consumer restart | Bloom filter is an optimization, not a safety mechanism. Always back it with a persistent dedup table. |
Interview Question: Walk through how you would evaluate the operational cost, failure modes, and security implications of your idempotency strategy before shipping it to production.
Interview Question: Walk through how you would evaluate the operational cost, failure modes, and security implications of your idempotency strategy before shipping it to production.
- What happens if the dedup table is unavailable (database outage)? Does the consumer block (safe but stops processing) or bypass the check (fast but risks duplicates)? I would configure the consumer to block and alert — processing duplicates in a financial system is worse than processing nothing for 5 minutes.
- What happens if the Bloom filter gives a false positive (says “maybe seen” for a genuinely new message)? The consumer falls back to the database check. The message is still processed correctly, just slower. False positives affect latency, not correctness.
- What happens during a consumer scale-out? New consumers start with empty in-memory Bloom filters. All messages hit the database path until the filter warms up. Plan for a temporary 5-10x increase in dedup table reads during scaling events.
- Deploy the dedup logic in shadow mode first: run the dedup check but process the message regardless. Log whether the check would have skipped the message. Run for 48 hours to verify the dedup logic matches expected behavior (no false skips, correct duplicate detection).
- Enable dedup in active mode on a single consumer instance (canary). Monitor for any messages incorrectly skipped.
- Roll to all consumers. Monitor dedup hit rate — it should be very low (under 0.1% in steady state). A high dedup hit rate indicates a producer or consumer configuration issue.
dedup_hit_rate— percentage of messages caught as duplicates. Baseline this. A spike indicates a producer or consumer issue.dedup_check_latency_p99— latency of the dedup table lookup. Should be < 5ms. If it degrades, the dedup table needs indexing or partitioning.dedup_table_size— track growth. Alert if the table exceeds expected size based on message volume and TTL.false_skip_rate— if you can detect cases where the dedup incorrectly skipped a message (via downstream reconciliation), track this. Should be zero.
- Storage: at 10M messages/day with 128-byte message IDs, the dedup table grows ~1.2 GB/day. With 7-day retention, ~8.4 GB. Modest, but at 100M messages/day it is 84 GB and needs partitioned tables or a dedicated store.
- Compute: the dedup check adds one database query per message. With a Bloom filter, 99%+ of messages skip the database. Without one, you are adding 10M queries/day.
- Operational: someone must monitor the dedup table, ensure TTL cleanup runs, handle incidents where dedup fails. Budget 2-4 hours/month of on-call attention.
- The dedup table contains message IDs, which may be UUIDs or may contain entity information (order IDs, customer IDs). If the dedup table is in a less-secured data store (Redis vs. the main database), ensure message IDs do not leak PII.
- Dedup bypass is a potential attack vector: if an attacker can manipulate message IDs (generating new IDs for the same logical operation), they bypass dedup and trigger duplicate side effects. Ensure message IDs are generated server-side by the producer, not by the client.
Further reading: Designing Event-Driven Systems by Ben Stopford — free ebook from Confluent, covers Kafka-centric event architectures. Enterprise Integration Patterns by Gregor Hohpe & Bobby Woolf — the definitive catalog of messaging patterns (routing, transformation, splitting, aggregation). Kafka: The Definitive Guide by Gwen Shapira et al. — comprehensive guide to Kafka internals and usage patterns. Curated links:
- Jay Kreps — “The Log: What every software engineer should know about real-time data’s unifying abstraction” — the foundational blog post that explains the commit log concept behind Kafka. If you read one thing about distributed messaging, make it this.
- Confluent Blog — Kafka Patterns and Best Practices — regularly updated articles on event streaming patterns, schema evolution, exactly-once semantics, and production Kafka operations from the team that maintains Kafka.
- Martin Kleppmann — “Turning the database inside-out” (Strange Loop 2014) — a brilliant talk on event sourcing, stream processing, and why we should rethink how we build applications around data. His book Designing Data-Intensive Applications is also essential.
- AWS — SQS vs SNS vs EventBridge: How to Choose — a clear comparison of AWS messaging services with decision criteria and use-case mappings for serverless architectures.
- Design Patterns — The Saga pattern (orchestration and choreography) is built on top of messaging. If you’re designing a multi-step distributed workflow, read the Saga section in Design Patterns for the full pattern catalog, including when to choose orchestration vs. choreography.
- Reliability — Async retry patterns (exponential backoff, jitter, circuit breakers) are covered in depth in Reliability Principles. DLQs and poison messages are messaging-specific applications of the broader reliability principle: “plan for failure at every layer.”
- Databases — Distributed transactions (two-phase commit, the outbox pattern) bridge the gap between database writes and message publishing. The outbox pattern — write the event to a local database table in the same transaction as the business data, then publish to the broker asynchronously — is the most reliable way to ensure “if the database committed, the event will be published.” See the transactions section in APIs & Databases for the database side of this story.
- Observability — Distributed tracing across async message flows requires correlation IDs propagated through message headers. Without them, a message that flows through 5 services is untraceable. See Caching & Observability for the observability patterns that make async architectures debuggable.
- Distributed Systems Theory — The exactly-once delivery impossibility discussed in Section 22.2 is rooted in the Two Generals Problem and the FLP impossibility result covered in depth in Distributed Systems Theory. Kafka’s consensus protocol (KRaft, replacing ZooKeeper) uses Raft for leader election and metadata management — understanding consensus helps you reason about why Kafka needs an odd number of brokers and what happens during leader failover.
- Cloud Service Patterns — AWS provides managed alternatives to self-hosted messaging: SQS (managed queue), SNS (managed pub/sub), and EventBridge (event bus with content-based routing and schema registry). When to use each — and when to reach for self-managed Kafka instead — is covered in Cloud Service Patterns. Key insight: SQS + SNS together give you the fanout-to-queue pattern (SNS broadcasts, SQS queues per consumer) that covers 80% of use cases without operating a Kafka cluster.
- Real-Time Systems — When your messaging system needs to push updates to end users (not just between backend services), you cross into real-time territory. WebSocket connections can be the final delivery hop from a Kafka consumer to a browser client. The architecture patterns for combining message brokers with WebSocket fan-out layers are covered in Real-Time Systems.
Part XVI — Concurrency, Threading, and Async Execution
Chapter 23: Concurrency
23.1 Concurrency vs Parallelism
These are often confused. Here is the distinction with a real-world analogy and code-level example: Concurrency: one cook, multiple dishes. A single cook works on pasta, salad, and soup. While the pasta boils (waiting/IO), the cook chops vegetables for the salad (doing work). The cook interleaves tasks, switching between them when one is waiting. Only one task is actively being worked on at any moment, but all three make progress.-race flag, C++ ThreadSanitizer, Java’s FindBugs) detect race conditions at runtime. Concurrency testing tools: jcstress (Java), go test -race (Go). Lock-free data structures: ConcurrentHashMap (Java), sync.Map (Go).
23.2 Race Conditions
Behavior depends on timing of operations. Two requests read balance 75 withdrawal, both compute 75 = 25 to the account. Result: 100 account — the second withdrawal should have been rejected, but both saw $100 and assumed it was safe. Prevention: database transactions with correct isolation, optimistic locking (version field), pessimistic locking (SELECT FOR UPDATE), advisory locks for application-level coordination, atomic operations (Redis INCR), idempotent operations.Interview Question: Two users simultaneously try to book the last seat on a flight. Explain how you'd prevent double-booking without sacrificing performance.
Interview Question: Two users simultaneously try to book the last seat on a flight. Explain how you'd prevent double-booking without sacrificing performance.
available_seats = 0 and rejects the booking. Pros: Simple, correct, familiar. Cons: Under high contention (thousands of users booking the same flight simultaneously), transactions queue up and latency spikes. Deadlock-prone if multiple rows are locked in different orders.Approach 2: Optimistic locking (version or CAS).- Standard flight booking (moderate contention): Approach 3 (atomic conditional update) — simplest, correct, performant.
- Complex booking with multiple steps (seat selection + meal + loyalty points): Approach 2 (optimistic locking) with retry logic.
- Flash sale / concert tickets (extreme contention): Approach 4 (queue-based).
SELECT ... FOR UPDATE requires being inside a transaction.Words that impress: “compare-and-swap semantics,” “atomic conditional update,” “contention profile,” “optimistic vs. pessimistic depending on the write-to-read ratio,” “serializable isolation as a last resort.”23.3 Deadlocks
Circular waiting: Transaction A holds Row 1, waits for Row 2. Transaction B holds Row 2, waits for Row 1. Neither can proceed — both wait forever (or until the database detects the deadlock and kills one). Concrete example:lock_timeout so transactions fail fast instead of waiting forever. (4) Monitoring — PostgreSQL logs deadlocks automatically. Track deadlock frequency. If it is increasing, investigate the access patterns.
In application code: Use SELECT ... FOR UPDATE with a consistent ordering. If you need to update accounts 1 and 2, always lock the lower ID first: SELECT * FROM accounts WHERE id IN (1, 2) ORDER BY id FOR UPDATE.
23.4 Async/Await, Event Loops, and Background Processing
Async/await enables non-blocking IO — the thread is released while waiting for a response and can handle other requests. Event loops (Node.js, Python asyncio) use a single thread with non-blocking IO. Background processing (Sidekiq, Celery, Hangfire, BullMQ) handles work outside the request cycle.await an I/O operation, the runtime hands it off to the OS (network I/O) or libuv’s thread pool (file I/O, DNS). The event loop continues processing other callbacks. When the I/O completes, its callback is queued and the event loop picks it up on the next tick.Python asyncio works the same way:- CPU-heavy computation (image processing, JSON parsing large payloads, crypto)
- Synchronous file I/O (
fs.readFileSyncin Node, regularopen()in Python) - Long-running loops without yielding
worker_threads (Node.js), concurrent.futures.ProcessPoolExecutor (Python), or a separate microservice.The event loop model relies on OS-level I/O multiplexing: epoll (Linux), kqueue (macOS/BSD), or IOCP (Windows). These kernel mechanisms allow a single thread to monitor thousands of file descriptors (sockets, pipes) simultaneously without polling. When any socket has data ready, the kernel wakes the event loop. This is why Node.js can handle 10,000 concurrent connections on one thread — it is the OS doing the heavy lifting. For the full story on how the OS manages I/O, processes, and the kernel primitives that make async runtimes possible, see OS Fundamentals.Interview Question: Explain the difference between concurrency in Node.js vs Java
Interview Question: Explain the difference between concurrency in Node.js vs Java
| Aspect | Node.js | Java |
|---|---|---|
| Model | Single-threaded event loop | Multi-threaded (thread pool) |
| Concurrency mechanism | Non-blocking I/O, callbacks/promises | OS threads, ExecutorService, virtual threads (Java 21+) |
| CPU-bound work | Blocks the loop (bad) | Parallelized across cores (good) |
| Memory per connection | Very low (no thread stack) | Higher (~512KB-1MB per thread stack) |
| Shared state | No shared state issues (single thread) | Requires locks, atomics, or thread-safe collections |
| Best for | I/O-heavy APIs, real-time apps | CPU-heavy processing, enterprise systems |
- The Go Blog — “Go Concurrency Patterns” — official patterns for pipelines, fan-out/fan-in, and cancellation in Go. Essential reading if you work with goroutines and channels.
- Python asyncio Documentation and PEP 3156 — the authoritative reference for Python’s async/await model. PEP 3156 explains the design rationale behind asyncio’s event loop.
- Kyle Kingsbury — “Jepsen: Distributed Systems Safety Research” — rigorous testing of distributed databases and queues for correctness under failure. Kingsbury’s analyses of systems like Kafka, Redis, PostgreSQL, and etcd reveal real concurrency and consistency bugs that vendors do not advertise. If you want to understand what “correct” actually means in distributed systems, start here.
- Databases — Transaction isolation levels (READ COMMITTED, REPEATABLE READ, SERIALIZABLE) are the database’s answer to the concurrency problems in this chapter. Every race condition example above has a corresponding database isolation level that prevents it — at the cost of throughput. See the transactions section in APIs & Databases for how database engines implement these protections under the hood.
- Reliability — The circuit breaker pattern prevents a slow downstream service from holding threads hostage, which is a concurrency resource exhaustion problem. Thread pool exhaustion — when all threads are waiting on a slow dependency — is one of the most common causes of cascading failures. See Reliability Principles.
- Performance & Scalability — Connection pool sizing, thread pool configuration, and async I/O are all concurrency decisions that directly determine your system’s throughput ceiling. See Performance & Scalability.
- OS Fundamentals — Every concurrency primitive discussed in this chapter — threads, mutexes, atomic operations, context switching — is implemented at the OS level. Threads are OS-scheduled execution contexts with their own stack but shared heap memory. Mutexes are built on top of OS primitives like futexes (fast userspace mutexes on Linux) that avoid kernel transitions in the uncontended case. Understanding the OS layer explains why mutex contention is expensive (kernel context switch), why goroutines are cheap (userspace scheduling, ~2KB stack vs. ~1MB OS thread stack), and why
async/awaitworks (epoll/kqueue for non-blocking I/O). See OS Fundamentals for the full picture of how the OS manages processes, threads, memory, and I/O scheduling underneath your application code. - Distributed Systems Theory — The concurrency problems in this chapter get dramatically harder when you add a network. A race condition between two threads on one machine can be solved with a mutex. A race condition between two services on different machines requires distributed consensus (Raft, Paxos) or conflict-free data structures (CRDTs). The theoretical impossibility results — FLP, CAP — explain why perfect distributed coordination is fundamentally harder than local coordination. See Distributed Systems Theory for the theoretical foundations.
Part XVII — State Management
Chapter 24: State in Distributed Systems
24.1 Stateless vs Stateful
Make everything stateless by default. Push state to dedicated stores (Redis, databases). Stateless services are disposable and horizontally scalable. How a G-Counter actually works: Each node maintains its own counter (a vector:{nodeA: 5, nodeB: 3, nodeC: 7}). Each node only increments its own entry. When nodes sync, they take the element-wise maximum: merge({A:5, B:3, C:7}, {A:4, B:6, C:7}) = {A:5, B:6, C:7}. The global count is the sum of all entries: 5+6+7 = 18. Increments never conflict because each node only writes to its own slot. A PN-Counter extends this with a separate G-Counter for decrements — the real count is increments - decrements. The trade-off: CRDTs are limited to operations that are commutative and associative — not all data structures can be made conflict-free (e.g., a CRDT cannot enforce “balance must not go negative” because that requires coordination).
Tools: Redis (distributed state, pub/sub, Lua scripting for atomic operations). Apache ZooKeeper, etcd (distributed coordination, leader election, distributed locks). Consul (service discovery + KV store). Temporal, Cadence (durable workflow state management).
24.2 Distributed Locking
When multiple service instances need to coordinate — ensuring only one runs a scheduled job, or only one processes a specific order — you need a distributed lock. Redis-based distributed lock (simplified):24.2b Distributed Consensus (Raft Basics)
How does a group of servers agree on a value when any of them can fail? This is the core problem solved by Raft (used in etcd, CockroachDB, Consul) and Paxos (used in Google’s Chubby, Spanner). Raft in 3 phases: (1) Leader election: Servers start as followers. If a follower does not hear from the leader (heartbeat timeout), it becomes a candidate and requests votes. A candidate that receives a majority of votes becomes leader. (2) Log replication: The leader receives client requests, appends them to its log, and replicates to followers. Once a majority acknowledges, the entry is committed and applied. (3) Safety: Only a server with all committed entries can be elected leader. This guarantees committed entries are never lost. How Paxos differs from Raft: Paxos solves the same problem but with a more flexible (and notoriously harder to understand) protocol. Paxos separates consensus into Prepare and Accept phases, allowing proposals from any node. Raft elects a single leader who drives all decisions — simpler to reason about. Raft was designed explicitly to be easier to understand and implement. Most modern systems choose Raft or a Raft variant (etcd, CockroachDB, Consul). You will encounter Paxos in Google’s systems (Chubby, Spanner) and academic papers. Why this matters for senior engineers: You rarely implement consensus, but you use systems built on it (etcd, ZooKeeper, CockroachDB, Consul). Understanding Raft helps you reason about: why these systems need an odd number of nodes (3, 5, 7 — to form a majority), why they sacrifice availability during a partition (CP systems), and why leader election takes time (cannot serve writes during election).24.3 Distributed State Challenges
Synchronization, conflicts, stale data. Approaches: centralized store with strong consistency, eventual consistency with conflict resolution (last-write-wins, CRDTs), event sourcing.| Consistency Model | Guarantee | Latency | Availability | Real-World Example |
|---|---|---|---|---|
| Strong (linearizability) | Every read returns the most recent write. All nodes see the same data at the same time. | High (requires coordination) | Lower (unavailable during partitions) | Bank account balance — you must never see stale balances. Google Spanner, CockroachDB (uses synchronized clocks + Raft). |
| Eventual | If no new writes occur, all replicas will eventually converge to the same value. No guarantee on when. | Low (no coordination) | High (always available) | Social media like counts — seeing 1,042 likes vs 1,043 for a few seconds is acceptable. DynamoDB, Cassandra (default mode). |
| Causal | Operations that are causally related are seen in the same order by all nodes. Concurrent (unrelated) operations may be seen in different orders. | Medium | Medium-High | Chat applications — if Alice says “Hi” and Bob replies “Hello,” everyone must see Alice’s message first. But two unrelated conversations can appear in any order. MongoDB (causal consistency sessions). |
| Read-your-writes | A client always sees its own writes, even if reading from a replica. Other clients may see stale data. | Low-Medium | High | User profile update — after you change your name, you should immediately see the new name, even if other users briefly see the old one. Session-sticky routing, writing and reading from the same replica. |
| Monotonic reads | A client never sees data go “backward” — if you saw version 5, you will never see version 4 on a subsequent read. | Low-Medium | High | Dashboard metrics — numbers should not jump backward when you refresh the page. Sticky sessions to the same replica. |
- Financial transactions, inventory with hard limits -> Strong consistency
- Analytics, social feeds, caches -> Eventual consistency
- Collaborative editing, chat systems -> Causal consistency
- User-facing writes (profile edits, settings) -> Read-your-writes at minimum
24.4 Workflow State Machines
Model complex business processes as state machines. An order goes through states: Draft -> Submitted -> Processing -> Shipped -> Delivered -> Completed (or Cancelled at various points). Each transition has rules and side effects. Makes invalid states unrepresentable. Implementation: Define allowed transitions explicitly. An order in “Shipped” state cannot transition to “Draft.” Each transition can have guards (conditions that must be true) and side effects (send notification, update inventory). Store the current state in the database. On state change, validate the transition, execute side effects, persist the new state — all in a transaction.| From | To | Guard (condition) | Side Effect |
|---|---|---|---|
| PLACED | PAID | Payment confirmed by gateway | Reserve inventory, send confirmation email |
| PLACED | CANCELLED | Payment timeout (30 min) OR user cancels | Release any provisional holds |
| PAID | SHIPPED | Warehouse confirms dispatch | Generate tracking number, notify customer |
| PAID | CANCELLED | Admin/user requests refund before ship | Initiate refund, release inventory |
| SHIPPED | DELIVERED | Carrier confirms delivery | Close fulfillment ticket, trigger review request |
| DELIVERED | RETURNED | Customer initiates return within 30 days | Generate return label, create return case |
| RETURNED | CANCELLED | Return received and inspected | Process refund, restock inventory |
Interview Question: A business process has 8 states and complex transition rules. Users report that orders sometimes end up in impossible states. How do you fix this?
Interview Question: A business process has 8 states and complex transition rules. Users report that orders sometimes end up in impossible states. How do you fix this?
- Audit: Query for all current state values and identify which ones are invalid
- Define: Create a single source of truth — a transition table listing every
(from_state, to_state)pair that is allowed - Enforce: Add a
CHECKconstraint on the state column (CHECK (state IN ('PLACED', 'PAID', 'SHIPPED', ...))) and validate transitions in application code before any write - Log: Record every transition with
from_state,to_state,timestamp, andactorfor auditing - Migrate: Fix existing bad data — move impossible states to the last valid state and flag for manual review
- Prevent: All state changes go through a single
transition_order()function — never update the state column directly
Interview Question: Design an order processing pipeline where payment, inventory, and shipping must happen in sequence but each can fail independently.
Interview Question: Design an order processing pipeline where payment, inventory, and shipping must happen in sequence but each can fail independently.
PaymentCharged -> InventoryService reserves -> InventoryReserved -> ShippingService schedules. This is more decoupled but harder to reason about for sequential pipelines with complex compensation logic. When the sequence is strict and failures require multi-step rollback, orchestration is clearer. Choreography works better when steps are more independent and compensation is simple.5. Production-grade: use a workflow engine. In production, you would implement this as a Temporal workflow or a Step Functions state machine rather than hand-rolling the orchestration. These engines handle retries, timeouts, crash recovery, and compensation out of the box. The workflow state is durable — if the orchestrator crashes mid-saga, it resumes from where it left off when it restarts.- Design Patterns — The Saga pattern discussed in the order processing question above is covered comprehensively in Design Patterns, including the full comparison between orchestration-based Sagas (central coordinator) and choreography-based Sagas (event-driven). The state machine patterns in this chapter are the building blocks; Sagas are how you compose them across services.
- Databases — Event sourcing stores state as a sequence of events rather than a mutable row. This eliminates the “state vs. history” problem — you have both. But it introduces eventual consistency, projection management, and snapshot optimization. The database chapter in APIs & Databases covers the persistence patterns that support event sourcing (append-only tables, materialized views, CQRS read models).
- Reliability — Distributed locking (Section 24.2) is a reliability concern as much as a state concern. What happens when the lock holder crashes? When the lock service itself is unreachable? The Reliability Principles chapter covers fencing tokens, leader election, and the broader principle that distributed locks are a last resort — prefer idempotent designs that don’t need locks.
- Distributed Systems Theory — The Raft consensus overview in Section 24.2b is an introduction; Distributed Systems Theory goes deep into the consensus problem, including Paxos, Raft’s correctness proofs, leader election edge cases, and the FLP impossibility result (why no deterministic consensus protocol can tolerate even a single crash failure in an asynchronous network). It also covers CRDTs in depth — the mathematical properties that make conflict-free state merging possible — and the CAP theorem that constrains every decision about distributed state.
- Cloud Service Patterns — Managed state services like DynamoDB (strongly consistent reads, conditional writes for optimistic locking), ElastiCache/Redis (distributed caching and locking), and Step Functions (managed workflow state machines) offload much of the operational complexity of distributed state. When to use managed services vs. self-managed state infrastructure is covered in Cloud Service Patterns. Key insight: DynamoDB’s conditional writes (
ConditionExpression) give you atomic check-and-set semantics without managing your own locking infrastructure. - OS Fundamentals — The mutexes, semaphores, and atomic operations used for local state synchronization are OS-level primitives. Understanding how the OS schedules threads, manages shared memory, and implements lock-free data structures helps you reason about the performance characteristics of your synchronization choices. See OS Fundamentals for the kernel-level view of synchronization primitives and why some are orders of magnitude faster than others.
- Temporal.io Blog — “Why Workflow Engines” — articles on durable execution, workflow orchestration, and why hand-rolling state machines for distributed business processes is a losing battle. The Saga pattern article is particularly relevant to the order processing interview question above.
- Martin Kleppmann — “Designing Data-Intensive Applications” (DDIA) — Chapters 7-9 cover transactions, distributed consistency, and consensus in a depth and clarity unmatched by any other resource. If you buy one book on distributed state, make it this one.
Interview Deep-Dive Questions
The questions below go beyond the inline interview questions already in the chapter. They are designed to simulate a real senior-to-staff-level technical interview: each question has a strong candidate answer, follow-ups that branch into different territories, and “going deeper” prompts that push into the edges where production experience separates from textbook knowledge.Q1: You are designing an event-driven architecture for a payment processing platform. A single payment event must trigger updates in the ledger service, the notification service, and the fraud analytics service. Walk me through your design, and specifically address what happens when one of those downstream services is unavailable for 20 minutes.
Strong Candidate Answer:- The payment service publishes a
PaymentCompletedevent to a Kafka topic. Each downstream service runs its own consumer group, so each gets an independent copy of every event. This is the classic fan-out pattern using Kafka’s consumer group model: one topic, three consumer groups, each reading every message at their own pace. - The critical design decision here is that the payment service does not directly call these downstream services. It publishes the event and moves on. This means a 20-minute outage in the notification service has zero impact on payment processing or ledger updates. The notification service’s consumer simply stops polling, its consumer lag grows, and when it recovers, it resumes from where it left off. Kafka retains messages for the configured retention period (typically 7 days or more), so 20 minutes of downtime is trivially survivable.
- For the ledger service, which has strict correctness requirements, I would use at-least-once delivery with an idempotent consumer. The consumer writes a dedup entry and the ledger update in the same database transaction. If the ledger service restarts mid-processing, it re-reads the unacknowledged messages and the idempotency check prevents double-counting.
- For the fraud analytics service, I would be more relaxed. It is a read-heavy analytics workload, so processing duplicates or slightly delayed data is acceptable. I might even run it in a batch mode, consuming events in micro-batches every 30 seconds rather than message-by-message.
- The one subtlety is the payment service itself. I would use the outbox pattern to ensure the payment database write and the event publication are atomic. The payment service writes the event to a local
outboxtable in the same transaction as the payment record. A separate process (or Debezium CDC) reads the outbox table and publishes to Kafka. This guarantees that if the payment commits, the event will eventually be published, even if Kafka is temporarily unreachable at the moment of the commit.
Follow-up: What if the fraud analytics service discovers that a payment is fraudulent 10 minutes after it was processed? How does the system react?
Strong Answer:- This is where you need a compensating event pattern. The fraud service publishes a
PaymentFlaggedFraudulentevent back to a Kafka topic. The ledger service consumes this event and creates a reversal entry (not a deletion; ledgers are append-only for auditability). The notification service consumes it and sends a “payment held for review” email to the customer. - The key principle is that in an event-driven system, you do not “undo” events. You publish new events that represent the compensating action. The original
PaymentCompletedevent remains in the log permanently. This is critical for auditing and for debugging later disputes. You can replay the full history and see exactly what happened and when. - In practice, I would also have the fraud service publish a confidence score and a reason code, so downstream systems can make proportional responses. A score of 0.6 might trigger a hold and a manual review. A score of 0.95 might trigger an automatic reversal and a customer lockout.
Follow-up: How do you ensure that the outbox pattern does not become a bottleneck?
Strong Answer:- The outbox pattern introduces a polling delay: the relay process reads uncommitted rows from the outbox table and publishes them. In a low-throughput system this is negligible, but at scale (say, 50K payments per second), the outbox table can become a hot table.
- The standard approach is Change Data Capture (CDC) using a tool like Debezium. Instead of polling the outbox table, Debezium reads the database’s write-ahead log (the WAL in Postgres, the binlog in MySQL) and publishes changes to Kafka in near-real-time. This is much more efficient than polling because it piggybacks on the database’s own replication mechanism. The outbox table can be truncated aggressively because the WAL is the source of truth.
- Another optimization is partitioning the outbox table by time or by a hash of the entity ID. This distributes the write load and allows parallel relay workers to process different partitions independently.
- The one gotcha I have seen in production is WAL retention. If Debezium falls behind (say, during a long maintenance window), the database may have already cleaned up the WAL segments that Debezium needs. You need to configure WAL retention to exceed your maximum expected Debezium downtime, or you need a fallback mechanism to snapshot the outbox table and resume.
Going Deeper: How does Debezium CDC handle schema changes in the source database?
Strong Answer:- Debezium captures schema changes from the database’s DDL log (or by snapshotting
information_schema) and includes schema metadata alongside the data changes it publishes. When a column is added or a type changes, Debezium emits a schema change event to a special schema-change topic, and subsequent data events include the updated schema. - The practical problem is that downstream consumers may not be ready for the schema change. This is where a schema registry becomes essential. If your Debezium connector is configured to register Avro schemas in Confluent Schema Registry, the registry’s compatibility checks will reject schema changes that break downstream consumers before they are published. Without this, a
DROP COLUMNin the source database can silently break every consumer that depends on that column. - In my experience, the safest approach is: (1) add new columns as optional with defaults, (2) deploy consumer changes that handle both old and new schemas, (3) backfill the new column, (4) only then make it required or remove the old column. Treat database schema changes as event schema changes, because with CDC, that is literally what they are.
Q2: Explain the difference between the Outbox Pattern and dual writes. Why does one work and the other fail?
Strong Candidate Answer:- Dual writes means writing to two systems independently: first update the database, then publish an event to Kafka (or vice versa). The problem is that these are two separate operations with no transactional guarantee spanning both. If the application crashes after the database write but before the Kafka publish, the database has the update but no event is ever published. Downstream consumers never learn about the change. Alternatively, if Kafka accepts the message but the database write fails on retry, you have a phantom event with no backing data.
- The Outbox Pattern solves this by making the event publication part of the database transaction. Instead of writing to Kafka directly, you write the event payload to an
outboxtable in the same database, in the same transaction as your business data. A separate relay process (or CDC connector) reads the outbox table and publishes to Kafka. Because the event row and the business data row are committed atomically, you get the guarantee: if the business data is committed, the event will eventually be published. If the transaction rolls back, the event is never written. - The key insight is that you are converting a distributed consistency problem (two systems, no shared transaction) into a local consistency problem (one database, one transaction) plus a delivery problem (eventually relay the event). The delivery problem is much easier to solve with retries and idempotency.
Follow-up: Can you use the Outbox Pattern with a NoSQL database that does not support multi-document transactions?
Strong Answer:- With a document store like MongoDB (pre-4.0) or DynamoDB, you cannot atomically write to two different tables in one transaction. But you can embed the outbox within the same document. For example, in DynamoDB, your payment record could include an
outbox_eventsattribute that is a list of pending events. A singlePutItemorUpdateItematomically writes the business data and the event metadata together. - A separate relay process scans for items with non-empty
outbox_events, publishes each event to Kafka, and then clears theoutbox_eventsattribute. You need to handle the case where the relay publishes but crashes before clearing; the idempotent consumer on the Kafka side handles the resulting duplicate. - DynamoDB Streams is actually a built-in CDC mechanism. You can configure a Lambda function to trigger on every DynamoDB write, read the new item image, and publish to Kafka. This is functionally equivalent to Debezium for DynamoDB. The trade-off is Lambda invocation latency and cost at high throughput.
Follow-up: What about in-memory message brokers like Redis Streams? Can you use the Outbox Pattern there?
Strong Answer:- Redis Streams give you an append-only log similar to Kafka, but Redis is typically not your primary database. If your business data is in PostgreSQL and you want to publish to Redis Streams, you still have the dual-write problem.
- One approach is to use Redis Streams as the outbox relay destination and PostgreSQL as the outbox source, with the same pattern: write to the outbox table in PG, relay to Redis Streams via a poller or CDC. But this adds complexity for minimal benefit over just relaying to Kafka directly.
- Where Redis Streams shine in this context is when Redis is your primary data store. If you are running a real-time leaderboard entirely in Redis, you can use a Lua script to atomically update the sorted set and
XADDan event to a stream in a single atomic Redis operation. Redis’s single-threaded execution model guarantees atomicity within a single Lua script. This is effectively a “Redis-native outbox” without needing a separate relay.
Q3: A candidate says “We use exactly-once processing in Kafka, so we don’t need to worry about idempotency in our consumers.” What is wrong with this statement?
Strong Candidate Answer:- This is a dangerous misunderstanding. Kafka’s exactly-once semantics (EOS) provides exactly-once processing within the Kafka ecosystem only: reading from a Kafka topic, processing, and writing the result to another Kafka topic. It does this through idempotent producers (dedup at the broker via producer ID + sequence number) and transactional consumers (atomically committing consumer offsets and producer writes in a single Kafka transaction).
- The moment your consumer has a side effect outside of Kafka, you are back to at-least-once. If your consumer reads an event, writes to PostgreSQL, and then the consumer crashes before committing its Kafka offset, the event will be redelivered when the consumer restarts. Kafka’s EOS has no way to roll back the PostgreSQL write. Your PostgreSQL insert will happen twice unless you build idempotency yourself.
- In practice, nearly every consumer does something outside Kafka: write to a database, call an external API, update a cache, send an email. For all of these, you still need idempotent consumers. Kafka’s EOS is a powerful tool for stream processing pipelines that stay entirely within Kafka (e.g., Kafka Streams applications that read from one topic, transform, and write to another). It does not replace application-level idempotency for anything beyond that.
Follow-up: How does Kafka’s transactional consumer actually work under the hood?
Strong Answer:- Kafka’s transactions work by atomically committing a set of writes (producer messages to output topics) and a set of consumer offset commits to the
__consumer_offsetsinternal topic. The transaction coordinator (a broker designated for the transaction) manages this. - The flow is: (1) the consumer reads a batch of messages, (2) the application begins a Kafka transaction, (3) processes the messages and writes results to output topic(s), (4) sends the consumer offset commits to the transaction, (5) calls
commitTransaction(). The transaction coordinator writes aCOMMITmarker to the transaction log. If the consumer crashes before step 5, the coordinator writes anABORTmarker after a timeout, and all writes in the transaction are discarded (consumers reading withisolation.level=read_committedskip uncommitted messages). - The subtle part is
read_committedisolation. Consumers must be configured withisolation.level=read_committedto participate in exactly-once. Otherwise, they see uncommitted (potentially aborted) messages, which defeats the purpose. This is a common misconfiguration.
Follow-up: When would you intentionally choose at-most-once delivery over at-least-once?
Strong Answer:- At-most-once is appropriate when losing a small fraction of messages is acceptable but duplicates cause problems or waste. Real-world examples:
- Metrics and telemetry sampling: if you are collecting CPU utilization every second from 10,000 servers, losing 0.1% of data points is invisible in your dashboards. But duplicate data points could skew percentile calculations.
- Real-time gaming events: in a multiplayer game, a position update that arrives twice could cause visible glitches (player teleportation). A lost update is smoothed over by the next update 50ms later.
- High-frequency sensor data with downstream aggregation: if you are averaging temperature readings over 5-minute windows, a missing reading barely affects the average. A duplicate reading might push the average up artificially.
- The implementation is simple: the producer sends the message with
acks=0(fire and forget, no broker acknowledgment) oracks=1(leader ack only, no retry on failure). The consumer commits its offset before processing. If the consumer crashes after committing but before processing, the message is lost. This is the trade-off: speed and simplicity at the cost of potential message loss.
Q4: You join a team running a Kafka-based event pipeline. They have a single topic with 12 partitions, and the partition key is the user ID. A small number of “power users” generate 100x the events of normal users. The pipeline is falling behind. What do you do?
Strong Candidate Answer:- This is the hot partition problem, also known as partition skew. When a small number of keys produce disproportionate traffic, those keys’ events all land on the same partition (because Kafka hashes the key to a partition). That partition’s consumer is overloaded while other consumers sit relatively idle. Adding more consumers does not help because you cannot have two consumers within the same consumer group reading the same partition.
- Step 1: Confirm the diagnosis. Check per-partition lag. If partition 7 has 5 million lag and every other partition is under 100K, you have a hot partition. Confirm by checking which keys are on that partition and their event volume.
- Step 2: Short-term mitigation. Increase the processing power of the consumer handling the hot partition. This might mean vertically scaling its instance or optimizing its processing logic (batching database writes, parallelizing within the consumer using a thread pool keyed on sub-entity IDs).
- Step 3: Medium-term fix with compound keys. Change the partition key from
user_idtouser_id + event_typeoruser_id + shard_suffix(e.g., append a modulo of the event timestamp). This spreads a single user’s events across multiple partitions. The trade-off is that you lose per-user ordering within a single partition. If ordering per user matters, you need to re-sort events downstream or use a different approach. - Step 4: If per-user ordering is critical. Implement a two-level architecture. The first-level topic uses compound keys to spread load. A downstream consumer reassembles per-user ordering using event timestamps or sequence numbers and writes to a per-user-ordered store or secondary topic. This adds complexity but preserves both scalability and ordering.
- Step 5: Long-term. Consider whether your power users should be in a separate topic entirely, with dedicated consumer capacity. This is the “VIP lane” pattern. It isolates the blast radius of power-user spikes and lets you scale each tier independently.
Follow-up: What is the impact of increasing the number of partitions on an existing Kafka topic?
Strong Answer:- You can increase partitions on an existing topic, but you cannot decrease them. When you add partitions, Kafka does not redistribute existing data. Old messages stay in their original partitions. Only new messages are distributed across the new partition set.
- The critical impact is on key-based partitioning. If you had 12 partitions and a message with key “user-42” was going to partition 7 (via
hash("user-42") % 12 = 7), after increasing to 24 partitions,hash("user-42") % 24might be partition 18. So “user-42” messages are now split across two partitions: old ones in 7, new ones in 18. If you rely on per-key ordering, this is a breaking change. All messages for “user-42” used to be in one partition (guaranteed order). Now they are in two (no ordering guarantee between them). - In practice, you should plan your initial partition count conservatively high. It is much easier to start with 50 partitions and have some under-utilized than to start with 12 and need to repartition a live topic.
Going Deeper: How would you handle this hot partition problem if you were using SQS instead of Kafka?
Strong Answer:- SQS does not have the concept of partitions or partition keys in standard queues. Messages are distributed across SQS’s internal shards automatically. So the hot-partition problem does not exist in the same way. Any consumer can read any message.
- However, with SQS FIFO queues, you do have a similar problem. FIFO queues use a
MessageGroupIdto maintain ordering within a group. All messages with the sameMessageGroupIdare delivered in order and processed by one consumer at a time. A power user whoseMessageGroupIdis their user ID will create a hot group where messages back up behind sequential processing. - The fix is similar: use a compound
MessageGroupId(user_id + shard) to spread the load, accepting that you lose strict per-user ordering within the FIFO queue. Or accept that for most use cases, a standard SQS queue (which provides no ordering) with application-level ordering is more practical at scale.
Q5: You have a Node.js service that handles HTTP requests and also runs background Kafka consumers. During a traffic spike, you notice that Kafka consumer lag is growing even though your HTTP response times are fine. What is happening?
Strong Candidate Answer:- This is a classic event-loop starvation problem. Node.js runs on a single-threaded event loop. Both the HTTP request handlers and the Kafka consumer callbacks share that same event loop. During a traffic spike, the HTTP handlers are consuming the majority of the event loop’s time. The Kafka consumer’s
poll()callbacks and message processing callbacks get starved; they are queued behind hundreds of HTTP callbacks waiting to execute. - The HTTP response times look fine because each individual request is fast, and the HTTP server has its own keep-alive and backpressure mechanisms. But the cumulative effect of processing 10,000 HTTP requests per second means the Kafka consumer rarely gets a turn on the event loop. It polls less frequently, processes messages slower, and lag grows.
- Fix 1: Separate processes. Run the HTTP server and the Kafka consumer in separate Node.js processes. Each gets its own event loop and they do not compete. This is the most robust solution and the pattern I would default to in production.
- Fix 2: Worker threads. Move the Kafka consumer to a worker thread. The main thread handles HTTP, the worker thread handles Kafka. They communicate via
MessagePortif needed. - Fix 3: If you must keep them in one process, use
setImmediate()or chunked processing to yield back to the event loop between Kafka message batches. Process N messages, yield, let HTTP callbacks run, then process the next batch. This is fragile and hard to tune.
Follow-up: How would this problem manifest differently in a Go service?
Strong Answer:- In Go, the HTTP handlers and Kafka consumers would run in separate goroutines. Go’s runtime scheduler multiplexes goroutines across OS threads (GOMAXPROCS, typically one per CPU core). HTTP handlers and Kafka consumers naturally run in parallel on different cores. One does not starve the other unless you exhaust the goroutine scheduler with extreme concurrency.
- The equivalent problem in Go is not event-loop starvation but thread pool exhaustion or memory pressure. If your HTTP handlers each spawn a goroutine and each goroutine allocates memory, a traffic spike might create millions of goroutines, consuming gigabytes of stack memory (even though each goroutine starts small at ~2KB, it can grow). The Kafka consumer itself would be fine, but the overall process might OOM.
- Go’s concurrency model is fundamentally different from Node’s. Go uses preemptive scheduling (goroutines can be paused mid-execution at function calls), while Node uses cooperative scheduling (callbacks run to completion). This means Go does not have the event-loop starvation problem, but it can have goroutine leak and memory pressure problems that Node does not.
Follow-up: What metrics would you monitor to detect event-loop starvation in Node.js before it causes production issues?
Strong Answer:- Event loop lag: measure the delay between scheduling a
setTimeout(callback, 0)and the callback actually executing. If this consistently exceeds 50-100ms, the event loop is saturated. Libraries likemonitorEventLoopDelay(built into Node.jsperf_hooks) orevent-loop-lagnpm package provide this. - Event loop utilization (ELU): available since Node.js 16 via
performance.eventLoopUtilization(). This gives you the percentage of time the event loop is actively processing callbacks vs. idle. An ELU approaching 1.0 means the event loop has no idle time and is at capacity. - Active handles and requests:
process._getActiveHandles()andprocess._getActiveRequests()tell you how many open connections, timers, and pending I/O operations exist. A sudden spike in active handles during a traffic surge confirms the event loop is overloaded. - In my experience, the most actionable metric is event loop lag with a P99 alert. If P99 event loop lag exceeds 200ms, page someone. It means the event loop is so saturated that some callbacks are waiting 200ms just to start executing.
Q6: Describe a scenario where a distributed lock (using Redis) fails to provide mutual exclusion, even when your implementation is correct. How do you mitigate this?
Strong Candidate Answer:-
The classic failure mode is the GC pause / clock skew scenario, first articulated clearly by Martin Kleppmann in his critique of Redlock. Here is the sequence:
- Process A acquires a Redis lock with a 30-second TTL.
- Process A begins its critical section (e.g., processing a payment).
- Process A’s JVM enters a long GC pause (stop-the-world, 35 seconds).
- The lock’s TTL expires while Process A is paused. Redis automatically deletes the lock key.
- Process B acquires the same lock (legitimately, since it is free).
- Process B begins its critical section.
- Process A’s GC pause ends. Process A resumes execution, still believing it holds the lock.
- Both Process A and Process B are now executing the critical section simultaneously. Mutual exclusion has failed.
- This is not a bug in your code or in Redis. It is a fundamental limitation of any lock that uses time-based expiry in an environment where processes can pause unpredictably.
- Mitigation 1: Fencing tokens. When Process A acquires the lock, it also receives a monotonically increasing fencing token (e.g., a counter or version number stored alongside the lock). Every write to the protected resource includes the fencing token. The resource (e.g., the database) rejects any write with a fencing token lower than the highest it has seen. When Process A wakes up from GC and tries to write with token 34, the database has already seen token 35 from Process B and rejects A’s write. This is the definitive solution.
- Mitigation 2: Lock renewal (heartbeat). Process A periodically renews the lock’s TTL while it is in the critical section. If A is paused by GC, it cannot renew, and the lock expires as designed. But this does not solve the problem; it just makes the window smaller. A GC pause that exceeds the renewal interval still causes the same failure.
- Mitigation 3: Accept that distributed locks are advisory, not absolute. Design your system so that even if two processes enter the critical section simultaneously, the outcome is safe (idempotent writes, fencing tokens on downstream resources). The lock reduces the probability of concurrent execution but should not be your only safety mechanism.
Follow-up: How does ZooKeeper/etcd solve this differently than Redis?
Strong Answer:- ZooKeeper and etcd provide consensus-based locks using Raft (etcd) or ZAB (ZooKeeper). The key difference is ephemeral nodes (ZooKeeper) or leases (etcd) that are tied to the client’s session, not to a wall-clock TTL.
- In ZooKeeper, when Process A creates an ephemeral lock node, it maintains a session with ZooKeeper via heartbeats. If Process A becomes unresponsive (GC pause, network partition), ZooKeeper detects the missed heartbeats and deletes the ephemeral node after the session timeout. The lock is released not because a timer expired, but because the system detected the holder is unresponsive.
- However, this does not fully eliminate the GC pause problem either. If Process A’s GC pause is shorter than the session timeout, ZooKeeper still considers the session active, and the same race can occur. The window is smaller and tied to failure detection rather than arbitrary TTL, but it is not zero.
- The bottom line is that no distributed locking mechanism can fully protect against client-side pauses. Fencing tokens remain the definitive solution regardless of which lock implementation you use.
Going Deeper: What is the Redlock algorithm and why is it controversial?
Strong Answer:- Redlock is a distributed lock algorithm proposed by Redis creator Salvatore Sanfilippo. Instead of relying on a single Redis instance (which is a single point of failure), Redlock uses N independent Redis instances (typically 5). To acquire a lock, a client must successfully acquire it on a majority (at least 3 out of 5) of instances within a time window. The lock is considered valid for the original TTL minus the time spent acquiring it.
- The controversy comes from Martin Kleppmann’s 2016 analysis. He argued that Redlock makes assumptions about timing (bounded clock drift, bounded process pauses) that are unsafe in real distributed systems. Specifically: if the system assumption is that clocks are roughly synchronized and processes do not pause for longer than the lock validity period, then a simpler single-node lock with a TTL achieves the same guarantees. If those assumptions are violated, Redlock fails too. In Kleppmann’s view, Redlock occupies an awkward middle ground: more complex than a single-node lock but no safer under the failure modes that actually matter.
- Sanfilippo disagreed, arguing that Redlock’s multi-node approach provides genuine fault tolerance against individual Redis node failures, and that the timing assumptions are reasonable in practice.
- My pragmatic take: for most applications, a single Redis lock with a TTL and fencing tokens is sufficient. If you need stronger guarantees, use a consensus-based system (etcd, ZooKeeper) rather than Redlock. If you need absolute safety, make your downstream operations idempotent so that the lock is an optimization, not a correctness requirement.
Q7: In a microservice architecture, Service A publishes events and Service B consumes them. Service B needs to process events in the exact order they were produced per customer. How do you guarantee this, and what are the limits of that guarantee?
Strong Candidate Answer:- The standard approach with Kafka is to use the customer ID as the partition key. Kafka guarantees ordering within a single partition, so all events for customer “C-123” go to the same partition and are consumed in the order they were produced. Within a consumer group, each partition is consumed by exactly one consumer instance, so there is no concurrent processing that could reorder events.
- But the guarantee has important limits:
- Single-producer ordering only. If two instances of Service A both produce events for customer “C-123,” Kafka preserves the order within each producer’s batch but not across producers. Two events sent by two different producers may arrive in any order even within the same partition. Fix: ensure a single producer instance is responsible for each customer (partition-based routing of upstream requests), or include a monotonic sequence number set by the business logic layer, not the producer.
- Consumer restart reprocessing. If Service B crashes after processing message M5 but before committing offset 5, it will re-read M5 on restart and process it again. The processing order is preserved, but the idempotency requirement is now coupled with the ordering requirement. If M5 is “debit 200.
- Repartitioning breaks ordering. If you increase the partition count on the topic, the partition assignment for customer “C-123” changes. New events for “C-123” go to a different partition. You now have old events in one partition and new events in another, with no ordering guarantee between them.
- Multi-topic ordering. If “C-123” has events on both a
paymentstopic and arefundstopic, there is no ordering guarantee across topics. Event timestamps may not be sufficient because producer clocks can drift. You need a vector clock or a centralized sequencer to establish cross-topic ordering.
Follow-up: How would you handle ordered processing when you need to consume from multiple Kafka topics that relate to the same entity?
Strong Answer:- This is one of the hardest problems in event-driven architectures. There is no built-in Kafka mechanism for cross-topic ordering. Options:
- Merge topics. Publish all events for an entity to a single topic with the entity ID as partition key. Different event types are distinguished by a
typefield in the payload. This gives you per-entity ordering across all event types. The downside is that you lose per-type consumer groups and per-type retention policies. All consumers reading this topic receive all event types and must filter. - Application-level reordering. Consume from both topics, buffer events per entity, and sort by a logical timestamp or sequence number before processing. This requires a windowed join: wait for events to arrive within a window (say, 5 seconds), then process in order. The trade-off is latency (you wait for the window) and complexity (what happens if an event arrives after the window closes?).
- Kafka Streams or Flink. Use a stream processing framework that supports multi-topic joins with event-time semantics and watermarks. Kafka Streams’
KStream-KStream joinwith windowed joins handles this explicitly. Apache Flink’s event-time processing with watermarks can order events from multiple sources.
- Merge topics. Publish all events for an entity to a single topic with the entity ID as partition key. Different event types are distinguished by a
- In my experience, the first option (single topic per entity) is the simplest and most reliable if you can tolerate the coupling. The third option (stream processing framework) is the right call when you have many event types, different producers, and need robust event-time ordering.
Follow-up: What is a watermark in stream processing, and why is it necessary?
Strong Answer:- A watermark is a declaration by the stream processing system that says: “I believe all events with a timestamp up to T have arrived.” Any event arriving after the watermark with a timestamp before T is considered a “late event.”
- Watermarks are necessary because events do not arrive in order. A payment event with timestamp 10:00:05 might arrive at the stream processor before an event with timestamp 10:00:02 due to network delays, different producer clock offsets, or retry delays. Without watermarks, the processor would never know when it is “safe” to produce a result for the 10:00:00-10:00:05 window. It might wait forever for that one late event.
- In practice, watermarks are typically set to
max_event_time_seen - allowed_lateness. If you set allowed lateness to 10 seconds, the watermark advances tomax_seen - 10s. Events arriving more than 10 seconds late are dropped or sent to a “late events” side output for special handling. The trade-off is: longer allowed lateness means more complete results but higher processing latency. Shorter lateness means faster results but more dropped events.
Q8: Your team is debating whether to use RabbitMQ or Kafka for a new service. One engineer says “Kafka is always better because it scales more.” How would you push back on this?
Strong Candidate Answer:- “Kafka is always better” is an oversimplification that leads to poor architectural decisions. Kafka and RabbitMQ solve different problems and have fundamentally different models. Choosing between them should be based on your actual requirements, not on which one has higher peak throughput on a benchmark.
- When RabbitMQ is the better choice:
- Complex routing logic. RabbitMQ’s exchange model (direct, topic, fanout, headers) lets you route messages based on content, headers, and patterns without consumer-side filtering. Kafka requires consumers to read everything from a topic and filter themselves.
- Priority queues. RabbitMQ supports priority queues natively. A high-priority order can jump ahead of 10,000 low-priority orders. Kafka has no concept of message priority within a partition.
- Per-message TTL and delayed messaging. RabbitMQ supports message-level TTL and delayed delivery via plugins. Kafka retains everything until the retention period; there is no per-message expiry.
- Simple task queues with low operational overhead. If you need a queue for background job processing (send emails, generate PDFs) and you do not need replay, event sourcing, or multi-consumer-group semantics, RabbitMQ is simpler to operate and reason about.
- Request-reply patterns. RabbitMQ has native support for reply-to queues and correlation IDs. Kafka can do request-reply, but it is awkward and requires careful topic and consumer management.
- When Kafka is the better choice: replay, event sourcing, high-throughput streaming, multi-consumer-group fan-out, long-term message retention, stream processing.
- The decision framework I use: “Do I need a smart broker (routing, priority, TTL) or a dumb pipe with smart consumers (replay, reprocessing, independent consumption)?” Smart broker points to RabbitMQ. Dumb pipe points to Kafka.
Follow-up: What are the operational differences in running RabbitMQ vs Kafka in production?
Strong Answer:- Kafka operations: Kafka is operationally heavier. You manage brokers, ZooKeeper (or KRaft in newer versions), topic configurations (partition count, replication factor, retention), consumer group monitoring, and partition rebalancing. Disk I/O is critical because Kafka relies heavily on sequential disk writes and page cache. Kafka clusters typically need dedicated infrastructure with fast disks (SSDs or NVMe) and significant memory for page cache. Monitoring requires tracking broker health, under-replicated partitions, ISR shrinkage, consumer lag per partition, and request latency percentiles.
- RabbitMQ operations: RabbitMQ is lighter for simple setups but has its own challenges. It runs on Erlang, which has a unique operational profile. Memory management is critical: if consumers fall behind, messages queue in memory (and optionally page to disk), and RabbitMQ can hit memory alarms that cause it to block all publishers. Cluster networking requires low-latency links between nodes (RabbitMQ clusters do not tolerate WAN latency well). Queue mirroring (classic HA queues) is being replaced by quorum queues in newer versions for better consistency. Monitoring focuses on queue depth, message rates, memory usage, and file descriptor count.
- The honest answer: if your team does not have dedicated infrastructure engineers, consider managed services. AWS MSK (managed Kafka), Amazon MQ (managed RabbitMQ), or CloudAMQP eliminate most operational burden. The operational cost of self-hosting either technology is significant and often underestimated.
Q9: Walk me through how you would design an idempotent consumer for a service that sends emails. The challenge: sending an email is an irreversible side effect.
Strong Candidate Answer:- This is one of the harder idempotency problems because email sending is a fire-and-forget side effect. Unlike a database write where you can check if the row exists, once an email is sent, you cannot “un-send” it. The core challenge is: what happens when your consumer sends the email successfully but crashes before acknowledging the message? On redelivery, how do you know the email was already sent?
- The pattern:
- Before sending the email, check a
sent_emailstable:SELECT 1 FROM sent_emails WHERE message_id = ? - If found, skip. The email was already sent. Ack the message.
- If not found, insert a record into
sent_emailswithin a transaction:INSERT INTO sent_emails (message_id, status) VALUES (?, 'PENDING') - Send the email via the email provider (SendGrid, SES, etc.), passing the message ID as the provider’s idempotency key if supported.
- Update the record:
UPDATE sent_emails SET status = 'SENT', sent_at = NOW() WHERE message_id = ? - Ack the message.
- Before sending the email, check a
- The failure modes:
- Crash after step 3, before step 4: The record is in
PENDINGstate. On redelivery, the consumer seesPENDINGand knows the email was not sent. It retries. No duplicate. - Crash after step 4, before step 5: The email was sent but the status was not updated to
SENT. On redelivery, the consumer seesPENDINGand tries to send again. This is where the email provider’s idempotency key is critical. If the provider supports idempotency keys (SES does viaMessageDeduplicationIdin FIFO topics; SendGrid does not natively), the duplicate send is suppressed. If the provider does not support idempotency, you may send a duplicate email. - The honest trade-off: For most systems, receiving a duplicate email is an acceptable failure mode (annoying but not harmful). The idempotent consumer prevents duplicate processing, and the email provider’s idempotency key (when available) prevents duplicate delivery. If neither is available, you accept the risk of the rare duplicate email in exchange for guaranteed delivery.
- Crash after step 3, before step 4: The record is in
Follow-up: How would you design this differently for an SMS notification where duplicates are expensive (each SMS costs money)?
Strong Answer:- For costly irreversible side effects, you need a stronger guarantee. The approach I would use is the two-phase send pattern:
- Record the intent to send in your database (status =
PENDING), within the same transaction as any business logic. - A separate, dedicated sender process reads
PENDINGrecords, calls the SMS provider with an idempotency key (Twilio supportsIdempotencyKeyin API calls), and updates the status toSENTonly after a successful API response. - If the sender crashes after calling Twilio but before updating status, Twilio’s idempotency key prevents the duplicate on the next attempt. The sender process retries the
PENDINGrecord, Twilio returns the original response, and the status is updated toSENT.
- Record the intent to send in your database (status =
- The key differences from the email pattern: (1) the send is decoupled from the message consumer into a separate reliable sender, reducing the window for crashes between send and record, (2) the SMS provider’s idempotency key is mandatory, not optional, because duplicates have a cost, and (3) you may add a cooldown period: do not retry a
PENDINGrecord within 60 seconds of the last attempt, giving the previous attempt time to complete.
Q10: Explain the CAP theorem, and then tell me why most experienced engineers think it is an oversimplification.
Strong Candidate Answer:- The CAP theorem, proven by Seth Gilbert and Nancy Lynch in 2002 (based on Eric Brewer’s 2000 conjecture), states that a distributed system can provide at most two of three guarantees simultaneously: Consistency (every read returns the most recent write), Availability (every request receives a response), and Partition tolerance (the system continues operating despite network partitions between nodes).
- Since network partitions are unavoidable in any real distributed system, the practical choice is between CP (consistent but may be unavailable during a partition) and AP (available but may return stale data during a partition).
- Why experienced engineers consider it an oversimplification:
- CAP is about the partition moment, not the steady state. When there is no partition (which is the vast majority of the time), you can have all three. The CAP trade-off only kicks in during an actual network partition. This makes the theorem less useful for daily design decisions than people assume.
- Consistency and Availability are not binary. CAP defines them as absolute (every read is linearizable, every request gets a response), but real systems operate on a spectrum. You can have eventual consistency, causal consistency, read-your-writes consistency. You can have high availability (99.99%) without absolute availability. The PACELC theorem (proposed by Daniel Abadi) extends CAP: “When there is a Partition, choose A or C; Else (no partition), choose Latency or Consistency.” This captures the real trade-off most systems face daily: not partition tolerance, but latency vs. consistency.
- It says nothing about latency. A CP system that takes 30 seconds to respond during a partition is technically “available” by CAP’s definition (it eventually responds) but practically useless. Real system design cares about latency bounds, not just binary up/down.
- It conflates different kinds of failures. A network partition between data centers is a very different failure mode than a single node crash. CAP treats them the same, but your design response should be different.
Follow-up: Give me a concrete example of a system that is CP and a system that is AP, and explain the real-world consequences of each choice.
Strong Answer:- CP example: Google Spanner / CockroachDB. These databases use synchronized clocks (TrueTime in Spanner’s case) and consensus protocols to provide strong consistency across globally distributed replicas. During a network partition, a minority partition cannot reach quorum and becomes unavailable for writes. The consequence: if your application relies on Spanner and a partition isolates a region, users in that region cannot perform writes until the partition heals. Google accepts this because financial and transactional workloads require correctness over availability. An incorrect bank balance is worse than a temporarily unavailable banking app.
- AP example: DynamoDB (default mode) / Cassandra. These databases prioritize availability by allowing any replica to accept writes even during a partition. Two replicas might accept conflicting writes for the same key. When the partition heals, they reconcile using a conflict resolution strategy (last-writer-wins by default in both). The consequence: a user might update their profile photo in Region A, and a user in Region B might see the old photo for a few seconds (or longer during a partition) until replicas converge. For social media, this is fine. For an inventory system where two replicas both sell the “last” item, this can result in overselling.
- The practical lesson: most systems are not purely CP or AP. They make different choices for different operations. A banking system might use CP for balance transfers but AP for displaying the transaction history feed (eventual consistency is fine for a read-only audit trail).
Follow-up: How does the PACELC theorem change how you think about system design compared to CAP?
Strong Answer:- PACELC says: during a Partition, choose A or C; Else (normal operation), choose Latency or Consistency. The “Else” clause is what makes PACELC more useful than CAP for daily engineering decisions. Partitions are rare events. The latency-vs-consistency trade-off is something you face on every single request.
- For example, DynamoDB is PA/EL: during a partition it chooses availability (AP), and during normal operation it chooses low latency (EL) by reading from the nearest replica even if it might be slightly stale. CockroachDB is PC/EC: during a partition it chooses consistency (CP), and during normal operation it still chooses consistency (EC), accepting higher latency for reads that require a quorum.
- The PACELC framing leads to better design conversations. Instead of “is our system CP or AP?” (which only matters during rare partitions), you ask “what is our latency-consistency trade-off on every request?” This question drives real architectural decisions: do we read from replicas (fast, possibly stale) or from the leader (slow, always current)? Do we wait for quorum writes (durable, slow) or ack after writing to one node (fast, possibly lost)?
Q11: A junior developer on your team writes a Go service where multiple goroutines access a shared map without synchronization, arguing “it works fine in testing.” How do you explain the problem and what approach do you recommend?
Strong Candidate Answer:- Go maps are not safe for concurrent use. The Go runtime will actually detect concurrent map access and deliberately crash the program with a fatal error:
fatal error: concurrent map read and map write. This is not a subtle bug that might manifest under rare timing conditions. Go chose to make it an immediate, loud crash rather than silently corrupting data. - The reason it works in testing is that tests typically run with low concurrency and low frequency. The race condition requires specific interleaving: one goroutine writing while another reads. With 2 goroutines and a few operations, the probability is low. In production with 1,000 goroutines, it is a certainty.
- Recommended approaches, in order of preference:
- Avoid shared state entirely. Restructure the code so each goroutine owns its own data. Communicate via channels (Go’s CSP model). “Don’t communicate by sharing memory; share memory by communicating.” This is the Go idiom and eliminates the entire class of bugs.
sync.Mapif the map is read-heavy with rare writes and keys are stable.sync.Mapis optimized for this access pattern. It uses a read-only fast path that avoids locking for reads.sync.RWMutexaround a regular map if reads and writes are balanced. Read-lock for reads (multiple readers concurrently), write-lock for writes (exclusive access). This is the most common approach when you genuinely need a shared map.- Channel-based access. A single goroutine owns the map and accepts read/write requests via channels. This serializes all access through one goroutine and eliminates races by construction. It is elegant but can become a bottleneck if the map is accessed very frequently.
- I would also show the junior developer Go’s race detector:
go test -raceorgo build -race. Run the tests with the race detector enabled and you will see the race immediately, even if the test otherwise passes. Make-racea standard part of the CI pipeline.
Follow-up: When would you choose sync.Map over a sync.RWMutex + regular map? What are the trade-offs?
Strong Answer:
sync.Mapis optimized for two specific access patterns: (1) write-once-read-many (keys are set once and then read frequently), and (2) disjoint key access (different goroutines work on different keys). For these patterns,sync.Mapavoids lock contention by using an internal read-only copy of the map that can be accessed without any locking.- For workloads with frequent writes, frequent iteration, or where you need to atomically check-and-set (read a value, decide, then write),
sync.Mapperforms worse than aRWMutex+ regular map.sync.Mapalso has noLen()method (by design, because counting would require synchronization) and itsRangemethod is not a snapshot (entries can change mid-iteration). - The pragmatic advice: start with
sync.RWMutex+ regular map. It is simple, well-understood, and performs well for most workloads. Only switch tosync.Mapif profiling shows mutex contention is a bottleneck and your access pattern matchessync.Map’s sweet spot. Premature optimization towardsync.Mapoften makes code harder to reason about without measurable benefit.
Going Deeper: Explain how Go’s race detector works. Why can it detect races that never actually manifest?
Strong Answer:- Go’s race detector uses a technique based on the happens-before algorithm (a variant of ThreadSanitizer, originally developed by Google for C++). It instruments every memory access (read and write) at compile time, inserting tracking code that records which goroutine accessed which memory address and when.
- At runtime, the detector maintains a vector clock per goroutine. Every memory access is annotated with the goroutine’s vector clock. When two accesses to the same memory location are detected where (1) at least one is a write and (2) neither happens-before the other (no synchronization between them), a race is reported.
- The key insight is that the detector does not need the race to actually cause a problem. It detects unsynchronized accesses even if the specific interleaving in this run happened to be safe. The happens-before analysis is about the absence of synchronization, not about whether the specific timing caused corruption. This is why the race detector catches bugs that “work fine” in testing; the test might never hit the bad interleaving, but the detector sees that the synchronization is missing.
- The cost: race-detected binaries run 5-10x slower and use 5-10x more memory. This is why you run it in CI and testing, not in production (though some teams run it in a canary production instance at reduced traffic).
Q12: You are designing a system where 500 microservice instances need to run a specific batch job exactly once per hour. No duplicates, no misses. How do you coordinate this?
Strong Candidate Answer:- This is a distributed scheduling problem. You need leader election: exactly one instance runs the job each hour, and if that instance dies, another takes over. There are several approaches, each with different trade-offs.
- Approach 1: Database-based leader election. Use a
scheduled_jobstable with a lock row. Each instance tries toUPDATE scheduled_jobs SET locked_by = ?, locked_at = NOW() WHERE job_name = 'hourly_batch' AND (locked_by IS NULL OR locked_at < NOW() - INTERVAL '5 minutes'). TheWHEREclause implements both lock acquisition and stale-lock recovery. Only one instance succeeds (atomic row-level lock in the database). After the job runs, clearlocked_by. If the instance crashes, the stale-lock timeout (5 minutes) lets another instance take over.- Pros: No additional infrastructure. Uses your existing database.
- Cons: Polling-based (each instance polls the table). Database load under high instance count. Not instant failover (depends on stale timeout).
- Approach 2: Distributed lock with Redis/etcd. One instance acquires a distributed lock (
SET job:hourly_batch lock_value NX EX 3600), runs the job, and releases the lock. Other instances check the lock and skip. Use lock renewal if the job might take longer than the TTL.- Pros: Fast, well-understood.
- Cons: Same GC-pause / TTL-expiry issues discussed earlier. Need fencing tokens for correctness.
- Approach 3: Consensus-based leader election (etcd/ZooKeeper). Use etcd’s election API or ZooKeeper’s sequential ephemeral nodes. One instance becomes the leader and runs all scheduled jobs. If the leader dies, the session expires and a new leader is automatically elected.
- Pros: Strongest correctness guarantees. Automatic failover without polling.
- Cons: Requires running etcd/ZooKeeper. More operational complexity.
- Approach 4 (my recommendation for most teams): Use a managed scheduler. Kubernetes CronJobs, AWS EventBridge Scheduler + Lambda, or Cloud Scheduler + Cloud Run. The platform handles “run exactly one instance on a schedule.” You get retries, failure handling, and observability for free.
- Pros: Zero coordination code. Production-grade out of the box.
- Cons: Vendor-specific. Less control over execution environment.
- The nuance: “Exactly once” is still tricky even with a single executor. If the CronJob runs and succeeds but the success signal is lost (the pod is killed before reporting success), the scheduler might run it again. The job itself must be idempotent, regardless of which coordination mechanism you use. “Exactly-once scheduling” really means “at-least-once scheduling with an idempotent job.”
Follow-up: How does Kubernetes CronJob handle the “exactly once” problem internally?
Strong Answer:- Kubernetes CronJobs have a
concurrencyPolicyfield with three options:Allow(multiple jobs can run simultaneously),Forbid(skip the new run if the previous is still running), andReplace(kill the running job and start a new one). For “exactly once per hour,” you wantForbid. - However, Kubernetes CronJobs have known issues with missed schedules and double-fires. If the kube-controller-manager is down during the scheduled time, the job is missed (there is a
startingDeadlineSecondswindow where it can still be created late, but outside that window it is skipped). If there is a brief controller restart around the scheduled time, a job can be created twice. The Kubernetes documentation explicitly warns: “For every CronJob, the CronJob Controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job.” - The practical implication: Kubernetes CronJobs are “at-most-once” or “at-least-once” depending on configuration and failure timing. They are not “exactly once.” Your job must be idempotent. For critical scheduled work (financial reconciliation, data pipeline triggers), I would add application-level deduplication: the job checks a
last_successful_runtimestamp before executing and skips if it has already run for this period.
Follow-up: What if the batch job itself takes 90 minutes but the schedule is every 60 minutes?
Strong Answer:- This is a job overlap scenario and it needs to be handled explicitly. With
concurrencyPolicy: Forbidin Kubernetes, the 2nd invocation is simply skipped. This means you miss a run every time the job takes longer than the interval. - Better approaches:
- Fix the job. If a job consistently takes longer than its interval, the schedule is wrong. Either increase the interval or optimize the job. Investigate why it takes 90 minutes; often there are easy wins (batch database writes, parallelize independent steps, skip already-processed records).
- Sliding window with overlap prevention. The job records its start time and the “window” it covers (e.g., “processing events from 10:00 to 11:00”). The next invocation checks the last completed window and processes from there. If the 10:00 job finishes at 11:30, the 11:00 invocation was skipped, but the 12:00 invocation processes events from 11:00 to 12:00. No data is lost; it is just processed later.
- Queue-based processing instead of cron. Replace the hourly cron with a continuous consumer that processes events as they arrive. The “batch” is replaced by a stream processor. This eliminates the job-duration problem entirely but changes the architecture from batch to streaming.
Advanced Interview Scenarios
These questions target the gray areas where textbook answers actively mislead you. Each scenario is designed so that the obvious first-instinct answer is incomplete or wrong. They test production judgment, debugging intuition, and the kind of systems thinking that only comes from operating real infrastructure under pressure.Q13: Your team is excited about event sourcing for a new order management system. You have been asked to review the architecture proposal. What questions would you ask, and when would you push back against event sourcing?
Reveal Answer
Reveal Answer
- The first question I would ask is: “What query patterns does this system need to support?” Event sourcing stores state as a sequence of events. To get the current state, you replay events or read a projection. If the system’s primary use case is “show me the current status of order #12345” — which it almost certainly is for an order management system — you are going to need CQRS (Command Query Responsibility Segregation) with materialized read models. That means maintaining projections, handling projection lag, rebuilding projections when you find bugs in them, and managing the consistency gap between the write model (event log) and the read model (projection). You have just doubled your operational surface area.
- The second question: “How many events per entity do you expect over its lifetime?” An order might accumulate 10-30 events (created, paid, items_added, shipped, delivered). Replaying 30 events to reconstruct state is trivial. But I have seen teams use event sourcing for entities that accumulate thousands of events per instance — real-time bidding records, IoT sensor streams, high-frequency trading positions. At 50,000 events per entity, rebuilding state from scratch takes seconds. You need snapshots, and now you are managing snapshot frequency, snapshot storage, and snapshot invalidation alongside the event stream.
- The third question: “Does your team have experience operating event-sourced systems?” In my experience at a fintech company, we adopted event sourcing for our ledger service. The engineering team spent about 4 months on the event store and projection infrastructure before writing a single line of business logic. Schema evolution was the hardest part — when we needed to change the structure of an
OrderPlacedevent, we had to write an upcaster that transforms old events into the new shape during replay. We had 800 million historical events. Every schema change required testing the upcaster against the full event history. A traditional CRUD system with an audit log table would have given us 90% of the benefit at 30% of the cost. - My push-back framework: Event sourcing is the right choice when (a) you need a complete, immutable audit trail that is the source of truth (financial systems, compliance), (b) you need temporal queries (“what was the state at 3pm yesterday?”), and (c) you need to derive multiple read models from the same event stream. If the primary need is just an audit trail, a regular
audit_logtable with CDC is far simpler. If the team has never operated event sourcing before, I would strongly recommend starting with CRUD + audit log and migrating to event sourcing only when a concrete requirement demands it.
Follow-up: How do you handle schema evolution for events that are already stored in the event log?
Strong Answer:- You cannot change historical events. They are immutable facts. When your event schema needs to evolve, you have three strategies:
- Upcasters: A transformation function that converts old event shapes to new shapes at read time. When replaying events, the upcaster intercepts
OrderPlaced_v1and emitsOrderPlaced_v2with the new fields populated with defaults or derived values. Axon Framework and Marten both support this natively. - Versioned event types: Publish both
OrderPlaced_v1andOrderPlaced_v2in the stream. Consumers handle both. This avoids mutation of old events but forces consumers to understand every version forever. - Stream migration: Write a one-time migration that reads every event from the old stream, transforms it, and writes to a new stream. Then cut over consumers. This is the nuclear option — clean but expensive and risky for large event stores.
- Upcasters: A transformation function that converts old event shapes to new shapes at read time. When replaying events, the upcaster intercepts
- In practice, upcasters are the standard approach. But the gotcha is testing: your upcaster must be tested against real historical events, not synthetic test data. We learned this the hard way when an upcaster that worked perfectly on test events failed on a production event from 2019 that had a null field nobody remembered was possible.
Follow-up: What is the relationship between event sourcing and CQRS? Can you have one without the other?
Strong Answer:- You can absolutely have CQRS without event sourcing. CQRS just means separating read and write models. A traditional CRUD service that writes to a normalized PostgreSQL database and reads from a denormalized Redis cache is doing CQRS. You can also have event sourcing without CQRS — replay events every time you need current state — but this is impractical for any system with non-trivial query requirements, so in practice the two almost always appear together.
- The reason they are so tightly associated is that event sourcing makes CQRS nearly mandatory. Your write model is an event log. Your read model is a projection derived from that log. The projection is the query-optimized view. Without projections, every read requires replaying the event history, which is O(n) in the number of events per entity.
Q14: You are debugging an incident where messages in your dead letter queue (DLQ) grew from 50 to 15,000 overnight. Walk me through your investigation and response.
Reveal Answer
Reveal Answer
- The first thing I do is determine whether this is a sudden cliff or a gradual ramp. I pull the DLQ ingestion rate over time from our monitoring (CloudWatch metrics for SQS DLQ depth, or a Kafka consumer lag dashboard for DLQ topics). A sudden spike at 2:17 AM suggests a deployment or an upstream schema change at that exact time. A gradual increase over 8 hours suggests a slowly degrading dependency (database connection pool exhaustion, a third-party API returning increasing error rates, certificate expiration).
- Next, I sample and categorize. I pull 50 messages from the DLQ and look at the error metadata (most DLQ implementations include the original error message, stack trace, and retry count). I am looking for one dominant error or multiple distinct failure types. In my experience, 90% of DLQ spikes have a single root cause. The categories I look for:
- Deserialization failures (schema change broke the consumer — check recent producer deployments)
- Downstream dependency errors (database timeout, API 503 — check dependency health dashboards)
- Business logic validation failures (a new edge case the code does not handle — usually triggered by upstream data changes)
- Poison messages (malformed data that will never succeed regardless of retries)
- At a payments company I worked with, we had a DLQ spike of 12,000 messages. Sampling revealed they were all
JsonParseException. A producer team had deployed a change that switched a timestamp field from epoch milliseconds to ISO-8601 strings. No schema registry, no contract test. The fix was a consumer-side parser that handled both formats, a reprocessing run for the DLQ messages, and a post-incident action item to implement schema registry with compatibility enforcement. - For reprocessing, I do not just dump everything back into the main queue. I first fix the root cause. Then I reprocess in controlled batches (500 at a time) with monitoring on the consumer error rate. If the error rate spikes during reprocessing, I stop immediately — there may be a secondary issue. I also add a deduplication check before reprocessing in case some messages were partially processed before being DLQ’d.
- The operational response has three tiers: (1) Alert at 100 messages in DLQ per hour (something is wrong), (2) Page at 1,000 messages (active incident), (3) Circuit-breaker at 10,000 (pause the consumer group to stop accumulating more DLQ entries while you investigate).
Follow-up: How do you design a DLQ reprocessing pipeline that is safe and auditable?
Strong Answer:- I build a dedicated reprocessing service (not a script that someone runs manually). It reads from the DLQ, enriches each message with a
reprocessing_attemptheader and areprocessed_attimestamp, and publishes back to the original queue. The consumer’s idempotency logic handles duplicates. Every reprocessing action is logged to an audit table: message ID, original error, reprocessing time, outcome. We track reprocessing success rate as a metric. If reprocessing success rate drops below 95%, something new is wrong. - The critical safety feature: the reprocessing service has a rate limiter. If you dump 15,000 messages back into the main queue at once, you can overwhelm the consumer or the downstream dependencies that were already struggling. I typically reprocess at 10-20% of normal throughput with automatic pause on error rate spikes.
Follow-up: How do you decide when to just drop DLQ messages instead of reprocessing?
Strong Answer:- It depends entirely on the business impact. Metrics events, analytics pings, non-critical logs — if they are more than 24 hours old, the value of reprocessing approaches zero. Drop them and move on. Payment events, order state changes, inventory updates — every one must be accounted for. You reprocess or manually reconcile, no exceptions. The decision tree I use: (1) Is this message idempotent to reprocess? If no, manual review required. (2) Is the data still relevant? Analytics events from yesterday are stale; order events from yesterday are critical. (3) Can we reconcile without reprocessing? Sometimes it is faster to run a reconciliation job that compares the source of truth (the database) against the expected state and fixes discrepancies directly.
Q15: You are tracing a bug where a customer was charged twice. The payment flow goes: API Gateway -> Order Service (Kafka) -> Payment Service (gRPC) -> Stripe API. Each hop is async except the Stripe call. How do you trace what happened?
Reveal Answer
Reveal Answer
- The first thing I look for is a correlation ID (also called a trace ID or request ID). A well-instrumented system generates a unique ID at the API Gateway (the entry point) and propagates it through every downstream call — in HTTP headers (
X-Request-Id), Kafka message headers (trace_id), and gRPC metadata. Every log line, every metric, every span includes this ID. I search for the customer’s charge in Stripe, get the Stripepayment_intentID, then trace backward through our systems using the correlation ID. - If correlation IDs are properly instrumented (OpenTelemetry is the standard here), I go to Jaeger, Tempo, or Datadog APM and search for the trace. The distributed trace shows every span: API Gateway received the request at T0, Order Service published to Kafka at T1, Payment Service consumed the event at T2, Stripe was called at T3. If the customer was charged twice, I expect to see either (a) two Kafka messages for the same order (producer sent twice), (b) one Kafka message consumed twice by the Payment Service (consumer rebalance caused reprocessing), or (c) one Stripe call that was retried after a timeout and both the original and retry succeeded.
- Scenario (b) is the most common. The Payment Service consumed the event, called Stripe, Stripe returned 200, but the consumer crashed before committing its Kafka offset. On restart, the consumer re-read the event and called Stripe again. If the Payment Service did not pass an idempotency key to Stripe’s API (
Idempotency-Keyheader), Stripe creates a second charge. This is exactly the “irreversible side effect” problem from Q9. - The fix is layered: (1) Pass
Idempotency-Key: {order_id}to Stripe on every charge request. Stripe deduplicates on this key for 24 hours. (2) The Payment Service checks its ownprocessed_paymentstable before calling Stripe. (3) Add an alert on duplicate Stripe charges per order ID — a canary that catches idempotency failures. - For the immediate customer issue: I check Stripe’s dashboard for the duplicate charge, issue a refund for the second charge, and notify the customer proactively. Then I fix the root cause.
Idempotency-Key header. We had been silently double-charging ~40 customers per deployment for 3 months. The total refund was $47,000. After adding the idempotency key, duplicates dropped to zero. We also added a daily reconciliation job that compares our payments table against Stripe’s charge list and flags mismatches.Follow-up: How do you propagate trace context through Kafka without losing it?
Strong Answer:- Kafka messages support custom headers (since Kafka 0.11). The producer adds OpenTelemetry trace context as headers:
traceparent(the W3C Trace Context standard) and optionallytracestate. The consumer extracts these headers and creates a new span linked to the parent trace. Most OpenTelemetry instrumentation libraries for Kafka (likeopentelemetry-instrumentation-kafka-pythonor the Javaopentelemetry-kafka-clients) do this automatically. - The subtle issue is fan-out. When one Kafka message triggers processing that publishes to 3 different downstream topics, each downstream message should carry the same parent trace ID but create its own child span. This creates a trace tree, not a linear chain. Without proper span linking, you see a flat list of unrelated spans in your tracing UI instead of a causal tree.
- The other gotcha is batching. If your consumer processes messages in batches (100 at a time for throughput), each message has a different trace context. You need to either create one span per message within the batch (accurate but verbose) or create a batch span with links to all 100 parent traces (cleaner but harder to navigate). Most teams compromise by creating per-message spans for critical paths (payments) and batch spans for bulk processing (analytics).
Follow-up: Your distributed traces show a 50ms gap between the Kafka produce timestamp and the consumer processing timestamp, but during incidents this gap grows to 45 seconds. What is happening?
Strong Answer:- Under normal load, the 50ms gap is the end-to-end latency from produce to consume: broker write latency (~5ms), replication latency (~10ms), consumer poll interval, and processing queue depth. This is healthy.
- A jump to 45 seconds points to one of three things: (1) Consumer lag — the consumer is falling behind and processing messages that were produced 45 seconds ago. Check consumer lag metrics per partition. (2) Consumer group rebalance — during a rebalance, consumption pauses entirely for the affected partitions. A 45-second gap matches a typical eager rebalance duration. Check for rebalance events in the consumer logs (look for
Revoking previously assigned partitionsandSuccessfully joined group). (3)max.poll.interval.mstimeout — a consumer took too long processing a batch, was kicked from the group, triggering a rebalance, and the partition sat unprocessed until reassignment completed. - The diagnostic tool I reach for first is Kafka’s
__consumer_offsetstopic (viakafka-consumer-groups.sh --describe) to see the current lag per partition and the last commit timestamp. If one partition has 45 seconds of lag and others are current, I have a hot partition or a stuck consumer. If all partitions jumped simultaneously, it was a rebalance.
Q16: Your service uses a thread pool of 200 threads to handle requests. Each request makes a downstream HTTP call that normally takes 50ms. The downstream service starts responding in 5 seconds instead. What happens to your service, and what should you have done to prevent it?
Reveal Answer
Reveal Answer
- This is a textbook cascading failure via thread pool exhaustion, and it is one of the most common ways services die in production. Let me walk through the math.
- Under normal conditions: each request holds a thread for ~50ms. With 200 threads, you can handle 200 / 0.05 = 4,000 requests per second. The thread pool is healthy, threads are recycled quickly.
- When the downstream latency jumps to 5 seconds: each request now holds a thread for 5 seconds. With 200 threads, you can handle 200 / 5 = 40 requests per second. Your throughput just dropped by 99%.
- But it gets worse. Incoming requests arrive at the original rate (let us say 2,000/sec). All 200 threads are occupied within 100ms. New requests queue up waiting for a free thread. The queue grows. If the queue is unbounded, memory grows until OOM. If the queue is bounded, requests start getting rejected. Either way, your service is now effectively down, not because it has a bug, but because a downstream service got slow.
- What you should have done:
- Timeouts. Every downstream HTTP call should have an aggressive timeout. If the normal latency is 50ms, set a timeout of 500ms (10x normal, covering P99.9). After 500ms, kill the request, return an error, and free the thread. The downstream service being slow should not hold your threads hostage for 5 seconds.
- Circuit breaker. After N consecutive timeouts (say, 5 in a row), trip the circuit breaker. For the next 30 seconds, all requests to the downstream service fail immediately without even making the HTTP call. This frees threads instantly and prevents the queue from growing. After the cooldown, send a probe request to check if the downstream has recovered. Libraries: Hystrix (legacy), resilience4j (Java), Polly (.NET), cockatiel (Node.js).
- Bulkhead pattern. Do not share a single thread pool across all downstream calls. Dedicate a smaller pool (say, 50 threads) to the slow downstream service. When that pool is exhausted, only requests to that downstream are affected. Requests to other dependencies continue normally on the remaining 150 threads. This is fault isolation. Netflix popularized this with Hystrix’s thread pool isolation.
- Async with bounded concurrency. Instead of blocking a thread per request, use async HTTP clients (Netty, Java’s HttpClient with CompletableFuture, Node.js’s native async). The thread makes the request and is immediately freed. A callback handles the response. With async I/O, a slow downstream holds a connection, not a thread. Connections are cheaper than threads.
Follow-up: How do you determine the correct thread pool size for a service?
Strong Answer:- The classic formula is Little’s Law:
L = lambda * W, where L is the number of concurrent requests (threads needed), lambda is the arrival rate (requests per second), and W is the average processing time (seconds). If you need to handle 1,000 req/sec and each request takes 100ms, you need1000 * 0.1 = 100threads at steady state. - But that assumes uniform latency. Real systems have long-tail latencies. A P50 of 100ms but a P99 of 2 seconds means some threads are occupied 20x longer than average. You need headroom: I typically size thread pools at 2-3x the Little’s Law estimate and set a hard queue limit of 2x the pool size. Beyond that, fast-fail with HTTP 503.
- The other factor is what the threads are doing. If threads are doing CPU work, more threads than CPU cores causes context-switching overhead that actually reduces throughput. If threads are doing I/O waits (which is most web services), you can have many more threads than cores because most threads are sleeping. The practical rule: for I/O-bound services, 200-500 threads is common. For CPU-bound services,
cores * 2is a reasonable starting point. Profile your actual workload — thread pool sizing is empirical, not theoretical.
Follow-up: How does this problem manifest differently in Go compared to Java?
Strong Answer:- Go does not have a fixed-size thread pool in the same way. Each incoming HTTP request gets a goroutine, and goroutines are cheap (~2-4KB stack). You can have hundreds of thousands of goroutines without issue. So the “200 threads exhausted” scenario does not happen in the same way.
- But the equivalent failure mode in Go is connection exhaustion and memory pressure. If each goroutine makes an HTTP call to the slow downstream and blocks for 5 seconds, you accumulate goroutines: 2,000 req/sec * 5 sec = 10,000 goroutines blocked simultaneously. Each holds a TCP connection to the downstream. You can exhaust the downstream’s connection limit, your own file descriptor limit (
ulimit -n), or accumulate enough memory to trigger the OOM killer. - The fix in Go is the same principles, different tools:
context.WithTimeoutfor per-request deadlines,golang.org/x/sync/semaphoreor buffered channels for bounded concurrency (acting as a bulkhead), and libraries likesony/gobreakerfor circuit breakers. The key difference is that in Go you are managing concurrency limits explicitly via semaphores rather than implicitly via thread pool size.
Q17: Two events arrive at the same consumer within milliseconds: InventoryReserved and OrderCancelled, both for the same order. Depending on processing order, you either reserve-then-release (correct) or cancel-then-reserve (now you have phantom reserved inventory). How do you handle this?
Reveal Answer
Reveal Answer
- This is a causal ordering problem that partition-level ordering does not solve.
InventoryReservedcomes from the Inventory Service.OrderCancelledcomes from the Order Service. They are on different topics (or at least produced by different services). Kafka guarantees ordering within a partition of a single topic from a single producer. Cross-service, cross-topic ordering is not guaranteed. - Solution 1: State machine with version guards. The consumer maintains an order state machine. Each event includes the expected current state (or a version/sequence number). The consumer applies events only if the expected state matches. If
OrderCancelledarrives first, it transitions the order toCANCELLED. WhenInventoryReservedarrives, it expects the order to be inPENDINGstate. The state isCANCELLED, so the event is rejected or triggers a compensating action (release the reservation). This is optimistic concurrency applied to event processing. - Solution 2: Merge events into a single topic. If causal ordering between inventory and order events is critical, route both through a single topic partitioned by
order_id. All events for one order land on the same partition and are consumed in order. The trade-off: the Inventory Service and Order Service must agree on the topic, schema, and partition key. This creates coupling that event-driven architectures are supposed to avoid. - Solution 3: Event timestamps with reordering buffer. The consumer buffers events for each order for a short window (e.g., 500ms) and sorts by a logical timestamp or causal sequence number before processing. This handles small ordering inversions. The trade-off is added latency (500ms minimum) and complexity (what if the window is too short?).
- Solution 4 (my preference for most systems): Make every event handler idempotent and compensating. Do not try to enforce global ordering. Instead, design each handler so that it can detect out-of-order delivery and trigger the appropriate compensating action. If
InventoryReservedarrives for a cancelled order, the handler checks the order state, seesCANCELLED, and immediately publishes anInventoryReleasedevent. The system converges to the correct state regardless of delivery order. This is the philosophy behind eventual consistency: the system may pass through transient incorrect states but always converges.
ShipmentDispatched and ShipmentCancelled events. A dispatcher would cancel a shipment seconds after dispatching it (driver reassignment). Depending on event order, we had phantom shipments in the tracking system that showed “in transit” for shipments that were actually cancelled. The fix was Solution 4: the tracking consumer checked the canonical shipment state in the database before updating. If a ShipmentDispatched event arrived for a shipment already in CANCELLED state, the consumer emitted a TrackingCorrected event and updated the UI. Within 2-3 seconds, the tracking display always converged to the correct state. We monitored the “correction rate” as a metric — it ran at about 0.02%, which was acceptable.Follow-up: What is a vector clock, and how would it help with this problem?
Strong Answer:- A vector clock is a logical clock that tracks causal relationships across multiple nodes. Each service maintains a vector:
{OrderService: 5, InventoryService: 3, PaymentService: 7}. When a service sends an event, it increments its own entry and includes the full vector. When a service receives an event, it takes the element-wise maximum of its local vector and the received vector, then increments its own entry. - Two events are causally ordered if one vector is strictly less than or equal to the other (every entry is <=, and at least one is <). Two events are concurrent if neither vector dominates the other. In our scenario,
InventoryReservedandOrderCancelledwould have concurrent vectors (neither caused the other), which tells the consumer it cannot assume an ordering. The consumer can then apply conflict resolution rules. - In practice, vector clocks are heavy for high-throughput systems (the vector grows with the number of services). Most teams use a simpler approach: a centralized sequence number per entity (the order’s version counter) or Lamport timestamps. Amazon’s DynamoDB originally used vector clocks for conflict detection but switched to last-writer-wins for simplicity, acknowledging that the complexity was not justified for most access patterns.
Follow-up: How does this problem change if you are using a single-writer-per-entity pattern?
Strong Answer:- If you designate one service as the single writer for each entity — the Order Service is the only service that writes to the
orderstable and publishes order events — then the writer serializes all state changes. The Order Service receives both the “reserve inventory” result and the “cancel” request. It processes them in order, publishing eitherOrderConfirmedorOrderCancelled. Downstream consumers receive a single, consistently ordered stream of events from one producer. - This eliminates the cross-service ordering problem at the cost of centralizing write logic. The Order Service becomes the bottleneck and the single point of failure for all order state changes. It works well when the entity has a natural owner. It breaks down when multiple services have legitimate reasons to mutate the same entity independently (e.g., a fraud service that can freeze an order, a logistics service that can mark it as undeliverable, and a customer service that can escalate it).
Q18: Your system handles 50,000 events/second. The downstream database can handle 10,000 writes/second. You cannot scale the database further. What do you do?
Reveal Answer
Reveal Answer
- This is a backpressure design problem. The producer rate (50K/sec) exceeds the sink capacity (10K writes/sec) by 5x. You cannot just buffer indefinitely — that is a memory leak. You need to either reduce the write volume, increase the write efficiency, or reshape the data flow.
- Strategy 1: Write batching. Instead of one database write per event, accumulate events in memory and write in batches. If you batch 100 events into a single bulk insert, your 10,000 writes/sec handles 1,000,000 events/sec. The trade-off is latency: you wait up to N milliseconds (or until the batch is full) before writing. For a 100ms batching window at 50K events/sec, you batch 5,000 events per write, needing only 10 writes/sec. This is the single most effective technique I have used.
- Strategy 2: Write-behind cache. Write events to Redis (or an in-memory buffer) immediately, and flush to the database asynchronously in batches. The consumer acks after the Redis write (fast), and a separate flusher moves data to the database at the database’s pace. Risk: if Redis dies before flushing, you lose uncommitted events. Mitigate with Redis persistence (AOF) or by not acking the Kafka offset until the database write is confirmed (but this reintroduces the throughput bottleneck on the ack path).
- Strategy 3: Pre-aggregation. If the events are things like page views, clicks, or metrics, aggregate before writing. Instead of writing 50,000 individual
page_viewedevents, maintain in-memory counters and write one row per page per minute:page_id, minute, count. This collapses 50K events/sec into maybe 500 writes/sec. The trade-off: you lose individual event granularity. This is fine for analytics, unacceptable for an audit trail. - Strategy 4: Tiered storage. Write the hot, high-volume events to a fast append-only store (Kafka itself, or a time-series database like TimescaleDB/ClickHouse that handles 50K inserts/sec easily). The slow relational database receives only aggregated summaries or is populated via a periodic ETL. This separates the ingest path from the query path.
- Strategy 5: Backpressure propagation. If none of the above works, propagate backpressure upstream. Slow down the Kafka consumer (reduce
max.poll.records, increase the processing interval). Consumer lag grows, but the database is not overwhelmed. If the lag becomes unacceptable, this is a signal that the architecture needs to change — the database is not the right sink for this volume.
Follow-up: How do you implement backpressure in a Kafka consumer without losing messages?
Strong Answer:- Kafka consumers have natural backpressure: you control the poll rate. If you reduce
max.poll.recordsto 100 (from 500), eachpoll()returns fewer messages and your consumer processes slower. Consumer lag grows, but no messages are lost — they sit in Kafka until consumed. Kafka’s retention period (7 days by default) is your buffer. - For more granular control, implement a rate limiter in the consumer: after processing each batch, check the downstream health (database connection count, response latency). If the downstream is stressed,
Thread.sleep()or reduce the batch size before the next poll. This is cooperative backpressure — the consumer voluntarily slows itself. - The danger zone is
max.poll.interval.ms. If you slow down too much and exceed this timeout betweenpoll()calls, Kafka kicks the consumer from the group and triggers a rebalance. So your backpressure must stay within this bound. If you need to pause for minutes, callconsumer.pause(partitions)to halt fetching while still sending heartbeats, thenconsumer.resume(partitions)when the downstream recovers.
Follow-up: When is backpressure the wrong answer?
Strong Answer:- Backpressure is wrong when the producer cannot tolerate slowing down. In a user-facing API, if the downstream queue is full and you propagate backpressure to the API, the user sees a 503 or a long wait. That is often worse than dropping the event.
- The pattern here is load shedding: accept the request, return 200 to the user, and make a best-effort attempt to process the event. If the system is overloaded, drop low-priority events (analytics, telemetry) while preserving high-priority ones (payments, orders). This requires priority classification at the ingestion layer.
- Backpressure works when both sides of the pipe are internal services with no human waiting. It fails when a human is on the other end of the request. The rule: backpressure for machine-to-machine flows, load shedding for user-facing flows.
Q19: Your team runs a multi-region active-active system. Both the US and EU regions can accept writes for the same customer. A customer updates their email to “new@example.com” in the US region and simultaneously updates their phone number to “+44…” in the EU region. What happens?
Reveal Answer
Reveal Answer
- This is the write-write conflict problem in multi-region active-active systems, and the answer depends entirely on your conflict resolution strategy. Let me walk through the options.
- Last-writer-wins (LWW) on the full record is the default in many databases (DynamoDB global tables, Cassandra). The problem is exactly as described: if the EU write is “later” (by wall clock), the EU record overwrites the US record entirely. The customer’s email change is silently lost. The customer sees their new phone number but their email reverted. This is data loss that is invisible in your metrics and nearly impossible for the customer to diagnose.
- Per-field LWW is better. Instead of replacing the entire record, merge at the field level. The US write sets
email = "new@example.com"at T1. The EU write setsphone = "+44..."at T2. During replication, the system sees thatemailwas modified in US andphonewas modified in EU. No conflict — merge both changes. The customer gets both updates. CRDTs (specifically, a Last-Writer-Wins Register per field) implement this. DynamoDB does not do this natively — you would need application-level merge logic. - Application-level conflict resolution is the most robust. When the replication layer detects conflicting writes to the same record, it invokes application logic to resolve them. For a user profile, the merge is usually field-level LWW (as above). For an inventory count, the merge might be to take the minimum (conservative). For a shopping cart, the merge might be a union of items. Riak and CouchDB expose conflict resolution hooks. DynamoDB global tables and Cassandra use LWW without hooks, so you need to handle conflicts in application code during reads.
- The deceptively hard part is clock synchronization. LWW requires a globally consistent notion of “which write happened last.” But clocks across regions drift. If the US region’s clock is 500ms ahead of the EU region’s clock, the US write is always “later” even when the EU write happened after. Google Spanner solves this with TrueTime (GPS-synchronized clocks with bounded uncertainty). Most other systems use NTP, which has unbounded drift under adverse conditions. In practice, NTP drift is usually under 10ms, which is fine for human-scale interactions. But for machine-generated writes at millisecond intervals, NTP drift can cause incorrect LWW decisions.
Follow-up: How do CRDTs solve this problem, and what are their limitations?
Strong Answer:- CRDTs (Conflict-free Replicated Data Types) are data structures designed so that concurrent updates from different replicas always merge deterministically without coordination. For the user profile example, you would use a LWW-Register per field. Each field stores
(value, timestamp). Merge takes the entry with the higher timestamp. This gives you per-field LWW with mathematical guarantees of convergence. - For more complex structures: a G-Counter (grow-only counter) merges by taking the element-wise maximum of per-node counters. An OR-Set (observed-remove set) allows concurrent adds and removes without losing data.
- Limitations are significant. CRDTs can only model operations that are commutative, associative, and idempotent. You cannot enforce “balance must not go negative” with a CRDT because that requires coordination (checking the current value before decrementing). You cannot implement a global unique constraint with a CRDT. You cannot implement a “transfer” (debit A, credit B) atomically with CRDTs. For business logic that requires these invariants, you need coordination — which means you need a leader, which means you sacrifice the active-active property for those specific operations.
- The practical rule: use CRDTs for data where convergence matters more than invariants (user profiles, shopping carts, collaborative documents). Use leader-based coordination for data where invariants must hold (account balances, inventory counts, unique email constraints).
Follow-up: How does DynamoDB global tables handle conflicts compared to Spanner’s approach?
Strong Answer:- DynamoDB global tables use last-writer-wins at the item level, determined by the item’s last write timestamp. There is no application-level conflict resolution hook. If two regions write the same item within the replication window (~1 second), the write with the later timestamp wins and the other is silently discarded. DynamoDB does not even tell you a conflict occurred.
- Spanner takes the opposite approach: strong consistency via synchronized clocks. Spanner’s TrueTime API uses GPS and atomic clocks in every data center to provide a globally consistent timestamp with bounded uncertainty (typically under 7ms). Writes are serialized globally, so there are no conflicts — every write sees the result of all previous writes. The cost is latency: a write in Spanner must wait for the TrueTime uncertainty interval to elapse before committing, to guarantee that no other write with an earlier timestamp can appear later. This adds ~7ms to every write.
- The fundamental trade-off: DynamoDB gives you low-latency active-active writes but with silent conflict resolution (data loss risk). Spanner gives you global consistency but with higher write latency and the requirement that writes route through a leader region for a given partition. Pick based on whether your business tolerates silent data loss (DynamoDB) or extra latency (Spanner).
Q20: A Kafka consumer group with 6 consumers and 24 partitions is experiencing rebalance storms — rebalances are happening every 2-3 minutes, causing repeated processing pauses. How do you diagnose and fix this?
Reveal Answer
Reveal Answer
session.timeout.ms to avoid rebalances.” They apply a band-aid without diagnosing the root cause. Increasing the timeout hides the symptom (missed heartbeats) while the underlying problem (slow processing or GC pauses) continues.What strong candidates say:- Rebalance storms are one of the most operationally painful Kafka issues. A “storm” means rebalances are triggering cascading rebalances — the group never stabilizes. Each rebalance pauses processing, creates lag, the lag causes consumers to process larger backlogs, which makes them slower, which triggers more rebalances. It is a feedback loop.
- Diagnosis step 1: Check the rebalance trigger. Consumer logs will show why each rebalance happened. The three most common causes:
Member consumer-3 has failed to heartbeat— the consumer’s heartbeat thread could not reach the group coordinator withinsession.timeout.ms. Cause: network issues, long GC pauses, or the consumer’s main thread is so busy that the JVM cannot schedule the heartbeat thread.Member consumer-3 has exceeded max.poll.interval.ms— the consumer took too long betweenpoll()calls. Cause: processing a batch of messages took longer than the configured interval (default 5 minutes). This often happens when a batch contains a message that triggers a slow downstream call or a message that causes an exception and retry loop.New member joined/Member left— a consumer instance is crashing and restarting, or an autoscaler is adding/removing instances too aggressively. Each join/leave triggers a rebalance.
- Diagnosis step 2: Check if it is a single consumer or all consumers. If one consumer is repeatedly being kicked and rejoining, it is likely that consumer has a problem (resource starvation, bad hardware, network partition to the coordinator). If all consumers are cycling, it is likely a configuration issue or a systemic load problem.
- Fixes:
- Switch to CooperativeStickyAssignor. The default eager rebalance protocol revokes ALL partitions from ALL consumers during a rebalance. CooperativeSticky only revokes the partitions that need to move. This dramatically reduces the rebalance blast radius and breaks the feedback loop because non-moving partitions continue processing. This single change fixes most rebalance storms.
- Enable static group membership. Set
group.instance.idon each consumer. When a consumer restarts (e.g., during a deployment), it reclaims its partitions without triggering a full rebalance. Without static membership, a rolling deployment of 6 consumers triggers 12 rebalances (6 leaves + 6 joins). - Tune
max.poll.interval.msandmax.poll.records. If processing is slow, reducemax.poll.recordsso each batch is smaller and completes within the interval. Or increasemax.poll.interval.msif your processing genuinely needs more time (but this delays detection of actually-dead consumers). - Fix the root cause of slow processing. Profile the consumer. Is it making synchronous HTTP calls per message? Batch them. Is it doing full-table scans in the database? Add an index. Is it spending time in GC? Tune heap size and GC settings. The rebalance storm is a symptom, not the disease.
- Stabilize the consumer count. If autoscaling is adding and removing consumers based on CPU, every scale event triggers a rebalance. Use lag-based scaling instead of CPU-based, and add hysteresis (do not scale down until lag has been stable for 10 minutes).
max.poll.interval.ms, triggering secondary rebalances. The secondary rebalances caused more lag, which caused more timeouts, and so on for 45 minutes. The fix was three changes applied together: (1) switch to CooperativeStickyAssignor (rebalances went from 30 seconds to under 5 seconds), (2) set group.instance.id per pod (restarts reclaimed partitions without rebalance), and (3) increase max.poll.interval.ms from 5 minutes to 10 minutes with reduced max.poll.records. Deployment rebalance impact went from 45 minutes to under 30 seconds.Follow-up: What is the difference between session.timeout.ms and max.poll.interval.ms, and why does Kafka have both?
Strong Answer:
session.timeout.msgoverns the heartbeat-based liveness check. The consumer’s background heartbeat thread sends heartbeats to the group coordinator everyheartbeat.interval.ms. If the coordinator does not receive a heartbeat withinsession.timeout.ms, it declares the consumer dead and triggers a rebalance. This detects crashed consumers or network partitions.max.poll.interval.msgoverns the application-level liveness check. It measures the time between consecutive calls toconsumer.poll(). If the application thread is stuck (infinite loop, deadlock, extremely slow processing), the heartbeat thread may still be running (it is a separate thread), sosession.timeout.mswill not trigger. But the consumer is not making progress.max.poll.interval.mscatches this case.- Kafka has both because they detect different failure modes. A crashed JVM stops both heartbeats and polling —
session.timeout.mscatches it. A hung application thread stops polling but heartbeats continue —max.poll.interval.mscatches it. Beforemax.poll.interval.mswas introduced (in Kafka 0.10.1), a consumer with a stuck processing thread would hold its partitions indefinitely while making no progress, because heartbeats kept it “alive” in the group.
Follow-up: How would you monitor a Kafka consumer group in production to detect rebalance issues before they become storms?
Strong Answer:- I monitor five key metrics:
- Rebalance rate — number of rebalances per hour per consumer group. More than 2-3 per hour outside of deployments is a red flag. Alert at 5 per hour.
- Rebalance duration — how long each rebalance takes (time from first revocation to all partitions assigned). With eager protocol, this includes the full stop-the-world pause. With cooperative, it is shorter. Alert if any rebalance exceeds 60 seconds.
- Consumer lag per partition — the gap between the latest produced offset and the last committed consumer offset. Look for partitions where lag is growing. A growing lag immediately after a rebalance confirms the feedback loop.
- Processing time per batch — how long the consumer takes between
poll()calls. If this approachesmax.poll.interval.ms, you are one slow message away from a rebalance. Alert at 80% of the configured interval. - Consumer group membership changes — track joins and leaves. Frequent membership changes (outside deployments) indicate unstable consumers.
- Tools: Kafka’s built-in JMX metrics (
kafka.consumer:type=consumer-coordinator-metrics), Burrow (LinkedIn’s consumer lag monitoring), Confluent Control Center, Datadog Kafka integration, or Prometheus with the kafka-exporter. I prefer Burrow because it evaluates lag trends, not just absolute lag — it can distinguish between “high but stable lag” (acceptable during a burst) and “increasing lag” (consumer falling behind).
Q21: You need to migrate a live system from RabbitMQ to Kafka with zero downtime and no lost messages. How do you approach this?
Reveal Answer
Reveal Answer
- This is a strangler fig migration applied to messaging infrastructure. The principle: never do a big-bang cutover. Run both systems in parallel, verify correctness, and shift traffic incrementally.
- Phase 1: Dual-write from producers (2-4 weeks).
- Modify producers to write to both RabbitMQ and Kafka simultaneously. The RabbitMQ write is the “source of truth” — consumers still read from RabbitMQ and process normally. The Kafka write is a “shadow” — a separate shadow consumer reads from Kafka and validates that every message in Kafka matches what RabbitMQ delivered. This catches serialization issues, partitioning bugs, and configuration errors before any production traffic touches Kafka.
- The dual-write introduces the dual-write consistency problem: if the RabbitMQ write succeeds but the Kafka write fails, you have divergent state. For this migration phase, I treat RabbitMQ as authoritative. The Kafka shadow consumer only validates; it does not trigger business logic. Failed Kafka writes are logged and retried asynchronously.
- Phase 2: Dual-read with Kafka as primary (1-2 weeks).
- Switch the primary consumer to read from Kafka. Keep the RabbitMQ consumer running as a fallback that only logs (does not process) for comparison. Monitor: are Kafka and RabbitMQ delivering the same messages in the same order? Are there any messages in RabbitMQ that are missing from Kafka (indicating a dual-write failure)?
- Set up a reconciliation job that compares processed message IDs from both systems daily. Any discrepancy triggers an alert.
- Phase 3: Cut over (1 day).
- Stop the RabbitMQ consumers. Drain remaining RabbitMQ messages (wait for queue depth to reach zero). Stop the RabbitMQ producers. The system is now fully on Kafka.
- Keep RabbitMQ running (but idle) for 2 weeks as a safety net. If a critical issue is discovered in Kafka, you can revert producers to RabbitMQ within minutes.
- Phase 4: Decommission RabbitMQ.
- Remove the dual-write code from producers. Shut down RabbitMQ. Clean up monitoring and alerting.
Follow-up: How do you handle the fact that RabbitMQ and Kafka have fundamentally different delivery semantics?
Strong Answer:- The biggest semantic difference is message acknowledgment and redelivery. In RabbitMQ, a consumer explicitly
acks ornacks each message. If the consumer crashes without acking, RabbitMQ redelivers to another consumer. In Kafka, the consumer commits offsets periodically. If the consumer crashes, it re-reads from the last committed offset. - This means the failure behavior is different during the migration. A message that would have been redelivered to a different consumer in RabbitMQ (because consumer A crashed) will be redelivered to the same partition’s consumer in Kafka (because Kafka assigns partitions, not messages). If your consumers are stateful or have consumer-specific side effects, this behavioral change matters.
- The other major difference: RabbitMQ removes messages after acknowledgment. Kafka retains messages regardless of consumption. Consumers that relied on RabbitMQ’s “once consumed, it is gone” behavior may need adjustment. For example, a consumer that re-reads the queue on restart to “recover” unprocessed messages will not work the same way in Kafka — it will re-read potentially millions of messages from the last committed offset unless you manage offsets carefully.
Follow-up: What if you cannot modify the producer code — for example, it is a third-party system?
Strong Answer:- If you cannot modify the producer, you need a bridge: a component that consumes from RabbitMQ and produces to Kafka. This is a dedicated service (or a connector — Kafka Connect has a RabbitMQ source connector) that reads messages from the RabbitMQ queue, transforms them if needed, and publishes to the Kafka topic.
- The bridge is effectively an “outbox relay” between two messaging systems. It must be idempotent (handle redeliveries from RabbitMQ), must preserve message ordering where required (partition by the same key the producer uses), and must handle backpressure (if Kafka is slow, the bridge’s RabbitMQ consumer should slow down, not lose messages).
- The advantage of the bridge approach: the producer never changes, and you can run RabbitMQ and Kafka in parallel indefinitely. The disadvantage: you now have a critical single point in the pipeline (the bridge) that must be highly available, monitored, and scaled. A bridge outage means messages queue in RabbitMQ and Kafka consumers see a gap.
Q22: Your team uses a state machine for order processing. An engineer proposes adding a “Retry” state that an order enters whenever any step fails. What is wrong with this design, and what would you recommend instead?
Reveal Answer
Reveal Answer
- A single “Retry” state is an information-destroying anti-pattern. The moment you move an order to “Retry,” you have lost the most critical piece of context: what step failed and what step should be retried? An order that failed at payment and an order that failed at shipping are in completely different situations, requiring different recovery actions, different timeouts, and different escalation paths. But in the proposed design, they are both in the same “Retry” state. Your ops dashboard shows “427 orders in Retry state” and tells you nothing about what is actually wrong.
- The correct design: error states are siblings of the step they failed from, not a generic bucket.
PAYMENT_FAILED(sibling ofPAID) — retry the payment, with a 30-second exponential backoff, max 3 attempts. If all retries fail, transition toREQUIRES_MANUAL_REVIEWwith reason “payment_failure.”SHIPPING_FAILED(sibling ofSHIPPED) — retry the shipping API, different timeout (60 seconds, because carrier APIs are slower), max 5 attempts. If all retries fail, transition toREQUIRES_MANUAL_REVIEWwith reason “shipping_failure.”- Each error state carries context: the error message, the retry count, the next retry time, and the step that should be retried.
- The operational benefits are enormous. You can query
SELECT COUNT(*), state FROM orders WHERE state LIKE '%_FAILED' GROUP BY stateand immediately see: 12 orders inPAYMENT_FAILED(check the payment gateway), 3 inSHIPPING_FAILED(check the carrier API), 1 inINVENTORY_RESERVATION_FAILED(check the warehouse system). With a generic “Retry” state, you would have to inspect each order individually. - The retry logic also differs per error state. Payment failures might need exponential backoff with jitter (to avoid thundering herd against the payment gateway). Shipping failures might need a longer wait (carrier APIs often have maintenance windows). Inventory failures might not be retryable at all (item is out of stock — escalate immediately). A single “Retry” state forces a one-size-fits-all retry policy.
RETRY state. When the payment gateway had a 4-hour outage, 8,000 orders went to RETRY. When the carrier API had a brief blip, 200 orders went to RETRY. All 8,200 orders were in the same state. The retry worker processed them in FIFO order, mixing payment retries (which would all fail because the gateway was still down) with shipping retries (which would succeed immediately). The shipping retries were blocked behind 8,000 doomed payment retries. Customers whose orders were ready to ship waited 6 hours because the system could not distinguish between retryable-now and retry-later. We refactored to per-step error states with independent retry queues. The next gateway outage? Shipping retries processed in under a minute. Payment retries waited in their own queue until the gateway recovered, then drained in 15 minutes.Follow-up: How do you handle the combinatorial explosion of error states as the number of steps grows?
Strong Answer:- With 8 processing steps, you could potentially have 8 corresponding error states plus 8
REQUIRES_MANUAL_REVIEWvariants. That is 24 states total. This sounds like a lot, but the complexity is explicit and manageable, whereas a single “Retry” state hides the same complexity in opaque, unqueryable ways. - To manage the growth: (1) Use a naming convention that makes the structure clear:
{STEP}_FAILEDand{STEP}_MANUAL_REVIEW. (2) Use a state machine library or framework that represents the transition table as data (a dictionary or DSL), not as nested if-else code. (3) Automate the retry infrastructure — each_FAILEDstate has a retry policy defined as configuration (max attempts, backoff strategy, timeout), not as code. Adding a new step means adding one entry to the retry policy configuration. - The philosophical point: explicit states feel like “more code” but they are actually less complexity. Every possible state the system can be in is visible, queryable, and monitorable. The alternative — fewer named states but more implicit states (an order that is “in Retry” but you need to check 3 other fields to know what kind of retry) — is hidden complexity that bites you during incidents.
Follow-up: How do you prevent an order from being stuck in a _FAILED state forever?
Strong Answer:
- Every
_FAILEDstate has a maximum retry count and a maximum age. If an order has been inPAYMENT_FAILEDfor more than 2 hours or has retried more than 5 times (whichever comes first), it automatically transitions toREQUIRES_MANUAL_REVIEW. A scheduled job runs every 5 minutes scanning for expired_FAILEDorders. - For
REQUIRES_MANUAL_REVIEW, the same principle applies: if no human acts within 24 hours, the system sends an escalation alert. If no action within 48 hours, it auto-cancels the order and notifies the customer. No order should be in a terminal error state without a human being aware and accountable. - The key metric is mean time in error state. Track this per step. If
PAYMENT_FAILEDorders average 45 seconds in the error state (a transient retry), that is healthy. If they average 3 hours, something systemic is wrong with the payment gateway relationship.