Skip to main content

Part XV — Event Sourcing, Messaging, and Async Systems

Asynchronous messaging changes the fundamental contract of your system: instead of “I did the thing,” your service says “I requested the thing.” This unlocks decoupling, resilience, and scalability — but introduces a new class of problems: duplicate messages, ordering, eventual consistency, and the terrifying question “did my message actually get processed?” Every pattern in this section exists because async is powerful but unsafe by default.

Real-World Stories: Why This Matters

In 2010, LinkedIn had a problem that was quietly strangling their growth. The company needed to move massive amounts of data between systems — user activity events, metrics, log data, news feed updates — and every integration was a bespoke point-to-point pipeline. With N source systems and M destination systems, they were staring down N x M individual pipelines, each with its own format, delivery semantics, and failure modes. Adding one new data source meant building integrations with every consumer. Adding a new consumer meant integrating with every source. The O(N x M) complexity was unsustainable.Jay Kreps, Neha Narkhede, and Jun Rao created Kafka to solve this. Their key insight was deceptively simple: model the data infrastructure as a distributed commit log. Every producer appends to the log. Every consumer reads from the log at its own pace. The log is the single source of truth. This reduced O(N x M) to O(N + M) — each system only needs to know how to talk to Kafka. Producers write, consumers read, and the two sides do not need to know about each other.What started as an internal tool for LinkedIn’s data pipelines became the backbone of modern event-driven architecture. Today, Kafka processes trillions of messages per day across companies like Netflix, Uber, Airbnb, and Goldman Sachs. The original paper — “Kafka: a Distributed Messaging System for Log Processing” — is worth reading not for the implementation details (much has changed) but for the clarity of the problem statement. The lesson: the best infrastructure emerges from teams solving real, painful problems at scale, not from committees designing abstract specifications.
Between 1985 and 1987, a radiation therapy machine called the Therac-25 delivered massive overdoses of radiation to at least six patients, killing three and seriously injuring the others. The root cause was a race condition in the machine’s control software.The Therac-25 had two modes: a low-power electron beam mode and a high-power X-ray mode. When an operator typed quickly — switching modes and hitting the “fire” button in rapid succession — the software entered a state where the high-power beam was active but the safety mechanisms (a metal spreader that diffuses the beam for X-ray mode) were not in position. The race condition occurred because two concurrent tasks — one updating the treatment mode, another configuring the beam — could interleave in a way the developers never tested. The bug only manifested when the operator typed faster than a specific threshold, which is why it took months and multiple casualties to identify.What makes the Therac-25 story essential reading for every software engineer is not just the tragedy but what enabled it. The developers had removed hardware safety interlocks that existed in earlier models (the Therac-6 and Therac-20), trusting the software to enforce safety constraints alone. The software was written by a single programmer over several years with no independent code review, no formal verification, and no systematic testing of concurrent execution paths. Error messages displayed to operators were cryptic one-word codes like “MALFUNCTION 54” with no documentation on what they meant or how serious they were. Operators learned to just hit the “proceed” button.This story is the reason safety-critical systems engineering exists as a discipline. It is also a sobering reminder: concurrency bugs are not just performance problems or occasional glitches. In the wrong system, they are lethal. Every time you write go func() or new Thread(), you are creating the possibility of interleavings your tests will never exercise.
Uber’s platform processes millions of events per second: ride requests, driver location updates (every 4 seconds from every active driver), fare calculations, ETA estimates, surge pricing signals, and payment events. Every one of these is time-sensitive — a rider does not want to wait 30 seconds for their driver’s location to update on the map. The system must be both high-throughput and low-latency.Uber built their real-time event infrastructure on top of Apache Kafka. Every significant action — rider opens the app, driver accepts a trip, car moves, payment is processed — is published as an event to Kafka topics. Downstream consumers handle everything from real-time maps to fraud detection to dynamic pricing. At peak (New Year’s Eve, for example), the system handles order-of-magnitude spikes without degradation. Kafka’s partitioning model is critical here: ride events are partitioned by geographic region, so a surge in New York does not affect processing for rides in London.But the real engineering challenge is not just throughput — it is ordering and consistency under partition. When a rider’s trip involves events from multiple services (matching, routing, pricing, payments), those events must be processed in a causally consistent order even though they originate from different systems, traverse different Kafka topics, and are consumed by different services. Uber solved this with a combination of per-entity partitioning (all events for a single trip go to the same partition), vector clocks for causal ordering across services, and a custom “event sourcing” framework called uReplicator that handles cross-datacenter replication. The lesson: Kafka gives you the building blocks, but at Uber’s scale, you end up building significant infrastructure on top of those building blocks.
Discord delivers billions of messages per day to hundreds of millions of users, many of them in real-time voice and text channels with thousands of concurrent participants. One of their most deceptively difficult challenges is message ordering.In a small chat room, ordering is trivial: messages arrive at the server, the server assigns a timestamp, and clients display messages in timestamp order. But at Discord’s scale, this breaks down in multiple ways. Messages arrive at different server instances from different geographic regions with slightly different clock times. A user on a slow mobile connection might send a message that arrives at the server after a reply to that message has already been sent by someone on a fast connection. When a channel has 50,000 concurrent viewers, broadcasting every message to every viewer in the correct order with sub-100ms latency is a distributed systems problem, not a chat problem.Discord’s solution involves a custom message ID scheme based on Twitter’s Snowflake format — a 64-bit ID encoding a millisecond timestamp, a worker ID, and a sequence number. This gives messages a globally unique, roughly-ordered identifier without requiring centralized coordination. For channels with heavy traffic, Discord uses a combination of Kafka for durable ordering and an in-memory fan-out layer for real-time delivery. When ordering conflicts occur (two messages with near-identical timestamps), Discord applies a deterministic tiebreaker (lower worker ID wins) so that every client converges to the same order, even if they initially received messages out of sequence. The subtle insight: perfect global ordering is not necessary — consistent ordering (every client sees the same order) is what users actually need.

Chapter 22: Messaging

22.1 Queues vs Topics

Analogy: Message queues are like a post office. The sender drops off a letter and leaves — they do not need to wait for the recipient to be home. The post office holds the letter until the recipient picks it up. If the recipient is on vacation for a week, the letter waits patiently. If the sender sends ten letters, the post office delivers them one by one when the recipient is ready. This is the fundamental value of asynchronous messaging: the sender and receiver do not need to be available at the same time, and neither needs to know anything about the other. The post office (the broker) is the decoupling layer.
These serve fundamentally different purposes: Queue (point-to-point): one message -> one consumer. Used for work distribution. Example: 1000 image resize jobs in a queue, 10 workers consuming them. Each job is processed by exactly one worker. If you add more workers, work is distributed across them. Use for: background jobs, task processing, load leveling.
Queue: [job1, job2, job3, job4, job5]
Worker A picks job1, Worker B picks job2, Worker C picks job3...
Each job processed exactly once.
Topic (pub/sub): one message -> all subscribers. Used for event broadcasting. Example: an OrderPlaced event is published to a topic. The Inventory Service, Email Service, and Analytics Service all receive it. Each subscriber gets every message. Use for: event-driven architecture, notifying multiple services, decoupling producers from consumers.
Topic: OrderPlaced event published
  -> Inventory Service (receives it, reserves stock)
  -> Email Service (receives it, sends confirmation)
  -> Analytics Service (receives it, updates dashboard)
All three get the same message independently.
Kafka’s model: Kafka uses “consumer groups” to combine both patterns. Messages in a topic partition go to one consumer within each group (queue behavior within a group) but to all groups (topic behavior across groups). This is why Kafka is both a queue and a pub/sub system.

22.2 Delivery Semantics

At-most-once: May lose messages. Never duplicate. Fast. At-least-once: Never lose. May duplicate. Standard choice. Requires idempotent consumers. Exactly-once: True exactly-once delivery is impossible in distributed systems (Two Generals problem). What we achieve is exactly-once processing semantics: messages may be delivered more than once, but the consumer’s processing logic ensures that processing a message multiple times has the same effect as processing it once. Kafka’s “exactly-once” (since 0.11) achieves this through idempotent producers and transactional consumers — not through actual exactly-once network delivery.
Why exactly-once delivery is technically impossible. The Two Generals Problem proves that no protocol can guarantee a message is delivered exactly once over an unreliable network. Consider: a producer sends a message and the broker acknowledges it. But what if the acknowledgment is lost? The producer does not know if the broker received the message or not. It has two choices: (1) resend (risking a duplicate) or (2) do nothing (risking a lost message). There is no third option. Every “exactly-once” system actually implements at-least-once delivery + idempotent processing — the combination approximates exactly-once semantics at the application level. Kafka achieves this with idempotent producers (broker deduplicates based on producer ID + sequence number) and transactional consumers (offset commits and output writes in a single atomic transaction). The network still delivers duplicates; the system just makes them invisible.
Pseudocode — idempotent consumer:
function handle_message(message):
  message_id = message.headers["message_id"]  // unique per message

  begin_transaction()
    // Check if already processed — in the SAME transaction as processing
    already = db.query("SELECT 1 FROM processed_messages WHERE id = ?", message_id)
    if already:
      commit_transaction()
      broker.ack(message)         // ack without re-processing
      return

    // Process the business logic
    order = deserialize(message.payload)
    db.insert("orders", order)
    db.update("UPDATE inventory SET quantity = quantity - ? WHERE product_id = ?",
              order.quantity, order.product_id)

    // Record that we processed this message
    db.insert("processed_messages", { id: message_id, processed_at: now() })
  commit_transaction()

  broker.ack(message)             // only ack AFTER commit succeeds
  // If the process crashes between commit and ack, the message is re-delivered
  // but the idempotency check catches it — no duplicate processing
The key: the idempotency check and the business logic are in the same database transaction. If you check in a separate transaction, there is a race window where two consumers both see “not processed” and both process.
Big Word Alert: Backpressure. When a consumer cannot keep up with the producer, backpressure is the mechanism for signaling “slow down.” Without backpressure, the producer overwhelms the consumer, causing crashes, data loss, or cascading failures. Queues provide natural backpressure — the queue grows, which is a visible signal that consumers need to scale up or producers need to throttle. Explicit backpressure: HTTP 429, TCP window sizing, reactive streams.
Unbounded Queues. A queue without a size limit is a memory leak waiting to happen. If producers outpace consumers, the queue grows until it exhausts memory or disk. Always set queue size limits and decide what happens when the limit is hit: reject new messages (fail fast), drop oldest messages (lossy), or block the producer (backpressure). The right choice depends on whether losing messages or slowing producers is worse for your use case.

22.3 Dead Letter Queues, Poison Messages, Consumer Lag

DLQ: Messages that fail after max retries are moved to a dead letter queue for investigation rather than blocking the main queue. Without a DLQ, a single unprocessable “poison message” blocks all messages behind it in the queue. Poison messages: A message that always fails processing — malformed data, a bug in the consumer logic, or a dependency that will never recover for this specific input. Retrying a poison message forever wastes resources and blocks the queue. After 3-5 attempts, move to DLQ. Log the error with the message content and correlation ID for debugging. Consumer lag: The distance between the latest message in the queue/topic and the last message processed by the consumer. Monitor this. If lag is increasing, consumers cannot keep up. Causes: slow processing logic, downstream dependency latency, insufficient consumer instances. Fix: optimize processing, scale consumers horizontally, or investigate the slow dependency.
Retry Strategies and Poison Pill Detection. Not all failures are equal — a transient network blip deserves a quick retry, but a malformed message will never succeed no matter how many times you try. A robust retry strategy combines multiple techniques:
StrategyHow It WorksBest For
Immediate retryRetry instantly (1-2 times)Transient network glitches
Exponential backoffWait 1s, 2s, 4s, 8s, 16s… between retriesDownstream overload, rate limiting
Exponential backoff + jitterBackoff with random jitter addedPreventing “thundering herd” when many consumers retry simultaneously
Max retries with DLQAfter N failures (typically 3-5), move to DLQAll message-driven systems as a safety net
Poison pill detectionIf a message fails with a non-retryable error (e.g., deserialization failure, schema mismatch), skip retries entirely and DLQ immediatelyMalformed data, schema evolution bugs
Pseudocode — exponential backoff with poison pill detection:
function process_with_retry(message):
  max_retries = 5
  retry_count = message.metadata.retry_count or 0

  try:
    process(message)
    broker.ack(message)
  catch NonRetryableError as e:
    // Poison pill — no amount of retrying will fix this
    log.error("Non-retryable failure", message_id=message.id, error=e)
    move_to_dlq(message, reason=e)
    broker.ack(message)
  catch RetryableError as e:
    if retry_count >= max_retries:
      log.error("Max retries exceeded", message_id=message.id, attempts=retry_count)
      move_to_dlq(message, reason="max_retries_exceeded")
      broker.ack(message)
    else:
      delay = min(2^retry_count * 1000, 60000)  // cap at 60s
      jitter = random(0, delay * 0.3)           // 30% jitter
      schedule_retry(message, delay + jitter, retry_count + 1)
      broker.ack(message)
What they are really testing: Can you systematically diagnose a production incident involving message infrastructure? Do you understand Kafka’s consumer group model, partitioning, and the operational levers available to you?Strong answer framework:1. Triage — understand the scope and trend (first 5 minutes). Check consumer lag metrics per partition (using kafka-consumer-groups.sh --describe or your monitoring tool like Burrow, Datadog, Confluent Control Center). Is the lag uniform across all partitions, or is one partition significantly behind? Uniform lag suggests a systemic issue (all consumers are slow). Skewed lag suggests a hot partition or a stuck consumer.2. Identify the bottleneck — is it production rate or consumption rate? Check the producer throughput. Did a new feature or upstream change suddenly increase message volume? If production rate doubled but consumer capacity stayed the same, the fix is scaling consumers. Check consumer processing time per message — if it jumped from 5ms to 500ms, something downstream is slow (a database query, an external API call, a GC pause).3. Short-term fixes (stop the bleeding):
  • Scale out consumers — add more consumer instances up to the number of partitions (Kafka’s hard limit: one consumer per partition within a group). If you have 12 partitions and 4 consumers, you can scale to 12.
  • Increase consumer throughput — if processing is the bottleneck, check for downstream latency (slow database? overloaded API?). Batch processing if the consumer is making one DB call per message instead of batching.
  • Pause non-critical consumers — if multiple consumer groups read the same topic, temporarily pause lower-priority ones to free broker resources.
  • Check for poison messages — a single message that causes repeated failures and retries can stall an entire partition. Check DLQ volume and error logs.
4. Medium-term fixes (prevent recurrence):
  • Repartition the topic — if you hit the consumer-per-partition limit, increase partitions (careful: this changes message distribution and can break ordering guarantees for existing keys).
  • Optimize consumer processing — profile the consumer. Is it doing synchronous I/O? Can processing be parallelized within the consumer (process batch of messages concurrently, commit offsets after the batch)?
  • Add autoscaling — trigger consumer scaling based on lag thresholds, not just CPU.
  • Set up lag-based alerts — alert when lag exceeds N messages or when lag growth rate is positive for more than M minutes.
5. What NOT to do:
  • Do not blindly increase max.poll.records without ensuring your consumer can process the larger batch within max.poll.interval.ms — this causes rebalances and makes things worse.
  • Do not skip messages to “catch up” unless you have a dead letter mechanism and the messages are truly replayable later.
  • Do not add consumers beyond the partition count — they will sit idle.
Common mistakes candidates make: Jumping straight to “add more consumers” without diagnosing the root cause. Not knowing the partition-to-consumer limit. Suggesting increasing partitions without acknowledging the ordering implications. Forgetting to check if the issue is a spike in production rate vs. a slowdown in consumption rate.Words that impress: “consumer group rebalance,” “partition assignment strategy,” “lag growth rate vs. absolute lag,” “max.poll.interval.ms timeout,” “cooperative sticky assignor,” “backpressure from downstream dependencies.”

22.4 Event-Driven Patterns

Domain events represent something that happened within a bounded context: OrderPlaced, InventoryReserved, PaymentReceived. They are named in past tense (facts, not commands). They drive business logic within a single service or bounded context. Integration events are communicated between services and may have a different (simpler, more stable) schema than the internal domain event. The Order Service’s internal OrderPlaced event might contain order line items, internal pricing metadata, and customer preferences. The integration event published to other services contains only: order_id, customer_id, total, items, timestamp — the minimum other services need. Event versioning: Events are contracts. Changing an event schema is a breaking change for all consumers. Use a schema registry (Confluent Schema Registry, AWS Glue) to enforce compatibility rules (backward, forward, full). Compatibility modes: backward-compatible (new consumers can read old events), forward-compatible (old consumers can read new events), full (both).

22.5 Kafka vs RabbitMQ vs SQS — When to Use Each

CriteriaKafkaRabbitMQSQS
ModelDistributed log (append-only)Message broker (queue + exchange routing)Managed queue (AWS)
OrderingPer partitionPer queue (single consumer)FIFO queues only (per message group)
RetentionConfigurable (days/weeks/forever)Until consumed and acked1-14 days
ReplayYes (consumers re-read from any offset)No (consumed = gone)No
ThroughputVery high (millions/sec per cluster)Medium (tens of thousands/sec)Medium (3,000/sec standard, higher with batching)
Consumer groupsYes (parallel consumption + pub/sub)Yes (competing consumers)Yes (multiple consumers)
Exactly-onceYes (idempotent producers + transactions)No (at-least-once)No (at-least-once, FIFO has dedup)
Operational complexityHigh (ZooKeeper/KRaft, partitions, replication)Medium (Erlang cluster)None (fully managed)
Best forEvent streaming, event sourcing, high-throughput, replay neededTask queues, complex routing, request-replySimple cloud queues, serverless triggers
Worst forSimple task queues (overkill)High-throughput streamingReplay, event sourcing, cross-cloud
Decision rule: Need replay or event sourcing -> Kafka. Need complex routing (fanout, topic-based, header-based) -> RabbitMQ. Need a simple managed queue on AWS -> SQS. Running serverless (Lambda triggers) -> SQS or SNS.
Choosing by use case — concrete scenarios:
ScenarioBest ChoiceWhy
Real-time clickstream analytics (100K+ events/sec)KafkaHigh throughput, replay for reprocessing, consumer groups for parallel processing
E-commerce order processing with complex routing (priority orders, regional routing)RabbitMQExchange-based routing, priority queues, dead letter exchanges
Lambda-triggered image thumbnail generationSQSZero ops, native Lambda integration, pay-per-message
Event sourcing for financial audit trailKafkaImmutable log, indefinite retention, replay from any point
Microservice task queue (send emails, generate PDFs)RabbitMQ or SQSSimple work distribution, no need for replay or event streaming
Multi-region event replicationKafkaMirrorMaker 2, built-in replication across clusters
Serverless webhook processing with unpredictable spikesSQSAuto-scales, no capacity planning, built-in retry + DLQ
Tools: Apache Kafka (event streaming, high throughput, durable). RabbitMQ (traditional message broker, flexible routing). AWS SQS/SNS. Azure Service Bus. Google Pub/Sub. NATS (lightweight, cloud-native). Redis Streams (lightweight event streaming with Redis). Further reading: Designing Event-Driven Systems by Ben Stopford — free ebook from Confluent, covers Kafka-centric event architectures. Enterprise Integration Patterns by Gregor Hohpe & Bobby Woolf — the definitive catalog of messaging patterns (routing, transformation, splitting, aggregation). Kafka: The Definitive Guide by Gwen Shapira et al. — comprehensive guide to Kafka internals and usage patterns. Curated links:

Part XVI — Concurrency, Threading, and Async Execution

Chapter 23: Concurrency

23.1 Concurrency vs Parallelism

These are often confused. Here is the distinction with a real-world analogy and code-level example: Concurrency: one cook, multiple dishes. A single cook works on pasta, salad, and soup. While the pasta boils (waiting/IO), the cook chops vegetables for the salad (doing work). The cook interleaves tasks, switching between them when one is waiting. Only one task is actively being worked on at any moment, but all three make progress.
// Concurrency: Node.js event loop — one thread handles many requests
async function handleRequest(req) {
  const user = await db.getUser(req.userId)   // thread released while waiting
  const orders = await db.getOrders(user.id)   // thread released while waiting
  return { user, orders }
}
// 1000 requests can be "in progress" on a single thread
Parallelism: multiple cooks, multiple dishes. Three cooks each work on their own dish simultaneously. All three are actively working at the same time on different CPU cores.
// Parallelism: Go goroutines on multiple cores
func processImages(images []Image) {
  var wg sync.WaitGroup
  for _, img := range images {
    wg.Add(1)
    go func(i Image) {       // each goroutine may run on a different core
      defer wg.Done()
      resize(i)              // CPU-intensive work happening simultaneously
    }(img)
  }
  wg.Wait()
}
The key insight: Concurrency is about structure (handling multiple things). Parallelism is about execution (doing multiple things at the same time). You can have concurrency without parallelism (Node.js on one core). You can have parallelism without concurrency (a batch job that splits data across 8 cores but each core processes one chunk start-to-finish). Most real systems use both.
Big Word Alert: Thread Safety. Code is thread-safe if it behaves correctly when accessed from multiple threads simultaneously without external synchronization. A function that only reads local variables is naturally thread-safe. A function that modifies a shared counter is not — two threads incrementing simultaneously can produce the wrong result. Making code thread-safe: use locks/mutexes (simple but can cause contention), atomic operations (fast but limited to simple operations), immutable data (no modification = no race conditions), or message-passing (each thread owns its data, communicates via messages).
Tools: Thread sanitizers (Go’s -race flag, C++ ThreadSanitizer, Java’s FindBugs) detect race conditions at runtime. Concurrency testing tools: jcstress (Java), go test -race (Go). Lock-free data structures: ConcurrentHashMap (Java), sync.Map (Go).

23.2 Race Conditions

Behavior depends on timing of operations. Two requests read balance 100,bothapprove100, both approve 75 withdrawal, both compute 100100 - 75 = 25andwrite25 and write 25 to the account. Result: 150withdrawnfroma150 withdrawn from a 100 account — the second withdrawal should have been rejected, but both saw $100 and assumed it was safe. Prevention: database transactions with correct isolation, optimistic locking (version field), pessimistic locking (SELECT FOR UPDATE), advisory locks for application-level coordination, atomic operations (Redis INCR), idempotent operations.
Concrete code-level race condition examples.Example 1 — Check-then-act (the classic TOCTOU bug):
# BROKEN — race condition between check and act
def withdraw(account_id, amount):
    balance = db.query("SELECT balance FROM accounts WHERE id = %s", account_id)
    if balance >= amount:
        # Another thread can withdraw between the SELECT and UPDATE
        db.execute("UPDATE accounts SET balance = balance - %s WHERE id = %s", amount, account_id)
        return "success"
    return "insufficient funds"

# FIXED — atomic operation, no window between check and act
def withdraw(account_id, amount):
    rows_affected = db.execute(
        "UPDATE accounts SET balance = balance - %s WHERE id = %s AND balance >= %s",
        amount, account_id, amount
    )
    return "success" if rows_affected > 0 else "insufficient funds"
Example 2 — Lost update with optimistic locking:
# BROKEN — two threads read version 1, both write version 1 data
def update_profile(user_id, new_name):
    user = db.query("SELECT * FROM users WHERE id = %s", user_id)
    user.name = new_name
    db.execute("UPDATE users SET name = %s WHERE id = %s", new_name, user_id)

# FIXED — optimistic locking rejects stale writes
def update_profile(user_id, new_name, expected_version):
    rows = db.execute(
        "UPDATE users SET name = %s, version = version + 1 WHERE id = %s AND version = %s",
        new_name, user_id, expected_version
    )
    if rows == 0:
        raise ConflictError("Profile was modified by another request. Retry.")
Example 3 — Shared counter without synchronization (in-memory):
// BROKEN — multiple goroutines increment simultaneously
var counter int
func increment() {
    counter++  // this is actually: read counter, add 1, write counter
    // Two goroutines can both read 5, both write 6 — one increment lost
}

// FIXED — atomic operation
var counter int64
func increment() {
    atomic.AddInt64(&counter, 1)  // hardware-level atomic, no lost updates
}
What they are really testing: Do you understand the spectrum of concurrency control strategies? Can you reason about trade-offs between correctness, performance, and user experience in a real business scenario?Strong answer framework:This is a classic check-then-act race condition. Both users read “1 seat available,” both proceed to book, and you end up with -1 seats. The solution depends on how much contention you expect and how much latency you can tolerate.Approach 1: Pessimistic locking (SELECT FOR UPDATE).
BEGIN;
SELECT available_seats FROM flights WHERE flight_id = 'UA123' FOR UPDATE;
-- This locks the row. The second transaction blocks here until the first commits.
-- If seats > 0:
UPDATE flights SET available_seats = available_seats - 1 WHERE flight_id = 'UA123';
INSERT INTO bookings (flight_id, user_id, ...) VALUES ('UA123', 'user_42', ...);
COMMIT;
The second user’s transaction waits while the first completes. After the first commits, the second transaction reads available_seats = 0 and rejects the booking. Pros: Simple, correct, familiar. Cons: Under high contention (thousands of users booking the same flight simultaneously), transactions queue up and latency spikes. Deadlock-prone if multiple rows are locked in different orders.Approach 2: Optimistic locking (version or CAS).
-- Read current state
SELECT available_seats, version FROM flights WHERE flight_id = 'UA123';
-- available_seats = 1, version = 42

-- Try to update — only succeeds if version has not changed
UPDATE flights
SET available_seats = available_seats - 1, version = version + 1
WHERE flight_id = 'UA123' AND version = 42 AND available_seats > 0;
-- If rows_affected = 0, someone else got there first. Retry or reject.
Both users read version 42. The first to write increments to version 43. The second user’s write fails (version 42 no longer exists). Pros: No blocking — both transactions run concurrently. Great for low-to-medium contention. Cons: Under high contention, most transactions fail and must retry, creating a “retry storm.”Approach 3: Atomic conditional update (best for this specific case).
UPDATE flights
SET available_seats = available_seats - 1
WHERE flight_id = 'UA123' AND available_seats > 0;
-- Check rows_affected: if 0, no seats left. If 1, you got it.
This combines the check and the update into a single atomic SQL statement. No read-then-write gap. No version column needed. The database handles the concurrency. Pros: Simplest solution, no locking overhead, correct by construction. Cons: Limited to cases where the entire decision can be encoded in a single atomic statement (works here, but not for complex multi-step booking workflows).Approach 4: For extremely hot resources — queue the requests.If 10,000 people try to book the last 5 seats simultaneously (concert tickets, flash sales), none of the above scale well. Instead: accept all booking requests into a queue, process them serially (or with a limited concurrency pool), and confirm/reject asynchronously. The user sees “Your booking is being processed…” instead of an instant confirmation. Pros: Eliminates contention entirely. Cons: Worse user experience (async instead of instant).Which to choose:
  • Standard flight booking (moderate contention): Approach 3 (atomic conditional update) — simplest, correct, performant.
  • Complex booking with multiple steps (seat selection + meal + loyalty points): Approach 2 (optimistic locking) with retry logic.
  • Flash sale / concert tickets (extreme contention): Approach 4 (queue-based).
Common mistakes candidates make: Only knowing about pessimistic locking. Not considering the user experience impact. Suggesting distributed locks (Redis) when a single database row is involved — overkill and introduces unnecessary failure modes. Forgetting that SELECT ... FOR UPDATE requires being inside a transaction.Words that impress: “compare-and-swap semantics,” “atomic conditional update,” “contention profile,” “optimistic vs. pessimistic depending on the write-to-read ratio,” “serializable isolation as a last resort.”

23.3 Deadlocks

Analogy: A deadlock is like two people meeting in a narrow hallway. Each waits for the other to step aside, but neither will move first. They stand there, politely and indefinitely, each thinking “after you” while blocking the only path forward. In software, the “hallway” is a shared resource, and the “people” are threads or transactions. The fix is the same in both worlds: establish a convention. In the hallway, “person heading north always yields.” In code, “always acquire locks in ascending ID order.” Without a convention, you get eternal standoffs.
Circular waiting: Transaction A holds Row 1, waits for Row 2. Transaction B holds Row 2, waits for Row 1. Neither can proceed — both wait forever (or until the database detects the deadlock and kills one). Concrete example:
Transaction A: UPDATE accounts SET balance = balance - 100 WHERE id = 1  (locks row 1)
Transaction B: UPDATE accounts SET balance = balance - 50 WHERE id = 2   (locks row 2)
Transaction A: UPDATE accounts SET balance = balance + 100 WHERE id = 2  (waits for row 2 — held by B)
Transaction B: UPDATE accounts SET balance = balance + 50 WHERE id = 1   (waits for row 1 — held by A)
-- DEADLOCK — neither can proceed
Prevention: (1) Consistent lock ordering — always lock rows in the same order (e.g., by ascending ID). If both transactions lock row 1 first, then row 2, no circular wait is possible. (2) Short transactions — the less time you hold locks, the less likely deadlocks are. (3) Lock timeouts — set lock_timeout so transactions fail fast instead of waiting forever. (4) Monitoring — PostgreSQL logs deadlocks automatically. Track deadlock frequency. If it is increasing, investigate the access patterns. In application code: Use SELECT ... FOR UPDATE with a consistent ordering. If you need to update accounts 1 and 2, always lock the lower ID first: SELECT * FROM accounts WHERE id IN (1, 2) ORDER BY id FOR UPDATE.
The Four Coffman Conditions for Deadlock. A deadlock can only occur when all four of these conditions hold simultaneously. Breaking any one of them prevents deadlocks:
ConditionDefinitionHow to Break It
Mutual ExclusionA resource can only be held by one process at a timeUse shareable resources where possible (read locks instead of exclusive locks)
Hold and WaitA process holds resources while waiting for othersRequire processes to request all resources at once (lock all rows in a single statement)
No PreemptionResources cannot be forcibly taken from a processAllow lock timeouts — the database kills one transaction to break the cycle
Circular WaitA circular chain of processes each waiting for the nextImpose a total ordering on resources — always acquire locks in the same order (e.g., by ascending ID). This is the most practical prevention strategy
In practice, consistent lock ordering (breaking circular wait) and lock timeouts (breaking no preemption) are the two most commonly used strategies. Database engines like PostgreSQL and MySQL/InnoDB have built-in deadlock detection that automatically aborts one of the transactions involved.

23.4 Async/Await, Event Loops, and Background Processing

Async/await enables non-blocking IO — the thread is released while waiting for a response and can handle other requests. Event loops (Node.js, Python asyncio) use a single thread with non-blocking IO. Background processing (Sidekiq, Celery, Hangfire, BullMQ) handles work outside the request cycle.
In Node.js, a single CPU-intensive operation blocks the entire event loop. Offload heavy computation to worker threads or a separate service.
The Event Loop Model — Node.js and Python asyncio explained.Both Node.js and Python asyncio use a single-threaded event loop to achieve concurrency without threads. Understanding this model is critical for writing performant async code.How it works (Node.js):
    +-------------------------------------------------+
    |                   EVENT LOOP                     |
    |                                                  |
    |  1. Poll Phase: check for completed I/O          |
    |  2. Check Phase: run setImmediate() callbacks     |
    |  3. Timers Phase: run setTimeout/setInterval      |
    |  4. Repeat                                       |
    +-------------------------------------------------+
         |            |             |
    [DB query]   [HTTP call]  [File read]
         |            |             |
    (OS/libuv handles I/O in background thread pool)
The single thread runs your JavaScript code. When you await an I/O operation, the runtime hands it off to the OS (network I/O) or libuv’s thread pool (file I/O, DNS). The event loop continues processing other callbacks. When the I/O completes, its callback is queued and the event loop picks it up on the next tick.Python asyncio works the same way:
import asyncio

async def fetch_user(user_id):
    # This does NOT block the event loop — it yields control
    response = await http_client.get(f"/users/{user_id}")
    return response.json()

async def main():
    # These run concurrently on a SINGLE thread
    user1, user2, user3 = await asyncio.gather(
        fetch_user(1),
        fetch_user(2),
        fetch_user(3),
    )
    # All three HTTP requests were in-flight simultaneously
    # Total time ~ max(request1, request2, request3), not the sum
What blocks the event loop (avoid this):
  • CPU-heavy computation (image processing, JSON parsing large payloads, crypto)
  • Synchronous file I/O (fs.readFileSync in Node, regular open() in Python)
  • Long-running loops without yielding
Fix: offload CPU work to worker_threads (Node.js), concurrent.futures.ProcessPoolExecutor (Python), or a separate microservice.
Strong answer: Node.js uses a single-threaded event loop. It handles concurrency through non-blocking IO — when a database query is sent, the thread does not wait. It processes other requests and is notified when the query completes. This is efficient for IO-bound work but blocks on CPU-intensive work. Java uses multi-threading — each request can get its own thread (or a thread from a pool). Threads run in parallel on multiple cores. CPU-intensive work can be parallelized across threads. The trade-off: Java threads consume more memory (each thread has its own stack), context switching between threads has overhead, and shared mutable state requires careful synchronization (locks, atomics). Node.js avoids shared state complexity (single thread) but cannot utilize multiple cores without worker threads or clustering. Most web workloads are IO-bound, so both models work well — the choice is more about ecosystem and team expertise.
AspectNode.jsJava
ModelSingle-threaded event loopMulti-threaded (thread pool)
Concurrency mechanismNon-blocking I/O, callbacks/promisesOS threads, ExecutorService, virtual threads (Java 21+)
CPU-bound workBlocks the loop (bad)Parallelized across cores (good)
Memory per connectionVery low (no thread stack)Higher (~512KB-1MB per thread stack)
Shared stateNo shared state issues (single thread)Requires locks, atomics, or thread-safe collections
Best forI/O-heavy APIs, real-time appsCPU-heavy processing, enterprise systems
Further reading: Java Concurrency in Practice by Brian Goetz — the definitive guide to concurrent programming in Java (principles apply broadly). Concurrency in Go by Katherine Cox-Buday — goroutines, channels, and Go’s concurrency model. Node.js Design Patterns by Mario Casciaro & Luciano Mammino — covers the event loop, streams, and async patterns in depth. Curated links:

Part XVII — State Management

Chapter 24: State in Distributed Systems

24.1 Stateless vs Stateful

Make everything stateless by default. Push state to dedicated stores (Redis, databases). Stateless services are disposable and horizontally scalable.
Big Word Alert: CRDTs (Conflict-free Replicated Data Types). Data structures that can be replicated across multiple nodes and updated independently without coordination. When replicas sync, conflicts are resolved automatically using mathematical properties (commutativity, associativity, idempotency). Examples: G-Counter (grow-only counter), OR-Set (observed-remove set). Used in collaborative editing (Google Docs uses a variant), distributed databases (Riak, Redis Enterprise), and offline-first applications. The key insight: if you design your data structure so that all operations commute (order does not matter), you never need coordination.
How a G-Counter actually works: Each node maintains its own counter (a vector: {nodeA: 5, nodeB: 3, nodeC: 7}). Each node only increments its own entry. When nodes sync, they take the element-wise maximum: merge({A:5, B:3, C:7}, {A:4, B:6, C:7}) = {A:5, B:6, C:7}. The global count is the sum of all entries: 5+6+7 = 18. Increments never conflict because each node only writes to its own slot. A PN-Counter extends this with a separate G-Counter for decrements — the real count is increments - decrements. The trade-off: CRDTs are limited to operations that are commutative and associative — not all data structures can be made conflict-free (e.g., a CRDT cannot enforce “balance must not go negative” because that requires coordination).
Split-Brain. When a network partition splits a cluster into two groups, each group may believe it is the “leader” and accept writes independently. When the partition heals, you have conflicting data. Prevention: quorum-based systems (a write must be acknowledged by a majority), fencing tokens (the old leader’s writes are rejected after a new leader is elected), or single-leader architectures (only one node accepts writes).
Tools: Redis (distributed state, pub/sub, Lua scripting for atomic operations). Apache ZooKeeper, etcd (distributed coordination, leader election, distributed locks). Consul (service discovery + KV store). Temporal, Cadence (durable workflow state management).

24.2 Distributed Locking

When multiple service instances need to coordinate — ensuring only one runs a scheduled job, or only one processes a specific order — you need a distributed lock. Redis-based distributed lock (simplified):
function acquire_lock(resource_id, ttl=30s):
  lock_value = uuid()   // unique to this holder — for safe release
  acquired = redis.set("lock:" + resource_id, lock_value, NX=true, EX=ttl)
  return { acquired, lock_value }

function release_lock(resource_id, lock_value):
  // Only release if WE hold the lock (compare lock_value)
  // Use Lua script for atomicity
  script = "if redis.call('get', KEYS[1]) == ARGV[1] then redis.call('del', KEYS[1]) end"
  redis.eval(script, "lock:" + resource_id, lock_value)
Why the lock_value matters (fencing): Without it, process A acquires lock, gets slow (GC pause), lock expires, process B acquires lock, process A wakes up and releases B’s lock. Now both think they have the lock. The unique lock_value prevents A from releasing B’s lock. For even stronger safety, use fencing tokens — a monotonically increasing number issued with each lock acquisition that downstream resources can validate. When to use distributed locks: Exactly-once job scheduling (cron that runs across multiple instances). Preventing duplicate processing of the same event. Coordinating access to external resources (API rate limits shared across instances). When NOT to use: If you can design around the need for coordination (idempotent operations, partition-based assignment), do that instead — locks add latency and failure modes. Tools: Redis (SET NX + Lua scripts). Redlock (Redis distributed lock algorithm for multi-node Redis). ZooKeeper, etcd (consensus-based locks — stronger guarantees). PostgreSQL advisory locks (pg_advisory_lock — good when you already have PostgreSQL).

24.2b Distributed Consensus (Raft Basics)

How does a group of servers agree on a value when any of them can fail? This is the core problem solved by Raft (used in etcd, CockroachDB, Consul) and Paxos (used in Google’s Chubby, Spanner). Raft in 3 phases: (1) Leader election: Servers start as followers. If a follower does not hear from the leader (heartbeat timeout), it becomes a candidate and requests votes. A candidate that receives a majority of votes becomes leader. (2) Log replication: The leader receives client requests, appends them to its log, and replicates to followers. Once a majority acknowledges, the entry is committed and applied. (3) Safety: Only a server with all committed entries can be elected leader. This guarantees committed entries are never lost. How Paxos differs from Raft: Paxos solves the same problem but with a more flexible (and notoriously harder to understand) protocol. Paxos separates consensus into Prepare and Accept phases, allowing proposals from any node. Raft elects a single leader who drives all decisions — simpler to reason about. Raft was designed explicitly to be easier to understand and implement. Most modern systems choose Raft or a Raft variant (etcd, CockroachDB, Consul). You will encounter Paxos in Google’s systems (Chubby, Spanner) and academic papers. Why this matters for senior engineers: You rarely implement consensus, but you use systems built on it (etcd, ZooKeeper, CockroachDB, Consul). Understanding Raft helps you reason about: why these systems need an odd number of nodes (3, 5, 7 — to form a majority), why they sacrifice availability during a partition (CP systems), and why leader election takes time (cannot serve writes during election).

24.3 Distributed State Challenges

Synchronization, conflicts, stale data. Approaches: centralized store with strong consistency, eventual consistency with conflict resolution (last-write-wins, CRDTs), event sourcing.
Consistency Patterns in Distributed Systems. When data is replicated across nodes, you must choose a consistency model. Each model makes different trade-offs between correctness, availability, and latency.
Consistency ModelGuaranteeLatencyAvailabilityReal-World Example
Strong (linearizability)Every read returns the most recent write. All nodes see the same data at the same time.High (requires coordination)Lower (unavailable during partitions)Bank account balance — you must never see stale balances. Google Spanner, CockroachDB (uses synchronized clocks + Raft).
EventualIf no new writes occur, all replicas will eventually converge to the same value. No guarantee on when.Low (no coordination)High (always available)Social media like counts — seeing 1,042 likes vs 1,043 for a few seconds is acceptable. DynamoDB, Cassandra (default mode).
CausalOperations that are causally related are seen in the same order by all nodes. Concurrent (unrelated) operations may be seen in different orders.MediumMedium-HighChat applications — if Alice says “Hi” and Bob replies “Hello,” everyone must see Alice’s message first. But two unrelated conversations can appear in any order. MongoDB (causal consistency sessions).
Read-your-writesA client always sees its own writes, even if reading from a replica. Other clients may see stale data.Low-MediumHighUser profile update — after you change your name, you should immediately see the new name, even if other users briefly see the old one. Session-sticky routing, writing and reading from the same replica.
Monotonic readsA client never sees data go “backward” — if you saw version 5, you will never see version 4 on a subsequent read.Low-MediumHighDashboard metrics — numbers should not jump backward when you refresh the page. Sticky sessions to the same replica.
How to choose:
  • Financial transactions, inventory with hard limits -> Strong consistency
  • Analytics, social feeds, caches -> Eventual consistency
  • Collaborative editing, chat systems -> Causal consistency
  • User-facing writes (profile edits, settings) -> Read-your-writes at minimum

24.4 Workflow State Machines

Model complex business processes as state machines. An order goes through states: Draft -> Submitted -> Processing -> Shipped -> Delivered -> Completed (or Cancelled at various points). Each transition has rules and side effects. Makes invalid states unrepresentable. Implementation: Define allowed transitions explicitly. An order in “Shipped” state cannot transition to “Draft.” Each transition can have guards (conditions that must be true) and side effects (send notification, update inventory). Store the current state in the database. On state change, validate the transition, execute side effects, persist the new state — all in a transaction.
Real-World State Machine: Order Lifecycle.Below is a concrete order lifecycle state machine with every valid transition, guard condition, and side effect mapped out. This is the kind of design that prevents the “impossible state” bugs described in the interview question below.
                          cancel (any state except Delivered/Returned)
                    +-------------------------------------------+
                    |                                           |
                    v                                           |
              +-----------+   pay    +---------+  ship   +-----------+
  [create] -> |  PLACED   | ------> |  PAID   | -----> |  SHIPPED  |
              +-----------+         +---------+         +-----------+
                    |                    |                     |
                    | (payment timeout)  | (refund before      | deliver
                    v                    |  shipping)          v
              +-----------+              |              +-----------+
              | CANCELLED |  <-----------+              | DELIVERED |
              +-----------+                             +-----------+
                    ^                                         |
                    |              +-----------+    return     |
                    +---------    |  RETURNED  | <------------+
                                  +-----------+
Transition table:
FromToGuard (condition)Side Effect
PLACEDPAIDPayment confirmed by gatewayReserve inventory, send confirmation email
PLACEDCANCELLEDPayment timeout (30 min) OR user cancelsRelease any provisional holds
PAIDSHIPPEDWarehouse confirms dispatchGenerate tracking number, notify customer
PAIDCANCELLEDAdmin/user requests refund before shipInitiate refund, release inventory
SHIPPEDDELIVEREDCarrier confirms deliveryClose fulfillment ticket, trigger review request
DELIVEREDRETURNEDCustomer initiates return within 30 daysGenerate return label, create return case
RETURNEDCANCELLEDReturn received and inspectedProcess refund, restock inventory
Pseudocode implementation:
TRANSITIONS = {
    ("PLACED",    "PAID"):      { "guard": payment_confirmed,   "effect": reserve_inventory_and_email },
    ("PLACED",    "CANCELLED"): { "guard": timeout_or_user_cancel, "effect": release_holds },
    ("PAID",      "SHIPPED"):   { "guard": warehouse_dispatched, "effect": generate_tracking },
    ("PAID",      "CANCELLED"): { "guard": refund_requested,     "effect": process_refund },
    ("SHIPPED",   "DELIVERED"): { "guard": carrier_confirmed,    "effect": close_fulfillment },
    ("DELIVERED", "RETURNED"):  { "guard": return_within_window, "effect": generate_return_label },
    ("RETURNED",  "CANCELLED"): { "guard": return_inspected,     "effect": refund_and_restock },
}

def transition_order(order, target_state):
    key = (order.state, target_state)
    rule = TRANSITIONS.get(key)

    if rule is None:
        raise InvalidTransitionError(f"Cannot go from {order.state} to {target_state}")

    if not rule["guard"](order):
        raise GuardFailedError(f"Conditions not met for {order.state} -> {target_state}")

    with db.transaction():
        old_state = order.state
        order.state = target_state
        db.save(order)
        db.insert("state_transitions", {
            "order_id": order.id,
            "from_state": old_state,
            "to_state": target_state,
            "actor": current_user(),
            "timestamp": now(),
        })
        rule["effect"](order)
Strong answer: Replace ad-hoc if-else state handling with an explicit state machine. Define all valid states and all valid transitions in one place (a transition table or state machine library). Reject any transition not explicitly allowed. Add a database constraint on valid state values. Log every state transition with timestamp, actor, and previous state. For the existing bad data, write a migration to fix impossible states (usually by moving them to the last valid state and flagging for review). The root cause is almost always: state transitions scattered across multiple code paths with inconsistent validation.Key steps in the fix:
  1. Audit: Query for all current state values and identify which ones are invalid
  2. Define: Create a single source of truth — a transition table listing every (from_state, to_state) pair that is allowed
  3. Enforce: Add a CHECK constraint on the state column (CHECK (state IN ('PLACED', 'PAID', 'SHIPPED', ...))) and validate transitions in application code before any write
  4. Log: Record every transition with from_state, to_state, timestamp, and actor for auditing
  5. Migrate: Fix existing bad data — move impossible states to the last valid state and flag for manual review
  6. Prevent: All state changes go through a single transition_order() function — never update the state column directly
What they are really testing: Do you understand the Saga pattern, compensating transactions, and how to handle partial failures in a distributed workflow? Can you reason about failure modes and recovery strategies?Strong answer framework:This is a textbook case for the Saga pattern — a sequence of local transactions where each step has a corresponding compensating action if a later step fails. You cannot use a traditional distributed transaction (two-phase commit) here because payment, inventory, and shipping are likely separate services with separate databases.The pipeline:
Step 1: Process Payment   -->  Step 2: Reserve Inventory  -->  Step 3: Schedule Shipping
   |                              |                              |
   | (if fails: no-op,           | (if fails: refund payment)  | (if fails: refund payment
   |  payment never charged)     |                              |  + release inventory)
Orchestration-based Saga (recommended for sequential pipelines):A central orchestrator (the Order Service or a workflow engine like Temporal) drives the pipeline and handles failures.
# Orchestrator — drives the saga and handles compensation
async def process_order(order):
    # Step 1: Payment
    payment_result = await payment_service.charge(order.payment_info, order.total)
    if not payment_result.success:
        await mark_order_failed(order, "payment_failed", payment_result.error)
        return

    # Step 2: Inventory
    inventory_result = await inventory_service.reserve(order.items)
    if not inventory_result.success:
        # Compensate Step 1
        await payment_service.refund(payment_result.transaction_id)
        await mark_order_failed(order, "inventory_failed", inventory_result.error)
        return

    # Step 3: Shipping
    shipping_result = await shipping_service.schedule(order.shipping_address, order.items)
    if not shipping_result.success:
        # Compensate Steps 1 and 2 (reverse order)
        await inventory_service.release(inventory_result.reservation_id)
        await payment_service.refund(payment_result.transaction_id)
        await mark_order_failed(order, "shipping_failed", shipping_result.error)
        return

    await mark_order_complete(order, payment_result, inventory_result, shipping_result)
Critical design decisions:1. Compensating actions must be idempotent. If the orchestrator crashes after issuing a refund but before recording that it issued the refund, it will retry and issue the refund again. The payment service must handle duplicate refund requests gracefully (check if refund already processed for this transaction ID).2. Each step must be idempotent too. If the orchestrator crashes after payment succeeds but before moving to inventory, it will restart and try payment again. The payment service must recognize “this order was already charged” and return the existing transaction ID instead of charging twice.3. What if the compensation itself fails? The refund call might fail. You need retry logic with exponential backoff for compensating actions. If retries are exhausted, flag the order for manual intervention. This is why you need an audit log of every step and every compensation attempt.4. Why not choreography (event-driven)? In choreography, each service emits an event and the next service reacts: PaymentCharged -> InventoryService reserves -> InventoryReserved -> ShippingService schedules. This is more decoupled but harder to reason about for sequential pipelines with complex compensation logic. When the sequence is strict and failures require multi-step rollback, orchestration is clearer. Choreography works better when steps are more independent and compensation is simple.5. Production-grade: use a workflow engine. In production, you would implement this as a Temporal workflow or a Step Functions state machine rather than hand-rolling the orchestration. These engines handle retries, timeouts, crash recovery, and compensation out of the box. The workflow state is durable — if the orchestrator crashes mid-saga, it resumes from where it left off when it restarts.
# Temporal workflow (conceptual) — the engine handles durability and retries
@workflow.defn
class OrderProcessingWorkflow:
    @workflow.run
    async def run(self, order: Order):
        payment = await workflow.execute_activity(charge_payment, order, retry_policy=RetryPolicy(max_attempts=3))

        try:
            reservation = await workflow.execute_activity(reserve_inventory, order, retry_policy=RetryPolicy(max_attempts=3))
        except ActivityError:
            await workflow.execute_activity(refund_payment, payment.transaction_id)
            raise

        try:
            shipment = await workflow.execute_activity(schedule_shipping, order, retry_policy=RetryPolicy(max_attempts=3))
        except ActivityError:
            await workflow.execute_activity(release_inventory, reservation.id)
            await workflow.execute_activity(refund_payment, payment.transaction_id)
            raise

        return OrderResult(payment, reservation, shipment)
Common mistakes candidates make: Suggesting a distributed transaction / two-phase commit across microservices (this is impractical — it requires all services to support XA transactions and creates tight coupling). Not considering what happens when compensation fails. Designing the saga without idempotency, leading to double-charging or double-refunding. Not mentioning a workflow engine — hand-rolling sagas is a common source of bugs in production.Words that impress: “Saga pattern — orchestration vs. choreography,” “compensating transactions,” “idempotency keys for each step,” “Temporal workflow for durable execution,” “semantic rollback vs. technical rollback,” “forward recovery vs. backward recovery.”
Further reading: Statecharts: A Visual Formalism for Complex Systems by David Harel — the original paper on statecharts. XState — a popular JavaScript/TypeScript state machine library. Temporal.io — a workflow engine for long-running, durable business processes. Curated links: