> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Distributed Transactions

> 2PC, 3PC, Saga pattern, and distributed locking mechanisms

<Frame>
  <img src="https://mintcdn.com/devweeekends/CHfRzoAmD5TGW2ch/images/courses/two-phase-commit.svg?fit=max&auto=format&n=CHfRzoAmD5TGW2ch&q=85&s=6d1c8816119f7a606118c78308a94a07" alt="Two-Phase Commit Protocol" width="1080" height="1080" data-path="images/courses/two-phase-commit.svg" />
</Frame>

# Track 4: Distributed Transactions

Maintaining data integrity across multiple nodes. Distributed transactions are the hardest problem in distributed systems that you actually have to solve in practice. Single-machine transactions are like making a bank transfer within one branch -- the teller can see both accounts and lock them. Distributed transactions are like coordinating a transfer between two banks in different countries, over unreliable phone lines, where either bank might lose power mid-call.

<Info>
  **Track Duration**: 36-44 hours\
  **Modules**: 6\
  **Key Topics**: 2PC, 3PC, Saga, TCC, Distributed Locking
</Info>

***

## Module 16: ACID in Distributed Systems

### Local vs Distributed Transactions

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    LOCAL vs DISTRIBUTED TRANSACTIONS                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  LOCAL TRANSACTION:                                                         │
│  ──────────────────                                                         │
│  Single database, ACID guarantees are "easy"                                │
│                                                                              │
│  BEGIN TRANSACTION                                                          │
│    UPDATE accounts SET balance = balance - 100 WHERE id = 1;                │
│    UPDATE accounts SET balance = balance + 100 WHERE id = 2;                │
│  COMMIT                                                                     │
│                                                                              │
│  Database handles: Atomicity (undo log), Isolation (locks/MVCC)             │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  DISTRIBUTED TRANSACTION:                                                   │
│  ────────────────────────                                                   │
│  Multiple databases/services, ACID is HARD                                  │
│                                                                              │
│  ┌───────────┐    ┌───────────┐    ┌───────────┐                            │
│  │ Service A │    │ Service B │    │ Service C │                            │
│  │  DB: x-1  │    │  DB: y+1  │    │  Log entry│                            │
│  └───────────┘    └───────────┘    └───────────┘                            │
│       │                │                │                                   │
│       └────────────────┴────────────────┘                                   │
│                    │                                                        │
│         All succeed or all fail?                                            │
│         Who decides? What if network fails?                                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Isolation Levels in Distributed Systems

<Tabs>
  <Tab title="Read Uncommitted">
    Lowest level - can read uncommitted changes from other transactions.

    **Problem**: Dirty reads

    ```
    T1: UPDATE x = 10  (not committed yet)
    T2: SELECT x → 10  (reads uncommitted value)
    T1: ROLLBACK
    T2: Has incorrect data!
    ```

    **Rarely used** in production.
  </Tab>

  <Tab title="Read Committed">
    Can only read committed data.

    **Prevents**: Dirty reads
    **Allows**: Non-repeatable reads

    ```
    T1: SELECT x → 5
    T2: UPDATE x = 10, COMMIT
    T1: SELECT x → 10  (different!)
    ```

    **Default** in PostgreSQL, SQL Server.
  </Tab>

  <Tab title="Repeatable Read">
    Same query returns same results within transaction.

    **Prevents**: Dirty reads, Non-repeatable reads
    **Allows**: Phantom reads

    ```
    T1: SELECT * WHERE age > 25 → 5 rows
    T2: INSERT (age = 30), COMMIT
    T1: SELECT * WHERE age > 25 → 6 rows (phantom!)
    ```

    **Default** in MySQL InnoDB.
  </Tab>

  <Tab title="Serializable">
    Highest level - transactions execute as if serial.

    **Prevents**: All anomalies
    **Cost**: Significant performance impact

    **Implementation approaches**:

    * Two-Phase Locking (2PL)
    * Serializable Snapshot Isolation (SSI)
    * Actual serial execution
  </Tab>
</Tabs>

### Snapshot Isolation and Write Skew

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    SNAPSHOT ISOLATION                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Each transaction sees a consistent SNAPSHOT of data                        │
│                                                                              │
│  HOW IT WORKS:                                                              │
│  ─────────────                                                              │
│  1. Transaction reads from snapshot at start time                           │
│  2. Writes go to private workspace                                          │
│  3. At commit, check for conflicts with concurrent transactions             │
│  4. If no conflicts, make writes visible                                    │
│                                                                              │
│  WRITE SKEW PROBLEM:                                                        │
│  ───────────────────                                                        │
│  Two transactions read overlapping data, make disjoint writes               │
│                                                                              │
│  Example: On-call doctors (at least 1 must be on-call)                      │
│                                                                              │
│  Initial: Alice = on-call, Bob = on-call                                    │
│                                                                              │
│  T1 (Alice): SELECT COUNT(*) WHERE on_call = true → 2                       │
│              UPDATE SET on_call = false WHERE name = 'Alice'                │
│              (OK, Bob is still on-call)                                     │
│                                                                              │
│  T2 (Bob):   SELECT COUNT(*) WHERE on_call = true → 2                       │
│              UPDATE SET on_call = false WHERE name = 'Bob'                  │
│              (OK, Alice is still on-call)                                   │
│                                                                              │
│  BOTH COMMIT! → No one is on-call (constraint violated)                     │
│                                                                              │
│  SOLUTION: Serializable isolation or explicit locking                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Module 17: Two-Phase Commit (2PC)

The classic distributed transaction protocol. The analogy that makes 2PC click: imagine a wedding ceremony. Phase 1 is when the officiant asks each person "Do you take this person to be your spouse?" Both must say "I do" (vote YES). Phase 2 is when the officiant pronounces them married (COMMIT). The critical problem with 2PC maps perfectly: if the officiant faints after both say "I do" but before pronouncing them married, the couple is stuck -- they cannot marry themselves (commit) and they cannot un-say "I do" (abort) without the officiant's decision.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      TWO-PHASE COMMIT                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PARTICIPANTS:                                                              │
│  ─────────────                                                              │
│  COORDINATOR: Manages the transaction                                       │
│  PARTICIPANTS: Services that participate (hold resources)                   │
│                                                                              │
│  PHASE 1: PREPARE (Voting)                                                  │
│  ─────────────────────────                                                  │
│                                                                              │
│  Coordinator              Participants                                      │
│      │                    A       B       C                                 │
│      │───Prepare──────────►       │       │                                 │
│      │───Prepare──────────────────►       │                                 │
│      │───Prepare──────────────────────────►                                 │
│      │                    │       │       │                                 │
│      │◄──────────Vote YES─│       │       │                                 │
│      │◄──────────Vote YES─────────│       │                                 │
│      │◄──────────Vote YES─────────────────│                                 │
│                                                                              │
│  PHASE 2: COMMIT (Decision)                                                 │
│  ──────────────────────────                                                 │
│                                                                              │
│  Coordinator              Participants                                      │
│      │                    A       B       C                                 │
│      │───Commit───────────►       │       │                                 │
│      │───Commit───────────────────►       │                                 │
│      │───Commit───────────────────────────►                                 │
│      │                    │       │       │                                 │
│      │◄─────────────ACK───│       │       │                                 │
│      │◄─────────────ACK───────────│       │                                 │
│      │◄─────────────ACK───────────────────│                                 │
│                                                                              │
│  If ANY participant votes NO → Coordinator sends ABORT to all              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 2PC State Machines

```
COORDINATOR STATE:                    PARTICIPANT STATE:

    ┌─────────┐                           ┌─────────┐
    │  INIT   │                           │  INIT   │
    └────┬────┘                           └────┬────┘
         │ send PREPARE                        │ receive PREPARE
         ▼                                     ▼
    ┌─────────┐                           ┌─────────┐
    │  WAIT   │                           │ READY?  │
    └────┬────┘                           └────┬────┘
         │                                     │
    ┌────┴────┐                           ┌────┴────┐
    │         │                           │         │
    ▼         ▼                           ▼         ▼
┌───────┐ ┌───────┐                   ┌───────┐ ┌───────┐
│COMMIT │ │ ABORT │                   │COMMIT │ │ ABORT │
└───────┘ └───────┘                   └───────┘ └───────┘
(all YES)  (any NO)                   (receive) (receive)

IMPORTANT: After voting YES, participant is BLOCKED
           until coordinator decision arrives!
```

### 2PC Failure Scenarios

<Tabs>
  <Tab title="Participant Fails Before Vote">
    ```
    Coordinator              Participants
        │                    A       B(fails)  C
        │───Prepare──────────►       ✗        │
        │◄──────────Vote YES─│               │
        │                    (timeout on B)   │
        │                                     │
        │───ABORT────────────►               ►
        
    OUTCOME: Transaction aborted
    RECOVERY: B restarts, has no record (OK)
    ```
  </Tab>

  <Tab title="Participant Fails After Vote">
    ```
    Coordinator              Participants
        │                    A       B        C
        │───Prepare──────────►       ►        ►
        │◄────────Vote YES───│       │        │
        │◄────────Vote YES───────────│✗(fail) │
        │◄────────Vote YES───────────────────│
        │                                     
        │───Commit───────────►       ✗       ►
        │◄─────────ACK───────│               │
        │                (retry B until ACK)  │
        
    OUTCOME: Must complete (B voted YES = promised)
    RECOVERY: B restarts, reads log, finishes commit
    ```
  </Tab>

  <Tab title="Coordinator Fails">
    **THE BLOCKING PROBLEM**

    ```
    Coordinator              Participants
        │                    A       B       C
        │───Prepare──────────►       ►       ►
        │◄────────Vote YES───│       │       │
        │✗(coordinator dies)
        
        Participants are BLOCKED!
        - Already voted YES
        - Can't commit (don't know if others voted YES)
        - Can't abort (coordinator might commit later)
        - MUST WAIT for coordinator recovery
    ```

    **This is the fundamental problem with 2PC**
  </Tab>
</Tabs>

### 2PC in Practice

```
XA TRANSACTIONS (Industry standard for 2PC):
────────────────
Java: JTA (Java Transaction API)
Databases: PostgreSQL, MySQL, Oracle support XA

LIMITATIONS IN PRACTICE:
────────────────────────
1. Performance: 2x network round trips minimum
2. Holding locks: Resources locked until commit
3. Coordinator is SPOF
4. Blocking on failures
5. Not suitable for microservices (couples services)

WHO USES IT:
────────────
- Internal database systems (sharded DBs)
- Traditional enterprise systems (J2EE)
- Some distributed databases (Spanner uses variant)

WHO AVOIDS IT:
──────────────
- Microservices architectures
- High-throughput systems
- Systems requiring high availability
```

***

## Module 18: Three-Phase Commit (3PC)

An attempt to solve 2PC's blocking problem.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                      THREE-PHASE COMMIT                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ADDS: Pre-commit phase between prepare and commit                          │
│                                                                              │
│  PHASE 1: CAN-COMMIT (same as 2PC prepare)                                  │
│  Coordinator: "Can you commit?"                                             │
│  Participant: "Yes" or "No"                                                 │
│                                                                              │
│  PHASE 2: PRE-COMMIT (NEW!)                                                 │
│  If all said yes:                                                           │
│  Coordinator: "Prepare to commit"                                           │
│  Participant: Logs intent, replies ACK                                      │
│                                                                              │
│  PHASE 3: DO-COMMIT                                                         │
│  Coordinator: "Commit now"                                                  │
│  Participant: Commits, replies ACK                                          │
│                                                                              │
│  KEY INSIGHT:                                                               │
│  ─────────────                                                              │
│  After pre-commit, if coordinator fails, participants can:                  │
│  - If I received pre-commit → coordinator decided commit                    │
│  - I can timeout and commit without coordinator!                            │
│                                                                              │
│  BEFORE pre-commit, if coordinator fails:                                   │
│  - I can safely abort (no one committed yet)                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### 3PC vs 2PC

```
┌────────────────────┬──────────────────┬──────────────────────┐
│                    │       2PC        │        3PC           │
├────────────────────┼──────────────────┼──────────────────────┤
│ Phases             │ 2                │ 3                    │
│ Message complexity │ 4n               │ 6n                   │
│ Blocking           │ Yes              │ No (in some cases)   │
│ Network partition  │ Can block        │ Can violate safety!  │
│ Used in practice   │ Widely           │ Rarely               │
└────────────────────┴──────────────────┴──────────────────────┘

WHY 3PC ISN'T USED:
───────────────────
Network partitions break 3PC:
- One partition times out, commits
- Other partition times out, aborts
- DIFFERENT DECISIONS = unsafe!

2PC is blocking but SAFE
3PC is non-blocking but UNSAFE in partitions

Modern approach: Use consensus (Paxos/Raft) for coordinator
```

***

## Module 19: Saga Pattern

<Warning>
  **This is the most practical approach for microservices**. Know this well for interviews.
</Warning>

<Frame>
  <img src="https://mintcdn.com/devweeekends/X0Fp4X8lMl-ZftoO/images/courses/saga-pattern.svg?fit=max&auto=format&n=X0Fp4X8lMl-ZftoO&q=85&s=9eef2b541b7f3129ee1d4c4c4b8053c4" alt="Saga Pattern - Choreography vs Orchestration" width="1080" height="1080" data-path="images/courses/saga-pattern.svg" />
</Frame>

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         SAGA PATTERN                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  IDEA: Break transaction into steps, each with a compensating action        │
│                                                                              │
│  ORDER SAGA EXAMPLE:                                                        │
│  ───────────────────                                                        │
│                                                                              │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │ T1: Create Order  ──►  T2: Reserve Inv  ──►  T3: Process Payment   │    │
│  │                                                                     │    │
│  │ C1: Cancel Order  ◄──  C2: Release Inv  ◄──  C3: Refund Payment    │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                              │
│  SUCCESS PATH:                                                              │
│  T1 → T2 → T3 → Done!                                                       │
│                                                                              │
│  FAILURE PATH (T3 fails):                                                   │
│  T1 → T2 → T3(fail) → C2 → C1 → Rolled back!                               │
│                                                                              │
│  KEY DIFFERENCE FROM 2PC:                                                   │
│  ─────────────────────────                                                  │
│  - NO LOCKS held during saga                                                │
│  - Each step commits independently                                          │
│  - Compensation instead of rollback                                         │
│  - Eventually consistent, not immediately consistent                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Saga Coordination Styles

<Tabs>
  <Tab title="Choreography">
    Services communicate via events, no central coordinator.

    ```
    Order        Inventory        Payment          Shipping
    Service      Service          Service          Service
       │                                              
       │──OrderCreated────►│                          
       │                   │──InventoryReserved──►│   
       │                                          │   
       │                   │◄──PaymentProcessed───│   
       │                                          │   
       │◄────────────────ShipmentScheduled────────│   

    Each service:
    1. Listens for events
    2. Does its work
    3. Publishes events

    PROS:
    - Simple, loosely coupled
    - No SPOF

    CONS:
    - Hard to understand flow
    - Difficult to add new steps
    - Complex failure handling
    ```
  </Tab>

  <Tab title="Orchestration">
    Central saga orchestrator controls the flow.

    ```
    ┌─────────────────────────────────────────────────────────┐
    │                 SAGA ORCHESTRATOR                       │
    └─────────────────────────────────────────────────────────┘
         │              │               │              │
         ▼              ▼               ▼              ▼
    ┌─────────┐   ┌─────────┐   ┌─────────────┐  ┌─────────┐
    │  Order  │   │Inventory│   │   Payment   │  │Shipping │
    │ Service │   │ Service │   │   Service   │  │ Service │
    └─────────┘   └─────────┘   └─────────────┘  └─────────┘

    Orchestrator:
    1. Calls each service in order
    2. Tracks saga state
    3. Handles failures, triggers compensation

    PROS:
    - Clear flow, easy to understand
    - Centralized failure handling
    - Easy to modify

    CONS:
    - Orchestrator is a point of coupling
    - Must be highly available
    ```
  </Tab>
</Tabs>

### Implementing Saga Orchestration

```python theme={null}
class OrderSaga:
    """
    Order saga with compensation logic.
    
    Real-world analogy: booking a vacation. You book the flight (step 1),
    then the hotel (step 2), then the rental car (step 3). If the rental
    car is unavailable, you cancel the hotel and cancel the flight -- in
    reverse order. Each cancellation is a "compensation" for the original
    booking. The key insight: you cannot "un-book" a flight that was never
    booked, which is why we track completed_steps.
    
    Critical design requirement: every compensation action MUST be
    idempotent. If the orchestrator crashes and restarts mid-compensation,
    it will re-run compensations that may have already partially executed.
    """
    
    def __init__(self, order_id):
        self.order_id = order_id
        self.state = "STARTED"
        self.completed_steps = []  # Persisted to durable storage so we
                                    # can resume after orchestrator crash
    
    async def execute(self, order_data):
        try:
            # Step 1: Create order record in PENDING state.
            # This is the saga's "root" -- all subsequent steps reference
            # this order_id for correlation.
            await self.create_order(order_data)
            self.completed_steps.append("CREATE_ORDER")
            
            # Step 2: Reserve inventory
            await self.reserve_inventory(order_data.items)
            self.completed_steps.append("RESERVE_INVENTORY")
            
            # Step 3: Process payment
            await self.process_payment(order_data.payment)
            self.completed_steps.append("PROCESS_PAYMENT")
            
            # Step 4: Schedule shipping
            await self.schedule_shipping(order_data.address)
            self.completed_steps.append("SCHEDULE_SHIPPING")
            
            self.state = "COMPLETED"
            return {"status": "success", "order_id": self.order_id}
            
        except SagaStepFailure as e:
            await self.compensate()
            self.state = "COMPENSATED"
            return {"status": "failed", "reason": str(e)}
    
    async def compensate(self):
        """
        Execute compensation in reverse order.
        
        Why reverse order? Because later steps may depend on earlier ones.
        You must release inventory BEFORE canceling the order, because
        the inventory service might check that the order still exists.
        
        Compensation is a "best effort" -- if a compensation step fails,
        we log it, alert on-call, and continue with the remaining
        compensations. A human will resolve the inconsistency. This is
        a fundamental trade-off of sagas: you gain availability and
        loose coupling, but you accept the possibility of manual
        intervention in rare failure scenarios.
        """
        
        compensation_map = {
            "SCHEDULE_SHIPPING": self.cancel_shipping,
            "PROCESS_PAYMENT": self.refund_payment,
            "RESERVE_INVENTORY": self.release_inventory,
            "CREATE_ORDER": self.cancel_order,
        }
        
        for step in reversed(self.completed_steps):
            compensator = compensation_map.get(step)
            if compensator:
                try:
                    await compensator()
                except Exception as e:
                    # Log and continue -- compensation must be best-effort.
                    # In production, this triggers an alert and creates a
                    # "stuck saga" ticket for the on-call engineer.
                    log.error(f"Compensation failed for {step}: {e}")
```

### Saga Failure Handling

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    SAGA FAILURE SCENARIOS                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. FORWARD STEP FAILS (most common)                                        │
│  ───────────────────────────────────                                        │
│  T1 ✓ → T2 ✓ → T3 ✗                                                         │
│  Execute: C2 → C1                                                           │
│  Result: Saga rolled back                                                   │
│                                                                              │
│  2. COMPENSATION STEP FAILS                                                 │
│  ───────────────────────────                                                │
│  T1 ✓ → T2 ✓ → T3 ✗                                                         │
│  Compensate: C2 ✗                                                           │
│  Action: Retry C2 (must be idempotent!)                                     │
│          After N retries, alert human                                       │
│                                                                              │
│  3. ORCHESTRATOR CRASHES                                                    │
│  ─────────────────────────                                                  │
│  Solution: Persist saga state to durable storage                            │
│  On restart: Resume from last known state                                   │
│                                                                              │
│  4. SEMANTIC FAILURES                                                       │
│  ────────────────────                                                       │
│  Some things can't be compensated perfectly:                                │
│  - Email already sent                                                       │
│  - Notification pushed                                                      │
│  - External system updated                                                  │
│  Solution: Design for eventual consistency, maybe retry later               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Saga Design Principles

```
1. IDEMPOTENT OPERATIONS
   ──────────────────────
   Every step and compensation must be safe to retry
   
   BAD:  inventory -= 1
   GOOD: if not reserved(order_id): inventory -= 1

2. REVERSIBLE STEPS
   ─────────────────
   Design operations that can be undone
   
   BAD:  delete_user(user_id)
   GOOD: mark_user_deleted(user_id)  # Can be unmarked

3. SEMANTIC LOCKS
   ───────────────
   Prevent dirty reads during saga execution
   
   order.status = "PENDING"  # Others know saga is in progress
   # ... saga executes ...
   order.status = "CONFIRMED"  # Safe to read now

4. SAGA LOG
   ─────────
   Record all events for debugging and replay
   
   {saga_id, step, status, timestamp, data}
```

### 19.1 Distributed Sagas: Orchestration vs Choreography Deep Dive

For Staff/Principal level design, the choice between Orchestration and Choreography is not just about "coupling," but about **Observability**, **Testability**, and **Operational Complexity**.

#### The Operational Comparison

| Feature              | Choreography (Event-Based)                                              | Orchestration (Command-Based)                                                   |
| :------------------- | :---------------------------------------------------------------------- | :------------------------------------------------------------------------------ |
| **Control**          | Decentralized. Each service owns its transition logic.                  | Centralized. One "Brain" knows the whole flow.                                  |
| **Visibility**       | **Poor**. You must trace events across multiple logs to see the status. | **Good**. One dashboard shows the status of all active sagas.                   |
| **Testing**          | **Hard**. Requires integration tests across $N$ services.               | **Easier**. You can mock services and test the Orchestrator logic.              |
| **Scalability**      | **High**. No central bottleneck.                                        | **Moderate**. Orchestrator can become a bottleneck (use sharded orchestrators). |
| **Failure Handling** | Complex "if-this-then-that" logic in every service.                     | Simple. Orchestrator handles retries and compensations.                         |

#### Advanced: Saga Isolation Levels

Sagas provide **Atomicity**, **Consistency**, and **Durability** (ACD), but they lack **Isolation**. This leads to three main anomalies:

1. **Lost Updates**: Saga 1 and Saga 2 both update a record, and one overwrite's the other's work without seeing it.
2. **Dirty Reads**: A client reads data from a Saga step that later gets compensated (rolled back).
3. **Fuzzy Reads**: A client reads a record at the start of a Saga and sees a different value at the end, even if the Saga succeeded.

#### Countermeasures (The "Staff Level" Solution):

* **Semantic Lock**: A "pending" flag on records being updated by a saga. Other transactions must check this flag before modifying.
* **Commutative Updates**: Design operations so order doesn't matter (e.g., `account_balance += 100` instead of `account_balance = new_value`).
* **Pessimistic View**: Reduce the "Dirty Read" risk by only making changes visible at the **end** of the saga (requires a staging table).
* **Reread Value**: Before committing a final step, reread the initial values to ensure they haven't changed (Optimistic concurrency).

**Staff Tip:** If your saga has more than 5 steps or involves more than 3 teams, **Orchestration** is almost always the correct choice for long-term maintainability. Choreography is better for simple, high-frequency pipelines where latency is the primary concern.

***

## Module 20: TCC Pattern

Try-Confirm-Cancel - another distributed transaction pattern.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                          TCC PATTERN                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  THREE PHASES:                                                              │
│                                                                              │
│  TRY:     Reserve resources, check conditions                               │
│           (but don't commit!)                                               │
│                                                                              │
│  CONFIRM: Make reservations permanent                                       │
│           (only if all TRY succeeded)                                       │
│                                                                              │
│  CANCEL:  Release reservations                                              │
│           (if any TRY failed)                                               │
│                                                                              │
│  EXAMPLE: Transfer $100 from A to B                                         │
│  ─────────────────────────────────                                          │
│                                                                              │
│  TRY phase:                                                                 │
│    Account A: Reserve $100 (balance: $500 → available: $400)               │
│    Account B: Reserve credit for $100                                       │
│                                                                              │
│  If both TRY succeed:                                                       │
│    CONFIRM:                                                                 │
│      Account A: Deduct $100 from reserved                                   │
│      Account B: Add $100 to balance                                         │
│                                                                              │
│  If any TRY fails:                                                          │
│    CANCEL:                                                                  │
│      Account A: Release reserved $100                                       │
│      Account B: Release credit reservation                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### TCC vs Saga

```
┌────────────────────┬────────────────────────┬────────────────────────┐
│                    │         SAGA           │          TCC           │
├────────────────────┼────────────────────────┼────────────────────────┤
│ Resource locking   │ No (compensate later)  │ Yes (during Try phase) │
│ Isolation          │ No isolation           │ Better isolation       │
│ Complexity         │ Simpler                │ More complex           │
│ When to use        │ Long transactions      │ Short, isolated        │
│ Failure handling   │ Compensation           │ Cancel releases        │
│ Resource usage     │ May have dirty reads   │ Resources "reserved"   │
└────────────────────┴────────────────────────┴────────────────────────┘

USE TCC WHEN:
- Need stronger isolation
- Resources can be "reserved"
- Transaction is short-lived
- Can afford the complexity

USE SAGA WHEN:
- Long-running transactions
- Steps are independent services
- Eventual consistency is OK
- Simpler compensation logic
```

***

## Advanced Distributed Transactions

### Google's Percolator (Snapshot Isolation at Scale)

Percolator is the protocol Google uses for incremental processing of the web index, built on top of Bigtable. It provides ACID transactions using a **Primary Lock** and **Secondary Locks** pattern.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    PERCOLATOR TRANSACTION PROTOCOL                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PHASE 1: PREWRITE                                                          │
│  1. Pick one cell as the PRIMARY lock.                                      │
│  2. Write data to a temporary "data" column.                                │
│  3. Acquire locks on all cells (writing a pointer to the Primary lock).     │
│  4. If any lock fails (conflict), abort the transaction.                    │
│                                                                              │
│  PHASE 2: COMMIT                                                            │
│  1. Get a commit timestamp from the TSO (Timestamp Oracle).                 │
│  2. Commit the PRIMARY lock by writing the commit timestamp and deleting     │
│     the lock.                                                               │
│  3. Once the Primary is committed, the transaction is OFFICIALLY COMMITTED. │
│  4. Asynchronously commit SECONDARY locks (write timestamp, delete locks).  │
│                                                                              │
│  RECOVERY:                                                                  │
│  - If a client fails, another client can check the Primary lock.            │
│  - If Primary is committed: Roll forward the secondary locks.               │
│  - If Primary is missing/expired: Roll back all locks.                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Distributed Deadlock Detection

In distributed systems, deadlocks can occur when nodes form a cycle of dependency across different machines.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    CHANDY-MISRA-HAAS ALGORITHM                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  A probe-based algorithm to detect distributed deadlocks.                   │
│                                                                              │
│  1. PROBE MESSAGE: (initiator, sender, receiver)                            │
│  2. When process Pi is blocked by Pj:                                       │
│     - Pi sends a probe (Pi, Pi, Pj) to Pj.                                  │
│  3. When Pj receives (initiator, sender, receiver):                         │
│     - If Pj is blocked by Pk:                                               │
│       - If initiator == Pj: DEADLOCK DETECTED!                              │
│       - Else: Forward probe (initiator, Pj, Pk) to Pk.                      │
│                                                                              │
│  Why it works: The probe travels along the dependency edges. If it returns  │
│  to the initiator, a cycle exists.                                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

***

## Module 21: Distributed Locking

Coordinating access to shared resources across nodes.

### The Problem

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    WHY DISTRIBUTED LOCKS ARE HARD                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SCENARIO: Two workers processing same job                                  │
│                                                                              │
│  Worker A         Lock Service       Worker B                               │
│      │                 │                 │                                  │
│      │──acquire────────►                 │                                  │
│      │◄────granted─────│                 │                                  │
│      │                 │                 │                                  │
│      │   (processing)  │                 │                                  │
│      │                 │                 │                                  │
│      │   (GC pause)    │──lease expires──│                                  │
│      │   (20 seconds)  │                 │                                  │
│      │                 │◄────acquire─────│                                  │
│      │                 │─────granted─────►                                  │
│      │                 │                 │                                  │
│      │   (GC ends)     │                 │   (processing)                   │
│      │                 │                 │                                  │
│      │   (writes!)     │                 │   (writes!)                      │
│      │                 │                 │                                  │
│      └──────── BOTH THINK THEY HOLD THE LOCK! ──────────┘                  │
│                                                                              │
│  PROBLEMS:                                                                  │
│  ─────────                                                                  │
│  1. Can't distinguish slow from dead                                        │
│  2. Leases expire, but work continues                                       │
│  3. Network delays                                                          │
│  4. Process pauses (GC, swapping)                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Redis Single-Node Lock

```python theme={null}
# Simple Redis lock (NOT safe for production distributed lock)

def acquire_lock(redis, key, value, ttl_ms):
    """
    SET key value NX PX ttl_ms
    - NX: Only if not exists
    - PX: Expiry in milliseconds
    """
    return redis.set(key, value, nx=True, px=ttl_ms)

def release_lock(redis, key, expected_value):
    """
    Only release if we still hold the lock (value matches)
    Uses Lua script for atomicity
    """
    script = """
    if redis.call("get", KEYS[1]) == ARGV[1] then
        return redis.call("del", KEYS[1])
    else
        return 0
    end
    """
    return redis.eval(script, 1, key, expected_value)

# Usage
lock_value = str(uuid.uuid4())  # Unique per acquisition
if acquire_lock(redis, "my-lock", lock_value, 30000):
    try:
        do_work()
    finally:
        release_lock(redis, "my-lock", lock_value)
```

### Redlock Algorithm

Distributed lock across multiple Redis nodes.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         REDLOCK                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SETUP: 5 independent Redis masters (N=5, quorum=3)                         │
│                                                                              │
│  ALGORITHM:                                                                 │
│  ──────────                                                                 │
│  1. Get current time T1                                                     │
│                                                                              │
│  2. Try to acquire lock on each Redis node sequentially                     │
│     - Use same key, same random value                                       │
│     - Use short timeout for each node                                       │
│                                                                              │
│  3. Get current time T2                                                     │
│     Elapsed = T2 - T1                                                       │
│                                                                              │
│  4. Lock acquired if:                                                       │
│     - Acquired on majority (≥3)                                             │
│     - Elapsed < lock TTL                                                    │
│     - Remaining TTL = TTL - Elapsed > min_work_time                         │
│                                                                              │
│  5. If failed, release lock on ALL nodes                                    │
│                                                                              │
│  VALIDITY TIME = TTL - elapsed - clock_drift                                │
│  Only do work within validity time                                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Redlock Critique (Martin Kleppmann)

```
KLEPPMANN'S CRITIQUE:
─────────────────────

1. PROCESS PAUSES CAN STILL CAUSE ISSUES
   
   Client acquires lock
   GC pause for 30 seconds
   Lock expires
   Another client acquires lock
   GC ends, first client continues with expired lock
   
   Redlock can't prevent this!

2. CLOCK ASSUMPTIONS
   
   Redlock assumes clocks don't jump forward
   But NTP can cause clock jumps
   If a majority of nodes have clock issues, safety is violated

3. NOT A CORRECT DISTRIBUTED SYSTEM PRIMITIVE
   
   "Redlock depends on a lot of timing assumptions"
   "It is neither a safe solution using perfect clocks, 
    nor a safe solution using imperfect clocks"

SALVIO (Redis) RESPONSE:
────────────────────────
- Use monotonic clocks
- Design for reasonable clock drift
- Redlock is pragmatic for real-world use

CONCLUSION:
───────────
For truly safety-critical locks, use consensus-based approaches
(Zookeeper, etcd, Consul)
```

### Fencing Tokens

The solution to the GC pause problem.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                        FENCING TOKENS                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  IDEA: Lock service returns incrementing token with each lock               │
│        Storage service rejects operations with old tokens                   │
│                                                                              │
│  Worker A         Lock           Storage                                    │
│      │              │               │                                       │
│      │──acquire─────►               │                                       │
│      │◄──token=33───│               │                                       │
│      │              │               │                                       │
│      │   (GC pause) │──expires──    │                                       │
│      │              │               │                                       │
│  Worker B           │               │                                       │
│      │──acquire─────►               │                                       │
│      │◄──token=34───│               │                                       │
│      │              │               │                                       │
│      │──────────write(token=34)─────►                                       │
│      │◄───────────────OK────────────│                                       │
│      │              │               │                                       │
│      │   (GC ends)  │               │                                       │
│      │──────────write(token=33)─────►                                       │
│      │◄──────────REJECTED───────────│  (33 < 34)                           │
│                                                                              │
│  Storage tracks: max_seen_token = 34                                        │
│  Rejects any write with token < max_seen_token                              │
│                                                                              │
│  REQUIREMENTS:                                                              │
│  ─────────────                                                              │
│  - Lock service must generate monotonically increasing tokens               │
│  - Storage must track and check tokens                                      │
│  - Requires cooperation from storage layer                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

### Zookeeper-Based Locks

```python theme={null}
# Zookeeper distributed lock using ephemeral sequential nodes

class ZookeeperLock:
    def __init__(self, zk, lock_path):
        self.zk = zk
        self.lock_path = lock_path
        self.my_node = None
    
    def acquire(self):
        # Create ephemeral sequential node
        self.my_node = self.zk.create(
            f"{self.lock_path}/lock-",
            ephemeral=True,
            sequential=True
        )
        
        while True:
            # Get all children, sorted
            children = sorted(self.zk.get_children(self.lock_path))
            my_name = self.my_node.split("/")[-1]
            
            # Am I first? I have the lock!
            if children[0] == my_name:
                return
            
            # Watch the node before me
            my_index = children.index(my_name)
            node_before = children[my_index - 1]
            
            # Wait for it to be deleted
            event = threading.Event()
            
            def watch_callback(event_type):
                event.set()
            
            if self.zk.exists(
                f"{self.lock_path}/{node_before}",
                watch=watch_callback
            ):
                event.wait()  # Wait for deletion
    
    def release(self):
        if self.my_node:
            self.zk.delete(self.my_node)
            self.my_node = None

# Why this works:
# 1. Ephemeral nodes deleted if client disconnects
# 2. Sequential ensures ordering
# 3. Only watch predecessor - no herd effect
# 4. Zookeeper handles consistency via ZAB
```

***

## Key Interview Questions

<AccordionGroup>
  <Accordion title="Q: Explain the difference between 2PC and Saga">
    **2PC**:

    * Coordinator blocks until all participants vote
    * Holds locks during prepare phase
    * Strong consistency, immediate
    * Can block on coordinator failure
    * Better for database transactions

    **Saga**:

    * No global locks, each step commits independently
    * Uses compensation for rollback
    * Eventually consistent
    * No blocking, more available
    * Better for microservices, long transactions

    **When to use**:

    * 2PC: Short transactions, need strong consistency, can afford latency
    * Saga: Long transactions, need high availability, can tolerate eventual consistency
  </Accordion>

  <Accordion title="Q: Design a distributed lock for a rate limiter">
    ```
    Requirements:
    - Limit API calls per user per minute
    - Multiple server instances
    - Must be accurate (not approximate)

    Solution 1: Redis with Lua script
    ─────────────────────────────────
    key = f"rate:{user_id}:{current_minute}"

    Script (atomic):
      current = GET key
      if current >= limit:
        return REJECTED
      INCR key
      EXPIRE key 60
      return ALLOWED

    Solution 2: Token bucket with lock
    ─────────────────────────────────
    1. Acquire lock for user's bucket
    2. Check/update token count
    3. Release lock

    Use Redlock or Zookeeper for the lock

    Solution 3: Sliding window (no lock)
    ────────────────────────────────────
    Use sorted set with timestamp scores
    ZADD key timestamp request_id
    ZREMRANGEBYSCORE key 0 (now - window)
    ZCARD key → current count
    ```
  </Accordion>

  <Accordion title="Q: How would you handle a saga compensation failure?">
    ```
    Strategies:

    1. RETRY WITH BACKOFF
       - Compensations must be idempotent
       - Retry with exponential backoff
       - After N retries, escalate

    2. DEAD LETTER QUEUE
       - Store failed compensations
       - Process later (automated or manual)
       - Alert operators

    3. EVENTUAL RECONCILIATION
       - Background job checks consistency
       - Fixes discrepancies automatically

    4. MANUAL INTERVENTION
       - Some failures need human decision
       - Provide good tooling/visibility

    5. DESIGN FOR FAILURE
       - Make services tolerate inconsistency
       - Use soft deletes, status fields
       - Build reversal into business process

    Key: Never lose the failure information!
    Always log/store failed saga state
    ```
  </Accordion>

  <Accordion title="Q: Why is distributed locking harder than it seems?">
    **Core challenges**:

    1. **Can't distinguish slow from dead**
       * Heartbeat timeout? Node might just be slow
       * GC pause, network delay, CPU starvation

    2. **Clock skew**
       * Lock expiry depends on time
       * Different machines have different times
       * NTP can jump clocks

    3. **Partial failures**
       * Acquired lock on some nodes, not others
       * Network partition mid-operation

    4. **Client failures**
       * Client crashes while holding lock
       * Need automatic expiry/release
       * But expiry can cause double-holding

    **Solutions**:

    * Fencing tokens (best)
    * Consensus-based locks (Zookeeper, etcd)
    * Design system to tolerate inconsistency
    * Accept that "perfect" distributed lock doesn't exist
  </Accordion>
</AccordionGroup>

***

## Advanced Design Scenarios

### Scenario 1: Global Bank Transfer System

You are designing a **global bank transfer platform** where money moves between accounts in different regions and
possibly different underlying systems.

**Hard requirements**:

* No double-spend, no lost money
* Transfers must be **auditable** and **reversible**
* Transfers may cross **currencies** and **regions**
* Short periods of unavailability are acceptable; **inconsistency is not**

**Design outline**:

* Partition accounts into **shards**; each shard is a **replicated log** protected by **Raft/Paxos**
* Within a shard, use **strict serializable transactions** (single-shard transfers are simple)
* For cross-shard transfers, use either:
  * **2PC over consensus-backed shards** (coordinator decisions are written via consensus), or
  * **Transactional outbox + reconciliation** if you can tolerate temporary imbalance

**2PC-over-consensus pattern**:

* Each shard is internally replicated; **coordinator** is itself backed by consensus (no single-point-of-failure)
* Steps:
  1. Begin transaction with a globally unique ID `txid`
  2. Phase 1 (prepare): write *"prepare(txid, debit/credit)"* to each shard's replicated log and wait until committed
  3. Once all participants are **prepared**, coordinator writes *"commit(txid)"* to its own log and notifies shards
  4. Phase 2 (commit): each shard finalizes its local changes (e.g., move from "reserved" to "posted")
* If coordinator fails, **replay logs** to determine final outcome; log state is the source of truth

**Interview talking points**:

* Why 2PC is blocking and how consensus for the coordinator removes single-point-of-failure but not all blocking
* How you would **replay** after a crash to restore consistent state
* Where you might deliberately relax consistency (e.g., showing “pending” vs “settled” balances on the UI)

### Scenario 2: Microservices Order Workflow with Sagas

You run a microservices-based e-commerce platform with services for **Orders**, **Payments**, **Inventory**, and
**Shipping**.

**Requirements**:

* **High availability**; cannot afford global locks or long-lived distributed transactions
* Must support **long-running workflows** (shipping, warehouse operations)
* Business rules are tolerant of temporary inconsistency as long as system eventually converges

**Design**:

* Use **Saga orchestration** (one central orchestrator) to manage the order lifecycle
* Each step is a **local transaction** with a **compensating action**

Example saga steps:

1. Create order (status: `PENDING`)
2. Reserve inventory for items
3. Authorize payment
4. Schedule shipment

Compensations:

* If shipping fails: cancel shipment and **release inventory**, **void payment** or refund
* If payment authorization fails: **release inventory** and mark order `CANCELLED`

Persistence and reliability:

* Saga state (current step, payload, retries) stored in a **durable saga log** (e.g., its own table or topic)
* Each local step must be **idempotent**; compensations must be idempotent as well
* Use **outbox pattern** from each service to publish events reliably

**Interview talking points**:

* Trade-offs vs 2PC (availability vs immediate consistency)
* How to debug and replay a failed saga from the saga log
* Handling **compensation failures** (DLQs, manual intervention, reconciliation jobs)

### Scenario 3: TCC for High-Value Reservations

You are designing a **hotel or flight booking system** where overbooking is unacceptable and reservations are short-
lived.

**Requirements**:

* Temporarily "hold" a room/seat while the user completes payment
* If payment fails or times out, **release** the hold automatically
* Stronger isolation than pure saga; must avoid double-booking

**Design with TCC (Try-Confirm-Cancel)**:

* TRY: reserve capacity in each system (rooms, seats, loyalty points) with a **hold token** and expiry
* CONFIRM: commit all holds and make them permanent once payment succeeds
* CANCEL: release holds if any TRY fails or payment fails/timeouts

Implementation details:

* Each resource service exposes **idempotent** `try`, `confirm`, `cancel` endpoints keyed by `reservationId`
* TRY operations must be **side-effect-free** beyond reserving capacity; they should not make changes visible to other
  users until CONFIRM
* Store overall TCC state in a coordinator (which can itself be a saga orchestrator with TCC semantics)

**Interview talking points**:

* When TCC is preferable to Saga (short-lived, strong isolation, reservable resources)
* How to handle timeout of reservations (background job that auto-cancels stale TRY states)
* Failure modes when CONFIRM or CANCEL messages are delayed and how idempotency plus retries mitigates them

***

## Next Steps

<Card title="Continue to Track 5: Data Systems at Scale" icon="arrow-right" href="/courses/distributed-systems/data-systems">
  Learn about partitioning, distributed databases, and stream processing
</Card>
