┌─────────────────────────────────────────────────────────────────────────────┐│ SNAPSHOT ISOLATION │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Each transaction sees a consistent SNAPSHOT of data ││ ││ HOW IT WORKS: ││ ───────────── ││ 1. Transaction reads from snapshot at start time ││ 2. Writes go to private workspace ││ 3. At commit, check for conflicts with concurrent transactions ││ 4. If no conflicts, make writes visible ││ ││ WRITE SKEW PROBLEM: ││ ─────────────────── ││ Two transactions read overlapping data, make disjoint writes ││ ││ Example: On-call doctors (at least 1 must be on-call) ││ ││ Initial: Alice = on-call, Bob = on-call ││ ││ T1 (Alice): SELECT COUNT(*) WHERE on_call = true → 2 ││ UPDATE SET on_call = false WHERE name = 'Alice' ││ (OK, Bob is still on-call) ││ ││ T2 (Bob): SELECT COUNT(*) WHERE on_call = true → 2 ││ UPDATE SET on_call = false WHERE name = 'Bob' ││ (OK, Alice is still on-call) ││ ││ BOTH COMMIT! → No one is on-call (constraint violated) ││ ││ SOLUTION: Serializable isolation or explicit locking ││ │└─────────────────────────────────────────────────────────────────────────────┘
Coordinator Participants │ A B(fails) C │───Prepare──────────► ✗ │ │◄──────────Vote YES─│ │ │ (timeout on B) │ │ │ │───ABORT────────────► ►OUTCOME: Transaction abortedRECOVERY: B restarts, has no record (OK)
Copy
Coordinator Participants │ A B C │───Prepare──────────► ► ► │◄────────Vote YES───│ │ │ │◄────────Vote YES───────────│✗(fail) │ │◄────────Vote YES───────────────────│ │ │───Commit───────────► ✗ ► │◄─────────ACK───────│ │ │ (retry B until ACK) │OUTCOME: Must complete (B voted YES = promised)RECOVERY: B restarts, reads log, finishes commit
THE BLOCKING PROBLEM
Copy
Coordinator Participants │ A B C │───Prepare──────────► ► ► │◄────────Vote YES───│ │ │ │✗(coordinator dies) Participants are BLOCKED! - Already voted YES - Can't commit (don't know if others voted YES) - Can't abort (coordinator might commit later) - MUST WAIT for coordinator recovery
XA TRANSACTIONS (Industry standard for 2PC):────────────────Java: JTA (Java Transaction API)Databases: PostgreSQL, MySQL, Oracle support XALIMITATIONS IN PRACTICE:────────────────────────1. Performance: 2x network round trips minimum2. Holding locks: Resources locked until commit3. Coordinator is SPOF4. Blocking on failures5. Not suitable for microservices (couples services)WHO USES IT:────────────- Internal database systems (sharded DBs)- Traditional enterprise systems (J2EE)- Some distributed databases (Spanner uses variant)WHO AVOIDS IT:──────────────- Microservices architectures- High-throughput systems- Systems requiring high availability
┌─────────────────────────────────────────────────────────────────────────────┐│ THREE-PHASE COMMIT │├─────────────────────────────────────────────────────────────────────────────┤│ ││ ADDS: Pre-commit phase between prepare and commit ││ ││ PHASE 1: CAN-COMMIT (same as 2PC prepare) ││ Coordinator: "Can you commit?" ││ Participant: "Yes" or "No" ││ ││ PHASE 2: PRE-COMMIT (NEW!) ││ If all said yes: ││ Coordinator: "Prepare to commit" ││ Participant: Logs intent, replies ACK ││ ││ PHASE 3: DO-COMMIT ││ Coordinator: "Commit now" ││ Participant: Commits, replies ACK ││ ││ KEY INSIGHT: ││ ───────────── ││ After pre-commit, if coordinator fails, participants can: ││ - If I received pre-commit → coordinator decided commit ││ - I can timeout and commit without coordinator! ││ ││ BEFORE pre-commit, if coordinator fails: ││ - I can safely abort (no one committed yet) ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌────────────────────┬──────────────────┬──────────────────────┐│ │ 2PC │ 3PC │├────────────────────┼──────────────────┼──────────────────────┤│ Phases │ 2 │ 3 ││ Message complexity │ 4n │ 6n ││ Blocking │ Yes │ No (in some cases) ││ Network partition │ Can block │ Can violate safety! ││ Used in practice │ Widely │ Rarely │└────────────────────┴──────────────────┴──────────────────────┘WHY 3PC ISN'T USED:───────────────────Network partitions break 3PC:- One partition times out, commits- Other partition times out, aborts- DIFFERENT DECISIONS = unsafe!2PC is blocking but SAFE3PC is non-blocking but UNSAFE in partitionsModern approach: Use consensus (Paxos/Raft) for coordinator
Services communicate via events, no central coordinator.
Copy
Order Inventory Payment ShippingService Service Service Service │ │──OrderCreated────►│ │ │──InventoryReserved──►│ │ │ │ │◄──PaymentProcessed───│ │ │ │◄────────────────ShipmentScheduled────────│ Each service:1. Listens for events2. Does its work3. Publishes eventsPROS:- Simple, loosely coupled- No SPOFCONS:- Hard to understand flow- Difficult to add new steps- Complex failure handling
Central saga orchestrator controls the flow.
Copy
┌─────────────────────────────────────────────────────────┐│ SAGA ORCHESTRATOR │└─────────────────────────────────────────────────────────┘ │ │ │ │ ▼ ▼ ▼ ▼┌─────────┐ ┌─────────┐ ┌─────────────┐ ┌─────────┐│ Order │ │Inventory│ │ Payment │ │Shipping ││ Service │ │ Service │ │ Service │ │ Service │└─────────┘ └─────────┘ └─────────────┘ └─────────┘Orchestrator:1. Calls each service in order2. Tracks saga state3. Handles failures, triggers compensationPROS:- Clear flow, easy to understand- Centralized failure handling- Easy to modifyCONS:- Orchestrator is a point of coupling- Must be highly available
1. IDEMPOTENT OPERATIONS ────────────────────── Every step and compensation must be safe to retry BAD: inventory -= 1 GOOD: if not reserved(order_id): inventory -= 12. REVERSIBLE STEPS ───────────────── Design operations that can be undone BAD: delete_user(user_id) GOOD: mark_user_deleted(user_id) # Can be unmarked3. SEMANTIC LOCKS ─────────────── Prevent dirty reads during saga execution order.status = "PENDING" # Others know saga is in progress # ... saga executes ... order.status = "CONFIRMED" # Safe to read now4. SAGA LOG ───────── Record all events for debugging and replay {saga_id, step, status, timestamp, data}
19.1 Distributed Sagas: Orchestration vs Choreography Deep Dive
For Staff/Principal level design, the choice between Orchestration and Choreography is not just about “coupling,” but about Observability, Testability, and Operational Complexity.
Semantic Lock: A “pending” flag on records being updated by a saga. Other transactions must check this flag before modifying.
Commutative Updates: Design operations so order doesn’t matter (e.g., account_balance += 100 instead of account_balance = new_value).
Pessimistic View: Reduce the “Dirty Read” risk by only making changes visible at the end of the saga (requires a staging table).
Reread Value: Before committing a final step, reread the initial values to ensure they haven’t changed (Optimistic concurrency).
Staff Tip: If your saga has more than 5 steps or involves more than 3 teams, Orchestration is almost always the correct choice for long-term maintainability. Choreography is better for simple, high-frequency pipelines where latency is the primary concern.
Percolator is the protocol Google uses for incremental processing of the web index, built on top of Bigtable. It provides ACID transactions using a Primary Lock and Secondary Locks pattern.
Copy
┌─────────────────────────────────────────────────────────────────────────────┐│ PERCOLATOR TRANSACTION PROTOCOL │├─────────────────────────────────────────────────────────────────────────────┤│ ││ PHASE 1: PREWRITE ││ 1. Pick one cell as the PRIMARY lock. ││ 2. Write data to a temporary "data" column. ││ 3. Acquire locks on all cells (writing a pointer to the Primary lock). ││ 4. If any lock fails (conflict), abort the transaction. ││ ││ PHASE 2: COMMIT ││ 1. Get a commit timestamp from the TSO (Timestamp Oracle). ││ 2. Commit the PRIMARY lock by writing the commit timestamp and deleting ││ the lock. ││ 3. Once the Primary is committed, the transaction is OFFICIALLY COMMITTED. ││ 4. Asynchronously commit SECONDARY locks (write timestamp, delete locks). ││ ││ RECOVERY: ││ - If a client fails, another client can check the Primary lock. ││ - If Primary is committed: Roll forward the secondary locks. ││ - If Primary is missing/expired: Roll back all locks. ││ │└─────────────────────────────────────────────────────────────────────────────┘
In distributed systems, deadlocks can occur when nodes form a cycle of dependency across different machines.
Copy
┌─────────────────────────────────────────────────────────────────────────────┐│ CHANDY-MISRA-HAAS ALGORITHM │├─────────────────────────────────────────────────────────────────────────────┤│ ││ A probe-based algorithm to detect distributed deadlocks. ││ ││ 1. PROBE MESSAGE: (initiator, sender, receiver) ││ 2. When process Pi is blocked by Pj: ││ - Pi sends a probe (Pi, Pi, Pj) to Pj. ││ 3. When Pj receives (initiator, sender, receiver): ││ - If Pj is blocked by Pk: ││ - If initiator == Pj: DEADLOCK DETECTED! ││ - Else: Forward probe (initiator, Pj, Pk) to Pk. ││ ││ Why it works: The probe travels along the dependency edges. If it returns ││ to the initiator, a cycle exists. ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐│ REDLOCK │├─────────────────────────────────────────────────────────────────────────────┤│ ││ SETUP: 5 independent Redis masters (N=5, quorum=3) ││ ││ ALGORITHM: ││ ────────── ││ 1. Get current time T1 ││ ││ 2. Try to acquire lock on each Redis node sequentially ││ - Use same key, same random value ││ - Use short timeout for each node ││ ││ 3. Get current time T2 ││ Elapsed = T2 - T1 ││ ││ 4. Lock acquired if: ││ - Acquired on majority (≥3) ││ - Elapsed < lock TTL ││ - Remaining TTL = TTL - Elapsed > min_work_time ││ ││ 5. If failed, release lock on ALL nodes ││ ││ VALIDITY TIME = TTL - elapsed - clock_drift ││ Only do work within validity time ││ │└─────────────────────────────────────────────────────────────────────────────┘
KLEPPMANN'S CRITIQUE:─────────────────────1. PROCESS PAUSES CAN STILL CAUSE ISSUES Client acquires lock GC pause for 30 seconds Lock expires Another client acquires lock GC ends, first client continues with expired lock Redlock can't prevent this!2. CLOCK ASSUMPTIONS Redlock assumes clocks don't jump forward But NTP can cause clock jumps If a majority of nodes have clock issues, safety is violated3. NOT A CORRECT DISTRIBUTED SYSTEM PRIMITIVE "Redlock depends on a lot of timing assumptions" "It is neither a safe solution using perfect clocks, nor a safe solution using imperfect clocks"SALVIO (Redis) RESPONSE:────────────────────────- Use monotonic clocks- Design for reasonable clock drift- Redlock is pragmatic for real-world useCONCLUSION:───────────For truly safety-critical locks, use consensus-based approaches(Zookeeper, etcd, Consul)
2PC: Short transactions, need strong consistency, can afford latency
Saga: Long transactions, need high availability, can tolerate eventual consistency
Q: Design a distributed lock for a rate limiter
Copy
Requirements:- Limit API calls per user per minute- Multiple server instances- Must be accurate (not approximate)Solution 1: Redis with Lua script─────────────────────────────────key = f"rate:{user_id}:{current_minute}"Script (atomic): current = GET key if current >= limit: return REJECTED INCR key EXPIRE key 60 return ALLOWEDSolution 2: Token bucket with lock─────────────────────────────────1. Acquire lock for user's bucket2. Check/update token count3. Release lockUse Redlock or Zookeeper for the lockSolution 3: Sliding window (no lock)────────────────────────────────────Use sorted set with timestamp scoresZADD key timestamp request_idZREMRANGEBYSCORE key 0 (now - window)ZCARD key → current count
Q: How would you handle a saga compensation failure?
Copy
Strategies:1. RETRY WITH BACKOFF - Compensations must be idempotent - Retry with exponential backoff - After N retries, escalate2. DEAD LETTER QUEUE - Store failed compensations - Process later (automated or manual) - Alert operators3. EVENTUAL RECONCILIATION - Background job checks consistency - Fixes discrepancies automatically4. MANUAL INTERVENTION - Some failures need human decision - Provide good tooling/visibility5. DESIGN FOR FAILURE - Make services tolerate inconsistency - Use soft deletes, status fields - Build reversal into business processKey: Never lose the failure information!Always log/store failed saga state
Q: Why is distributed locking harder than it seems?
Core challenges:
Can’t distinguish slow from dead
Heartbeat timeout? Node might just be slow
GC pause, network delay, CPU starvation
Clock skew
Lock expiry depends on time
Different machines have different times
NTP can jump clocks
Partial failures
Acquired lock on some nodes, not others
Network partition mid-operation
Client failures
Client crashes while holding lock
Need automatic expiry/release
But expiry can cause double-holding
Solutions:
Fencing tokens (best)
Consensus-based locks (Zookeeper, etcd)
Design system to tolerate inconsistency
Accept that “perfect” distributed lock doesn’t exist
You are designing a global bank transfer platform where money moves between accounts in different regions and
possibly different underlying systems.Hard requirements:
No double-spend, no lost money
Transfers must be auditable and reversible
Transfers may cross currencies and regions
Short periods of unavailability are acceptable; inconsistency is not
Design outline:
Partition accounts into shards; each shard is a replicated log protected by Raft/Paxos
Within a shard, use strict serializable transactions (single-shard transfers are simple)
For cross-shard transfers, use either:
2PC over consensus-backed shards (coordinator decisions are written via consensus), or
Transactional outbox + reconciliation if you can tolerate temporary imbalance
2PC-over-consensus pattern:
Each shard is internally replicated; coordinator is itself backed by consensus (no single-point-of-failure)
Steps:
Begin transaction with a globally unique ID txid
Phase 1 (prepare): write “prepare(txid, debit/credit)” to each shard’s replicated log and wait until committed
Once all participants are prepared, coordinator writes “commit(txid)” to its own log and notifies shards
Phase 2 (commit): each shard finalizes its local changes (e.g., move from “reserved” to “posted”)
If coordinator fails, replay logs to determine final outcome; log state is the source of truth
Interview talking points:
Why 2PC is blocking and how consensus for the coordinator removes single-point-of-failure but not all blocking
How you would replay after a crash to restore consistent state
Where you might deliberately relax consistency (e.g., showing “pending” vs “settled” balances on the UI)