Track 4: Distributed Transactions
Maintaining data integrity across multiple nodes.Track Duration: 36-44 hours
Modules: 6
Key Topics: 2PC, 3PC, Saga, TCC, Distributed Locking
Modules: 6
Key Topics: 2PC, 3PC, Saga, TCC, Distributed Locking
Module 16: ACID in Distributed Systems
Local vs Distributed Transactions
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ LOCAL vs DISTRIBUTED TRANSACTIONS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ LOCAL TRANSACTION: │
│ ────────────────── │
│ Single database, ACID guarantees are "easy" │
│ │
│ BEGIN TRANSACTION │
│ UPDATE accounts SET balance = balance - 100 WHERE id = 1; │
│ UPDATE accounts SET balance = balance + 100 WHERE id = 2; │
│ COMMIT │
│ │
│ Database handles: Atomicity (undo log), Isolation (locks/MVCC) │
│ │
│ ═══════════════════════════════════════════════════════════════════════ │
│ │
│ DISTRIBUTED TRANSACTION: │
│ ──────────────────────── │
│ Multiple databases/services, ACID is HARD │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ Service A │ │ Service B │ │ Service C │ │
│ │ DB: x-1 │ │ DB: y+1 │ │ Log entry│ │
│ └───────────┘ └───────────┘ └───────────┘ │
│ │ │ │ │
│ └────────────────┴────────────────┘ │
│ │ │
│ All succeed or all fail? │
│ Who decides? What if network fails? │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Isolation Levels in Distributed Systems
- Read Uncommitted
- Read Committed
- Repeatable Read
- Serializable
Lowest level - can read uncommitted changes from other transactions.Problem: Dirty readsRarely used in production.
Copy
T1: UPDATE x = 10 (not committed yet)
T2: SELECT x → 10 (reads uncommitted value)
T1: ROLLBACK
T2: Has incorrect data!
Can only read committed data.Prevents: Dirty reads
Allows: Non-repeatable readsDefault in PostgreSQL, SQL Server.
Copy
T1: SELECT x → 5
T2: UPDATE x = 10, COMMIT
T1: SELECT x → 10 (different!)
Same query returns same results within transaction.Prevents: Dirty reads, Non-repeatable reads
Allows: Phantom readsDefault in MySQL InnoDB.
Copy
T1: SELECT * WHERE age > 25 → 5 rows
T2: INSERT (age = 30), COMMIT
T1: SELECT * WHERE age > 25 → 6 rows (phantom!)
Highest level - transactions execute as if serial.Prevents: All anomalies
Cost: Significant performance impactImplementation approaches:
- Two-Phase Locking (2PL)
- Serializable Snapshot Isolation (SSI)
- Actual serial execution
Snapshot Isolation and Write Skew
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ SNAPSHOT ISOLATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Each transaction sees a consistent SNAPSHOT of data │
│ │
│ HOW IT WORKS: │
│ ───────────── │
│ 1. Transaction reads from snapshot at start time │
│ 2. Writes go to private workspace │
│ 3. At commit, check for conflicts with concurrent transactions │
│ 4. If no conflicts, make writes visible │
│ │
│ WRITE SKEW PROBLEM: │
│ ─────────────────── │
│ Two transactions read overlapping data, make disjoint writes │
│ │
│ Example: On-call doctors (at least 1 must be on-call) │
│ │
│ Initial: Alice = on-call, Bob = on-call │
│ │
│ T1 (Alice): SELECT COUNT(*) WHERE on_call = true → 2 │
│ UPDATE SET on_call = false WHERE name = 'Alice' │
│ (OK, Bob is still on-call) │
│ │
│ T2 (Bob): SELECT COUNT(*) WHERE on_call = true → 2 │
│ UPDATE SET on_call = false WHERE name = 'Bob' │
│ (OK, Alice is still on-call) │
│ │
│ BOTH COMMIT! → No one is on-call (constraint violated) │
│ │
│ SOLUTION: Serializable isolation or explicit locking │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Module 17: Two-Phase Commit (2PC)
The classic distributed transaction protocol.Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ TWO-PHASE COMMIT │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PARTICIPANTS: │
│ ───────────── │
│ COORDINATOR: Manages the transaction │
│ PARTICIPANTS: Services that participate (hold resources) │
│ │
│ PHASE 1: PREPARE (Voting) │
│ ───────────────────────── │
│ │
│ Coordinator Participants │
│ │ A B C │
│ │───Prepare──────────► │ │ │
│ │───Prepare──────────────────► │ │
│ │───Prepare──────────────────────────► │
│ │ │ │ │ │
│ │◄──────────Vote YES─│ │ │ │
│ │◄──────────Vote YES─────────│ │ │
│ │◄──────────Vote YES─────────────────│ │
│ │
│ PHASE 2: COMMIT (Decision) │
│ ────────────────────────── │
│ │
│ Coordinator Participants │
│ │ A B C │
│ │───Commit───────────► │ │ │
│ │───Commit───────────────────► │ │
│ │───Commit───────────────────────────► │
│ │ │ │ │ │
│ │◄─────────────ACK───│ │ │ │
│ │◄─────────────ACK───────────│ │ │
│ │◄─────────────ACK───────────────────│ │
│ │
│ If ANY participant votes NO → Coordinator sends ABORT to all │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
2PC State Machines
Copy
COORDINATOR STATE: PARTICIPANT STATE:
┌─────────┐ ┌─────────┐
│ INIT │ │ INIT │
└────┬────┘ └────┬────┘
│ send PREPARE │ receive PREPARE
▼ ▼
┌─────────┐ ┌─────────┐
│ WAIT │ │ READY? │
└────┬────┘ └────┬────┘
│ │
┌────┴────┐ ┌────┴────┐
│ │ │ │
▼ ▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│COMMIT │ │ ABORT │ │COMMIT │ │ ABORT │
└───────┘ └───────┘ └───────┘ └───────┘
(all YES) (any NO) (receive) (receive)
IMPORTANT: After voting YES, participant is BLOCKED
until coordinator decision arrives!
2PC Failure Scenarios
- Participant Fails Before Vote
- Participant Fails After Vote
- Coordinator Fails
Copy
Coordinator Participants
│ A B(fails) C
│───Prepare──────────► ✗ │
│◄──────────Vote YES─│ │
│ (timeout on B) │
│ │
│───ABORT────────────► ►
OUTCOME: Transaction aborted
RECOVERY: B restarts, has no record (OK)
Copy
Coordinator Participants
│ A B C
│───Prepare──────────► ► ►
│◄────────Vote YES───│ │ │
│◄────────Vote YES───────────│✗(fail) │
│◄────────Vote YES───────────────────│
│
│───Commit───────────► ✗ ►
│◄─────────ACK───────│ │
│ (retry B until ACK) │
OUTCOME: Must complete (B voted YES = promised)
RECOVERY: B restarts, reads log, finishes commit
THE BLOCKING PROBLEMThis is the fundamental problem with 2PC
Copy
Coordinator Participants
│ A B C
│───Prepare──────────► ► ►
│◄────────Vote YES───│ │ │
│✗(coordinator dies)
Participants are BLOCKED!
- Already voted YES
- Can't commit (don't know if others voted YES)
- Can't abort (coordinator might commit later)
- MUST WAIT for coordinator recovery
2PC in Practice
Copy
XA TRANSACTIONS (Industry standard for 2PC):
────────────────
Java: JTA (Java Transaction API)
Databases: PostgreSQL, MySQL, Oracle support XA
LIMITATIONS IN PRACTICE:
────────────────────────
1. Performance: 2x network round trips minimum
2. Holding locks: Resources locked until commit
3. Coordinator is SPOF
4. Blocking on failures
5. Not suitable for microservices (couples services)
WHO USES IT:
────────────
- Internal database systems (sharded DBs)
- Traditional enterprise systems (J2EE)
- Some distributed databases (Spanner uses variant)
WHO AVOIDS IT:
──────────────
- Microservices architectures
- High-throughput systems
- Systems requiring high availability
Module 18: Three-Phase Commit (3PC)
An attempt to solve 2PC’s blocking problem.Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ THREE-PHASE COMMIT │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ADDS: Pre-commit phase between prepare and commit │
│ │
│ PHASE 1: CAN-COMMIT (same as 2PC prepare) │
│ Coordinator: "Can you commit?" │
│ Participant: "Yes" or "No" │
│ │
│ PHASE 2: PRE-COMMIT (NEW!) │
│ If all said yes: │
│ Coordinator: "Prepare to commit" │
│ Participant: Logs intent, replies ACK │
│ │
│ PHASE 3: DO-COMMIT │
│ Coordinator: "Commit now" │
│ Participant: Commits, replies ACK │
│ │
│ KEY INSIGHT: │
│ ───────────── │
│ After pre-commit, if coordinator fails, participants can: │
│ - If I received pre-commit → coordinator decided commit │
│ - I can timeout and commit without coordinator! │
│ │
│ BEFORE pre-commit, if coordinator fails: │
│ - I can safely abort (no one committed yet) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
3PC vs 2PC
Copy
┌────────────────────┬──────────────────┬──────────────────────┐
│ │ 2PC │ 3PC │
├────────────────────┼──────────────────┼──────────────────────┤
│ Phases │ 2 │ 3 │
│ Message complexity │ 4n │ 6n │
│ Blocking │ Yes │ No (in some cases) │
│ Network partition │ Can block │ Can violate safety! │
│ Used in practice │ Widely │ Rarely │
└────────────────────┴──────────────────┴──────────────────────┘
WHY 3PC ISN'T USED:
───────────────────
Network partitions break 3PC:
- One partition times out, commits
- Other partition times out, aborts
- DIFFERENT DECISIONS = unsafe!
2PC is blocking but SAFE
3PC is non-blocking but UNSAFE in partitions
Modern approach: Use consensus (Paxos/Raft) for coordinator
Module 19: Saga Pattern
This is the most practical approach for microservices. Know this well for interviews.
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ SAGA PATTERN │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ IDEA: Break transaction into steps, each with a compensating action │
│ │
│ ORDER SAGA EXAMPLE: │
│ ─────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ T1: Create Order ──► T2: Reserve Inv ──► T3: Process Payment │ │
│ │ │ │
│ │ C1: Cancel Order ◄── C2: Release Inv ◄── C3: Refund Payment │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ SUCCESS PATH: │
│ T1 → T2 → T3 → Done! │
│ │
│ FAILURE PATH (T3 fails): │
│ T1 → T2 → T3(fail) → C2 → C1 → Rolled back! │
│ │
│ KEY DIFFERENCE FROM 2PC: │
│ ───────────────────────── │
│ - NO LOCKS held during saga │
│ - Each step commits independently │
│ - Compensation instead of rollback │
│ - Eventually consistent, not immediately consistent │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Saga Coordination Styles
- Choreography
- Orchestration
Services communicate via events, no central coordinator.
Copy
Order Inventory Payment Shipping
Service Service Service Service
│
│──OrderCreated────►│
│ │──InventoryReserved──►│
│ │
│ │◄──PaymentProcessed───│
│ │
│◄────────────────ShipmentScheduled────────│
Each service:
1. Listens for events
2. Does its work
3. Publishes events
PROS:
- Simple, loosely coupled
- No SPOF
CONS:
- Hard to understand flow
- Difficult to add new steps
- Complex failure handling
Central saga orchestrator controls the flow.
Copy
┌─────────────────────────────────────────────────────────┐
│ SAGA ORCHESTRATOR │
└─────────────────────────────────────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────────┐ ┌─────────┐
│ Order │ │Inventory│ │ Payment │ │Shipping │
│ Service │ │ Service │ │ Service │ │ Service │
└─────────┘ └─────────┘ └─────────────┘ └─────────┘
Orchestrator:
1. Calls each service in order
2. Tracks saga state
3. Handles failures, triggers compensation
PROS:
- Clear flow, easy to understand
- Centralized failure handling
- Easy to modify
CONS:
- Orchestrator is a point of coupling
- Must be highly available
Implementing Saga Orchestration
Copy
class OrderSaga:
"""Order saga with compensation logic"""
def __init__(self, order_id):
self.order_id = order_id
self.state = "STARTED"
self.completed_steps = []
async def execute(self, order_data):
try:
# Step 1: Create order
await self.create_order(order_data)
self.completed_steps.append("CREATE_ORDER")
# Step 2: Reserve inventory
await self.reserve_inventory(order_data.items)
self.completed_steps.append("RESERVE_INVENTORY")
# Step 3: Process payment
await self.process_payment(order_data.payment)
self.completed_steps.append("PROCESS_PAYMENT")
# Step 4: Schedule shipping
await self.schedule_shipping(order_data.address)
self.completed_steps.append("SCHEDULE_SHIPPING")
self.state = "COMPLETED"
return {"status": "success", "order_id": self.order_id}
except SagaStepFailure as e:
await self.compensate()
self.state = "COMPENSATED"
return {"status": "failed", "reason": str(e)}
async def compensate(self):
"""Execute compensation in reverse order"""
compensation_map = {
"SCHEDULE_SHIPPING": self.cancel_shipping,
"PROCESS_PAYMENT": self.refund_payment,
"RESERVE_INVENTORY": self.release_inventory,
"CREATE_ORDER": self.cancel_order,
}
for step in reversed(self.completed_steps):
compensator = compensation_map.get(step)
if compensator:
try:
await compensator()
except Exception as e:
# Log and continue - compensation must be best-effort
log.error(f"Compensation failed for {step}: {e}")
Saga Failure Handling
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ SAGA FAILURE SCENARIOS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. FORWARD STEP FAILS (most common) │
│ ─────────────────────────────────── │
│ T1 ✓ → T2 ✓ → T3 ✗ │
│ Execute: C2 → C1 │
│ Result: Saga rolled back │
│ │
│ 2. COMPENSATION STEP FAILS │
│ ─────────────────────────── │
│ T1 ✓ → T2 ✓ → T3 ✗ │
│ Compensate: C2 ✗ │
│ Action: Retry C2 (must be idempotent!) │
│ After N retries, alert human │
│ │
│ 3. ORCHESTRATOR CRASHES │
│ ───────────────────────── │
│ Solution: Persist saga state to durable storage │
│ On restart: Resume from last known state │
│ │
│ 4. SEMANTIC FAILURES │
│ ──────────────────── │
│ Some things can't be compensated perfectly: │
│ - Email already sent │
│ - Notification pushed │
│ - External system updated │
│ Solution: Design for eventual consistency, maybe retry later │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Saga Design Principles
Copy
1. IDEMPOTENT OPERATIONS
──────────────────────
Every step and compensation must be safe to retry
BAD: inventory -= 1
GOOD: if not reserved(order_id): inventory -= 1
2. REVERSIBLE STEPS
─────────────────
Design operations that can be undone
BAD: delete_user(user_id)
GOOD: mark_user_deleted(user_id) # Can be unmarked
3. SEMANTIC LOCKS
───────────────
Prevent dirty reads during saga execution
order.status = "PENDING" # Others know saga is in progress
# ... saga executes ...
order.status = "CONFIRMED" # Safe to read now
4. SAGA LOG
─────────
Record all events for debugging and replay
{saga_id, step, status, timestamp, data}
Module 20: TCC Pattern
Try-Confirm-Cancel - another distributed transaction pattern.Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ TCC PATTERN │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ THREE PHASES: │
│ │
│ TRY: Reserve resources, check conditions │
│ (but don't commit!) │
│ │
│ CONFIRM: Make reservations permanent │
│ (only if all TRY succeeded) │
│ │
│ CANCEL: Release reservations │
│ (if any TRY failed) │
│ │
│ EXAMPLE: Transfer $100 from A to B │
│ ───────────────────────────────── │
│ │
│ TRY phase: │
│ Account A: Reserve $100 (balance: $500 → available: $400) │
│ Account B: Reserve credit for $100 │
│ │
│ If both TRY succeed: │
│ CONFIRM: │
│ Account A: Deduct $100 from reserved │
│ Account B: Add $100 to balance │
│ │
│ If any TRY fails: │
│ CANCEL: │
│ Account A: Release reserved $100 │
│ Account B: Release credit reservation │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
TCC vs Saga
Copy
┌────────────────────┬────────────────────────┬────────────────────────┐
│ │ SAGA │ TCC │
├────────────────────┼────────────────────────┼────────────────────────┤
│ Resource locking │ No (compensate later) │ Yes (during Try phase) │
│ Isolation │ No isolation │ Better isolation │
│ Complexity │ Simpler │ More complex │
│ When to use │ Long transactions │ Short, isolated │
│ Failure handling │ Compensation │ Cancel releases │
│ Resource usage │ May have dirty reads │ Resources "reserved" │
└────────────────────┴────────────────────────┴────────────────────────┘
USE TCC WHEN:
- Need stronger isolation
- Resources can be "reserved"
- Transaction is short-lived
- Can afford the complexity
USE SAGA WHEN:
- Long-running transactions
- Steps are independent services
- Eventual consistency is OK
- Simpler compensation logic
Module 21: Distributed Locking
Coordinating access to shared resources across nodes.The Problem
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ WHY DISTRIBUTED LOCKS ARE HARD │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SCENARIO: Two workers processing same job │
│ │
│ Worker A Lock Service Worker B │
│ │ │ │ │
│ │──acquire────────► │ │
│ │◄────granted─────│ │ │
│ │ │ │ │
│ │ (processing) │ │ │
│ │ │ │ │
│ │ (GC pause) │──lease expires──│ │
│ │ (20 seconds) │ │ │
│ │ │◄────acquire─────│ │
│ │ │─────granted─────► │
│ │ │ │ │
│ │ (GC ends) │ │ (processing) │
│ │ │ │ │
│ │ (writes!) │ │ (writes!) │
│ │ │ │ │
│ └──────── BOTH THINK THEY HOLD THE LOCK! ──────────┘ │
│ │
│ PROBLEMS: │
│ ───────── │
│ 1. Can't distinguish slow from dead │
│ 2. Leases expire, but work continues │
│ 3. Network delays │
│ 4. Process pauses (GC, swapping) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Redis Single-Node Lock
Copy
# Simple Redis lock (NOT safe for production distributed lock)
def acquire_lock(redis, key, value, ttl_ms):
"""
SET key value NX PX ttl_ms
- NX: Only if not exists
- PX: Expiry in milliseconds
"""
return redis.set(key, value, nx=True, px=ttl_ms)
def release_lock(redis, key, expected_value):
"""
Only release if we still hold the lock (value matches)
Uses Lua script for atomicity
"""
script = """
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
"""
return redis.eval(script, 1, key, expected_value)
# Usage
lock_value = str(uuid.uuid4()) # Unique per acquisition
if acquire_lock(redis, "my-lock", lock_value, 30000):
try:
do_work()
finally:
release_lock(redis, "my-lock", lock_value)
Redlock Algorithm
Distributed lock across multiple Redis nodes.Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ REDLOCK │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SETUP: 5 independent Redis masters (N=5, quorum=3) │
│ │
│ ALGORITHM: │
│ ────────── │
│ 1. Get current time T1 │
│ │
│ 2. Try to acquire lock on each Redis node sequentially │
│ - Use same key, same random value │
│ - Use short timeout for each node │
│ │
│ 3. Get current time T2 │
│ Elapsed = T2 - T1 │
│ │
│ 4. Lock acquired if: │
│ - Acquired on majority (≥3) │
│ - Elapsed < lock TTL │
│ - Remaining TTL = TTL - Elapsed > min_work_time │
│ │
│ 5. If failed, release lock on ALL nodes │
│ │
│ VALIDITY TIME = TTL - elapsed - clock_drift │
│ Only do work within validity time │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Redlock Critique (Martin Kleppmann)
Copy
KLEPPMANN'S CRITIQUE:
─────────────────────
1. PROCESS PAUSES CAN STILL CAUSE ISSUES
Client acquires lock
GC pause for 30 seconds
Lock expires
Another client acquires lock
GC ends, first client continues with expired lock
Redlock can't prevent this!
2. CLOCK ASSUMPTIONS
Redlock assumes clocks don't jump forward
But NTP can cause clock jumps
If a majority of nodes have clock issues, safety is violated
3. NOT A CORRECT DISTRIBUTED SYSTEM PRIMITIVE
"Redlock depends on a lot of timing assumptions"
"It is neither a safe solution using perfect clocks,
nor a safe solution using imperfect clocks"
SALVIO (Redis) RESPONSE:
────────────────────────
- Use monotonic clocks
- Design for reasonable clock drift
- Redlock is pragmatic for real-world use
CONCLUSION:
───────────
For truly safety-critical locks, use consensus-based approaches
(Zookeeper, etcd, Consul)
Fencing Tokens
The solution to the GC pause problem.Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ FENCING TOKENS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ IDEA: Lock service returns incrementing token with each lock │
│ Storage service rejects operations with old tokens │
│ │
│ Worker A Lock Storage │
│ │ │ │ │
│ │──acquire─────► │ │
│ │◄──token=33───│ │ │
│ │ │ │ │
│ │ (GC pause) │──expires── │ │
│ │ │ │ │
│ Worker B │ │ │
│ │──acquire─────► │ │
│ │◄──token=34───│ │ │
│ │ │ │ │
│ │──────────write(token=34)─────► │
│ │◄───────────────OK────────────│ │
│ │ │ │ │
│ │ (GC ends) │ │ │
│ │──────────write(token=33)─────► │
│ │◄──────────REJECTED───────────│ (33 < 34) │
│ │
│ Storage tracks: max_seen_token = 34 │
│ Rejects any write with token < max_seen_token │
│ │
│ REQUIREMENTS: │
│ ───────────── │
│ - Lock service must generate monotonically increasing tokens │
│ - Storage must track and check tokens │
│ - Requires cooperation from storage layer │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Zookeeper-Based Locks
Copy
# Zookeeper distributed lock using ephemeral sequential nodes
class ZookeeperLock:
def __init__(self, zk, lock_path):
self.zk = zk
self.lock_path = lock_path
self.my_node = None
def acquire(self):
# Create ephemeral sequential node
self.my_node = self.zk.create(
f"{self.lock_path}/lock-",
ephemeral=True,
sequential=True
)
while True:
# Get all children, sorted
children = sorted(self.zk.get_children(self.lock_path))
my_name = self.my_node.split("/")[-1]
# Am I first? I have the lock!
if children[0] == my_name:
return
# Watch the node before me
my_index = children.index(my_name)
node_before = children[my_index - 1]
# Wait for it to be deleted
event = threading.Event()
def watch_callback(event_type):
event.set()
if self.zk.exists(
f"{self.lock_path}/{node_before}",
watch=watch_callback
):
event.wait() # Wait for deletion
def release(self):
if self.my_node:
self.zk.delete(self.my_node)
self.my_node = None
# Why this works:
# 1. Ephemeral nodes deleted if client disconnects
# 2. Sequential ensures ordering
# 3. Only watch predecessor - no herd effect
# 4. Zookeeper handles consistency via ZAB
Key Interview Questions
Q: Explain the difference between 2PC and Saga
Q: Explain the difference between 2PC and Saga
2PC:
- Coordinator blocks until all participants vote
- Holds locks during prepare phase
- Strong consistency, immediate
- Can block on coordinator failure
- Better for database transactions
- No global locks, each step commits independently
- Uses compensation for rollback
- Eventually consistent
- No blocking, more available
- Better for microservices, long transactions
- 2PC: Short transactions, need strong consistency, can afford latency
- Saga: Long transactions, need high availability, can tolerate eventual consistency
Q: Design a distributed lock for a rate limiter
Q: Design a distributed lock for a rate limiter
Copy
Requirements:
- Limit API calls per user per minute
- Multiple server instances
- Must be accurate (not approximate)
Solution 1: Redis with Lua script
─────────────────────────────────
key = f"rate:{user_id}:{current_minute}"
Script (atomic):
current = GET key
if current >= limit:
return REJECTED
INCR key
EXPIRE key 60
return ALLOWED
Solution 2: Token bucket with lock
─────────────────────────────────
1. Acquire lock for user's bucket
2. Check/update token count
3. Release lock
Use Redlock or Zookeeper for the lock
Solution 3: Sliding window (no lock)
────────────────────────────────────
Use sorted set with timestamp scores
ZADD key timestamp request_id
ZREMRANGEBYSCORE key 0 (now - window)
ZCARD key → current count
Q: How would you handle a saga compensation failure?
Q: How would you handle a saga compensation failure?
Copy
Strategies:
1. RETRY WITH BACKOFF
- Compensations must be idempotent
- Retry with exponential backoff
- After N retries, escalate
2. DEAD LETTER QUEUE
- Store failed compensations
- Process later (automated or manual)
- Alert operators
3. EVENTUAL RECONCILIATION
- Background job checks consistency
- Fixes discrepancies automatically
4. MANUAL INTERVENTION
- Some failures need human decision
- Provide good tooling/visibility
5. DESIGN FOR FAILURE
- Make services tolerate inconsistency
- Use soft deletes, status fields
- Build reversal into business process
Key: Never lose the failure information!
Always log/store failed saga state
Q: Why is distributed locking harder than it seems?
Q: Why is distributed locking harder than it seems?
Core challenges:
- Can’t distinguish slow from dead
- Heartbeat timeout? Node might just be slow
- GC pause, network delay, CPU starvation
- Clock skew
- Lock expiry depends on time
- Different machines have different times
- NTP can jump clocks
- Partial failures
- Acquired lock on some nodes, not others
- Network partition mid-operation
- Client failures
- Client crashes while holding lock
- Need automatic expiry/release
- But expiry can cause double-holding
- Fencing tokens (best)
- Consensus-based locks (Zookeeper, etcd)
- Design system to tolerate inconsistency
- Accept that “perfect” distributed lock doesn’t exist
Next Steps
Continue to Track 5: Data Systems at Scale
Learn about partitioning, distributed databases, and stream processing