Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

CAP Theorem Explained

What are Distributed Systems?

A distributed system is a collection of independent computers that appear to users as a single coherent system. Leslie Lamport put it best: “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” That quote captures the essential challenge — once your system spans multiple machines, you inherit a whole category of problems (network partitions, clock skew, partial failures) that simply do not exist on a single machine. The analogy that works best: imagine a team of chefs in separate kitchens, communicating only by passing notes through a mail slot. They need to coordinate on what dishes to prepare, in what order, using shared ingredients — but notes can be delayed, lost, or arrive out of order. That is distributed systems in a nutshell. Everything that follows in this chapter is about managing that coordination challenge.

Why Distributed?

Scalability

Handle more load than a single machine

Reliability

Survive individual machine failures

Latency

Serve users from nearby locations

Compliance

Data locality requirements

The Eight Fallacies

These fallacies, originally articulated by Peter Deutsch and James Gosling at Sun Microsystems, are the assumptions that bite engineers hardest when they move from single-machine to distributed systems. In interviews, casually referencing one or two of these when discussing failure modes signals deep understanding.
  1. The network is reliable — Packets get lost. AWS reports measurable packet loss even within a single availability zone. Design for retries and idempotency from day one.
  2. Latency is zero — Cross-region calls take 100-300ms. A design that makes 10 sequential cross-service calls adds a full second of latency. Batch and parallelize.
  3. Bandwidth is infinite — A chatty microservice that transfers 1MB payloads at 10K QPS needs 10 GB/s of bandwidth and costs real money in cloud egress fees.
  4. The network is secure — Always encrypt in transit (mTLS between services). Internal networks are not safe; lateral movement is how most breaches escalate.
  5. Topology doesn’t change — Containers and VMs spin up and down constantly. Hard-coded IP addresses will break. Use service discovery.
  6. There is one administrator — In a cloud/microservices world, dozens of teams own different parts of the infrastructure with different policies.
  7. Transport cost is zero — AWS charges $0.01-0.09/GB for cross-AZ and cross-region data transfer. At scale, this becomes a major line item.
  8. The network is homogeneous — Your services run on different hardware, different OS versions, different network cards with different MTU sizes.

Consistency Models

Strong Consistency

All nodes see the same data at the same time. This is the most intuitive model — it behaves as if there is only one copy of the data. The price you pay is latency (every write must be acknowledged by replicas before returning) and availability (if replicas are unreachable, writes block). Strong consistency is non-negotiable for use cases where “stale reads” cause real harm: bank balances, inventory counts, seat reservations. Strong Consistency

Eventual Consistency

All nodes will eventually have the same data. Eventual Consistency

Consistency Implementation Examples

from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Optional, List, Any, Dict
from datetime import datetime
import hashlib
import json
import asyncio

@dataclass
class VectorClock:
    """Track causality in distributed systems"""
    clocks: Dict[str, int] = field(default_factory=dict)
    
    def increment(self, node_id: str) -> None:
        self.clocks[node_id] = self.clocks.get(node_id, 0) + 1
    
    def merge(self, other: 'VectorClock') -> 'VectorClock':
        """Merge two vector clocks (take max of each)"""
        merged = VectorClock()
        all_nodes = set(self.clocks.keys()) | set(other.clocks.keys())
        for node in all_nodes:
            merged.clocks[node] = max(
                self.clocks.get(node, 0),
                other.clocks.get(node, 0)
            )
        return merged
    
    def is_concurrent_with(self, other: 'VectorClock') -> bool:
        """Check if two events are concurrent (neither happened before the other)"""
        self_greater = any(
            self.clocks.get(k, 0) > other.clocks.get(k, 0)
            for k in self.clocks
        )
        other_greater = any(
            other.clocks.get(k, 0) > self.clocks.get(k, 0)
            for k in other.clocks
        )
        return self_greater and other_greater


@dataclass
class ReplicatedValue:
    """Last-Write-Wins Register for eventual consistency"""
    value: Any
    timestamp: float
    node_id: str
    vector_clock: VectorClock
    
    def merge(self, other: 'ReplicatedValue') -> 'ReplicatedValue':
        """Merge with conflict resolution"""
        # Check for concurrent updates
        if self.vector_clock.is_concurrent_with(other.vector_clock):
            # Conflict! Use timestamp + node_id as tiebreaker
            if (self.timestamp, self.node_id) > (other.timestamp, other.node_id):
                winner = self
            else:
                winner = other
        elif self.timestamp > other.timestamp:
            winner = self
        else:
            winner = other
            
        return ReplicatedValue(
            value=winner.value,
            timestamp=winner.timestamp,
            node_id=winner.node_id,
            vector_clock=self.vector_clock.merge(other.vector_clock)
        )


class EventuallyConsistentStore:
    """Distributed key-value store with eventual consistency"""
    
    def __init__(self, node_id: str, replicas: List['EventuallyConsistentStore'] = None):
        self.node_id = node_id
        self.data: Dict[str, ReplicatedValue] = {}
        self.vector_clock = VectorClock()
        self.replicas = replicas or []
        self.pending_sync: List[tuple] = []
    
    def write(self, key: str, value: Any) -> None:
        """Write with vector clock tracking"""
        self.vector_clock.increment(self.node_id)
        
        replicated = ReplicatedValue(
            value=value,
            timestamp=datetime.now().timestamp(),
            node_id=self.node_id,
            vector_clock=VectorClock(clocks=self.vector_clock.clocks.copy())
        )
        
        # Local write
        if key in self.data:
            self.data[key] = self.data[key].merge(replicated)
        else:
            self.data[key] = replicated
        
        # Queue for async replication
        self.pending_sync.append((key, replicated))
    
    def read(self, key: str) -> Optional[Any]:
        """Read (may return stale data)"""
        if key in self.data:
            return self.data[key].value
        return None
    
    async def sync_to_replicas(self) -> None:
        """Background sync to other replicas"""
        while self.pending_sync:
            key, value = self.pending_sync.pop(0)
            for replica in self.replicas:
                try:
                    await replica.receive_sync(key, value)
                except Exception as e:
                    # Re-queue failed sync
                    self.pending_sync.append((key, value))
                    print(f"Sync failed to {replica.node_id}: {e}")
    
    async def receive_sync(self, key: str, incoming: ReplicatedValue) -> None:
        """Receive and merge update from another replica"""
        if key in self.data:
            self.data[key] = self.data[key].merge(incoming)
        else:
            self.data[key] = incoming
        
        # Update our vector clock
        self.vector_clock = self.vector_clock.merge(incoming.vector_clock)


# Usage example
async def demo_eventual_consistency():
    # Three replicas
    replica_a = EventuallyConsistentStore("node-a")
    replica_b = EventuallyConsistentStore("node-b")
    replica_c = EventuallyConsistentStore("node-c")
    
    # Cross-reference replicas
    replica_a.replicas = [replica_b, replica_c]
    replica_b.replicas = [replica_a, replica_c]
    replica_c.replicas = [replica_a, replica_b]
    
    # Concurrent writes (simulating network partition)
    replica_a.write("user:1:name", "Alice")
    replica_b.write("user:1:name", "Alicia")  # Conflict!
    
    # Before sync: different values
    print(f"A sees: {replica_a.read('user:1:name')}")  # Alice
    print(f"B sees: {replica_b.read('user:1:name')}")  # Alicia
    
    # After sync: converges to same value
    await replica_a.sync_to_replicas()
    await replica_b.sync_to_replicas()
    
    print(f"After sync - A: {replica_a.read('user:1:name')}")
    print(f"After sync - B: {replica_b.read('user:1:name')}")
    # Both will show the same value (winner based on timestamp + node_id)

Consistency Levels

LevelDescriptionTrade-off
StrongAll reads see latest writeHigh latency
EventualReads may be staleLow latency
CausalRespects cause-effectMedium
Read-your-writesSee your own writesGood UX
SessionConsistency within sessionPractical

Consensus Algorithms

How do distributed nodes agree on a value?

Paxos (Simplified)

Paxos Simplified

Raft (Easier to Understand)

Raft Consensus

Raft Implementation

import asyncio
import random
from enum import Enum
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Callable
from datetime import datetime

class NodeState(Enum):
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"

@dataclass
class LogEntry:
    term: int
    index: int
    command: str
    data: dict

@dataclass
class RaftNode:
    """Simplified Raft consensus node"""
    node_id: str
    state: NodeState = NodeState.FOLLOWER
    current_term: int = 0
    voted_for: Optional[str] = None
    log: List[LogEntry] = field(default_factory=list)
    commit_index: int = 0
    last_applied: int = 0
    
    # Leader state
    next_index: Dict[str, int] = field(default_factory=dict)
    match_index: Dict[str, int] = field(default_factory=dict)
    
    # Cluster info
    peers: List[str] = field(default_factory=list)
    
    # Timers
    election_timeout: float = 0
    heartbeat_interval: float = 0.15  # 150ms
    
    def __post_init__(self):
        self.reset_election_timeout()
    
    def reset_election_timeout(self):
        """Randomized election timeout (150-300ms typically)"""
        self.election_timeout = random.uniform(0.15, 0.30)
    
    async def start_election(self) -> bool:
        """Candidate starts election"""
        self.state = NodeState.CANDIDATE
        self.current_term += 1
        self.voted_for = self.node_id
        votes = 1  # Vote for self
        
        # Request votes from all peers
        vote_requests = []
        for peer_id in self.peers:
            vote_requests.append(
                self.request_vote(peer_id)
            )
        
        results = await asyncio.gather(*vote_requests, return_exceptions=True)
        votes += sum(1 for r in results if r is True)
        
        # Check if won election
        quorum = (len(self.peers) + 1) // 2 + 1
        if votes >= quorum:
            await self.become_leader()
            return True
        
        return False
    
    async def become_leader(self):
        """Transition to leader state"""
        self.state = NodeState.LEADER
        print(f"[{self.node_id}] Became leader for term {self.current_term}")
        
        # Initialize leader state
        for peer_id in self.peers:
            self.next_index[peer_id] = len(self.log) + 1
            self.match_index[peer_id] = 0
        
        # Start heartbeat
        asyncio.create_task(self.send_heartbeats())
    
    async def send_heartbeats(self):
        """Leader sends periodic heartbeats"""
        while self.state == NodeState.LEADER:
            for peer_id in self.peers:
                asyncio.create_task(self.append_entries(peer_id, []))
            await asyncio.sleep(self.heartbeat_interval)
    
    async def request_vote(self, peer_id: str) -> bool:
        """Request vote from a peer"""
        # In real implementation, this would be RPC
        last_log_index = len(self.log)
        last_log_term = self.log[-1].term if self.log else 0
        
        # Simulate RPC response
        # Peer grants vote if:
        # 1. Candidate's term >= peer's term
        # 2. Peer hasn't voted for someone else
        # 3. Candidate's log is at least as up-to-date
        return True  # Simplified
    
    async def append_entries(
        self, 
        peer_id: str, 
        entries: List[LogEntry]
    ) -> bool:
        """AppendEntries RPC (heartbeat + log replication)"""
        prev_log_index = self.next_index.get(peer_id, 1) - 1
        prev_log_term = self.log[prev_log_index - 1].term if prev_log_index > 0 else 0
        
        # Simulate RPC
        success = True  # In real implementation, check response
        
        if success and entries:
            self.match_index[peer_id] = prev_log_index + len(entries)
            self.next_index[peer_id] = self.match_index[peer_id] + 1
            
            # Update commit index
            self.update_commit_index()
        
        return success
    
    def update_commit_index(self):
        """Leader updates commit index based on replication"""
        for n in range(self.commit_index + 1, len(self.log) + 1):
            replicated_count = 1  # Count self
            for peer_id in self.peers:
                if self.match_index.get(peer_id, 0) >= n:
                    replicated_count += 1
            
            quorum = (len(self.peers) + 1) // 2 + 1
            if replicated_count >= quorum:
                if self.log[n - 1].term == self.current_term:
                    self.commit_index = n
    
    def propose(self, command: str, data: dict) -> Optional[LogEntry]:
        """Client proposes a command (only leader can accept)"""
        if self.state != NodeState.LEADER:
            return None  # Redirect to leader
        
        entry = LogEntry(
            term=self.current_term,
            index=len(self.log) + 1,
            command=command,
            data=data
        )
        self.log.append(entry)
        
        # Replicate to followers (async)
        asyncio.create_task(self.replicate_log())
        
        return entry
    
    async def replicate_log(self):
        """Replicate latest log entries to all followers"""
        for peer_id in self.peers:
            next_idx = self.next_index.get(peer_id, 1)
            entries_to_send = self.log[next_idx - 1:]
            
            if entries_to_send:
                await self.append_entries(peer_id, entries_to_send)


# Leader election simulation
async def simulate_raft():
    nodes = [
        RaftNode(node_id="node-1", peers=["node-2", "node-3"]),
        RaftNode(node_id="node-2", peers=["node-1", "node-3"]),
        RaftNode(node_id="node-3", peers=["node-1", "node-2"]),
    ]
    
    # Simulate node-1 winning election
    leader = nodes[0]
    await leader.start_election()
    
    # Propose a command
    if leader.state == NodeState.LEADER:
        entry = leader.propose("SET", {"key": "name", "value": "Alice"})
        print(f"Proposed: {entry}")

Distributed Transactions

Two-Phase Commit (2PC)

Two-Phase Commit

Two-Phase Commit Implementation

import asyncio
from enum import Enum
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Any
from datetime import datetime, timedelta
import uuid

class TxState(Enum):
    PENDING = "pending"
    PREPARED = "prepared"
    COMMITTED = "committed"
    ABORTED = "aborted"

@dataclass
class Transaction:
    tx_id: str
    state: TxState = TxState.PENDING
    participants: List[str] = field(default_factory=list)
    votes: Dict[str, bool] = field(default_factory=dict)
    created_at: datetime = field(default_factory=datetime.now)
    timeout: timedelta = timedelta(seconds=30)

class TwoPhaseCommitCoordinator:
    """Coordinates distributed transactions using 2PC"""
    
    def __init__(self, coordinator_id: str):
        self.coordinator_id = coordinator_id
        self.transactions: Dict[str, Transaction] = {}
        self.participants: Dict[str, 'TransactionParticipant'] = {}
    
    def register_participant(self, participant: 'TransactionParticipant'):
        self.participants[participant.participant_id] = participant
    
    async def begin_transaction(
        self, 
        participant_ids: List[str],
        operations: Dict[str, Any]
    ) -> str:
        """Start a new distributed transaction"""
        tx_id = f"tx-{uuid.uuid4().hex[:8]}"
        tx = Transaction(tx_id=tx_id, participants=participant_ids)
        self.transactions[tx_id] = tx
        
        print(f"[Coordinator] Starting transaction {tx_id}")
        
        try:
            # Phase 1: Prepare
            prepared = await self.prepare_phase(tx_id, operations)
            
            if prepared:
                # Phase 2: Commit
                await self.commit_phase(tx_id)
                return tx_id
            else:
                # Phase 2: Abort
                await self.abort_phase(tx_id)
                raise Exception("Transaction aborted - not all participants prepared")
                
        except asyncio.TimeoutError:
            await self.abort_phase(tx_id)
            raise Exception("Transaction timed out")
    
    async def prepare_phase(
        self, 
        tx_id: str, 
        operations: Dict[str, Any]
    ) -> bool:
        """Phase 1: Ask all participants to prepare"""
        tx = self.transactions[tx_id]
        print(f"[Coordinator] Phase 1: PREPARE for {tx_id}")
        
        prepare_tasks = []
        for participant_id in tx.participants:
            participant = self.participants.get(participant_id)
            if participant:
                op = operations.get(participant_id, {})
                prepare_tasks.append(
                    participant.prepare(tx_id, op)
                )
        
        # Wait for all votes with timeout
        try:
            results = await asyncio.wait_for(
                asyncio.gather(*prepare_tasks, return_exceptions=True),
                timeout=tx.timeout.total_seconds()
            )
            
            # Check all votes
            for participant_id, result in zip(tx.participants, results):
                if isinstance(result, Exception):
                    tx.votes[participant_id] = False
                else:
                    tx.votes[participant_id] = result
            
            all_prepared = all(tx.votes.values())
            tx.state = TxState.PREPARED if all_prepared else TxState.PENDING
            
            return all_prepared
            
        except asyncio.TimeoutError:
            return False
    
    async def commit_phase(self, tx_id: str):
        """Phase 2: Tell all participants to commit"""
        tx = self.transactions[tx_id]
        print(f"[Coordinator] Phase 2: COMMIT for {tx_id}")
        
        commit_tasks = []
        for participant_id in tx.participants:
            participant = self.participants.get(participant_id)
            if participant:
                commit_tasks.append(participant.commit(tx_id))
        
        await asyncio.gather(*commit_tasks, return_exceptions=True)
        tx.state = TxState.COMMITTED
        print(f"[Coordinator] Transaction {tx_id} COMMITTED")
    
    async def abort_phase(self, tx_id: str):
        """Phase 2: Tell all participants to abort"""
        tx = self.transactions[tx_id]
        print(f"[Coordinator] Phase 2: ABORT for {tx_id}")
        
        abort_tasks = []
        for participant_id in tx.participants:
            participant = self.participants.get(participant_id)
            if participant:
                abort_tasks.append(participant.abort(tx_id))
        
        await asyncio.gather(*abort_tasks, return_exceptions=True)
        tx.state = TxState.ABORTED
        print(f"[Coordinator] Transaction {tx_id} ABORTED")


class TransactionParticipant:
    """Participant in a distributed transaction"""
    
    def __init__(self, participant_id: str):
        self.participant_id = participant_id
        self.prepared_data: Dict[str, Any] = {}
        self.committed_data: Dict[str, Any] = {}
        self.undo_log: Dict[str, Any] = {}
    
    async def prepare(self, tx_id: str, operation: Dict) -> bool:
        """Prepare to commit - acquire locks, validate, write to redo log"""
        print(f"[{self.participant_id}] Preparing {tx_id}: {operation}")
        
        try:
            # Simulate validation and lock acquisition
            await asyncio.sleep(0.1)  # Simulate I/O
            
            # Write to redo log (would be durable in real implementation)
            self.prepared_data[tx_id] = operation
            
            # Save undo information
            key = operation.get("key")
            if key:
                self.undo_log[tx_id] = {
                    "key": key,
                    "old_value": self.committed_data.get(key)
                }
            
            print(f"[{self.participant_id}] Voted YES for {tx_id}")
            return True
            
        except Exception as e:
            print(f"[{self.participant_id}] Voted NO for {tx_id}: {e}")
            return False
    
    async def commit(self, tx_id: str):
        """Commit the prepared transaction"""
        print(f"[{self.participant_id}] Committing {tx_id}")
        
        operation = self.prepared_data.pop(tx_id, {})
        
        # Apply the operation
        key = operation.get("key")
        value = operation.get("value")
        if key:
            self.committed_data[key] = value
        
        # Clear undo log
        self.undo_log.pop(tx_id, None)
    
    async def abort(self, tx_id: str):
        """Abort - rollback using undo log"""
        print(f"[{self.participant_id}] Aborting {tx_id}")
        
        # Rollback using undo log
        undo = self.undo_log.pop(tx_id, None)
        if undo:
            key = undo["key"]
            old_value = undo["old_value"]
            if old_value is not None:
                self.committed_data[key] = old_value
            elif key in self.committed_data:
                del self.committed_data[key]
        
        self.prepared_data.pop(tx_id, None)


# Usage example
async def demo_2pc():
    coordinator = TwoPhaseCommitCoordinator("coordinator-1")
    
    # Register participants (e.g., database shards)
    orders_db = TransactionParticipant("orders-db")
    inventory_db = TransactionParticipant("inventory-db")
    
    coordinator.register_participant(orders_db)
    coordinator.register_participant(inventory_db)
    
    # Execute distributed transaction
    try:
        tx_id = await coordinator.begin_transaction(
            participant_ids=["orders-db", "inventory-db"],
            operations={
                "orders-db": {"key": "order:123", "value": {"product": "laptop", "qty": 1}},
                "inventory-db": {"key": "product:laptop:stock", "value": 99}
            }
        )
        print(f"Transaction {tx_id} completed successfully!")
        
    except Exception as e:
        print(f"Transaction failed: {e}")

asyncio.run(demo_2pc())

Saga Pattern

Saga Pattern

Choreography vs Orchestration

Choreography (Event-driven)

Order ─► OrderCreated event

    ┌─────────┼─────────┐
    ▼         ▼         ▼
Inventory  Payment  Shipping
    │         │         │
    └─────────┴─────────┘
          Events

• No central controller
• Services react to events
• More decoupled


Orchestration (Command-driven)

    ┌───────────────────┐
    │   Saga Manager    │
    └─────────┬─────────┘

    ┌─────────┼─────────┐
    ▼         ▼         ▼
Inventory  Payment  Shipping
    
• Central controller
• Explicit flow control
• Easier to understand

Handling Failures

Circuit Breaker Pattern

Circuit Breaker
import time
import asyncio
from enum import Enum
from dataclasses import dataclass, field
from typing import Callable, TypeVar, Generic, Optional, Any
from functools import wraps
from datetime import datetime, timedelta
import logging

T = TypeVar('T')
logger = logging.getLogger(__name__)

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject requests
    HALF_OPEN = "half_open"  # Testing if service recovered

@dataclass
class CircuitBreakerConfig:
    failure_threshold: int = 5
    success_threshold: int = 2  # Successes needed to close from half-open
    timeout: timedelta = timedelta(seconds=60)
    half_open_max_calls: int = 3
    excluded_exceptions: tuple = ()  # Exceptions that don't count as failures

class CircuitBreakerOpenException(Exception):
    """Raised when circuit breaker is open"""
    def __init__(self, circuit_name: str, retry_after: float):
        self.circuit_name = circuit_name
        self.retry_after = retry_after
        super().__init__(f"Circuit {circuit_name} is OPEN. Retry after {retry_after:.1f}s")

@dataclass
class CircuitBreaker:
    """Production-ready circuit breaker with metrics"""
    name: str
    config: CircuitBreakerConfig = field(default_factory=CircuitBreakerConfig)
    
    state: CircuitState = field(default=CircuitState.CLOSED, init=False)
    failure_count: int = field(default=0, init=False)
    success_count: int = field(default=0, init=False)
    last_failure_time: Optional[datetime] = field(default=None, init=False)
    half_open_calls: int = field(default=0, init=False)
    
    # Metrics
    total_calls: int = field(default=0, init=False)
    total_failures: int = field(default=0, init=False)
    total_rejections: int = field(default=0, init=False)

    def __call__(self, func: Callable) -> Callable:
        """Use as decorator"""
        @wraps(func)
        async def async_wrapper(*args, **kwargs):
            return await self.call_async(lambda: func(*args, **kwargs))
        
        @wraps(func)
        def sync_wrapper(*args, **kwargs):
            return self.call_sync(lambda: func(*args, **kwargs))
        
        return async_wrapper if asyncio.iscoroutinefunction(func) else sync_wrapper

    def call_sync(self, func: Callable[[], T]) -> T:
        """Execute function with circuit breaker (sync)"""
        self._check_state()
        self.total_calls += 1
        
        try:
            result = func()
            self._on_success()
            return result
        except self.config.excluded_exceptions:
            raise  # Don't count as failure
        except Exception as e:
            self._on_failure()
            raise

    async def call_async(self, func: Callable) -> Any:
        """Execute function with circuit breaker (async)"""
        self._check_state()
        self.total_calls += 1
        
        try:
            result = await func() if asyncio.iscoroutinefunction(func) else func()
            self._on_success()
            return result
        except self.config.excluded_exceptions:
            raise
        except Exception as e:
            self._on_failure()
            raise

    def _check_state(self):
        """Check if request should be allowed"""
        if self.state == CircuitState.CLOSED:
            return
        
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self._transition_to_half_open()
            else:
                self.total_rejections += 1
                retry_after = (
                    self.last_failure_time + self.config.timeout - datetime.now()
                ).total_seconds()
                raise CircuitBreakerOpenException(self.name, max(0, retry_after))
        
        if self.state == CircuitState.HALF_OPEN:
            if self.half_open_calls >= self.config.half_open_max_calls:
                self.total_rejections += 1
                raise CircuitBreakerOpenException(self.name, 1.0)
            self.half_open_calls += 1

    def _should_attempt_reset(self) -> bool:
        """Check if enough time has passed to try again"""
        if self.last_failure_time is None:
            return True
        return datetime.now() - self.last_failure_time >= self.config.timeout

    def _transition_to_half_open(self):
        """Move to half-open state"""
        logger.info(f"Circuit {self.name}: OPEN -> HALF_OPEN")
        self.state = CircuitState.HALF_OPEN
        self.half_open_calls = 0
        self.success_count = 0

    def _on_success(self):
        """Handle successful call"""
        if self.state == CircuitState.HALF_OPEN:
            self.success_count += 1
            if self.success_count >= self.config.success_threshold:
                self._close()
        else:
            self.failure_count = 0  # Reset on success

    def _on_failure(self):
        """Handle failed call"""
        self.total_failures += 1
        self.failure_count += 1
        self.last_failure_time = datetime.now()
        
        if self.state == CircuitState.HALF_OPEN:
            self._open()
        elif self.failure_count >= self.config.failure_threshold:
            self._open()

    def _open(self):
        """Open the circuit"""
        logger.warning(f"Circuit {self.name}: -> OPEN (failures: {self.failure_count})")
        self.state = CircuitState.OPEN

    def _close(self):
        """Close the circuit"""
        logger.info(f"Circuit {self.name}: -> CLOSED")
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.half_open_calls = 0

    @property
    def metrics(self) -> dict:
        """Get circuit breaker metrics"""
        return {
            "name": self.name,
            "state": self.state.value,
            "total_calls": self.total_calls,
            "total_failures": self.total_failures,
            "total_rejections": self.total_rejections,
            "failure_rate": self.total_failures / max(1, self.total_calls),
            "current_failure_count": self.failure_count
        }


# Usage as decorator
payment_circuit = CircuitBreaker(
    name="payment-service",
    config=CircuitBreakerConfig(
        failure_threshold=3,
        success_threshold=2,
        timeout=timedelta(seconds=30)
    )
)

@payment_circuit
async def call_payment_service(amount: float):
    # Simulate external API call
    async with aiohttp.ClientSession() as session:
        async with session.post(
            "https://api.payment.com/charge",
            json={"amount": amount}
        ) as response:
            if response.status != 200:
                raise Exception(f"Payment failed: {response.status}")
            return await response.json()


# Usage with context manager
class CircuitBreakerContext:
    def __init__(self, circuit: CircuitBreaker):
        self.circuit = circuit
    
    def __enter__(self):
        self.circuit._check_state()
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        if exc_type is None:
            self.circuit._on_success()
        elif exc_type not in self.circuit.config.excluded_exceptions:
            self.circuit._on_failure()
        return False  # Don't suppress exceptions


# Example with fallback
async def get_user_with_fallback(user_id: str):
    try:
        return await call_user_service(user_id)
    except CircuitBreakerOpenException:
        # Fallback: return cached data
        return get_cached_user(user_id)

Retry with Exponential Backoff

import asyncio
import random
import time
from functools import wraps
from typing import Callable, TypeVar, Type, Tuple, Optional
from dataclasses import dataclass
import logging

T = TypeVar('T')
logger = logging.getLogger(__name__)

@dataclass
class RetryConfig:
    max_retries: int = 5
    base_delay: float = 1.0
    max_delay: float = 60.0
    exponential_base: float = 2.0
    jitter: bool = True
    retryable_exceptions: Tuple[Type[Exception], ...] = (Exception,)
    on_retry: Optional[Callable[[Exception, int], None]] = None

class RetryExhausted(Exception):
    """All retries failed"""
    def __init__(self, last_exception: Exception, attempts: int):
        self.last_exception = last_exception
        self.attempts = attempts
        super().__init__(f"All {attempts} retry attempts failed: {last_exception}")

def retry_with_backoff(config: Optional[RetryConfig] = None):
    """Decorator for retry with exponential backoff"""
    if config is None:
        config = RetryConfig()
    
    def decorator(func: Callable[..., T]) -> Callable[..., T]:
        @wraps(func)
        async def async_wrapper(*args, **kwargs) -> T:
            last_exception = None
            
            for attempt in range(config.max_retries):
                try:
                    return await func(*args, **kwargs)
                    
                except config.retryable_exceptions as e:
                    last_exception = e
                    
                    if attempt == config.max_retries - 1:
                        break
                    
                    # Calculate delay with exponential backoff
                    delay = min(
                        config.base_delay * (config.exponential_base ** attempt),
                        config.max_delay
                    )
                    
                    # Add jitter to prevent thundering herd
                    if config.jitter:
                        delay = delay * (0.5 + random.random())
                    
                    logger.warning(
                        f"Attempt {attempt + 1}/{config.max_retries} failed: {e}. "
                        f"Retrying in {delay:.2f}s"
                    )
                    
                    if config.on_retry:
                        config.on_retry(e, attempt + 1)
                    
                    await asyncio.sleep(delay)
            
            raise RetryExhausted(last_exception, config.max_retries)
        
        @wraps(func)
        def sync_wrapper(*args, **kwargs) -> T:
            last_exception = None
            
            for attempt in range(config.max_retries):
                try:
                    return func(*args, **kwargs)
                    
                except config.retryable_exceptions as e:
                    last_exception = e
                    
                    if attempt == config.max_retries - 1:
                        break
                    
                    delay = min(
                        config.base_delay * (config.exponential_base ** attempt),
                        config.max_delay
                    )
                    
                    if config.jitter:
                        delay = delay * (0.5 + random.random())
                    
                    logger.warning(
                        f"Attempt {attempt + 1}/{config.max_retries} failed: {e}. "
                        f"Retrying in {delay:.2f}s"
                    )
                    
                    if config.on_retry:
                        config.on_retry(e, attempt + 1)
                    
                    time.sleep(delay)
            
            raise RetryExhausted(last_exception, config.max_retries)
        
        return async_wrapper if asyncio.iscoroutinefunction(func) else sync_wrapper
    
    return decorator


# Usage examples
@retry_with_backoff(RetryConfig(
    max_retries=3,
    base_delay=1.0,
    retryable_exceptions=(ConnectionError, TimeoutError)
))
async def fetch_user_data(user_id: str):
    async with aiohttp.ClientSession() as session:
        async with session.get(f"https://api.example.com/users/{user_id}") as resp:
            if resp.status == 503:
                raise ConnectionError("Service unavailable")
            return await resp.json()


# Retry with callback for metrics
def on_retry_callback(exception: Exception, attempt: int):
    metrics.increment("api.retries", tags={"attempt": attempt})

@retry_with_backoff(RetryConfig(
    max_retries=5,
    on_retry=on_retry_callback
))
def call_external_api():
    response = requests.get("https://api.example.com/data")
    response.raise_for_status()
    return response.json()


# Combine with circuit breaker
@payment_circuit
@retry_with_backoff(RetryConfig(max_retries=2))
async def charge_payment_resilient(amount: float):
    """Retries within circuit breaker protection"""
    return await call_payment_service(amount)

Bulkhead Pattern

Isolate failures to prevent cascade. Bulkhead Pattern
import asyncio
from dataclasses import dataclass, field
from typing import Dict, Optional, Callable, TypeVar, Any
from functools import wraps
from contextlib import asynccontextmanager
import time
from enum import Enum

T = TypeVar('T')

class BulkheadRejectedException(Exception):
    """Raised when bulkhead rejects request"""
    def __init__(self, bulkhead_name: str, reason: str):
        self.bulkhead_name = bulkhead_name
        self.reason = reason
        super().__init__(f"Bulkhead {bulkhead_name} rejected: {reason}")

@dataclass
class BulkheadConfig:
    max_concurrent: int = 10
    max_queue: int = 100
    queue_timeout: float = 5.0  # seconds

@dataclass
class BulkheadMetrics:
    active_count: int = 0
    queue_size: int = 0
    total_accepted: int = 0
    total_rejected: int = 0
    total_timeout: int = 0

class Bulkhead:
    """
    Bulkhead pattern - isolate components to prevent cascade failures.
    Limits concurrent execution and queues excess requests.
    """
    
    def __init__(self, name: str, config: BulkheadConfig = None):
        self.name = name
        self.config = config or BulkheadConfig()
        self.semaphore = asyncio.Semaphore(self.config.max_concurrent)
        self.metrics = BulkheadMetrics()
        self._queue = asyncio.Queue(maxsize=self.config.max_queue)
    
    def __call__(self, func: Callable) -> Callable:
        """Use as decorator"""
        @wraps(func)
        async def wrapper(*args, **kwargs):
            async with self.acquire():
                return await func(*args, **kwargs)
        return wrapper
    
    @asynccontextmanager
    async def acquire(self):
        """Acquire a slot in the bulkhead"""
        # Check if we can get a semaphore slot
        if self.semaphore.locked() and self._queue.full():
            self.metrics.total_rejected += 1
            raise BulkheadRejectedException(
                self.name, 
                f"Max concurrent ({self.config.max_concurrent}) and queue ({self.config.max_queue}) reached"
            )
        
        # Try to acquire with timeout
        try:
            await asyncio.wait_for(
                self.semaphore.acquire(),
                timeout=self.config.queue_timeout
            )
        except asyncio.TimeoutError:
            self.metrics.total_timeout += 1
            raise BulkheadRejectedException(
                self.name,
                f"Queue timeout after {self.config.queue_timeout}s"
            )
        
        self.metrics.active_count += 1
        self.metrics.total_accepted += 1
        
        try:
            yield
        finally:
            self.metrics.active_count -= 1
            self.semaphore.release()
    
    def get_metrics(self) -> dict:
        return {
            "name": self.name,
            "active": self.metrics.active_count,
            "max_concurrent": self.config.max_concurrent,
            "accepted": self.metrics.total_accepted,
            "rejected": self.metrics.total_rejected,
            "timeouts": self.metrics.total_timeout,
            "utilization": self.metrics.active_count / self.config.max_concurrent
        }


class ThreadPoolBulkhead:
    """
    Thread pool bulkhead for sync operations.
    Each pool isolates a group of operations.
    """
    
    def __init__(self, name: str, max_workers: int = 10):
        from concurrent.futures import ThreadPoolExecutor
        self.name = name
        self.executor = ThreadPoolExecutor(
            max_workers=max_workers,
            thread_name_prefix=f"bulkhead-{name}"
        )
        self.max_workers = max_workers
    
    def submit(self, func: Callable, *args, **kwargs):
        """Submit work to the bulkhead thread pool"""
        return self.executor.submit(func, *args, **kwargs)
    
    async def run(self, func: Callable, *args, **kwargs) -> Any:
        """Run sync function in bulkhead pool"""
        loop = asyncio.get_event_loop()
        return await loop.run_in_executor(
            self.executor,
            lambda: func(*args, **kwargs)
        )


# Bulkhead manager for multiple isolated pools
class BulkheadManager:
    """Manage multiple bulkheads for different services"""
    
    def __init__(self):
        self.bulkheads: Dict[str, Bulkhead] = {}
    
    def create(self, name: str, config: BulkheadConfig = None) -> Bulkhead:
        bulkhead = Bulkhead(name, config)
        self.bulkheads[name] = bulkhead
        return bulkhead
    
    def get(self, name: str) -> Optional[Bulkhead]:
        return self.bulkheads.get(name)
    
    def get_all_metrics(self) -> Dict[str, dict]:
        return {
            name: bh.get_metrics() 
            for name, bh in self.bulkheads.items()
        }


# Usage
bulkhead_manager = BulkheadManager()

# Separate pools for different services
payment_bulkhead = bulkhead_manager.create(
    "payment",
    BulkheadConfig(max_concurrent=5, max_queue=20)
)

inventory_bulkhead = bulkhead_manager.create(
    "inventory", 
    BulkheadConfig(max_concurrent=10, max_queue=50)
)

notification_bulkhead = bulkhead_manager.create(
    "notification",
    BulkheadConfig(max_concurrent=20, max_queue=100)
)

@payment_bulkhead
async def process_payment(order_id: str, amount: float):
    """Isolated payment processing"""
    await asyncio.sleep(0.5)  # Simulate API call
    return {"status": "success", "order_id": order_id}

@inventory_bulkhead  
async def check_inventory(product_id: str):
    """Isolated inventory check"""
    await asyncio.sleep(0.1)
    return {"available": True, "quantity": 100}

# Even if payment service is slow/failing,
# inventory and notifications continue working
async def handle_order(order):
    try:
        # These run in isolated pools
        inventory = await check_inventory(order["product_id"])
        payment = await process_payment(order["id"], order["amount"])
        await send_notification(order["user_id"])
        return {"status": "completed"}
    except BulkheadRejectedException as e:
        return {"status": "retry_later", "reason": str(e)}

Key Takeaways

ConceptRemember
CAP TheoremPick 2 of 3: Consistency, Availability, Partition Tolerance
ConsensusUse Raft for leader election, state machine replication
Transactions2PC for strong consistency, Sagas for microservices
FailuresDesign for failure: retries, circuit breakers, bulkheads
Distributed systems are hard. Every network call can fail, be slow, or return stale data. Design for failure from day one.

Interview Questions

Q1: Explain the CAP theorem. Then tell me why most real-world system design decisions aren’t actually “pick two out of three.”

The CAP theorem states that a distributed system can provide at most two of three guarantees simultaneously: Consistency (every read sees the most recent write), Availability (every request gets a non-error response), and Partition tolerance (the system continues to operate despite network partitions between nodes).
  • Why “pick two” is misleading: Network partitions aren’t optional — they will happen in any distributed system. So the real choice is between CP (consistency + partition tolerance) and AP (availability + partition tolerance). You’re never truly choosing to give up partition tolerance, because the network will partition whether you designed for it or not.
  • In practice, it’s a spectrum, not a binary: Systems like DynamoDB let you tune consistency per-request. A shopping cart read can be eventually consistent (AP), while a checkout that decrements inventory uses strongly consistent reads (CP). You’re making the CAP trade-off at the operation level, not the system level.
  • The PACELC extension is more useful: It says “if there’s a Partition, choose A or C; Else, choose Latency or Consistency.” This captures what engineers actually deal with day-to-day — even when the network is fine, strong consistency adds latency (because you’re waiting for replica acknowledgment). DynamoDB is PA/EL (available during partitions, low latency otherwise). Spanner is PC/EC (consistent always, pays the latency cost via TrueTime).
  • Real-world example: At a company running a global e-commerce platform, product catalog reads are AP (showing a slightly stale price for 2 seconds is fine), but the payment ledger is CP (an incorrect balance means real money loss). Same infrastructure, different CAP choices per use case.
Red flag answer: “You pick two out of three: consistency, availability, or partition tolerance” and stops there. This is the textbook recitation that every candidate gives. It shows memorization, not understanding. Even worse: “I’d pick CA” — this means the candidate doesn’t understand that partitions aren’t a choice. Follow-ups:
  1. You’re designing a global inventory system for a flash sale with 100K concurrent users. Walk me through specifically where you’d choose CP vs AP within that single system.
  2. Your team just adopted CockroachDB because “it’s CP and still highly available.” How is that possible, and what are you actually giving up compared to a system like Cassandra?

Q2: What is a vector clock, and why can’t you just use wall-clock timestamps to order events in a distributed system?

A vector clock is a data structure that tracks causal ordering of events across multiple nodes. Each node maintains a vector of counters — one per node in the system — and increments its own counter on every local event. When messages are exchanged, nodes merge vectors by taking the element-wise maximum.
  • Why wall clocks fail: Physical clocks on different machines are never perfectly synchronized. Even with NTP, you get clock skew of 1-10ms between machines in the same datacenter, and much worse across regions. If Node A writes at T=100 and Node B writes at T=101, you can’t be sure A actually happened first — Node B’s clock might just be 2ms ahead. Google’s Spanner solves this with TrueTime (atomic clocks + GPS), but that’s a $100M+ infrastructure investment.
  • What vector clocks actually give you: They establish a partial order. If event A’s vector clock is strictly less than or equal to B’s on all components, A happened before B. If neither dominates, the events are concurrent — meaning there’s a genuine conflict that needs resolution. This is fundamentally different from timestamps, which impose a total order that might be wrong.
  • The practical trade-off: Vector clocks grow linearly with the number of nodes. In a system with 1,000 nodes, every piece of data carries a 1,000-element vector — that’s real overhead on every message and storage operation. This is why systems like Dynamo and Riak used them but with pruning strategies, and why many systems (Cassandra, for example) opted for Last-Write-Wins with timestamps instead, accepting the rare data loss for simplicity.
  • Real-world example: Amazon’s original Dynamo paper used vector clocks for their shopping cart. When concurrent updates happened (say, two users on the same account adding items simultaneously), the system could detect the conflict and merge the carts instead of silently dropping one write.
Red flag answer: “Just use timestamps” or “NTP keeps clocks in sync so this isn’t a problem.” This reveals the candidate hasn’t worked with distributed systems at scale, where clock skew causes subtle, hard-to-reproduce bugs — the kind that show up only under load or during a network hiccup. Follow-ups:
  1. If vector clocks grow too large for a 500-node cluster, what alternatives exist? How do systems like Cassandra handle this differently?
  2. Walk me through a concrete scenario where using Last-Write-Wins timestamps would silently lose data but vector clocks would catch the conflict.

Q3: Your system uses Raft for consensus across 5 nodes. The leader crashes. Walk me through exactly what happens next, step by step.

When the Raft leader crashes, the cluster goes through a leader election process. Here’s the step-by-step sequence:
  • Step 1 — Detection via heartbeat timeout: Each follower has a randomized election timeout (typically 150-300ms). The leader sends heartbeats at a shorter interval (e.g., every 50-100ms). When a follower doesn’t receive a heartbeat within its timeout, it suspects the leader is dead.
  • Step 2 — A follower becomes a candidate: The first follower whose timeout expires increments its currentTerm, transitions to candidate state, votes for itself, and sends RequestVote RPCs to all other nodes. The randomized timeout is critical — it makes split votes unlikely because one node almost always times out first.
  • Step 3 — Voting: Each node votes for at most one candidate per term. A node grants its vote only if the candidate’s log is at least as up-to-date as the voter’s log (this is the “election safety” property). This guarantees the new leader has all committed entries.
  • Step 4 — Winning the election: The candidate needs a majority (3 out of 5 nodes). Once it gets 3 votes, it becomes the new leader and immediately sends heartbeats to assert authority and prevent other elections.
  • Step 5 — Uncommitted entries: Any log entries the old leader had replicated to fewer than 3 nodes are not committed and may be overwritten. Entries replicated to 3+ nodes are committed and will survive. This is why clients should only consider a write successful after the leader confirms it’s committed (replicated to a majority).
  • During the election, the system is unavailable for writes — typically 150-500ms of downtime. Reads can be served from followers if your system allows stale reads; otherwise reads also block.
Red flag answer: “A new leader is automatically selected” without being able to describe the mechanism. Also a red flag: not mentioning the safety property that prevents a node with a stale log from becoming leader — this is the entire point of Raft’s election restriction. Follow-ups:
  1. What happens if the network partitions into a group of 2 and a group of 3, and the old leader is in the group of 2? Does the old leader know it’s no longer the leader?
  2. You’re running Raft in production and noticing frequent leader elections even though no nodes are actually crashing. What’s causing this and how do you fix it?

Q4: Explain Two-Phase Commit. What is the fundamental flaw, and what alternatives exist for distributed transactions in a microservices architecture?

Two-Phase Commit (2PC) is a protocol that ensures atomicity across multiple participants in a distributed transaction. Phase 1 (Prepare): the coordinator asks each participant “can you commit?” and each participant acquires locks, writes to a redo log, and votes yes or no. Phase 2 (Commit/Abort): if all voted yes, the coordinator tells everyone to commit; if any voted no, everyone aborts.
  • The fundamental flaw is the blocking problem: If the coordinator crashes after sending “prepare” but before sending the commit/abort decision, all participants are stuck holding locks indefinitely. They’ve voted yes, so they can’t unilaterally abort (another participant might have already committed). They can’t commit either (the coordinator might have decided to abort). This makes 2PC a blocking protocol — a single coordinator failure can freeze the entire system.
  • In a microservices world, 2PC is usually the wrong choice because: (1) it requires all services to be available simultaneously, defeating the independence that microservices are supposed to provide; (2) holding distributed locks across services creates latency and contention; (3) the coordinator is a single point of failure unless you add complexity like 3PC or Paxos-based commit.
  • The Saga pattern is the practical alternative: Instead of one big transaction, you decompose it into a sequence of local transactions. Each service does its work and publishes an event. If a step fails, you execute compensating transactions to undo previous steps. For example: Order Service creates order -> Payment Service charges card -> Inventory Service reserves stock. If inventory fails, you refund the payment and cancel the order.
  • Saga trade-offs: You lose isolation (intermediate states are visible) and compensating transactions can be complex (how do you “un-send” an email?). But you gain availability, loose coupling, and independent scalability — which matters more in microservices. Uber processes millions of trips daily using sagas, not 2PC.
Red flag answer: “Two-Phase Commit guarantees consistency across services, so use it for everything distributed.” This shows the candidate hasn’t felt the operational pain of distributed locking in production. Also a red flag: describing 2PC without mentioning the coordinator failure problem — that’s the one thing every interviewer expects you to know. Follow-ups:
  1. You’re implementing a saga for an e-commerce checkout. The payment succeeds but the inventory reservation fails. Your compensating transaction to refund the payment also fails (the payment provider is down). What now?
  2. How would you ensure exactly-once semantics in a saga when messages can be delivered more than once?

Q5: You’re running a distributed key-value store with 3 replicas. A client writes a value and immediately reads it back but gets stale data. Explain why this happens and how you’d fix it.

This is the classic read-after-write consistency problem in eventually consistent systems. Here’s exactly what’s happening:
  • The write goes to one replica (or the leader) and returns success before all replicas are updated. If the subsequent read hits a different replica that hasn’t received the update yet, it returns stale data. With 3 replicas using async replication, the write might succeed on 1 node in 5ms while the other 2 nodes receive the update 50-200ms later.
  • Fix 1 — Quorum reads and writes (W + R > N): With N=3 replicas, set W=2 (write must succeed on 2 nodes) and R=2 (read must come from 2 nodes). Since W+R=4 > N=3, at least one node in the read set must have the latest write. The client picks the value with the highest version/timestamp. This is how DynamoDB’s strongly consistent reads work, and Cassandra’s QUORUM consistency level.
  • Fix 2 — Read-your-writes consistency (session-based): Route all of a user’s reads to the same replica that accepted their writes, typically using session affinity (sticky sessions by user ID). This is simpler than quorums and good enough for many use cases — the user sees their own writes even if other users see stale data briefly.
  • Fix 3 — Write to leader, read from leader: In a leader-follower setup, route both reads and writes to the leader for operations that need consistency. The downside: the leader becomes a bottleneck. This is why most systems only do this for specific critical paths, not all reads.
  • What I’d actually recommend depends on the use case: For a user profile update, read-your-writes (Fix 2) is sufficient and cheap. For an inventory decrement during checkout, quorum (Fix 1) is necessary. For a social media feed, you might just accept eventual consistency because the cost of stale reads is low.
Red flag answer: “Set everything to strong consistency” without discussing the latency and availability trade-offs. Also a red flag: not knowing the quorum formula W + R > N — this is bread-and-butter distributed systems knowledge. Follow-ups:
  1. You’re using quorum reads/writes (W=2, R=2, N=3) and one replica goes down permanently. What happens to your read and write availability?
  2. A customer reports that even with quorum reads, they’re occasionally seeing stale data. What could cause this, and how would you debug it?

Q6: What is the difference between the Saga pattern’s choreography and orchestration approaches? When would you choose one over the other?

Both choreography and orchestration are ways to coordinate the steps of a saga (a sequence of local transactions with compensating actions). The difference is in who controls the flow.
  • Choreography (event-driven): Each service publishes domain events after completing its local transaction, and other services subscribe to those events. There’s no central controller — the flow emerges from event subscriptions. Order Service publishes OrderCreated, Payment Service hears it and charges the card, publishes PaymentCompleted, Inventory Service hears that and reserves stock. For compensation, services publish failure events: PaymentFailed triggers OrderCancelled.
  • Orchestration (command-driven): A central Saga Orchestrator sends explicit commands to each service in sequence. The orchestrator knows the full workflow and handles branching/compensation logic. “Step 1: tell Payment to charge. Step 2: if success, tell Inventory to reserve. If Payment failed, tell Order to cancel.”
  • Choose choreography when: You have 3-4 simple services, the flow is mostly linear, teams are autonomous and don’t want a central coordinator, and you already have an event bus (Kafka, SNS/SQS). It scales well organizationally because no single team owns the flow.
  • Choose orchestration when: The workflow has more than 5 steps, complex branching logic (if payment is partial, do X; if international, add customs step), you need clear visibility into saga state (which step are we on? where did it fail?), or you need to add/reorder steps without touching every service. Netflix uses orchestrators for their complex content workflows.
  • The hidden cost of choreography: At scale, the implicit flow becomes very hard to reason about. When something goes wrong, you’re tracing events across 6 services’ logs to reconstruct what happened. With an orchestrator, you look at one place. I’ve seen teams start with choreography for its elegance and migrate to orchestration after the third production incident where no one could figure out the sequence of events.
Red flag answer: “Always use choreography because it’s more decoupled” or “always use orchestration because it’s easier.” Both absolutes ignore the context. Also a red flag: not mentioning observability as a factor — in production, the ability to debug a failed saga is often the deciding factor, not theoretical coupling. Follow-ups:
  1. You chose choreography and now have 8 services in the saga. A step in the middle fails and the compensating events create a cascade that takes 45 seconds to fully resolve. How do you improve this?
  2. How do you handle the case where the orchestrator itself crashes mid-saga? How does the system recover?

Q7: What is the Circuit Breaker pattern, and how does it differ from simple retry logic? Describe a production scenario where retries would make the problem worse but a circuit breaker would help.

A circuit breaker monitors calls to a downstream service and “opens” (stops sending traffic) when failures exceed a threshold. It has three states: Closed (normal — requests flow through), Open (tripped — requests fail immediately without calling the downstream), and Half-Open (testing — a limited number of requests are let through to check if the service has recovered).
  • How it differs from retries: Retries address transient failures — a single request that might succeed on the second attempt. A circuit breaker addresses sustained failures — when the downstream service is down or degraded and retrying will only make things worse. They’re complementary: retries inside a circuit breaker means “try a few times, but if the service is consistently failing, stop trying and fail fast.”
  • The production scenario where retries are destructive: Imagine your payment service is overloaded and responding with 503s at 2-second latency. You have 10 upstream servers each sending 100 requests/second, all with retry logic (3 retries). Without a circuit breaker, each failed request generates 3 more attempts. Your 1,000 req/s becomes 3,000 req/s to an already overloaded service. This is the retry storm (or “thundering herd”) — you’re pouring gasoline on the fire. A circuit breaker detects the failure rate, opens the circuit, and returns errors immediately to callers. The payment service gets breathing room to recover.
  • Key configuration decisions: The failure threshold (too low and you trip on transient errors; too high and you’re slow to react), the timeout duration (how long to stay open before trying again), and what counts as a failure (5xx responses yes, 4xx responses no — a 400 Bad Request means the client sent garbage, not that the service is broken).
  • What most people miss — the fallback strategy: The circuit breaker is only half the pattern. The other half is what you do when it’s open. For a payment service: queue the charge for later processing. For a recommendation engine: show popular items instead. For a search service: return cached results. The fallback is where the real engineering judgment lives.
Red flag answer: “Just retry with exponential backoff and it’ll be fine.” This works for transient errors but is exactly wrong for sustained outages — you need to stop sending traffic, not send it more slowly. Also a red flag: describing the pattern without mentioning the half-open state — that’s the recovery mechanism that makes the pattern work. Follow-ups:
  1. You have circuit breakers on 5 downstream services. All 5 trip simultaneously at 3 AM. What’s your incident response playbook, and what does this correlated failure tell you about the root cause?
  2. Your circuit breaker is flapping — rapidly alternating between open and closed. What’s causing this and how do you tune it?

Q8: Explain the Bulkhead pattern. How does it complement the Circuit Breaker pattern, and when would you use both together?

The Bulkhead pattern isolates different parts of your system into separate resource pools so that a failure in one doesn’t cascade to others. The name comes from ship design — bulkheads divide a ship’s hull into watertight compartments so a breach in one section doesn’t sink the entire vessel.
  • Concrete implementation: You create separate thread pools (or connection pools, or semaphore pools) for each downstream dependency. Your payment service gets 5 concurrent slots, inventory gets 10, notifications get 20. If the payment service hangs, only those 5 threads are blocked — the other 30 threads continue serving inventory and notification requests normally. Without a bulkhead, all 35 threads could pile up waiting for the slow payment service, and your entire application becomes unresponsive.
  • How it complements Circuit Breaker: A circuit breaker stops calling a failing service. A bulkhead limits the damage while the failure is being detected. They operate on different timescales: the bulkhead provides immediate isolation (thread pool exhaustion is prevented from second one), while the circuit breaker kicks in after enough failures accumulate (maybe 10-30 seconds). Together, the bulkhead contains the blast radius while the circuit breaker stops the bleeding.
  • When to use both together: Almost always, in production systems with multiple downstream dependencies. Example: your API gateway talks to 4 microservices. Each gets its own bulkhead (connection pool with a max size and a queue timeout). Each also gets its own circuit breaker. If the recommendation service goes down, its bulkhead prevents the 50 pending requests from consuming all your gateway threads, and its circuit breaker prevents new requests from even trying after the failure threshold is hit. Meanwhile, search, auth, and cart services continue serving traffic without any impact.
  • The configuration challenge: Setting bulkhead sizes requires understanding your traffic patterns. Too small and you reject legitimate requests during peak load. Too large and the isolation is ineffective. In practice, you start with per-service p99 latency times expected concurrency, add 20% headroom, and tune based on production metrics. Netflix’s Hystrix library popularized this pattern, and Resilience4j carries it forward in modern Java.
Red flag answer: “Just use a big thread pool for everything” — this is exactly the anti-pattern that bulkheads solve. Also a red flag: only describing the pattern in theory without mentioning pool sizing, queue overflow behavior, or how it interacts with circuit breakers in a real deployment. Follow-ups:
  1. You’ve set up bulkheads per-service, but a bug in your API gateway layer means all services share the same outbound HTTP connection pool, bypassing the bulkheads. How would you detect this in production?
  2. Your payment bulkhead is set to 5 concurrent slots. During a Black Friday traffic spike, legitimate payment requests are being rejected because the pool is full. How do you handle this without removing the isolation benefit?

Q9: A distributed system you operate is experiencing a “split brain” scenario. What does this mean, what damage can it cause, and how do you resolve it?

Split brain occurs when a network partition causes a distributed system to divide into two or more groups, each believing it’s the “real” cluster. Both sides continue operating independently — accepting writes, serving reads, potentially electing their own leaders — without knowing the other side is doing the same thing.
  • The damage is real and specific: In a database cluster, split brain means two primary nodes both accepting writes. Client A writes balance = 100 to Primary-Left while Client B writes balance = 200 to Primary-Right. When the partition heals, you have conflicting data with no way to automatically determine which is correct. In the worst case, both sides committed financial transactions with different balances — reconciliation is a manual nightmare. This happened to GitHub in 2018 when a brief network partition caused their MySQL primary to split, leading to data inconsistency that took hours to resolve.
  • Prevention mechanisms: (1) Quorum-based systems (Raft, ZooKeeper) prevent split brain by design — a leader needs a majority to commit, so only one side of a partition can have a quorum. With 5 nodes split 2-3, only the group of 3 can elect a leader. (2) Fencing tokens (also called “epoch numbers”) — when a new leader is elected, it gets a monotonically increasing token. Storage systems reject writes from tokens lower than the latest they’ve seen, so a stale leader’s writes are rejected. (3) STONITH (Shoot The Other Node In The Head) — in traditional HA systems, when you suspect split brain, you forcefully power off the other node via out-of-band management (IPMI/iLO) before promoting yourself. Brutal, but effective.
  • Resolution after the fact: If split brain already occurred, you need to: identify the divergence point (last common write), choose a winning side (usually the one with more confirmed transactions), replay the losing side’s writes as conflicts for manual resolution, and alert the operations team. This is why many systems prefer to become unavailable (CP) rather than risk split brain (AP with no conflict resolution).
Red flag answer: “Just run more replicas” — this doesn’t solve split brain; it can actually make it worse if you don’t have proper quorum rules. Also a red flag: not being able to name a concrete prevention mechanism (fencing tokens, quorum, STONITH) — it suggests the candidate has only read about split brain but never operated a system where it could happen. Follow-ups:
  1. You’re running a 2-node database cluster for cost reasons (no quorum possible). How do you prevent split brain without adding a third node?
  2. Your ZooKeeper ensemble of 5 nodes is split 2-2-1 across three network segments. What happens? Can any side make progress?

Q10: You’re designing a system that needs to process each message exactly once, but your message broker provides at-least-once delivery. How do you achieve exactly-once semantics?

True exactly-once delivery is impossible in a distributed system (proven by the Two Generals Problem). What we actually implement is “effectively exactly-once” processing by combining at-least-once delivery with idempotent processing on the consumer side.
  • The core pattern — idempotency keys: Every message gets a unique identifier (UUID, or a natural key like order_id + action). The consumer maintains an idempotency store (a table or cache of processed message IDs). Before processing, check: “Have I seen this ID before?” If yes, skip or return the cached result. If no, process and record the ID atomically.
  • The atomic part is critical: The message processing and the idempotency record must be in the same transaction. If you process the message, then crash before recording the ID, the message will be redelivered and reprocessed. The pattern is: BEGIN TRANSACTION; INSERT INTO idempotency_log (message_id); do_the_work(); COMMIT;. If the work involves a different datastore than the idempotency log, you need the Outbox Pattern — write the idempotency record and a “pending work” record to the same DB in one transaction, then a separate worker picks up and forwards the pending work.
  • Idempotency store considerations: Use a fast lookup store (Redis with TTL, or a database table with an index on message_id). Set a TTL on entries — you don’t need to remember every message forever, just long enough to cover the broker’s redelivery window (typically 7 days for SQS, configurable for Kafka). At high throughput (100K+ messages/sec), the idempotency lookup itself becomes a bottleneck — use bloom filters as a first pass to avoid hitting the database for messages you’ve definitely never seen.
  • Kafka’s approach: Kafka 0.11+ added idempotent producers (deduplication using sequence numbers per producer-partition pair) and transactional producers (atomic writes across multiple partitions). This gets you exactly-once within Kafka, but the moment data leaves Kafka and hits your database, you’re back to needing application-level idempotency.
Red flag answer: “Our message broker supports exactly-once delivery” — no broker truly does end-to-end; they provide it within their system boundary, but you still need idempotency at the application layer. Also a red flag: “Just use a database unique constraint on the message ID” without discussing the transactional atomicity requirement — a unique constraint prevents duplicate inserts but doesn’t prevent duplicate side effects. Follow-ups:
  1. Your idempotency store is in Redis and the Redis node crashes, losing all keys. Messages are redelivered and reprocessed. How do you make the system resilient to this?
  2. You’re processing 200K messages/second. Your idempotency table has 500 million rows and lookups are getting slow. How do you optimize this without losing the exactly-once guarantee?