> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Advanced Concepts

> Senior-level system design concepts for Staff+ interviews

<Warning>
  **Senior/Staff Level Content**: This section covers advanced topics that differentiate senior engineers from mid-level. Master these for L5+ (Senior) and Staff interviews at FAANG.
</Warning>

## Consistency Deep Dive

### Linearizability vs Serializability

Most candidates confuse these -- and interviewers *love* testing this distinction because it reveals whether you genuinely understand distributed consistency or just memorized surface-level definitions. The key: linearizability is about *single operations on single objects* appearing to happen atomically in real time. Serializability is about *multi-operation transactions on multiple objects* appearing to happen in some serial order (which may not correspond to real time). Strict serializability gives you both -- and is what Google Spanner provides using GPS-synchronized atomic clocks (TrueTime), at the cost of cross-region write latency.

```
┌─────────────────────────────────────────────────────────────────┐
│           Linearizability vs Serializability                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  LINEARIZABILITY (Single-object, real-time)                    │
│  ─────────────────────────────────────────                      │
│  • Operations appear to happen atomically at some point         │
│  • Real-time ordering: if op A completes before op B starts,   │
│    then A is ordered before B                                   │
│  • Think: "single copy of data that everyone sees"             │
│                                                                 │
│  Time ────────────────────────────────────────────►            │
│                                                                 │
│  Client A:  ──[write x=1]────────────                          │
│  Client B:        ──────[read x]───                            │
│                          │                                      │
│                          └─► Must return 1 (write completed)   │
│                                                                 │
│  SERIALIZABILITY (Multi-object, transactions)                  │
│  ───────────────────────────────────────────                    │
│  • Transactions appear to execute in SOME serial order          │
│  • Doesn't guarantee real-time ordering                         │
│  • Think: "transactions don't see partial state"               │
│                                                                 │
│  T1: read(A), write(B)                                         │
│  T2: read(B), write(A)                                         │
│  Serial order: T1→T2 or T2→T1, but not interleaved            │
│                                                                 │
│  STRICT SERIALIZABILITY                                         │
│  ───────────────────────                                        │
│  • Both! Serial order + real-time ordering                      │
│  • Most expensive, what Spanner provides                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### Isolation Levels (Know All Four!)

```
┌─────────────────────────────────────────────────────────────────┐
│                    Isolation Levels                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  READ UNCOMMITTED (Weakest)                                     │
│  ─────────────────────────                                      │
│  • Can see uncommitted changes (dirty reads)                   │
│  • Almost never used                                            │
│                                                                 │
│  T1: write(x=1) ──────────── rollback                          │
│  T2:     read(x) ─► 1 (dirty read!)                            │
│                                                                 │
│  ─────────────────────────────────────────────────────────────  │
│                                                                 │
│  READ COMMITTED (Default in PostgreSQL)                         │
│  ─────────────────────────────────────                          │
│  • Only see committed data                                     │
│  • Allows non-repeatable reads                                  │
│                                                                 │
│  T1: read(x) ─► 1 ─────────── read(x) ─► 2 (changed!)         │
│  T2:      write(x=2), commit                                   │
│                                                                 │
│  ─────────────────────────────────────────────────────────────  │
│                                                                 │
│  REPEATABLE READ (Default in MySQL InnoDB)                      │
│  ─────────────────────────────────────────                      │
│  • Same query returns same results in a transaction             │
│  • Allows phantom reads (new rows appear)                       │
│                                                                 │
│  T1: SELECT COUNT(*) WHERE age > 30 ─► 5                       │
│  T2:      INSERT INTO users (age=35), commit                   │
│  T1: SELECT COUNT(*) WHERE age > 30 ─► 6 (phantom!)            │
│                                                                 │
│  ─────────────────────────────────────────────────────────────  │
│                                                                 │
│  SERIALIZABLE (Strongest)                                       │
│  ────────────────────────                                       │
│  • Full isolation, as if transactions ran serially              │
│  • Highest safety, lowest performance                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

| Level            | Dirty Read | Non-Repeatable Read | Phantom Read |
| ---------------- | ---------- | ------------------- | ------------ |
| Read Uncommitted | ✓          | ✓                   | ✓            |
| Read Committed   | ✗          | ✓                   | ✓            |
| Repeatable Read  | ✗          | ✗                   | ✓            |
| Serializable     | ✗          | ✗                   | ✗            |

<Tip>
  **Interview Answer**: "I'd use Read Committed for most cases. Serializable only for critical financial transactions where consistency matters more than throughput."
</Tip>

## Distributed Consensus Deep Dive

### Leader Election: Why It's Hard

Leader election sounds simple: "just pick one node to be in charge." But the reason it is one of the hardest problems in distributed systems is that the very mechanism you would use to coordinate the election (the network) is the thing that fails. Imagine trying to elect a class president when students can only pass notes, some notes get lost, and some students might have already left the building without telling anyone. The note-passing *is* the problem you are trying to solve.

```
┌─────────────────────────────────────────────────────────────────┐
│                 Split Brain Problem                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Initial State: Node A is leader                               │
│                                                                 │
│  ┌────────┐      ┌────────┐      ┌────────┐                   │
│  │ Node A │──────│ Node B │──────│ Node C │                   │
│  │ LEADER │      │FOLLOWER│      │FOLLOWER│                   │
│  └────────┘      └────────┘      └────────┘                   │
│                                                                 │
│  Network partition occurs:                                     │
│                                                                 │
│  Partition 1         ║         Partition 2                     │
│  ┌────────┐          ║          ┌────────┐ ┌────────┐         │
│  │ Node A │          ║          │ Node B │─│ Node C │         │
│  │ LEADER │          ║          │FOLLOWER│ │FOLLOWER│         │
│  └────────┘          ║          └────────┘ └────────┘         │
│                      ║               │                         │
│  A thinks it's       ║          B or C becomes leader!        │
│  still leader!       ║          (timeout, no heartbeat)        │
│                      ║                                         │
│  DANGER: Two leaders! (Split brain)                            │
│                                                                 │
│  Solution: Quorum-based voting                                 │
│  • Need majority (N/2 + 1) to elect leader                     │
│  • Partition 1 has 1/3 (minority) → can't write               │
│  • Partition 2 has 2/3 (majority) → can elect, can write      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### Raft: The Algorithm You Should Know

```
┌─────────────────────────────────────────────────────────────────┐
│                    Raft Consensus                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  KEY CONCEPTS:                                                  │
│                                                                 │
│  1. TERMS (Logical Clock)                                       │
│     • Time divided into terms                                   │
│     • Each term has at most one leader                         │
│     • Term number monotonically increases                       │
│                                                                 │
│     Term 1       Term 2       Term 3                           │
│     ├──Leader A──┼──Election──┼──Leader B──────►               │
│                  │  (failed)  │                                 │
│                                                                 │
│  2. LOG REPLICATION                                             │
│                                                                 │
│     Leader Log:  [1:x=1] [1:y=2] [2:x=3] [2:z=4]               │
│                    ↓       ↓       ↓       ↓                   │
│     Follower 1:  [1:x=1] [1:y=2] [2:x=3] [2:z=4] ✓             │
│     Follower 2:  [1:x=1] [1:y=2] [2:x=3]         (catching up) │
│                                                                 │
│     Entry committed when replicated to majority                 │
│                                                                 │
│  3. LEADER ELECTION                                             │
│                                                                 │
│     Follower timeout → Candidate → RequestVote RPC             │
│     • Vote granted if:                                         │
│       - Candidate's term >= voter's term                       │
│       - Candidate's log is at least as up-to-date              │
│       - Voter hasn't voted for someone else this term          │
│     • Majority votes → become Leader                           │
│                                                                 │
│  4. SAFETY GUARANTEE                                            │
│     • Elected leader has all committed entries                  │
│     • Leaders never overwrite their log                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## Clock Synchronization

Clock synchronization is one of the sneakiest problems in distributed systems because wall clocks *appear* reliable -- they give you a number that seems authoritative -- but they silently drift. Think of it like three friends timing a race with unsynchronized stopwatches: each records a slightly different finish time, and there is no objective way to determine whose stopwatch is "right." In distributed systems, this disagreement can corrupt conflict resolution, cause duplicate processing, or silently lose data when last-write-wins picks the wrong "last."

### Why Wall Clocks Fail

```
┌─────────────────────────────────────────────────────────────────┐
│                 Clock Synchronization Problems                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Physical clocks drift (100 ms/day typical)                    │
│  NTP sync has error bounds (1-10ms typically)                  │
│                                                                 │
│  Problem scenario:                                              │
│                                                                 │
│  Machine A (clock: 10:00:00.000):  write(x=1) @ 10:00:00.000  │
│  Machine B (clock: 10:00:00.050):  write(x=2) @ 10:00:00.050  │
│                                                                 │
│  But B's clock is 100ms ahead! Real order:                     │
│  • B actually wrote at real time 09:59:59.950                  │
│  • A wrote at real time 10:00:00.000                           │
│  • A happened AFTER B, but timestamps say B is newer!          │
│                                                                 │
│  Solution 1: Logical clocks (Lamport)                          │
│  Solution 2: Vector clocks                                      │
│  Solution 3: Hybrid clocks (HLC)                               │
│  Solution 4: GPS/atomic clocks (Spanner's TrueTime)            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### Vector Clocks (Conflict Detection)

```
┌─────────────────────────────────────────────────────────────────┐
│                    Vector Clocks                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Each node maintains vector: {A: count, B: count, C: count}    │
│                                                                 │
│  Node A         Node B         Node C                          │
│  {A:0,B:0,C:0}  {A:0,B:0,C:0}  {A:0,B:0,C:0}                   │
│      │              │              │                            │
│  write(x=1)         │              │                            │
│  {A:1,B:0,C:0}      │              │                            │
│      │─────────────►│              │                            │
│      │         {A:1,B:0,C:0}       │                            │
│      │              │              │                            │
│      │         write(y=2)          │                            │
│      │         {A:1,B:1,C:0}       │                            │
│      │              │─────────────►│                            │
│      │              │         {A:1,B:1,C:0}                     │
│      │              │              │                            │
│                                                                 │
│  Comparison:                                                    │
│  {A:1,B:2,C:0} vs {A:2,B:1,C:0}                                │
│  Neither dominates → CONFLICT! Must resolve                    │
│                                                                 │
│  {A:1,B:2,C:0} vs {A:1,B:3,C:0}                                │
│  Second dominates → Second is newer, no conflict               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## Data Partitioning Strategies

Data partitioning (sharding) is how you scale a database beyond what a single machine can handle -- but the partition key choice is the most consequential and least reversible decision you will make. A bad partition key creates "hot partitions" (one shard drowning while others idle), forces expensive cross-partition queries on your hot path, or makes rebalancing a multi-day operational nightmare. The analogy: choosing a partition key is like organizing a library into separate buildings. If you split by the first letter of the author's last name, the "S" building is overflowing (Smith, Singh, Suzuki...) while the "X" building is nearly empty. But if you split by a hash of the book's ISBN, every building is equally full -- at the cost of no longer being able to browse all books by the same author in one place.

### Partition Key Selection

```
┌─────────────────────────────────────────────────────────────────┐
│              Choosing Partition Keys                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  GOOD Partition Key Properties:                                 │
│  ✓ High cardinality (many unique values)                       │
│  ✓ Evenly distributed access                                   │
│  ✓ Matches query patterns                                      │
│                                                                 │
│  Example: E-commerce Orders                                     │
│                                                                 │
│  ❌ BAD: partition by date                                      │
│     • Today's partition gets all traffic (hot partition)       │
│     • Old partitions sit idle                                  │
│                                                                 │
│  ❌ BAD: partition by country                                   │
│     • US partition has 60% of traffic                          │
│     • Small countries underutilized                            │
│                                                                 │
│  ✅ GOOD: partition by order_id                                │
│     • Random distribution                                       │
│     • But: can't query "all orders for user X" easily          │
│                                                                 │
│  ✅ BETTER: partition by user_id                                │
│     • User's orders on same partition (locality)               │
│     • Query patterns match                                     │
│     • Watch for celebrity users (hot partition)                │
│                                                                 │
│  ✅ BEST: compound key (user_id, order_date)                   │
│     • Partition by user_id                                     │
│     • Sort by order_date within partition                      │
│     • Efficient for "user X's orders in last month"           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### Handling Hot Partitions

```python theme={null}
# Strategy 1: Add random suffix
def get_partition_key(celebrity_id):
    if is_celebrity(celebrity_id):
        # Split celebrity across 10 partitions
        suffix = random.randint(0, 9)
        return f"{celebrity_id}_{suffix}"
    return celebrity_id

# Reading requires scatter-gather
def get_celebrity_data(celebrity_id):
    results = []
    for suffix in range(10):
        key = f"{celebrity_id}_{suffix}"
        results.extend(query_partition(key))
    return results

# Strategy 2: Time-based suffix
def get_partition_key_time(user_id):
    # Different partition each hour
    hour = datetime.now().hour
    return f"{user_id}_{hour % 4}"
```

## Exactly-Once Semantics

### The Three Delivery Guarantees

```
┌─────────────────────────────────────────────────────────────────┐
│              Message Delivery Guarantees                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  AT-MOST-ONCE                                                   │
│  ─────────────                                                  │
│  Send and forget. May lose messages.                           │
│  Use case: Metrics, logs (some loss OK)                        │
│                                                                 │
│  Producer ──[msg]──► Broker    (might fail silently)           │
│                                                                 │
│  ─────────────────────────────────────────────────────────────  │
│                                                                 │
│  AT-LEAST-ONCE                                                  │
│  ─────────────                                                  │
│  Retry until ACK. May duplicate messages.                      │
│  Use case: Most applications (with idempotent consumers)       │
│                                                                 │
│  Producer ──[msg]──► Broker ──[ACK]──► Producer                │
│      │                  │                                       │
│      └──[retry]─────────┘ (if no ACK, retry → duplicate)       │
│                                                                 │
│  ─────────────────────────────────────────────────────────────  │
│                                                                 │
│  EXACTLY-ONCE                                                   │
│  ────────────                                                   │
│  Each message processed exactly once. Hard to achieve!         │
│  Use case: Financial transactions, critical data               │
│                                                                 │
│  Achieved via: Idempotency + Deduplication + Transactions      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### Implementing Exactly-Once

```python theme={null}
class ExactlyOnceProcessor:
    """
    Exactly-once processing with idempotency keys
    """
    
    def process_message(self, message):
        idempotency_key = message.id
        
        # Step 1: Check if already processed
        if self.is_processed(idempotency_key):
            return self.get_cached_result(idempotency_key)
        
        # Step 2: Process with transaction
        with self.db.transaction():
            # Do the work
            result = self.do_business_logic(message)
            
            # Record as processed (same transaction!)
            self.mark_processed(idempotency_key, result)
            
            # Commit message offset (Kafka-style)
            self.commit_offset(message.offset)
        
        return result
    
    def is_processed(self, key):
        return self.db.exists(f"processed:{key}")
    
    def mark_processed(self, key, result):
        self.db.set(f"processed:{key}", result, ttl=86400)
```

## Distributed Caching Patterns

Caching at scale introduces problems that simply do not exist with a single-server cache. The most dangerous: cache stampede (also called "thundering herd"), where a popular cache key expires and hundreds of servers simultaneously hit the database to re-populate it. At moderate scale this is an annoyance; at high scale it can take down your database and cascade into a full outage. The patterns below are battle-tested solutions from companies like Facebook, Twitter, and Netflix.

### Cache Stampede Prevention

```
┌─────────────────────────────────────────────────────────────────┐
│                   Cache Stampede Problem                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Popular key expires → 1000 servers hit DB simultaneously      │
│                                                                 │
│  Server 1 ─┐                                                    │
│  Server 2 ─┤                                                    │
│  Server 3 ─┼───► Cache MISS ───► Database ← 💥 OVERWHELMED     │
│  ...       │                                                    │
│  Server N ─┘                                                    │
│                                                                 │
│  SOLUTIONS:                                                     │
│                                                                 │
│  1. LOCKING (Mutex)                                             │
│     First request acquires lock, others wait or get stale      │
│                                                                 │
│  2. PROBABILISTIC EARLY EXPIRATION                              │
│     Refresh before expiry with some probability                │
│                                                                 │
│  3. BACKGROUND REFRESH                                          │
│     Never expire, refresh asynchronously                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

```python theme={null}
import random
import time

class StampedePreventingCache:
    def get(self, key, fetch_func, ttl=3600, beta=1.0):
        cached = self.cache.get(key)
        
        if cached:
            value, expiry, delta = cached
            now = time.time()
            
            # Probabilistic early refresh
            # As we approach expiry, probability increases
            gap = expiry - now
            if gap > 0:
                # XFetch algorithm
                random_early = delta * beta * math.log(random.random())
                if gap + random_early > 0:
                    return value  # Use cached value
        
        # Cache miss or early refresh triggered
        # Use distributed lock to prevent stampede
        lock_key = f"lock:{key}"
        if self.acquire_lock(lock_key, timeout=5):
            try:
                start = time.time()
                value = fetch_func()
                delta = time.time() - start
                
                self.cache.set(key, (value, time.time() + ttl, delta))
                return value
            finally:
                self.release_lock(lock_key)
        else:
            # Someone else is refreshing, return stale or wait
            if cached:
                return cached[0]  # Return stale
            time.sleep(0.1)
            return self.get(key, fetch_func, ttl, beta)  # Retry
```

## Rate Limiting at Scale

Rate limiting is the seatbelt of distributed systems -- you hope you never need it, but when a client misbehaves or a traffic spike hits, it is the difference between a graceful degradation and a cascading outage. The subtlety most engineers miss: rate limiting must be *global*, not per-server. If you have 10 API servers each allowing 100 requests/second from the same client, that client effectively gets 1,000 requests/second. Centralized rate limiting (typically via Redis) solves this, but introduces a new dependency on the rate limiter itself -- which is why the most resilient implementations use local rate limiting as a fast first pass and centralized rate limiting for accuracy.

<Note>
  **Scalability Analysis**: At 100K QPS, your rate limiter itself becomes a bottleneck if implemented naively. A Redis-based sliding window counter that calls ZRANGEBYSCORE for each request can handle roughly 50K-100K operations per second on a single Redis instance. Beyond that, you need sharded rate limiting (hash the client identifier to determine which Redis shard tracks them) or a hierarchical approach (local token buckets per server, synchronized periodically with a central store). Companies like Cloudflare and Stripe use multi-tier rate limiting: edge-level (Anycast), per-server (in-memory), and centralized (Redis cluster).
</Note>

### Distributed Rate Limiting

```
┌─────────────────────────────────────────────────────────────────┐
│           Distributed Rate Limiting Strategies                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  STRATEGY 1: Centralized (Redis)                               │
│  ─────────────────────────────────                              │
│  All servers check same Redis                                  │
│  ✓ Accurate   ✗ Single point of failure                       │
│                                                                 │
│  Server 1 ─┐                                                    │
│  Server 2 ─┼───► Redis ───► Accurate count                     │
│  Server 3 ─┘                                                    │
│                                                                 │
│  STRATEGY 2: Local + Sync                                       │
│  ───────────────────────                                        │
│  Local counters, periodically sync                             │
│  ✓ Fast   ✗ Approximate                                        │
│                                                                 │
│  Server 1: local_count=50 ──┐                                  │
│  Server 2: local_count=40 ──┼──► Sync every 1s                 │
│  Server 3: local_count=30 ──┘                                  │
│                                                                 │
│  STRATEGY 3: Token Bucket with Redis                           │
│  ─────────────────────────────────                              │
│  Each request: DECR if tokens > 0                              │
│  Background: Refill tokens at fixed rate                       │
│                                                                 │
│  Lua script for atomic check-and-decrement:                    │
│  local tokens = redis.call('GET', key)                         │
│  if tokens > 0 then                                            │
│      redis.call('DECR', key)                                   │
│      return 1  -- allowed                                      │
│  end                                                           │
│  return 0  -- rejected                                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## CQRS (Command Query Responsibility Segregation)

CQRS separates read and write operations into different models, optimizing each for its specific use case. The core insight: in most systems, read and write workloads have fundamentally different characteristics. Writes need validation, consistency, and domain logic. Reads need speed, denormalization, and flexible queries. Trying to serve both through a single model forces painful compromises.

Think of it like a library: the cataloging system (write side) carefully classifies books, enforces Dewey Decimal rules, and ensures no duplicates. The search terminals (read side) are optimized for patrons to find books fast -- they might have denormalized data, multiple indexes, and even slightly stale information. You would not force librarians to catalog books through the search terminal, and you would not make patrons navigate the raw catalog system.

The trade-off is operational complexity: you now maintain two models and a synchronization mechanism between them (usually events). CQRS is overkill for simple CRUD applications, but it shines for systems with high read-to-write ratios (100:1 or more), complex read queries that differ significantly from write structures, or requirements for different scaling characteristics on reads vs writes.

<img src="https://mintcdn.com/devweeekends/2f8Rfaato9LS1FSq/images/system-design/cqrs.svg?fit=max&auto=format&n=2f8Rfaato9LS1FSq&q=85&s=39895c2f46daa2f414727da4920337d1" alt="CQRS Pattern" width="1080" height="1080" data-path="images/system-design/cqrs.svg" />

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    from dataclasses import dataclass, field
    from typing import List, Optional, Dict, Any
    from datetime import datetime
    from abc import ABC, abstractmethod
    import asyncio
    from enum import Enum

    # ============== Commands (Write Side) ==============
    class CommandType(Enum):
        CREATE_ORDER = "create_order"
        UPDATE_ORDER = "update_order"
        CANCEL_ORDER = "cancel_order"

    @dataclass
    class Command:
        command_type: CommandType
        aggregate_id: str
        payload: Dict[str, Any]
        timestamp: datetime = field(default_factory=datetime.utcnow)
        correlation_id: str = field(default_factory=lambda: str(uuid.uuid4()))

    class CommandHandler(ABC):
        @abstractmethod
        async def handle(self, command: Command) -> None:
            pass

    class OrderCommandHandler(CommandHandler):
        def __init__(self, event_store: 'EventStore', event_bus: 'EventBus'):
            self.event_store = event_store
            self.event_bus = event_bus
        
        async def handle(self, command: Command) -> str:
            if command.command_type == CommandType.CREATE_ORDER:
                return await self._handle_create_order(command)
            elif command.command_type == CommandType.CANCEL_ORDER:
                return await self._handle_cancel_order(command)
            raise ValueError(f"Unknown command type: {command.command_type}")
        
        async def _handle_create_order(self, command: Command) -> str:
            # Business validation
            order_id = str(uuid.uuid4())
            
            # Create event
            event = Event(
                event_type=EventType.ORDER_CREATED,
                aggregate_id=order_id,
                payload={
                    "user_id": command.payload["user_id"],
                    "items": command.payload["items"],
                    "total": command.payload["total"],
                },
                version=1
            )
            
            # Persist event
            await self.event_store.append(event)
            
            # Publish for read model updates
            await self.event_bus.publish(event)
            
            return order_id
        
        async def _handle_cancel_order(self, command: Command) -> None:
            # Load current state from events
            events = await self.event_store.get_events(command.aggregate_id)
            order = OrderAggregate.from_events(events)
            
            # Business validation
            if order.status == OrderStatus.CANCELLED:
                raise ValueError("Order already cancelled")
            if order.status == OrderStatus.SHIPPED:
                raise ValueError("Cannot cancel shipped order")
            
            # Create cancellation event
            event = Event(
                event_type=EventType.ORDER_CANCELLED,
                aggregate_id=command.aggregate_id,
                payload={"reason": command.payload.get("reason", "User requested")},
                version=order.version + 1
            )
            
            await self.event_store.append(event)
            await self.event_bus.publish(event)

    # ============== Queries (Read Side) ==============
    @dataclass
    class OrderReadModel:
        """Denormalized read model optimized for queries"""
        id: str
        user_id: str
        status: str
        items: List[Dict]
        total: float
        created_at: datetime
        updated_at: datetime

    class OrderQueryService:
        def __init__(self, read_db: 'ReadDatabase'):
            self.read_db = read_db
        
        async def get_order(self, order_id: str) -> Optional[OrderReadModel]:
            """Fast read from denormalized model"""
            return await self.read_db.find_one("orders", {"id": order_id})
        
        async def get_user_orders(
            self, 
            user_id: str, 
            status: Optional[str] = None,
            limit: int = 10,
            offset: int = 0
        ) -> List[OrderReadModel]:
            """Query with filters - optimized for read patterns"""
            query = {"user_id": user_id}
            if status:
                query["status"] = status
            
            return await self.read_db.find(
                "orders",
                query,
                sort=[("created_at", -1)],
                limit=limit,
                skip=offset
            )
        
        async def get_orders_by_status(self, status: str) -> List[OrderReadModel]:
            """Admin query - different index"""
            return await self.read_db.find(
                "orders",
                {"status": status},
                sort=[("updated_at", -1)]
            )

    # ============== Read Model Projector ==============
    class OrderProjector:
        """Updates read model based on events"""
        
        def __init__(self, read_db: 'ReadDatabase'):
            self.read_db = read_db
        
        async def project(self, event: 'Event') -> None:
            handler = getattr(self, f"_handle_{event.event_type.value}", None)
            if handler:
                await handler(event)
        
        async def _handle_order_created(self, event: 'Event') -> None:
            order = OrderReadModel(
                id=event.aggregate_id,
                user_id=event.payload["user_id"],
                status="pending",
                items=event.payload["items"],
                total=event.payload["total"],
                created_at=event.timestamp,
                updated_at=event.timestamp
            )
            await self.read_db.insert("orders", order.__dict__)
        
        async def _handle_order_cancelled(self, event: 'Event') -> None:
            await self.read_db.update(
                "orders",
                {"id": event.aggregate_id},
                {
                    "status": "cancelled",
                    "cancellation_reason": event.payload["reason"],
                    "updated_at": event.timestamp
                }
            )
    ```
  </Tab>

  <Tab title="JavaScript">
    ```javascript theme={null}
    // ============== Commands (Write Side) ==============
    const CommandType = {
      CREATE_ORDER: 'create_order',
      UPDATE_ORDER: 'update_order',
      CANCEL_ORDER: 'cancel_order'
    };

    class Command {
      constructor(commandType, aggregateId, payload) {
        this.commandType = commandType;
        this.aggregateId = aggregateId;
        this.payload = payload;
        this.timestamp = new Date();
        this.correlationId = crypto.randomUUID();
      }
    }

    class OrderCommandHandler {
      constructor(eventStore, eventBus) {
        this.eventStore = eventStore;
        this.eventBus = eventBus;
      }

      async handle(command) {
        switch (command.commandType) {
          case CommandType.CREATE_ORDER:
            return this.handleCreateOrder(command);
          case CommandType.CANCEL_ORDER:
            return this.handleCancelOrder(command);
          default:
            throw new Error(`Unknown command type: ${command.commandType}`);
        }
      }

      async handleCreateOrder(command) {
        const orderId = crypto.randomUUID();
        
        const event = new Event(
          EventType.ORDER_CREATED,
          orderId,
          {
            userId: command.payload.userId,
            items: command.payload.items,
            total: command.payload.total
          },
          1
        );

        await this.eventStore.append(event);
        await this.eventBus.publish(event);
        
        return orderId;
      }

      async handleCancelOrder(command) {
        // Load current state from events
        const events = await this.eventStore.getEvents(command.aggregateId);
        const order = OrderAggregate.fromEvents(events);

        // Business validation
        if (order.status === 'cancelled') {
          throw new Error('Order already cancelled');
        }
        if (order.status === 'shipped') {
          throw new Error('Cannot cancel shipped order');
        }

        const event = new Event(
          EventType.ORDER_CANCELLED,
          command.aggregateId,
          { reason: command.payload.reason || 'User requested' },
          order.version + 1
        );

        await this.eventStore.append(event);
        await this.eventBus.publish(event);
      }
    }

    // ============== Queries (Read Side) ==============
    class OrderQueryService {
      constructor(readDb) {
        this.readDb = readDb;
      }

      async getOrder(orderId) {
        return this.readDb.findOne('orders', { id: orderId });
      }

      async getUserOrders(userId, { status, limit = 10, offset = 0 } = {}) {
        const query = { userId };
        if (status) query.status = status;

        return this.readDb.find('orders', query, {
          sort: { createdAt: -1 },
          limit,
          skip: offset
        });
      }

      async getOrdersByStatus(status) {
        return this.readDb.find('orders', { status }, {
          sort: { updatedAt: -1 }
        });
      }
    }

    // ============== Read Model Projector ==============
    class OrderProjector {
      constructor(readDb) {
        this.readDb = readDb;
      }

      async project(event) {
        const handlerName = `handle${this.toPascalCase(event.eventType)}`;
        if (this[handlerName]) {
          await this[handlerName](event);
        }
      }

      async handleOrderCreated(event) {
        const order = {
          id: event.aggregateId,
          userId: event.payload.userId,
          status: 'pending',
          items: event.payload.items,
          total: event.payload.total,
          createdAt: event.timestamp,
          updatedAt: event.timestamp
        };
        await this.readDb.insert('orders', order);
      }

      async handleOrderCancelled(event) {
        await this.readDb.update(
          'orders',
          { id: event.aggregateId },
          {
            status: 'cancelled',
            cancellationReason: event.payload.reason,
            updatedAt: event.timestamp
          }
        );
      }

      toPascalCase(str) {
        return str.replace(/_([a-z])/g, (g) => g[1].toUpperCase())
                  .replace(/^[a-z]/, (c) => c.toUpperCase());
      }
    }

    // ============== Usage Example ==============
    const app = express();

    // Command endpoint (write)
    app.post('/orders', async (req, res) => {
      const command = new Command(
        CommandType.CREATE_ORDER,
        null,
        req.body
      );
      
      const orderId = await commandHandler.handle(command);
      res.status(201).json({ orderId });
    });

    // Query endpoint (read)
    app.get('/orders/:id', async (req, res) => {
      const order = await queryService.getOrder(req.params.id);
      if (!order) return res.status(404).json({ error: 'Not found' });
      res.json(order);
    });
    ```
  </Tab>
</Tabs>

## Event Sourcing

Event sourcing stores all changes as a sequence of events rather than overwriting current state, providing a complete audit trail and enabling time-travel debugging. Where a traditional database says "the account balance IS $525," an event-sourced system says "here is every deposit and withdrawal that led to $525." This seemingly small difference has profound implications: you can rebuild any past state by replaying events to a point in time, you get a full audit trail for free, and you can create new read models by replaying the event stream through new logic -- without migrating any data.

Event sourcing pairs naturally with CQRS (above): the write side appends events to an immutable log, and one or more read-side projections consume those events to build materialized views optimized for queries. The trade-off is complexity: your system now has eventual consistency between the event store and read models, you need snapshot strategies for aggregates with long event histories (an aggregate with 10 million events takes too long to replay from scratch), and schema evolution of events requires careful versioning since events are immutable once stored.

<img src="https://mintcdn.com/devweeekends/2f8Rfaato9LS1FSq/images/system-design/event-sourcing.svg?fit=max&auto=format&n=2f8Rfaato9LS1FSq&q=85&s=1aa5db0894f256ce7742699589dec787" alt="Event Sourcing" width="1080" height="1080" data-path="images/system-design/event-sourcing.svg" />

<Tabs>
  <Tab title="Python">
    ```python theme={null}
    from dataclasses import dataclass, field
    from typing import List, Dict, Any, Optional
    from datetime import datetime
    from enum import Enum
    from abc import ABC, abstractmethod
    import json
    import uuid

    class EventType(Enum):
        ORDER_CREATED = "order_created"
        ORDER_ITEM_ADDED = "order_item_added"
        ORDER_ITEM_REMOVED = "order_item_removed"
        ORDER_SUBMITTED = "order_submitted"
        ORDER_PAID = "order_paid"
        ORDER_SHIPPED = "order_shipped"
        ORDER_DELIVERED = "order_delivered"
        ORDER_CANCELLED = "order_cancelled"

    @dataclass
    class Event:
        event_type: EventType
        aggregate_id: str
        payload: Dict[str, Any]
        version: int
        timestamp: datetime = field(default_factory=datetime.utcnow)
        event_id: str = field(default_factory=lambda: str(uuid.uuid4()))
        metadata: Dict[str, Any] = field(default_factory=dict)

    class EventStore:
        """Append-only event store with PostgreSQL"""
        
        def __init__(self, db_pool):
            self.db_pool = db_pool
        
        async def append(self, event: Event) -> None:
            """Append event with optimistic concurrency control"""
            async with self.db_pool.acquire() as conn:
                try:
                    await conn.execute("""
                        INSERT INTO events (
                            event_id, aggregate_id, event_type, 
                            payload, version, timestamp, metadata
                        )
                        VALUES ($1, $2, $3, $4, $5, $6, $7)
                    """,
                        event.event_id,
                        event.aggregate_id,
                        event.event_type.value,
                        json.dumps(event.payload),
                        event.version,
                        event.timestamp,
                        json.dumps(event.metadata)
                    )
                except UniqueViolationError:
                    raise ConcurrencyError(
                        f"Version {event.version} already exists for {event.aggregate_id}"
                    )
        
        async def get_events(
            self, 
            aggregate_id: str, 
            from_version: int = 0
        ) -> List[Event]:
            """Load all events for an aggregate"""
            async with self.db_pool.acquire() as conn:
                rows = await conn.fetch("""
                    SELECT event_id, aggregate_id, event_type, 
                           payload, version, timestamp, metadata
                    FROM events
                    WHERE aggregate_id = $1 AND version > $2
                    ORDER BY version ASC
                """, aggregate_id, from_version)
                
                return [
                    Event(
                        event_id=row['event_id'],
                        aggregate_id=row['aggregate_id'],
                        event_type=EventType(row['event_type']),
                        payload=json.loads(row['payload']),
                        version=row['version'],
                        timestamp=row['timestamp'],
                        metadata=json.loads(row['metadata'])
                    )
                    for row in rows
                ]
        
        async def get_all_events(
            self, 
            from_position: int = 0,
            batch_size: int = 1000
        ) -> List[Event]:
            """Stream all events for projections"""
            async with self.db_pool.acquire() as conn:
                rows = await conn.fetch("""
                    SELECT * FROM events
                    WHERE global_position > $1
                    ORDER BY global_position ASC
                    LIMIT $2
                """, from_position, batch_size)
                
                return [self._row_to_event(row) for row in rows]

    # ============== Aggregate with Event Sourcing ==============
    class OrderStatus(Enum):
        DRAFT = "draft"
        PENDING = "pending"
        PAID = "paid"
        SHIPPED = "shipped"
        DELIVERED = "delivered"
        CANCELLED = "cancelled"

    @dataclass
    class OrderAggregate:
        """Order aggregate rebuilt from events"""
        id: str
        user_id: Optional[str] = None
        status: OrderStatus = OrderStatus.DRAFT
        items: List[Dict] = field(default_factory=list)
        total: float = 0.0
        version: int = 0
        created_at: Optional[datetime] = None
        
        @classmethod
        def from_events(cls, events: List[Event]) -> 'OrderAggregate':
            """Reconstruct aggregate from event history"""
            aggregate = cls(id=events[0].aggregate_id if events else None)
            
            for event in events:
                aggregate._apply(event)
            
            return aggregate
        
        def _apply(self, event: Event) -> None:
            """Apply event to update state"""
            handler = getattr(self, f"_apply_{event.event_type.value}", None)
            if handler:
                handler(event)
            self.version = event.version
        
        def _apply_order_created(self, event: Event) -> None:
            self.user_id = event.payload["user_id"]
            self.status = OrderStatus.DRAFT
            self.created_at = event.timestamp
        
        def _apply_order_item_added(self, event: Event) -> None:
            self.items.append(event.payload["item"])
            self._recalculate_total()
        
        def _apply_order_item_removed(self, event: Event) -> None:
            item_id = event.payload["item_id"]
            self.items = [i for i in self.items if i["id"] != item_id]
            self._recalculate_total()
        
        def _apply_order_submitted(self, event: Event) -> None:
            self.status = OrderStatus.PENDING
        
        def _apply_order_paid(self, event: Event) -> None:
            self.status = OrderStatus.PAID
        
        def _apply_order_shipped(self, event: Event) -> None:
            self.status = OrderStatus.SHIPPED
        
        def _apply_order_cancelled(self, event: Event) -> None:
            self.status = OrderStatus.CANCELLED
        
        def _recalculate_total(self) -> None:
            self.total = sum(
                item["price"] * item["quantity"] 
                for item in self.items
            )

    # ============== Snapshots for Performance ==============
    class SnapshotStore:
        """Store periodic snapshots to speed up replay"""
        
        def __init__(self, db_pool, snapshot_interval: int = 100):
            self.db_pool = db_pool
            self.snapshot_interval = snapshot_interval
        
        async def save_snapshot(
            self, 
            aggregate_id: str, 
            aggregate: OrderAggregate
        ) -> None:
            """Save aggregate snapshot"""
            async with self.db_pool.acquire() as conn:
                await conn.execute("""
                    INSERT INTO snapshots (aggregate_id, version, state, created_at)
                    VALUES ($1, $2, $3, $4)
                    ON CONFLICT (aggregate_id) 
                    DO UPDATE SET version = $2, state = $3, created_at = $4
                """,
                    aggregate_id,
                    aggregate.version,
                    json.dumps(aggregate.__dict__, default=str),
                    datetime.utcnow()
                )
        
        async def get_snapshot(
            self, 
            aggregate_id: str
        ) -> Optional[OrderAggregate]:
            """Load latest snapshot"""
            async with self.db_pool.acquire() as conn:
                row = await conn.fetchrow("""
                    SELECT state, version FROM snapshots
                    WHERE aggregate_id = $1
                """, aggregate_id)
                
                if row:
                    state = json.loads(row['state'])
                    return OrderAggregate(**state)
                return None

    class OrderRepository:
        """Repository using snapshots + events"""
        
        def __init__(
            self, 
            event_store: EventStore, 
            snapshot_store: SnapshotStore
        ):
            self.event_store = event_store
            self.snapshot_store = snapshot_store
        
        async def get(self, order_id: str) -> Optional[OrderAggregate]:
            # Try to load from snapshot first
            aggregate = await self.snapshot_store.get_snapshot(order_id)
            
            if aggregate:
                # Only replay events after snapshot
                events = await self.event_store.get_events(
                    order_id, 
                    from_version=aggregate.version
                )
            else:
                # Replay all events
                events = await self.event_store.get_events(order_id)
                if not events:
                    return None
                aggregate = OrderAggregate(id=order_id)
            
            # Apply remaining events
            for event in events:
                aggregate._apply(event)
            
            # Create snapshot if needed
            if aggregate.version % self.snapshot_store.snapshot_interval == 0:
                await self.snapshot_store.save_snapshot(order_id, aggregate)
            
            return aggregate
    ```
  </Tab>

  <Tab title="JavaScript">
    ```javascript theme={null}
    const { v4: uuidv4 } = require('uuid');

    // ============== Event Types ==============
    const EventType = {
      ORDER_CREATED: 'order_created',
      ORDER_ITEM_ADDED: 'order_item_added',
      ORDER_ITEM_REMOVED: 'order_item_removed',
      ORDER_SUBMITTED: 'order_submitted',
      ORDER_PAID: 'order_paid',
      ORDER_SHIPPED: 'order_shipped',
      ORDER_CANCELLED: 'order_cancelled'
    };

    class Event {
      constructor(eventType, aggregateId, payload, version) {
        this.eventId = uuidv4();
        this.eventType = eventType;
        this.aggregateId = aggregateId;
        this.payload = payload;
        this.version = version;
        this.timestamp = new Date();
        this.metadata = {};
      }
    }

    // ============== Event Store ==============
    class EventStore {
      constructor(pool) {
        this.pool = pool;
      }

      async append(event) {
        const client = await this.pool.connect();
        try {
          await client.query(`
            INSERT INTO events (
              event_id, aggregate_id, event_type, 
              payload, version, timestamp, metadata
            )
            VALUES ($1, $2, $3, $4, $5, $6, $7)
          `, [
            event.eventId,
            event.aggregateId,
            event.eventType,
            JSON.stringify(event.payload),
            event.version,
            event.timestamp,
            JSON.stringify(event.metadata)
          ]);
        } catch (error) {
          if (error.code === '23505') { // Unique violation
            throw new ConcurrencyError(
              `Version ${event.version} already exists for ${event.aggregateId}`
            );
          }
          throw error;
        } finally {
          client.release();
        }
      }

      async getEvents(aggregateId, fromVersion = 0) {
        const client = await this.pool.connect();
        try {
          const result = await client.query(`
            SELECT * FROM events
            WHERE aggregate_id = $1 AND version > $2
            ORDER BY version ASC
          `, [aggregateId, fromVersion]);

          return result.rows.map(row => ({
            eventId: row.event_id,
            eventType: row.event_type,
            aggregateId: row.aggregate_id,
            payload: row.payload,
            version: row.version,
            timestamp: row.timestamp,
            metadata: row.metadata
          }));
        } finally {
          client.release();
        }
      }

      async getAllEvents(fromPosition = 0, batchSize = 1000) {
        const client = await this.pool.connect();
        try {
          const result = await client.query(`
            SELECT * FROM events
            WHERE global_position > $1
            ORDER BY global_position ASC
            LIMIT $2
          `, [fromPosition, batchSize]);

          return result.rows;
        } finally {
          client.release();
        }
      }
    }

    // ============== Order Aggregate ==============
    const OrderStatus = {
      DRAFT: 'draft',
      PENDING: 'pending',
      PAID: 'paid',
      SHIPPED: 'shipped',
      DELIVERED: 'delivered',
      CANCELLED: 'cancelled'
    };

    class OrderAggregate {
      constructor(id) {
        this.id = id;
        this.userId = null;
        this.status = OrderStatus.DRAFT;
        this.items = [];
        this.total = 0;
        this.version = 0;
        this.createdAt = null;
      }

      static fromEvents(events) {
        if (!events.length) return null;
        
        const aggregate = new OrderAggregate(events[0].aggregateId);
        
        for (const event of events) {
          aggregate.apply(event);
        }
        
        return aggregate;
      }

      apply(event) {
        const handler = this[`apply${this.toPascalCase(event.eventType)}`];
        if (handler) {
          handler.call(this, event);
        }
        this.version = event.version;
      }

      applyOrderCreated(event) {
        this.userId = event.payload.userId;
        this.status = OrderStatus.DRAFT;
        this.createdAt = event.timestamp;
      }

      applyOrderItemAdded(event) {
        this.items.push(event.payload.item);
        this.recalculateTotal();
      }

      applyOrderItemRemoved(event) {
        const itemId = event.payload.itemId;
        this.items = this.items.filter(i => i.id !== itemId);
        this.recalculateTotal();
      }

      applyOrderSubmitted(event) {
        this.status = OrderStatus.PENDING;
      }

      applyOrderPaid(event) {
        this.status = OrderStatus.PAID;
      }

      applyOrderShipped(event) {
        this.status = OrderStatus.SHIPPED;
      }

      applyOrderCancelled(event) {
        this.status = OrderStatus.CANCELLED;
      }

      recalculateTotal() {
        this.total = this.items.reduce(
          (sum, item) => sum + (item.price * item.quantity),
          0
        );
      }

      toPascalCase(str) {
        return str.replace(/_([a-z])/g, (g) => g[1].toUpperCase())
                  .replace(/^[a-z]/, (c) => c.toUpperCase());
      }
    }

    // ============== Snapshot Store ==============
    class SnapshotStore {
      constructor(pool, snapshotInterval = 100) {
        this.pool = pool;
        this.snapshotInterval = snapshotInterval;
      }

      async saveSnapshot(aggregateId, aggregate) {
        const client = await this.pool.connect();
        try {
          await client.query(`
            INSERT INTO snapshots (aggregate_id, version, state, created_at)
            VALUES ($1, $2, $3, $4)
            ON CONFLICT (aggregate_id) 
            DO UPDATE SET version = $2, state = $3, created_at = $4
          `, [
            aggregateId,
            aggregate.version,
            JSON.stringify(aggregate),
            new Date()
          ]);
        } finally {
          client.release();
        }
      }

      async getSnapshot(aggregateId) {
        const client = await this.pool.connect();
        try {
          const result = await client.query(`
            SELECT state, version FROM snapshots
            WHERE aggregate_id = $1
          `, [aggregateId]);

          if (result.rows.length > 0) {
            const state = result.rows[0].state;
            const aggregate = new OrderAggregate(aggregateId);
            Object.assign(aggregate, state);
            return aggregate;
          }
          return null;
        } finally {
          client.release();
        }
      }
    }

    // ============== Order Repository ==============
    class OrderRepository {
      constructor(eventStore, snapshotStore) {
        this.eventStore = eventStore;
        this.snapshotStore = snapshotStore;
      }

      async get(orderId) {
        // Try to load from snapshot first
        let aggregate = await this.snapshotStore.getSnapshot(orderId);
        let events;

        if (aggregate) {
          // Only replay events after snapshot
          events = await this.eventStore.getEvents(orderId, aggregate.version);
        } else {
          // Replay all events
          events = await this.eventStore.getEvents(orderId);
          if (!events.length) return null;
          aggregate = new OrderAggregate(orderId);
        }

        // Apply remaining events
        for (const event of events) {
          aggregate.apply(event);
        }

        // Create snapshot if needed
        if (aggregate.version % this.snapshotStore.snapshotInterval === 0) {
          await this.snapshotStore.saveSnapshot(orderId, aggregate);
        }

        return aggregate;
      }
    }

    // ============== Time Travel / Replay ==============
    class EventReplayer {
      constructor(eventStore) {
        this.eventStore = eventStore;
      }

      async getStateAtTime(aggregateId, targetTime) {
        const events = await this.eventStore.getEvents(aggregateId);
        const filteredEvents = events.filter(e => 
          new Date(e.timestamp) <= targetTime
        );
        return OrderAggregate.fromEvents(filteredEvents);
      }

      async replayAllEvents(projector, fromPosition = 0) {
        let position = fromPosition;
        const batchSize = 1000;

        while (true) {
          const events = await this.eventStore.getAllEvents(position, batchSize);
          
          if (events.length === 0) break;

          for (const event of events) {
            await projector.project(event);
            position = event.global_position;
          }

          console.log(`Replayed up to position ${position}`);
        }

        return position;
      }
    }

    module.exports = {
      Event,
      EventType,
      EventStore,
      OrderAggregate,
      SnapshotStore,
      OrderRepository,
      EventReplayer
    };
    ```
  </Tab>
</Tabs>

<Tip>
  **When to use Event Sourcing:**

  * Audit requirements (financial systems, healthcare)
  * Need to replay/debug past states
  * Complex business logic with temporal queries
  * Event-driven microservices architecture
</Tip>

## Interview Questions: Senior Level

<Accordion title="How would you design for multi-region active-active?">
  **Key Points**:

  1. **Data replication**: Async replication between regions (eventual consistency)
  2. **Conflict resolution**: Last-write-wins (with vector clocks) or custom merge
  3. **Routing**: GeoDNS to route users to nearest region
  4. **Failover**: Health checks + automatic DNS failover
  5. **Consistency**: Accept that cross-region writes may conflict

  **Trade-offs to mention**:

  * Latency vs consistency
  * Cost of running in multiple regions
  * Complexity of conflict resolution
</Accordion>

<Accordion title="How do you handle a database that can't keep up with writes?">
  **Solutions in order of complexity**:

  1. **Batch writes**: Accumulate and write in batches
  2. **Write-behind cache**: Write to Redis, async persist to DB
  3. **Message queue**: Queue writes, process at sustainable rate
  4. **Sharding**: Distribute writes across multiple DB nodes
  5. **Different DB**: Switch to write-optimized DB (Cassandra, ScyllaDB)

  **Always ask**: "What's the consistency requirement? Can we lose some writes?"
</Accordion>

<Accordion title="Explain how you'd implement distributed transactions">
  **Answer structure**:

  1. **First ask**: "Do we really need distributed transactions?" Often can redesign.
  2. **2PC**: Strong consistency, but blocking and slow
  3. **Saga**: Eventual consistency, compensating transactions
  4. **Outbox pattern**: Reliable event publishing with local transaction

  **Code example for Saga**:

  ```python theme={null}
  async def create_order_saga(order):
      try:
          order_id = await order_service.create(order)
          await inventory_service.reserve(order.items)
          await payment_service.charge(order.payment)
          await order_service.confirm(order_id)
      except PaymentFailed:
          await inventory_service.release(order.items)
          await order_service.cancel(order_id)
  ```
</Accordion>

<Accordion title="How do you debug a latency spike in a distributed system?">
  **Systematic approach**:

  1. **Observe**: Check metrics dashboards (p99 latency by service)
  2. **Trace**: Use distributed tracing (Jaeger/Zipkin) to find slow span
  3. **Correlate**: Check if spike correlates with deployments, traffic, or GC
  4. **Drill down**: Once you find the service, check:
     * CPU/memory usage
     * DB query times (slow query log)
     * Network latency between services
     * Thread pool saturation
     * Lock contention

  **Common causes**: DB slow queries, GC pauses, connection pool exhaustion, lock contention, network issues
</Accordion>

<Accordion title="How would you design a system that handles 1M requests per second?">
  **Approach**:

  1. **Back of envelope**: 1M RPS = \~60K servers at 16 RPS each (conservative)
  2. **Stateless compute**: Horizontal scaling with load balancer
  3. **Caching**: Cache everything possible (aim for 99%+ cache hit)
  4. **CDN**: Serve static content from edge
  5. **Database**: Shard aggressively, read replicas
  6. **Async**: Queue non-critical work

  **Bottleneck analysis**:

  * Network: 1M × 10KB = 10GB/s = 80Gbps (need multiple LBs)
  * Compute: 1M / 10K (RPS per server) = 100 servers minimum
  * Database: Can't hit DB for every request, need 99%+ cache hit
</Accordion>

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="You are designing a global payment system. The PM says they want sub-100ms writes worldwide. Walk me through why that is physically impossible with strong consistency, and what you would actually propose.">
    **Strong Answer:**

    The core constraint is the speed of light. A round trip from US-East to Singapore is roughly 160ms at minimum -- and Raft/Paxos require a majority quorum acknowledgment before a write is committed. If your replicas span continents, every write pays that cross-continent RTT at least once.

    * **Linearizable writes across regions**: With a 3-node Raft cluster spanning US, EU, and APAC, a write from Singapore must reach a majority. If the leader is in US-East, that is \~160ms to Singapore and \~90ms to EU. The write latency floor is the second-fastest quorum member, so roughly 90ms best case -- and that is before any application logic.
    * **What Spanner does**: Google Spanner achieves external consistency using GPS-synchronized TrueTime clocks, but even Spanner reports single-digit millisecond writes only when the transaction's data is colocated within a single region. Cross-region transactions in Spanner still take 100-200ms.
    * **What I would actually propose**: Partition the data by geography. Payments originating in APAC are mastered in APAC, EU payments in EU. Each region runs its own Raft group with sub-10ms write latency. Cross-region reads can tolerate brief staleness (a merchant dashboard does not need real-time accuracy of a payment that happened 2 seconds ago in another continent). For the rare cross-region transaction (a US user paying an EU merchant), accept the latency hit on that specific path and use a Saga pattern for the settlement workflow.

    **Back-of-envelope**: 3 regions, each handling \~33% of traffic. Within a region, Raft quorum across 3 AZs takes \~2-5ms. You get sub-10ms writes for 95%+ of transactions, with the remaining 5% cross-region transactions at 150-200ms.

    **Follow-up: How would you handle the edge case where a user travels from the US to Europe and makes a payment -- their account is mastered in US-West?**

    You have two options. First, proxy the write back to the home region and accept the \~100ms penalty -- for a payment flow where the user is already interacting with a UI, adding 100ms to a button click is imperceptible. Second, use a "follow-the-sun" migration pattern where after detecting consistent activity from a new region (say, 3+ transactions in 24 hours), you migrate the account's master to the new region. The key insight: optimize for the common case (users transact locally), accept latency on the rare case (traveling users), and never sacrifice correctness for speed on financial data.
  </Accordion>

  <Accordion title="Your team's Kafka consumers are processing events with exactly-once semantics using idempotency keys stored in Postgres. During a load test at 50K events/sec, you notice duplicate processing spiking. What is happening and how do you fix it?">
    **Strong Answer:**

    The most likely culprit is the interaction between consumer group rebalancing and the idempotency check window. Here is the failure sequence:

    * Consumer A reads event X, begins processing, writes to Postgres with idempotency key, but has not yet committed the Kafka offset.
    * A Kafka rebalance triggers (perhaps because another consumer joined, or A's heartbeat was slow under load). Consumer A loses its partition assignment.
    * Consumer B picks up the partition, reads event X again (offset was not committed), checks Postgres for the idempotency key -- and the timing matters here. If A's Postgres transaction committed but the Kafka offset did not, you are fine (idempotency catches it). But if A's Postgres transaction is still in-flight or was rolled back due to the rebalance interruption, B will not find the key and will process the event again.

    **The deeper issue**: At 50K events/sec, the processing time per event might exceed the `max.poll.interval.ms` setting (default 300 seconds, but effective throughput matters). If the consumer takes too long between polls -- because it is doing synchronous Postgres writes for each event -- Kafka assumes it is dead and triggers a rebalance.

    **Fix, in order of impact**:

    1. **Batch the idempotency writes**: Instead of one Postgres round-trip per event, buffer 100-500 events and do a single batch INSERT ON CONFLICT. This reduces the per-event processing time and keeps the consumer polling frequently.
    2. **Use the transactional outbox pattern**: Write the idempotency key and the business result in the same Postgres transaction, then have a separate process commit Kafka offsets only after confirming the Postgres transaction committed.
    3. **Tune Kafka consumer settings**: Increase `max.poll.records` and decrease `max.poll.interval.ms` appropriately. Use `session.timeout.ms` = 10s and `heartbeat.interval.ms` = 3s to detect failures fast without false positives.
    4. **Consider Kafka transactions**: Use Kafka's built-in transactional API (`enable.idempotence=true` + `transactional.id`) to achieve exactly-once between Kafka produce and consume, and handle the Postgres write separately with an outbox.

    **Follow-up: At 50K events/sec, how much Postgres write throughput do you need for the idempotency table, and when does that become the bottleneck?**

    At 50K events/sec with batches of 500, you need 100 batch inserts/sec into Postgres. Each batch INSERT ON CONFLICT with 500 rows takes roughly 5-10ms on a well-indexed Postgres instance, so you need about 0.5-1 second of Postgres time per second -- well within a single Postgres instance's capacity. The bottleneck shifts to Postgres at around 200-500K events/sec, at which point you would shard the idempotency table by a hash of the event key, or move to a faster store like Redis with persistence for the idempotency check (accepting the trade-off of slightly weaker durability guarantees on the idempotency store).
  </Accordion>

  <Accordion title="Walk me through how you would design the isolation level strategy for a large e-commerce platform. Not every table needs Serializable, and not every table can tolerate Read Uncommitted. How do you decide?">
    **Strong Answer:**

    The way I think about this is: isolation level is a per-transaction decision, not a per-database decision. Different operations within the same database have radically different consistency requirements.

    * **Serializable (or at minimum Repeatable Read with explicit locking)**: Inventory decrement on checkout. This is the classic "two users buy the last item" problem. If you use Read Committed, both transactions can read quantity=1, both decrement to 0, and you have oversold. You need either Serializable isolation or an explicit SELECT FOR UPDATE to prevent this. At high scale, I would actually avoid row-level locking entirely and use an atomic decrement: `UPDATE inventory SET quantity = quantity - 1 WHERE product_id = ? AND quantity >= 1`, checking the affected row count.

    * **Read Committed (the Postgres default)**: Order history queries, product catalog browsing, user profile reads. These are read-heavy, and a non-repeatable read (seeing a price change mid-transaction) is harmless -- the user sees the updated price, which is correct behavior.

    * **Repeatable Read**: Financial reporting and analytics queries that run for minutes. If you are generating a daily revenue report, you need a consistent snapshot -- seeing some orders but not others because they committed during your query would produce incorrect totals. Postgres implements this efficiently with MVCC snapshots.

    * **Read Uncommitted**: I almost never use this in practice. The one exception might be approximate analytics dashboards where you want maximum throughput and can tolerate seeing in-flight data. But even then, Read Committed is only marginally slower and avoids the confusion of dirty reads.

    **The architectural pattern I use**: Define a `TransactionContext` that each service method declares. The payment service always runs at Serializable for the actual charge, but the receipt generation that follows runs at Read Committed. The inventory service uses the atomic decrement pattern (no explicit isolation level needed because the atomicity is in the SQL statement itself). The reporting service uses Repeatable Read with long-running read-only transactions.

    **Follow-up: Your Serializable transactions on the inventory table are causing lock contention and timeouts during flash sales. How do you fix this without dropping the isolation level?**

    Three approaches, from simplest to most complex. First, use the atomic decrement pattern I mentioned -- it avoids explicit locking entirely because the WHERE clause acts as a guard. Second, if you need Serializable for more complex invariants, shard the inventory by product\_id so contention is per-product, not global. A flash sale for one product does not block purchases of other products. Third, for the actual flash sale scenario (1000 people buying 50 items), move the hot inventory count to Redis with DECR, let Redis handle the contention (single-threaded, no lock overhead), and reconcile back to Postgres asynchronously. You accept a brief window where Postgres is behind Redis, but Redis is the authoritative source for "is there stock left" during the sale.
  </Accordion>

  <Accordion title="Estimate the cost and feasibility of adding cross-region replication to a service in AWS us-east-1 that handles 20K writes/sec and 100K reads/sec with a 50TB Postgres database.">
    **Strong Answer:**

    Let me break this into the key cost and feasibility dimensions.

    **Data transfer costs (the surprise line item)**:

    * Each write generates a WAL (Write-Ahead Log) record. At 20K writes/sec with an average WAL record of \~200 bytes, that is 4 MB/sec = \~345 GB/day of replication traffic.
    * AWS cross-region data transfer: $0.02/GB. So 345 GB/day = ~$7/day = \~\$210/month just for ongoing replication.
    * The initial seed: 50TB transferred cross-region at $0.02/GB = $1,000 one-time cost. But the bigger issue is time -- at 5 Gbps sustained transfer, 50TB takes roughly 22 hours. During that time, the replica is not serving reads.

    **Replication lag**:

    * Async replication across regions (us-east-1 to eu-west-1, \~80ms RTT): expect 80-200ms replication lag under normal load. This is fine for read replicas serving non-critical reads.
    * Synchronous replication: every write now takes at least 80ms additional latency. At 20K writes/sec, this means each write occupies a connection for 80ms longer. Your connection pool needs to be roughly 20K \* 0.08 = 1,600 additional connections just to maintain throughput.

    **Read scaling in the remote region**:

    * 100K reads/sec. If you route 50% to the EU replica, that is 50K reads/sec on a single Postgres instance, which is feasible with connection pooling and proper indexing but tight. You likely need 2-3 read replicas in EU.

    **My recommendation**:

    * Async replication for the read replicas (accept 100-200ms lag).
    * Do NOT make writes synchronous cross-region -- the latency cost is too high for 20K writes/sec.
    * If you need cross-region write availability (disaster recovery), use a warm standby that can be promoted in minutes, not a synchronous replica.
    * Total monthly cost estimate: $210 (data transfer) + ~$3,000-5,000 (2-3 db.r6g.4xlarge RDS instances in EU) + \~$500 (additional network infrastructure) = roughly $4,000-5,500/month.

    **Follow-up: The business says they need RPO of zero -- no data loss if us-east-1 goes down completely. Does that change your answer?**

    RPO of zero means synchronous replication, which means every write pays the 80ms cross-region penalty. At 20K writes/sec, this is brutal. I would push back on the requirement with data: "RPO of zero adds 80ms to every write, which drops our write throughput from 20K/sec to roughly 12K/sec (connection pool becomes the bottleneck), and increases p99 write latency from 50ms to 150ms. With async replication, our RPO is typically under 200ms -- meaning in a catastrophic us-east-1 failure, we lose at most 200ms of writes. Is that acceptable?" If they insist on zero RPO, I would look at CockroachDB or Aurora Global Database, which are architecturally designed for this at the cost of higher per-write latency.
  </Accordion>
</AccordionGroup>

<Tip>
  **Interview Strategy for Advanced Topics**: When an interviewer asks about CQRS, event sourcing, or distributed consensus, the strongest move is to first explain *when you would NOT use it*. "Event sourcing is powerful for audit trails and temporal queries, but for a simple CRUD service with no compliance requirements, it adds unjustified complexity. I would reach for it when..." This shows you understand the tool *and* its boundaries, which is exactly the judgment call staff-level engineers are evaluated on.
</Tip>

<Note>
  **Scalability Quick Reference for Advanced Patterns**:

  * **CQRS**: becomes valuable at roughly 10:1 or higher read-to-write ratio, or when read patterns diverge significantly from write patterns. Below that ratio, the synchronization overhead between read and write models is not worth the benefit.
  * **Event Sourcing**: the event store grows linearly with write volume. At 10K writes/second with 1KB average event size, you generate roughly 850GB/day of event data. Snapshots every 100 events reduce replay time from O(n) to O(n/100) for aggregate reconstruction.
  * **Distributed Consensus (Raft)**: practical for up to 5-7 nodes. Beyond that, the leader must replicate every write to a majority, and AppendEntries RPCs become the bottleneck. For larger clusters, use multi-Raft (CockroachDB/TiKV style) where different data ranges have independent Raft groups.
  * **Vector Clocks**: the vector grows with the number of nodes that have ever written. For systems with thousands of writers, consider hybrid logical clocks (HLC) or bounded vector clocks (Dynamo-style dotted version vectors) to cap metadata size.
</Note>
