Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Senior/Staff Level Content: This section covers advanced topics that differentiate senior engineers from mid-level. Master these for L5+ (Senior) and Staff interviews at FAANG.

Consistency Deep Dive

Linearizability vs Serializability

Most candidates confuse these — and interviewers love testing this distinction because it reveals whether you genuinely understand distributed consistency or just memorized surface-level definitions. The key: linearizability is about single operations on single objects appearing to happen atomically in real time. Serializability is about multi-operation transactions on multiple objects appearing to happen in some serial order (which may not correspond to real time). Strict serializability gives you both — and is what Google Spanner provides using GPS-synchronized atomic clocks (TrueTime), at the cost of cross-region write latency.
┌─────────────────────────────────────────────────────────────────┐
│           Linearizability vs Serializability                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  LINEARIZABILITY (Single-object, real-time)                    │
│  ─────────────────────────────────────────                      │
│  • Operations appear to happen atomically at some point         │
│  • Real-time ordering: if op A completes before op B starts,   │
│    then A is ordered before B                                   │
│  • Think: "single copy of data that everyone sees"             │
│                                                                 │
│  Time ────────────────────────────────────────────►            │
│                                                                 │
│  Client A:  ──[write x=1]────────────                          │
│  Client B:        ──────[read x]───                            │
│                          │                                      │
│                          └─► Must return 1 (write completed)   │
│                                                                 │
│  SERIALIZABILITY (Multi-object, transactions)                  │
│  ───────────────────────────────────────────                    │
│  • Transactions appear to execute in SOME serial order          │
│  • Doesn't guarantee real-time ordering                         │
│  • Think: "transactions don't see partial state"               │
│                                                                 │
│  T1: read(A), write(B)                                         │
│  T2: read(B), write(A)                                         │
│  Serial order: T1→T2 or T2→T1, but not interleaved            │
│                                                                 │
│  STRICT SERIALIZABILITY                                         │
│  ───────────────────────                                        │
│  • Both! Serial order + real-time ordering                      │
│  • Most expensive, what Spanner provides                        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Isolation Levels (Know All Four!)

┌─────────────────────────────────────────────────────────────────┐
│                    Isolation Levels                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  READ UNCOMMITTED (Weakest)                                     │
│  ─────────────────────────                                      │
│  • Can see uncommitted changes (dirty reads)                   │
│  • Almost never used                                            │
│                                                                 │
│  T1: write(x=1) ──────────── rollback                          │
│  T2:     read(x) ─► 1 (dirty read!)                            │
│                                                                 │
│  ─────────────────────────────────────────────────────────────  │
│                                                                 │
│  READ COMMITTED (Default in PostgreSQL)                         │
│  ─────────────────────────────────────                          │
│  • Only see committed data                                     │
│  • Allows non-repeatable reads                                  │
│                                                                 │
│  T1: read(x) ─► 1 ─────────── read(x) ─► 2 (changed!)         │
│  T2:      write(x=2), commit                                   │
│                                                                 │
│  ─────────────────────────────────────────────────────────────  │
│                                                                 │
│  REPEATABLE READ (Default in MySQL InnoDB)                      │
│  ─────────────────────────────────────────                      │
│  • Same query returns same results in a transaction             │
│  • Allows phantom reads (new rows appear)                       │
│                                                                 │
│  T1: SELECT COUNT(*) WHERE age > 30 ─► 5                       │
│  T2:      INSERT INTO users (age=35), commit                   │
│  T1: SELECT COUNT(*) WHERE age > 30 ─► 6 (phantom!)            │
│                                                                 │
│  ─────────────────────────────────────────────────────────────  │
│                                                                 │
│  SERIALIZABLE (Strongest)                                       │
│  ────────────────────────                                       │
│  • Full isolation, as if transactions ran serially              │
│  • Highest safety, lowest performance                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
LevelDirty ReadNon-Repeatable ReadPhantom Read
Read Uncommitted
Read Committed
Repeatable Read
Serializable
Interview Answer: “I’d use Read Committed for most cases. Serializable only for critical financial transactions where consistency matters more than throughput.”

Distributed Consensus Deep Dive

Leader Election: Why It’s Hard

Leader election sounds simple: “just pick one node to be in charge.” But the reason it is one of the hardest problems in distributed systems is that the very mechanism you would use to coordinate the election (the network) is the thing that fails. Imagine trying to elect a class president when students can only pass notes, some notes get lost, and some students might have already left the building without telling anyone. The note-passing is the problem you are trying to solve.
┌─────────────────────────────────────────────────────────────────┐
│                 Split Brain Problem                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Initial State: Node A is leader                               │
│                                                                 │
│  ┌────────┐      ┌────────┐      ┌────────┐                   │
│  │ Node A │──────│ Node B │──────│ Node C │                   │
│  │ LEADER │      │FOLLOWER│      │FOLLOWER│                   │
│  └────────┘      └────────┘      └────────┘                   │
│                                                                 │
│  Network partition occurs:                                     │
│                                                                 │
│  Partition 1         ║         Partition 2                     │
│  ┌────────┐          ║          ┌────────┐ ┌────────┐         │
│  │ Node A │          ║          │ Node B │─│ Node C │         │
│  │ LEADER │          ║          │FOLLOWER│ │FOLLOWER│         │
│  └────────┘          ║          └────────┘ └────────┘         │
│                      ║               │                         │
│  A thinks it's       ║          B or C becomes leader!        │
│  still leader!       ║          (timeout, no heartbeat)        │
│                      ║                                         │
│  DANGER: Two leaders! (Split brain)                            │
│                                                                 │
│  Solution: Quorum-based voting                                 │
│  • Need majority (N/2 + 1) to elect leader                     │
│  • Partition 1 has 1/3 (minority) → can't write               │
│  • Partition 2 has 2/3 (majority) → can elect, can write      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Raft: The Algorithm You Should Know

┌─────────────────────────────────────────────────────────────────┐
│                    Raft Consensus                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  KEY CONCEPTS:                                                  │
│                                                                 │
│  1. TERMS (Logical Clock)                                       │
│     • Time divided into terms                                   │
│     • Each term has at most one leader                         │
│     • Term number monotonically increases                       │
│                                                                 │
│     Term 1       Term 2       Term 3                           │
│     ├──Leader A──┼──Election──┼──Leader B──────►               │
│                  │  (failed)  │                                 │
│                                                                 │
│  2. LOG REPLICATION                                             │
│                                                                 │
│     Leader Log:  [1:x=1] [1:y=2] [2:x=3] [2:z=4]               │
│                    ↓       ↓       ↓       ↓                   │
│     Follower 1:  [1:x=1] [1:y=2] [2:x=3] [2:z=4] ✓             │
│     Follower 2:  [1:x=1] [1:y=2] [2:x=3]         (catching up) │
│                                                                 │
│     Entry committed when replicated to majority                 │
│                                                                 │
│  3. LEADER ELECTION                                             │
│                                                                 │
│     Follower timeout → Candidate → RequestVote RPC             │
│     • Vote granted if:                                         │
│       - Candidate's term >= voter's term                       │
│       - Candidate's log is at least as up-to-date              │
│       - Voter hasn't voted for someone else this term          │
│     • Majority votes → become Leader                           │
│                                                                 │
│  4. SAFETY GUARANTEE                                            │
│     • Elected leader has all committed entries                  │
│     • Leaders never overwrite their log                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Clock Synchronization

Clock synchronization is one of the sneakiest problems in distributed systems because wall clocks appear reliable — they give you a number that seems authoritative — but they silently drift. Think of it like three friends timing a race with unsynchronized stopwatches: each records a slightly different finish time, and there is no objective way to determine whose stopwatch is “right.” In distributed systems, this disagreement can corrupt conflict resolution, cause duplicate processing, or silently lose data when last-write-wins picks the wrong “last.”

Why Wall Clocks Fail

┌─────────────────────────────────────────────────────────────────┐
│                 Clock Synchronization Problems                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Physical clocks drift (100 ms/day typical)                    │
│  NTP sync has error bounds (1-10ms typically)                  │
│                                                                 │
│  Problem scenario:                                              │
│                                                                 │
│  Machine A (clock: 10:00:00.000):  write(x=1) @ 10:00:00.000  │
│  Machine B (clock: 10:00:00.050):  write(x=2) @ 10:00:00.050  │
│                                                                 │
│  But B's clock is 100ms ahead! Real order:                     │
│  • B actually wrote at real time 09:59:59.950                  │
│  • A wrote at real time 10:00:00.000                           │
│  • A happened AFTER B, but timestamps say B is newer!          │
│                                                                 │
│  Solution 1: Logical clocks (Lamport)                          │
│  Solution 2: Vector clocks                                      │
│  Solution 3: Hybrid clocks (HLC)                               │
│  Solution 4: GPS/atomic clocks (Spanner's TrueTime)            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Vector Clocks (Conflict Detection)

┌─────────────────────────────────────────────────────────────────┐
│                    Vector Clocks                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Each node maintains vector: {A: count, B: count, C: count}    │
│                                                                 │
│  Node A         Node B         Node C                          │
│  {A:0,B:0,C:0}  {A:0,B:0,C:0}  {A:0,B:0,C:0}                   │
│      │              │              │                            │
│  write(x=1)         │              │                            │
│  {A:1,B:0,C:0}      │              │                            │
│      │─────────────►│              │                            │
│      │         {A:1,B:0,C:0}       │                            │
│      │              │              │                            │
│      │         write(y=2)          │                            │
│      │         {A:1,B:1,C:0}       │                            │
│      │              │─────────────►│                            │
│      │              │         {A:1,B:1,C:0}                     │
│      │              │              │                            │
│                                                                 │
│  Comparison:                                                    │
│  {A:1,B:2,C:0} vs {A:2,B:1,C:0}                                │
│  Neither dominates → CONFLICT! Must resolve                    │
│                                                                 │
│  {A:1,B:2,C:0} vs {A:1,B:3,C:0}                                │
│  Second dominates → Second is newer, no conflict               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Data Partitioning Strategies

Data partitioning (sharding) is how you scale a database beyond what a single machine can handle — but the partition key choice is the most consequential and least reversible decision you will make. A bad partition key creates “hot partitions” (one shard drowning while others idle), forces expensive cross-partition queries on your hot path, or makes rebalancing a multi-day operational nightmare. The analogy: choosing a partition key is like organizing a library into separate buildings. If you split by the first letter of the author’s last name, the “S” building is overflowing (Smith, Singh, Suzuki…) while the “X” building is nearly empty. But if you split by a hash of the book’s ISBN, every building is equally full — at the cost of no longer being able to browse all books by the same author in one place.

Partition Key Selection

┌─────────────────────────────────────────────────────────────────┐
│              Choosing Partition Keys                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  GOOD Partition Key Properties:                                 │
│  ✓ High cardinality (many unique values)                       │
│  ✓ Evenly distributed access                                   │
│  ✓ Matches query patterns                                      │
│                                                                 │
│  Example: E-commerce Orders                                     │
│                                                                 │
│  ❌ BAD: partition by date                                      │
│     • Today's partition gets all traffic (hot partition)       │
│     • Old partitions sit idle                                  │
│                                                                 │
│  ❌ BAD: partition by country                                   │
│     • US partition has 60% of traffic                          │
│     • Small countries underutilized                            │
│                                                                 │
│  ✅ GOOD: partition by order_id                                │
│     • Random distribution                                       │
│     • But: can't query "all orders for user X" easily          │
│                                                                 │
│  ✅ BETTER: partition by user_id                                │
│     • User's orders on same partition (locality)               │
│     • Query patterns match                                     │
│     • Watch for celebrity users (hot partition)                │
│                                                                 │
│  ✅ BEST: compound key (user_id, order_date)                   │
│     • Partition by user_id                                     │
│     • Sort by order_date within partition                      │
│     • Efficient for "user X's orders in last month"           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Handling Hot Partitions

# Strategy 1: Add random suffix
def get_partition_key(celebrity_id):
    if is_celebrity(celebrity_id):
        # Split celebrity across 10 partitions
        suffix = random.randint(0, 9)
        return f"{celebrity_id}_{suffix}"
    return celebrity_id

# Reading requires scatter-gather
def get_celebrity_data(celebrity_id):
    results = []
    for suffix in range(10):
        key = f"{celebrity_id}_{suffix}"
        results.extend(query_partition(key))
    return results

# Strategy 2: Time-based suffix
def get_partition_key_time(user_id):
    # Different partition each hour
    hour = datetime.now().hour
    return f"{user_id}_{hour % 4}"

Exactly-Once Semantics

The Three Delivery Guarantees

┌─────────────────────────────────────────────────────────────────┐
│              Message Delivery Guarantees                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  AT-MOST-ONCE                                                   │
│  ─────────────                                                  │
│  Send and forget. May lose messages.                           │
│  Use case: Metrics, logs (some loss OK)                        │
│                                                                 │
│  Producer ──[msg]──► Broker    (might fail silently)           │
│                                                                 │
│  ─────────────────────────────────────────────────────────────  │
│                                                                 │
│  AT-LEAST-ONCE                                                  │
│  ─────────────                                                  │
│  Retry until ACK. May duplicate messages.                      │
│  Use case: Most applications (with idempotent consumers)       │
│                                                                 │
│  Producer ──[msg]──► Broker ──[ACK]──► Producer                │
│      │                  │                                       │
│      └──[retry]─────────┘ (if no ACK, retry → duplicate)       │
│                                                                 │
│  ─────────────────────────────────────────────────────────────  │
│                                                                 │
│  EXACTLY-ONCE                                                   │
│  ────────────                                                   │
│  Each message processed exactly once. Hard to achieve!         │
│  Use case: Financial transactions, critical data               │
│                                                                 │
│  Achieved via: Idempotency + Deduplication + Transactions      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Implementing Exactly-Once

class ExactlyOnceProcessor:
    """
    Exactly-once processing with idempotency keys
    """
    
    def process_message(self, message):
        idempotency_key = message.id
        
        # Step 1: Check if already processed
        if self.is_processed(idempotency_key):
            return self.get_cached_result(idempotency_key)
        
        # Step 2: Process with transaction
        with self.db.transaction():
            # Do the work
            result = self.do_business_logic(message)
            
            # Record as processed (same transaction!)
            self.mark_processed(idempotency_key, result)
            
            # Commit message offset (Kafka-style)
            self.commit_offset(message.offset)
        
        return result
    
    def is_processed(self, key):
        return self.db.exists(f"processed:{key}")
    
    def mark_processed(self, key, result):
        self.db.set(f"processed:{key}", result, ttl=86400)

Distributed Caching Patterns

Caching at scale introduces problems that simply do not exist with a single-server cache. The most dangerous: cache stampede (also called “thundering herd”), where a popular cache key expires and hundreds of servers simultaneously hit the database to re-populate it. At moderate scale this is an annoyance; at high scale it can take down your database and cascade into a full outage. The patterns below are battle-tested solutions from companies like Facebook, Twitter, and Netflix.

Cache Stampede Prevention

┌─────────────────────────────────────────────────────────────────┐
│                   Cache Stampede Problem                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Popular key expires → 1000 servers hit DB simultaneously      │
│                                                                 │
│  Server 1 ─┐                                                    │
│  Server 2 ─┤                                                    │
│  Server 3 ─┼───► Cache MISS ───► Database ← 💥 OVERWHELMED     │
│  ...       │                                                    │
│  Server N ─┘                                                    │
│                                                                 │
│  SOLUTIONS:                                                     │
│                                                                 │
│  1. LOCKING (Mutex)                                             │
│     First request acquires lock, others wait or get stale      │
│                                                                 │
│  2. PROBABILISTIC EARLY EXPIRATION                              │
│     Refresh before expiry with some probability                │
│                                                                 │
│  3. BACKGROUND REFRESH                                          │
│     Never expire, refresh asynchronously                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
import random
import time

class StampedePreventingCache:
    def get(self, key, fetch_func, ttl=3600, beta=1.0):
        cached = self.cache.get(key)
        
        if cached:
            value, expiry, delta = cached
            now = time.time()
            
            # Probabilistic early refresh
            # As we approach expiry, probability increases
            gap = expiry - now
            if gap > 0:
                # XFetch algorithm
                random_early = delta * beta * math.log(random.random())
                if gap + random_early > 0:
                    return value  # Use cached value
        
        # Cache miss or early refresh triggered
        # Use distributed lock to prevent stampede
        lock_key = f"lock:{key}"
        if self.acquire_lock(lock_key, timeout=5):
            try:
                start = time.time()
                value = fetch_func()
                delta = time.time() - start
                
                self.cache.set(key, (value, time.time() + ttl, delta))
                return value
            finally:
                self.release_lock(lock_key)
        else:
            # Someone else is refreshing, return stale or wait
            if cached:
                return cached[0]  # Return stale
            time.sleep(0.1)
            return self.get(key, fetch_func, ttl, beta)  # Retry

Rate Limiting at Scale

Rate limiting is the seatbelt of distributed systems — you hope you never need it, but when a client misbehaves or a traffic spike hits, it is the difference between a graceful degradation and a cascading outage. The subtlety most engineers miss: rate limiting must be global, not per-server. If you have 10 API servers each allowing 100 requests/second from the same client, that client effectively gets 1,000 requests/second. Centralized rate limiting (typically via Redis) solves this, but introduces a new dependency on the rate limiter itself — which is why the most resilient implementations use local rate limiting as a fast first pass and centralized rate limiting for accuracy.
Scalability Analysis: At 100K QPS, your rate limiter itself becomes a bottleneck if implemented naively. A Redis-based sliding window counter that calls ZRANGEBYSCORE for each request can handle roughly 50K-100K operations per second on a single Redis instance. Beyond that, you need sharded rate limiting (hash the client identifier to determine which Redis shard tracks them) or a hierarchical approach (local token buckets per server, synchronized periodically with a central store). Companies like Cloudflare and Stripe use multi-tier rate limiting: edge-level (Anycast), per-server (in-memory), and centralized (Redis cluster).

Distributed Rate Limiting

┌─────────────────────────────────────────────────────────────────┐
│           Distributed Rate Limiting Strategies                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  STRATEGY 1: Centralized (Redis)                               │
│  ─────────────────────────────────                              │
│  All servers check same Redis                                  │
│  ✓ Accurate   ✗ Single point of failure                       │
│                                                                 │
│  Server 1 ─┐                                                    │
│  Server 2 ─┼───► Redis ───► Accurate count                     │
│  Server 3 ─┘                                                    │
│                                                                 │
│  STRATEGY 2: Local + Sync                                       │
│  ───────────────────────                                        │
│  Local counters, periodically sync                             │
│  ✓ Fast   ✗ Approximate                                        │
│                                                                 │
│  Server 1: local_count=50 ──┐                                  │
│  Server 2: local_count=40 ──┼──► Sync every 1s                 │
│  Server 3: local_count=30 ──┘                                  │
│                                                                 │
│  STRATEGY 3: Token Bucket with Redis                           │
│  ─────────────────────────────────                              │
│  Each request: DECR if tokens > 0                              │
│  Background: Refill tokens at fixed rate                       │
│                                                                 │
│  Lua script for atomic check-and-decrement:                    │
│  local tokens = redis.call('GET', key)                         │
│  if tokens > 0 then                                            │
│      redis.call('DECR', key)                                   │
│      return 1  -- allowed                                      │
│  end                                                           │
│  return 0  -- rejected                                         │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

CQRS (Command Query Responsibility Segregation)

CQRS separates read and write operations into different models, optimizing each for its specific use case. The core insight: in most systems, read and write workloads have fundamentally different characteristics. Writes need validation, consistency, and domain logic. Reads need speed, denormalization, and flexible queries. Trying to serve both through a single model forces painful compromises. Think of it like a library: the cataloging system (write side) carefully classifies books, enforces Dewey Decimal rules, and ensures no duplicates. The search terminals (read side) are optimized for patrons to find books fast — they might have denormalized data, multiple indexes, and even slightly stale information. You would not force librarians to catalog books through the search terminal, and you would not make patrons navigate the raw catalog system. The trade-off is operational complexity: you now maintain two models and a synchronization mechanism between them (usually events). CQRS is overkill for simple CRUD applications, but it shines for systems with high read-to-write ratios (100:1 or more), complex read queries that differ significantly from write structures, or requirements for different scaling characteristics on reads vs writes. CQRS Pattern
from dataclasses import dataclass, field
from typing import List, Optional, Dict, Any
from datetime import datetime
from abc import ABC, abstractmethod
import asyncio
from enum import Enum

# ============== Commands (Write Side) ==============
class CommandType(Enum):
    CREATE_ORDER = "create_order"
    UPDATE_ORDER = "update_order"
    CANCEL_ORDER = "cancel_order"

@dataclass
class Command:
    command_type: CommandType
    aggregate_id: str
    payload: Dict[str, Any]
    timestamp: datetime = field(default_factory=datetime.utcnow)
    correlation_id: str = field(default_factory=lambda: str(uuid.uuid4()))

class CommandHandler(ABC):
    @abstractmethod
    async def handle(self, command: Command) -> None:
        pass

class OrderCommandHandler(CommandHandler):
    def __init__(self, event_store: 'EventStore', event_bus: 'EventBus'):
        self.event_store = event_store
        self.event_bus = event_bus
    
    async def handle(self, command: Command) -> str:
        if command.command_type == CommandType.CREATE_ORDER:
            return await self._handle_create_order(command)
        elif command.command_type == CommandType.CANCEL_ORDER:
            return await self._handle_cancel_order(command)
        raise ValueError(f"Unknown command type: {command.command_type}")
    
    async def _handle_create_order(self, command: Command) -> str:
        # Business validation
        order_id = str(uuid.uuid4())
        
        # Create event
        event = Event(
            event_type=EventType.ORDER_CREATED,
            aggregate_id=order_id,
            payload={
                "user_id": command.payload["user_id"],
                "items": command.payload["items"],
                "total": command.payload["total"],
            },
            version=1
        )
        
        # Persist event
        await self.event_store.append(event)
        
        # Publish for read model updates
        await self.event_bus.publish(event)
        
        return order_id
    
    async def _handle_cancel_order(self, command: Command) -> None:
        # Load current state from events
        events = await self.event_store.get_events(command.aggregate_id)
        order = OrderAggregate.from_events(events)
        
        # Business validation
        if order.status == OrderStatus.CANCELLED:
            raise ValueError("Order already cancelled")
        if order.status == OrderStatus.SHIPPED:
            raise ValueError("Cannot cancel shipped order")
        
        # Create cancellation event
        event = Event(
            event_type=EventType.ORDER_CANCELLED,
            aggregate_id=command.aggregate_id,
            payload={"reason": command.payload.get("reason", "User requested")},
            version=order.version + 1
        )
        
        await self.event_store.append(event)
        await self.event_bus.publish(event)

# ============== Queries (Read Side) ==============
@dataclass
class OrderReadModel:
    """Denormalized read model optimized for queries"""
    id: str
    user_id: str
    status: str
    items: List[Dict]
    total: float
    created_at: datetime
    updated_at: datetime

class OrderQueryService:
    def __init__(self, read_db: 'ReadDatabase'):
        self.read_db = read_db
    
    async def get_order(self, order_id: str) -> Optional[OrderReadModel]:
        """Fast read from denormalized model"""
        return await self.read_db.find_one("orders", {"id": order_id})
    
    async def get_user_orders(
        self, 
        user_id: str, 
        status: Optional[str] = None,
        limit: int = 10,
        offset: int = 0
    ) -> List[OrderReadModel]:
        """Query with filters - optimized for read patterns"""
        query = {"user_id": user_id}
        if status:
            query["status"] = status
        
        return await self.read_db.find(
            "orders",
            query,
            sort=[("created_at", -1)],
            limit=limit,
            skip=offset
        )
    
    async def get_orders_by_status(self, status: str) -> List[OrderReadModel]:
        """Admin query - different index"""
        return await self.read_db.find(
            "orders",
            {"status": status},
            sort=[("updated_at", -1)]
        )

# ============== Read Model Projector ==============
class OrderProjector:
    """Updates read model based on events"""
    
    def __init__(self, read_db: 'ReadDatabase'):
        self.read_db = read_db
    
    async def project(self, event: 'Event') -> None:
        handler = getattr(self, f"_handle_{event.event_type.value}", None)
        if handler:
            await handler(event)
    
    async def _handle_order_created(self, event: 'Event') -> None:
        order = OrderReadModel(
            id=event.aggregate_id,
            user_id=event.payload["user_id"],
            status="pending",
            items=event.payload["items"],
            total=event.payload["total"],
            created_at=event.timestamp,
            updated_at=event.timestamp
        )
        await self.read_db.insert("orders", order.__dict__)
    
    async def _handle_order_cancelled(self, event: 'Event') -> None:
        await self.read_db.update(
            "orders",
            {"id": event.aggregate_id},
            {
                "status": "cancelled",
                "cancellation_reason": event.payload["reason"],
                "updated_at": event.timestamp
            }
        )

Event Sourcing

Event sourcing stores all changes as a sequence of events rather than overwriting current state, providing a complete audit trail and enabling time-travel debugging. Where a traditional database says “the account balance IS 525,"aneventsourcedsystemsays"hereiseverydepositandwithdrawalthatledto525," an event-sourced system says "here is every deposit and withdrawal that led to 525.” This seemingly small difference has profound implications: you can rebuild any past state by replaying events to a point in time, you get a full audit trail for free, and you can create new read models by replaying the event stream through new logic — without migrating any data. Event sourcing pairs naturally with CQRS (above): the write side appends events to an immutable log, and one or more read-side projections consume those events to build materialized views optimized for queries. The trade-off is complexity: your system now has eventual consistency between the event store and read models, you need snapshot strategies for aggregates with long event histories (an aggregate with 10 million events takes too long to replay from scratch), and schema evolution of events requires careful versioning since events are immutable once stored. Event Sourcing
from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
from datetime import datetime
from enum import Enum
from abc import ABC, abstractmethod
import json
import uuid

class EventType(Enum):
    ORDER_CREATED = "order_created"
    ORDER_ITEM_ADDED = "order_item_added"
    ORDER_ITEM_REMOVED = "order_item_removed"
    ORDER_SUBMITTED = "order_submitted"
    ORDER_PAID = "order_paid"
    ORDER_SHIPPED = "order_shipped"
    ORDER_DELIVERED = "order_delivered"
    ORDER_CANCELLED = "order_cancelled"

@dataclass
class Event:
    event_type: EventType
    aggregate_id: str
    payload: Dict[str, Any]
    version: int
    timestamp: datetime = field(default_factory=datetime.utcnow)
    event_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    metadata: Dict[str, Any] = field(default_factory=dict)

class EventStore:
    """Append-only event store with PostgreSQL"""
    
    def __init__(self, db_pool):
        self.db_pool = db_pool
    
    async def append(self, event: Event) -> None:
        """Append event with optimistic concurrency control"""
        async with self.db_pool.acquire() as conn:
            try:
                await conn.execute("""
                    INSERT INTO events (
                        event_id, aggregate_id, event_type, 
                        payload, version, timestamp, metadata
                    )
                    VALUES ($1, $2, $3, $4, $5, $6, $7)
                """,
                    event.event_id,
                    event.aggregate_id,
                    event.event_type.value,
                    json.dumps(event.payload),
                    event.version,
                    event.timestamp,
                    json.dumps(event.metadata)
                )
            except UniqueViolationError:
                raise ConcurrencyError(
                    f"Version {event.version} already exists for {event.aggregate_id}"
                )
    
    async def get_events(
        self, 
        aggregate_id: str, 
        from_version: int = 0
    ) -> List[Event]:
        """Load all events for an aggregate"""
        async with self.db_pool.acquire() as conn:
            rows = await conn.fetch("""
                SELECT event_id, aggregate_id, event_type, 
                       payload, version, timestamp, metadata
                FROM events
                WHERE aggregate_id = $1 AND version > $2
                ORDER BY version ASC
            """, aggregate_id, from_version)
            
            return [
                Event(
                    event_id=row['event_id'],
                    aggregate_id=row['aggregate_id'],
                    event_type=EventType(row['event_type']),
                    payload=json.loads(row['payload']),
                    version=row['version'],
                    timestamp=row['timestamp'],
                    metadata=json.loads(row['metadata'])
                )
                for row in rows
            ]
    
    async def get_all_events(
        self, 
        from_position: int = 0,
        batch_size: int = 1000
    ) -> List[Event]:
        """Stream all events for projections"""
        async with self.db_pool.acquire() as conn:
            rows = await conn.fetch("""
                SELECT * FROM events
                WHERE global_position > $1
                ORDER BY global_position ASC
                LIMIT $2
            """, from_position, batch_size)
            
            return [self._row_to_event(row) for row in rows]

# ============== Aggregate with Event Sourcing ==============
class OrderStatus(Enum):
    DRAFT = "draft"
    PENDING = "pending"
    PAID = "paid"
    SHIPPED = "shipped"
    DELIVERED = "delivered"
    CANCELLED = "cancelled"

@dataclass
class OrderAggregate:
    """Order aggregate rebuilt from events"""
    id: str
    user_id: Optional[str] = None
    status: OrderStatus = OrderStatus.DRAFT
    items: List[Dict] = field(default_factory=list)
    total: float = 0.0
    version: int = 0
    created_at: Optional[datetime] = None
    
    @classmethod
    def from_events(cls, events: List[Event]) -> 'OrderAggregate':
        """Reconstruct aggregate from event history"""
        aggregate = cls(id=events[0].aggregate_id if events else None)
        
        for event in events:
            aggregate._apply(event)
        
        return aggregate
    
    def _apply(self, event: Event) -> None:
        """Apply event to update state"""
        handler = getattr(self, f"_apply_{event.event_type.value}", None)
        if handler:
            handler(event)
        self.version = event.version
    
    def _apply_order_created(self, event: Event) -> None:
        self.user_id = event.payload["user_id"]
        self.status = OrderStatus.DRAFT
        self.created_at = event.timestamp
    
    def _apply_order_item_added(self, event: Event) -> None:
        self.items.append(event.payload["item"])
        self._recalculate_total()
    
    def _apply_order_item_removed(self, event: Event) -> None:
        item_id = event.payload["item_id"]
        self.items = [i for i in self.items if i["id"] != item_id]
        self._recalculate_total()
    
    def _apply_order_submitted(self, event: Event) -> None:
        self.status = OrderStatus.PENDING
    
    def _apply_order_paid(self, event: Event) -> None:
        self.status = OrderStatus.PAID
    
    def _apply_order_shipped(self, event: Event) -> None:
        self.status = OrderStatus.SHIPPED
    
    def _apply_order_cancelled(self, event: Event) -> None:
        self.status = OrderStatus.CANCELLED
    
    def _recalculate_total(self) -> None:
        self.total = sum(
            item["price"] * item["quantity"] 
            for item in self.items
        )

# ============== Snapshots for Performance ==============
class SnapshotStore:
    """Store periodic snapshots to speed up replay"""
    
    def __init__(self, db_pool, snapshot_interval: int = 100):
        self.db_pool = db_pool
        self.snapshot_interval = snapshot_interval
    
    async def save_snapshot(
        self, 
        aggregate_id: str, 
        aggregate: OrderAggregate
    ) -> None:
        """Save aggregate snapshot"""
        async with self.db_pool.acquire() as conn:
            await conn.execute("""
                INSERT INTO snapshots (aggregate_id, version, state, created_at)
                VALUES ($1, $2, $3, $4)
                ON CONFLICT (aggregate_id) 
                DO UPDATE SET version = $2, state = $3, created_at = $4
            """,
                aggregate_id,
                aggregate.version,
                json.dumps(aggregate.__dict__, default=str),
                datetime.utcnow()
            )
    
    async def get_snapshot(
        self, 
        aggregate_id: str
    ) -> Optional[OrderAggregate]:
        """Load latest snapshot"""
        async with self.db_pool.acquire() as conn:
            row = await conn.fetchrow("""
                SELECT state, version FROM snapshots
                WHERE aggregate_id = $1
            """, aggregate_id)
            
            if row:
                state = json.loads(row['state'])
                return OrderAggregate(**state)
            return None

class OrderRepository:
    """Repository using snapshots + events"""
    
    def __init__(
        self, 
        event_store: EventStore, 
        snapshot_store: SnapshotStore
    ):
        self.event_store = event_store
        self.snapshot_store = snapshot_store
    
    async def get(self, order_id: str) -> Optional[OrderAggregate]:
        # Try to load from snapshot first
        aggregate = await self.snapshot_store.get_snapshot(order_id)
        
        if aggregate:
            # Only replay events after snapshot
            events = await self.event_store.get_events(
                order_id, 
                from_version=aggregate.version
            )
        else:
            # Replay all events
            events = await self.event_store.get_events(order_id)
            if not events:
                return None
            aggregate = OrderAggregate(id=order_id)
        
        # Apply remaining events
        for event in events:
            aggregate._apply(event)
        
        # Create snapshot if needed
        if aggregate.version % self.snapshot_store.snapshot_interval == 0:
            await self.snapshot_store.save_snapshot(order_id, aggregate)
        
        return aggregate
When to use Event Sourcing:
  • Audit requirements (financial systems, healthcare)
  • Need to replay/debug past states
  • Complex business logic with temporal queries
  • Event-driven microservices architecture

Interview Questions: Senior Level

Key Points:
  1. Data replication: Async replication between regions (eventual consistency)
  2. Conflict resolution: Last-write-wins (with vector clocks) or custom merge
  3. Routing: GeoDNS to route users to nearest region
  4. Failover: Health checks + automatic DNS failover
  5. Consistency: Accept that cross-region writes may conflict
Trade-offs to mention:
  • Latency vs consistency
  • Cost of running in multiple regions
  • Complexity of conflict resolution
Solutions in order of complexity:
  1. Batch writes: Accumulate and write in batches
  2. Write-behind cache: Write to Redis, async persist to DB
  3. Message queue: Queue writes, process at sustainable rate
  4. Sharding: Distribute writes across multiple DB nodes
  5. Different DB: Switch to write-optimized DB (Cassandra, ScyllaDB)
Always ask: “What’s the consistency requirement? Can we lose some writes?”
Answer structure:
  1. First ask: “Do we really need distributed transactions?” Often can redesign.
  2. 2PC: Strong consistency, but blocking and slow
  3. Saga: Eventual consistency, compensating transactions
  4. Outbox pattern: Reliable event publishing with local transaction
Code example for Saga:
async def create_order_saga(order):
    try:
        order_id = await order_service.create(order)
        await inventory_service.reserve(order.items)
        await payment_service.charge(order.payment)
        await order_service.confirm(order_id)
    except PaymentFailed:
        await inventory_service.release(order.items)
        await order_service.cancel(order_id)
Systematic approach:
  1. Observe: Check metrics dashboards (p99 latency by service)
  2. Trace: Use distributed tracing (Jaeger/Zipkin) to find slow span
  3. Correlate: Check if spike correlates with deployments, traffic, or GC
  4. Drill down: Once you find the service, check:
    • CPU/memory usage
    • DB query times (slow query log)
    • Network latency between services
    • Thread pool saturation
    • Lock contention
Common causes: DB slow queries, GC pauses, connection pool exhaustion, lock contention, network issues
Approach:
  1. Back of envelope: 1M RPS = ~60K servers at 16 RPS each (conservative)
  2. Stateless compute: Horizontal scaling with load balancer
  3. Caching: Cache everything possible (aim for 99%+ cache hit)
  4. CDN: Serve static content from edge
  5. Database: Shard aggressively, read replicas
  6. Async: Queue non-critical work
Bottleneck analysis:
  • Network: 1M × 10KB = 10GB/s = 80Gbps (need multiple LBs)
  • Compute: 1M / 10K (RPS per server) = 100 servers minimum
  • Database: Can’t hit DB for every request, need 99%+ cache hit

Interview Deep-Dive

Strong Answer:The core constraint is the speed of light. A round trip from US-East to Singapore is roughly 160ms at minimum — and Raft/Paxos require a majority quorum acknowledgment before a write is committed. If your replicas span continents, every write pays that cross-continent RTT at least once.
  • Linearizable writes across regions: With a 3-node Raft cluster spanning US, EU, and APAC, a write from Singapore must reach a majority. If the leader is in US-East, that is ~160ms to Singapore and ~90ms to EU. The write latency floor is the second-fastest quorum member, so roughly 90ms best case — and that is before any application logic.
  • What Spanner does: Google Spanner achieves external consistency using GPS-synchronized TrueTime clocks, but even Spanner reports single-digit millisecond writes only when the transaction’s data is colocated within a single region. Cross-region transactions in Spanner still take 100-200ms.
  • What I would actually propose: Partition the data by geography. Payments originating in APAC are mastered in APAC, EU payments in EU. Each region runs its own Raft group with sub-10ms write latency. Cross-region reads can tolerate brief staleness (a merchant dashboard does not need real-time accuracy of a payment that happened 2 seconds ago in another continent). For the rare cross-region transaction (a US user paying an EU merchant), accept the latency hit on that specific path and use a Saga pattern for the settlement workflow.
Back-of-envelope: 3 regions, each handling ~33% of traffic. Within a region, Raft quorum across 3 AZs takes ~2-5ms. You get sub-10ms writes for 95%+ of transactions, with the remaining 5% cross-region transactions at 150-200ms.Follow-up: How would you handle the edge case where a user travels from the US to Europe and makes a payment — their account is mastered in US-West?You have two options. First, proxy the write back to the home region and accept the ~100ms penalty — for a payment flow where the user is already interacting with a UI, adding 100ms to a button click is imperceptible. Second, use a “follow-the-sun” migration pattern where after detecting consistent activity from a new region (say, 3+ transactions in 24 hours), you migrate the account’s master to the new region. The key insight: optimize for the common case (users transact locally), accept latency on the rare case (traveling users), and never sacrifice correctness for speed on financial data.
Strong Answer:The most likely culprit is the interaction between consumer group rebalancing and the idempotency check window. Here is the failure sequence:
  • Consumer A reads event X, begins processing, writes to Postgres with idempotency key, but has not yet committed the Kafka offset.
  • A Kafka rebalance triggers (perhaps because another consumer joined, or A’s heartbeat was slow under load). Consumer A loses its partition assignment.
  • Consumer B picks up the partition, reads event X again (offset was not committed), checks Postgres for the idempotency key — and the timing matters here. If A’s Postgres transaction committed but the Kafka offset did not, you are fine (idempotency catches it). But if A’s Postgres transaction is still in-flight or was rolled back due to the rebalance interruption, B will not find the key and will process the event again.
The deeper issue: At 50K events/sec, the processing time per event might exceed the max.poll.interval.ms setting (default 300 seconds, but effective throughput matters). If the consumer takes too long between polls — because it is doing synchronous Postgres writes for each event — Kafka assumes it is dead and triggers a rebalance.Fix, in order of impact:
  1. Batch the idempotency writes: Instead of one Postgres round-trip per event, buffer 100-500 events and do a single batch INSERT ON CONFLICT. This reduces the per-event processing time and keeps the consumer polling frequently.
  2. Use the transactional outbox pattern: Write the idempotency key and the business result in the same Postgres transaction, then have a separate process commit Kafka offsets only after confirming the Postgres transaction committed.
  3. Tune Kafka consumer settings: Increase max.poll.records and decrease max.poll.interval.ms appropriately. Use session.timeout.ms = 10s and heartbeat.interval.ms = 3s to detect failures fast without false positives.
  4. Consider Kafka transactions: Use Kafka’s built-in transactional API (enable.idempotence=true + transactional.id) to achieve exactly-once between Kafka produce and consume, and handle the Postgres write separately with an outbox.
Follow-up: At 50K events/sec, how much Postgres write throughput do you need for the idempotency table, and when does that become the bottleneck?At 50K events/sec with batches of 500, you need 100 batch inserts/sec into Postgres. Each batch INSERT ON CONFLICT with 500 rows takes roughly 5-10ms on a well-indexed Postgres instance, so you need about 0.5-1 second of Postgres time per second — well within a single Postgres instance’s capacity. The bottleneck shifts to Postgres at around 200-500K events/sec, at which point you would shard the idempotency table by a hash of the event key, or move to a faster store like Redis with persistence for the idempotency check (accepting the trade-off of slightly weaker durability guarantees on the idempotency store).
Strong Answer:The way I think about this is: isolation level is a per-transaction decision, not a per-database decision. Different operations within the same database have radically different consistency requirements.
  • Serializable (or at minimum Repeatable Read with explicit locking): Inventory decrement on checkout. This is the classic “two users buy the last item” problem. If you use Read Committed, both transactions can read quantity=1, both decrement to 0, and you have oversold. You need either Serializable isolation or an explicit SELECT FOR UPDATE to prevent this. At high scale, I would actually avoid row-level locking entirely and use an atomic decrement: UPDATE inventory SET quantity = quantity - 1 WHERE product_id = ? AND quantity >= 1, checking the affected row count.
  • Read Committed (the Postgres default): Order history queries, product catalog browsing, user profile reads. These are read-heavy, and a non-repeatable read (seeing a price change mid-transaction) is harmless — the user sees the updated price, which is correct behavior.
  • Repeatable Read: Financial reporting and analytics queries that run for minutes. If you are generating a daily revenue report, you need a consistent snapshot — seeing some orders but not others because they committed during your query would produce incorrect totals. Postgres implements this efficiently with MVCC snapshots.
  • Read Uncommitted: I almost never use this in practice. The one exception might be approximate analytics dashboards where you want maximum throughput and can tolerate seeing in-flight data. But even then, Read Committed is only marginally slower and avoids the confusion of dirty reads.
The architectural pattern I use: Define a TransactionContext that each service method declares. The payment service always runs at Serializable for the actual charge, but the receipt generation that follows runs at Read Committed. The inventory service uses the atomic decrement pattern (no explicit isolation level needed because the atomicity is in the SQL statement itself). The reporting service uses Repeatable Read with long-running read-only transactions.Follow-up: Your Serializable transactions on the inventory table are causing lock contention and timeouts during flash sales. How do you fix this without dropping the isolation level?Three approaches, from simplest to most complex. First, use the atomic decrement pattern I mentioned — it avoids explicit locking entirely because the WHERE clause acts as a guard. Second, if you need Serializable for more complex invariants, shard the inventory by product_id so contention is per-product, not global. A flash sale for one product does not block purchases of other products. Third, for the actual flash sale scenario (1000 people buying 50 items), move the hot inventory count to Redis with DECR, let Redis handle the contention (single-threaded, no lock overhead), and reconcile back to Postgres asynchronously. You accept a brief window where Postgres is behind Redis, but Redis is the authoritative source for “is there stock left” during the sale.
Strong Answer:Let me break this into the key cost and feasibility dimensions.Data transfer costs (the surprise line item):
  • Each write generates a WAL (Write-Ahead Log) record. At 20K writes/sec with an average WAL record of ~200 bytes, that is 4 MB/sec = ~345 GB/day of replication traffic.
  • AWS cross-region data transfer: 0.02/GB.So345GB/day= 0.02/GB. So 345 GB/day = ~7/day = ~$210/month just for ongoing replication.
  • The initial seed: 50TB transferred cross-region at 0.02/GB=0.02/GB = 1,000 one-time cost. But the bigger issue is time — at 5 Gbps sustained transfer, 50TB takes roughly 22 hours. During that time, the replica is not serving reads.
Replication lag:
  • Async replication across regions (us-east-1 to eu-west-1, ~80ms RTT): expect 80-200ms replication lag under normal load. This is fine for read replicas serving non-critical reads.
  • Synchronous replication: every write now takes at least 80ms additional latency. At 20K writes/sec, this means each write occupies a connection for 80ms longer. Your connection pool needs to be roughly 20K * 0.08 = 1,600 additional connections just to maintain throughput.
Read scaling in the remote region:
  • 100K reads/sec. If you route 50% to the EU replica, that is 50K reads/sec on a single Postgres instance, which is feasible with connection pooling and proper indexing but tight. You likely need 2-3 read replicas in EU.
My recommendation:
  • Async replication for the read replicas (accept 100-200ms lag).
  • Do NOT make writes synchronous cross-region — the latency cost is too high for 20K writes/sec.
  • If you need cross-region write availability (disaster recovery), use a warm standby that can be promoted in minutes, not a synchronous replica.
  • Total monthly cost estimate: 210(datatransfer)+ 210 (data transfer) + ~3,000-5,000 (2-3 db.r6g.4xlarge RDS instances in EU) + ~500(additionalnetworkinfrastructure)=roughly500 (additional network infrastructure) = roughly 4,000-5,500/month.
Follow-up: The business says they need RPO of zero — no data loss if us-east-1 goes down completely. Does that change your answer?RPO of zero means synchronous replication, which means every write pays the 80ms cross-region penalty. At 20K writes/sec, this is brutal. I would push back on the requirement with data: “RPO of zero adds 80ms to every write, which drops our write throughput from 20K/sec to roughly 12K/sec (connection pool becomes the bottleneck), and increases p99 write latency from 50ms to 150ms. With async replication, our RPO is typically under 200ms — meaning in a catastrophic us-east-1 failure, we lose at most 200ms of writes. Is that acceptable?” If they insist on zero RPO, I would look at CockroachDB or Aurora Global Database, which are architecturally designed for this at the cost of higher per-write latency.
Interview Strategy for Advanced Topics: When an interviewer asks about CQRS, event sourcing, or distributed consensus, the strongest move is to first explain when you would NOT use it. “Event sourcing is powerful for audit trails and temporal queries, but for a simple CRUD service with no compliance requirements, it adds unjustified complexity. I would reach for it when…” This shows you understand the tool and its boundaries, which is exactly the judgment call staff-level engineers are evaluated on.
Scalability Quick Reference for Advanced Patterns:
  • CQRS: becomes valuable at roughly 10:1 or higher read-to-write ratio, or when read patterns diverge significantly from write patterns. Below that ratio, the synchronization overhead between read and write models is not worth the benefit.
  • Event Sourcing: the event store grows linearly with write volume. At 10K writes/second with 1KB average event size, you generate roughly 850GB/day of event data. Snapshots every 100 events reduce replay time from O(n) to O(n/100) for aggregate reconstruction.
  • Distributed Consensus (Raft): practical for up to 5-7 nodes. Beyond that, the leader must replicate every write to a majority, and AppendEntries RPCs become the bottleneck. For larger clusters, use multi-Raft (CockroachDB/TiKV style) where different data ranges have independent Raft groups.
  • Vector Clocks: the vector grows with the number of nodes that have ever written. For systems with thousands of writers, consider hybrid logical clocks (HLC) or bounded vector clocks (Dynamo-style dotted version vectors) to cap metadata size.