Track 3: Replication Strategies

How to copy data across nodes while maintaining consistency guarantees.

Track Duration: 38-46 hours
Modules: 5
Key Topics: Single-leader, Multi-leader, Leaderless, Conflict Resolution, CRDTs

Module 11: Single-Leader Replication

The most common replication strategy.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SINGLE-LEADER REPLICATION                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                        ┌─────────────┐                                      │
│            Writes ────►│   LEADER    │                                      │
│                        │  (Primary)  │                                      │
│                        └──────┬──────┘                                      │
│                               │                                             │
│                    ┌──────────┼──────────┐                                  │
│                    │          │          │                                  │
│                    ▼          ▼          ▼                                  │
│              ┌─────────┐┌─────────┐┌─────────┐                              │
│   Reads ◄───►│Follower ││Follower ││Follower │◄───► Reads                   │
│              │   1     ││   2     ││   3     │                              │
│              └─────────┘└─────────┘└─────────┘                              │
│                                                                              │
│  FLOW:                                                                      │
│  ─────                                                                      │
│  1. Client writes to leader only                                            │
│  2. Leader persists write                                                   │
│  3. Leader sends write to all followers                                     │
│  4. Followers apply write                                                   │
│  5. Reads can go to any replica                                             │
│                                                                              │
│  DATABASES: PostgreSQL, MySQL, MongoDB, Redis                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Synchronous vs Asynchronous Replication

Synchronous
Asynchronous
Semi-Synchronous

Leader waits for follower acknowledgment before confirming write.

Client     Leader      Follower
   │         │            │
   │──Write──►            │
   │         │──Replicate─►
   │         │◄───ACK─────│
   │◄──OK────│            │
   │         │            │

Time: Longer (network + follower persist)

Pros:

Follower guaranteed to be up-to-date
No data loss on leader failure

Cons:

Higher latency
Leader blocked if follower slow/dead

Leader confirms write immediately, replicates in background.

Client     Leader      Follower
   │         │            │
   │──Write──►            │
   │◄──OK────│            │
   │         │──Replicate─►
   │         │◄───ACK─────│
   │         │            │

Time: Short (just leader persist)

Pros:

Lower latency
Leader not blocked by slow followers

Cons:

Data loss possible if leader fails before replication
Followers may serve stale data

Wait for at least one follower (best of both worlds).

MySQL: One sync follower, rest async
PostgreSQL: synchronous_commit settings

Configuration:

synchronous_standby_names = 'FIRST 1 (f1, f2, f3)'
Wait for first follower to ack
Others replicate asynchronously

Pros:

No data loss (at least 2 copies)
Better latency than full sync

Replication Lag

┌─────────────────────────────────────────────────────────────────────────────┐
│                        REPLICATION LAG                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Leader:   [1] [2] [3] [4] [5] [6] [7] [8]                                  │
│  Follower: [1] [2] [3] [4] [5]                   ← 3 entries behind         │
│                                                                              │
│  PROBLEMS THIS CAUSES:                                                      │
│  ─────────────────────                                                      │
│                                                                              │
│  1. READ-YOUR-WRITES VIOLATION                                              │
│     User writes, immediately reads from follower → doesn't see write!       │
│                                                                              │
│  2. MONOTONIC READS VIOLATION                                               │
│     User reads from F1, then F2 (further behind) → data "goes backward"    │
│                                                                              │
│  3. CONSISTENT PREFIX VIOLATION                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Understanding Replication Lag

Replication lag is the time delay between when a write occurs on the leader and when it’s applied on a follower. The diagram above shows a 150ms lag where the follower is consistently behind the leader. Why Lag Happens:

Network latency between leader and followers
Follower processing slower than leader writes
Follower temporarily offline or restarting
High write throughput overwhelming followers

Measuring Lag:

-- PostgreSQL: Check replication lag
SELECT 
    client_addr,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS pending_bytes,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;

Acceptable Lag: Depends on your use case

Financial systems: Milliseconds
Analytics: Minutes or hours acceptable
Caching: Seconds to minutes

Solutions to Lag Problems:

Sticky sessions: Route user’s reads to same follower
Read from leader: For critical reads after writes
Version vectors: Track causality explicitly

Implementing Read-Your-Writes Guarantees

One of the most common consistency issues users experience. Here’s how to implement it:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    READ-YOUR-WRITES IMPLEMENTATION                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PROBLEM:                                                                   │
│  User writes to leader, immediately reads from follower → stale data!       │
│                                                                              │
│  Client ──Write──► Leader (v5)                                              │
│  Client ──Read───► Follower (still at v3) → "Where's my data?!"            │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  SOLUTION 1: STICKY SESSIONS (Simple)                                       │
│  ─────────────────────────────────────                                      │
│  Route all requests from same user to same replica                          │
│  • Use user_id hash for routing                                             │
│  • Works if replica is leader or caught-up follower                        │
│  • Breaks on failover                                                       │
│                                                                              │
│  SOLUTION 2: READ-FROM-LEADER (Strong but bottleneck)                       │
│  ────────────────────────────────────────────────────                       │
│  After write, read from leader for X seconds                                │
│  • Simple to implement                                                      │
│  • Increases leader load                                                    │
│  • Time-based heuristic (may still serve stale)                            │
│                                                                              │
│  SOLUTION 3: VERSION TRACKING (Robust)                                      │
│  ─────────────────────────────────────                                      │
│  Track write version, ensure read version ≥ write version                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Production Implementation (Version Tracking):

class ReadYourWritesClient:
    """
    Client-side implementation of Read-Your-Writes guarantee.
    Used by: MongoDB, CockroachDB, Spanner
    """
    def __init__(self):
        # Track the latest write version per key (or global)
        self.write_version = {}  # key -> version
    
    def write(self, key, value):
        """Write to leader, record version"""
        response = self.leader.write(key, value)
        # Store the version returned by leader
        self.write_version[key] = response.version
        return response
    
    def read(self, key):
        """Read ensuring we see our writes"""
        min_version = self.write_version.get(key, 0)
        
        # Option A: Send min_version to replica, it waits until caught up
        response = self.replica.read(
            key, 
            min_version=min_version  # "Don't respond until you have this version"
        )
        
        # Option B: Read from any replica, retry if stale
        for replica in self.replicas:
            response = replica.read(key)
            if response.version >= min_version:
                return response
        
        # Fallback to leader if no replica is fresh enough
        return self.leader.read(key)

class ReplicaServer:
    """Server-side: Wait for replication before responding"""
    
    def read(self, key, min_version=None):
        if min_version is not None:
            # Wait until we've replicated up to min_version
            self.wait_for_version(min_version, timeout=5.0)
        
        return self.storage.get(key)
    
    def wait_for_version(self, target_version, timeout):
        """Block until replication catches up"""
        start = time.time()
        while self.current_version < target_version:
            if time.time() - start > timeout:
                raise StaleReadError("Replica too far behind")
            time.sleep(0.01)  # Busy wait (use condition variable in prod)

Comparison of Approaches:

Approach	Consistency	Availability	Complexity
Sticky Sessions	Weak	High	Low
Read-from-Leader	Strong	Lower	Low
Version Tracking	Strong	High	Medium
Causal Tokens	Strong	High	High

Causal Consistency Tokens (Advanced): Systems like MongoDB use “causal consistency tokens” that encode the entire causal history, not just a single version. This ensures you see all causally-related writes, not just your own.

# MongoDB-style causal token
{
    "clusterTime": {"$timestamp": {"t": 1624000000, "i": 1}},
    "operationTime": {"$timestamp": {"t": 1624000000, "i": 1}},
    "afterClusterTime": {"$timestamp": {"t": 1623999999, "i": 5}}
}

Handling Failover

│ │ │ STEP 1: DETECT LEADER FAILURE │ │ ───────────────────────────── │ │ • Heartbeat timeout (typically 10-30 seconds) │ │ • Multiple checks to avoid false positives │ │ │ │ STEP 2: ELECT NEW LEADER │ │ ──────────────────────── │ │ • Most up-to-date follower preferred │ │ • May use consensus (Raft) or controller node │ │ │ │ STEP 3: RECONFIGURE SYSTEM │ │ ───────────────────────── │ │ • Update clients to write to new leader │ │ • Other followers switch to new leader │ │ • DNS/load balancer update │ │ │ │ GOTCHAS: │ │ ──────── │ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ │ SPLIT-BRAIN: Both old and new leader think they’re the leader │ │ │ │ DATA LOSS: Async replication means new leader may be behind │ │ │ │ CONFLICTS: Old leader comes back with conflicting writes │ │ │ └──────────────────────────────────────────────────────────────────┘ │ │ │ │ GITHUB 2012 INCIDENT: │ │ MySQL failover caused data loss │ │ Auto-incrementing IDs reused → foreign key violations │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

---

## Module 12: Multi-Leader Replication

When a single leader isn't enough.

<Frame>
  <img src="/images/courses/distributed-systems-multi-leader.svg" alt="Multi-Leader Replication Architecture" />
</Frame>

┌─────────────────────────────────────────────────────────────────────────────┐ │ MULTI-LEADER REPLICATION │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Datacenter A │ Datacenter B │ │ │ │ │ ┌──────────────┐ │ ┌──────────────┐ │ │ │ Leader A │◄────────────┼────────►│ Leader B │ │ │ └──────┬───────┘ Async │ └──────┬───────┘ │ │ │ Sync │ │ │ │ ┌─────┴─────┐ │ ┌─────┴─────┐ │ │ ▼ ▼ │ ▼ ▼ │ │ ┌────────┐ ┌────────┐ │ ┌────────┐ ┌────────┐ │ │ │Follower│ │Follower│ │ │Follower│ │Follower│ │ │ └────────┘ └────────┘ │ └────────┘ └────────┘ │ │ │ │ │ Clients ───► Local Leader │ Clients ───► Local Leader │ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ USE CASES: ──────────

Multi-datacenter operation (write locally, replicate globally)
Offline-capable clients (phone writes locally, syncs later)
Collaborative editing (each user has local leader)

### Conflict Detection and Resolution

<Tabs>
  <Tab title="Conflict Detection">

CONFLICT SCENARIO: ────────────────── Time Leader A Leader B ──── ──────── ──────── t1 x = 1 x = 1 (Initial state) t2 x = 2 x = 3 (Concurrent writes!) t3 (replicate) ────────────── (replicate) t4 x = 2 AND 3? x = 3 AND 2? CONFLICT! WHEN TO DETECT: ─────────────── • Synchronous: Detect immediately (if possible) • Asynchronous: Detect during replication (too late to reject) Most multi-leader systems detect asynchronously → must resolve conflicts after the fact

</Tab>

<Tab title="Resolution Strategies">

LAST-WRITE-WINS (LWW) ──────────────────────── Use timestamps, highest timestamp wins

Problem: Data loss (other write silently discarded) Problem: Clock skew can cause “wrong” winner Used by: Cassandra, DynamoDB (default)

MERGE VALUES ──────────────── Combine conflicting values

x = 2 + x = 3 → x = [2, 3] Problem: Not always semantically meaningful

CUSTOM RESOLUTION ──────────────────── Application-specific logic

Example: For shopping cart, union of items Example: For counter, add the deltas

PROMPT USER ────────────── Show conflicts to user, let them choose

Used by: Git, some CMS systems

</Tab>

<Tab title="Conflict Avoidance">

BEST STRATEGY: Avoid conflicts in the first place TECHNIQUES: ───────────

ROUTE BY USER User’s writes always go to same leader No concurrent writes to same data
ROUTE BY DATA Partition data, each partition has one leader
GLOBAL LOCKING Acquire lock before write (defeats purpose of multi-leader though)
CRDT DATA TYPES Data structures designed to merge without conflicts (covered in Module 15)

</Tab>
</Tabs>

### Conflict Scenarios Visualized

![Multi-Leader Conflicts](/images/courses/distributed-systems-multi-leader-conflict.svg)

**Real-World Example**: Google Docs

Google Docs uses multi-leader replication (operational transformation, similar to CRDTs):
- Each user's browser is a "leader"
- Concurrent edits to same document
- Conflicts resolved automatically through OT algorithm
- Users see each other's changes in real-time

**When Multi-Leader Makes Sense**:
1. **Geographic Distribution**: Users in different continents
2. **Offline Operation**: Mobile apps that sync later
3. **Collaborative Editing**: Multiple users editing simultaneously
4. **High Availability**: No single point of failure

---

## Module 13: Leaderless Replication

The Dynamo-style approach.

<Frame>
<img src="/images/courses/distributed-systems-leaderless.svg" alt="Leaderless Replication Architecture" />
</Frame>

┌─────────────────────────────────────────────────────────────────────────────┐ │ LEADERLESS REPLICATION │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ NO LEADER! Client writes to multiple nodes directly. │ │ │ │ Client │ │ ┌───┐ │ │ │ │ │ │ └─┬─┘ │ │ Write │ Write │ │ ┌─────────────────┼─────────────────┐ │ │ ▼ ▼ ▼ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ │ │ x = 5 │ │ x = 5 │ │ x = 5 │ │ │ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ QUORUMS: │ │ ──────── │ │ N = total nodes │ │ W = nodes that must acknowledge write │ │ R = nodes that must respond to read │ │ │ │ RULE: W + R > N (ensures overlap, some node has latest) │ │ │ │ COMMON CONFIGS: │ │ ─────────────── │ │ N=3, W=2, R=2: Balanced read/write, tolerates 1 failure │ │ N=3, W=3, R=1: Fast reads, writes need all nodes │ │ N=3, W=1, R=3: Fast writes, reads slower but consistent │ │ │ │ DATABASES: Cassandra, Riak, DynamoDB, Voldemort │ │ │ └─────────────────────────────────────────────────────────────────────────────┘


### Quorum Mathematics

┌─────────────────────────────────────────────────────────────────────────────┐ │ QUORUM INTERSECTION │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ WHY W + R > N WORKS: │ │ │ │ N = 5 nodes, W = 3, R = 3 │ │ │ │ Write touches: [1] [2] [3] ✓ [ ] [ ] │ │ Read touches: [1] ✓ [ ] [4] [5] ← Some overlap! │ │ ↑ │ │ This node has the latest write! │ │ │ │ OVERLAP = W + R - N = 3 + 3 - 5 = 1 │ │ At least 1 node in read set has latest value │ │ │ │ EDGE CASES: │ │ ─────────── │ │ W + R = N: Exactly 0 overlap possible (risky) │ │ W + R < N: May read stale data │ │ │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ SCENARIO: Some home nodes unreachable │ │ │ │ Client wants to write x = 5 │ │ Home nodes for x: [A, B, C] │ │ But node C is down! │ │ │ │ STRICT QUORUM: │ │ ────────────── │ │ Wait for C or fail (reduced availability) │ │ │ │ SLOPPY QUORUM: │ │ ────────────── │ │ Write to D instead of C (D is not home node) │ │ Still get W nodes, just not the “right” ones │ │ │ │ Write: [A ✓] [B ✓] [C ✗] [D ✓ hint] │ │ │ │ HINTED HANDOFF: │ │ ─────────────── │ │ When C comes back online: │ │ D says “I have a hint for you” → sends x = 5 to C │ │ D deletes hint after C acknowledges │ │ │ │ ┌────────────────────────────────────────────────────────────────┐ │ │ │ WARNING: With sloppy quorums, W + R > N doesn’t guarantee │ │ │ │ reading latest value! Reads might miss the hint node. │ │ │ └────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

### Quorum Intersection Explained

![Quorum Visualization](/images/courses/distributed-systems-quorum.svg)

**Why This Works**: The Pigeonhole Principle

If you have N boxes and you put W + R items in them, and W + R > N, then at least one box must contain more than one item. That box represents a node that participated in both the write and the read, guaranteeing you see the latest value.

**Tuning Quorums for Your Workload**:

| Use Case | N | W | R | Rationale |
|----------|---|---|---|-----------|
| Read-heavy | 3 | 2 | 1 | Fast reads, writes need majority |
| Write-heavy | 3 | 1 | 3 | Fast writes, reads check all |
| Balanced | 3 | 2 | 2 | Equal read/write performance |
| High availability | 5 | 3 | 3 | Tolerates 2 failures |

**Latency Impact**:
- Write latency = slowest of W nodes
- Read latency = slowest of R nodes
- Lower W or R = better latency, weaker consistency

### Anti-Entropy: Read Repair and Merkle Trees

┌─────────────────────────────────────────────────────────────────────────────┐ │ ANTI-ENTROPY │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Mechanisms to detect and repair inconsistencies │ │ │ │ 1. READ REPAIR │ │ ─────────────── │ │ During reads, detect stale replicas and update them │ │ │ │ Client reads x from [A, B, C]: │ │ A: x = 5 (version 3) │ │ B: x = 5 (version 3) │ │ C: x = 3 (version 2) ← Stale! │ │ │ │ Action: Send x = 5 to C (background repair) │ │ │ │ 2. MERKLE TREES │ │ ──────────────── │ │ Efficiently compare data between nodes │ │ │ │ ROOT (hash of children) │ │ / \ │ │ HASH(A,B) HASH(C,D) │ │ / \ / \ │ │ HASH(A) HASH(B) HASH(C) HASH(D) │ │ │ │ │ │ │ │ [Data A] [Data B] [Data C] [Data D] │ │ │ │ COMPARISON: │ │ ─────────── │ │ Node1 and Node2 exchange root hashes │ │ If same → all data matches (done!) │ │ If different → compare children, recursively find mismatches │ │ │ │ EFFICIENCY: O(log n) comparisons to find one mismatch │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

---

## Module 14: Conflict Resolution

When conflicts happen, how do you resolve them?

### Version Vectors

```text
┌─────────────────────────────────────────────────────────────────────────────┐
│                      VERSION VECTORS                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Like vector clocks, but for replicas instead of processes                  │
│                                                                              │
│  FORMAT: {replica_id: version_number, ...}                                  │
│                                                                              │
│  EXAMPLE:                                                                   │
│  ────────                                                                   │
│  Object x starts at version {A:0, B:0, C:0}                                 │
│                                                                              │
│  Write at A: x = 5, version = {A:1, B:0, C:0}                               │
│  Write at B: x = 7, version = {A:0, B:1, C:0}                               │
│                                                                              │
│  Neither dominates the other → CONFLICT                                     │
│                                                                              │
│  COMPARISON:                                                                │
│  ───────────                                                                │
│  {A:1, B:0} dominates {A:0, B:0}  (A:1 > A:0, B:0 = B:0)                   │
│  {A:1, B:0} || {A:0, B:1}          (concurrent - neither dominates)        │
│                                                                              │
│  SIBLING RESOLUTION:                                                        │
│  ───────────────────                                                        │
│  1. Present both values to application                                      │
│  2. Application merges and writes back                                      │
│  3. New version dominates both siblings                                     │
│                                                                              │
│  x = merge(5, 7), version = {A:1, B:1, C:0}                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Dotted Version Vectors (DVV)

Traditional Version Vectors can grow linearly with the number of nodes (

O(N)

overhead) and can be imprecise for concurrent updates to the same node. Dotted Version Vectors (DVV), used in Riak, solve this by separating the “dot” (a specific update) from the “causal context.” Key Benefit: DVVs accurately distinguish between “I have seen this update” and “This update is a sibling of that one,” leading to significantly fewer false conflicts during churn.

Format: [ (NodeID, Counter), VersionVector ]
- (NodeID, Counter) is the "Dot" representing the specific event.
- VersionVector is the "Context" representing what this node had seen.

Last-Write-Wins (LWW) Deep Dive

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LAST-WRITE-WINS                                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SIMPLE RULE: Attach timestamp, highest timestamp wins                      │
│                                                                              │
│  Write 1: x = 5, timestamp = 1000                                           │
│  Write 2: x = 7, timestamp = 1001                                           │
│                                                                              │
│  Winner: x = 7                                                              │
│                                                                              │
│  PROBLEMS:                                                                  │
│  ─────────                                                                  │
│                                                                              │
│  1. CLOCK SKEW                                                              │
│     Node A: 10:00:00                                                        │
│     Node B: 10:00:01 (1 second ahead)                                       │
│                                                                              │
│     Even if A's write was "actually" first,                                 │
│     B's write wins because higher timestamp                                 │
│                                                                              │
│  2. DATA LOSS                                                               │
│     User 1: Set name = "Alice"                                              │
│     User 2: Set email = "alice@x.com"                                       │
│                                                                              │
│     If applied to entire object, one change is lost!                        │
│     Better: Apply LWW per-field                                             │
│                                                                              │
│  3. DELETE + UPDATE RACE                                                    │
│     User 1: Delete item                                                     │
│     User 2: Update item                                                     │
│                                                                              │
│     Which wins? Often delete loses → item reappears                         │
│     Use tombstones with timestamps                                          │
│                                                                              │
│  WHEN LWW IS OK:                                                            │
│  ───────────────                                                            │
│  • Immutable data (UUIDs as keys)                                           │
│  • Caching (stale data acceptable)                                          │
│  • Metrics (approximate is fine)                                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Application-Level Resolution

# Example: Shopping cart conflict resolution

def resolve_cart_conflict(cart1, cart2):
    """
    Resolution strategy: UNION of items
    
    cart1 = {A: 2, B: 1}  (2 of A, 1 of B)
    cart2 = {A: 1, C: 3}  (1 of A, 3 of C)
    
    Result: {A: 2, B: 1, C: 3}
    - Take max quantity for each item
    - Include items from both carts
    """
    resolved = {}
    all_items = set(cart1.keys()) | set(cart2.keys())
    
    for item in all_items:
        qty1 = cart1.get(item, 0)
        qty2 = cart2.get(item, 0)
        resolved[item] = max(qty1, qty2)
    
    return resolved


# Example: Counter conflict resolution

def resolve_counter_conflict(counter1, counter2, base):
    """
    Resolution strategy: Add deltas
    
    base = 100
    counter1 = 105 (added 5)
    counter2 = 103 (added 3)
    
    Result: 108 (100 + 5 + 3)
    """
    delta1 = counter1 - base
    delta2 = counter2 - base
    return base + delta1 + delta2

Module 15: CRDTs

Conflict-free Replicated Data Types - data structures that automatically merge.

Advanced Topic: CRDTs are asked at Staff+ level interviews, especially at companies building collaborative tools.

┌─────────────────────────────────────────────────────────────────────────────┐
│                         CRDTs                                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  CRDT = Data structure designed to be replicated across nodes               │
│         and always converge to same state without coordination              │
│                                                                              │
│  KEY PROPERTY: Merge operation is:                                          │
│  • Commutative: merge(A, B) = merge(B, A)                                   │
│  • Associative: merge(merge(A, B), C) = merge(A, merge(B, C))               │
│  • Idempotent: merge(A, A) = A                                              │
│                                                                              │
│  TWO TYPES:                                                                 │
│  ──────────                                                                 │
│  1. State-based (CvRDT): Merge entire states                                │
│  2. Operation-based (CmRDT): Replicate and apply operations                 │
│                                                                              │
│  TRADEOFF:                                                                  │
│  ─────────                                                                  │
│  State-based: Larger messages, simpler delivery                             │
│  Op-based: Smaller messages, require exactly-once delivery                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Counter CRDTs

G-Counter (Grow-only)
PN-Counter

G-Counter: Only increments

STATE: {node_id: count, ...}

EXAMPLE:
────────
Node A increments: {A: 5, B: 0, C: 0}
Node B increments: {A: 0, B: 3, C: 0}
Node C increments: {A: 0, B: 0, C: 7}

MERGE: Element-wise max
{A: 5, B: 3, C: 7}

VALUE: Sum of all counts = 15


Implementation:
───────────────
class GCounter:
    def __init__(self, node_id):
        self.node_id = node_id
        self.counts = defaultdict(int)
    
    def increment(self, n=1):
        self.counts[self.node_id] += n
    
    def value(self):
        return sum(self.counts.values())
    
    def merge(self, other):
        for node_id, count in other.counts.items():
            self.counts[node_id] = max(
                self.counts[node_id], count
            )

PN-Counter: Increment and decrement

STATE: Two G-Counters
- P (positive): increments
- N (negative): decrements

VALUE = P.value() - N.value()

EXAMPLE:
────────
Node A: increment(5)    → P = {A:5}, N = {A:0}
Node A: decrement(2)    → P = {A:5}, N = {A:2}
Node B: increment(3)    → P = {B:3}, N = {B:0}

Merged:
P = {A:5, B:3} = 8
N = {A:2, B:0} = 2
Value = 8 - 2 = 6


class PNCounter:
    def __init__(self, node_id):
        self.p = GCounter(node_id)
        self.n = GCounter(node_id)
    
    def increment(self, n=1):
        self.p.increment(n)
    
    def decrement(self, n=1):
        self.n.increment(n)
    
    def value(self):
        return self.p.value() - self.n.value()
    
    def merge(self, other):
        self.p.merge(other.p)
        self.n.merge(other.n)

CRDT Counter Visualization

Why CRDTs Are Powerful: Traditional approach:

Node A: counter = 5
Node B: counter = 3
Merge: ??? (conflict!)

CRDT approach:

Node A: {A: 5, B: 0, C: 0}
Node B: {A: 0, B: 3, C: 0}
Merge: {A: 5, B: 3, C: 0} → value = 8 (no conflict!)

Real-World CRDT Usage:

Redis: CRDT-based geo-replicated databases
Riak: Uses CRDTs for distributed counters and sets
Figma: Uses CRDTs for collaborative design
Apple Notes: CRDTs for offline-first sync

Tradeoff: CRDTs require more memory (store per-node state) but eliminate coordination overhead.

Set CRDTs

G-Set (Grow-only)
2P-Set (Two-Phase)
OR-Set (Observed-Remove)

G-Set: Only adds, never removes

STATE: Set of elements

MERGE: Union of sets

Node A: {apple, banana}
Node B: {banana, cherry}

Merge: {apple, banana, cherry}


class GSet:
    def __init__(self):
        self.elements = set()
    
    def add(self, element):
        self.elements.add(element)
    
    def contains(self, element):
        return element in self.elements
    
    def merge(self, other):
        self.elements |= other.elements

2P-Set: Add and remove (but can't re-add after remove)

STATE: Two G-Sets
- A (added): elements added
- R (removed): elements removed (tombstones)

CONTAINS: element in A and not in R

LIMITATION: Once removed, can never add again


class TwoPSet:
    def __init__(self):
        self.a = GSet()  # added
        self.r = GSet()  # removed
    
    def add(self, element):
        self.a.add(element)
    
    def remove(self, element):
        if element in self.a.elements:
            self.r.add(element)
    
    def contains(self, element):
        return (element in self.a.elements and 
                element not in self.r.elements)
    
    def merge(self, other):
        self.a.merge(other.a)
        self.r.merge(other.r)

OR-Set: Add and remove with re-add support

IDEA: Tag each addition with unique ID
Remove only removes specific tag, not all

STATE: {element: set of (tag, is_added)}

EXAMPLE:
────────
Add "apple" at A:    {apple: {(uuid1, true)}}
Add "apple" at B:    {apple: {(uuid2, true)}}
Remove "apple" at A: {apple: {(uuid1, false), (uuid2, true)}}

Contains: At least one (tag, true) without matching (tag, false)

"apple" is still in set because uuid2 wasn't removed!


USED BY:
• Riak
• Redis Enterprise (Active-Active)
• SoundCloud's Roshi

Register CRDTs

┌─────────────────────────────────────────────────────────────────────────────┐
│                     REGISTER CRDTs                                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  LWW-REGISTER (Last-Write-Wins)                                             │
│  ──────────────────────────────                                             │
│  STATE: (value, timestamp)                                                  │
│  MERGE: Keep higher timestamp                                               │
│                                                                              │
│  A: ("Alice", 100)                                                          │
│  B: ("Bob", 101)                                                            │
│  Merge: ("Bob", 101)                                                        │
│                                                                              │
│  MV-REGISTER (Multi-Value)                                                  │
│  ─────────────────────────                                                  │
│  STATE: Set of (value, version-vector)                                      │
│  MERGE: Keep all concurrent values                                          │
│                                                                              │
│  A: ("Alice", {A:1})                                                        │
│  B: ("Bob", {B:1})                                                          │
│  Merge: {("Alice", {A:1}), ("Bob", {B:1})}  ← Both kept as siblings        │
│                                                                              │
│  Application must resolve by writing new value:                             │
│  Write: ("Alice or Bob", {A:1, B:1})                                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

JSON CRDTs & Rich Text (Automerge/Yjs)

Modeling a simple counter or set is one thing, but how do you model a JSON Document or a Rich Text Document (like this one)?

1. JSON CRDT Architecture

A JSON CRDT models the document as a Tree of CRDTs.

Objects: Maps where keys map to other CRDTs.
Arrays: Sequences where each element has a unique ID and a pointer to the previous element (RGA - Replicated Growable Array).
Registers: For leaf values (strings, numbers).

2. The Interleaving Problem

In collaborative text editing, if Alice inserts “A” and Bob inserts “B” at the same position, a naive CRDT might result in “BA” for Alice and “AB” for Bob. Modern CRDTs like Yjs and Automerge use Causal Ordering and Unique IDs to ensure that all replicas agree on the order (e.g., “AB” for everyone).

Delta-CRDTs: Optimizing Bandwidth

The biggest drawback of State-based CRDTs (CvRDTs) is that the state grows over time. If you have a counter with 1,000 nodes, you must send all 1,000 counts every time you sync. Delta-CRDTs solve this by only sending the Delta (the changes) since the last successful synchronization.

Dot Store: A way to track which specific updates (dots) a neighbor has already seen.
Delta-Group: A collection of updates that are merged together before being sent.
Efficiency: Reduces bandwidth from $O(N)$ to $O(\Delta)$ , making CRDTs viable for mobile devices on slow networks.

Garbage Collection & Tombstones

When you remove an element from a CRDT (like a 2P-Set or OR-Set), you must keep a Tombstone to prove to other nodes that the item was deleted. If you delete the tombstone too early, a node that hasn’t seen the deletion might “re-infect” the cluster with the old data.

Strategies for Cleaning Tombstones:

Stable State GC: Only delete a tombstone once you have proof (via gossip) that every node in the cluster has seen the deletion.
Time-based GC: Delete tombstones after a very long period (e.g., 30 days). If a node stays offline longer than that, it must be completely wiped and re-synced from scratch.

CRDT Usage in Practice

PRODUCTION CRDT SYSTEMS:
────────────────────────

┌──────────────────┬───────────────────────────────────────────────┐
│ System           │ CRDT Usage                                    │
├──────────────────┼───────────────────────────────────────────────┤
│ Riak             │ Counters, Sets, Maps, Registers               │
│ Redis Enterprise │ Counters, Sets for Active-Active              │
│ Phoenix (Elixir) │ Presence tracking with ORSets                 │
│ Apple (Notes)    │ CRDT-based sync                               │
│ Figma            │ Custom CRDTs for collaborative design         │
│ SoundCloud       │ Roshi (LWW-Set for activity feeds)            │
│ League of Legends│ Player state sync                             │
└──────────────────┴───────────────────────────────────────────────┘

LIMITATIONS:
────────────
• Memory overhead (tombstones, version vectors)
• Not suitable for all data types
• Complex to implement correctly
• May need garbage collection for tombstones

Multi-Paxos / Raft Groups (Sharded Consensus)

A single consensus group (e.g., one Raft cluster) is bottlenecked by the CPU and I/O of a single leader. To scale to millions of requests, you must use Sharded Consensus.

Architecture: Multi-Group

Instead of one large group, we partition the data into many small shards, each managed by its own independent consensus group.

┌───────────────────────────────────────────────────────────────┐
│                    SHARDED CONSENSUS GROUPS                   │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  SHARD A (Keys 0-99)     SHARD B (Keys 100-199)               │
│  ┌─────────────────┐     ┌─────────────────┐                  │
│  │ Raft Group 1    │     │ Raft Group 2    │                  │
│  │ (Nodes 1, 2, 3) │     │ (Nodes 4, 5, 6) │                  │
│  └─────────────────┘     └─────────────────┘                  │
│                                                               │
│  COORDINATOR: A "Placement Driver" or "Master" tracks which   │
│  nodes belong to which Raft groups.                           │
│                                                               │
└───────────────────────────────────────────────────────────────┘

The Challenge: Cross-Shard Transactions

When an operation affects multiple shards (e.g., Transfer($5) from Shard A to Shard B), you need a higher-level protocol like Two-Phase Commit (2PC) or Percolator to coordinate across the individual consensus groups. Used By: CockroachDB (Ranges), TiDB (Regions), and Google Spanner (Paxos Groups).

Chain Replication

Chain replication is a widely used technique for high-throughput, linearizable replication. It is used in systems like FAWN and CORFU.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        CHAIN REPLICATION                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  STRUCTURE: Nodes are arranged in a chain: Head → ... → Tail                │
│                                                                              │
│  WRITE PATH:                                                                │
│  1. Client sends write to the HEAD.                                         │
│  2. Head applies write and forwards to next node.                           │
│  3. Propagation continues until it reaches the TAIL.                        │
│  4. Tail sends ACK back up the chain or directly to client.                 │
│                                                                              │
│  READ PATH:                                                                 │
│  1. Clients read ONLY from the TAIL.                                        │
│                                                                              │
│  WHY IT WORKS:                                                              │
│  - Linearizability: A write is only visible (at the Tail) once it has been  │
│    persisted by ALL nodes in the chain.                                     │
│  - Higher Throughput: The load is distributed across the chain.             │
│                                                                              │
│  FAILURE HANDLING:                                                          │
│  - Head fails: Next node becomes the new Head.                              │
│  - Tail fails: Predecessor becomes the new Tail.                            │
│  - Internal node fails: Predecessor and successor link up.                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Advanced Design Scenarios

Scenario 1: Global E‑Commerce Database Replication

You run a global e‑commerce platform with users in US, EU, and APAC. You want strong guarantees for orders and payments, but can tolerate slightly stale reads for product browsing. Design:

Topology:
- Single write leader region (e.g., us-east-1) with synchronous followers in the same region.
- Read replicas per region (EU, APAC) receiving asynchronous replication from the leader.
Write path:
- All order and payment writes go to the regional leader in us-east-1.
- Use semi-synchronous replication inside the leader region: commit after 1 follower ACK to avoid single-copy-of-truth.
Read path:
- Critical reads (order status, payment status): hit the leader region or only replicas with replication lag < T ms.
- Non-critical reads (product catalog, reviews): go to nearest regional replica even if seconds behind.

US-East (Primary Region)
  - primary-db (leader, sync to 1 follower)
  - follower-db-1 (sync)
  - follower-db-2 (async)

EU-West (Read Region)
  - repl-db-eu (async follower)

AP-South (Read Region)
  - repl-db-ap (async follower)

Consistency & failure behavior:

RPO/RTO:
- With semi-sync in primary region, losing the leader does not lose committed transactions; failover to the in-region sync follower.
- Cross-region replication remains async → if the entire region fails, you may lose up to (Δ) seconds of writes that haven’t replicated.
Patterns used:
- From this module: single-leader replication, sync vs async, replication lag awareness, and failover.
- Combined with consistency models: guarantee read-your-writes by routing post-write reads to the leader or using a min_lsn/min_commit_time threshold on replicas.

Scenario 2: Active‑Active User Profile Store

You want low-latency profile updates and reads from any region, and you’re willing to accept eventual consistency for some fields but not for others. Requirements:

Profile updates can happen in any region (multi-leader / active‑active).
Some fields (e.g., email, phone) must avoid conflicting values.
Other fields (e.g., last_seen_at, recently_viewed) can be merged.

Design:

Multi-leader replication between regions with a log for each leader.
Per-field conflict resolution strategy:
- Identity fields (email, phone, name): Last-Write-Wins per field with server-side timestamps (avoid whole-record LWW).
- Activity fields (recently_viewed, devices, tags): treat as CRDT sets or counters and merge.

Profile document:
{
  id: "user-123",
  email: { value: "a@x.com", ts: 1700000000 },
  phone: { value: "+1-555-...", ts: 1700000100 },
  devices: G-Set of device IDs,
  tags:   OR-Set of tags
}

On update:
- For scalar fields: compare timestamps.
- For sets: CRDT merge (union / OR‑Set semantics).

Failure / partition behavior:

During a partition, each region continues accepting writes.
On healing:
- For scalars, keep value with newer timestamp (assuming bounded skew, or use hybrid logical clocks).
- For sets/counters, apply CRDT merge (no conflicts).

This scenario composes:

Multi-leader replication + conflict detection (version vectors or timestamps).
Conflict resolution strategies and CRDTs for different fields in the same entity.

Scenario 3: Leaderless Metrics Store with Tunable Consistency

You’re designing a high-throughput metrics and logs store similar to Dynamo/Cassandra. Availability and write throughput matter more than strict consistency. Design:

Leaderless replication with parameters (N, W, R):
- (N = 3) replicas per key.
- For writes: (W = 2) for production metrics, (W = 1) for debug logs.
- For reads: (R = 1) for dashboards where staleness is acceptable, (R = 2) for alerting queries.
Anti-entropy:
- Enable read repair on reads (update stale replicas in background).
- Periodically run Merkle tree comparisons between nodes to reconcile drift.

Write metrics(point):
  - Hash(key) → choose N replicas.
  - Try to write to all N.
  - If at least W ACKs within timeout → success.
  - If fewer than W → client sees write failure (but some replicas may still store point).

Read metrics(query):
  - Contact R replicas.
  - Return value with latest timestamp.
  - Schedule background repairs for stale replicas.

Partition & failure behavior:

Partial node failures: As long as at least (W) nodes are reachable for writes and (R) for reads, the system remains available with overlapping quorums.
Sloppy quorums + hinted handoff:
- When a home replica is down, temporarily write to a substitute node (hinted handoff) to keep (W) high.
- On recovery, hints are replayed to repair the downed node.
Trade-offs:
- For alerting: configure (W + R > N) and avoid sloppy quorums to favor correctness.
- For debug logs: allow (W = 1, R = 1) and sloppy quorums to maximize write availability.

This scenario applies:

Quorum mathematics ((W + R > N)), sloppy quorums, hinted handoff, and anti-entropy techniques from this module in a concrete operational design.

Key Interview Questions

Q: When would you choose leaderless over leader-based replication?

Choose leaderless when:

High availability is critical (any node can accept writes)
Write latency matters (write to nearest nodes)
Can tolerate eventual consistency
Multi-datacenter with no clear primary

Choose leader-based when:

Need strong consistency
Transactions required
Simpler conflict resolution
Read-heavy workload (scale reads with followers)

Examples:

Leaderless: Session data, user activity, metrics
Leader: Financial transactions, inventory, user accounts

Q: Design conflict resolution for a shopping cart

Requirements analysis:

Cart shared across devices
Offline support
Reasonable merge behavior

Solution:

Cart Item = (product_id, quantity, timestamps)

Add item: 
- If not exists: add with current timestamp
- If exists: max(qty1, qty2), latest timestamp

Remove item:
- Use tombstone with timestamp
- Remove wins if remove_timestamp > add_timestamp

Merge strategy:
1. Union all product_ids
2. For each product:
   - If removed in both → removed
   - If removed in one with later timestamp → removed
   - Otherwise → max(quantities)

Alternative: Use OR-Set CRDT for cart items

Q: How do you handle replication lag in a user-facing feature?

Common scenario: User updates profile, immediately views it, sees old data.Solutions:

Read-your-writes consistency
- Track writes with client-side timestamp
- Read from replica that’s caught up
- Or read from leader for recently-written data
Monotonic reads
- Stick user to same replica (session affinity)
- Or track last-seen replica position
Synchronous replication for critical data
- Higher latency but guaranteed consistency
Application-level workaround
- Optimistic UI (assume success)
- Show cached data with “syncing” indicator

Q: Explain how Cassandra handles replication

Answer:

Architecture:
- Leaderless, all nodes equal
- Consistent hashing for partitioning
- Configurable replication factor (RF)

Write path:
1. Client writes to any coordinator
2. Coordinator forwards to RF replicas
3. Wait for W acknowledgments
4. Return success

Read path:
1. Coordinator contacts R replicas
2. Returns result with highest timestamp
3. Background read repair if inconsistency

Consistency levels:
- ONE: Fast, might be stale
- QUORUM: W + R > RF, usually consistent
- ALL: All replicas, strongest but slowest
- LOCAL_QUORUM: Quorum within datacenter

Anti-entropy:
- Read repair on reads
- Merkle tree comparison
- Hinted handoff when nodes down

Next Steps

Continue to Track 4: Distributed Transactions

Learn 2PC, Saga pattern, and distributed locking

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Track 3: Replication Strategies

​Module 11: Single-Leader Replication

​Synchronous vs Asynchronous Replication

​Replication Lag

​Understanding Replication Lag

​Implementing Read-Your-Writes Guarantees

​Handling Failover

​Dotted Version Vectors (DVV)

​Last-Write-Wins (LWW) Deep Dive

​Application-Level Resolution

​Module 15: CRDTs

​Counter CRDTs

​CRDT Counter Visualization

​Set CRDTs

​Register CRDTs

​JSON CRDTs & Rich Text (Automerge/Yjs)

​1. JSON CRDT Architecture

​2. The Interleaving Problem

​Delta-CRDTs: Optimizing Bandwidth

​Garbage Collection & Tombstones

​Strategies for Cleaning Tombstones:

​CRDT Usage in Practice

​Multi-Paxos / Raft Groups (Sharded Consensus)

​Architecture: Multi-Group

​The Challenge: Cross-Shard Transactions

​Chain Replication

​Advanced Design Scenarios

​Scenario 1: Global E‑Commerce Database Replication

​Scenario 2: Active‑Active User Profile Store

​Scenario 3: Leaderless Metrics Store with Tunable Consistency

​Key Interview Questions

​Next Steps

Continue to Track 4: Distributed Transactions

Track 3: Replication Strategies

Module 11: Single-Leader Replication

Synchronous vs Asynchronous Replication

Replication Lag

Understanding Replication Lag

Implementing Read-Your-Writes Guarantees

Handling Failover

Dotted Version Vectors (DVV)

Last-Write-Wins (LWW) Deep Dive

Application-Level Resolution

Module 15: CRDTs

Counter CRDTs

CRDT Counter Visualization

Set CRDTs

Register CRDTs

JSON CRDTs & Rich Text (Automerge/Yjs)

1. JSON CRDT Architecture

2. The Interleaving Problem

Delta-CRDTs: Optimizing Bandwidth

Garbage Collection & Tombstones

Strategies for Cleaning Tombstones:

CRDT Usage in Practice

Multi-Paxos / Raft Groups (Sharded Consensus)

Architecture: Multi-Group

The Challenge: Cross-Shard Transactions

Chain Replication

Advanced Design Scenarios

Scenario 1: Global E‑Commerce Database Replication

Scenario 2: Active‑Active User Profile Store

Scenario 3: Leaderless Metrics Store with Tunable Consistency

Key Interview Questions

Next Steps