Skip to main content
Replication Strategies

Track 3: Replication Strategies

How to copy data across nodes while maintaining consistency guarantees.
Track Duration: 38-46 hours
Modules: 5
Key Topics: Single-leader, Multi-leader, Leaderless, Conflict Resolution, CRDTs

Module 11: Single-Leader Replication

The most common replication strategy.
Single-Leader Replication Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│                    SINGLE-LEADER REPLICATION                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                        ┌─────────────┐                                      │
│            Writes ────►│   LEADER    │                                      │
│                        │  (Primary)  │                                      │
│                        └──────┬──────┘                                      │
│                               │                                             │
│                    ┌──────────┼──────────┐                                  │
│                    │          │          │                                  │
│                    ▼          ▼          ▼                                  │
│              ┌─────────┐┌─────────┐┌─────────┐                              │
│   Reads ◄───►│Follower ││Follower ││Follower │◄───► Reads                   │
│              │   1     ││   2     ││   3     │                              │
│              └─────────┘└─────────┘└─────────┘                              │
│                                                                              │
│  FLOW:                                                                      │
│  ─────                                                                      │
│  1. Client writes to leader only                                            │
│  2. Leader persists write                                                   │
│  3. Leader sends write to all followers                                     │
│  4. Followers apply write                                                   │
│  5. Reads can go to any replica                                             │
│                                                                              │
│  DATABASES: PostgreSQL, MySQL, MongoDB, Redis                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Synchronous vs Asynchronous Replication

Leader waits for follower acknowledgment before confirming write.
Client     Leader      Follower
   │         │            │
   │──Write──►            │
   │         │──Replicate─►
   │         │◄───ACK─────│
   │◄──OK────│            │
   │         │            │

Time: Longer (network + follower persist)
Pros:
  • Follower guaranteed to be up-to-date
  • No data loss on leader failure
Cons:
  • Higher latency
  • Leader blocked if follower slow/dead

Replication Lag

┌─────────────────────────────────────────────────────────────────────────────┐
│                        REPLICATION LAG                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Leader:   [1] [2] [3] [4] [5] [6] [7] [8]                                  │
│  Follower: [1] [2] [3] [4] [5]                   ← 3 entries behind         │
│                                                                              │
│  PROBLEMS THIS CAUSES:                                                      │
│  ─────────────────────                                                      │
│                                                                              │
│  1. READ-YOUR-WRITES VIOLATION                                              │
│     User writes, immediately reads from follower → doesn't see write!       │
│                                                                              │
│  2. MONOTONIC READS VIOLATION                                               │
│     User reads from F1, then F2 (further behind) → data "goes backward"    │
│                                                                              │
│  3. CONSISTENT PREFIX VIOLATION                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Understanding Replication Lag

Replication Lag Timeline Replication lag is the time delay between when a write occurs on the leader and when it’s applied on a follower. The diagram above shows a 150ms lag where the follower is consistently behind the leader. Why Lag Happens:
  • Network latency between leader and followers
  • Follower processing slower than leader writes
  • Follower temporarily offline or restarting
  • High write throughput overwhelming followers
Measuring Lag:
-- PostgreSQL: Check replication lag
SELECT 
    client_addr,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS pending_bytes,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;
Acceptable Lag: Depends on your use case
  • Financial systems: Milliseconds
  • Analytics: Minutes or hours acceptable
  • Caching: Seconds to minutes
Solutions to Lag Problems:
  • Sticky sessions: Route user’s reads to same follower
  • Read from leader: For critical reads after writes
  • Version vectors: Track causality explicitly

Implementing Read-Your-Writes Guarantees

One of the most common consistency issues users experience. Here’s how to implement it:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    READ-YOUR-WRITES IMPLEMENTATION                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PROBLEM:                                                                   │
│  User writes to leader, immediately reads from follower → stale data!       │
│                                                                              │
│  Client ──Write──► Leader (v5)                                              │
│  Client ──Read───► Follower (still at v3) → "Where's my data?!"            │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════    │
│                                                                              │
│  SOLUTION 1: STICKY SESSIONS (Simple)                                       │
│  ─────────────────────────────────────                                      │
│  Route all requests from same user to same replica                          │
│  • Use user_id hash for routing                                             │
│  • Works if replica is leader or caught-up follower                        │
│  • Breaks on failover                                                       │
│                                                                              │
│  SOLUTION 2: READ-FROM-LEADER (Strong but bottleneck)                       │
│  ────────────────────────────────────────────────────                       │
│  After write, read from leader for X seconds                                │
│  • Simple to implement                                                      │
│  • Increases leader load                                                    │
│  • Time-based heuristic (may still serve stale)                            │
│                                                                              │
│  SOLUTION 3: VERSION TRACKING (Robust)                                      │
│  ─────────────────────────────────────                                      │
│  Track write version, ensure read version ≥ write version                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Production Implementation (Version Tracking):
class ReadYourWritesClient:
    """
    Client-side implementation of Read-Your-Writes guarantee.
    Used by: MongoDB, CockroachDB, Spanner
    """
    def __init__(self):
        # Track the latest write version per key (or global)
        self.write_version = {}  # key -> version
    
    def write(self, key, value):
        """Write to leader, record version"""
        response = self.leader.write(key, value)
        # Store the version returned by leader
        self.write_version[key] = response.version
        return response
    
    def read(self, key):
        """Read ensuring we see our writes"""
        min_version = self.write_version.get(key, 0)
        
        # Option A: Send min_version to replica, it waits until caught up
        response = self.replica.read(
            key, 
            min_version=min_version  # "Don't respond until you have this version"
        )
        
        # Option B: Read from any replica, retry if stale
        for replica in self.replicas:
            response = replica.read(key)
            if response.version >= min_version:
                return response
        
        # Fallback to leader if no replica is fresh enough
        return self.leader.read(key)

class ReplicaServer:
    """Server-side: Wait for replication before responding"""
    
    def read(self, key, min_version=None):
        if min_version is not None:
            # Wait until we've replicated up to min_version
            self.wait_for_version(min_version, timeout=5.0)
        
        return self.storage.get(key)
    
    def wait_for_version(self, target_version, timeout):
        """Block until replication catches up"""
        start = time.time()
        while self.current_version < target_version:
            if time.time() - start > timeout:
                raise StaleReadError("Replica too far behind")
            time.sleep(0.01)  # Busy wait (use condition variable in prod)
Comparison of Approaches:
ApproachConsistencyAvailabilityComplexity
Sticky SessionsWeakHighLow
Read-from-LeaderStrongLowerLow
Version TrackingStrongHighMedium
Causal TokensStrongHighHigh
Causal Consistency Tokens (Advanced): Systems like MongoDB use “causal consistency tokens” that encode the entire causal history, not just a single version. This ensures you see all causally-related writes, not just your own.
# MongoDB-style causal token
{
    "clusterTime": {"$timestamp": {"t": 1624000000, "i": 1}},
    "operationTime": {"$timestamp": {"t": 1624000000, "i": 1}},
    "afterClusterTime": {"$timestamp": {"t": 1623999999, "i": 5}}
}

Handling Failover

│ │ │ STEP 1: DETECT LEADER FAILURE │ │ ───────────────────────────── │ │ • Heartbeat timeout (typically 10-30 seconds) │ │ • Multiple checks to avoid false positives │ │ │ │ STEP 2: ELECT NEW LEADER │ │ ──────────────────────── │ │ • Most up-to-date follower preferred │ │ • May use consensus (Raft) or controller node │ │ │ │ STEP 3: RECONFIGURE SYSTEM │ │ ───────────────────────── │ │ • Update clients to write to new leader │ │ • Other followers switch to new leader │ │ • DNS/load balancer update │ │ │ │ GOTCHAS: │ │ ──────── │ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ │ SPLIT-BRAIN: Both old and new leader think they’re the leader │ │ │ │ DATA LOSS: Async replication means new leader may be behind │ │ │ │ CONFLICTS: Old leader comes back with conflicting writes │ │ │ └──────────────────────────────────────────────────────────────────┘ │ │ │ │ GITHUB 2012 INCIDENT: │ │ MySQL failover caused data loss │ │ Auto-incrementing IDs reused → foreign key violations │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

---

## Module 12: Multi-Leader Replication

When a single leader isn't enough.

<Frame>
  <img src="/images/courses/distributed-systems-multi-leader.svg" alt="Multi-Leader Replication Architecture" />
</Frame>

┌─────────────────────────────────────────────────────────────────────────────┐ │ MULTI-LEADER REPLICATION │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Datacenter A │ Datacenter B │ │ │ │ │ ┌──────────────┐ │ ┌──────────────┐ │ │ │ Leader A │◄────────────┼────────►│ Leader B │ │ │ └──────┬───────┘ Async │ └──────┬───────┘ │ │ │ Sync │ │ │ │ ┌─────┴─────┐ │ ┌─────┴─────┐ │ │ ▼ ▼ │ ▼ ▼ │ │ ┌────────┐ ┌────────┐ │ ┌────────┐ ┌────────┐ │ │ │Follower│ │Follower│ │ │Follower│ │Follower│ │ │ └────────┘ └────────┘ │ └────────┘ └────────┘ │ │ │ │ │ Clients ───► Local Leader │ Clients ───► Local Leader │ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ USE CASES: ──────────
  1. Multi-datacenter operation (write locally, replicate globally)
  2. Offline-capable clients (phone writes locally, syncs later)
  3. Collaborative editing (each user has local leader)

### Conflict Detection and Resolution

<Tabs>
  <Tab title="Conflict Detection">
CONFLICT SCENARIO: ────────────────── Time Leader A Leader B ──── ──────── ──────── t1 x = 1 x = 1 (Initial state) t2 x = 2 x = 3 (Concurrent writes!) t3 (replicate) ────────────── (replicate) t4 x = 2 AND 3? x = 3 AND 2? CONFLICT! WHEN TO DETECT: ─────────────── • Synchronous: Detect immediately (if possible) • Asynchronous: Detect during replication (too late to reject) Most multi-leader systems detect asynchronously → must resolve conflicts after the fact
</Tab>

<Tab title="Resolution Strategies">
  1. LAST-WRITE-WINS (LWW) ──────────────────────── Use timestamps, highest timestamp wins
Problem: Data loss (other write silently discarded) Problem: Clock skew can cause “wrong” winner Used by: Cassandra, DynamoDB (default)
  1. MERGE VALUES ──────────────── Combine conflicting values
x = 2 + x = 3 → x = [2, 3] Problem: Not always semantically meaningful
  1. CUSTOM RESOLUTION ──────────────────── Application-specific logic
Example: For shopping cart, union of items Example: For counter, add the deltas
  1. PROMPT USER ────────────── Show conflicts to user, let them choose
Used by: Git, some CMS systems
</Tab>

<Tab title="Conflict Avoidance">
BEST STRATEGY: Avoid conflicts in the first place TECHNIQUES: ───────────
  1. ROUTE BY USER User’s writes always go to same leader No concurrent writes to same data
  2. ROUTE BY DATA Partition data, each partition has one leader
  3. GLOBAL LOCKING Acquire lock before write (defeats purpose of multi-leader though)
  4. CRDT DATA TYPES Data structures designed to merge without conflicts (covered in Module 15)
</Tab>
</Tabs>

### Conflict Scenarios Visualized

![Multi-Leader Conflicts](/images/courses/distributed-systems-multi-leader-conflict.svg)

**Real-World Example**: Google Docs

Google Docs uses multi-leader replication (operational transformation, similar to CRDTs):
- Each user's browser is a "leader"
- Concurrent edits to same document
- Conflicts resolved automatically through OT algorithm
- Users see each other's changes in real-time

**When Multi-Leader Makes Sense**:
1. **Geographic Distribution**: Users in different continents
2. **Offline Operation**: Mobile apps that sync later
3. **Collaborative Editing**: Multiple users editing simultaneously
4. **High Availability**: No single point of failure

---

## Module 13: Leaderless Replication

The Dynamo-style approach.

<Frame>
<img src="/images/courses/distributed-systems-leaderless.svg" alt="Leaderless Replication Architecture" />
</Frame>

┌─────────────────────────────────────────────────────────────────────────────┐ │ LEADERLESS REPLICATION │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ NO LEADER! Client writes to multiple nodes directly. │ │ │ │ Client │ │ ┌───┐ │ │ │ │ │ │ └─┬─┘ │ │ Write │ Write │ │ ┌─────────────────┼─────────────────┐ │ │ ▼ ▼ ▼ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ │ │ x = 5 │ │ x = 5 │ │ x = 5 │ │ │ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ QUORUMS: │ │ ──────── │ │ N = total nodes │ │ W = nodes that must acknowledge write │ │ R = nodes that must respond to read │ │ │ │ RULE: W + R > N (ensures overlap, some node has latest) │ │ │ │ COMMON CONFIGS: │ │ ─────────────── │ │ N=3, W=2, R=2: Balanced read/write, tolerates 1 failure │ │ N=3, W=3, R=1: Fast reads, writes need all nodes │ │ N=3, W=1, R=3: Fast writes, reads slower but consistent │ │ │ │ DATABASES: Cassandra, Riak, DynamoDB, Voldemort │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

### Quorum Mathematics

┌─────────────────────────────────────────────────────────────────────────────┐ │ QUORUM INTERSECTION │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ WHY W + R > N WORKS: │ │ │ │ N = 5 nodes, W = 3, R = 3 │ │ │ │ Write touches: [1] [2] [3] ✓ [ ] [ ] │ │ Read touches: [1] ✓ [ ] [4] [5] ← Some overlap! │ │ ↑ │ │ This node has the latest write! │ │ │ │ OVERLAP = W + R - N = 3 + 3 - 5 = 1 │ │ At least 1 node in read set has latest value │ │ │ │ EDGE CASES: │ │ ─────────── │ │ W + R = N: Exactly 0 overlap possible (risky) │ │ W + R < N: May read stale data │ │ │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ SCENARIO: Some home nodes unreachable │ │ │ │ Client wants to write x = 5 │ │ Home nodes for x: [A, B, C] │ │ But node C is down! │ │ │ │ STRICT QUORUM: │ │ ────────────── │ │ Wait for C or fail (reduced availability) │ │ │ │ SLOPPY QUORUM: │ │ ────────────── │ │ Write to D instead of C (D is not home node) │ │ Still get W nodes, just not the “right” ones │ │ │ │ Write: [A ✓] [B ✓] [C ✗] [D ✓ hint] │ │ │ │ HINTED HANDOFF: │ │ ─────────────── │ │ When C comes back online: │ │ D says “I have a hint for you” → sends x = 5 to C │ │ D deletes hint after C acknowledges │ │ │ │ ┌────────────────────────────────────────────────────────────────┐ │ │ │ WARNING: With sloppy quorums, W + R > N doesn’t guarantee │ │ │ │ reading latest value! Reads might miss the hint node. │ │ │ └────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

### Quorum Intersection Explained

![Quorum Visualization](/images/courses/distributed-systems-quorum.svg)

**Why This Works**: The Pigeonhole Principle

If you have N boxes and you put W + R items in them, and W + R > N, then at least one box must contain more than one item. That box represents a node that participated in both the write and the read, guaranteeing you see the latest value.

**Tuning Quorums for Your Workload**:

| Use Case | N | W | R | Rationale |
|----------|---|---|---|-----------|
| Read-heavy | 3 | 2 | 1 | Fast reads, writes need majority |
| Write-heavy | 3 | 1 | 3 | Fast writes, reads check all |
| Balanced | 3 | 2 | 2 | Equal read/write performance |
| High availability | 5 | 3 | 3 | Tolerates 2 failures |

**Latency Impact**:
- Write latency = slowest of W nodes
- Read latency = slowest of R nodes
- Lower W or R = better latency, weaker consistency

### Anti-Entropy: Read Repair and Merkle Trees

┌─────────────────────────────────────────────────────────────────────────────┐ │ ANTI-ENTROPY │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Mechanisms to detect and repair inconsistencies │ │ │ │ 1. READ REPAIR │ │ ─────────────── │ │ During reads, detect stale replicas and update them │ │ │ │ Client reads x from [A, B, C]: │ │ A: x = 5 (version 3) │ │ B: x = 5 (version 3) │ │ C: x = 3 (version 2) ← Stale! │ │ │ │ Action: Send x = 5 to C (background repair) │ │ │ │ 2. MERKLE TREES │ │ ──────────────── │ │ Efficiently compare data between nodes │ │ │ │ ROOT (hash of children) │ │ / \ │ │ HASH(A,B) HASH(C,D) │ │ / \ / \ │ │ HASH(A) HASH(B) HASH(C) HASH(D) │ │ │ │ │ │ │ │ [Data A] [Data B] [Data C] [Data D] │ │ │ │ COMPARISON: │ │ ─────────── │ │ Node1 and Node2 exchange root hashes │ │ If same → all data matches (done!) │ │ If different → compare children, recursively find mismatches │ │ │ │ EFFICIENCY: O(log n) comparisons to find one mismatch │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

---

## Module 14: Conflict Resolution

When conflicts happen, how do you resolve them?

### Version Vectors

```text
┌─────────────────────────────────────────────────────────────────────────────┐
│                      VERSION VECTORS                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Like vector clocks, but for replicas instead of processes                  │
│                                                                              │
│  FORMAT: {replica_id: version_number, ...}                                  │
│                                                                              │
│  EXAMPLE:                                                                   │
│  ────────                                                                   │
│  Object x starts at version {A:0, B:0, C:0}                                 │
│                                                                              │
│  Write at A: x = 5, version = {A:1, B:0, C:0}                               │
│  Write at B: x = 7, version = {A:0, B:1, C:0}                               │
│                                                                              │
│  Neither dominates the other → CONFLICT                                     │
│                                                                              │
│  COMPARISON:                                                                │
│  ───────────                                                                │
│  {A:1, B:0} dominates {A:0, B:0}  (A:1 > A:0, B:0 = B:0)                   │
│  {A:1, B:0} || {A:0, B:1}          (concurrent - neither dominates)        │
│                                                                              │
│  SIBLING RESOLUTION:                                                        │
│  ───────────────────                                                        │
│  1. Present both values to application                                      │
│  2. Application merges and writes back                                      │
│  3. New version dominates both siblings                                     │
│                                                                              │
│  x = merge(5, 7), version = {A:1, B:1, C:0}                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Dotted Version Vectors (DVV)

Traditional Version Vectors can grow linearly with the number of nodes (O(N)O(N) overhead) and can be imprecise for concurrent updates to the same node. Dotted Version Vectors (DVV), used in Riak, solve this by separating the “dot” (a specific update) from the “causal context.” Key Benefit: DVVs accurately distinguish between “I have seen this update” and “This update is a sibling of that one,” leading to significantly fewer false conflicts during churn.
Format: [ (NodeID, Counter), VersionVector ]
- (NodeID, Counter) is the "Dot" representing the specific event.
- VersionVector is the "Context" representing what this node had seen.

Last-Write-Wins (LWW) Deep Dive

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LAST-WRITE-WINS                                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SIMPLE RULE: Attach timestamp, highest timestamp wins                      │
│                                                                              │
│  Write 1: x = 5, timestamp = 1000                                           │
│  Write 2: x = 7, timestamp = 1001                                           │
│                                                                              │
│  Winner: x = 7                                                              │
│                                                                              │
│  PROBLEMS:                                                                  │
│  ─────────                                                                  │
│                                                                              │
│  1. CLOCK SKEW                                                              │
│     Node A: 10:00:00                                                        │
│     Node B: 10:00:01 (1 second ahead)                                       │
│                                                                              │
│     Even if A's write was "actually" first,                                 │
│     B's write wins because higher timestamp                                 │
│                                                                              │
│  2. DATA LOSS                                                               │
│     User 1: Set name = "Alice"                                              │
│     User 2: Set email = "alice@x.com"                                       │
│                                                                              │
│     If applied to entire object, one change is lost!                        │
│     Better: Apply LWW per-field                                             │
│                                                                              │
│  3. DELETE + UPDATE RACE                                                    │
│     User 1: Delete item                                                     │
│     User 2: Update item                                                     │
│                                                                              │
│     Which wins? Often delete loses → item reappears                         │
│     Use tombstones with timestamps                                          │
│                                                                              │
│  WHEN LWW IS OK:                                                            │
│  ───────────────                                                            │
│  • Immutable data (UUIDs as keys)                                           │
│  • Caching (stale data acceptable)                                          │
│  • Metrics (approximate is fine)                                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Application-Level Resolution

# Example: Shopping cart conflict resolution

def resolve_cart_conflict(cart1, cart2):
    """
    Resolution strategy: UNION of items
    
    cart1 = {A: 2, B: 1}  (2 of A, 1 of B)
    cart2 = {A: 1, C: 3}  (1 of A, 3 of C)
    
    Result: {A: 2, B: 1, C: 3}
    - Take max quantity for each item
    - Include items from both carts
    """
    resolved = {}
    all_items = set(cart1.keys()) | set(cart2.keys())
    
    for item in all_items:
        qty1 = cart1.get(item, 0)
        qty2 = cart2.get(item, 0)
        resolved[item] = max(qty1, qty2)
    
    return resolved


# Example: Counter conflict resolution

def resolve_counter_conflict(counter1, counter2, base):
    """
    Resolution strategy: Add deltas
    
    base = 100
    counter1 = 105 (added 5)
    counter2 = 103 (added 3)
    
    Result: 108 (100 + 5 + 3)
    """
    delta1 = counter1 - base
    delta2 = counter2 - base
    return base + delta1 + delta2

Module 15: CRDTs

Conflict-free Replicated Data Types - data structures that automatically merge.
Advanced Topic: CRDTs are asked at Staff+ level interviews, especially at companies building collaborative tools.
┌─────────────────────────────────────────────────────────────────────────────┐
│                         CRDTs                                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  CRDT = Data structure designed to be replicated across nodes               │
│         and always converge to same state without coordination              │
│                                                                              │
│  KEY PROPERTY: Merge operation is:                                          │
│  • Commutative: merge(A, B) = merge(B, A)                                   │
│  • Associative: merge(merge(A, B), C) = merge(A, merge(B, C))               │
│  • Idempotent: merge(A, A) = A                                              │
│                                                                              │
│  TWO TYPES:                                                                 │
│  ──────────                                                                 │
│  1. State-based (CvRDT): Merge entire states                                │
│  2. Operation-based (CmRDT): Replicate and apply operations                 │
│                                                                              │
│  TRADEOFF:                                                                  │
│  ─────────                                                                  │
│  State-based: Larger messages, simpler delivery                             │
│  Op-based: Smaller messages, require exactly-once delivery                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Counter CRDTs

G-Counter: Only increments

STATE: {node_id: count, ...}

EXAMPLE:
────────
Node A increments: {A: 5, B: 0, C: 0}
Node B increments: {A: 0, B: 3, C: 0}
Node C increments: {A: 0, B: 0, C: 7}

MERGE: Element-wise max
{A: 5, B: 3, C: 7}

VALUE: Sum of all counts = 15


Implementation:
───────────────
class GCounter:
    def __init__(self, node_id):
        self.node_id = node_id
        self.counts = defaultdict(int)
    
    def increment(self, n=1):
        self.counts[self.node_id] += n
    
    def value(self):
        return sum(self.counts.values())
    
    def merge(self, other):
        for node_id, count in other.counts.items():
            self.counts[node_id] = max(
                self.counts[node_id], count
            )

CRDT Counter Visualization

CRDT Counter Why CRDTs Are Powerful: Traditional approach:
Node A: counter = 5
Node B: counter = 3
Merge: ??? (conflict!)
CRDT approach:
Node A: {A: 5, B: 0, C: 0}
Node B: {A: 0, B: 3, C: 0}
Merge: {A: 5, B: 3, C: 0} → value = 8 (no conflict!)
Real-World CRDT Usage:
  • Redis: CRDT-based geo-replicated databases
  • Riak: Uses CRDTs for distributed counters and sets
  • Figma: Uses CRDTs for collaborative design
  • Apple Notes: CRDTs for offline-first sync
Tradeoff: CRDTs require more memory (store per-node state) but eliminate coordination overhead.

Set CRDTs

G-Set: Only adds, never removes

STATE: Set of elements

MERGE: Union of sets

Node A: {apple, banana}
Node B: {banana, cherry}

Merge: {apple, banana, cherry}


class GSet:
    def __init__(self):
        self.elements = set()
    
    def add(self, element):
        self.elements.add(element)
    
    def contains(self, element):
        return element in self.elements
    
    def merge(self, other):
        self.elements |= other.elements

Register CRDTs

┌─────────────────────────────────────────────────────────────────────────────┐
│                     REGISTER CRDTs                                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  LWW-REGISTER (Last-Write-Wins)                                             │
│  ──────────────────────────────                                             │
│  STATE: (value, timestamp)                                                  │
│  MERGE: Keep higher timestamp                                               │
│                                                                              │
│  A: ("Alice", 100)                                                          │
│  B: ("Bob", 101)                                                            │
│  Merge: ("Bob", 101)                                                        │
│                                                                              │
│  MV-REGISTER (Multi-Value)                                                  │
│  ─────────────────────────                                                  │
│  STATE: Set of (value, version-vector)                                      │
│  MERGE: Keep all concurrent values                                          │
│                                                                              │
│  A: ("Alice", {A:1})                                                        │
│  B: ("Bob", {B:1})                                                          │
│  Merge: {("Alice", {A:1}), ("Bob", {B:1})}  ← Both kept as siblings        │
│                                                                              │
│  Application must resolve by writing new value:                             │
│  Write: ("Alice or Bob", {A:1, B:1})                                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

JSON CRDTs & Rich Text (Automerge/Yjs)

Modeling a simple counter or set is one thing, but how do you model a JSON Document or a Rich Text Document (like this one)?

1. JSON CRDT Architecture

A JSON CRDT models the document as a Tree of CRDTs.
  • Objects: Maps where keys map to other CRDTs.
  • Arrays: Sequences where each element has a unique ID and a pointer to the previous element (RGA - Replicated Growable Array).
  • Registers: For leaf values (strings, numbers).

2. The Interleaving Problem

In collaborative text editing, if Alice inserts “A” and Bob inserts “B” at the same position, a naive CRDT might result in “BA” for Alice and “AB” for Bob. Modern CRDTs like Yjs and Automerge use Causal Ordering and Unique IDs to ensure that all replicas agree on the order (e.g., “AB” for everyone).

Delta-CRDTs: Optimizing Bandwidth

The biggest drawback of State-based CRDTs (CvRDTs) is that the state grows over time. If you have a counter with 1,000 nodes, you must send all 1,000 counts every time you sync. Delta-CRDTs solve this by only sending the Delta (the changes) since the last successful synchronization.
  • Dot Store: A way to track which specific updates (dots) a neighbor has already seen.
  • Delta-Group: A collection of updates that are merged together before being sent.
  • Efficiency: Reduces bandwidth from O(N)O(N) to O(Δ)O(\Delta), making CRDTs viable for mobile devices on slow networks.

Garbage Collection & Tombstones

When you remove an element from a CRDT (like a 2P-Set or OR-Set), you must keep a Tombstone to prove to other nodes that the item was deleted. If you delete the tombstone too early, a node that hasn’t seen the deletion might “re-infect” the cluster with the old data.

Strategies for Cleaning Tombstones:

  1. Stable State GC: Only delete a tombstone once you have proof (via gossip) that every node in the cluster has seen the deletion.
  2. Time-based GC: Delete tombstones after a very long period (e.g., 30 days). If a node stays offline longer than that, it must be completely wiped and re-synced from scratch.

CRDT Usage in Practice

PRODUCTION CRDT SYSTEMS:
────────────────────────

┌──────────────────┬───────────────────────────────────────────────┐
│ System           │ CRDT Usage                                    │
├──────────────────┼───────────────────────────────────────────────┤
│ Riak             │ Counters, Sets, Maps, Registers               │
│ Redis Enterprise │ Counters, Sets for Active-Active              │
│ Phoenix (Elixir) │ Presence tracking with ORSets                 │
│ Apple (Notes)    │ CRDT-based sync                               │
│ Figma            │ Custom CRDTs for collaborative design         │
│ SoundCloud       │ Roshi (LWW-Set for activity feeds)            │
│ League of Legends│ Player state sync                             │
└──────────────────┴───────────────────────────────────────────────┘

LIMITATIONS:
────────────
• Memory overhead (tombstones, version vectors)
• Not suitable for all data types
• Complex to implement correctly
• May need garbage collection for tombstones

Multi-Paxos / Raft Groups (Sharded Consensus)

A single consensus group (e.g., one Raft cluster) is bottlenecked by the CPU and I/O of a single leader. To scale to millions of requests, you must use Sharded Consensus.

Architecture: Multi-Group

Instead of one large group, we partition the data into many small shards, each managed by its own independent consensus group.
┌───────────────────────────────────────────────────────────────┐
│                    SHARDED CONSENSUS GROUPS                   │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  SHARD A (Keys 0-99)     SHARD B (Keys 100-199)               │
│  ┌─────────────────┐     ┌─────────────────┐                  │
│  │ Raft Group 1    │     │ Raft Group 2    │                  │
│  │ (Nodes 1, 2, 3) │     │ (Nodes 4, 5, 6) │                  │
│  └─────────────────┘     └─────────────────┘                  │
│                                                               │
│  COORDINATOR: A "Placement Driver" or "Master" tracks which   │
│  nodes belong to which Raft groups.                           │
│                                                               │
└───────────────────────────────────────────────────────────────┘

The Challenge: Cross-Shard Transactions

When an operation affects multiple shards (e.g., Transfer($5) from Shard A to Shard B), you need a higher-level protocol like Two-Phase Commit (2PC) or Percolator to coordinate across the individual consensus groups. Used By: CockroachDB (Ranges), TiDB (Regions), and Google Spanner (Paxos Groups).

Chain Replication

Chain replication is a widely used technique for high-throughput, linearizable replication. It is used in systems like FAWN and CORFU.
┌─────────────────────────────────────────────────────────────────────────────┐
│                        CHAIN REPLICATION                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  STRUCTURE: Nodes are arranged in a chain: Head → ... → Tail                │
│                                                                              │
│  WRITE PATH:                                                                │
│  1. Client sends write to the HEAD.                                         │
│  2. Head applies write and forwards to next node.                           │
│  3. Propagation continues until it reaches the TAIL.                        │
│  4. Tail sends ACK back up the chain or directly to client.                 │
│                                                                              │
│  READ PATH:                                                                 │
│  1. Clients read ONLY from the TAIL.                                        │
│                                                                              │
│  WHY IT WORKS:                                                              │
│  - Linearizability: A write is only visible (at the Tail) once it has been  │
│    persisted by ALL nodes in the chain.                                     │
│  - Higher Throughput: The load is distributed across the chain.             │
│                                                                              │
│  FAILURE HANDLING:                                                          │
│  - Head fails: Next node becomes the new Head.                              │
│  - Tail fails: Predecessor becomes the new Tail.                            │
│  - Internal node fails: Predecessor and successor link up.                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Advanced Design Scenarios

Scenario 1: Global E‑Commerce Database Replication

You run a global e‑commerce platform with users in US, EU, and APAC. You want strong guarantees for orders and payments, but can tolerate slightly stale reads for product browsing. Design:
  • Topology:
    • Single write leader region (e.g., us-east-1) with synchronous followers in the same region.
    • Read replicas per region (EU, APAC) receiving asynchronous replication from the leader.
  • Write path:
    • All order and payment writes go to the regional leader in us-east-1.
    • Use semi-synchronous replication inside the leader region: commit after 1 follower ACK to avoid single-copy-of-truth.
  • Read path:
    • Critical reads (order status, payment status): hit the leader region or only replicas with replication lag < T ms.
    • Non-critical reads (product catalog, reviews): go to nearest regional replica even if seconds behind.
US-East (Primary Region)
  - primary-db (leader, sync to 1 follower)
  - follower-db-1 (sync)
  - follower-db-2 (async)

EU-West (Read Region)
  - repl-db-eu (async follower)

AP-South (Read Region)
  - repl-db-ap (async follower)
Consistency & failure behavior:
  • RPO/RTO:
    • With semi-sync in primary region, losing the leader does not lose committed transactions; failover to the in-region sync follower.
    • Cross-region replication remains async → if the entire region fails, you may lose up to (Δ) seconds of writes that haven’t replicated.
  • Patterns used:
    • From this module: single-leader replication, sync vs async, replication lag awareness, and failover.
    • Combined with consistency models: guarantee read-your-writes by routing post-write reads to the leader or using a min_lsn/min_commit_time threshold on replicas.

Scenario 2: Active‑Active User Profile Store

You want low-latency profile updates and reads from any region, and you’re willing to accept eventual consistency for some fields but not for others. Requirements:
  • Profile updates can happen in any region (multi-leader / active‑active).
  • Some fields (e.g., email, phone) must avoid conflicting values.
  • Other fields (e.g., last_seen_at, recently_viewed) can be merged.
Design:
  • Multi-leader replication between regions with a log for each leader.
  • Per-field conflict resolution strategy:
    • Identity fields (email, phone, name): Last-Write-Wins per field with server-side timestamps (avoid whole-record LWW).
    • Activity fields (recently_viewed, devices, tags): treat as CRDT sets or counters and merge.
Profile document:
{
  id: "user-123",
  email: { value: "a@x.com", ts: 1700000000 },
  phone: { value: "+1-555-...", ts: 1700000100 },
  devices: G-Set of device IDs,
  tags:   OR-Set of tags
}

On update:
- For scalar fields: compare timestamps.
- For sets: CRDT merge (union / OR‑Set semantics).
Failure / partition behavior:
  • During a partition, each region continues accepting writes.
  • On healing:
    • For scalars, keep value with newer timestamp (assuming bounded skew, or use hybrid logical clocks).
    • For sets/counters, apply CRDT merge (no conflicts).
This scenario composes:
  • Multi-leader replication + conflict detection (version vectors or timestamps).
  • Conflict resolution strategies and CRDTs for different fields in the same entity.

Scenario 3: Leaderless Metrics Store with Tunable Consistency

You’re designing a high-throughput metrics and logs store similar to Dynamo/Cassandra. Availability and write throughput matter more than strict consistency. Design:
  • Leaderless replication with parameters (N, W, R):
    • (N = 3) replicas per key.
    • For writes: (W = 2) for production metrics, (W = 1) for debug logs.
    • For reads: (R = 1) for dashboards where staleness is acceptable, (R = 2) for alerting queries.
  • Anti-entropy:
    • Enable read repair on reads (update stale replicas in background).
    • Periodically run Merkle tree comparisons between nodes to reconcile drift.
Write metrics(point):
  - Hash(key) → choose N replicas.
  - Try to write to all N.
  - If at least W ACKs within timeout → success.
  - If fewer than W → client sees write failure (but some replicas may still store point).

Read metrics(query):
  - Contact R replicas.
  - Return value with latest timestamp.
  - Schedule background repairs for stale replicas.
Partition & failure behavior:
  • Partial node failures: As long as at least (W) nodes are reachable for writes and (R) for reads, the system remains available with overlapping quorums.
  • Sloppy quorums + hinted handoff:
    • When a home replica is down, temporarily write to a substitute node (hinted handoff) to keep (W) high.
    • On recovery, hints are replayed to repair the downed node.
  • Trade-offs:
    • For alerting: configure (W + R > N) and avoid sloppy quorums to favor correctness.
    • For debug logs: allow (W = 1, R = 1) and sloppy quorums to maximize write availability.
This scenario applies:
  • Quorum mathematics ((W + R > N)), sloppy quorums, hinted handoff, and anti-entropy techniques from this module in a concrete operational design.

Key Interview Questions

Choose leaderless when:
  • High availability is critical (any node can accept writes)
  • Write latency matters (write to nearest nodes)
  • Can tolerate eventual consistency
  • Multi-datacenter with no clear primary
Choose leader-based when:
  • Need strong consistency
  • Transactions required
  • Simpler conflict resolution
  • Read-heavy workload (scale reads with followers)
Examples:
  • Leaderless: Session data, user activity, metrics
  • Leader: Financial transactions, inventory, user accounts
Requirements analysis:
  • Cart shared across devices
  • Offline support
  • Reasonable merge behavior
Solution:
Cart Item = (product_id, quantity, timestamps)

Add item: 
- If not exists: add with current timestamp
- If exists: max(qty1, qty2), latest timestamp

Remove item:
- Use tombstone with timestamp
- Remove wins if remove_timestamp > add_timestamp

Merge strategy:
1. Union all product_ids
2. For each product:
   - If removed in both → removed
   - If removed in one with later timestamp → removed
   - Otherwise → max(quantities)

Alternative: Use OR-Set CRDT for cart items
Common scenario: User updates profile, immediately views it, sees old data.Solutions:
  1. Read-your-writes consistency
    • Track writes with client-side timestamp
    • Read from replica that’s caught up
    • Or read from leader for recently-written data
  2. Monotonic reads
    • Stick user to same replica (session affinity)
    • Or track last-seen replica position
  3. Synchronous replication for critical data
    • Higher latency but guaranteed consistency
  4. Application-level workaround
    • Optimistic UI (assume success)
    • Show cached data with “syncing” indicator
Answer:
Architecture:
- Leaderless, all nodes equal
- Consistent hashing for partitioning
- Configurable replication factor (RF)

Write path:
1. Client writes to any coordinator
2. Coordinator forwards to RF replicas
3. Wait for W acknowledgments
4. Return success

Read path:
1. Coordinator contacts R replicas
2. Returns result with highest timestamp
3. Background read repair if inconsistency

Consistency levels:
- ONE: Fast, might be stale
- QUORUM: W + R > RF, usually consistent
- ALL: All replicas, strongest but slowest
- LOCAL_QUORUM: Quorum within datacenter

Anti-entropy:
- Read repair on reads
- Merkle tree comparison
- Hinted handoff when nodes down

Next Steps

Continue to Track 4: Distributed Transactions

Learn 2PC, Saga pattern, and distributed locking