Skip to main content
Replication Strategies

Track 3: Replication Strategies

How to copy data across nodes while maintaining consistency guarantees.
Track Duration: 38-46 hours
Modules: 5
Key Topics: Single-leader, Multi-leader, Leaderless, Conflict Resolution, CRDTs

Module 11: Single-Leader Replication

The most common replication strategy.
Single-Leader Replication Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│                    SINGLE-LEADER REPLICATION                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                        ┌─────────────┐                                      │
│            Writes ────►│   LEADER    │                                      │
│                        │  (Primary)  │                                      │
│                        └──────┬──────┘                                      │
│                               │                                             │
│                    ┌──────────┼──────────┐                                  │
│                    │          │          │                                  │
│                    ▼          ▼          ▼                                  │
│              ┌─────────┐┌─────────┐┌─────────┐                              │
│   Reads ◄───►│Follower ││Follower ││Follower │◄───► Reads                   │
│              │   1     ││   2     ││   3     │                              │
│              └─────────┘└─────────┘└─────────┘                              │
│                                                                              │
│  FLOW:                                                                      │
│  ─────                                                                      │
│  1. Client writes to leader only                                            │
│  2. Leader persists write                                                   │
│  3. Leader sends write to all followers                                     │
│  4. Followers apply write                                                   │
│  5. Reads can go to any replica                                             │
│                                                                              │
│  DATABASES: PostgreSQL, MySQL, MongoDB, Redis                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Synchronous vs Asynchronous Replication

Leader waits for follower acknowledgment before confirming write.
Client     Leader      Follower
   │         │            │
   │──Write──►            │
   │         │──Replicate─►
   │         │◄───ACK─────│
   │◄──OK────│            │
   │         │            │

Time: Longer (network + follower persist)
Pros:
  • Follower guaranteed to be up-to-date
  • No data loss on leader failure
Cons:
  • Higher latency
  • Leader blocked if follower slow/dead

Replication Lag

┌─────────────────────────────────────────────────────────────────────────────┐
│                        REPLICATION LAG                                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Leader:   [1] [2] [3] [4] [5] [6] [7] [8]                                  │
│  Follower: [1] [2] [3] [4] [5]                   ← 3 entries behind         │
│                                                                              │
│  PROBLEMS THIS CAUSES:                                                      │
│  ─────────────────────                                                      │
│                                                                              │
│  1. READ-YOUR-WRITES VIOLATION                                              │
│     User writes, immediately reads from follower → doesn't see write!       │
│                                                                              │
│  2. MONOTONIC READS VIOLATION                                               │
│     User reads from F1, then F2 (further behind) → data "goes backward"    │
│                                                                              │
│  3. CONSISTENT PREFIX VIOLATION                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Understanding Replication Lag

Replication Lag Timeline Replication lag is the time delay between when a write occurs on the leader and when it’s applied on a follower. The diagram above shows a 150ms lag where the follower is consistently behind the leader. Why Lag Happens:
  • Network latency between leader and followers
  • Follower processing slower than leader writes
  • Follower temporarily offline or restarting
  • High write throughput overwhelming followers
Measuring Lag:
-- PostgreSQL: Check replication lag
SELECT 
    client_addr,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS pending_bytes,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;
Acceptable Lag: Depends on your use case
  • Financial systems: Milliseconds
  • Analytics: Minutes or hours acceptable
  • Caching: Seconds to minutes
Solutions to Lag Problems:
  • Sticky sessions: Route user’s reads to same follower
  • Read from leader: For critical reads after writes
  • Version vectors: Track causality explicitly

Handling Failover

│ │ │ STEP 1: DETECT LEADER FAILURE │ │ ───────────────────────────── │ │ • Heartbeat timeout (typically 10-30 seconds) │ │ • Multiple checks to avoid false positives │ │ │ │ STEP 2: ELECT NEW LEADER │ │ ──────────────────────── │ │ • Most up-to-date follower preferred │ │ • May use consensus (Raft) or controller node │ │ │ │ STEP 3: RECONFIGURE SYSTEM │ │ ───────────────────────── │ │ • Update clients to write to new leader │ │ • Other followers switch to new leader │ │ • DNS/load balancer update │ │ │ │ GOTCHAS: │ │ ──────── │ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ │ SPLIT-BRAIN: Both old and new leader think they’re the leader │ │ │ │ DATA LOSS: Async replication means new leader may be behind │ │ │ │ CONFLICTS: Old leader comes back with conflicting writes │ │ │ └──────────────────────────────────────────────────────────────────┘ │ │ │ │ GITHUB 2012 INCIDENT: │ │ MySQL failover caused data loss │ │ Auto-incrementing IDs reused → foreign key violations │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

---

## Module 12: Multi-Leader Replication

When a single leader isn't enough.

<Frame>
  <img src="/images/courses/distributed-systems-multi-leader.svg" alt="Multi-Leader Replication Architecture" />
</Frame>

┌─────────────────────────────────────────────────────────────────────────────┐ │ MULTI-LEADER REPLICATION │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Datacenter A │ Datacenter B │ │ │ │ │ ┌──────────────┐ │ ┌──────────────┐ │ │ │ Leader A │◄────────────┼────────►│ Leader B │ │ │ └──────┬───────┘ Async │ └──────┬───────┘ │ │ │ Sync │ │ │ │ ┌─────┴─────┐ │ ┌─────┴─────┐ │ │ ▼ ▼ │ ▼ ▼ │ │ ┌────────┐ ┌────────┐ │ ┌────────┐ ┌────────┐ │ │ │Follower│ │Follower│ │ │Follower│ │Follower│ │ │ └────────┘ └────────┘ │ └────────┘ └────────┘ │ │ │ │ │ Clients ───► Local Leader │ Clients ───► Local Leader │ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ USE CASES: ──────────
  1. Multi-datacenter operation (write locally, replicate globally)
  2. Offline-capable clients (phone writes locally, syncs later)
  3. Collaborative editing (each user has local leader)

### Conflict Detection and Resolution

<Tabs>
  <Tab title="Conflict Detection">
CONFLICT SCENARIO: ────────────────── Time Leader A Leader B ──── ──────── ──────── t1 x = 1 x = 1 (Initial state) t2 x = 2 x = 3 (Concurrent writes!) t3 (replicate) ────────────── (replicate) t4 x = 2 AND 3? x = 3 AND 2? CONFLICT! WHEN TO DETECT: ─────────────── • Synchronous: Detect immediately (if possible) • Asynchronous: Detect during replication (too late to reject) Most multi-leader systems detect asynchronously → must resolve conflicts after the fact
</Tab>

<Tab title="Resolution Strategies">
  1. LAST-WRITE-WINS (LWW) ──────────────────────── Use timestamps, highest timestamp wins
Problem: Data loss (other write silently discarded) Problem: Clock skew can cause “wrong” winner Used by: Cassandra, DynamoDB (default)
  1. MERGE VALUES ──────────────── Combine conflicting values
x = 2 + x = 3 → x = [2, 3] Problem: Not always semantically meaningful
  1. CUSTOM RESOLUTION ──────────────────── Application-specific logic
Example: For shopping cart, union of items Example: For counter, add the deltas
  1. PROMPT USER ────────────── Show conflicts to user, let them choose
Used by: Git, some CMS systems
</Tab>

<Tab title="Conflict Avoidance">
BEST STRATEGY: Avoid conflicts in the first place TECHNIQUES: ───────────
  1. ROUTE BY USER User’s writes always go to same leader No concurrent writes to same data
  2. ROUTE BY DATA Partition data, each partition has one leader
  3. GLOBAL LOCKING Acquire lock before write (defeats purpose of multi-leader though)
  4. CRDT DATA TYPES Data structures designed to merge without conflicts (covered in Module 15)
</Tab>
</Tabs>

### Conflict Scenarios Visualized

![Multi-Leader Conflicts](/images/courses/distributed-systems-multi-leader-conflict.svg)

**Real-World Example**: Google Docs

Google Docs uses multi-leader replication (operational transformation, similar to CRDTs):
- Each user's browser is a "leader"
- Concurrent edits to same document
- Conflicts resolved automatically through OT algorithm
- Users see each other's changes in real-time

**When Multi-Leader Makes Sense**:
1. **Geographic Distribution**: Users in different continents
2. **Offline Operation**: Mobile apps that sync later
3. **Collaborative Editing**: Multiple users editing simultaneously
4. **High Availability**: No single point of failure

---

## Module 13: Leaderless Replication

The Dynamo-style approach.

<Frame>
<img src="/images/courses/distributed-systems-leaderless.svg" alt="Leaderless Replication Architecture" />
</Frame>

┌─────────────────────────────────────────────────────────────────────────────┐ │ LEADERLESS REPLICATION │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ NO LEADER! Client writes to multiple nodes directly. │ │ │ │ Client │ │ ┌───┐ │ │ │ │ │ │ └─┬─┘ │ │ Write │ Write │ │ ┌─────────────────┼─────────────────┐ │ │ ▼ ▼ ▼ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ │ │ x = 5 │ │ x = 5 │ │ x = 5 │ │ │ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ QUORUMS: │ │ ──────── │ │ N = total nodes │ │ W = nodes that must acknowledge write │ │ R = nodes that must respond to read │ │ │ │ RULE: W + R > N (ensures overlap, some node has latest) │ │ │ │ COMMON CONFIGS: │ │ ─────────────── │ │ N=3, W=2, R=2: Balanced read/write, tolerates 1 failure │ │ N=3, W=3, R=1: Fast reads, writes need all nodes │ │ N=3, W=1, R=3: Fast writes, reads slower but consistent │ │ │ │ DATABASES: Cassandra, Riak, DynamoDB, Voldemort │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

### Quorum Mathematics

┌─────────────────────────────────────────────────────────────────────────────┐ │ QUORUM INTERSECTION │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ WHY W + R > N WORKS: │ │ │ │ N = 5 nodes, W = 3, R = 3 │ │ │ │ Write touches: [1] [2] [3] ✓ [ ] [ ] │ │ Read touches: [1] ✓ [ ] [4] [5] ← Some overlap! │ │ ↑ │ │ This node has the latest write! │ │ │ │ OVERLAP = W + R - N = 3 + 3 - 5 = 1 │ │ At least 1 node in read set has latest value │ │ │ │ EDGE CASES: │ │ ─────────── │ │ W + R = N: Exactly 0 overlap possible (risky) │ │ W + R < N: May read stale data │ │ │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ SCENARIO: Some home nodes unreachable │ │ │ │ Client wants to write x = 5 │ │ Home nodes for x: [A, B, C] │ │ But node C is down! │ │ │ │ STRICT QUORUM: │ │ ────────────── │ │ Wait for C or fail (reduced availability) │ │ │ │ SLOPPY QUORUM: │ │ ────────────── │ │ Write to D instead of C (D is not home node) │ │ Still get W nodes, just not the “right” ones │ │ │ │ Write: [A ✓] [B ✓] [C ✗] [D ✓ hint] │ │ │ │ HINTED HANDOFF: │ │ ─────────────── │ │ When C comes back online: │ │ D says “I have a hint for you” → sends x = 5 to C │ │ D deletes hint after C acknowledges │ │ │ │ ┌────────────────────────────────────────────────────────────────┐ │ │ │ WARNING: With sloppy quorums, W + R > N doesn’t guarantee │ │ │ │ reading latest value! Reads might miss the hint node. │ │ │ └────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

### Quorum Intersection Explained

![Quorum Visualization](/images/courses/distributed-systems-quorum.svg)

**Why This Works**: The Pigeonhole Principle

If you have N boxes and you put W + R items in them, and W + R > N, then at least one box must contain more than one item. That box represents a node that participated in both the write and the read, guaranteeing you see the latest value.

**Tuning Quorums for Your Workload**:

| Use Case | N | W | R | Rationale |
|----------|---|---|---|-----------|
| Read-heavy | 3 | 2 | 1 | Fast reads, writes need majority |
| Write-heavy | 3 | 1 | 3 | Fast writes, reads check all |
| Balanced | 3 | 2 | 2 | Equal read/write performance |
| High availability | 5 | 3 | 3 | Tolerates 2 failures |

**Latency Impact**:
- Write latency = slowest of W nodes
- Read latency = slowest of R nodes
- Lower W or R = better latency, weaker consistency

### Anti-Entropy: Read Repair and Merkle Trees

┌─────────────────────────────────────────────────────────────────────────────┐ │ ANTI-ENTROPY │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ Mechanisms to detect and repair inconsistencies │ │ │ │ 1. READ REPAIR │ │ ─────────────── │ │ During reads, detect stale replicas and update them │ │ │ │ Client reads x from [A, B, C]: │ │ A: x = 5 (version 3) │ │ B: x = 5 (version 3) │ │ C: x = 3 (version 2) ← Stale! │ │ │ │ Action: Send x = 5 to C (background repair) │ │ │ │ 2. MERKLE TREES │ │ ──────────────── │ │ Efficiently compare data between nodes │ │ │ │ ROOT (hash of children) │ │ / \ │ │ HASH(A,B) HASH(C,D) │ │ / \ / \ │ │ HASH(A) HASH(B) HASH(C) HASH(D) │ │ │ │ │ │ │ │ [Data A] [Data B] [Data C] [Data D] │ │ │ │ COMPARISON: │ │ ─────────── │ │ Node1 and Node2 exchange root hashes │ │ If same → all data matches (done!) │ │ If different → compare children, recursively find mismatches │ │ │ │ EFFICIENCY: O(log n) comparisons to find one mismatch │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

---

## Module 14: Conflict Resolution

When conflicts happen, how do you resolve them?

### Version Vectors

```text
┌─────────────────────────────────────────────────────────────────────────────┐
│                      VERSION VECTORS                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Like vector clocks, but for replicas instead of processes                  │
│                                                                              │
│  FORMAT: {replica_id: version_number, ...}                                  │
│                                                                              │
│  EXAMPLE:                                                                   │
│  ────────                                                                   │
│  Object x starts at version {A:0, B:0, C:0}                                 │
│                                                                              │
│  Write at A: x = 5, version = {A:1, B:0, C:0}                               │
│  Write at B: x = 7, version = {A:0, B:1, C:0}                               │
│                                                                              │
│  Neither dominates the other → CONFLICT                                     │
│                                                                              │
│  COMPARISON:                                                                │
│  ───────────                                                                │
│  {A:1, B:0} dominates {A:0, B:0}  (A:1 > A:0, B:0 = B:0)                   │
│  {A:1, B:0} || {A:0, B:1}          (concurrent - neither dominates)        │
│                                                                              │
│  SIBLING RESOLUTION:                                                        │
│  ───────────────────                                                        │
│  1. Present both values to application                                      │
│  2. Application merges and writes back                                      │
│  3. New version dominates both siblings                                     │
│                                                                              │
│  x = merge(5, 7), version = {A:1, B:1, C:0}                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Last-Write-Wins (LWW) Deep Dive

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LAST-WRITE-WINS                                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SIMPLE RULE: Attach timestamp, highest timestamp wins                      │
│                                                                              │
│  Write 1: x = 5, timestamp = 1000                                           │
│  Write 2: x = 7, timestamp = 1001                                           │
│                                                                              │
│  Winner: x = 7                                                              │
│                                                                              │
│  PROBLEMS:                                                                  │
│  ─────────                                                                  │
│                                                                              │
│  1. CLOCK SKEW                                                              │
│     Node A: 10:00:00                                                        │
│     Node B: 10:00:01 (1 second ahead)                                       │
│                                                                              │
│     Even if A's write was "actually" first,                                 │
│     B's write wins because higher timestamp                                 │
│                                                                              │
│  2. DATA LOSS                                                               │
│     User 1: Set name = "Alice"                                              │
│     User 2: Set email = "[email protected]"                                       │
│                                                                              │
│     If applied to entire object, one change is lost!                        │
│     Better: Apply LWW per-field                                             │
│                                                                              │
│  3. DELETE + UPDATE RACE                                                    │
│     User 1: Delete item                                                     │
│     User 2: Update item                                                     │
│                                                                              │
│     Which wins? Often delete loses → item reappears                         │
│     Use tombstones with timestamps                                          │
│                                                                              │
│  WHEN LWW IS OK:                                                            │
│  ───────────────                                                            │
│  • Immutable data (UUIDs as keys)                                           │
│  • Caching (stale data acceptable)                                          │
│  • Metrics (approximate is fine)                                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Application-Level Resolution

# Example: Shopping cart conflict resolution

def resolve_cart_conflict(cart1, cart2):
    """
    Resolution strategy: UNION of items
    
    cart1 = {A: 2, B: 1}  (2 of A, 1 of B)
    cart2 = {A: 1, C: 3}  (1 of A, 3 of C)
    
    Result: {A: 2, B: 1, C: 3}
    - Take max quantity for each item
    - Include items from both carts
    """
    resolved = {}
    all_items = set(cart1.keys()) | set(cart2.keys())
    
    for item in all_items:
        qty1 = cart1.get(item, 0)
        qty2 = cart2.get(item, 0)
        resolved[item] = max(qty1, qty2)
    
    return resolved


# Example: Counter conflict resolution

def resolve_counter_conflict(counter1, counter2, base):
    """
    Resolution strategy: Add deltas
    
    base = 100
    counter1 = 105 (added 5)
    counter2 = 103 (added 3)
    
    Result: 108 (100 + 5 + 3)
    """
    delta1 = counter1 - base
    delta2 = counter2 - base
    return base + delta1 + delta2

Module 15: CRDTs

Conflict-free Replicated Data Types - data structures that automatically merge.
Advanced Topic: CRDTs are asked at Staff+ level interviews, especially at companies building collaborative tools.
┌─────────────────────────────────────────────────────────────────────────────┐
│                         CRDTs                                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  CRDT = Data structure designed to be replicated across nodes               │
│         and always converge to same state without coordination              │
│                                                                              │
│  KEY PROPERTY: Merge operation is:                                          │
│  • Commutative: merge(A, B) = merge(B, A)                                   │
│  • Associative: merge(merge(A, B), C) = merge(A, merge(B, C))               │
│  • Idempotent: merge(A, A) = A                                              │
│                                                                              │
│  TWO TYPES:                                                                 │
│  ──────────                                                                 │
│  1. State-based (CvRDT): Merge entire states                                │
│  2. Operation-based (CmRDT): Replicate and apply operations                 │
│                                                                              │
│  TRADEOFF:                                                                  │
│  ─────────                                                                  │
│  State-based: Larger messages, simpler delivery                             │
│  Op-based: Smaller messages, require exactly-once delivery                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Counter CRDTs

G-Counter: Only increments

STATE: {node_id: count, ...}

EXAMPLE:
────────
Node A increments: {A: 5, B: 0, C: 0}
Node B increments: {A: 0, B: 3, C: 0}
Node C increments: {A: 0, B: 0, C: 7}

MERGE: Element-wise max
{A: 5, B: 3, C: 7}

VALUE: Sum of all counts = 15


Implementation:
───────────────
class GCounter:
    def __init__(self, node_id):
        self.node_id = node_id
        self.counts = defaultdict(int)
    
    def increment(self, n=1):
        self.counts[self.node_id] += n
    
    def value(self):
        return sum(self.counts.values())
    
    def merge(self, other):
        for node_id, count in other.counts.items():
            self.counts[node_id] = max(
                self.counts[node_id], count
            )

CRDT Counter Visualization

CRDT Counter Why CRDTs Are Powerful: Traditional approach:
Node A: counter = 5
Node B: counter = 3
Merge: ??? (conflict!)
CRDT approach:
Node A: {A: 5, B: 0, C: 0}
Node B: {A: 0, B: 3, C: 0}
Merge: {A: 5, B: 3, C: 0} → value = 8 (no conflict!)
Real-World CRDT Usage:
  • Redis: CRDT-based geo-replicated databases
  • Riak: Uses CRDTs for distributed counters and sets
  • Figma: Uses CRDTs for collaborative design
  • Apple Notes: CRDTs for offline-first sync
Tradeoff: CRDTs require more memory (store per-node state) but eliminate coordination overhead.

Set CRDTs

G-Set: Only adds, never removes

STATE: Set of elements

MERGE: Union of sets

Node A: {apple, banana}
Node B: {banana, cherry}

Merge: {apple, banana, cherry}


class GSet:
    def __init__(self):
        self.elements = set()
    
    def add(self, element):
        self.elements.add(element)
    
    def contains(self, element):
        return element in self.elements
    
    def merge(self, other):
        self.elements |= other.elements

Register CRDTs

┌─────────────────────────────────────────────────────────────────────────────┐
│                     REGISTER CRDTs                                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  LWW-REGISTER (Last-Write-Wins)                                             │
│  ──────────────────────────────                                             │
│  STATE: (value, timestamp)                                                  │
│  MERGE: Keep higher timestamp                                               │
│                                                                              │
│  A: ("Alice", 100)                                                          │
│  B: ("Bob", 101)                                                            │
│  Merge: ("Bob", 101)                                                        │
│                                                                              │
│  MV-REGISTER (Multi-Value)                                                  │
│  ─────────────────────────                                                  │
│  STATE: Set of (value, version-vector)                                      │
│  MERGE: Keep all concurrent values                                          │
│                                                                              │
│  A: ("Alice", {A:1})                                                        │
│  B: ("Bob", {B:1})                                                          │
│  Merge: {("Alice", {A:1}), ("Bob", {B:1})}  ← Both kept as siblings        │
│                                                                              │
│  Application must resolve by writing new value:                             │
│  Write: ("Alice or Bob", {A:1, B:1})                                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

CRDT Usage in Practice

PRODUCTION CRDT SYSTEMS:
────────────────────────

┌──────────────────┬───────────────────────────────────────────────┐
│ System           │ CRDT Usage                                    │
├──────────────────┼───────────────────────────────────────────────┤
│ Riak             │ Counters, Sets, Maps, Registers               │
│ Redis Enterprise │ Counters, Sets for Active-Active              │
│ Phoenix (Elixir) │ Presence tracking with ORSets                 │
│ Apple (Notes)    │ CRDT-based sync                               │
│ Figma            │ Custom CRDTs for collaborative design         │
│ SoundCloud       │ Roshi (LWW-Set for activity feeds)            │
│ League of Legends│ Player state sync                             │
└──────────────────┴───────────────────────────────────────────────┘

LIMITATIONS:
────────────
• Memory overhead (tombstones, version vectors)
• Not suitable for all data types
• Complex to implement correctly
• May need garbage collection for tombstones

Key Interview Questions

Choose leaderless when:
  • High availability is critical (any node can accept writes)
  • Write latency matters (write to nearest nodes)
  • Can tolerate eventual consistency
  • Multi-datacenter with no clear primary
Choose leader-based when:
  • Need strong consistency
  • Transactions required
  • Simpler conflict resolution
  • Read-heavy workload (scale reads with followers)
Examples:
  • Leaderless: Session data, user activity, metrics
  • Leader: Financial transactions, inventory, user accounts
Requirements analysis:
  • Cart shared across devices
  • Offline support
  • Reasonable merge behavior
Solution:
Cart Item = (product_id, quantity, timestamps)

Add item: 
- If not exists: add with current timestamp
- If exists: max(qty1, qty2), latest timestamp

Remove item:
- Use tombstone with timestamp
- Remove wins if remove_timestamp > add_timestamp

Merge strategy:
1. Union all product_ids
2. For each product:
   - If removed in both → removed
   - If removed in one with later timestamp → removed
   - Otherwise → max(quantities)

Alternative: Use OR-Set CRDT for cart items
Common scenario: User updates profile, immediately views it, sees old data.Solutions:
  1. Read-your-writes consistency
    • Track writes with client-side timestamp
    • Read from replica that’s caught up
    • Or read from leader for recently-written data
  2. Monotonic reads
    • Stick user to same replica (session affinity)
    • Or track last-seen replica position
  3. Synchronous replication for critical data
    • Higher latency but guaranteed consistency
  4. Application-level workaround
    • Optimistic UI (assume success)
    • Show cached data with “syncing” indicator
Answer:
Architecture:
- Leaderless, all nodes equal
- Consistent hashing for partitioning
- Configurable replication factor (RF)

Write path:
1. Client writes to any coordinator
2. Coordinator forwards to RF replicas
3. Wait for W acknowledgments
4. Return success

Read path:
1. Coordinator contacts R replicas
2. Returns result with highest timestamp
3. Background read repair if inconsistency

Consistency levels:
- ONE: Fast, might be stale
- QUORUM: W + R > RF, usually consistent
- ALL: All replicas, strongest but slowest
- LOCAL_QUORUM: Quorum within datacenter

Anti-entropy:
- Read repair on reads
- Merkle tree comparison
- Hinted handoff when nodes down

Next Steps

Continue to Track 4: Distributed Transactions

Learn 2PC, Saga pattern, and distributed locking