Replication lag is the time delay between when a write occurs on the leader and when it’s applied on a follower. The diagram above shows a 150ms lag where the follower is consistently behind the leader.Why Lag Happens:
Network latency between leader and followers
Follower processing slower than leader writes
Follower temporarily offline or restarting
High write throughput overwhelming followers
Measuring Lag:
Copy
-- PostgreSQL: Check replication lagSELECT client_addr, state, pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS pending_bytes, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytesFROM pg_stat_replication;
Acceptable Lag: Depends on your use case
Financial systems: Milliseconds
Analytics: Minutes or hours acceptable
Caching: Seconds to minutes
Solutions to Lag Problems:
Sticky sessions: Route user’s reads to same follower
Collaborative editing (each user has local leader)
Copy
### Conflict Detection and Resolution<Tabs> <Tab title="Conflict Detection">
CONFLICT SCENARIO:
──────────────────Time Leader A Leader B
──── ──────── ────────
t1 x = 1 x = 1 (Initial state)
t2 x = 2 x = 3 (Concurrent writes!)
t3 (replicate) ────────────── (replicate)
t4 x = 2 AND 3? x = 3 AND 2? CONFLICT!WHEN TO DETECT:
───────────────
• Synchronous: Detect immediately (if possible)
• Asynchronous: Detect during replication (too late to reject)Most multi-leader systems detect asynchronously
→ must resolve conflicts after the fact
Copy
</Tab><Tab title="Resolution Strategies">
LAST-WRITE-WINS (LWW)
────────────────────────
Use timestamps, highest timestamp wins
Problem: Data loss (other write silently discarded)
Problem: Clock skew can cause “wrong” winnerUsed by: Cassandra, DynamoDB (default)
Example: For shopping cart, union of items
Example: For counter, add the deltas
PROMPT USER
──────────────
Show conflicts to user, let them choose
Used by: Git, some CMS systems
Copy
</Tab><Tab title="Conflict Avoidance">
BEST STRATEGY: Avoid conflicts in the first placeTECHNIQUES:
───────────
ROUTE BY USER
User’s writes always go to same leader
No concurrent writes to same data
ROUTE BY DATA
Partition data, each partition has one leader
GLOBAL LOCKING
Acquire lock before write
(defeats purpose of multi-leader though)
CRDT DATA TYPES
Data structures designed to merge without conflicts
(covered in Module 15)
Copy
</Tab></Tabs>### Conflict Scenarios Visualized**Real-World Example**: Google DocsGoogle Docs uses multi-leader replication (operational transformation, similar to CRDTs):- Each user's browser is a "leader"- Concurrent edits to same document- Conflicts resolved automatically through OT algorithm- Users see each other's changes in real-time**When Multi-Leader Makes Sense**:1. **Geographic Distribution**: Users in different continents2. **Offline Operation**: Mobile apps that sync later3. **Collaborative Editing**: Multiple users editing simultaneously4. **High Availability**: No single point of failure---## Module 13: Leaderless ReplicationThe Dynamo-style approach.<Frame><img src="/images/courses/distributed-systems-leaderless.svg" alt="Leaderless Replication Architecture" /></Frame>
┌─────────────────────────────────────────────────────────────────────────────┐
│ LEADERLESS REPLICATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ NO LEADER! Client writes to multiple nodes directly. │
│ │
│ Client │
│ ┌───┐ │
│ │ │ │
│ └─┬─┘ │
│ Write │ Write │
│ ┌─────────────────┼─────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ x = 5 │ │ x = 5 │ │ x = 5 │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ QUORUMS: │
│ ──────── │
│ N = total nodes │
│ W = nodes that must acknowledge write │
│ R = nodes that must respond to read │
│ │
│ RULE: W + R > N (ensures overlap, some node has latest) │
│ │
│ COMMON CONFIGS: │
│ ─────────────── │
│ N=3, W=2, R=2: Balanced read/write, tolerates 1 failure │
│ N=3, W=3, R=1: Fast reads, writes need all nodes │
│ N=3, W=1, R=3: Fast writes, reads slower but consistent │
│ │
│ DATABASES: Cassandra, Riak, DynamoDB, Voldemort │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Copy
### Quorum Mathematics
┌─────────────────────────────────────────────────────────────────────────────┐
│ QUORUM INTERSECTION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ WHY W + R > N WORKS: │
│ │
│ N = 5 nodes, W = 3, R = 3 │
│ │
│ Write touches: [1] [2] [3] ✓ [ ] [ ] │
│ Read touches: [1] ✓ [ ] [4] [5] ← Some overlap! │
│ ↑ │
│ This node has the latest write! │
│ │
│ OVERLAP = W + R - N = 3 + 3 - 5 = 1 │
│ At least 1 node in read set has latest value │
│ │
│ EDGE CASES: │
│ ─────────── │
│ W + R = N: Exactly 0 overlap possible (risky) │
│ W + R < N: May read stale data │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SCENARIO: Some home nodes unreachable │
│ │
│ Client wants to write x = 5 │
│ Home nodes for x: [A, B, C] │
│ But node C is down! │
│ │
│ STRICT QUORUM: │
│ ────────────── │
│ Wait for C or fail (reduced availability) │
│ │
│ SLOPPY QUORUM: │
│ ────────────── │
│ Write to D instead of C (D is not home node) │
│ Still get W nodes, just not the “right” ones │
│ │
│ Write: [A ✓] [B ✓] [C ✗] [D ✓ hint] │
│ │
│ HINTED HANDOFF: │
│ ─────────────── │
│ When C comes back online: │
│ D says “I have a hint for you” → sends x = 5 to C │
│ D deletes hint after C acknowledges │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ WARNING: With sloppy quorums, W + R > N doesn’t guarantee │ │
│ │ reading latest value! Reads might miss the hint node. │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Copy
### Quorum Intersection Explained**Why This Works**: The Pigeonhole PrincipleIf you have N boxes and you put W + R items in them, and W + R > N, then at least one box must contain more than one item. That box represents a node that participated in both the write and the read, guaranteeing you see the latest value.**Tuning Quorums for Your Workload**:| Use Case | N | W | R | Rationale ||----------|---|---|---|-----------|| Read-heavy | 3 | 2 | 1 | Fast reads, writes need majority || Write-heavy | 3 | 1 | 3 | Fast writes, reads check all || Balanced | 3 | 2 | 2 | Equal read/write performance || High availability | 5 | 3 | 3 | Tolerates 2 failures |**Latency Impact**:- Write latency = slowest of W nodes- Read latency = slowest of R nodes- Lower W or R = better latency, weaker consistency### Anti-Entropy: Read Repair and Merkle Trees
┌─────────────────────────────────────────────────────────────────────────────┐
│ ANTI-ENTROPY │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Mechanisms to detect and repair inconsistencies │
│ │
│ 1. READ REPAIR │
│ ─────────────── │
│ During reads, detect stale replicas and update them │
│ │
│ Client reads x from [A, B, C]: │
│ A: x = 5 (version 3) │
│ B: x = 5 (version 3) │
│ C: x = 3 (version 2) ← Stale! │
│ │
│ Action: Send x = 5 to C (background repair) │
│ │
│ 2. MERKLE TREES │
│ ──────────────── │
│ Efficiently compare data between nodes │
│ │
│ ROOT (hash of children) │
│ / \ │
│ HASH(A,B) HASH(C,D) │
│ / \ / \ │
│ HASH(A) HASH(B) HASH(C) HASH(D) │
│ │ │ │ │ │
│ [Data A] [Data B] [Data C] [Data D] │
│ │
│ COMPARISON: │
│ ─────────── │
│ Node1 and Node2 exchange root hashes │
│ If same → all data matches (done!) │
│ If different → compare children, recursively find mismatches │
│ │
│ EFFICIENCY: O(log n) comparisons to find one mismatch │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Copy
---## Module 14: Conflict ResolutionWhen conflicts happen, how do you resolve them?### Version Vectors```text┌─────────────────────────────────────────────────────────────────────────────┐│ VERSION VECTORS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Like vector clocks, but for replicas instead of processes ││ ││ FORMAT: {replica_id: version_number, ...} ││ ││ EXAMPLE: ││ ──────── ││ Object x starts at version {A:0, B:0, C:0} ││ ││ Write at A: x = 5, version = {A:1, B:0, C:0} ││ Write at B: x = 7, version = {A:0, B:1, C:0} ││ ││ Neither dominates the other → CONFLICT ││ ││ COMPARISON: ││ ─────────── ││ {A:1, B:0} dominates {A:0, B:0} (A:1 > A:0, B:0 = B:0) ││ {A:1, B:0} || {A:0, B:1} (concurrent - neither dominates) ││ ││ SIBLING RESOLUTION: ││ ─────────────────── ││ 1. Present both values to application ││ 2. Application merges and writes back ││ 3. New version dominates both siblings ││ ││ x = merge(5, 7), version = {A:1, B:1, C:0} ││ │└─────────────────────────────────────────────────────────────────────────────┘
G-Set: Only adds, never removesSTATE: Set of elementsMERGE: Union of setsNode A: {apple, banana}Node B: {banana, cherry}Merge: {apple, banana, cherry}class GSet: def __init__(self): self.elements = set() def add(self, element): self.elements.add(element) def contains(self, element): return element in self.elements def merge(self, other): self.elements |= other.elements
Copy
2P-Set: Add and remove (but can't re-add after remove)STATE: Two G-Sets- A (added): elements added- R (removed): elements removed (tombstones)CONTAINS: element in A and not in RLIMITATION: Once removed, can never add againclass TwoPSet: def __init__(self): self.a = GSet() # added self.r = GSet() # removed def add(self, element): self.a.add(element) def remove(self, element): if element in self.a.elements: self.r.add(element) def contains(self, element): return (element in self.a.elements and element not in self.r.elements) def merge(self, other): self.a.merge(other.a) self.r.merge(other.r)
Copy
OR-Set: Add and remove with re-add supportIDEA: Tag each addition with unique IDRemove only removes specific tag, not allSTATE: {element: set of (tag, is_added)}EXAMPLE:────────Add "apple" at A: {apple: {(uuid1, true)}}Add "apple" at B: {apple: {(uuid2, true)}}Remove "apple" at A: {apple: {(uuid1, false), (uuid2, true)}}Contains: At least one (tag, true) without matching (tag, false)"apple" is still in set because uuid2 wasn't removed!USED BY:• Riak• Redis Enterprise (Active-Active)• SoundCloud's Roshi
PRODUCTION CRDT SYSTEMS:────────────────────────┌──────────────────┬───────────────────────────────────────────────┐│ System │ CRDT Usage │├──────────────────┼───────────────────────────────────────────────┤│ Riak │ Counters, Sets, Maps, Registers ││ Redis Enterprise │ Counters, Sets for Active-Active ││ Phoenix (Elixir) │ Presence tracking with ORSets ││ Apple (Notes) │ CRDT-based sync ││ Figma │ Custom CRDTs for collaborative design ││ SoundCloud │ Roshi (LWW-Set for activity feeds) ││ League of Legends│ Player state sync │└──────────────────┴───────────────────────────────────────────────┘LIMITATIONS:────────────• Memory overhead (tombstones, version vectors)• Not suitable for all data types• Complex to implement correctly• May need garbage collection for tombstones
Q: When would you choose leaderless over leader-based replication?
Choose leaderless when:
High availability is critical (any node can accept writes)
Write latency matters (write to nearest nodes)
Can tolerate eventual consistency
Multi-datacenter with no clear primary
Choose leader-based when:
Need strong consistency
Transactions required
Simpler conflict resolution
Read-heavy workload (scale reads with followers)
Examples:
Leaderless: Session data, user activity, metrics
Leader: Financial transactions, inventory, user accounts
Q: Design conflict resolution for a shopping cart
Requirements analysis:
Cart shared across devices
Offline support
Reasonable merge behavior
Solution:
Copy
Cart Item = (product_id, quantity, timestamps)Add item: - If not exists: add with current timestamp- If exists: max(qty1, qty2), latest timestampRemove item:- Use tombstone with timestamp- Remove wins if remove_timestamp > add_timestampMerge strategy:1. Union all product_ids2. For each product: - If removed in both → removed - If removed in one with later timestamp → removed - Otherwise → max(quantities)Alternative: Use OR-Set CRDT for cart items
Q: How do you handle replication lag in a user-facing feature?
Common scenario: User updates profile, immediately views it, sees old data.Solutions:
Read-your-writes consistency
Track writes with client-side timestamp
Read from replica that’s caught up
Or read from leader for recently-written data
Monotonic reads
Stick user to same replica (session affinity)
Or track last-seen replica position
Synchronous replication for critical data
Higher latency but guaranteed consistency
Application-level workaround
Optimistic UI (assume success)
Show cached data with “syncing” indicator
Q: Explain how Cassandra handles replication
Answer:
Copy
Architecture:- Leaderless, all nodes equal- Consistent hashing for partitioning- Configurable replication factor (RF)Write path:1. Client writes to any coordinator2. Coordinator forwards to RF replicas3. Wait for W acknowledgments4. Return successRead path:1. Coordinator contacts R replicas2. Returns result with highest timestamp3. Background read repair if inconsistencyConsistency levels:- ONE: Fast, might be stale- QUORUM: W + R > RF, usually consistent- ALL: All replicas, strongest but slowest- LOCAL_QUORUM: Quorum within datacenterAnti-entropy:- Read repair on reads- Merkle tree comparison- Hinted handoff when nodes down