Replication lag is the time delay between when a write occurs on the leader and when it’s applied on a follower. The diagram above shows a 150ms lag where the follower is consistently behind the leader.Why Lag Happens:
Network latency between leader and followers
Follower processing slower than leader writes
Follower temporarily offline or restarting
High write throughput overwhelming followers
Measuring Lag:
Copy
-- PostgreSQL: Check replication lagSELECT client_addr, state, pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS pending_bytes, pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytesFROM pg_stat_replication;
Acceptable Lag: Depends on your use case
Financial systems: Milliseconds
Analytics: Minutes or hours acceptable
Caching: Seconds to minutes
Solutions to Lag Problems:
Sticky sessions: Route user’s reads to same follower
One of the most common consistency issues users experience. Here’s how to implement it:
Copy
┌─────────────────────────────────────────────────────────────────────────────┐│ READ-YOUR-WRITES IMPLEMENTATION │├─────────────────────────────────────────────────────────────────────────────┤│ ││ PROBLEM: ││ User writes to leader, immediately reads from follower → stale data! ││ ││ Client ──Write──► Leader (v5) ││ Client ──Read───► Follower (still at v3) → "Where's my data?!" ││ ││ ═══════════════════════════════════════════════════════════════════════ ││ ││ SOLUTION 1: STICKY SESSIONS (Simple) ││ ───────────────────────────────────── ││ Route all requests from same user to same replica ││ • Use user_id hash for routing ││ • Works if replica is leader or caught-up follower ││ • Breaks on failover ││ ││ SOLUTION 2: READ-FROM-LEADER (Strong but bottleneck) ││ ──────────────────────────────────────────────────── ││ After write, read from leader for X seconds ││ • Simple to implement ││ • Increases leader load ││ • Time-based heuristic (may still serve stale) ││ ││ SOLUTION 3: VERSION TRACKING (Robust) ││ ───────────────────────────────────── ││ Track write version, ensure read version ≥ write version ││ │└─────────────────────────────────────────────────────────────────────────────┘
Production Implementation (Version Tracking):
Copy
class ReadYourWritesClient: """ Client-side implementation of Read-Your-Writes guarantee. Used by: MongoDB, CockroachDB, Spanner """ def __init__(self): # Track the latest write version per key (or global) self.write_version = {} # key -> version def write(self, key, value): """Write to leader, record version""" response = self.leader.write(key, value) # Store the version returned by leader self.write_version[key] = response.version return response def read(self, key): """Read ensuring we see our writes""" min_version = self.write_version.get(key, 0) # Option A: Send min_version to replica, it waits until caught up response = self.replica.read( key, min_version=min_version # "Don't respond until you have this version" ) # Option B: Read from any replica, retry if stale for replica in self.replicas: response = replica.read(key) if response.version >= min_version: return response # Fallback to leader if no replica is fresh enough return self.leader.read(key)class ReplicaServer: """Server-side: Wait for replication before responding""" def read(self, key, min_version=None): if min_version is not None: # Wait until we've replicated up to min_version self.wait_for_version(min_version, timeout=5.0) return self.storage.get(key) def wait_for_version(self, target_version, timeout): """Block until replication catches up""" start = time.time() while self.current_version < target_version: if time.time() - start > timeout: raise StaleReadError("Replica too far behind") time.sleep(0.01) # Busy wait (use condition variable in prod)
Comparison of Approaches:
Approach
Consistency
Availability
Complexity
Sticky Sessions
Weak
High
Low
Read-from-Leader
Strong
Lower
Low
Version Tracking
Strong
High
Medium
Causal Tokens
Strong
High
High
Causal Consistency Tokens (Advanced):
Systems like MongoDB use “causal consistency tokens” that encode the entire causal history, not just a single version. This ensures you see all causally-related writes, not just your own.
Collaborative editing (each user has local leader)
Copy
### Conflict Detection and Resolution<Tabs> <Tab title="Conflict Detection">
CONFLICT SCENARIO:
──────────────────Time Leader A Leader B
──── ──────── ────────
t1 x = 1 x = 1 (Initial state)
t2 x = 2 x = 3 (Concurrent writes!)
t3 (replicate) ────────────── (replicate)
t4 x = 2 AND 3? x = 3 AND 2? CONFLICT!WHEN TO DETECT:
───────────────
• Synchronous: Detect immediately (if possible)
• Asynchronous: Detect during replication (too late to reject)Most multi-leader systems detect asynchronously
→ must resolve conflicts after the fact
Copy
</Tab><Tab title="Resolution Strategies">
LAST-WRITE-WINS (LWW)
────────────────────────
Use timestamps, highest timestamp wins
Problem: Data loss (other write silently discarded)
Problem: Clock skew can cause “wrong” winnerUsed by: Cassandra, DynamoDB (default)
Example: For shopping cart, union of items
Example: For counter, add the deltas
PROMPT USER
──────────────
Show conflicts to user, let them choose
Used by: Git, some CMS systems
Copy
</Tab><Tab title="Conflict Avoidance">
BEST STRATEGY: Avoid conflicts in the first placeTECHNIQUES:
───────────
ROUTE BY USER
User’s writes always go to same leader
No concurrent writes to same data
ROUTE BY DATA
Partition data, each partition has one leader
GLOBAL LOCKING
Acquire lock before write
(defeats purpose of multi-leader though)
CRDT DATA TYPES
Data structures designed to merge without conflicts
(covered in Module 15)
Copy
</Tab></Tabs>### Conflict Scenarios Visualized**Real-World Example**: Google DocsGoogle Docs uses multi-leader replication (operational transformation, similar to CRDTs):- Each user's browser is a "leader"- Concurrent edits to same document- Conflicts resolved automatically through OT algorithm- Users see each other's changes in real-time**When Multi-Leader Makes Sense**:1. **Geographic Distribution**: Users in different continents2. **Offline Operation**: Mobile apps that sync later3. **Collaborative Editing**: Multiple users editing simultaneously4. **High Availability**: No single point of failure---## Module 13: Leaderless ReplicationThe Dynamo-style approach.<Frame><img src="/images/courses/distributed-systems-leaderless.svg" alt="Leaderless Replication Architecture" /></Frame>
┌─────────────────────────────────────────────────────────────────────────────┐
│ LEADERLESS REPLICATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ NO LEADER! Client writes to multiple nodes directly. │
│ │
│ Client │
│ ┌───┐ │
│ │ │ │
│ └─┬─┘ │
│ Write │ Write │
│ ┌─────────────────┼─────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │
│ │ x = 5 │ │ x = 5 │ │ x = 5 │ │
│ └─────────┘ └─────────┘ └─────────┘ │
│ │
│ QUORUMS: │
│ ──────── │
│ N = total nodes │
│ W = nodes that must acknowledge write │
│ R = nodes that must respond to read │
│ │
│ RULE: W + R > N (ensures overlap, some node has latest) │
│ │
│ COMMON CONFIGS: │
│ ─────────────── │
│ N=3, W=2, R=2: Balanced read/write, tolerates 1 failure │
│ N=3, W=3, R=1: Fast reads, writes need all nodes │
│ N=3, W=1, R=3: Fast writes, reads slower but consistent │
│ │
│ DATABASES: Cassandra, Riak, DynamoDB, Voldemort │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Copy
### Quorum Mathematics
┌─────────────────────────────────────────────────────────────────────────────┐
│ QUORUM INTERSECTION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ WHY W + R > N WORKS: │
│ │
│ N = 5 nodes, W = 3, R = 3 │
│ │
│ Write touches: [1] [2] [3] ✓ [ ] [ ] │
│ Read touches: [1] ✓ [ ] [4] [5] ← Some overlap! │
│ ↑ │
│ This node has the latest write! │
│ │
│ OVERLAP = W + R - N = 3 + 3 - 5 = 1 │
│ At least 1 node in read set has latest value │
│ │
│ EDGE CASES: │
│ ─────────── │
│ W + R = N: Exactly 0 overlap possible (risky) │
│ W + R < N: May read stale data │
│ │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SCENARIO: Some home nodes unreachable │
│ │
│ Client wants to write x = 5 │
│ Home nodes for x: [A, B, C] │
│ But node C is down! │
│ │
│ STRICT QUORUM: │
│ ────────────── │
│ Wait for C or fail (reduced availability) │
│ │
│ SLOPPY QUORUM: │
│ ────────────── │
│ Write to D instead of C (D is not home node) │
│ Still get W nodes, just not the “right” ones │
│ │
│ Write: [A ✓] [B ✓] [C ✗] [D ✓ hint] │
│ │
│ HINTED HANDOFF: │
│ ─────────────── │
│ When C comes back online: │
│ D says “I have a hint for you” → sends x = 5 to C │
│ D deletes hint after C acknowledges │
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ WARNING: With sloppy quorums, W + R > N doesn’t guarantee │ │
│ │ reading latest value! Reads might miss the hint node. │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Copy
### Quorum Intersection Explained**Why This Works**: The Pigeonhole PrincipleIf you have N boxes and you put W + R items in them, and W + R > N, then at least one box must contain more than one item. That box represents a node that participated in both the write and the read, guaranteeing you see the latest value.**Tuning Quorums for Your Workload**:| Use Case | N | W | R | Rationale ||----------|---|---|---|-----------|| Read-heavy | 3 | 2 | 1 | Fast reads, writes need majority || Write-heavy | 3 | 1 | 3 | Fast writes, reads check all || Balanced | 3 | 2 | 2 | Equal read/write performance || High availability | 5 | 3 | 3 | Tolerates 2 failures |**Latency Impact**:- Write latency = slowest of W nodes- Read latency = slowest of R nodes- Lower W or R = better latency, weaker consistency### Anti-Entropy: Read Repair and Merkle Trees
┌─────────────────────────────────────────────────────────────────────────────┐
│ ANTI-ENTROPY │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Mechanisms to detect and repair inconsistencies │
│ │
│ 1. READ REPAIR │
│ ─────────────── │
│ During reads, detect stale replicas and update them │
│ │
│ Client reads x from [A, B, C]: │
│ A: x = 5 (version 3) │
│ B: x = 5 (version 3) │
│ C: x = 3 (version 2) ← Stale! │
│ │
│ Action: Send x = 5 to C (background repair) │
│ │
│ 2. MERKLE TREES │
│ ──────────────── │
│ Efficiently compare data between nodes │
│ │
│ ROOT (hash of children) │
│ / \ │
│ HASH(A,B) HASH(C,D) │
│ / \ / \ │
│ HASH(A) HASH(B) HASH(C) HASH(D) │
│ │ │ │ │ │
│ [Data A] [Data B] [Data C] [Data D] │
│ │
│ COMPARISON: │
│ ─────────── │
│ Node1 and Node2 exchange root hashes │
│ If same → all data matches (done!) │
│ If different → compare children, recursively find mismatches │
│ │
│ EFFICIENCY: O(log n) comparisons to find one mismatch │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Copy
---## Module 14: Conflict ResolutionWhen conflicts happen, how do you resolve them?### Version Vectors```text┌─────────────────────────────────────────────────────────────────────────────┐│ VERSION VECTORS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Like vector clocks, but for replicas instead of processes ││ ││ FORMAT: {replica_id: version_number, ...} ││ ││ EXAMPLE: ││ ──────── ││ Object x starts at version {A:0, B:0, C:0} ││ ││ Write at A: x = 5, version = {A:1, B:0, C:0} ││ Write at B: x = 7, version = {A:0, B:1, C:0} ││ ││ Neither dominates the other → CONFLICT ││ ││ COMPARISON: ││ ─────────── ││ {A:1, B:0} dominates {A:0, B:0} (A:1 > A:0, B:0 = B:0) ││ {A:1, B:0} || {A:0, B:1} (concurrent - neither dominates) ││ ││ SIBLING RESOLUTION: ││ ─────────────────── ││ 1. Present both values to application ││ 2. Application merges and writes back ││ 3. New version dominates both siblings ││ ││ x = merge(5, 7), version = {A:1, B:1, C:0} ││ │└─────────────────────────────────────────────────────────────────────────────┘
Traditional Version Vectors can grow linearly with the number of nodes (O(N) overhead) and can be imprecise for concurrent updates to the same node. Dotted Version Vectors (DVV), used in Riak, solve this by separating the “dot” (a specific update) from the “causal context.”Key Benefit: DVVs accurately distinguish between “I have seen this update” and “This update is a sibling of that one,” leading to significantly fewer false conflicts during churn.
Copy
Format: [ (NodeID, Counter), VersionVector ]- (NodeID, Counter) is the "Dot" representing the specific event.- VersionVector is the "Context" representing what this node had seen.
G-Set: Only adds, never removesSTATE: Set of elementsMERGE: Union of setsNode A: {apple, banana}Node B: {banana, cherry}Merge: {apple, banana, cherry}class GSet: def __init__(self): self.elements = set() def add(self, element): self.elements.add(element) def contains(self, element): return element in self.elements def merge(self, other): self.elements |= other.elements
Copy
2P-Set: Add and remove (but can't re-add after remove)STATE: Two G-Sets- A (added): elements added- R (removed): elements removed (tombstones)CONTAINS: element in A and not in RLIMITATION: Once removed, can never add againclass TwoPSet: def __init__(self): self.a = GSet() # added self.r = GSet() # removed def add(self, element): self.a.add(element) def remove(self, element): if element in self.a.elements: self.r.add(element) def contains(self, element): return (element in self.a.elements and element not in self.r.elements) def merge(self, other): self.a.merge(other.a) self.r.merge(other.r)
Copy
OR-Set: Add and remove with re-add supportIDEA: Tag each addition with unique IDRemove only removes specific tag, not allSTATE: {element: set of (tag, is_added)}EXAMPLE:────────Add "apple" at A: {apple: {(uuid1, true)}}Add "apple" at B: {apple: {(uuid2, true)}}Remove "apple" at A: {apple: {(uuid1, false), (uuid2, true)}}Contains: At least one (tag, true) without matching (tag, false)"apple" is still in set because uuid2 wasn't removed!USED BY:• Riak• Redis Enterprise (Active-Active)• SoundCloud's Roshi
In collaborative text editing, if Alice inserts “A” and Bob inserts “B” at the same position, a naive CRDT might result in “BA” for Alice and “AB” for Bob. Modern CRDTs like Yjs and Automerge use Causal Ordering and Unique IDs to ensure that all replicas agree on the order (e.g., “AB” for everyone).
The biggest drawback of State-based CRDTs (CvRDTs) is that the state grows over time. If you have a counter with 1,000 nodes, you must send all 1,000 counts every time you sync.Delta-CRDTs solve this by only sending the Delta (the changes) since the last successful synchronization.
Dot Store: A way to track which specific updates (dots) a neighbor has already seen.
Delta-Group: A collection of updates that are merged together before being sent.
Efficiency: Reduces bandwidth from O(N) to O(Δ), making CRDTs viable for mobile devices on slow networks.
When you remove an element from a CRDT (like a 2P-Set or OR-Set), you must keep a Tombstone to prove to other nodes that the item was deleted. If you delete the tombstone too early, a node that hasn’t seen the deletion might “re-infect” the cluster with the old data.
Stable State GC: Only delete a tombstone once you have proof (via gossip) that every node in the cluster has seen the deletion.
Time-based GC: Delete tombstones after a very long period (e.g., 30 days). If a node stays offline longer than that, it must be completely wiped and re-synced from scratch.
PRODUCTION CRDT SYSTEMS:────────────────────────┌──────────────────┬───────────────────────────────────────────────┐│ System │ CRDT Usage │├──────────────────┼───────────────────────────────────────────────┤│ Riak │ Counters, Sets, Maps, Registers ││ Redis Enterprise │ Counters, Sets for Active-Active ││ Phoenix (Elixir) │ Presence tracking with ORSets ││ Apple (Notes) │ CRDT-based sync ││ Figma │ Custom CRDTs for collaborative design ││ SoundCloud │ Roshi (LWW-Set for activity feeds) ││ League of Legends│ Player state sync │└──────────────────┴───────────────────────────────────────────────┘LIMITATIONS:────────────• Memory overhead (tombstones, version vectors)• Not suitable for all data types• Complex to implement correctly• May need garbage collection for tombstones
A single consensus group (e.g., one Raft cluster) is bottlenecked by the CPU and I/O of a single leader. To scale to millions of requests, you must use Sharded Consensus.
When an operation affects multiple shards (e.g., Transfer($5) from Shard A to Shard B), you need a higher-level protocol like Two-Phase Commit (2PC) or Percolator to coordinate across the individual consensus groups.Used By: CockroachDB (Ranges), TiDB (Regions), and Google Spanner (Paxos Groups).
Chain replication is a widely used technique for high-throughput, linearizable replication. It is used in systems like FAWN and CORFU.
Copy
┌─────────────────────────────────────────────────────────────────────────────┐│ CHAIN REPLICATION │├─────────────────────────────────────────────────────────────────────────────┤│ ││ STRUCTURE: Nodes are arranged in a chain: Head → ... → Tail ││ ││ WRITE PATH: ││ 1. Client sends write to the HEAD. ││ 2. Head applies write and forwards to next node. ││ 3. Propagation continues until it reaches the TAIL. ││ 4. Tail sends ACK back up the chain or directly to client. ││ ││ READ PATH: ││ 1. Clients read ONLY from the TAIL. ││ ││ WHY IT WORKS: ││ - Linearizability: A write is only visible (at the Tail) once it has been ││ persisted by ALL nodes in the chain. ││ - Higher Throughput: The load is distributed across the chain. ││ ││ FAILURE HANDLING: ││ - Head fails: Next node becomes the new Head. ││ - Tail fails: Predecessor becomes the new Tail. ││ - Internal node fails: Predecessor and successor link up. ││ │└─────────────────────────────────────────────────────────────────────────────┘
Scenario 1: Global E‑Commerce Database Replication
You run a global e‑commerce platform with users in US, EU, and APAC. You want strong guarantees for orders and payments, but can tolerate slightly stale reads for product browsing.Design:
Topology:
Single write leader region (e.g., us-east-1) with synchronous followers in the same region.
Read replicas per region (EU, APAC) receiving asynchronous replication from the leader.
Write path:
All order and payment writes go to the regional leader in us-east-1.
Use semi-synchronous replication inside the leader region: commit after 1 follower ACK to avoid single-copy-of-truth.
Read path:
Critical reads (order status, payment status): hit the leader region or only replicas with replication lag < T ms.
Non-critical reads (product catalog, reviews): go to nearest regional replica even if seconds behind.
With semi-sync in primary region, losing the leader does not lose committed transactions; failover to the in-region sync follower.
Cross-region replication remains async → if the entire region fails, you may lose up to (Δ) seconds of writes that haven’t replicated.
Patterns used:
From this module: single-leader replication, sync vs async, replication lag awareness, and failover.
Combined with consistency models: guarantee read-your-writes by routing post-write reads to the leader or using a min_lsn/min_commit_time threshold on replicas.
You want low-latency profile updates and reads from any region, and you’re willing to accept eventual consistency for some fields but not for others.Requirements:
Profile updates can happen in any region (multi-leader / active‑active).
Some fields (e.g., email, phone) must avoid conflicting values.
Other fields (e.g., last_seen_at, recently_viewed) can be merged.
Design:
Multi-leader replication between regions with a log for each leader.
Per-field conflict resolution strategy:
Identity fields (email, phone, name): Last-Write-Wins per field with server-side timestamps (avoid whole-record LWW).
Activity fields (recently_viewed, devices, tags): treat as CRDT sets or counters and merge.
Scenario 3: Leaderless Metrics Store with Tunable Consistency
You’re designing a high-throughput metrics and logs store similar to Dynamo/Cassandra. Availability and write throughput matter more than strict consistency.Design:
Leaderless replication with parameters (N, W, R):
(N = 3) replicas per key.
For writes: (W = 2) for production metrics, (W = 1) for debug logs.
For reads: (R = 1) for dashboards where staleness is acceptable, (R = 2) for alerting queries.
Anti-entropy:
Enable read repair on reads (update stale replicas in background).
Periodically run Merkle tree comparisons between nodes to reconcile drift.
Copy
Write metrics(point): - Hash(key) → choose N replicas. - Try to write to all N. - If at least W ACKs within timeout → success. - If fewer than W → client sees write failure (but some replicas may still store point).Read metrics(query): - Contact R replicas. - Return value with latest timestamp. - Schedule background repairs for stale replicas.
Partition & failure behavior:
Partial node failures: As long as at least (W) nodes are reachable for writes and (R) for reads, the system remains available with overlapping quorums.
Sloppy quorums + hinted handoff:
When a home replica is down, temporarily write to a substitute node (hinted handoff) to keep (W) high.
On recovery, hints are replayed to repair the downed node.
Trade-offs:
For alerting: configure (W + R > N) and avoid sloppy quorums to favor correctness.
For debug logs: allow (W = 1, R = 1) and sloppy quorums to maximize write availability.
This scenario applies:
Quorum mathematics ((W + R > N)), sloppy quorums, hinted handoff, and anti-entropy techniques from this module in a concrete operational design.
Q: When would you choose leaderless over leader-based replication?
Choose leaderless when:
High availability is critical (any node can accept writes)
Write latency matters (write to nearest nodes)
Can tolerate eventual consistency
Multi-datacenter with no clear primary
Choose leader-based when:
Need strong consistency
Transactions required
Simpler conflict resolution
Read-heavy workload (scale reads with followers)
Examples:
Leaderless: Session data, user activity, metrics
Leader: Financial transactions, inventory, user accounts
Q: Design conflict resolution for a shopping cart
Requirements analysis:
Cart shared across devices
Offline support
Reasonable merge behavior
Solution:
Copy
Cart Item = (product_id, quantity, timestamps)Add item: - If not exists: add with current timestamp- If exists: max(qty1, qty2), latest timestampRemove item:- Use tombstone with timestamp- Remove wins if remove_timestamp > add_timestampMerge strategy:1. Union all product_ids2. For each product: - If removed in both → removed - If removed in one with later timestamp → removed - Otherwise → max(quantities)Alternative: Use OR-Set CRDT for cart items
Q: How do you handle replication lag in a user-facing feature?
Common scenario: User updates profile, immediately views it, sees old data.Solutions:
Read-your-writes consistency
Track writes with client-side timestamp
Read from replica that’s caught up
Or read from leader for recently-written data
Monotonic reads
Stick user to same replica (session affinity)
Or track last-seen replica position
Synchronous replication for critical data
Higher latency but guaranteed consistency
Application-level workaround
Optimistic UI (assume success)
Show cached data with “syncing” indicator
Q: Explain how Cassandra handles replication
Answer:
Copy
Architecture:- Leaderless, all nodes equal- Consistent hashing for partitioning- Configurable replication factor (RF)Write path:1. Client writes to any coordinator2. Coordinator forwards to RF replicas3. Wait for W acknowledgments4. Return successRead path:1. Coordinator contacts R replicas2. Returns result with highest timestamp3. Background read repair if inconsistencyConsistency levels:- ONE: Fast, might be stale- QUORUM: W + R > RF, usually consistent- ALL: All replicas, strongest but slowest- LOCAL_QUORUM: Quorum within datacenterAnti-entropy:- Read repair on reads- Merkle tree comparison- Hinted handoff when nodes down