Understanding consistency models is fundamental to designing distributed systems. This module covers the entire consistency spectrum, from the strongest (linearizability) to the weakest (eventual consistency).
┌─────────────────────────────────────────────────────────────────────────────┐│ LINEARIZABILITY │├─────────────────────────────────────────────────────────────────────────────┤│ ││ DEFINITION: ││ ─────────── ││ Every operation appears to execute instantaneously at some point ││ between its invocation and response (linearization point). ││ ││ PROPERTIES: ││ ─────────── ││ 1. Operations are atomic (happen at a single point in time) ││ 2. Real-time ordering is preserved ││ 3. All clients see the same order of operations ││ ││ EXAMPLE - Linearizable: ││ ──────────────────────── ││ Time ────────────────────────────────────────────────────► ││ ││ Client A: [──write(x=1)──] ││ ↓ linearization point ││ Client B: [──read(x)──] → returns 1 ✓ ││ ││ After A's write completes, B must see x=1 ││ ││ EXAMPLE - NOT Linearizable: ││ ────────────────────────── ││ Time ────────────────────────────────────────────────────► ││ ││ Client A: [──write(x=1)──] ││ Client B: [──read(x)──] → returns 0 ✗ ││ Client C: [──read(x)──] → returns 1 ││ ││ B reads after A completes but sees old value - VIOLATION! ││ │└─────────────────────────────────────────────────────────────────────────────┘
Linearizable writes:- All writes go through the leader- Leader assigns sequential zxid- Write is linearized when majority acknowledgeReads:- sync() + read for linearizable reads- Regular reads may be stale (sequential consistency)
Copy
Fully linearizable:- Uses Raft for consensus- Reads and writes go through leader- linearizable_read option for guaranteed fresh readsTrade-off: Higher latency for reads
Copy
External consistency (stronger than linearizable):- Uses TrueTime for global ordering- Commit-wait protocol- Even cross-datacenter transactions are linearizableTrade-off: GPS/atomic clock infrastructure required
┌─────────────────────────────────────────────────────────────────────────────┐│ COST OF LINEARIZABILITY │├─────────────────────────────────────────────────────────────────────────────┤│ ││ LATENCY: ││ ──────── ││ - Writes must be acknowledged by majority ││ - Reads may need to contact leader or do sync ││ - Cross-region: 50-200ms per operation ││ ││ AVAILABILITY: ││ ───────────── ││ - Network partition → one side cannot make progress ││ - Leader failure → election delay (seconds) ││ - CAP theorem: Cannot have C and A during partition ││ ││ THROUGHPUT: ││ ─────────── ││ - All operations serialized through leader ││ - Consensus overhead per write ││ - Read scalability limited without compromising consistency ││ ││ WHEN TO USE: ││ ───────────── ││ ✓ Leader election ││ ✓ Distributed locks ││ ✓ Unique ID generation ││ ✓ Configuration management ││ ✓ Financial transactions ││ ││ WHEN TO AVOID: ││ ────────────── ││ ✗ High-throughput read-heavy workloads ││ ✗ Geographically distributed with low-latency requirements ││ ✗ Systems where availability is paramount ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐│ SEQUENTIAL CONSISTENCY │├─────────────────────────────────────────────────────────────────────────────┤│ ││ DEFINITION: ││ ─────────── ││ - Operations from each client appear in program order ││ - All clients see the SAME global order ││ - BUT: Order may not respect real-time ││ ││ DIFFERENCE FROM LINEARIZABLE: ││ ───────────────────────────── ││ ││ Linearizable: Real-time order matters ││ Sequential: Only program order per client matters ││ ││ EXAMPLE - Sequentially Consistent but NOT Linearizable: ││ ───────────────────────────────────────────────────────── ││ ││ Real-time: ││ Client A: write(x=1) ──────────────────────────────────── ││ Client B: ──────────────────── read(x) → 0 ││ ││ Possible sequential order: B.read(x), A.write(x=1) ││ This is valid because B's read started before A's write completed! ││ ││ For LINEARIZABILITY: B must see x=1 if read starts after write ends ││ ││ USE CASES: ││ ─────────── ││ - Multi-core CPU memory models ││ - Zookeeper reads (without sync) ││ - Some distributed databases ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐│ CAUSAL CONSISTENCY │├─────────────────────────────────────────────────────────────────────────────┤│ ││ DEFINITION: ││ ─────────── ││ If operation A causally precedes operation B, ││ then all nodes see A before B. ││ ││ CAUSALITY RULES: ││ ──────────────── ││ 1. Program order: In same thread, A before B → A causes B ││ 2. Message passing: Send before receive → Send causes receive ││ 3. Transitivity: A→B and B→C implies A→C ││ ││ EXAMPLE - Causally Consistent: ││ ─────────────────────────────── ││ ││ Alice posts: "I got the job!" ││ Bob (seeing Alice's post): "Congratulations!" ││ ││ Carol must see Alice's post before Bob's reply (causal order) ││ Carol may see unrelated posts in any order (concurrent = no causal order) ││ ││ CAUSALLY RELATED vs CONCURRENT: ││ ──────────────────────────────── ││ ││ Client A: W(x=1) ──────────────────────────── ││ ↘ ││ Client B: R(x=1) ── W(y=2) ││ ↓ causally related (B read A's write) ││ ││ Client A: W(x=1) ──────────────────────────── ││ Client B: W(y=2) ──────────────────────────── ││ ↑ concurrent (no communication) ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐│ EVENTUAL CONSISTENCY │├─────────────────────────────────────────────────────────────────────────────┤│ ││ DEFINITION: ││ ─────────── ││ If no new updates are made, eventually all replicas converge ││ to the same value. ││ ││ THE WEAKEST USEFUL GUARANTEE: ││ ───────────────────────────── ││ - No guarantee about WHEN convergence happens ││ - No guarantee about order of updates ││ - During updates, different replicas may return different values ││ ││ TIMELINE: ││ ───────── ││ Update ─────────────────────────────────────► Convergence ││ ↑ inconsistency window ↑ ││ ││ Replica A: [x=1] ─────────────────── [x=5] ─── [x=5] ││ Replica B: [x=1] ── [x=5] ──────────────────── [x=5] ││ Replica C: [x=1] ──────────────────────── [x=5][x=5] ││ ││ During inconsistency window: ││ - Read from A: x=1 ││ - Read from B: x=5 ││ - Application must handle this! ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐│ SESSION GUARANTEES │├─────────────────────────────────────────────────────────────────────────────┤│ ││ SESSION GUARANTEES provide consistency within a single client session. ││ Weaker than linearizability but practical for many applications. ││ ││ 1. READ YOUR WRITES (RYW) ││ ───────────────────── ││ After a write, subsequent reads by same client see the write. ││ ││ 2. MONOTONIC READS ││ ──────────────────── ││ If client reads value v, subsequent reads won't return older values. ││ ││ 3. MONOTONIC WRITES ││ ───────────────────── ││ Writes by a client are applied in order on all replicas. ││ ││ 4. WRITES FOLLOW READS ││ ──────────────────────── ││ A write following a read is ordered after that read on all replicas. ││ │└─────────────────────────────────────────────────────────────────────────────┘
To ensure a client sees their own update immediately, even if the database is eventually consistent:
Write: The client performs a write and receives a Version Token (or Timestamp/LSN) in the response.
Storage: The client stores this token in their session (e.g., a cookie or local storage).
Read: When reading, the client sends this token. The server ensures the read replica has at least reached that version before responding. If the replica is lagging, the server can:
To prevent the “Time Travel” bug (where a user sees a post, refreshes, and it’s gone because they hit a lagging replica):
The client tracks the Max Version it has seen so far.
Every read request includes this min_version filter.
The load balancer or database proxy ensures that the request only hits replicas that are at or beyond this version.
Copy
class SessionClient: """ Pseudo-code for a client providing Read-Your-Writes and Monotonic Reads. """ def __init__(self, db_cluster): self.db = db_cluster self.last_version_seen = 0 def write(self, key, value): response = self.db.put(key, value) # Update session metadata from write response self.last_version_seen = max(self.last_version_seen, response.version) return response def read(self, key): # Pass the last version seen to ensure monotonic progress response = self.db.get(key, min_version=self.last_version_seen) self.last_version_seen = max(self.last_version_seen, response.version) return response.data
Staff Tip: While session guarantees improve the user experience, they can create Hot Spots. If a single user is extremely active, they may effectively “pin” themselves to a single replica, preventing the load balancer from distributing traffic effectively.
CALM stands for Consistency As Logical Monotonicity. This theorem provides a formal boundary for the CAP theorem: it identifies exactly which programs can be consistent and available without coordination.
Order Independence: In a monotonic system, messages can arrive in any order, be delayed, or be duplicated, and the final result will be the same. This is why CRDTs (Module 15) work—they are mathematically monotonic.
Deterministic Convergence: Multiple replicas receiving different subsets of updates will always “eventually converge” to the same state as they receive more information.
Availability: Monotonic programs are AP (Available under Partitions) because they don’t need to ask other nodes “Do you know anything that would make my current conclusion false?”
Consider a distributed system where you want to delete a file.
Problem: Deletion is non-monotonic. If Node A deletes file1, but Node B hasn’t seen the delete yet and re-replicates it, file1 “resurrects.”
Solution: Use a monotonic approximation. Instead of deleting, add a “Tombstone” (a record that says ‘this is deleted’). Adding a tombstone is monotonic (you are adding info). The actual space reclamation (purging) is non-monotonic and requires coordination or a background GC process with a grace period.
Staff+ Tip: When designing a high-scale system, always ask: “Can I refactor this non-monotonic operation into a monotonic one?” If you can, you remove the need for expensive Paxos/Raft coordination.
9.2 Linearizability vs. Sequential Consistency (The Proof of Non-Composability)
A critical (and often asked) property of linearizability is that it is composable.
If object X is linearizable and object Y is linearizable, then the combined system (X,Y) is also linearizable.
Sequential consistency is NOT composable. You can have two sequentially consistent objects that, when used together, violate sequential consistency. This is why multi-core memory models (which are often sequentially consistent) are so difficult to reason about at scale.
Setup: Spins up a cluster of nodes (e.g., 5 nodes).
Client: A set of clients perform operations (reads, writes, CAS) on the cluster.
Nemesis: A special process that causes “havoc”:
Network partitions (iptables drops).
Clock skew (ntpdate jumps).
Process crashes (kill -9).
Checker: After the test, Jepsen analyzes the history of operations to see if they violate the claimed consistency model (e.g., using Knossos for linearizability).
“When discussing consistency, I distinguish between single-object models like linearizability and multi-object models like serializability. Linearizability provides the strongest recency guarantee but at the cost of availability during partitions—a trade-off described by the CAP theorem. For high-availability systems, I look towards Causal Consistency, which is the strongest model achievable without global coordination. I also apply the CALM theorem to identify non-monotonic operations that strictly require coordination. Finally, I verify these systems using frameworks like Jepsen to ensure that under network partitions or clock skew, the safety invariants of the chosen model still hold.”