Before diving into consensus protocols and complex architectures, you must understand the fundamental challenges that make distributed systems hard. This chapter is the foundation everything else rests on — if your mental model of networks, time, and failure is wrong, every design decision you make downstream will be subtly broken.
Every distributed systems engineer must internalize these. Peter Deutsch (with additions by James Gosling) identified these assumptions that developers new to distributed systems unconsciously make. Each one is a trap — code that works perfectly on your laptop will fail in production because of one or more of these false beliefs:
1. The Network is Reliable
Reality: Networks fail constantly.
Common failures:- Packet loss (1 in 1000 typical, worse under load)- Network partitions (entire segments become unreachable)- Cable cuts (more common than you'd think)- Switch/router failuresReal incident: AWS US-East-1 2011- Network configuration change- Cascading failures- Multi-hour outage affecting Netflix, Reddit, Quora
Defense: Implement retries, timeouts, circuit breakers, and idempotency.
2. Latency is Zero
Reality: Every network call has latency.
Operation
Time
Same datacenter
0.5ms
Cross-region (US East ↔ West)
40-65ms
Cross-continent (US ↔ Europe)
70-120ms
Cross-pacific (US ↔ Asia)
100-200ms
Impact: A service with 10 sequential remote calls adds 500-2000ms latency.Defense: Cache aggressively, parallelize calls, use async processing.
3. Bandwidth is Infinite
Reality: Bandwidth is expensive and limited.
Example: Transferring 1TB over network10 Gbps link:- Theoretical: 13 minutes- Real-world: 20-30 minutes (overhead, congestion)AWS Data Transfer Costs:- Intra-region: $0.01/GB- Cross-region: $0.02/GB- To internet: $0.09/GB- 1PB/month internet egress: $90,000
Defense: Compress data, use efficient serialization (protobuf), batch requests.
4. The Network is Secure
Reality: Assume the network is compromised.Attack vectors:
Man-in-the-middle attacks
DNS spoofing
BGP hijacking (internet routing attacks)
Insider threats
Defense: mTLS everywhere, zero-trust architecture, encrypt at rest and in transit.
Defense: Use service discovery, health checks, load balancer draining.
6. There is One Administrator
Reality: Multiple teams, policies, and even companies.In a typical microservices architecture:
Platform team manages Kubernetes
Each service team manages their services
Security team manages policies
Network team manages infrastructure
Defense: Clear ownership, documented interfaces, SLAs between teams.
7. Transport Cost is Zero
Reality: Data transfer costs real money.
Cloud Provider
Egress Cost
AWS
$0.09/GB
GCP
$0.12/GB
Azure
$0.087/GB
High-traffic example: 1PB egress = $90,000/monthDefense: Keep data close to compute, use CDNs, compress aggressively.
8. The Network is Homogeneous
Reality: Different hardware, protocols, and vendors everywhere.
Heterogeneity:├── Different server generations (various CPU, RAM)├── Different network equipment (Cisco, Juniper, etc.)├── Different protocols (HTTP/1.1, HTTP/2, gRPC, QUIC)├── Different cloud providers (AWS, GCP, Azure)└── Different latency profiles
Defense: Abstract hardware differences, use consistent protocols, test on diverse environments.
A network partition occurs when nodes can’t communicate with each other:
BEFORE PARTITION: ┌────────────────┐ │ Network │ └────────────────┘ ↑ ↑ ↑ ↑ │ │ │ │ ┌────┼────┼────┼────┼────┐ │ │ │ │ │ │ [A] [B] [C] [D] [E] [F] All nodes can communicateDURING PARTITION: Partition 1 ║ Partition 2 ┌─────────────┐ ║ ┌─────────────┐ │ Network │ ║ │ Network │ └─────────────┘ ║ └─────────────┘ ↑ ↑ ↑ ║ ↑ ↑ ↑ │ │ │ PARTITION │ │ │ [A] [B] [C] ║ [D] [E] [F] A, B, C can talk D, E, F can talk to each other to each other but NOT to D, E, F but NOT to A, B, C
Key insight: During a partition, you must choose between:
Availability: Both partitions continue serving requests (might diverge)
Consistency: Reject requests from the minority partition
Think of it like two bank branches whose phone line gets cut. Do they keep processing transactions independently (available but potentially inconsistent — both might approve overdrafts) or do they stop until the phone line is restored (consistent but unavailable)? There is no option where both branches stay open AND guarantee they agree on the account balance. This is the CAP theorem in a nutshell.
def send_message(message): network.send(message) # No retry on failure # Message delivered 0 or 1 times
Use case: Metrics, logs (losing some is acceptable)Risk: Message loss
Retry until acknowledged
def send_message(message): while True: network.send(message) if ack_received(): break # Retry on timeout # Message delivered 1 or more times
Use case: Most systems (with idempotency)Risk: Duplicate processing
Holy grail - each message processed exactly once
def send_message(message): message_id = generate_unique_id() while True: network.send(message, id=message_id) if ack_received(): break # Receiver deduplicates by message_id
Use case: Financial transactions, critical operationsReality: Implemented as at-least-once + deduplication
Common pitfall: True exactly-once delivery is impossible in theory (a consequence of the Two Generals Problem). What systems actually implement is “effectively exactly-once” — at-least-once delivery combined with idempotent processing. Kafka’s “exactly-once semantics” works by assigning producer IDs and sequence numbers so the broker can deduplicate. The message may be sent multiple times, but it is processed exactly once. Understanding this distinction is critical for interviews.
┌─────────────────────────────────────────────────────────────────────────────┐│ FAILURE DETECTION APPROACHES │├─────────────────────────────────────────────────────────────────────────────┤│ ││ HEARTBEAT-BASED ││ ─────────────── ││ Node A ──[heartbeat]──> Node B ││ every 1 sec ││ ││ If no heartbeat for 5 seconds → Mark as failed ││ ││ Problems: ││ ├── Too aggressive: False positives (just slow, not dead) ││ └── Too conservative: Slow detection of actual failures ││ ││ PHI ACCRUAL FAILURE DETECTOR (Cassandra uses this) ││ ───────────────────────────── ││ Instead of binary (alive/dead), output a suspicion level: ││ ││ φ = -log₁₀(probability node is still alive) ││ ││ φ = 1 → 90% confident node is alive ││ φ = 2 → 99% confident node is alive │**This is one of the most important topics in distributed systems.**### Why Physical Clocks Fail
┌─────────────────────────────────────────────────────────────────────────────┐
│ PHYSICAL CLOCK PROBLEMS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ CLOCK SKEW (different machines show different times) │
│ ──────────── │
│ Machine A: 10:00:00.000 │
│ Machine B: 10:00:00.150 (150ms ahead) │
│ Machine C: 09:59:59.850 (150ms behind) │
│ │
│ NTP SYNCHRONIZATION │
│ ─────────────────── │
│ NTP accuracy: 1-10ms on LAN, 10-100ms over internet │
│ During sync: Clock can jump forward or backward! │
│ │
│ CLOCK DRIFT │
│ ──────────── │
│ Quartz crystals drift: ~50ppm (50 microseconds per second) │
│ After 1 hour: 180ms drift possible │
│ │
│ LEAP SECONDS │
│ ──────────── │
│ Added to compensate for Earth’s rotation slowing │
│ 23:59:59 → 23:59:60 → 00:00:00 │
│ Famously caused issues at Reddit, LinkedIn, Mozilla (2012) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
### Lamport TimestampsLeslie Lamport's logical clock solution (1978):**Rules:**1. Each process maintains a counter (starts at 0)2. Before each event, increment the counter3. When sending: attach current counter to message4. When receiving: counter = max(local, received) + 1**Example:**- Process A sends message 1 to B, then B receives it as 2- Process B sends message 3 to C, then C receives it as 5- Process A sends message 4 to B, then B receives it as 6- Process B sends message 7 to C, then C receives it as 8**Property:** If A happened before B, then L(A) < L(B)**Limitation:** L(A) < L(B) does NOT mean A happened before B (no causality)**Implementation**:```pythonclass LamportClock: def __init__(self): self.counter = 0 def increment(self): """Call before any local event""" self.counter += 1 return self.counter def send(self): """Call when sending a message""" self.counter += 1 return self.counter def receive(self, received_timestamp): """Call when receiving a message""" self.counter = max(self.counter, received_timestamp) + 1 return self.counter
In many textbooks, “Vector Clocks” and “Version Vectors” are used interchangeably. However, at the Staff level, you must distinguish between them:
Vector Clocks (Causality): Track the relationship between events in a distributed system. Every interaction (send/receive) increments the clock. They are used to build a partial order of all operations.
Version Vectors (Conflict Tracking): Track the relationship between versions of data. They only increment when data is updated. They are used in systems like DynamoDB and Riak to detect if two versions of a document are in conflict (siblings).
Feature
Vector Clocks
Version Vectors
Goal
Order events (A happened before B)
Detect data conflicts
Increment
Every event (internal + communication)
Only on data modification
Overhead
High (grows with event count)
Low (grows with replica count)
Staff Tip: If you only need to know if two versions of a file conflict, use Version Vectors. If you need to know if a message was “caused” by another message (e.g., in a distributed debugger), use Vector Clocks.
class VectorClock: def __init__(self, node_id, num_nodes): self.node_id = node_id self.clock = [0] * num_nodes def increment(self): self.clock[self.node_id] += 1 return self.clock.copy() def send(self): self.clock[self.node_id] += 1 return self.clock.copy() def receive(self, received_clock): for i in range(len(self.clock)): self.clock[i] = max(self.clock[i], received_clock[i]) self.clock[self.node_id] += 1 return self.clock.copy() def happens_before(self, other_clock): """Returns True if this clock happened before other_clock""" return (all(s <= o for s, o in zip(self.clock, other_clock)) and any(s < o for s, o in zip(self.clock, other_clock))) def concurrent(self, other_clock): """Returns True if events are concurrent""" return (not self.happens_before(other_clock) and not VectorClock.compare(other_clock, self.clock))
┌─────────────────────────────────────────────────────────────────────────────┐│ HYBRID LOGICAL CLOCKS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ HLC = (physical_time, logical_counter) ││ ││ PROPERTIES: ││ 1. Always greater than physical time (useful for debugging) ││ 2. Preserves causality like logical clocks ││ 3. Bounded skew from physical time ││ ││ USED BY: ││ ├── CockroachDB ││ ├── MongoDB ││ └── Many distributed databases ││ ││ ALGORITHM: ││ ───────── ││ On local event or send: ││ l' = max(l, physical_time) ││ if l' == l: c' = c + 1 ││ else: c' = 0 ││ l = l', c = c' ││ ││ On receive(m.l, m.c): ││ l' = max(l, m.l, physical_time) ││ if l' == l == m.l: c' = max(c, m.c) + 1 ││ else if l' == l: c' = c + 1 ││ else if l' == m.l: c' = m.c + 1 ││ else: c' = 0 ││ l = l', c = c' ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐│ PARTIAL FAILURES │├─────────────────────────────────────────────────────────────────────────────┤│ ││ SCENARIO: Node A calls Node B ││ ││ A ──[request]──> B ││ │ ││ (silence) ││ ││ What happened? ││ ├── Request lost in network? ││ ├── B crashed before processing? ││ ├── B processed but crashed before responding? ││ ├── B responded but response lost? ││ └── B is just slow? ││ ││ A DOESN'T KNOW! ││ ││ IMPLICATIONS: ││ ───────────── ││ 1. Cannot distinguish slow from dead ││ 2. Operations might have executed (or not) ││ 3. Must design for uncertainty ││ 4. Idempotency becomes essential ││ ││ PATTERN: Retry with idempotency keys ││ ───────────────────────────────── ││ request_id = "abc123" ││ if already_processed(request_id): ││ return cached_response ││ else: ││ process_and_store(request_id, response) ││ │└─────────────────────────────────────────────────────────────────────────────┘
Gray failures are the distributed systems equivalent of a slow gas leak — everything seems fine until suddenly it is not. They are far more common and insidious than clean crash failures, and they are the cause of most prolonged production outages:
Examples of gray failures:├── CPU running at 50% speed due to thermal throttling├── One of 10 disks has 100x higher latency├── Memory errors causing random crashes├── Network dropping 1% of packets (enough to cause retries, not enough to trigger alerts)├── Software bug triggered only by specific input (works for 99.99% of requests)└── Gradual memory leak over hours (OOM kill after 3 days)Detection challenges:├── Metrics look "mostly okay" (averages hide the problem)├── Intermittent symptoms (works on retry, so nobody investigates)├── Hard to reproduce (depends on specific timing or load)└── May only affect some requests (the unlucky 0.1% routed to the sick node)
Production wisdom: Gray failures are why you must monitor percentile latencies (p99, p99.9), not just averages or error rates. A node with a failing disk might serve 99% of requests normally but add 10 seconds of latency to the rest. The average latency barely moves, but 1% of your users have a terrible experience. Your dashboards should make this visible.
Byzantine failures are the most challenging to handle:
┌─────────────────────────────────────────────────────────────────────────────┐│ BYZANTINE FAULT TOLERANCE │├─────────────────────────────────────────────────────────────────────────────┤│ ││ THE BYZANTINE GENERALS PROBLEM ││ ───────────────────────────── ││ Multiple generals must agree on attack/retreat ││ Some generals may be traitors (send different messages) ││ How do loyal generals reach consensus? ││ ││ REQUIREMENTS FOR BFT: ││ ──────────────────── ││ To tolerate f Byzantine nodes: Need 3f + 1 total nodes ││ - 1 faulty node: Need 4 nodes minimum ││ - 2 faulty nodes: Need 7 nodes minimum ││ - 3 faulty nodes: Need 10 nodes minimum ││ ││ WHY 3f + 1? ││ ────────── ││ f nodes may lie, f nodes may be unreachable ││ Need f+1 correct responses to form a majority ││ (3f+1) - f (faulty) - f (unreachable) = f+1 (quorum) ││ ││ PRACTICAL BFT ALGORITHMS: ││ ─────────────────────── ││ ├── PBFT (Practical BFT) - Castro & Liskov, 1999 ││ ├── HotStuff - Used by Facebook's Libra/Diem ││ ├── Tendermint - Used in blockchain systems ││ └── RBFT - Redundant BFT for improved performance ││ │└─────────────────────────────────────────────────────────────────────────────┘
Performance Trade-off: BFT protocols are expensive! PBFT requires O(n²) messages per consensus round. Most internal systems use simpler crash-fault tolerant protocols (Raft, Paxos) which only require 2f+1 nodes.
External Consistency is a stronger, global guarantee, famously used by Google Spanner. It combines Linearizability with Serializability across multiple shards/objects and respects global wall-clock time.
Feature
Linearizability
External Consistency
Scope
Single object/key
Multi-shard transactions
Ordering
System-observed order
Global wall-clock order
Implementation
Paxos/Raft on one shard
TrueTime / Commit-Wait / HLC
The “Why”:
In a multi-shard system, two transactions T1 and T2 might affect different shards. If T1 finishes at 10:00:01 and a client then starts T2 at 10:00:02, T2must see the effects of T1. Without external consistency (e.g., using only local Raft groups with skewed clocks), T2 might get a timestamp that is logically earlier than T1, leading to causal violations.
Module 6: Modern Infrastructure & Serverless Internals
To achieve the “Principal” level of understanding, one must look beyond the logical protocols and into the physical isolation models that power modern clouds. The shift from monolithic VMs to Micro-VMs has fundamentally changed how we scale distributed systems.
Serverless (AWS Lambda, Google Cloud Functions) promises “no servers,” but in reality, it requires the most complex server management in existence. The challenge is The Cold Start Problem vs. Isolation Security.
Traditional VMs (EC2): Strong isolation, but slow to boot (30s+). Too slow for “on-demand” execution.
Containers (Docker/K8s): Fast to start, but weak isolation (shared kernel). Dangerous for multi-tenant code execution.
AWS Firecracker is the technology behind Lambda and Fargate. It uses KVM (Kernel-based Virtual Machine) but stripped down to the bare essentials.
Minimal Device Model: Firecracker drops legacy hardware support (no USB, no video). It only provides:
Net (Network)
Block (Storage)
Serial (Console)
Balloon (Memory management)
Snapshot-and-Restore: Instead of booting a kernel, Firecracker can restore from a Memory Snapshot. This reduces “boot” time from seconds to ~10ms.
Jailer: A secondary security layer that uses cgroups, namespaces, and seccomp to ensure that even if a Micro-VM is compromised, it cannot touch the host or other VMs.
Statelessness is Mandatory: Because the Micro-VM can be killed or recycled at any moment.
Sidecar Overhead: Traditional sidecars (Service Mesh) add too much latency for 50ms functions. We move towards Proxyless Mesh or Library-based approaches.
Connection Pooling: Traditional database connection pools fail at serverless scale. We use RDS Proxy or HTTP-based database protocols (Data API).
Lease-based: Node must renew lease to be considered alive
In practice, use adaptive timeouts based on historical latency.
Q: Explain vector clocks vs Lamport timestamps
Answer:
Lamport: Single counter, preserves “happens-before” one direction only
Vector clocks: One counter per node, can detect concurrent events
Use Lamport when you just need ordering.
Use vector clocks when you need to detect conflicts.
Q: Is your system CP or AP?
Answer: Frame it as a trade-off discussion:
First, acknowledge CAP only applies during partitions
Discuss what consistency level you actually need
Explain PACELC for normal operation
Give examples: “For our payment system, we’re CP because incorrect balance is unacceptable. For user preferences, we’re AP because eventual consistency is fine.”
The FLP impossibility result says consensus is impossible in an asynchronous system. Yet we use Raft and Paxos every day. How do you reconcile this?
Strong Answer:
FLP proves that no deterministic algorithm can guarantee consensus in a fully asynchronous system where even one process may crash. The key word is “guarantee” — it is an impossibility of absolute liveness, not of safety.
Practical systems like Raft and Paxos work around FLP by introducing partial synchrony assumptions. Specifically, they use timeouts to detect suspected failures. Once you add a timeout, you are no longer in a purely asynchronous model — you are assuming that messages will eventually be delivered within some bound. FLP does not apply to partially synchronous systems.
The trade-off is that during periods of true asynchrony (extreme network delays, GC pauses that exceed the timeout), these protocols may fail to make progress (liveness violation) but they never violate safety (they never agree on two different values). This is exactly the right trade-off for production systems: it is acceptable to be temporarily unavailable, but it is never acceptable to corrupt data.
Another approach is randomized consensus (like Ben-Or’s protocol), which circumvents FLP by using randomization rather than determinism. The expected number of rounds is finite, but no single execution is guaranteed to terminate.
Follow-up: Can you describe a scenario where the FLP result actually manifests in a production system?Consider a Raft cluster where the leader experiences a long GC pause. During the pause, followers time out and start an election. But the GC pause ends and the old leader resumes sending heartbeats before the election completes. Followers receive heartbeats from the old leader and reset their election timers, canceling the election. If this cycle repeats — GC pause, election starts, GC ends, election cancels — the cluster can enter a livelock where no leader is elected and no progress is made. This is FLP manifesting in practice. The mitigation is randomized election timeouts in Raft (150-300ms range), which break symmetry and make it statistically improbable for this livelock to persist. But “statistically improbable” is not “impossible” — FLP reminds us that no amount of engineering eliminates this possibility entirely.
Explain the difference between a network partition, a partial partition, and an asymmetric partition. Why does the distinction matter for system design?
Strong Answer:
A full partition splits the cluster into two groups that cannot communicate at all. This is the classic CAP scenario: each side must independently decide whether to keep serving (AP) or stop (CP). Raft handles this cleanly — the majority side elects a leader, the minority side stops accepting writes.
A partial partition is when some nodes can communicate with some but not all other nodes. For example, in a 5-node cluster, node A can talk to B and C, node D can talk to C and E, but A cannot talk to D or E. Node C acts as a bridge. This is more insidious because quorum-based systems may still form quorums that do not overlap correctly, and different nodes have different views of who is reachable.
An asymmetric partition is when A can send to B but B cannot send to A. TCP will eventually detect this (via ACK timeouts), but UDP-based gossip or heartbeat systems may not. Node A thinks B is alive (because A’s sends succeed), while B thinks A is dead (because B never receives anything).
The distinction matters because most distributed systems are designed for full partitions, but partial and asymmetric partitions are more common in practice (misconfigured firewalls, flaky switches, half-duplex network failures). If your failure detector assumes symmetric communication, an asymmetric partition can cause split-brain.
Follow-up: How would you test your system’s behavior under a partial partition?I would use a network simulation layer (like Jepsen’s iptables-based nemesis or Toxiproxy) that can selectively drop traffic between specific pairs of nodes. The test scenario would create a partition where nodes A-B-C can communicate, D-E can communicate, but C is the only bridge between the two groups. Then I would verify: can the system still make progress? Does it correctly detect the partial partition? Does it degrade gracefully? I would also test the asymmetric case by dropping packets in one direction only using iptables INPUT/OUTPUT rules. The key metrics to watch are: leader election stability, write latency, and most importantly, whether any safety invariants (like linearizability) are violated during the partition.
A colleague says your system should be designed for Byzantine fault tolerance because 'you never know what could go wrong.' How do you respond?
Strong Answer:
I would respectfully push back. BFT is the right tool for a very specific threat model: mutually distrustful participants who may actively lie (blockchains, multi-party financial systems). For the vast majority of internal distributed systems, crash-fault tolerance (CFT) is sufficient and dramatically cheaper.
The cost difference is significant. CFT protocols like Raft need 2f+1 nodes to tolerate f failures, with O(n) message complexity per consensus round. BFT protocols like PBFT need 3f+1 nodes with O(n-squared) message complexity. For a 5-node Raft cluster tolerating 2 failures, you need 5 nodes. For BFT tolerating 2 failures, you need 7 nodes, each exchanging quadratically more messages.
In practice, the failures that cause production outages are crashes, network partitions, disk failures, and software bugs — not malicious nodes. If you are worried about buggy nodes sending corrupt data, you can add checksums and input validation at the application layer for a fraction of the cost of full BFT.
The exception: if you are building a system where different organizations contribute nodes and have economic incentives to cheat, BFT is necessary. This is why blockchain consensus uses BFT variants.
Follow-up: What about the case where a software bug causes a node to send corrupt or inconsistent messages? Is that not effectively Byzantine behavior?This is an excellent point and a real concern. A node with a memory corruption bug or a deserialization error might send valid-looking but incorrect messages — this is sometimes called an “accidental Byzantine” fault. However, the practical response is not to deploy full BFT. Instead, I would use defense-in-depth: checksums on all messages to detect corruption, assertion-based crash detection (if a node detects its own state is inconsistent, it crashes rather than propagating corruption), and Merkle-tree-based anti-entropy to detect data divergence between replicas. These approaches catch accidental Byzantine behavior at a fraction of the cost. For true Byzantine resilience, I would only invest in it for the specific data paths where corruption would be undetectable and catastrophic.
Explain the PACELC theorem and give me an example of a real system for each of the four PACELC categories: PA/EL, PA/EC, PC/EL, and PC/EC.
Strong Answer:
PACELC extends CAP by asking: even when there is no partition, what trade-off does the system make between latency and consistency? The framework is: if Partition, choose Availability or Consistency; Else, choose Latency or Consistency.
PA/EL (Available during partitions, Low latency normally): DynamoDB and Cassandra. During a partition, they keep serving from whichever replicas are reachable. During normal operation, they return responses from the nearest replica without waiting for quorum, giving low latency but eventual consistency.
PA/EC (Available during partitions, Consistent normally): MongoDB with default read concern. During normal operation, reads go to the primary, which is consistent. During a partition, secondaries can still serve reads (with possible staleness), maintaining availability.
PC/EL (Consistent during partitions, Low latency normally): Yahoo’s PNUTS (now largely historical). It blocks writes during partitions to maintain per-record consistency, but during normal operation it routes reads to the nearest replica for low latency using “timeline consistency.”
PC/EC (Consistent during partitions, Consistent normally): Google Spanner. It blocks during partitions (choosing consistency), and during normal operation it uses TrueTime and commit-wait to provide external consistency, accepting higher latency (1-7ms commit wait) for global ordering guarantees.
Follow-up: Can a single system offer different PACELC trade-offs for different operations?Absolutely, and this is how most sophisticated production systems work. Cassandra is the canonical example: you can set consistency level per query. A write with QUORUM and a read with QUORUM gives you PC/EC behavior for that specific operation. A write with ONE and a read with ONE gives you PA/EL behavior. The same cluster, the same data, but different consistency-latency trade-offs per request. Spanner does something similar with its read-only transactions: a “strong read” gives you external consistency (PC/EC), while a “stale read” with a bounded staleness of 10 seconds gives you lower latency (closer to PC/EL). The design insight is that consistency should be a dial, not a switch — and different use cases within the same application should be able to turn that dial independently.