Defense: Use service discovery, health checks, load balancer draining.
6. There is One Administrator
Reality: Multiple teams, policies, and even companies.In a typical microservices architecture:
Platform team manages Kubernetes
Each service team manages their services
Security team manages policies
Network team manages infrastructure
Defense: Clear ownership, documented interfaces, SLAs between teams.
7. Transport Cost is Zero
Reality: Data transfer costs real money.
Cloud Provider
Egress Cost
AWS
$0.09/GB
GCP
$0.12/GB
Azure
$0.087/GB
High-traffic example: 1PB egress = $90,000/monthDefense: Keep data close to compute, use CDNs, compress aggressively.
8. The Network is Homogeneous
Reality: Different hardware, protocols, and vendors everywhere.
Copy
Heterogeneity:├── Different server generations (various CPU, RAM)├── Different network equipment (Cisco, Juniper, etc.)├── Different protocols (HTTP/1.1, HTTP/2, gRPC, QUIC)├── Different cloud providers (AWS, GCP, Azure)└── Different latency profiles
Defense: Abstract hardware differences, use consistent protocols, test on diverse environments.
A network partition occurs when nodes can’t communicate with each other:
Copy
BEFORE PARTITION: ┌────────────────┐ │ Network │ └────────────────┘ ↑ ↑ ↑ ↑ │ │ │ │ ┌────┼────┼────┼────┼────┐ │ │ │ │ │ │ [A] [B] [C] [D] [E] [F] All nodes can communicateDURING PARTITION: Partition 1 ║ Partition 2 ┌─────────────┐ ║ ┌─────────────┐ │ Network │ ║ │ Network │ └─────────────┘ ║ └─────────────┘ ↑ ↑ ↑ ║ ↑ ↑ ↑ │ │ │ PARTITION │ │ │ [A] [B] [C] ║ [D] [E] [F] A, B, C can talk D, E, F can talk to each other to each other but NOT to D, E, F but NOT to A, B, C
Key insight: During a partition, you must choose between:
Availability: Both partitions continue serving requests (might diverge)
Consistency: Reject requests from the minority partition
┌─────────────────────────────────────────────────────────────────────────────┐│ FAILURE DETECTION APPROACHES │├─────────────────────────────────────────────────────────────────────────────┤│ ││ HEARTBEAT-BASED ││ ─────────────── ││ Node A ──[heartbeat]──> Node B ││ every 1 sec ││ ││ If no heartbeat for 5 seconds → Mark as failed ││ ││ Problems: ││ ├── Too aggressive: False positives (just slow, not dead) ││ └── Too conservative: Slow detection of actual failures ││ ││ PHI ACCRUAL FAILURE DETECTOR (Cassandra uses this) ││ ───────────────────────────── ││ Instead of binary (alive/dead), output a suspicion level: ││ ││ φ = -log₁₀(probability node is still alive) ││ ││ φ = 1 → 90% confident node is alive ││ φ = 2 → 99% confident node is alive │**This is one of the most important topics in distributed systems.**### Why Physical Clocks Fail
┌─────────────────────────────────────────────────────────────────────────────┐
│ PHYSICAL CLOCK PROBLEMS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ CLOCK SKEW (different machines show different times) │
│ ──────────── │
│ Machine A: 10:00:00.000 │
│ Machine B: 10:00:00.150 (150ms ahead) │
│ Machine C: 09:59:59.850 (150ms behind) │
│ │
│ NTP SYNCHRONIZATION │
│ ─────────────────── │
│ NTP accuracy: 1-10ms on LAN, 10-100ms over internet │
│ During sync: Clock can jump forward or backward! │
│ │
│ CLOCK DRIFT │
│ ──────────── │
│ Quartz crystals drift: ~50ppm (50 microseconds per second) │
│ After 1 hour: 180ms drift possible │
│ │
│ LEAP SECONDS │
│ ──────────── │
│ Added to compensate for Earth’s rotation slowing │
│ 23:59:59 → 23:59:60 → 00:00:00 │
│ Famously caused issues at Reddit, LinkedIn, Mozilla (2012) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Copy
### Lamport TimestampsLeslie Lamport's logical clock solution (1978):**Rules:**1. Each process maintains a counter (starts at 0)2. Before each event, increment the counter3. When sending: attach current counter to message4. When receiving: counter = max(local, received) + 1**Example:**- Process A sends message 1 to B, then B receives it as 2- Process B sends message 3 to C, then C receives it as 5- Process A sends message 4 to B, then B receives it as 6- Process B sends message 7 to C, then C receives it as 8**Property:** If A happened before B, then L(A) < L(B)**Limitation:** L(A) < L(B) does NOT mean A happened before B (no causality)**Implementation**:```pythonclass LamportClock: def __init__(self): self.counter = 0 def increment(self): """Call before any local event""" self.counter += 1 return self.counter def send(self): """Call when sending a message""" self.counter += 1 return self.counter def receive(self, received_timestamp): """Call when receiving a message""" self.counter = max(self.counter, received_timestamp) + 1 return self.counter
Vector clocks track causality - they tell you if two events are related:Implementation:
Copy
class VectorClock: def __init__(self, node_id, num_nodes): self.node_id = node_id self.clock = [0] * num_nodes def increment(self): self.clock[self.node_id] += 1 return self.clock.copy() def send(self): self.clock[self.node_id] += 1 return self.clock.copy() def receive(self, received_clock): for i in range(len(self.clock)): self.clock[i] = max(self.clock[i], received_clock[i]) self.clock[self.node_id] += 1 return self.clock.copy() def happens_before(self, other_clock): """Returns True if this clock happened before other_clock""" return (all(s <= o for s, o in zip(self.clock, other_clock)) and any(s < o for s, o in zip(self.clock, other_clock))) def concurrent(self, other_clock): """Returns True if events are concurrent""" return (not self.happens_before(other_clock) and not VectorClock.compare(other_clock, self.clock))
┌─────────────────────────────────────────────────────────────────────────────┐│ HYBRID LOGICAL CLOCKS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ HLC = (physical_time, logical_counter) ││ ││ PROPERTIES: ││ 1. Always greater than physical time (useful for debugging) ││ 2. Preserves causality like logical clocks ││ 3. Bounded skew from physical time ││ ││ USED BY: ││ ├── CockroachDB ││ ├── MongoDB ││ └── Many distributed databases ││ ││ ALGORITHM: ││ ───────── ││ On local event or send: ││ l' = max(l, physical_time) ││ if l' == l: c' = c + 1 ││ else: c' = 0 ││ l = l', c = c' ││ ││ On receive(m.l, m.c): ││ l' = max(l, m.l, physical_time) ││ if l' == l == m.l: c' = max(c, m.c) + 1 ││ else if l' == l: c' = c + 1 ││ else if l' == m.l: c' = m.c + 1 ││ else: c' = 0 ││ l = l', c = c' ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐│ PARTIAL FAILURES │├─────────────────────────────────────────────────────────────────────────────┤│ ││ SCENARIO: Node A calls Node B ││ ││ A ──[request]──> B ││ │ ││ (silence) ││ ││ What happened? ││ ├── Request lost in network? ││ ├── B crashed before processing? ││ ├── B processed but crashed before responding? ││ ├── B responded but response lost? ││ └── B is just slow? ││ ││ A DOESN'T KNOW! ││ ││ IMPLICATIONS: ││ ───────────── ││ 1. Cannot distinguish slow from dead ││ 2. Operations might have executed (or not) ││ 3. Must design for uncertainty ││ 4. Idempotency becomes essential ││ ││ PATTERN: Retry with idempotency keys ││ ───────────────────────────────── ││ request_id = "abc123" ││ if already_processed(request_id): ││ return cached_response ││ else: ││ process_and_store(request_id, response) ││ │└─────────────────────────────────────────────────────────────────────────────┘
Examples of gray failures:├── CPU running at 50% speed due to thermal throttling├── One of 10 disks has 100x higher latency├── Memory errors causing random crashes├── Network dropping 1% of packets├── Software bug triggered only by specific input└── Gradual memory leak over hoursDetection challenges:├── Metrics look "mostly okay"├── Intermittent symptoms├── Hard to reproduce└── May only affect some requests
Byzantine failures are the most challenging to handle:
Copy
┌─────────────────────────────────────────────────────────────────────────────┐│ BYZANTINE FAULT TOLERANCE │├─────────────────────────────────────────────────────────────────────────────┤│ ││ THE BYZANTINE GENERALS PROBLEM ││ ───────────────────────────── ││ Multiple generals must agree on attack/retreat ││ Some generals may be traitors (send different messages) ││ How do loyal generals reach consensus? ││ ││ REQUIREMENTS FOR BFT: ││ ──────────────────── ││ To tolerate f Byzantine nodes: Need 3f + 1 total nodes ││ - 1 faulty node: Need 4 nodes minimum ││ - 2 faulty nodes: Need 7 nodes minimum ││ - 3 faulty nodes: Need 10 nodes minimum ││ ││ WHY 3f + 1? ││ ────────── ││ f nodes may lie, f nodes may be unreachable ││ Need f+1 correct responses to form a majority ││ (3f+1) - f (faulty) - f (unreachable) = f+1 (quorum) ││ ││ PRACTICAL BFT ALGORITHMS: ││ ─────────────────────── ││ ├── PBFT (Practical BFT) - Castro & Liskov, 1999 ││ ├── HotStuff - Used by Facebook's Libra/Diem ││ ├── Tendermint - Used in blockchain systems ││ └── RBFT - Redundant BFT for improved performance ││ │└─────────────────────────────────────────────────────────────────────────────┘
Performance Trade-off: BFT protocols are expensive! PBFT requires O(n²) messages per consensus round. Most internal systems use simpler crash-fault tolerant protocols (Raft, Paxos) which only require 2f+1 nodes.
Lease-based: Node must renew lease to be considered alive
In practice, use adaptive timeouts based on historical latency.
Q: Explain vector clocks vs Lamport timestamps
Answer:
Lamport: Single counter, preserves “happens-before” one direction only
Vector clocks: One counter per node, can detect concurrent events
Use Lamport when you just need ordering.
Use vector clocks when you need to detect conflicts.
Q: Is your system CP or AP?
Answer: Frame it as a trade-off discussion:
First, acknowledge CAP only applies during partitions
Discuss what consistency level you actually need
Explain PACELC for normal operation
Give examples: “For our payment system, we’re CP because incorrect balance is unacceptable. For user preferences, we’re AP because eventual consistency is fine.”