Defense: Use service discovery, health checks, load balancer draining.
6. There is One Administrator
Reality: Multiple teams, policies, and even companies.In a typical microservices architecture:
Platform team manages Kubernetes
Each service team manages their services
Security team manages policies
Network team manages infrastructure
Defense: Clear ownership, documented interfaces, SLAs between teams.
7. Transport Cost is Zero
Reality: Data transfer costs real money.
Cloud Provider
Egress Cost
AWS
$0.09/GB
GCP
$0.12/GB
Azure
$0.087/GB
High-traffic example: 1PB egress = $90,000/monthDefense: Keep data close to compute, use CDNs, compress aggressively.
8. The Network is Homogeneous
Reality: Different hardware, protocols, and vendors everywhere.
Copy
Heterogeneity:├── Different server generations (various CPU, RAM)├── Different network equipment (Cisco, Juniper, etc.)├── Different protocols (HTTP/1.1, HTTP/2, gRPC, QUIC)├── Different cloud providers (AWS, GCP, Azure)└── Different latency profiles
Defense: Abstract hardware differences, use consistent protocols, test on diverse environments.
A network partition occurs when nodes can’t communicate with each other:
Copy
BEFORE PARTITION: ┌────────────────┐ │ Network │ └────────────────┘ ↑ ↑ ↑ ↑ │ │ │ │ ┌────┼────┼────┼────┼────┐ │ │ │ │ │ │ [A] [B] [C] [D] [E] [F] All nodes can communicateDURING PARTITION: Partition 1 ║ Partition 2 ┌─────────────┐ ║ ┌─────────────┐ │ Network │ ║ │ Network │ └─────────────┘ ║ └─────────────┘ ↑ ↑ ↑ ║ ↑ ↑ ↑ │ │ │ PARTITION │ │ │ [A] [B] [C] ║ [D] [E] [F] A, B, C can talk D, E, F can talk to each other to each other but NOT to D, E, F but NOT to A, B, C
Key insight: During a partition, you must choose between:
Availability: Both partitions continue serving requests (might diverge)
Consistency: Reject requests from the minority partition
┌─────────────────────────────────────────────────────────────────────────────┐│ FAILURE DETECTION APPROACHES │├─────────────────────────────────────────────────────────────────────────────┤│ ││ HEARTBEAT-BASED ││ ─────────────── ││ Node A ──[heartbeat]──> Node B ││ every 1 sec ││ ││ If no heartbeat for 5 seconds → Mark as failed ││ ││ Problems: ││ ├── Too aggressive: False positives (just slow, not dead) ││ └── Too conservative: Slow detection of actual failures ││ ││ PHI ACCRUAL FAILURE DETECTOR (Cassandra uses this) ││ ───────────────────────────── ││ Instead of binary (alive/dead), output a suspicion level: ││ ││ φ = -log₁₀(probability node is still alive) ││ ││ φ = 1 → 90% confident node is alive ││ φ = 2 → 99% confident node is alive │**This is one of the most important topics in distributed systems.**### Why Physical Clocks Fail
┌─────────────────────────────────────────────────────────────────────────────┐
│ PHYSICAL CLOCK PROBLEMS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ CLOCK SKEW (different machines show different times) │
│ ──────────── │
│ Machine A: 10:00:00.000 │
│ Machine B: 10:00:00.150 (150ms ahead) │
│ Machine C: 09:59:59.850 (150ms behind) │
│ │
│ NTP SYNCHRONIZATION │
│ ─────────────────── │
│ NTP accuracy: 1-10ms on LAN, 10-100ms over internet │
│ During sync: Clock can jump forward or backward! │
│ │
│ CLOCK DRIFT │
│ ──────────── │
│ Quartz crystals drift: ~50ppm (50 microseconds per second) │
│ After 1 hour: 180ms drift possible │
│ │
│ LEAP SECONDS │
│ ──────────── │
│ Added to compensate for Earth’s rotation slowing │
│ 23:59:59 → 23:59:60 → 00:00:00 │
│ Famously caused issues at Reddit, LinkedIn, Mozilla (2012) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Copy
### Lamport TimestampsLeslie Lamport's logical clock solution (1978):**Rules:**1. Each process maintains a counter (starts at 0)2. Before each event, increment the counter3. When sending: attach current counter to message4. When receiving: counter = max(local, received) + 1**Example:**- Process A sends message 1 to B, then B receives it as 2- Process B sends message 3 to C, then C receives it as 5- Process A sends message 4 to B, then B receives it as 6- Process B sends message 7 to C, then C receives it as 8**Property:** If A happened before B, then L(A) < L(B)**Limitation:** L(A) < L(B) does NOT mean A happened before B (no causality)**Implementation**:```pythonclass LamportClock: def __init__(self): self.counter = 0 def increment(self): """Call before any local event""" self.counter += 1 return self.counter def send(self): """Call when sending a message""" self.counter += 1 return self.counter def receive(self, received_timestamp): """Call when receiving a message""" self.counter = max(self.counter, received_timestamp) + 1 return self.counter
In many textbooks, “Vector Clocks” and “Version Vectors” are used interchangeably. However, at the Staff level, you must distinguish between them:
Vector Clocks (Causality): Track the relationship between events in a distributed system. Every interaction (send/receive) increments the clock. They are used to build a partial order of all operations.
Version Vectors (Conflict Tracking): Track the relationship between versions of data. They only increment when data is updated. They are used in systems like DynamoDB and Riak to detect if two versions of a document are in conflict (siblings).
Feature
Vector Clocks
Version Vectors
Goal
Order events (A happened before B)
Detect data conflicts
Increment
Every event (internal + communication)
Only on data modification
Overhead
High (grows with event count)
Low (grows with replica count)
Staff Tip: If you only need to know if two versions of a file conflict, use Version Vectors. If you need to know if a message was “caused” by another message (e.g., in a distributed debugger), use Vector Clocks.
class VectorClock: def __init__(self, node_id, num_nodes): self.node_id = node_id self.clock = [0] * num_nodes def increment(self): self.clock[self.node_id] += 1 return self.clock.copy() def send(self): self.clock[self.node_id] += 1 return self.clock.copy() def receive(self, received_clock): for i in range(len(self.clock)): self.clock[i] = max(self.clock[i], received_clock[i]) self.clock[self.node_id] += 1 return self.clock.copy() def happens_before(self, other_clock): """Returns True if this clock happened before other_clock""" return (all(s <= o for s, o in zip(self.clock, other_clock)) and any(s < o for s, o in zip(self.clock, other_clock))) def concurrent(self, other_clock): """Returns True if events are concurrent""" return (not self.happens_before(other_clock) and not VectorClock.compare(other_clock, self.clock))
┌─────────────────────────────────────────────────────────────────────────────┐│ HYBRID LOGICAL CLOCKS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ HLC = (physical_time, logical_counter) ││ ││ PROPERTIES: ││ 1. Always greater than physical time (useful for debugging) ││ 2. Preserves causality like logical clocks ││ 3. Bounded skew from physical time ││ ││ USED BY: ││ ├── CockroachDB ││ ├── MongoDB ││ └── Many distributed databases ││ ││ ALGORITHM: ││ ───────── ││ On local event or send: ││ l' = max(l, physical_time) ││ if l' == l: c' = c + 1 ││ else: c' = 0 ││ l = l', c = c' ││ ││ On receive(m.l, m.c): ││ l' = max(l, m.l, physical_time) ││ if l' == l == m.l: c' = max(c, m.c) + 1 ││ else if l' == l: c' = c + 1 ││ else if l' == m.l: c' = m.c + 1 ││ else: c' = 0 ││ l = l', c = c' ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐│ PARTIAL FAILURES │├─────────────────────────────────────────────────────────────────────────────┤│ ││ SCENARIO: Node A calls Node B ││ ││ A ──[request]──> B ││ │ ││ (silence) ││ ││ What happened? ││ ├── Request lost in network? ││ ├── B crashed before processing? ││ ├── B processed but crashed before responding? ││ ├── B responded but response lost? ││ └── B is just slow? ││ ││ A DOESN'T KNOW! ││ ││ IMPLICATIONS: ││ ───────────── ││ 1. Cannot distinguish slow from dead ││ 2. Operations might have executed (or not) ││ 3. Must design for uncertainty ││ 4. Idempotency becomes essential ││ ││ PATTERN: Retry with idempotency keys ││ ───────────────────────────────── ││ request_id = "abc123" ││ if already_processed(request_id): ││ return cached_response ││ else: ││ process_and_store(request_id, response) ││ │└─────────────────────────────────────────────────────────────────────────────┘
Examples of gray failures:├── CPU running at 50% speed due to thermal throttling├── One of 10 disks has 100x higher latency├── Memory errors causing random crashes├── Network dropping 1% of packets├── Software bug triggered only by specific input└── Gradual memory leak over hoursDetection challenges:├── Metrics look "mostly okay"├── Intermittent symptoms├── Hard to reproduce└── May only affect some requests
Byzantine failures are the most challenging to handle:
Copy
┌─────────────────────────────────────────────────────────────────────────────┐│ BYZANTINE FAULT TOLERANCE │├─────────────────────────────────────────────────────────────────────────────┤│ ││ THE BYZANTINE GENERALS PROBLEM ││ ───────────────────────────── ││ Multiple generals must agree on attack/retreat ││ Some generals may be traitors (send different messages) ││ How do loyal generals reach consensus? ││ ││ REQUIREMENTS FOR BFT: ││ ──────────────────── ││ To tolerate f Byzantine nodes: Need 3f + 1 total nodes ││ - 1 faulty node: Need 4 nodes minimum ││ - 2 faulty nodes: Need 7 nodes minimum ││ - 3 faulty nodes: Need 10 nodes minimum ││ ││ WHY 3f + 1? ││ ────────── ││ f nodes may lie, f nodes may be unreachable ││ Need f+1 correct responses to form a majority ││ (3f+1) - f (faulty) - f (unreachable) = f+1 (quorum) ││ ││ PRACTICAL BFT ALGORITHMS: ││ ─────────────────────── ││ ├── PBFT (Practical BFT) - Castro & Liskov, 1999 ││ ├── HotStuff - Used by Facebook's Libra/Diem ││ ├── Tendermint - Used in blockchain systems ││ └── RBFT - Redundant BFT for improved performance ││ │└─────────────────────────────────────────────────────────────────────────────┘
Performance Trade-off: BFT protocols are expensive! PBFT requires O(n²) messages per consensus round. Most internal systems use simpler crash-fault tolerant protocols (Raft, Paxos) which only require 2f+1 nodes.
External Consistency is a stronger, global guarantee, famously used by Google Spanner. It combines Linearizability with Serializability across multiple shards/objects and respects global wall-clock time.
Feature
Linearizability
External Consistency
Scope
Single object/key
Multi-shard transactions
Ordering
System-observed order
Global wall-clock order
Implementation
Paxos/Raft on one shard
TrueTime / Commit-Wait / HLC
The “Why”:
In a multi-shard system, two transactions T1 and T2 might affect different shards. If T1 finishes at 10:00:01 and a client then starts T2 at 10:00:02, T2must see the effects of T1. Without external consistency (e.g., using only local Raft groups with skewed clocks), T2 might get a timestamp that is logically earlier than T1, leading to causal violations.
Module 6: Modern Infrastructure & Serverless Internals
To achieve the “Principal” level of understanding, one must look beyond the logical protocols and into the physical isolation models that power modern clouds. The shift from monolithic VMs to Micro-VMs has fundamentally changed how we scale distributed systems.
Serverless (AWS Lambda, Google Cloud Functions) promises “no servers,” but in reality, it requires the most complex server management in existence. The challenge is The Cold Start Problem vs. Isolation Security.
Traditional VMs (EC2): Strong isolation, but slow to boot (30s+). Too slow for “on-demand” execution.
Containers (Docker/K8s): Fast to start, but weak isolation (shared kernel). Dangerous for multi-tenant code execution.
AWS Firecracker is the technology behind Lambda and Fargate. It uses KVM (Kernel-based Virtual Machine) but stripped down to the bare essentials.
Minimal Device Model: Firecracker drops legacy hardware support (no USB, no video). It only provides:
Net (Network)
Block (Storage)
Serial (Console)
Balloon (Memory management)
Snapshot-and-Restore: Instead of booting a kernel, Firecracker can restore from a Memory Snapshot. This reduces “boot” time from seconds to ~10ms.
Jailer: A secondary security layer that uses cgroups, namespaces, and seccomp to ensure that even if a Micro-VM is compromised, it cannot touch the host or other VMs.
Statelessness is Mandatory: Because the Micro-VM can be killed or recycled at any moment.
Sidecar Overhead: Traditional sidecars (Service Mesh) add too much latency for 50ms functions. We move towards Proxyless Mesh or Library-based approaches.
Connection Pooling: Traditional database connection pools fail at serverless scale. We use RDS Proxy or HTTP-based database protocols (Data API).
Lease-based: Node must renew lease to be considered alive
In practice, use adaptive timeouts based on historical latency.
Q: Explain vector clocks vs Lamport timestamps
Answer:
Lamport: Single counter, preserves “happens-before” one direction only
Vector clocks: One counter per node, can detect concurrent events
Use Lamport when you just need ordering.
Use vector clocks when you need to detect conflicts.
Q: Is your system CP or AP?
Answer: Frame it as a trade-off discussion:
First, acknowledge CAP only applies during partitions
Discuss what consistency level you actually need
Explain PACELC for normal operation
Give examples: “For our payment system, we’re CP because incorrect balance is unacceptable. For user preferences, we’re AP because eventual consistency is fine.”