Track 1: Foundations

Before diving into consensus protocols and complex architectures, you must understand the fundamental challenges that make distributed systems hard.

Track Duration: 28-38 hours
Modules: 5
Key Topics: Network fundamentals, Time & Ordering, Failure Models, CAP/PACELC

Module 1: Why Distributed Systems?

The Need for Distribution

Every successful system eventually outgrows a single machine:

+-------------------------------------------------------------+
|                     WHY DISTRIBUTED SYSTEMS?               |
+-------------------------------------------------------------+
|                                                             |
|  SINGLE MACHINE LIMITS          DISTRIBUTION BENEFITS      |
|  ---------------------          ---------------------      |
|  - CPU: ~128 cores max          - Horizontal scaling       |
|  - RAM: ~12 TB max (expensive)  - Fault tolerance          |
|  - Disk: I/O bottlenecks        - Geographic distribution  |
|  - Network: Single point        - Cost efficiency          |
|    of failure                   - Regulatory compliance    |
|                                                             |
|  EXAMPLE: Google Search                                    |
|  ---------------------                                     |
|  - 8.5 billion searches/day                               |
|  - Index: 100+ petabytes                                  |
|  - Response time: < 0.5 seconds                           |
|  - Impossible on a single machine                         |
|                                                             |
+-------------------------------------------------------------+

Types of Distributed Systems

Compute Clusters

Purpose: Process data across many machinesExamples:

Hadoop/Spark clusters
Kubernetes pods
AWS Lambda fleet

Storage Systems

Purpose: Store and retrieve data across machinesExamples:

Distributed databases (Cassandra, DynamoDB)
Object stores (S3, GCS)
File systems (HDFS, Ceph)

Message Systems

Purpose: Enable communication between distributed componentsExamples:

Message queues (Kafka, RabbitMQ)
Pub/sub systems (SNS, Pub/Sub)
Event streams (Kinesis, Event Hubs)

The Eight Fallacies of Distributed Computing

Every distributed systems engineer must internalize these:

1. The Network is Reliable

Reality: Networks fail constantly.

Common failures:
- Packet loss (1 in 1000 typical, worse under load)
- Network partitions (entire segments become unreachable)
- Cable cuts (more common than you'd think)
- Switch/router failures

Real incident: AWS US-East-1 2011
- Network configuration change
- Cascading failures
- Multi-hour outage affecting Netflix, Reddit, Quora

Defense: Implement retries, timeouts, circuit breakers, and idempotency.

2. Latency is Zero

Reality: Every network call has latency.

Operation	Time
Same datacenter	0.5ms
Cross-region (US East ↔ West)	40-65ms
Cross-continent (US ↔ Europe)	70-120ms
Cross-pacific (US ↔ Asia)	100-200ms

Impact: A service with 10 sequential remote calls adds 500-2000ms latency.Defense: Cache aggressively, parallelize calls, use async processing.

3. Bandwidth is Infinite

Reality: Bandwidth is expensive and limited.

Example: Transferring 1TB over network

10 Gbps link:
- Theoretical: 13 minutes
- Real-world: 20-30 minutes (overhead, congestion)

AWS Data Transfer Costs:
- Intra-region: $0.01/GB
- Cross-region: $0.02/GB
- To internet: $0.09/GB
- 1PB/month internet egress: $90,000

Defense: Compress data, use efficient serialization (protobuf), batch requests.

4. The Network is Secure

Reality: Assume the network is compromised.Attack vectors:

Man-in-the-middle attacks
DNS spoofing
BGP hijacking (internet routing attacks)
Insider threats

Defense: mTLS everywhere, zero-trust architecture, encrypt at rest and in transit.

5. Topology Doesn't Change

Reality: Network topology changes constantly.

Dynamic changes:
├── Auto-scaling adds/removes instances
├── Deployments replace containers
├── Failovers redirect traffic
└── Cloud provider maintenance

Defense: Use service discovery, health checks, load balancer draining.

6. There is One Administrator

Reality: Multiple teams, policies, and even companies.In a typical microservices architecture:

Platform team manages Kubernetes
Each service team manages their services
Security team manages policies
Network team manages infrastructure

Defense: Clear ownership, documented interfaces, SLAs between teams.

7. Transport Cost is Zero

Reality: Data transfer costs real money.

Cloud Provider	Egress Cost
AWS	$0.09/GB
GCP	$0.12/GB
Azure	$0.087/GB

High-traffic example: 1PB egress = $90,000/monthDefense: Keep data close to compute, use CDNs, compress aggressively.

8. The Network is Homogeneous

Reality: Different hardware, protocols, and vendors everywhere.

Heterogeneity:
├── Different server generations (various CPU, RAM)
├── Different network equipment (Cisco, Juniper, etc.)
├── Different protocols (HTTP/1.1, HTTP/2, gRPC, QUIC)
├── Different cloud providers (AWS, GCP, Azure)
└── Different latency profiles

Defense: Abstract hardware differences, use consistent protocols, test on diverse environments.

Module 2: Network Fundamentals

TCP Guarantees and Failures

TCP provides order and reliability, but not much else:

┌─────────────────────────────────────────────────────────────────────────────┐
│                          TCP GUARANTEES                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ✓ GUARANTEED                         ✗ NOT GUARANTEED                      │
│  ─────────────                        ─────────────────                      │
│  ▸ Ordered delivery                   ▸ Bounded latency                     │
│  ▸ Reliable delivery (eventually)     ▸ Bounded bandwidth                   │
│  ▸ Error detection                    ▸ Connection will succeed             │
│  ▸ Flow control                       ▸ Message boundaries preserved        │
│  ▸ Duplicate elimination              ▸ Timely delivery                     │
│                                                                              │
│  FAILURE MODES                                                               │
│  ─────────────                                                              │
│  ▸ Connection reset (RST)                                                   │
│  ▸ Connection timeout                                                       │
│  ▸ Half-open connections (one side doesn't know it's dead)                 │
│  ▸ Head-of-line blocking                                                    │
│  ▸ TCP incast (many-to-one overwhelms receiver)                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Network Partitions

A network partition occurs when nodes can’t communicate with each other:

BEFORE PARTITION:
                    ┌────────────────┐
                    │   Network      │
                    └────────────────┘
                    ↑    ↑    ↑    ↑
                    │    │    │    │
               ┌────┼────┼────┼────┼────┐
               │    │    │    │    │    │
              [A]  [B]  [C]  [D]  [E]  [F]
              All nodes can communicate


DURING PARTITION:
     Partition 1              ║              Partition 2
  ┌─────────────┐             ║           ┌─────────────┐
  │   Network   │             ║           │   Network   │
  └─────────────┘             ║           └─────────────┘
   ↑     ↑    ↑               ║            ↑     ↑    ↑
   │     │    │           PARTITION        │     │    │
  [A]   [B]  [C]              ║           [D]   [E]  [F]
  
  A, B, C can talk             D, E, F can talk
  to each other                to each other
  but NOT to D, E, F           but NOT to A, B, C

Key insight: During a partition, you must choose between:

Availability: Both partitions continue serving requests (might diverge)
Consistency: Reject requests from the minority partition

Message Delivery Semantics

At-Most-Once
At-Least-Once
Exactly-Once

Fire and forget - no retries

def send_message(message):
    network.send(message)  # No retry on failure
    # Message delivered 0 or 1 times

Use case: Metrics, logs (losing some is acceptable)Risk: Message loss

Retry until acknowledged

def send_message(message):
    while True:
        network.send(message)
        if ack_received():
            break
        # Retry on timeout
    # Message delivered 1 or more times

Use case: Most systems (with idempotency)Risk: Duplicate processing

Holy grail - each message processed exactly once

def send_message(message):
    message_id = generate_unique_id()
    while True:
        network.send(message, id=message_id)
        if ack_received():
            break
    # Receiver deduplicates by message_id

Use case: Financial transactions, critical operationsReality: Implemented as at-least-once + deduplication

Failure Detection

How do you know if a node is dead?

┌─────────────────────────────────────────────────────────────────────────────┐
│                      FAILURE DETECTION APPROACHES                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HEARTBEAT-BASED                                                            │
│  ───────────────                                                            │
│  Node A ──[heartbeat]──> Node B                                             │
│           every 1 sec                                                        │
│                                                                              │
│  If no heartbeat for 5 seconds → Mark as failed                             │
│                                                                              │
│  Problems:                                                                   │
│  ├── Too aggressive: False positives (just slow, not dead)                  │
│  └── Too conservative: Slow detection of actual failures                    │
│                                                                              │
│  PHI ACCRUAL FAILURE DETECTOR (Cassandra uses this)                         │
│  ─────────────────────────────                                              │
│  Instead of binary (alive/dead), output a suspicion level:                  │
│                                                                              │
│  φ = -log₁₀(probability node is still alive)                                │
│                                                                              │
│  φ = 1 → 90% confident node is alive                                        │
│  φ = 2 → 99% confident node is alive                                        │
**This is one of the most important topics in distributed systems.**

### Why Physical Clocks Fail

┌─────────────────────────────────────────────────────────────────────────────┐ │ PHYSICAL CLOCK PROBLEMS │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ CLOCK SKEW (different machines show different times) │ │ ──────────── │ │ Machine A: 10:00:00.000 │ │ Machine B: 10:00:00.150 (150ms ahead) │ │ Machine C: 09:59:59.850 (150ms behind) │ │ │ │ NTP SYNCHRONIZATION │ │ ─────────────────── │ │ NTP accuracy: 1-10ms on LAN, 10-100ms over internet │ │ During sync: Clock can jump forward or backward! │ │ │ │ CLOCK DRIFT │ │ ──────────── │ │ Quartz crystals drift: ~50ppm (50 microseconds per second) │ │ After 1 hour: 180ms drift possible │ │ │ │ LEAP SECONDS │ │ ──────────── │ │ Added to compensate for Earth’s rotation slowing │ │ 23:59:59 → 23:59:60 → 00:00:00 │ │ Famously caused issues at Reddit, LinkedIn, Mozilla (2012) │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

### Lamport Timestamps

Leslie Lamport's logical clock solution (1978):

**Rules:**
1. Each process maintains a counter (starts at 0)
2. Before each event, increment the counter
3. When sending: attach current counter to message
4. When receiving: counter = max(local, received) + 1

**Example:**
- Process A sends message 1 to B, then B receives it as 2
- Process B sends message 3 to C, then C receives it as 5
- Process A sends message 4 to B, then B receives it as 6
- Process B sends message 7 to C, then C receives it as 8

**Property:** If A happened before B, then L(A) < L(B)
**Limitation:** L(A) < L(B) does NOT mean A happened before B (no causality)

**Implementation**:

```python
class LamportClock:
    def __init__(self):
        self.counter = 0
    
    def increment(self):
        """Call before any local event"""
        self.counter += 1
        return self.counter
    
    def send(self):
        """Call when sending a message"""
        self.counter += 1
        return self.counter
    
    def receive(self, received_timestamp):
        """Call when receiving a message"""
        self.counter = max(self.counter, received_timestamp) + 1
        return self.counter

Vector Clocks

Vector clocks track causality - they tell you if two events are related:

Advanced: Vector Clocks vs. Version Vectors

In many textbooks, “Vector Clocks” and “Version Vectors” are used interchangeably. However, at the Staff level, you must distinguish between them:

Vector Clocks (Causality): Track the relationship between events in a distributed system. Every interaction (send/receive) increments the clock. They are used to build a partial order of all operations.
Version Vectors (Conflict Tracking): Track the relationship between versions of data. They only increment when data is updated. They are used in systems like DynamoDB and Riak to detect if two versions of a document are in conflict (siblings).

Feature	Vector Clocks	Version Vectors
Goal	Order events ( $A$ happened before $B$ )	Detect data conflicts
Increment	Every event (internal + communication)	Only on data modification
Overhead	High (grows with event count)	Low (grows with replica count)

Staff Tip: If you only need to know if two versions of a file conflict, use Version Vectors. If you need to know if a message was “caused” by another message (e.g., in a distributed debugger), use Vector Clocks.

Implementation:

class VectorClock:
    def __init__(self, node_id, num_nodes):
        self.node_id = node_id
        self.clock = [0] * num_nodes
    
    def increment(self):
        self.clock[self.node_id] += 1
        return self.clock.copy()
    
    def send(self):
        self.clock[self.node_id] += 1
        return self.clock.copy()
    
    def receive(self, received_clock):
        for i in range(len(self.clock)):
            self.clock[i] = max(self.clock[i], received_clock[i])
        self.clock[self.node_id] += 1
        return self.clock.copy()
    
    def happens_before(self, other_clock):
        """Returns True if this clock happened before other_clock"""
        return (all(s <= o for s, o in zip(self.clock, other_clock)) and
                any(s < o for s, o in zip(self.clock, other_clock)))
    
    def concurrent(self, other_clock):
        """Returns True if events are concurrent"""
        return (not self.happens_before(other_clock) and 
                not VectorClock.compare(other_clock, self.clock))

Hybrid Logical Clocks (HLC)

Combines physical and logical clocks:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    HYBRID LOGICAL CLOCKS                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HLC = (physical_time, logical_counter)                                     │
│                                                                              │
│  PROPERTIES:                                                                │
│  1. Always greater than physical time (useful for debugging)                │
│  2. Preserves causality like logical clocks                                 │
│  3. Bounded skew from physical time                                         │
│                                                                              │
│  USED BY:                                                                   │
│  ├── CockroachDB                                                            │
│  ├── MongoDB                                                                │
│  └── Many distributed databases                                             │
│                                                                              │
│  ALGORITHM:                                                                 │
│  ─────────                                                                  │
│  On local event or send:                                                    │
│    l' = max(l, physical_time)                                               │
│    if l' == l: c' = c + 1                                                   │
│    else: c' = 0                                                             │
│    l = l', c = c'                                                           │
│                                                                              │
│  On receive(m.l, m.c):                                                      │
│    l' = max(l, m.l, physical_time)                                          │
│    if l' == l == m.l: c' = max(c, m.c) + 1                                  │
│    else if l' == l: c' = c + 1                                              │
│    else if l' == m.l: c' = m.c + 1                                          │
│    else: c' = 0                                                             │
│    l = l', c = c'                                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

TrueTime (Google Spanner)

Google’s hardware-based approach:

┌─────────────────────────────────────────────────────────────────────────────┐
│                          TRUETIME                                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HARDWARE:                                                                  │
│  ├── GPS receivers (provide absolute time)                                  │
│  └── Atomic clocks (maintain time during GPS outages)                      │
│                                                                              │
│  API:                                                                       │
│  TT.now() returns [earliest, latest] interval                               │
│  TT.after(t) → true if t has definitely passed                             │
│  TT.before(t) → true if t has definitely not arrived                       │
│                                                                              │
│  UNCERTAINTY:                                                               │
│  ├── Average: 4ms                                                           │
│  └── Worst case: 10ms                                                       │
│                                                                              │
│  HOW SPANNER USES IT:                                                       │
│  ────────────────────                                                       │
│  Transaction commits at timestamp t                                         │
│  Wait until TT.after(t) before making visible                               │
│  This guarantees external consistency!                                      │
│                                                                              │
│  COST: Special hardware + GPS antennas + careful engineering                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 4: Failure Models

Understanding what can go wrong is crucial for building resilient systems.

Types of Failures

Fail-Stop
Fail-Recover
Byzantine

Node crashes and stays downCharacteristics:

Clean failure, detectable
Node doesn’t recover with corrupted state
Easiest to handle

Example: Hardware failure, power loss

[Node A] ──── X ────────────────────
             │
          Crash
          (other nodes eventually detect)

Node crashes but can restartCharacteristics:

Node may recover with persistent state
Must handle partial writes
Need recovery protocols

Example: Software crash, restart

[Node A] ──── X ────── [Restart] ──── [Running]
             │             │
          Crash       Recovery
                     (load state from disk)

Node behaves arbitrarily (possibly maliciously)Characteristics:

Can send wrong/conflicting information
Can lie about its state
Hardest to handle
Requires 3f+1 nodes to tolerate f Byzantine failures

Example: Hacked node, buggy code, malicious actor

[Node A] says "value = 5" to Node B
[Node A] says "value = 7" to Node C
(Same node, different lies)

Partial Failures

The defining challenge of distributed systems:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        PARTIAL FAILURES                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SCENARIO: Node A calls Node B                                              │
│                                                                              │
│  A ──[request]──> B                                                         │
│                   │                                                         │
│         (silence)                                                           │
│                                                                              │
│  What happened?                                                             │
│  ├── Request lost in network?                                               │
│  ├── B crashed before processing?                                           │
│  ├── B processed but crashed before responding?                             │
│  ├── B responded but response lost?                                         │
│  └── B is just slow?                                                        │
│                                                                              │
│  A DOESN'T KNOW!                                                            │
│                                                                              │
│  IMPLICATIONS:                                                              │
│  ─────────────                                                              │
│  1. Cannot distinguish slow from dead                                       │
│  2. Operations might have executed (or not)                                 │
│  3. Must design for uncertainty                                             │
│  4. Idempotency becomes essential                                           │
│                                                                              │
│  PATTERN: Retry with idempotency keys                                       │
│  ─────────────────────────────────                                          │
│  request_id = "abc123"                                                      │
│  if already_processed(request_id):                                          │
│      return cached_response                                                 │
│  else:                                                                      │
│      process_and_store(request_id, response)                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Gray Failures

Subtle failures that are hard to detect:

Examples of gray failures:
├── CPU running at 50% speed due to thermal throttling
├── One of 10 disks has 100x higher latency
├── Memory errors causing random crashes
├── Network dropping 1% of packets
├── Software bug triggered only by specific input
└── Gradual memory leak over hours

Detection challenges:
├── Metrics look "mostly okay"
├── Intermittent symptoms
├── Hard to reproduce
└── May only affect some requests

Byzantine Fault Tolerance (BFT)

Byzantine failures are the most challenging to handle:

┌─────────────────────────────────────────────────────────────────────────────┐
│                    BYZANTINE FAULT TOLERANCE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  THE BYZANTINE GENERALS PROBLEM                                              │
│  ─────────────────────────────                                              │
│  Multiple generals must agree on attack/retreat                             │
│  Some generals may be traitors (send different messages)                    │
│  How do loyal generals reach consensus?                                     │
│                                                                              │
│  REQUIREMENTS FOR BFT:                                                      │
│  ────────────────────                                                       │
│  To tolerate f Byzantine nodes: Need 3f + 1 total nodes                     │
│  - 1 faulty node: Need 4 nodes minimum                                      │
│  - 2 faulty nodes: Need 7 nodes minimum                                     │
│  - 3 faulty nodes: Need 10 nodes minimum                                    │
│                                                                              │
│  WHY 3f + 1?                                                                │
│  ──────────                                                                 │
│  f nodes may lie, f nodes may be unreachable                                │
│  Need f+1 correct responses to form a majority                              │
│  (3f+1) - f (faulty) - f (unreachable) = f+1 (quorum)                       │
│                                                                              │
│  PRACTICAL BFT ALGORITHMS:                                                  │
│  ───────────────────────                                                    │
│  ├── PBFT (Practical BFT) - Castro & Liskov, 1999                          │
│  ├── HotStuff - Used by Facebook's Libra/Diem                               │
│  ├── Tendermint - Used in blockchain systems                                │
│  └── RBFT - Redundant BFT for improved performance                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

PBFT Algorithm Overview:

# Simplified PBFT Flow
class PBFTNode:
    def __init__(self, node_id, total_nodes):
        self.node_id = node_id
        self.n = total_nodes
        self.f = (total_nodes - 1) // 3  # Max faulty nodes
        self.view = 0
        self.sequence = 0
        
    def is_primary(self):
        return self.view % self.n == self.node_id
    
    def receive_client_request(self, request):
        if not self.is_primary():
            # Forward to primary
            return self.forward_to_primary(request)
        
        # Primary broadcasts PRE-PREPARE
        self.sequence += 1
        self.broadcast_preprepare(request, self.sequence)
    
    def on_preprepare(self, request, seq):
        # Backup validates and broadcasts PREPARE
        if self.validate_preprepare(request, seq):
            self.broadcast_prepare(request, seq)
    
    def on_prepare(self, request, seq, sender):
        # Collect 2f PREPARE messages (including own)
        if self.prepare_count[seq] >= 2 * self.f:
            # Enter PREPARED state, broadcast COMMIT
            self.broadcast_commit(request, seq)
    
    def on_commit(self, request, seq, sender):
        # Collect 2f + 1 COMMIT messages
        if self.commit_count[seq] >= 2 * self.f + 1:
            # Execute request and reply to client
            self.execute_and_reply(request)

When to Use BFT:

Blockchain and cryptocurrency systems
Multi-party financial transactions
High-security government systems
Any system with mutually distrustful participants

Performance Trade-off: BFT protocols are expensive! PBFT requires O(n²) messages per consensus round. Most internal systems use simpler crash-fault tolerant protocols (Raft, Paxos) which only require 2f+1 nodes.

Network Partition Simulation

Understanding how to test partition tolerance:

# Simulating network partitions in tests
import random
from typing import Set, Dict, List

class NetworkSimulator:
    """
    Simulates network conditions including partitions, 
    delays, and message loss for testing distributed systems.
    """
    
    def __init__(self, nodes: List[str]):
        self.nodes = set(nodes)
        self.partitions: List[Set[str]] = [self.nodes.copy()]
        self.message_loss_rate = 0.0
        self.latency_ms = 0
        self.latency_jitter_ms = 0
    
    def create_partition(self, group1: Set[str], group2: Set[str]):
        """Split network into two partitions"""
        self.partitions = [group1, group2]
        print(f"PARTITION: {group1} | {group2}")
    
    def heal_partition(self):
        """Restore network connectivity"""
        self.partitions = [self.nodes.copy()]
        print("PARTITION HEALED")
    
    def can_communicate(self, from_node: str, to_node: str) -> bool:
        """Check if two nodes can communicate"""
        for partition in self.partitions:
            if from_node in partition and to_node in partition:
                return True
        return False
    
    def send_message(self, from_node: str, to_node: str, message: dict) -> bool:
        """
        Attempt to send a message with simulated failures.
        Returns True if message was delivered.
        """
        # Check partition
        if not self.can_communicate(from_node, to_node):
            print(f"BLOCKED: {from_node} -> {to_node} (partition)")
            return False
        
        # Simulate message loss
        if random.random() < self.message_loss_rate:
            print(f"LOST: {from_node} -> {to_node}")
            return False
        
        # Simulate latency
        delay = self.latency_ms + random.uniform(
            -self.latency_jitter_ms, 
            self.latency_jitter_ms
        )
        print(f"DELIVERED: {from_node} -> {to_node} ({delay:.1f}ms)")
        return True


# Example: Testing a distributed system under partition
def test_consensus_under_partition():
    nodes = ["node1", "node2", "node3", "node4", "node5"]
    network = NetworkSimulator(nodes)
    
    # Normal operation
    assert network.can_communicate("node1", "node3")
    
    # Create minority/majority partition
    network.create_partition(
        {"node1", "node2"},           # Minority
        {"node3", "node4", "node5"}   # Majority
    )
    
    # Majority can reach consensus
    assert network.can_communicate("node3", "node4")
    assert network.can_communicate("node4", "node5")
    
    # Cross-partition communication blocked
    assert not network.can_communicate("node1", "node3")
    assert not network.can_communicate("node2", "node5")
    
    # Minority should reject writes (in CP system)
    # Majority should continue operating
    
    network.heal_partition()
    assert network.can_communicate("node1", "node5")

Chaos Engineering for Failure Testing

Best Practice: Don’t wait for production failures to discover weaknesses. Proactively inject failures in controlled environments.

# Example: Chaos experiment definition
experiment:
  name: "Network Partition Between Services"
  hypothesis: "Order service continues accepting reads during payment partition"
  
  steady_state:
    - probe: "order-service-health"
      type: http
      url: "http://order-service/health"
      expected_status: 200
    
  method:
    - action: "network-partition"
      target: 
        service: "payment-service"
        namespace: "production"
      duration: 60s
      
  rollback:
    - action: "heal-network"
      target:
        service: "payment-service"
        
  metrics_to_watch:
    - error_rate < 1%
    - p99_latency < 500ms
    - order_reads_successful > 99%

Module 5: CAP and PACELC Theorems

CAP Theorem

CAP is often misunderstood. It only applies during a network partition.

CP vs AP Deep Dive

CP Systems

During partition, choose ConsistencyBehavior:

Minority partition stops accepting writes
May reject reads too (for linearizability)
Majority partition continues

Good for:

Financial systems
Inventory management
Any system where stale data is dangerous

Example: Bank account balance

AP Systems

During partition, choose AvailabilityBehavior:

Both partitions continue serving
May return stale/conflicting data
Resolve conflicts after partition heals

Good for:

Social media feeds
Shopping carts
DNS

Example: Twitter timeline

PACELC: The Better Framework

CAP only describes behavior during partitions. PACELC adds normal operation:

┌─────────────────────────────────────────────────────────────────────────────┐
│                           PACELC                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  IF Partition                                                               │
│     Choose between Availability and Consistency                             │
│  ELSE (normal operation)                                                    │
│     Choose between Latency and Consistency                                  │
│                                                                              │
│  EXAMPLES:                                                                  │
│  ─────────                                                                  │
│  DynamoDB/Cassandra: PA/EL                                                  │
│  ├── Partition: Available                                                   │
│  └── Normal: Low latency (eventual consistency)                             │
│                                                                              │
│  Spanner: PC/EC                                                             │
│  ├── Partition: Consistent (blocks)                                         │
│  └── Normal: Consistent (higher latency, but distributed transactions)     │
│                                                                              │
│  MongoDB: PA/EC                                                             │
│  ├── Partition: Available (eventual reads)                                  │
│  └── Normal: Consistent (waits for primary)                                 │
│                                                                              │
│  PNUTS (Yahoo): PC/EL                                                       │
│  ├── Partition: Consistent (blocks)                                         │
│  └── Normal: Low latency (local reads)                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Consistency Spectrum

Consistency is not binary - it’s a spectrum:

STRONGER ─────────────────────────────────────────────────── WEAKER
   │                                                           │
   ▼                                                           ▼
┌──────────────┬─────────────┬────────────┬──────────────┬────────────┐
│ Linearizable │ Sequential  │  Causal    │   Session    │  Eventual  │
│              │ Consistent  │            │  Consistent  │            │
└──────────────┴─────────────┴────────────┴──────────────┴────────────┘
   │                │              │             │              │
   │                │              │             │              │
   ▼                ▼              ▼             ▼              ▼
All ops appear   All ops see    Causally     Same session   Eventually
in real-time    same order     related ops   sees own       all see
order           (not real-     in order      writes         same value
                time)
                
Cost: Highest ◄─────────────────────────────────────────► Cost: Lowest
Latency: High ◄─────────────────────────────────────────► Latency: Low

External Consistency vs. Linearizability

For Staff/Principal level engineering, you must distinguish between system-local and global-time guarantees.

Linearizability (Atomic Consistency)

Linearizability is a guarantee for a single object (or key). It ensures that:

Every operation appears to take effect atomically at some point between its invocation and response.
All operations are seen in the same order by all participants.
If a write $W$ completes before a read $R$ starts (according to the system’s observer), $R$ must see the result of $W$ .

External Consistency (Strict Serializability)

External Consistency is a stronger, global guarantee, famously used by Google Spanner. It combines Linearizability with Serializability across multiple shards/objects and respects global wall-clock time.

Feature	Linearizability	External Consistency
Scope	Single object/key	Multi-shard transactions
Ordering	System-observed order	Global wall-clock order
Implementation	Paxos/Raft on one shard	TrueTime / Commit-Wait / HLC

The “Why”: In a multi-shard system, two transactions

T_1

and

T_2

might affect different shards. If

T_1

finishes at 10:00:01 and a client then starts

T_2

at 10:00:02,

T_2

must see the effects of

T_1

. Without external consistency (e.g., using only local Raft groups with skewed clocks),

T_2

might get a timestamp that is logically earlier than

T_1

, leading to causal violations.

Module 6: Modern Infrastructure & Serverless Internals

To achieve the “Principal” level of understanding, one must look beyond the logical protocols and into the physical isolation models that power modern clouds. The shift from monolithic VMs to Micro-VMs has fundamentally changed how we scale distributed systems.

The Serverless Paradox

Serverless (AWS Lambda, Google Cloud Functions) promises “no servers,” but in reality, it requires the most complex server management in existence. The challenge is The Cold Start Problem vs. Isolation Security.

Traditional VMs (EC2): Strong isolation, but slow to boot (30s+). Too slow for “on-demand” execution.
Containers (Docker/K8s): Fast to start, but weak isolation (shared kernel). Dangerous for multi-tenant code execution.

Micro-VM Internals: AWS Firecracker

AWS Firecracker is the technology behind Lambda and Fargate. It uses KVM (Kernel-based Virtual Machine) but stripped down to the bare essentials.

Minimal Device Model: Firecracker drops legacy hardware support (no USB, no video). It only provides:
- Net (Network)
- Block (Storage)
- Serial (Console)
- Balloon (Memory management)
Snapshot-and-Restore: Instead of booting a kernel, Firecracker can restore from a Memory Snapshot. This reduces “boot” time from seconds to ~10ms.
Jailer: A secondary security layer that uses cgroups, namespaces, and seccomp to ensure that even if a Micro-VM is compromised, it cannot touch the host or other VMs.

Implications for Distributed Design

When designing systems for serverless:

Statelessness is Mandatory: Because the Micro-VM can be killed or recycled at any moment.
Sidecar Overhead: Traditional sidecars (Service Mesh) add too much latency for 50ms functions. We move towards Proxyless Mesh or Library-based approaches.
Connection Pooling: Traditional database connection pools fail at serverless scale. We use RDS Proxy or HTTP-based database protocols (Data API).

Key Interview Questions

Q: Can you achieve consensus in an asynchronous system?

Answer: No, provably impossible (FLP theorem). But we can achieve it practically by:

Using timeouts (introduces synchrony assumption)
Randomization
Failure detectors

Most practical systems (Raft, Paxos) assume partial synchrony.

Q: How do you detect if a node is dead vs slow?

Answer: You can’t with certainty. Approaches:

Timeouts: Simple but prone to false positives
Heartbeats: Regular health checks
Phi accrual detector: Probabilistic suspicion level
Lease-based: Node must renew lease to be considered alive

In practice, use adaptive timeouts based on historical latency.

Q: Explain vector clocks vs Lamport timestamps

Answer:

Lamport: Single counter, preserves “happens-before” one direction only
Vector clocks: One counter per node, can detect concurrent events

Use Lamport when you just need ordering. Use vector clocks when you need to detect conflicts.

Q: Is your system CP or AP?

Answer: Frame it as a trade-off discussion:

First, acknowledge CAP only applies during partitions
Discuss what consistency level you actually need
Explain PACELC for normal operation
Give examples: “For our payment system, we’re CP because incorrect balance is unacceptable. For user preferences, we’re AP because eventual consistency is fine.”

Next Steps

Continue to Track 2: Consensus Protocols

Learn Paxos, Raft, and other consensus algorithms that power distributed systems

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Track 1: Foundations

​Module 1: Why Distributed Systems?

​The Need for Distribution

​Types of Distributed Systems

Compute Clusters

Storage Systems

Message Systems

​The Eight Fallacies of Distributed Computing

​Module 2: Network Fundamentals

​TCP Guarantees and Failures

​Network Partitions

​Message Delivery Semantics

​Failure Detection

​Vector Clocks

​Advanced: Vector Clocks vs. Version Vectors

​Implementation:

​Hybrid Logical Clocks (HLC)

​TrueTime (Google Spanner)

​Module 4: Failure Models

​Types of Failures

​Partial Failures

​Gray Failures

​Byzantine Fault Tolerance (BFT)

​Network Partition Simulation

​Chaos Engineering for Failure Testing

​Module 5: CAP and PACELC Theorems

​CAP Theorem

​CP vs AP Deep Dive

CP Systems

AP Systems

​PACELC: The Better Framework

​Consistency Spectrum

​External Consistency vs. Linearizability

​Linearizability (Atomic Consistency)

​External Consistency (Strict Serializability)

​Module 6: Modern Infrastructure & Serverless Internals

​The Serverless Paradox

​Micro-VM Internals: AWS Firecracker

Track 1: Foundations

Module 1: Why Distributed Systems?

The Need for Distribution

Types of Distributed Systems

The Eight Fallacies of Distributed Computing

Module 2: Network Fundamentals

TCP Guarantees and Failures

Network Partitions

Message Delivery Semantics

Failure Detection

Vector Clocks

Advanced: Vector Clocks vs. Version Vectors

Implementation:

Hybrid Logical Clocks (HLC)

TrueTime (Google Spanner)

Module 4: Failure Models

Types of Failures

Partial Failures

Gray Failures

Byzantine Fault Tolerance (BFT)

Network Partition Simulation

Chaos Engineering for Failure Testing

Module 5: CAP and PACELC Theorems

CAP Theorem

CP vs AP Deep Dive

PACELC: The Better Framework

Consistency Spectrum

External Consistency vs. Linearizability

Linearizability (Atomic Consistency)

External Consistency (Strict Serializability)

Module 6: Modern Infrastructure & Serverless Internals

The Serverless Paradox

Micro-VM Internals: AWS Firecracker

Implications for Distributed Design

Key Interview Questions

Next Steps