Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Track 1: Foundations

Before diving into consensus protocols and complex architectures, you must understand the fundamental challenges that make distributed systems hard. This chapter is the foundation everything else rests on — if your mental model of networks, time, and failure is wrong, every design decision you make downstream will be subtly broken.
Track Duration: 28-38 hours
Modules: 5
Key Topics: Network fundamentals, Time & Ordering, Failure Models, CAP/PACELC

Module 1: Why Distributed Systems?

The Need for Distribution

Every successful system eventually outgrows a single machine:
+-------------------------------------------------------------+
|                     WHY DISTRIBUTED SYSTEMS?               |
+-------------------------------------------------------------+
|                                                             |
|  SINGLE MACHINE LIMITS          DISTRIBUTION BENEFITS      |
|  ---------------------          ---------------------      |
|  - CPU: ~128 cores max          - Horizontal scaling       |
|  - RAM: ~12 TB max (expensive)  - Fault tolerance          |
|  - Disk: I/O bottlenecks        - Geographic distribution  |
|  - Network: Single point        - Cost efficiency          |
|    of failure                   - Regulatory compliance    |
|                                                             |
|  EXAMPLE: Google Search                                    |
|  ---------------------                                     |
|  - 8.5 billion searches/day                               |
|  - Index: 100+ petabytes                                  |
|  - Response time: < 0.5 seconds                           |
|  - Impossible on a single machine                         |
|                                                             |
+-------------------------------------------------------------+

Types of Distributed Systems

Compute Clusters

Purpose: Process data across many machinesExamples:
  • Hadoop/Spark clusters
  • Kubernetes pods
  • AWS Lambda fleet

Storage Systems

Purpose: Store and retrieve data across machinesExamples:
  • Distributed databases (Cassandra, DynamoDB)
  • Object stores (S3, GCS)
  • File systems (HDFS, Ceph)

Message Systems

Purpose: Enable communication between distributed componentsExamples:
  • Message queues (Kafka, RabbitMQ)
  • Pub/sub systems (SNS, Pub/Sub)
  • Event streams (Kinesis, Event Hubs)

The Eight Fallacies of Distributed Computing

Every distributed systems engineer must internalize these. Peter Deutsch (with additions by James Gosling) identified these assumptions that developers new to distributed systems unconsciously make. Each one is a trap — code that works perfectly on your laptop will fail in production because of one or more of these false beliefs:
Reality: Networks fail constantly.
Common failures:
- Packet loss (1 in 1000 typical, worse under load)
- Network partitions (entire segments become unreachable)
- Cable cuts (more common than you'd think)
- Switch/router failures

Real incident: AWS US-East-1 2011
- Network configuration change
- Cascading failures
- Multi-hour outage affecting Netflix, Reddit, Quora
Defense: Implement retries, timeouts, circuit breakers, and idempotency.
Reality: Every network call has latency.
OperationTime
Same datacenter0.5ms
Cross-region (US East ↔ West)40-65ms
Cross-continent (US ↔ Europe)70-120ms
Cross-pacific (US ↔ Asia)100-200ms
Impact: A service with 10 sequential remote calls adds 500-2000ms latency.Defense: Cache aggressively, parallelize calls, use async processing.
Reality: Bandwidth is expensive and limited.
Example: Transferring 1TB over network

10 Gbps link:
- Theoretical: 13 minutes
- Real-world: 20-30 minutes (overhead, congestion)

AWS Data Transfer Costs:
- Intra-region: $0.01/GB
- Cross-region: $0.02/GB
- To internet: $0.09/GB
- 1PB/month internet egress: $90,000
Defense: Compress data, use efficient serialization (protobuf), batch requests.
Reality: Assume the network is compromised.Attack vectors:
  • Man-in-the-middle attacks
  • DNS spoofing
  • BGP hijacking (internet routing attacks)
  • Insider threats
Defense: mTLS everywhere, zero-trust architecture, encrypt at rest and in transit.
Reality: Network topology changes constantly.
Dynamic changes:
├── Auto-scaling adds/removes instances
├── Deployments replace containers
├── Failovers redirect traffic
└── Cloud provider maintenance
Defense: Use service discovery, health checks, load balancer draining.
Reality: Multiple teams, policies, and even companies.In a typical microservices architecture:
  • Platform team manages Kubernetes
  • Each service team manages their services
  • Security team manages policies
  • Network team manages infrastructure
Defense: Clear ownership, documented interfaces, SLAs between teams.
Reality: Data transfer costs real money.
Cloud ProviderEgress Cost
AWS$0.09/GB
GCP$0.12/GB
Azure$0.087/GB
High-traffic example: 1PB egress = $90,000/monthDefense: Keep data close to compute, use CDNs, compress aggressively.
Reality: Different hardware, protocols, and vendors everywhere.
Heterogeneity:
├── Different server generations (various CPU, RAM)
├── Different network equipment (Cisco, Juniper, etc.)
├── Different protocols (HTTP/1.1, HTTP/2, gRPC, QUIC)
├── Different cloud providers (AWS, GCP, Azure)
└── Different latency profiles
Defense: Abstract hardware differences, use consistent protocols, test on diverse environments.

Module 2: Network Fundamentals

TCP Guarantees and Failures

TCP provides order and reliability, but not much else:
┌─────────────────────────────────────────────────────────────────────────────┐
│                          TCP GUARANTEES                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ✓ GUARANTEED                         ✗ NOT GUARANTEED                      │
│  ─────────────                        ─────────────────                      │
│  ▸ Ordered delivery                   ▸ Bounded latency                     │
│  ▸ Reliable delivery (eventually)     ▸ Bounded bandwidth                   │
│  ▸ Error detection                    ▸ Connection will succeed             │
│  ▸ Flow control                       ▸ Message boundaries preserved        │
│  ▸ Duplicate elimination              ▸ Timely delivery                     │
│                                                                              │
│  FAILURE MODES                                                               │
│  ─────────────                                                              │
│  ▸ Connection reset (RST)                                                   │
│  ▸ Connection timeout                                                       │
│  ▸ Half-open connections (one side doesn't know it's dead)                 │
│  ▸ Head-of-line blocking                                                    │
│  ▸ TCP incast (many-to-one overwhelms receiver)                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Network Partitions

A network partition occurs when nodes can’t communicate with each other:
BEFORE PARTITION:
                    ┌────────────────┐
                    │   Network      │
                    └────────────────┘
                    ↑    ↑    ↑    ↑
                    │    │    │    │
               ┌────┼────┼────┼────┼────┐
               │    │    │    │    │    │
              [A]  [B]  [C]  [D]  [E]  [F]
              All nodes can communicate


DURING PARTITION:
     Partition 1              ║              Partition 2
  ┌─────────────┐             ║           ┌─────────────┐
  │   Network   │             ║           │   Network   │
  └─────────────┘             ║           └─────────────┘
   ↑     ↑    ↑               ║            ↑     ↑    ↑
   │     │    │           PARTITION        │     │    │
  [A]   [B]  [C]              ║           [D]   [E]  [F]
  
  A, B, C can talk             D, E, F can talk
  to each other                to each other
  but NOT to D, E, F           but NOT to A, B, C
Key insight: During a partition, you must choose between:
  • Availability: Both partitions continue serving requests (might diverge)
  • Consistency: Reject requests from the minority partition
Think of it like two bank branches whose phone line gets cut. Do they keep processing transactions independently (available but potentially inconsistent — both might approve overdrafts) or do they stop until the phone line is restored (consistent but unavailable)? There is no option where both branches stay open AND guarantee they agree on the account balance. This is the CAP theorem in a nutshell.

Message Delivery Semantics

Fire and forget - no retries
def send_message(message):
    network.send(message)  # No retry on failure
    # Message delivered 0 or 1 times
Use case: Metrics, logs (losing some is acceptable)Risk: Message loss

Failure Detection

How do you know if a node is dead?
┌─────────────────────────────────────────────────────────────────────────────┐
│                      FAILURE DETECTION APPROACHES                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HEARTBEAT-BASED                                                            │
│  ───────────────                                                            │
│  Node A ──[heartbeat]──> Node B                                             │
│           every 1 sec                                                        │
│                                                                              │
│  If no heartbeat for 5 seconds → Mark as failed                             │
│                                                                              │
│  Problems:                                                                   │
│  ├── Too aggressive: False positives (just slow, not dead)                  │
│  └── Too conservative: Slow detection of actual failures                    │
│                                                                              │
│  PHI ACCRUAL FAILURE DETECTOR (Cassandra uses this)                         │
│  ─────────────────────────────                                              │
│  Instead of binary (alive/dead), output a suspicion level:                  │
│                                                                              │
│  φ = -log₁₀(probability node is still alive)                                │
│                                                                              │
│  φ = 1 → 90% confident node is alive                                        │
│  φ = 2 → 99% confident node is alive                                        │
**This is one of the most important topics in distributed systems.**

### Why Physical Clocks Fail

┌─────────────────────────────────────────────────────────────────────────────┐ │ PHYSICAL CLOCK PROBLEMS │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ CLOCK SKEW (different machines show different times) │ │ ──────────── │ │ Machine A: 10:00:00.000 │ │ Machine B: 10:00:00.150 (150ms ahead) │ │ Machine C: 09:59:59.850 (150ms behind) │ │ │ │ NTP SYNCHRONIZATION │ │ ─────────────────── │ │ NTP accuracy: 1-10ms on LAN, 10-100ms over internet │ │ During sync: Clock can jump forward or backward! │ │ │ │ CLOCK DRIFT │ │ ──────────── │ │ Quartz crystals drift: ~50ppm (50 microseconds per second) │ │ After 1 hour: 180ms drift possible │ │ │ │ LEAP SECONDS │ │ ──────────── │ │ Added to compensate for Earth’s rotation slowing │ │ 23:59:59 → 23:59:60 → 00:00:00 │ │ Famously caused issues at Reddit, LinkedIn, Mozilla (2012) │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

### Lamport Timestamps

Leslie Lamport's logical clock solution (1978):

**Rules:**
1. Each process maintains a counter (starts at 0)
2. Before each event, increment the counter
3. When sending: attach current counter to message
4. When receiving: counter = max(local, received) + 1

**Example:**
- Process A sends message 1 to B, then B receives it as 2
- Process B sends message 3 to C, then C receives it as 5
- Process A sends message 4 to B, then B receives it as 6
- Process B sends message 7 to C, then C receives it as 8

**Property:** If A happened before B, then L(A) < L(B)
**Limitation:** L(A) < L(B) does NOT mean A happened before B (no causality)

**Implementation**:

```python
class LamportClock:
    def __init__(self):
        self.counter = 0
    
    def increment(self):
        """Call before any local event"""
        self.counter += 1
        return self.counter
    
    def send(self):
        """Call when sending a message"""
        self.counter += 1
        return self.counter
    
    def receive(self, received_timestamp):
        """Call when receiving a message"""
        self.counter = max(self.counter, received_timestamp) + 1
        return self.counter

Vector Clocks

Vector clocks track causality - they tell you if two events are related: Vector Clocks

Advanced: Vector Clocks vs. Version Vectors

In many textbooks, “Vector Clocks” and “Version Vectors” are used interchangeably. However, at the Staff level, you must distinguish between them:
  1. Vector Clocks (Causality): Track the relationship between events in a distributed system. Every interaction (send/receive) increments the clock. They are used to build a partial order of all operations.
  2. Version Vectors (Conflict Tracking): Track the relationship between versions of data. They only increment when data is updated. They are used in systems like DynamoDB and Riak to detect if two versions of a document are in conflict (siblings).
FeatureVector ClocksVersion Vectors
GoalOrder events (AA happened before BB)Detect data conflicts
IncrementEvery event (internal + communication)Only on data modification
OverheadHigh (grows with event count)Low (grows with replica count)
Staff Tip: If you only need to know if two versions of a file conflict, use Version Vectors. If you need to know if a message was “caused” by another message (e.g., in a distributed debugger), use Vector Clocks.

Implementation:

class VectorClock:
    def __init__(self, node_id, num_nodes):
        self.node_id = node_id
        self.clock = [0] * num_nodes
    
    def increment(self):
        self.clock[self.node_id] += 1
        return self.clock.copy()
    
    def send(self):
        self.clock[self.node_id] += 1
        return self.clock.copy()
    
    def receive(self, received_clock):
        for i in range(len(self.clock)):
            self.clock[i] = max(self.clock[i], received_clock[i])
        self.clock[self.node_id] += 1
        return self.clock.copy()
    
    def happens_before(self, other_clock):
        """Returns True if this clock happened before other_clock"""
        return (all(s <= o for s, o in zip(self.clock, other_clock)) and
                any(s < o for s, o in zip(self.clock, other_clock)))
    
    def concurrent(self, other_clock):
        """Returns True if events are concurrent"""
        return (not self.happens_before(other_clock) and 
                not VectorClock.compare(other_clock, self.clock))

Hybrid Logical Clocks (HLC)

Combines physical and logical clocks:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    HYBRID LOGICAL CLOCKS                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HLC = (physical_time, logical_counter)                                     │
│                                                                              │
│  PROPERTIES:                                                                │
│  1. Always greater than physical time (useful for debugging)                │
│  2. Preserves causality like logical clocks                                 │
│  3. Bounded skew from physical time                                         │
│                                                                              │
│  USED BY:                                                                   │
│  ├── CockroachDB                                                            │
│  ├── MongoDB                                                                │
│  └── Many distributed databases                                             │
│                                                                              │
│  ALGORITHM:                                                                 │
│  ─────────                                                                  │
│  On local event or send:                                                    │
│    l' = max(l, physical_time)                                               │
│    if l' == l: c' = c + 1                                                   │
│    else: c' = 0                                                             │
│    l = l', c = c'                                                           │
│                                                                              │
│  On receive(m.l, m.c):                                                      │
│    l' = max(l, m.l, physical_time)                                          │
│    if l' == l == m.l: c' = max(c, m.c) + 1                                  │
│    else if l' == l: c' = c + 1                                              │
│    else if l' == m.l: c' = m.c + 1                                          │
│    else: c' = 0                                                             │
│    l = l', c = c'                                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

TrueTime (Google Spanner)

Google’s hardware-based approach:
┌─────────────────────────────────────────────────────────────────────────────┐
│                          TRUETIME                                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HARDWARE:                                                                  │
│  ├── GPS receivers (provide absolute time)                                  │
│  └── Atomic clocks (maintain time during GPS outages)                      │
│                                                                              │
│  API:                                                                       │
│  TT.now() returns [earliest, latest] interval                               │
│  TT.after(t) → true if t has definitely passed                             │
│  TT.before(t) → true if t has definitely not arrived                       │
│                                                                              │
│  UNCERTAINTY:                                                               │
│  ├── Average: 4ms                                                           │
│  └── Worst case: 10ms                                                       │
│                                                                              │
│  HOW SPANNER USES IT:                                                       │
│  ────────────────────                                                       │
│  Transaction commits at timestamp t                                         │
│  Wait until TT.after(t) before making visible                               │
│  This guarantees external consistency!                                      │
│                                                                              │
│  COST: Special hardware + GPS antennas + careful engineering                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 4: Failure Models

Understanding what can go wrong is crucial for building resilient systems.

Types of Failures

Node crashes and stays downCharacteristics:
  • Clean failure, detectable
  • Node doesn’t recover with corrupted state
  • Easiest to handle
Example: Hardware failure, power loss
[Node A] ──── X ────────────────────

          Crash
          (other nodes eventually detect)

Partial Failures

The defining challenge of distributed systems:
┌─────────────────────────────────────────────────────────────────────────────┐
│                        PARTIAL FAILURES                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SCENARIO: Node A calls Node B                                              │
│                                                                              │
│  A ──[request]──> B                                                         │
│                   │                                                         │
│         (silence)                                                           │
│                                                                              │
│  What happened?                                                             │
│  ├── Request lost in network?                                               │
│  ├── B crashed before processing?                                           │
│  ├── B processed but crashed before responding?                             │
│  ├── B responded but response lost?                                         │
│  └── B is just slow?                                                        │
│                                                                              │
│  A DOESN'T KNOW!                                                            │
│                                                                              │
│  IMPLICATIONS:                                                              │
│  ─────────────                                                              │
│  1. Cannot distinguish slow from dead                                       │
│  2. Operations might have executed (or not)                                 │
│  3. Must design for uncertainty                                             │
│  4. Idempotency becomes essential                                           │
│                                                                              │
│  PATTERN: Retry with idempotency keys                                       │
│  ─────────────────────────────────                                          │
│  request_id = "abc123"                                                      │
│  if already_processed(request_id):                                          │
│      return cached_response                                                 │
│  else:                                                                      │
│      process_and_store(request_id, response)                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Gray Failures

Gray failures are the distributed systems equivalent of a slow gas leak — everything seems fine until suddenly it is not. They are far more common and insidious than clean crash failures, and they are the cause of most prolonged production outages:
Examples of gray failures:
├── CPU running at 50% speed due to thermal throttling
├── One of 10 disks has 100x higher latency
├── Memory errors causing random crashes
├── Network dropping 1% of packets (enough to cause retries, not enough to trigger alerts)
├── Software bug triggered only by specific input (works for 99.99% of requests)
└── Gradual memory leak over hours (OOM kill after 3 days)

Detection challenges:
├── Metrics look "mostly okay" (averages hide the problem)
├── Intermittent symptoms (works on retry, so nobody investigates)
├── Hard to reproduce (depends on specific timing or load)
└── May only affect some requests (the unlucky 0.1% routed to the sick node)
Production wisdom: Gray failures are why you must monitor percentile latencies (p99, p99.9), not just averages or error rates. A node with a failing disk might serve 99% of requests normally but add 10 seconds of latency to the rest. The average latency barely moves, but 1% of your users have a terrible experience. Your dashboards should make this visible.

Byzantine Fault Tolerance (BFT)

Byzantine failures are the most challenging to handle:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    BYZANTINE FAULT TOLERANCE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  THE BYZANTINE GENERALS PROBLEM                                              │
│  ─────────────────────────────                                              │
│  Multiple generals must agree on attack/retreat                             │
│  Some generals may be traitors (send different messages)                    │
│  How do loyal generals reach consensus?                                     │
│                                                                              │
│  REQUIREMENTS FOR BFT:                                                      │
│  ────────────────────                                                       │
│  To tolerate f Byzantine nodes: Need 3f + 1 total nodes                     │
│  - 1 faulty node: Need 4 nodes minimum                                      │
│  - 2 faulty nodes: Need 7 nodes minimum                                     │
│  - 3 faulty nodes: Need 10 nodes minimum                                    │
│                                                                              │
│  WHY 3f + 1?                                                                │
│  ──────────                                                                 │
│  f nodes may lie, f nodes may be unreachable                                │
│  Need f+1 correct responses to form a majority                              │
│  (3f+1) - f (faulty) - f (unreachable) = f+1 (quorum)                       │
│                                                                              │
│  PRACTICAL BFT ALGORITHMS:                                                  │
│  ───────────────────────                                                    │
│  ├── PBFT (Practical BFT) - Castro & Liskov, 1999                          │
│  ├── HotStuff - Used by Facebook's Libra/Diem                               │
│  ├── Tendermint - Used in blockchain systems                                │
│  └── RBFT - Redundant BFT for improved performance                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
PBFT Algorithm Overview:
# Simplified PBFT Flow
class PBFTNode:
    def __init__(self, node_id, total_nodes):
        self.node_id = node_id
        self.n = total_nodes
        self.f = (total_nodes - 1) // 3  # Max faulty nodes
        self.view = 0
        self.sequence = 0
        
    def is_primary(self):
        return self.view % self.n == self.node_id
    
    def receive_client_request(self, request):
        if not self.is_primary():
            # Forward to primary
            return self.forward_to_primary(request)
        
        # Primary broadcasts PRE-PREPARE
        self.sequence += 1
        self.broadcast_preprepare(request, self.sequence)
    
    def on_preprepare(self, request, seq):
        # Backup validates and broadcasts PREPARE
        if self.validate_preprepare(request, seq):
            self.broadcast_prepare(request, seq)
    
    def on_prepare(self, request, seq, sender):
        # Collect 2f PREPARE messages (including own)
        if self.prepare_count[seq] >= 2 * self.f:
            # Enter PREPARED state, broadcast COMMIT
            self.broadcast_commit(request, seq)
    
    def on_commit(self, request, seq, sender):
        # Collect 2f + 1 COMMIT messages
        if self.commit_count[seq] >= 2 * self.f + 1:
            # Execute request and reply to client
            self.execute_and_reply(request)
When to Use BFT:
  • Blockchain and cryptocurrency systems
  • Multi-party financial transactions
  • High-security government systems
  • Any system with mutually distrustful participants
Performance Trade-off: BFT protocols are expensive! PBFT requires O(n²) messages per consensus round. Most internal systems use simpler crash-fault tolerant protocols (Raft, Paxos) which only require 2f+1 nodes.

Network Partition Simulation

Understanding how to test partition tolerance:
# Simulating network partitions in tests
import random
from typing import Set, Dict, List

class NetworkSimulator:
    """
    Simulates network conditions including partitions, 
    delays, and message loss for testing distributed systems.
    """
    
    def __init__(self, nodes: List[str]):
        self.nodes = set(nodes)
        self.partitions: List[Set[str]] = [self.nodes.copy()]
        self.message_loss_rate = 0.0
        self.latency_ms = 0
        self.latency_jitter_ms = 0
    
    def create_partition(self, group1: Set[str], group2: Set[str]):
        """Split network into two partitions"""
        self.partitions = [group1, group2]
        print(f"PARTITION: {group1} | {group2}")
    
    def heal_partition(self):
        """Restore network connectivity"""
        self.partitions = [self.nodes.copy()]
        print("PARTITION HEALED")
    
    def can_communicate(self, from_node: str, to_node: str) -> bool:
        """Check if two nodes can communicate"""
        for partition in self.partitions:
            if from_node in partition and to_node in partition:
                return True
        return False
    
    def send_message(self, from_node: str, to_node: str, message: dict) -> bool:
        """
        Attempt to send a message with simulated failures.
        Returns True if message was delivered.
        """
        # Check partition
        if not self.can_communicate(from_node, to_node):
            print(f"BLOCKED: {from_node} -> {to_node} (partition)")
            return False
        
        # Simulate message loss
        if random.random() < self.message_loss_rate:
            print(f"LOST: {from_node} -> {to_node}")
            return False
        
        # Simulate latency
        delay = self.latency_ms + random.uniform(
            -self.latency_jitter_ms, 
            self.latency_jitter_ms
        )
        print(f"DELIVERED: {from_node} -> {to_node} ({delay:.1f}ms)")
        return True


# Example: Testing a distributed system under partition
def test_consensus_under_partition():
    nodes = ["node1", "node2", "node3", "node4", "node5"]
    network = NetworkSimulator(nodes)
    
    # Normal operation
    assert network.can_communicate("node1", "node3")
    
    # Create minority/majority partition
    network.create_partition(
        {"node1", "node2"},           # Minority
        {"node3", "node4", "node5"}   # Majority
    )
    
    # Majority can reach consensus
    assert network.can_communicate("node3", "node4")
    assert network.can_communicate("node4", "node5")
    
    # Cross-partition communication blocked
    assert not network.can_communicate("node1", "node3")
    assert not network.can_communicate("node2", "node5")
    
    # Minority should reject writes (in CP system)
    # Majority should continue operating
    
    network.heal_partition()
    assert network.can_communicate("node1", "node5")

Chaos Engineering for Failure Testing

Best Practice: Don’t wait for production failures to discover weaknesses. Proactively inject failures in controlled environments.
# Example: Chaos experiment definition
experiment:
  name: "Network Partition Between Services"
  hypothesis: "Order service continues accepting reads during payment partition"
  
  steady_state:
    - probe: "order-service-health"
      type: http
      url: "http://order-service/health"
      expected_status: 200
    
  method:
    - action: "network-partition"
      target: 
        service: "payment-service"
        namespace: "production"
      duration: 60s
      
  rollback:
    - action: "heal-network"
      target:
        service: "payment-service"
        
  metrics_to_watch:
    - error_rate < 1%
    - p99_latency < 500ms
    - order_reads_successful > 99%

Module 5: CAP and PACELC Theorems

CAP Theorem

CAP is often misunderstood. It only applies during a network partition.
CAP Theorem

CP vs AP Deep Dive

CP Systems

During partition, choose ConsistencyBehavior:
  • Minority partition stops accepting writes
  • May reject reads too (for linearizability)
  • Majority partition continues
Good for:
  • Financial systems
  • Inventory management
  • Any system where stale data is dangerous
Example: Bank account balance

AP Systems

During partition, choose AvailabilityBehavior:
  • Both partitions continue serving
  • May return stale/conflicting data
  • Resolve conflicts after partition heals
Good for:
  • Social media feeds
  • Shopping carts
  • DNS
Example: Twitter timeline

PACELC: The Better Framework

CAP only describes behavior during partitions. PACELC adds normal operation:
┌─────────────────────────────────────────────────────────────────────────────┐
│                           PACELC                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  IF Partition                                                               │
│     Choose between Availability and Consistency                             │
│  ELSE (normal operation)                                                    │
│     Choose between Latency and Consistency                                  │
│                                                                              │
│  EXAMPLES:                                                                  │
│  ─────────                                                                  │
│  DynamoDB/Cassandra: PA/EL                                                  │
│  ├── Partition: Available                                                   │
│  └── Normal: Low latency (eventual consistency)                             │
│                                                                              │
│  Spanner: PC/EC                                                             │
│  ├── Partition: Consistent (blocks)                                         │
│  └── Normal: Consistent (higher latency, but distributed transactions)     │
│                                                                              │
│  MongoDB: PA/EC                                                             │
│  ├── Partition: Available (eventual reads)                                  │
│  └── Normal: Consistent (waits for primary)                                 │
│                                                                              │
│  PNUTS (Yahoo): PC/EL                                                       │
│  ├── Partition: Consistent (blocks)                                         │
│  └── Normal: Low latency (local reads)                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Consistency Spectrum

Consistency is not binary - it’s a spectrum:
STRONGER ─────────────────────────────────────────────────── WEAKER
   │                                                           │
   ▼                                                           ▼
┌──────────────┬─────────────┬────────────┬──────────────┬────────────┐
│ Linearizable │ Sequential  │  Causal    │   Session    │  Eventual  │
│              │ Consistent  │            │  Consistent  │            │
└──────────────┴─────────────┴────────────┴──────────────┴────────────┘
   │                │              │             │              │
   │                │              │             │              │
   ▼                ▼              ▼             ▼              ▼
All ops appear   All ops see    Causally     Same session   Eventually
in real-time    same order     related ops   sees own       all see
order           (not real-     in order      writes         same value
                time)
                
Cost: Highest ◄─────────────────────────────────────────► Cost: Lowest
Latency: High ◄─────────────────────────────────────────► Latency: Low

External Consistency vs. Linearizability

For Staff/Principal level engineering, you must distinguish between system-local and global-time guarantees.

Linearizability (Atomic Consistency)

Linearizability is a guarantee for a single object (or key). It ensures that:
  1. Every operation appears to take effect atomically at some point between its invocation and response.
  2. All operations are seen in the same order by all participants.
  3. If a write WW completes before a read RR starts (according to the system’s observer), RR must see the result of WW.

External Consistency (Strict Serializability)

External Consistency is a stronger, global guarantee, famously used by Google Spanner. It combines Linearizability with Serializability across multiple shards/objects and respects global wall-clock time.
FeatureLinearizabilityExternal Consistency
ScopeSingle object/keyMulti-shard transactions
OrderingSystem-observed orderGlobal wall-clock order
ImplementationPaxos/Raft on one shardTrueTime / Commit-Wait / HLC
The “Why”: In a multi-shard system, two transactions T1T_1 and T2T_2 might affect different shards. If T1T_1 finishes at 10:00:01 and a client then starts T2T_2 at 10:00:02, T2T_2 must see the effects of T1T_1. Without external consistency (e.g., using only local Raft groups with skewed clocks), T2T_2 might get a timestamp that is logically earlier than T1T_1, leading to causal violations.

Module 6: Modern Infrastructure & Serverless Internals

To achieve the “Principal” level of understanding, one must look beyond the logical protocols and into the physical isolation models that power modern clouds. The shift from monolithic VMs to Micro-VMs has fundamentally changed how we scale distributed systems.

The Serverless Paradox

Serverless (AWS Lambda, Google Cloud Functions) promises “no servers,” but in reality, it requires the most complex server management in existence. The challenge is The Cold Start Problem vs. Isolation Security.
  • Traditional VMs (EC2): Strong isolation, but slow to boot (30s+). Too slow for “on-demand” execution.
  • Containers (Docker/K8s): Fast to start, but weak isolation (shared kernel). Dangerous for multi-tenant code execution.

Micro-VM Internals: AWS Firecracker

AWS Firecracker is the technology behind Lambda and Fargate. It uses KVM (Kernel-based Virtual Machine) but stripped down to the bare essentials.
  1. Minimal Device Model: Firecracker drops legacy hardware support (no USB, no video). It only provides:
    • Net (Network)
    • Block (Storage)
    • Serial (Console)
    • Balloon (Memory management)
  2. Snapshot-and-Restore: Instead of booting a kernel, Firecracker can restore from a Memory Snapshot. This reduces “boot” time from seconds to ~10ms.
  3. Jailer: A secondary security layer that uses cgroups, namespaces, and seccomp to ensure that even if a Micro-VM is compromised, it cannot touch the host or other VMs.

Implications for Distributed Design

When designing systems for serverless:
  • Statelessness is Mandatory: Because the Micro-VM can be killed or recycled at any moment.
  • Sidecar Overhead: Traditional sidecars (Service Mesh) add too much latency for 50ms functions. We move towards Proxyless Mesh or Library-based approaches.
  • Connection Pooling: Traditional database connection pools fail at serverless scale. We use RDS Proxy or HTTP-based database protocols (Data API).

Key Interview Questions

Answer: No, provably impossible (FLP theorem). But we can achieve it practically by:
  • Using timeouts (introduces synchrony assumption)
  • Randomization
  • Failure detectors
Most practical systems (Raft, Paxos) assume partial synchrony.
Answer: You can’t with certainty. Approaches:
  • Timeouts: Simple but prone to false positives
  • Heartbeats: Regular health checks
  • Phi accrual detector: Probabilistic suspicion level
  • Lease-based: Node must renew lease to be considered alive
In practice, use adaptive timeouts based on historical latency.
Answer:
  • Lamport: Single counter, preserves “happens-before” one direction only
  • Vector clocks: One counter per node, can detect concurrent events
Use Lamport when you just need ordering. Use vector clocks when you need to detect conflicts.
Answer: Frame it as a trade-off discussion:
  • First, acknowledge CAP only applies during partitions
  • Discuss what consistency level you actually need
  • Explain PACELC for normal operation
  • Give examples: “For our payment system, we’re CP because incorrect balance is unacceptable. For user preferences, we’re AP because eventual consistency is fine.”

Next Steps

Continue to Track 2: Consensus Protocols

Learn Paxos, Raft, and other consensus algorithms that power distributed systems

Interview Deep-Dive

Strong Answer:
  • FLP proves that no deterministic algorithm can guarantee consensus in a fully asynchronous system where even one process may crash. The key word is “guarantee” — it is an impossibility of absolute liveness, not of safety.
  • Practical systems like Raft and Paxos work around FLP by introducing partial synchrony assumptions. Specifically, they use timeouts to detect suspected failures. Once you add a timeout, you are no longer in a purely asynchronous model — you are assuming that messages will eventually be delivered within some bound. FLP does not apply to partially synchronous systems.
  • The trade-off is that during periods of true asynchrony (extreme network delays, GC pauses that exceed the timeout), these protocols may fail to make progress (liveness violation) but they never violate safety (they never agree on two different values). This is exactly the right trade-off for production systems: it is acceptable to be temporarily unavailable, but it is never acceptable to corrupt data.
  • Another approach is randomized consensus (like Ben-Or’s protocol), which circumvents FLP by using randomization rather than determinism. The expected number of rounds is finite, but no single execution is guaranteed to terminate.
Follow-up: Can you describe a scenario where the FLP result actually manifests in a production system?Consider a Raft cluster where the leader experiences a long GC pause. During the pause, followers time out and start an election. But the GC pause ends and the old leader resumes sending heartbeats before the election completes. Followers receive heartbeats from the old leader and reset their election timers, canceling the election. If this cycle repeats — GC pause, election starts, GC ends, election cancels — the cluster can enter a livelock where no leader is elected and no progress is made. This is FLP manifesting in practice. The mitigation is randomized election timeouts in Raft (150-300ms range), which break symmetry and make it statistically improbable for this livelock to persist. But “statistically improbable” is not “impossible” — FLP reminds us that no amount of engineering eliminates this possibility entirely.
Strong Answer:
  • A full partition splits the cluster into two groups that cannot communicate at all. This is the classic CAP scenario: each side must independently decide whether to keep serving (AP) or stop (CP). Raft handles this cleanly — the majority side elects a leader, the minority side stops accepting writes.
  • A partial partition is when some nodes can communicate with some but not all other nodes. For example, in a 5-node cluster, node A can talk to B and C, node D can talk to C and E, but A cannot talk to D or E. Node C acts as a bridge. This is more insidious because quorum-based systems may still form quorums that do not overlap correctly, and different nodes have different views of who is reachable.
  • An asymmetric partition is when A can send to B but B cannot send to A. TCP will eventually detect this (via ACK timeouts), but UDP-based gossip or heartbeat systems may not. Node A thinks B is alive (because A’s sends succeed), while B thinks A is dead (because B never receives anything).
  • The distinction matters because most distributed systems are designed for full partitions, but partial and asymmetric partitions are more common in practice (misconfigured firewalls, flaky switches, half-duplex network failures). If your failure detector assumes symmetric communication, an asymmetric partition can cause split-brain.
Follow-up: How would you test your system’s behavior under a partial partition?I would use a network simulation layer (like Jepsen’s iptables-based nemesis or Toxiproxy) that can selectively drop traffic between specific pairs of nodes. The test scenario would create a partition where nodes A-B-C can communicate, D-E can communicate, but C is the only bridge between the two groups. Then I would verify: can the system still make progress? Does it correctly detect the partial partition? Does it degrade gracefully? I would also test the asymmetric case by dropping packets in one direction only using iptables INPUT/OUTPUT rules. The key metrics to watch are: leader election stability, write latency, and most importantly, whether any safety invariants (like linearizability) are violated during the partition.
Strong Answer:
  • I would respectfully push back. BFT is the right tool for a very specific threat model: mutually distrustful participants who may actively lie (blockchains, multi-party financial systems). For the vast majority of internal distributed systems, crash-fault tolerance (CFT) is sufficient and dramatically cheaper.
  • The cost difference is significant. CFT protocols like Raft need 2f+1 nodes to tolerate f failures, with O(n) message complexity per consensus round. BFT protocols like PBFT need 3f+1 nodes with O(n-squared) message complexity. For a 5-node Raft cluster tolerating 2 failures, you need 5 nodes. For BFT tolerating 2 failures, you need 7 nodes, each exchanging quadratically more messages.
  • In practice, the failures that cause production outages are crashes, network partitions, disk failures, and software bugs — not malicious nodes. If you are worried about buggy nodes sending corrupt data, you can add checksums and input validation at the application layer for a fraction of the cost of full BFT.
  • The exception: if you are building a system where different organizations contribute nodes and have economic incentives to cheat, BFT is necessary. This is why blockchain consensus uses BFT variants.
Follow-up: What about the case where a software bug causes a node to send corrupt or inconsistent messages? Is that not effectively Byzantine behavior?This is an excellent point and a real concern. A node with a memory corruption bug or a deserialization error might send valid-looking but incorrect messages — this is sometimes called an “accidental Byzantine” fault. However, the practical response is not to deploy full BFT. Instead, I would use defense-in-depth: checksums on all messages to detect corruption, assertion-based crash detection (if a node detects its own state is inconsistent, it crashes rather than propagating corruption), and Merkle-tree-based anti-entropy to detect data divergence between replicas. These approaches catch accidental Byzantine behavior at a fraction of the cost. For true Byzantine resilience, I would only invest in it for the specific data paths where corruption would be undetectable and catastrophic.
Strong Answer:
  • PACELC extends CAP by asking: even when there is no partition, what trade-off does the system make between latency and consistency? The framework is: if Partition, choose Availability or Consistency; Else, choose Latency or Consistency.
  • PA/EL (Available during partitions, Low latency normally): DynamoDB and Cassandra. During a partition, they keep serving from whichever replicas are reachable. During normal operation, they return responses from the nearest replica without waiting for quorum, giving low latency but eventual consistency.
  • PA/EC (Available during partitions, Consistent normally): MongoDB with default read concern. During normal operation, reads go to the primary, which is consistent. During a partition, secondaries can still serve reads (with possible staleness), maintaining availability.
  • PC/EL (Consistent during partitions, Low latency normally): Yahoo’s PNUTS (now largely historical). It blocks writes during partitions to maintain per-record consistency, but during normal operation it routes reads to the nearest replica for low latency using “timeline consistency.”
  • PC/EC (Consistent during partitions, Consistent normally): Google Spanner. It blocks during partitions (choosing consistency), and during normal operation it uses TrueTime and commit-wait to provide external consistency, accepting higher latency (1-7ms commit wait) for global ordering guarantees.
Follow-up: Can a single system offer different PACELC trade-offs for different operations?Absolutely, and this is how most sophisticated production systems work. Cassandra is the canonical example: you can set consistency level per query. A write with QUORUM and a read with QUORUM gives you PC/EC behavior for that specific operation. A write with ONE and a read with ONE gives you PA/EL behavior. The same cluster, the same data, but different consistency-latency trade-offs per request. Spanner does something similar with its read-only transactions: a “strong read” gives you external consistency (PC/EC), while a “stale read” with a bounded staleness of 10 seconds gives you lower latency (closer to PC/EL). The design insight is that consistency should be a dial, not a switch — and different use cases within the same application should be able to turn that dial independently.