Skip to main content

Track 1: Foundations

Before diving into consensus protocols and complex architectures, you must understand the fundamental challenges that make distributed systems hard.
Track Duration: 28-38 hours
Modules: 5
Key Topics: Network fundamentals, Time & Ordering, Failure Models, CAP/PACELC

Module 1: Why Distributed Systems?

The Need for Distribution

Every successful system eventually outgrows a single machine:
+-------------------------------------------------------------+
|                     WHY DISTRIBUTED SYSTEMS?               |
+-------------------------------------------------------------+
|                                                             |
|  SINGLE MACHINE LIMITS          DISTRIBUTION BENEFITS      |
|  ---------------------          ---------------------      |
|  - CPU: ~128 cores max          - Horizontal scaling       |
|  - RAM: ~12 TB max (expensive)  - Fault tolerance          |
|  - Disk: I/O bottlenecks        - Geographic distribution  |
|  - Network: Single point        - Cost efficiency          |
|    of failure                   - Regulatory compliance    |
|                                                             |
|  EXAMPLE: Google Search                                    |
|  ---------------------                                     |
|  - 8.5 billion searches/day                               |
|  - Index: 100+ petabytes                                  |
|  - Response time: < 0.5 seconds                           |
|  - Impossible on a single machine                         |
|                                                             |
+-------------------------------------------------------------+

Types of Distributed Systems

Compute Clusters

Purpose: Process data across many machinesExamples:
  • Hadoop/Spark clusters
  • Kubernetes pods
  • AWS Lambda fleet

Storage Systems

Purpose: Store and retrieve data across machinesExamples:
  • Distributed databases (Cassandra, DynamoDB)
  • Object stores (S3, GCS)
  • File systems (HDFS, Ceph)

Message Systems

Purpose: Enable communication between distributed componentsExamples:
  • Message queues (Kafka, RabbitMQ)
  • Pub/sub systems (SNS, Pub/Sub)
  • Event streams (Kinesis, Event Hubs)

The Eight Fallacies of Distributed Computing

Every distributed systems engineer must internalize these:
Reality: Networks fail constantly.
Common failures:
- Packet loss (1 in 1000 typical, worse under load)
- Network partitions (entire segments become unreachable)
- Cable cuts (more common than you'd think)
- Switch/router failures

Real incident: AWS US-East-1 2011
- Network configuration change
- Cascading failures
- Multi-hour outage affecting Netflix, Reddit, Quora
Defense: Implement retries, timeouts, circuit breakers, and idempotency.
Reality: Every network call has latency.
OperationTime
Same datacenter0.5ms
Cross-region (US East ↔ West)40-65ms
Cross-continent (US ↔ Europe)70-120ms
Cross-pacific (US ↔ Asia)100-200ms
Impact: A service with 10 sequential remote calls adds 500-2000ms latency.Defense: Cache aggressively, parallelize calls, use async processing.
Reality: Bandwidth is expensive and limited.
Example: Transferring 1TB over network

10 Gbps link:
- Theoretical: 13 minutes
- Real-world: 20-30 minutes (overhead, congestion)

AWS Data Transfer Costs:
- Intra-region: $0.01/GB
- Cross-region: $0.02/GB
- To internet: $0.09/GB
- 1PB/month internet egress: $90,000
Defense: Compress data, use efficient serialization (protobuf), batch requests.
Reality: Assume the network is compromised.Attack vectors:
  • Man-in-the-middle attacks
  • DNS spoofing
  • BGP hijacking (internet routing attacks)
  • Insider threats
Defense: mTLS everywhere, zero-trust architecture, encrypt at rest and in transit.
Reality: Network topology changes constantly.
Dynamic changes:
├── Auto-scaling adds/removes instances
├── Deployments replace containers
├── Failovers redirect traffic
└── Cloud provider maintenance
Defense: Use service discovery, health checks, load balancer draining.
Reality: Multiple teams, policies, and even companies.In a typical microservices architecture:
  • Platform team manages Kubernetes
  • Each service team manages their services
  • Security team manages policies
  • Network team manages infrastructure
Defense: Clear ownership, documented interfaces, SLAs between teams.
Reality: Data transfer costs real money.
Cloud ProviderEgress Cost
AWS$0.09/GB
GCP$0.12/GB
Azure$0.087/GB
High-traffic example: 1PB egress = $90,000/monthDefense: Keep data close to compute, use CDNs, compress aggressively.
Reality: Different hardware, protocols, and vendors everywhere.
Heterogeneity:
├── Different server generations (various CPU, RAM)
├── Different network equipment (Cisco, Juniper, etc.)
├── Different protocols (HTTP/1.1, HTTP/2, gRPC, QUIC)
├── Different cloud providers (AWS, GCP, Azure)
└── Different latency profiles
Defense: Abstract hardware differences, use consistent protocols, test on diverse environments.

Module 2: Network Fundamentals

TCP Guarantees and Failures

TCP provides order and reliability, but not much else:
┌─────────────────────────────────────────────────────────────────────────────┐
│                          TCP GUARANTEES                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ✓ GUARANTEED                         ✗ NOT GUARANTEED                      │
│  ─────────────                        ─────────────────                      │
│  ▸ Ordered delivery                   ▸ Bounded latency                     │
│  ▸ Reliable delivery (eventually)     ▸ Bounded bandwidth                   │
│  ▸ Error detection                    ▸ Connection will succeed             │
│  ▸ Flow control                       ▸ Message boundaries preserved        │
│  ▸ Duplicate elimination              ▸ Timely delivery                     │
│                                                                              │
│  FAILURE MODES                                                               │
│  ─────────────                                                              │
│  ▸ Connection reset (RST)                                                   │
│  ▸ Connection timeout                                                       │
│  ▸ Half-open connections (one side doesn't know it's dead)                 │
│  ▸ Head-of-line blocking                                                    │
│  ▸ TCP incast (many-to-one overwhelms receiver)                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Network Partitions

A network partition occurs when nodes can’t communicate with each other:
BEFORE PARTITION:
                    ┌────────────────┐
                    │   Network      │
                    └────────────────┘
                    ↑    ↑    ↑    ↑
                    │    │    │    │
               ┌────┼────┼────┼────┼────┐
               │    │    │    │    │    │
              [A]  [B]  [C]  [D]  [E]  [F]
              All nodes can communicate


DURING PARTITION:
     Partition 1              ║              Partition 2
  ┌─────────────┐             ║           ┌─────────────┐
  │   Network   │             ║           │   Network   │
  └─────────────┘             ║           └─────────────┘
   ↑     ↑    ↑               ║            ↑     ↑    ↑
   │     │    │           PARTITION        │     │    │
  [A]   [B]  [C]              ║           [D]   [E]  [F]
  
  A, B, C can talk             D, E, F can talk
  to each other                to each other
  but NOT to D, E, F           but NOT to A, B, C
Key insight: During a partition, you must choose between:
  • Availability: Both partitions continue serving requests (might diverge)
  • Consistency: Reject requests from the minority partition

Message Delivery Semantics

Fire and forget - no retries
def send_message(message):
    network.send(message)  # No retry on failure
    # Message delivered 0 or 1 times
Use case: Metrics, logs (losing some is acceptable)Risk: Message loss

Failure Detection

How do you know if a node is dead?
┌─────────────────────────────────────────────────────────────────────────────┐
│                      FAILURE DETECTION APPROACHES                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HEARTBEAT-BASED                                                            │
│  ───────────────                                                            │
│  Node A ──[heartbeat]──> Node B                                             │
│           every 1 sec                                                        │
│                                                                              │
│  If no heartbeat for 5 seconds → Mark as failed                             │
│                                                                              │
│  Problems:                                                                   │
│  ├── Too aggressive: False positives (just slow, not dead)                  │
│  └── Too conservative: Slow detection of actual failures                    │
│                                                                              │
│  PHI ACCRUAL FAILURE DETECTOR (Cassandra uses this)                         │
│  ─────────────────────────────                                              │
│  Instead of binary (alive/dead), output a suspicion level:                  │
│                                                                              │
│  φ = -log₁₀(probability node is still alive)                                │
│                                                                              │
│  φ = 1 → 90% confident node is alive                                        │
│  φ = 2 → 99% confident node is alive                                        │
**This is one of the most important topics in distributed systems.**

### Why Physical Clocks Fail

┌─────────────────────────────────────────────────────────────────────────────┐ │ PHYSICAL CLOCK PROBLEMS │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ CLOCK SKEW (different machines show different times) │ │ ──────────── │ │ Machine A: 10:00:00.000 │ │ Machine B: 10:00:00.150 (150ms ahead) │ │ Machine C: 09:59:59.850 (150ms behind) │ │ │ │ NTP SYNCHRONIZATION │ │ ─────────────────── │ │ NTP accuracy: 1-10ms on LAN, 10-100ms over internet │ │ During sync: Clock can jump forward or backward! │ │ │ │ CLOCK DRIFT │ │ ──────────── │ │ Quartz crystals drift: ~50ppm (50 microseconds per second) │ │ After 1 hour: 180ms drift possible │ │ │ │ LEAP SECONDS │ │ ──────────── │ │ Added to compensate for Earth’s rotation slowing │ │ 23:59:59 → 23:59:60 → 00:00:00 │ │ Famously caused issues at Reddit, LinkedIn, Mozilla (2012) │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

### Lamport Timestamps

Leslie Lamport's logical clock solution (1978):

**Rules:**
1. Each process maintains a counter (starts at 0)
2. Before each event, increment the counter
3. When sending: attach current counter to message
4. When receiving: counter = max(local, received) + 1

**Example:**
- Process A sends message 1 to B, then B receives it as 2
- Process B sends message 3 to C, then C receives it as 5
- Process A sends message 4 to B, then B receives it as 6
- Process B sends message 7 to C, then C receives it as 8

**Property:** If A happened before B, then L(A) < L(B)
**Limitation:** L(A) < L(B) does NOT mean A happened before B (no causality)

**Implementation**:

```python
class LamportClock:
    def __init__(self):
        self.counter = 0
    
    def increment(self):
        """Call before any local event"""
        self.counter += 1
        return self.counter
    
    def send(self):
        """Call when sending a message"""
        self.counter += 1
        return self.counter
    
    def receive(self, received_timestamp):
        """Call when receiving a message"""
        self.counter = max(self.counter, received_timestamp) + 1
        return self.counter

Vector Clocks

Vector clocks track causality - they tell you if two events are related: Vector Clocks

Advanced: Vector Clocks vs. Version Vectors

In many textbooks, “Vector Clocks” and “Version Vectors” are used interchangeably. However, at the Staff level, you must distinguish between them:
  1. Vector Clocks (Causality): Track the relationship between events in a distributed system. Every interaction (send/receive) increments the clock. They are used to build a partial order of all operations.
  2. Version Vectors (Conflict Tracking): Track the relationship between versions of data. They only increment when data is updated. They are used in systems like DynamoDB and Riak to detect if two versions of a document are in conflict (siblings).
FeatureVector ClocksVersion Vectors
GoalOrder events (AA happened before BB)Detect data conflicts
IncrementEvery event (internal + communication)Only on data modification
OverheadHigh (grows with event count)Low (grows with replica count)
Staff Tip: If you only need to know if two versions of a file conflict, use Version Vectors. If you need to know if a message was “caused” by another message (e.g., in a distributed debugger), use Vector Clocks.

Implementation:

class VectorClock:
    def __init__(self, node_id, num_nodes):
        self.node_id = node_id
        self.clock = [0] * num_nodes
    
    def increment(self):
        self.clock[self.node_id] += 1
        return self.clock.copy()
    
    def send(self):
        self.clock[self.node_id] += 1
        return self.clock.copy()
    
    def receive(self, received_clock):
        for i in range(len(self.clock)):
            self.clock[i] = max(self.clock[i], received_clock[i])
        self.clock[self.node_id] += 1
        return self.clock.copy()
    
    def happens_before(self, other_clock):
        """Returns True if this clock happened before other_clock"""
        return (all(s <= o for s, o in zip(self.clock, other_clock)) and
                any(s < o for s, o in zip(self.clock, other_clock)))
    
    def concurrent(self, other_clock):
        """Returns True if events are concurrent"""
        return (not self.happens_before(other_clock) and 
                not VectorClock.compare(other_clock, self.clock))

Hybrid Logical Clocks (HLC)

Combines physical and logical clocks:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    HYBRID LOGICAL CLOCKS                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HLC = (physical_time, logical_counter)                                     │
│                                                                              │
│  PROPERTIES:                                                                │
│  1. Always greater than physical time (useful for debugging)                │
│  2. Preserves causality like logical clocks                                 │
│  3. Bounded skew from physical time                                         │
│                                                                              │
│  USED BY:                                                                   │
│  ├── CockroachDB                                                            │
│  ├── MongoDB                                                                │
│  └── Many distributed databases                                             │
│                                                                              │
│  ALGORITHM:                                                                 │
│  ─────────                                                                  │
│  On local event or send:                                                    │
│    l' = max(l, physical_time)                                               │
│    if l' == l: c' = c + 1                                                   │
│    else: c' = 0                                                             │
│    l = l', c = c'                                                           │
│                                                                              │
│  On receive(m.l, m.c):                                                      │
│    l' = max(l, m.l, physical_time)                                          │
│    if l' == l == m.l: c' = max(c, m.c) + 1                                  │
│    else if l' == l: c' = c + 1                                              │
│    else if l' == m.l: c' = m.c + 1                                          │
│    else: c' = 0                                                             │
│    l = l', c = c'                                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

TrueTime (Google Spanner)

Google’s hardware-based approach:
┌─────────────────────────────────────────────────────────────────────────────┐
│                          TRUETIME                                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HARDWARE:                                                                  │
│  ├── GPS receivers (provide absolute time)                                  │
│  └── Atomic clocks (maintain time during GPS outages)                      │
│                                                                              │
│  API:                                                                       │
│  TT.now() returns [earliest, latest] interval                               │
│  TT.after(t) → true if t has definitely passed                             │
│  TT.before(t) → true if t has definitely not arrived                       │
│                                                                              │
│  UNCERTAINTY:                                                               │
│  ├── Average: 4ms                                                           │
│  └── Worst case: 10ms                                                       │
│                                                                              │
│  HOW SPANNER USES IT:                                                       │
│  ────────────────────                                                       │
│  Transaction commits at timestamp t                                         │
│  Wait until TT.after(t) before making visible                               │
│  This guarantees external consistency!                                      │
│                                                                              │
│  COST: Special hardware + GPS antennas + careful engineering                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 4: Failure Models

Understanding what can go wrong is crucial for building resilient systems.

Types of Failures

Node crashes and stays downCharacteristics:
  • Clean failure, detectable
  • Node doesn’t recover with corrupted state
  • Easiest to handle
Example: Hardware failure, power loss
[Node A] ──── X ────────────────────

          Crash
          (other nodes eventually detect)

Partial Failures

The defining challenge of distributed systems:
┌─────────────────────────────────────────────────────────────────────────────┐
│                        PARTIAL FAILURES                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SCENARIO: Node A calls Node B                                              │
│                                                                              │
│  A ──[request]──> B                                                         │
│                   │                                                         │
│         (silence)                                                           │
│                                                                              │
│  What happened?                                                             │
│  ├── Request lost in network?                                               │
│  ├── B crashed before processing?                                           │
│  ├── B processed but crashed before responding?                             │
│  ├── B responded but response lost?                                         │
│  └── B is just slow?                                                        │
│                                                                              │
│  A DOESN'T KNOW!                                                            │
│                                                                              │
│  IMPLICATIONS:                                                              │
│  ─────────────                                                              │
│  1. Cannot distinguish slow from dead                                       │
│  2. Operations might have executed (or not)                                 │
│  3. Must design for uncertainty                                             │
│  4. Idempotency becomes essential                                           │
│                                                                              │
│  PATTERN: Retry with idempotency keys                                       │
│  ─────────────────────────────────                                          │
│  request_id = "abc123"                                                      │
│  if already_processed(request_id):                                          │
│      return cached_response                                                 │
│  else:                                                                      │
│      process_and_store(request_id, response)                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Gray Failures

Subtle failures that are hard to detect:
Examples of gray failures:
├── CPU running at 50% speed due to thermal throttling
├── One of 10 disks has 100x higher latency
├── Memory errors causing random crashes
├── Network dropping 1% of packets
├── Software bug triggered only by specific input
└── Gradual memory leak over hours

Detection challenges:
├── Metrics look "mostly okay"
├── Intermittent symptoms
├── Hard to reproduce
└── May only affect some requests

Byzantine Fault Tolerance (BFT)

Byzantine failures are the most challenging to handle:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    BYZANTINE FAULT TOLERANCE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  THE BYZANTINE GENERALS PROBLEM                                              │
│  ─────────────────────────────                                              │
│  Multiple generals must agree on attack/retreat                             │
│  Some generals may be traitors (send different messages)                    │
│  How do loyal generals reach consensus?                                     │
│                                                                              │
│  REQUIREMENTS FOR BFT:                                                      │
│  ────────────────────                                                       │
│  To tolerate f Byzantine nodes: Need 3f + 1 total nodes                     │
│  - 1 faulty node: Need 4 nodes minimum                                      │
│  - 2 faulty nodes: Need 7 nodes minimum                                     │
│  - 3 faulty nodes: Need 10 nodes minimum                                    │
│                                                                              │
│  WHY 3f + 1?                                                                │
│  ──────────                                                                 │
│  f nodes may lie, f nodes may be unreachable                                │
│  Need f+1 correct responses to form a majority                              │
│  (3f+1) - f (faulty) - f (unreachable) = f+1 (quorum)                       │
│                                                                              │
│  PRACTICAL BFT ALGORITHMS:                                                  │
│  ───────────────────────                                                    │
│  ├── PBFT (Practical BFT) - Castro & Liskov, 1999                          │
│  ├── HotStuff - Used by Facebook's Libra/Diem                               │
│  ├── Tendermint - Used in blockchain systems                                │
│  └── RBFT - Redundant BFT for improved performance                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
PBFT Algorithm Overview:
# Simplified PBFT Flow
class PBFTNode:
    def __init__(self, node_id, total_nodes):
        self.node_id = node_id
        self.n = total_nodes
        self.f = (total_nodes - 1) // 3  # Max faulty nodes
        self.view = 0
        self.sequence = 0
        
    def is_primary(self):
        return self.view % self.n == self.node_id
    
    def receive_client_request(self, request):
        if not self.is_primary():
            # Forward to primary
            return self.forward_to_primary(request)
        
        # Primary broadcasts PRE-PREPARE
        self.sequence += 1
        self.broadcast_preprepare(request, self.sequence)
    
    def on_preprepare(self, request, seq):
        # Backup validates and broadcasts PREPARE
        if self.validate_preprepare(request, seq):
            self.broadcast_prepare(request, seq)
    
    def on_prepare(self, request, seq, sender):
        # Collect 2f PREPARE messages (including own)
        if self.prepare_count[seq] >= 2 * self.f:
            # Enter PREPARED state, broadcast COMMIT
            self.broadcast_commit(request, seq)
    
    def on_commit(self, request, seq, sender):
        # Collect 2f + 1 COMMIT messages
        if self.commit_count[seq] >= 2 * self.f + 1:
            # Execute request and reply to client
            self.execute_and_reply(request)
When to Use BFT:
  • Blockchain and cryptocurrency systems
  • Multi-party financial transactions
  • High-security government systems
  • Any system with mutually distrustful participants
Performance Trade-off: BFT protocols are expensive! PBFT requires O(n²) messages per consensus round. Most internal systems use simpler crash-fault tolerant protocols (Raft, Paxos) which only require 2f+1 nodes.

Network Partition Simulation

Understanding how to test partition tolerance:
# Simulating network partitions in tests
import random
from typing import Set, Dict, List

class NetworkSimulator:
    """
    Simulates network conditions including partitions, 
    delays, and message loss for testing distributed systems.
    """
    
    def __init__(self, nodes: List[str]):
        self.nodes = set(nodes)
        self.partitions: List[Set[str]] = [self.nodes.copy()]
        self.message_loss_rate = 0.0
        self.latency_ms = 0
        self.latency_jitter_ms = 0
    
    def create_partition(self, group1: Set[str], group2: Set[str]):
        """Split network into two partitions"""
        self.partitions = [group1, group2]
        print(f"PARTITION: {group1} | {group2}")
    
    def heal_partition(self):
        """Restore network connectivity"""
        self.partitions = [self.nodes.copy()]
        print("PARTITION HEALED")
    
    def can_communicate(self, from_node: str, to_node: str) -> bool:
        """Check if two nodes can communicate"""
        for partition in self.partitions:
            if from_node in partition and to_node in partition:
                return True
        return False
    
    def send_message(self, from_node: str, to_node: str, message: dict) -> bool:
        """
        Attempt to send a message with simulated failures.
        Returns True if message was delivered.
        """
        # Check partition
        if not self.can_communicate(from_node, to_node):
            print(f"BLOCKED: {from_node} -> {to_node} (partition)")
            return False
        
        # Simulate message loss
        if random.random() < self.message_loss_rate:
            print(f"LOST: {from_node} -> {to_node}")
            return False
        
        # Simulate latency
        delay = self.latency_ms + random.uniform(
            -self.latency_jitter_ms, 
            self.latency_jitter_ms
        )
        print(f"DELIVERED: {from_node} -> {to_node} ({delay:.1f}ms)")
        return True


# Example: Testing a distributed system under partition
def test_consensus_under_partition():
    nodes = ["node1", "node2", "node3", "node4", "node5"]
    network = NetworkSimulator(nodes)
    
    # Normal operation
    assert network.can_communicate("node1", "node3")
    
    # Create minority/majority partition
    network.create_partition(
        {"node1", "node2"},           # Minority
        {"node3", "node4", "node5"}   # Majority
    )
    
    # Majority can reach consensus
    assert network.can_communicate("node3", "node4")
    assert network.can_communicate("node4", "node5")
    
    # Cross-partition communication blocked
    assert not network.can_communicate("node1", "node3")
    assert not network.can_communicate("node2", "node5")
    
    # Minority should reject writes (in CP system)
    # Majority should continue operating
    
    network.heal_partition()
    assert network.can_communicate("node1", "node5")

Chaos Engineering for Failure Testing

Best Practice: Don’t wait for production failures to discover weaknesses. Proactively inject failures in controlled environments.
# Example: Chaos experiment definition
experiment:
  name: "Network Partition Between Services"
  hypothesis: "Order service continues accepting reads during payment partition"
  
  steady_state:
    - probe: "order-service-health"
      type: http
      url: "http://order-service/health"
      expected_status: 200
    
  method:
    - action: "network-partition"
      target: 
        service: "payment-service"
        namespace: "production"
      duration: 60s
      
  rollback:
    - action: "heal-network"
      target:
        service: "payment-service"
        
  metrics_to_watch:
    - error_rate < 1%
    - p99_latency < 500ms
    - order_reads_successful > 99%

Module 5: CAP and PACELC Theorems

CAP Theorem

CAP is often misunderstood. It only applies during a network partition.
CAP Theorem

CP vs AP Deep Dive

CP Systems

During partition, choose ConsistencyBehavior:
  • Minority partition stops accepting writes
  • May reject reads too (for linearizability)
  • Majority partition continues
Good for:
  • Financial systems
  • Inventory management
  • Any system where stale data is dangerous
Example: Bank account balance

AP Systems

During partition, choose AvailabilityBehavior:
  • Both partitions continue serving
  • May return stale/conflicting data
  • Resolve conflicts after partition heals
Good for:
  • Social media feeds
  • Shopping carts
  • DNS
Example: Twitter timeline

PACELC: The Better Framework

CAP only describes behavior during partitions. PACELC adds normal operation:
┌─────────────────────────────────────────────────────────────────────────────┐
│                           PACELC                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  IF Partition                                                               │
│     Choose between Availability and Consistency                             │
│  ELSE (normal operation)                                                    │
│     Choose between Latency and Consistency                                  │
│                                                                              │
│  EXAMPLES:                                                                  │
│  ─────────                                                                  │
│  DynamoDB/Cassandra: PA/EL                                                  │
│  ├── Partition: Available                                                   │
│  └── Normal: Low latency (eventual consistency)                             │
│                                                                              │
│  Spanner: PC/EC                                                             │
│  ├── Partition: Consistent (blocks)                                         │
│  └── Normal: Consistent (higher latency, but distributed transactions)     │
│                                                                              │
│  MongoDB: PA/EC                                                             │
│  ├── Partition: Available (eventual reads)                                  │
│  └── Normal: Consistent (waits for primary)                                 │
│                                                                              │
│  PNUTS (Yahoo): PC/EL                                                       │
│  ├── Partition: Consistent (blocks)                                         │
│  └── Normal: Low latency (local reads)                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Consistency Spectrum

Consistency is not binary - it’s a spectrum:
STRONGER ─────────────────────────────────────────────────── WEAKER
   │                                                           │
   ▼                                                           ▼
┌──────────────┬─────────────┬────────────┬──────────────┬────────────┐
│ Linearizable │ Sequential  │  Causal    │   Session    │  Eventual  │
│              │ Consistent  │            │  Consistent  │            │
└──────────────┴─────────────┴────────────┴──────────────┴────────────┘
   │                │              │             │              │
   │                │              │             │              │
   ▼                ▼              ▼             ▼              ▼
All ops appear   All ops see    Causally     Same session   Eventually
in real-time    same order     related ops   sees own       all see
order           (not real-     in order      writes         same value
                time)
                
Cost: Highest ◄─────────────────────────────────────────► Cost: Lowest
Latency: High ◄─────────────────────────────────────────► Latency: Low

External Consistency vs. Linearizability

For Staff/Principal level engineering, you must distinguish between system-local and global-time guarantees.

Linearizability (Atomic Consistency)

Linearizability is a guarantee for a single object (or key). It ensures that:
  1. Every operation appears to take effect atomically at some point between its invocation and response.
  2. All operations are seen in the same order by all participants.
  3. If a write WW completes before a read RR starts (according to the system’s observer), RR must see the result of WW.

External Consistency (Strict Serializability)

External Consistency is a stronger, global guarantee, famously used by Google Spanner. It combines Linearizability with Serializability across multiple shards/objects and respects global wall-clock time.
FeatureLinearizabilityExternal Consistency
ScopeSingle object/keyMulti-shard transactions
OrderingSystem-observed orderGlobal wall-clock order
ImplementationPaxos/Raft on one shardTrueTime / Commit-Wait / HLC
The “Why”: In a multi-shard system, two transactions T1T_1 and T2T_2 might affect different shards. If T1T_1 finishes at 10:00:01 and a client then starts T2T_2 at 10:00:02, T2T_2 must see the effects of T1T_1. Without external consistency (e.g., using only local Raft groups with skewed clocks), T2T_2 might get a timestamp that is logically earlier than T1T_1, leading to causal violations.

Module 6: Modern Infrastructure & Serverless Internals

To achieve the “Principal” level of understanding, one must look beyond the logical protocols and into the physical isolation models that power modern clouds. The shift from monolithic VMs to Micro-VMs has fundamentally changed how we scale distributed systems.

The Serverless Paradox

Serverless (AWS Lambda, Google Cloud Functions) promises “no servers,” but in reality, it requires the most complex server management in existence. The challenge is The Cold Start Problem vs. Isolation Security.
  • Traditional VMs (EC2): Strong isolation, but slow to boot (30s+). Too slow for “on-demand” execution.
  • Containers (Docker/K8s): Fast to start, but weak isolation (shared kernel). Dangerous for multi-tenant code execution.

Micro-VM Internals: AWS Firecracker

AWS Firecracker is the technology behind Lambda and Fargate. It uses KVM (Kernel-based Virtual Machine) but stripped down to the bare essentials.
  1. Minimal Device Model: Firecracker drops legacy hardware support (no USB, no video). It only provides:
    • Net (Network)
    • Block (Storage)
    • Serial (Console)
    • Balloon (Memory management)
  2. Snapshot-and-Restore: Instead of booting a kernel, Firecracker can restore from a Memory Snapshot. This reduces “boot” time from seconds to ~10ms.
  3. Jailer: A secondary security layer that uses cgroups, namespaces, and seccomp to ensure that even if a Micro-VM is compromised, it cannot touch the host or other VMs.

Implications for Distributed Design

When designing systems for serverless:
  • Statelessness is Mandatory: Because the Micro-VM can be killed or recycled at any moment.
  • Sidecar Overhead: Traditional sidecars (Service Mesh) add too much latency for 50ms functions. We move towards Proxyless Mesh or Library-based approaches.
  • Connection Pooling: Traditional database connection pools fail at serverless scale. We use RDS Proxy or HTTP-based database protocols (Data API).

Key Interview Questions

Answer: No, provably impossible (FLP theorem). But we can achieve it practically by:
  • Using timeouts (introduces synchrony assumption)
  • Randomization
  • Failure detectors
Most practical systems (Raft, Paxos) assume partial synchrony.
Answer: You can’t with certainty. Approaches:
  • Timeouts: Simple but prone to false positives
  • Heartbeats: Regular health checks
  • Phi accrual detector: Probabilistic suspicion level
  • Lease-based: Node must renew lease to be considered alive
In practice, use adaptive timeouts based on historical latency.
Answer:
  • Lamport: Single counter, preserves “happens-before” one direction only
  • Vector clocks: One counter per node, can detect concurrent events
Use Lamport when you just need ordering. Use vector clocks when you need to detect conflicts.
Answer: Frame it as a trade-off discussion:
  • First, acknowledge CAP only applies during partitions
  • Discuss what consistency level you actually need
  • Explain PACELC for normal operation
  • Give examples: “For our payment system, we’re CP because incorrect balance is unacceptable. For user preferences, we’re AP because eventual consistency is fine.”

Next Steps

Continue to Track 2: Consensus Protocols

Learn Paxos, Raft, and other consensus algorithms that power distributed systems