Skip to main content

Track 1: Foundations

Before diving into consensus protocols and complex architectures, you must understand the fundamental challenges that make distributed systems hard.
Track Duration: 28-38 hours
Modules: 5
Key Topics: Network fundamentals, Time & Ordering, Failure Models, CAP/PACELC

Module 1: Why Distributed Systems?

The Need for Distribution

Every successful system eventually outgrows a single machine:
+-------------------------------------------------------------+
|                     WHY DISTRIBUTED SYSTEMS?               |
+-------------------------------------------------------------+
|                                                             |
|  SINGLE MACHINE LIMITS          DISTRIBUTION BENEFITS      |
|  ---------------------          ---------------------      |
|  - CPU: ~128 cores max          - Horizontal scaling       |
|  - RAM: ~12 TB max (expensive)  - Fault tolerance          |
|  - Disk: I/O bottlenecks        - Geographic distribution  |
|  - Network: Single point        - Cost efficiency          |
|    of failure                   - Regulatory compliance    |
|                                                             |
|  EXAMPLE: Google Search                                    |
|  ---------------------                                     |
|  - 8.5 billion searches/day                               |
|  - Index: 100+ petabytes                                  |
|  - Response time: < 0.5 seconds                           |
|  - Impossible on a single machine                         |
|                                                             |
+-------------------------------------------------------------+

Types of Distributed Systems

Compute Clusters

Purpose: Process data across many machinesExamples:
  • Hadoop/Spark clusters
  • Kubernetes pods
  • AWS Lambda fleet

Storage Systems

Purpose: Store and retrieve data across machinesExamples:
  • Distributed databases (Cassandra, DynamoDB)
  • Object stores (S3, GCS)
  • File systems (HDFS, Ceph)

Message Systems

Purpose: Enable communication between distributed componentsExamples:
  • Message queues (Kafka, RabbitMQ)
  • Pub/sub systems (SNS, Pub/Sub)
  • Event streams (Kinesis, Event Hubs)

The Eight Fallacies of Distributed Computing

Every distributed systems engineer must internalize these:
Reality: Networks fail constantly.
Common failures:
- Packet loss (1 in 1000 typical, worse under load)
- Network partitions (entire segments become unreachable)
- Cable cuts (more common than you'd think)
- Switch/router failures

Real incident: AWS US-East-1 2011
- Network configuration change
- Cascading failures
- Multi-hour outage affecting Netflix, Reddit, Quora
Defense: Implement retries, timeouts, circuit breakers, and idempotency.
Reality: Every network call has latency.
OperationTime
Same datacenter0.5ms
Cross-region (US East ↔ West)40-65ms
Cross-continent (US ↔ Europe)70-120ms
Cross-pacific (US ↔ Asia)100-200ms
Impact: A service with 10 sequential remote calls adds 500-2000ms latency.Defense: Cache aggressively, parallelize calls, use async processing.
Reality: Bandwidth is expensive and limited.
Example: Transferring 1TB over network

10 Gbps link:
- Theoretical: 13 minutes
- Real-world: 20-30 minutes (overhead, congestion)

AWS Data Transfer Costs:
- Intra-region: $0.01/GB
- Cross-region: $0.02/GB
- To internet: $0.09/GB
- 1PB/month internet egress: $90,000
Defense: Compress data, use efficient serialization (protobuf), batch requests.
Reality: Assume the network is compromised.Attack vectors:
  • Man-in-the-middle attacks
  • DNS spoofing
  • BGP hijacking (internet routing attacks)
  • Insider threats
Defense: mTLS everywhere, zero-trust architecture, encrypt at rest and in transit.
Reality: Network topology changes constantly.
Dynamic changes:
├── Auto-scaling adds/removes instances
├── Deployments replace containers
├── Failovers redirect traffic
└── Cloud provider maintenance
Defense: Use service discovery, health checks, load balancer draining.
Reality: Multiple teams, policies, and even companies.In a typical microservices architecture:
  • Platform team manages Kubernetes
  • Each service team manages their services
  • Security team manages policies
  • Network team manages infrastructure
Defense: Clear ownership, documented interfaces, SLAs between teams.
Reality: Data transfer costs real money.
Cloud ProviderEgress Cost
AWS$0.09/GB
GCP$0.12/GB
Azure$0.087/GB
High-traffic example: 1PB egress = $90,000/monthDefense: Keep data close to compute, use CDNs, compress aggressively.
Reality: Different hardware, protocols, and vendors everywhere.
Heterogeneity:
├── Different server generations (various CPU, RAM)
├── Different network equipment (Cisco, Juniper, etc.)
├── Different protocols (HTTP/1.1, HTTP/2, gRPC, QUIC)
├── Different cloud providers (AWS, GCP, Azure)
└── Different latency profiles
Defense: Abstract hardware differences, use consistent protocols, test on diverse environments.

Module 2: Network Fundamentals

TCP Guarantees and Failures

TCP provides order and reliability, but not much else:
┌─────────────────────────────────────────────────────────────────────────────┐
│                          TCP GUARANTEES                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ✓ GUARANTEED                         ✗ NOT GUARANTEED                      │
│  ─────────────                        ─────────────────                      │
│  ▸ Ordered delivery                   ▸ Bounded latency                     │
│  ▸ Reliable delivery (eventually)     ▸ Bounded bandwidth                   │
│  ▸ Error detection                    ▸ Connection will succeed             │
│  ▸ Flow control                       ▸ Message boundaries preserved        │
│  ▸ Duplicate elimination              ▸ Timely delivery                     │
│                                                                              │
│  FAILURE MODES                                                               │
│  ─────────────                                                              │
│  ▸ Connection reset (RST)                                                   │
│  ▸ Connection timeout                                                       │
│  ▸ Half-open connections (one side doesn't know it's dead)                 │
│  ▸ Head-of-line blocking                                                    │
│  ▸ TCP incast (many-to-one overwhelms receiver)                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Network Partitions

A network partition occurs when nodes can’t communicate with each other:
BEFORE PARTITION:
                    ┌────────────────┐
                    │   Network      │
                    └────────────────┘
                    ↑    ↑    ↑    ↑
                    │    │    │    │
               ┌────┼────┼────┼────┼────┐
               │    │    │    │    │    │
              [A]  [B]  [C]  [D]  [E]  [F]
              All nodes can communicate


DURING PARTITION:
     Partition 1              ║              Partition 2
  ┌─────────────┐             ║           ┌─────────────┐
  │   Network   │             ║           │   Network   │
  └─────────────┘             ║           └─────────────┘
   ↑     ↑    ↑               ║            ↑     ↑    ↑
   │     │    │           PARTITION        │     │    │
  [A]   [B]  [C]              ║           [D]   [E]  [F]
  
  A, B, C can talk             D, E, F can talk
  to each other                to each other
  but NOT to D, E, F           but NOT to A, B, C
Key insight: During a partition, you must choose between:
  • Availability: Both partitions continue serving requests (might diverge)
  • Consistency: Reject requests from the minority partition

Message Delivery Semantics

Fire and forget - no retries
def send_message(message):
    network.send(message)  # No retry on failure
    # Message delivered 0 or 1 times
Use case: Metrics, logs (losing some is acceptable)Risk: Message loss

Failure Detection

How do you know if a node is dead?
┌─────────────────────────────────────────────────────────────────────────────┐
│                      FAILURE DETECTION APPROACHES                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HEARTBEAT-BASED                                                            │
│  ───────────────                                                            │
│  Node A ──[heartbeat]──> Node B                                             │
│           every 1 sec                                                        │
│                                                                              │
│  If no heartbeat for 5 seconds → Mark as failed                             │
│                                                                              │
│  Problems:                                                                   │
│  ├── Too aggressive: False positives (just slow, not dead)                  │
│  └── Too conservative: Slow detection of actual failures                    │
│                                                                              │
│  PHI ACCRUAL FAILURE DETECTOR (Cassandra uses this)                         │
│  ─────────────────────────────                                              │
│  Instead of binary (alive/dead), output a suspicion level:                  │
│                                                                              │
│  φ = -log₁₀(probability node is still alive)                                │
│                                                                              │
│  φ = 1 → 90% confident node is alive                                        │
│  φ = 2 → 99% confident node is alive                                        │
**This is one of the most important topics in distributed systems.**

### Why Physical Clocks Fail

┌─────────────────────────────────────────────────────────────────────────────┐ │ PHYSICAL CLOCK PROBLEMS │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ CLOCK SKEW (different machines show different times) │ │ ──────────── │ │ Machine A: 10:00:00.000 │ │ Machine B: 10:00:00.150 (150ms ahead) │ │ Machine C: 09:59:59.850 (150ms behind) │ │ │ │ NTP SYNCHRONIZATION │ │ ─────────────────── │ │ NTP accuracy: 1-10ms on LAN, 10-100ms over internet │ │ During sync: Clock can jump forward or backward! │ │ │ │ CLOCK DRIFT │ │ ──────────── │ │ Quartz crystals drift: ~50ppm (50 microseconds per second) │ │ After 1 hour: 180ms drift possible │ │ │ │ LEAP SECONDS │ │ ──────────── │ │ Added to compensate for Earth’s rotation slowing │ │ 23:59:59 → 23:59:60 → 00:00:00 │ │ Famously caused issues at Reddit, LinkedIn, Mozilla (2012) │ │ │ └─────────────────────────────────────────────────────────────────────────────┘

### Lamport Timestamps

Leslie Lamport's logical clock solution (1978):

**Rules:**
1. Each process maintains a counter (starts at 0)
2. Before each event, increment the counter
3. When sending: attach current counter to message
4. When receiving: counter = max(local, received) + 1

**Example:**
- Process A sends message 1 to B, then B receives it as 2
- Process B sends message 3 to C, then C receives it as 5
- Process A sends message 4 to B, then B receives it as 6
- Process B sends message 7 to C, then C receives it as 8

**Property:** If A happened before B, then L(A) < L(B)
**Limitation:** L(A) < L(B) does NOT mean A happened before B (no causality)

**Implementation**:

```python
class LamportClock:
    def __init__(self):
        self.counter = 0
    
    def increment(self):
        """Call before any local event"""
        self.counter += 1
        return self.counter
    
    def send(self):
        """Call when sending a message"""
        self.counter += 1
        return self.counter
    
    def receive(self, received_timestamp):
        """Call when receiving a message"""
        self.counter = max(self.counter, received_timestamp) + 1
        return self.counter

Vector Clocks

Vector clocks track causality - they tell you if two events are related: Vector Clocks Implementation:
class VectorClock:
    def __init__(self, node_id, num_nodes):
        self.node_id = node_id
        self.clock = [0] * num_nodes
    
    def increment(self):
        self.clock[self.node_id] += 1
        return self.clock.copy()
    
    def send(self):
        self.clock[self.node_id] += 1
        return self.clock.copy()
    
    def receive(self, received_clock):
        for i in range(len(self.clock)):
            self.clock[i] = max(self.clock[i], received_clock[i])
        self.clock[self.node_id] += 1
        return self.clock.copy()
    
    def happens_before(self, other_clock):
        """Returns True if this clock happened before other_clock"""
        return (all(s <= o for s, o in zip(self.clock, other_clock)) and
                any(s < o for s, o in zip(self.clock, other_clock)))
    
    def concurrent(self, other_clock):
        """Returns True if events are concurrent"""
        return (not self.happens_before(other_clock) and 
                not VectorClock.compare(other_clock, self.clock))

Hybrid Logical Clocks (HLC)

Combines physical and logical clocks:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    HYBRID LOGICAL CLOCKS                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HLC = (physical_time, logical_counter)                                     │
│                                                                              │
│  PROPERTIES:                                                                │
│  1. Always greater than physical time (useful for debugging)                │
│  2. Preserves causality like logical clocks                                 │
│  3. Bounded skew from physical time                                         │
│                                                                              │
│  USED BY:                                                                   │
│  ├── CockroachDB                                                            │
│  ├── MongoDB                                                                │
│  └── Many distributed databases                                             │
│                                                                              │
│  ALGORITHM:                                                                 │
│  ─────────                                                                  │
│  On local event or send:                                                    │
│    l' = max(l, physical_time)                                               │
│    if l' == l: c' = c + 1                                                   │
│    else: c' = 0                                                             │
│    l = l', c = c'                                                           │
│                                                                              │
│  On receive(m.l, m.c):                                                      │
│    l' = max(l, m.l, physical_time)                                          │
│    if l' == l == m.l: c' = max(c, m.c) + 1                                  │
│    else if l' == l: c' = c + 1                                              │
│    else if l' == m.l: c' = m.c + 1                                          │
│    else: c' = 0                                                             │
│    l = l', c = c'                                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

TrueTime (Google Spanner)

Google’s hardware-based approach:
┌─────────────────────────────────────────────────────────────────────────────┐
│                          TRUETIME                                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  HARDWARE:                                                                  │
│  ├── GPS receivers (provide absolute time)                                  │
│  └── Atomic clocks (maintain time during GPS outages)                      │
│                                                                              │
│  API:                                                                       │
│  TT.now() returns [earliest, latest] interval                               │
│  TT.after(t) → true if t has definitely passed                             │
│  TT.before(t) → true if t has definitely not arrived                       │
│                                                                              │
│  UNCERTAINTY:                                                               │
│  ├── Average: 4ms                                                           │
│  └── Worst case: 10ms                                                       │
│                                                                              │
│  HOW SPANNER USES IT:                                                       │
│  ────────────────────                                                       │
│  Transaction commits at timestamp t                                         │
│  Wait until TT.after(t) before making visible                               │
│  This guarantees external consistency!                                      │
│                                                                              │
│  COST: Special hardware + GPS antennas + careful engineering                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 4: Failure Models

Understanding what can go wrong is crucial for building resilient systems.

Types of Failures

Node crashes and stays downCharacteristics:
  • Clean failure, detectable
  • Node doesn’t recover with corrupted state
  • Easiest to handle
Example: Hardware failure, power loss
[Node A] ──── X ────────────────────

          Crash
          (other nodes eventually detect)

Partial Failures

The defining challenge of distributed systems:
┌─────────────────────────────────────────────────────────────────────────────┐
│                        PARTIAL FAILURES                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SCENARIO: Node A calls Node B                                              │
│                                                                              │
│  A ──[request]──> B                                                         │
│                   │                                                         │
│         (silence)                                                           │
│                                                                              │
│  What happened?                                                             │
│  ├── Request lost in network?                                               │
│  ├── B crashed before processing?                                           │
│  ├── B processed but crashed before responding?                             │
│  ├── B responded but response lost?                                         │
│  └── B is just slow?                                                        │
│                                                                              │
│  A DOESN'T KNOW!                                                            │
│                                                                              │
│  IMPLICATIONS:                                                              │
│  ─────────────                                                              │
│  1. Cannot distinguish slow from dead                                       │
│  2. Operations might have executed (or not)                                 │
│  3. Must design for uncertainty                                             │
│  4. Idempotency becomes essential                                           │
│                                                                              │
│  PATTERN: Retry with idempotency keys                                       │
│  ─────────────────────────────────                                          │
│  request_id = "abc123"                                                      │
│  if already_processed(request_id):                                          │
│      return cached_response                                                 │
│  else:                                                                      │
│      process_and_store(request_id, response)                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Gray Failures

Subtle failures that are hard to detect:
Examples of gray failures:
├── CPU running at 50% speed due to thermal throttling
├── One of 10 disks has 100x higher latency
├── Memory errors causing random crashes
├── Network dropping 1% of packets
├── Software bug triggered only by specific input
└── Gradual memory leak over hours

Detection challenges:
├── Metrics look "mostly okay"
├── Intermittent symptoms
├── Hard to reproduce
└── May only affect some requests

Byzantine Fault Tolerance (BFT)

Byzantine failures are the most challenging to handle:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    BYZANTINE FAULT TOLERANCE                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  THE BYZANTINE GENERALS PROBLEM                                              │
│  ─────────────────────────────                                              │
│  Multiple generals must agree on attack/retreat                             │
│  Some generals may be traitors (send different messages)                    │
│  How do loyal generals reach consensus?                                     │
│                                                                              │
│  REQUIREMENTS FOR BFT:                                                      │
│  ────────────────────                                                       │
│  To tolerate f Byzantine nodes: Need 3f + 1 total nodes                     │
│  - 1 faulty node: Need 4 nodes minimum                                      │
│  - 2 faulty nodes: Need 7 nodes minimum                                     │
│  - 3 faulty nodes: Need 10 nodes minimum                                    │
│                                                                              │
│  WHY 3f + 1?                                                                │
│  ──────────                                                                 │
│  f nodes may lie, f nodes may be unreachable                                │
│  Need f+1 correct responses to form a majority                              │
│  (3f+1) - f (faulty) - f (unreachable) = f+1 (quorum)                       │
│                                                                              │
│  PRACTICAL BFT ALGORITHMS:                                                  │
│  ───────────────────────                                                    │
│  ├── PBFT (Practical BFT) - Castro & Liskov, 1999                          │
│  ├── HotStuff - Used by Facebook's Libra/Diem                               │
│  ├── Tendermint - Used in blockchain systems                                │
│  └── RBFT - Redundant BFT for improved performance                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
PBFT Algorithm Overview:
# Simplified PBFT Flow
class PBFTNode:
    def __init__(self, node_id, total_nodes):
        self.node_id = node_id
        self.n = total_nodes
        self.f = (total_nodes - 1) // 3  # Max faulty nodes
        self.view = 0
        self.sequence = 0
        
    def is_primary(self):
        return self.view % self.n == self.node_id
    
    def receive_client_request(self, request):
        if not self.is_primary():
            # Forward to primary
            return self.forward_to_primary(request)
        
        # Primary broadcasts PRE-PREPARE
        self.sequence += 1
        self.broadcast_preprepare(request, self.sequence)
    
    def on_preprepare(self, request, seq):
        # Backup validates and broadcasts PREPARE
        if self.validate_preprepare(request, seq):
            self.broadcast_prepare(request, seq)
    
    def on_prepare(self, request, seq, sender):
        # Collect 2f PREPARE messages (including own)
        if self.prepare_count[seq] >= 2 * self.f:
            # Enter PREPARED state, broadcast COMMIT
            self.broadcast_commit(request, seq)
    
    def on_commit(self, request, seq, sender):
        # Collect 2f + 1 COMMIT messages
        if self.commit_count[seq] >= 2 * self.f + 1:
            # Execute request and reply to client
            self.execute_and_reply(request)
When to Use BFT:
  • Blockchain and cryptocurrency systems
  • Multi-party financial transactions
  • High-security government systems
  • Any system with mutually distrustful participants
Performance Trade-off: BFT protocols are expensive! PBFT requires O(n²) messages per consensus round. Most internal systems use simpler crash-fault tolerant protocols (Raft, Paxos) which only require 2f+1 nodes.

Network Partition Simulation

Understanding how to test partition tolerance:
# Simulating network partitions in tests
import random
from typing import Set, Dict, List

class NetworkSimulator:
    """
    Simulates network conditions including partitions, 
    delays, and message loss for testing distributed systems.
    """
    
    def __init__(self, nodes: List[str]):
        self.nodes = set(nodes)
        self.partitions: List[Set[str]] = [self.nodes.copy()]
        self.message_loss_rate = 0.0
        self.latency_ms = 0
        self.latency_jitter_ms = 0
    
    def create_partition(self, group1: Set[str], group2: Set[str]):
        """Split network into two partitions"""
        self.partitions = [group1, group2]
        print(f"PARTITION: {group1} | {group2}")
    
    def heal_partition(self):
        """Restore network connectivity"""
        self.partitions = [self.nodes.copy()]
        print("PARTITION HEALED")
    
    def can_communicate(self, from_node: str, to_node: str) -> bool:
        """Check if two nodes can communicate"""
        for partition in self.partitions:
            if from_node in partition and to_node in partition:
                return True
        return False
    
    def send_message(self, from_node: str, to_node: str, message: dict) -> bool:
        """
        Attempt to send a message with simulated failures.
        Returns True if message was delivered.
        """
        # Check partition
        if not self.can_communicate(from_node, to_node):
            print(f"BLOCKED: {from_node} -> {to_node} (partition)")
            return False
        
        # Simulate message loss
        if random.random() < self.message_loss_rate:
            print(f"LOST: {from_node} -> {to_node}")
            return False
        
        # Simulate latency
        delay = self.latency_ms + random.uniform(
            -self.latency_jitter_ms, 
            self.latency_jitter_ms
        )
        print(f"DELIVERED: {from_node} -> {to_node} ({delay:.1f}ms)")
        return True


# Example: Testing a distributed system under partition
def test_consensus_under_partition():
    nodes = ["node1", "node2", "node3", "node4", "node5"]
    network = NetworkSimulator(nodes)
    
    # Normal operation
    assert network.can_communicate("node1", "node3")
    
    # Create minority/majority partition
    network.create_partition(
        {"node1", "node2"},           # Minority
        {"node3", "node4", "node5"}   # Majority
    )
    
    # Majority can reach consensus
    assert network.can_communicate("node3", "node4")
    assert network.can_communicate("node4", "node5")
    
    # Cross-partition communication blocked
    assert not network.can_communicate("node1", "node3")
    assert not network.can_communicate("node2", "node5")
    
    # Minority should reject writes (in CP system)
    # Majority should continue operating
    
    network.heal_partition()
    assert network.can_communicate("node1", "node5")

Chaos Engineering for Failure Testing

Best Practice: Don’t wait for production failures to discover weaknesses. Proactively inject failures in controlled environments.
# Example: Chaos experiment definition
experiment:
  name: "Network Partition Between Services"
  hypothesis: "Order service continues accepting reads during payment partition"
  
  steady_state:
    - probe: "order-service-health"
      type: http
      url: "http://order-service/health"
      expected_status: 200
    
  method:
    - action: "network-partition"
      target: 
        service: "payment-service"
        namespace: "production"
      duration: 60s
      
  rollback:
    - action: "heal-network"
      target:
        service: "payment-service"
        
  metrics_to_watch:
    - error_rate < 1%
    - p99_latency < 500ms
    - order_reads_successful > 99%

Module 5: CAP and PACELC Theorems

CAP Theorem

CAP is often misunderstood. It only applies during a network partition.
CAP Theorem

CP vs AP Deep Dive

CP Systems

During partition, choose ConsistencyBehavior:
  • Minority partition stops accepting writes
  • May reject reads too (for linearizability)
  • Majority partition continues
Good for:
  • Financial systems
  • Inventory management
  • Any system where stale data is dangerous
Example: Bank account balance

AP Systems

During partition, choose AvailabilityBehavior:
  • Both partitions continue serving
  • May return stale/conflicting data
  • Resolve conflicts after partition heals
Good for:
  • Social media feeds
  • Shopping carts
  • DNS
Example: Twitter timeline

PACELC: The Better Framework

CAP only describes behavior during partitions. PACELC adds normal operation:
┌─────────────────────────────────────────────────────────────────────────────┐
│                           PACELC                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  IF Partition                                                               │
│     Choose between Availability and Consistency                             │
│  ELSE (normal operation)                                                    │
│     Choose between Latency and Consistency                                  │
│                                                                              │
│  EXAMPLES:                                                                  │
│  ─────────                                                                  │
│  DynamoDB/Cassandra: PA/EL                                                  │
│  ├── Partition: Available                                                   │
│  └── Normal: Low latency (eventual consistency)                             │
│                                                                              │
│  Spanner: PC/EC                                                             │
│  ├── Partition: Consistent (blocks)                                         │
│  └── Normal: Consistent (higher latency, but distributed transactions)     │
│                                                                              │
│  MongoDB: PA/EC                                                             │
│  ├── Partition: Available (eventual reads)                                  │
│  └── Normal: Consistent (waits for primary)                                 │
│                                                                              │
│  PNUTS (Yahoo): PC/EL                                                       │
│  ├── Partition: Consistent (blocks)                                         │
│  └── Normal: Low latency (local reads)                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Consistency Spectrum

Consistency is not binary - it’s a spectrum:
STRONGER ─────────────────────────────────────────────────── WEAKER
   │                                                           │
   ▼                                                           ▼
┌──────────────┬─────────────┬────────────┬──────────────┬────────────┐
│ Linearizable │ Sequential  │  Causal    │   Session    │  Eventual  │
│              │ Consistent  │            │  Consistent  │            │
└──────────────┴─────────────┴────────────┴──────────────┴────────────┘
   │                │              │             │              │
   │                │              │             │              │
   ▼                ▼              ▼             ▼              ▼
All ops appear   All ops see    Causally     Same session   Eventually
in real-time    same order     related ops   sees own       all see
order           (not real-     in order      writes         same value
                time)
                
Cost: Highest ◄─────────────────────────────────────────► Cost: Lowest
Latency: High ◄─────────────────────────────────────────► Latency: Low

Key Interview Questions

Answer: No, provably impossible (FLP theorem). But we can achieve it practically by:
  • Using timeouts (introduces synchrony assumption)
  • Randomization
  • Failure detectors
Most practical systems (Raft, Paxos) assume partial synchrony.
Answer: You can’t with certainty. Approaches:
  • Timeouts: Simple but prone to false positives
  • Heartbeats: Regular health checks
  • Phi accrual detector: Probabilistic suspicion level
  • Lease-based: Node must renew lease to be considered alive
In practice, use adaptive timeouts based on historical latency.
Answer:
  • Lamport: Single counter, preserves “happens-before” one direction only
  • Vector clocks: One counter per node, can detect concurrent events
Use Lamport when you just need ordering. Use vector clocks when you need to detect conflicts.
Answer: Frame it as a trade-off discussion:
  • First, acknowledge CAP only applies during partitions
  • Discuss what consistency level you actually need
  • Explain PACELC for normal operation
  • Give examples: “For our payment system, we’re CP because incorrect balance is unacceptable. For user preferences, we’re AP because eventual consistency is fine.”

Next Steps

Continue to Track 2: Consensus Protocols

Learn Paxos, Raft, and other consensus algorithms that power distributed systems