Track 2: Consensus Protocols

Consensus is the most critical topic in distributed systems. Master this track to ace Staff+ interviews at top companies.

Track Duration: 36-46 hours
Modules: 5
Key Topics: FLP Impossibility, Paxos, Raft, Viewstamped Replication, ZAB

Module 6: The Consensus Problem

What is Consensus?

Consensus is getting multiple nodes to agree on a single value, despite failures.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        THE CONSENSUS PROBLEM                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SETUP:                                                                     │
│  ├── N nodes in a distributed system                                        │
│  ├── Some nodes may fail (crash failures)                                   │
│  ├── Network may delay/lose messages                                        │
│  └── Goal: Agree on a single value                                          │
│                                                                              │
│  PROPERTIES (must satisfy ALL):                                             │
│                                                                              │
│  1. AGREEMENT:        All non-faulty nodes decide the same value            │
│  2. VALIDITY:         The decided value was proposed by some node           │
│  3. TERMINATION:      All non-faulty nodes eventually decide                │
│                       (liveness property)                                    │
│  4. INTEGRITY:        Each node decides at most once                        │
│                                                                              │
│  WHY IT'S HARD:                                                             │
│  ─────────────                                                              │
│  Node A proposes "5"     │    What if:                                      │
│  Node B proposes "7"     │    - A crashes after proposing?                  │
│  Node C proposes "3"     │    - Message from A to C is lost?                │
│                          │    - B thinks A is dead (just slow)?             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Use Cases for Consensus

Leader Election

Nodes agree on who the leader isExamples:

Database primary selection
Kafka partition leader
Kubernetes control plane

Distributed Locking

Nodes agree on who holds the lockExamples:

Zookeeper locks
Redis Redlock
etcd leases

Configuration Management

Nodes agree on system configurationExamples:

Cluster membership
Feature flags
Service discovery

State Machine Replication

Nodes agree on sequence of commandsExamples:

Replicated databases
Distributed logs
Blockchain

FLP Impossibility Theorem

The Most Important Result: Consensus is impossible in an asynchronous system with even one faulty process.

┌─────────────────────────────────────────────────────────────────────────────┐
│                    FLP IMPOSSIBILITY (1985)                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Fischer, Lynch, and Paterson proved:                                       │
│                                                                              │
│  "No deterministic consensus protocol can guarantee termination             │
│   in an asynchronous system with even one crash failure."                   │
│                                                                              │
│  IMPLICATIONS:                                                              │
│  ─────────────                                                              │
│  You CANNOT have all three:                                                 │
│  1. Safety (agreement + validity)                                           │
│  2. Liveness (termination)                                                  │
│  3. Asynchrony (no timing assumptions)                                      │
│                                                                              │
│  WHY? (Intuition)                                                           │
│  ────────────────                                                           │
│  - In async systems, you can't distinguish slow from dead                   │
│  - There's always an adversarial schedule that prevents progress            │
│  - No matter the protocol, an adversary can delay messages                  │
│    to keep the system in an undecided state                                 │
│                                                                              │
│  HOW PRACTICAL SYSTEMS WORK AROUND IT:                                      │
│  ─────────────────────────────────────                                      │
│  1. Timeouts (partial synchrony) - Most systems do this                     │
│  2. Randomization - Some probability of progress                            │
│  3. Failure detectors - Oracle that hints about failures                    │
│                                                                              │
│  RAFT/PAXOS assume: "Eventually, the network will be synchronous enough"   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Safety vs Liveness

SAFETY: "Nothing bad happens"
├── Agreement is never violated
├── Only proposed values are chosen
└── Can be proven mathematically

LIVENESS: "Something good eventually happens"  
├── Decisions are eventually made
├── Progress is made
└── Can be violated by FLP (temporarily)

PRACTICAL APPROACH:
├── ALWAYS maintain safety
└── EVENTUALLY achieve liveness (when network is stable)

Module 7: Paxos Protocol

The Original Consensus Algorithm

Paxos was invented by Leslie Lamport in 1989. Despite being notoriously difficult to understand, it’s the foundation of modern consensus.

┌─────────────────────────────────────────────────────────────────────────────┐
│                          PAXOS OVERVIEW                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ROLES (a node can play multiple roles):                                    │
│  ─────                                                                      │
│  PROPOSER:   Proposes values, drives the protocol                           │
│  ACCEPTOR:   Votes on proposals, stores accepted values                     │
│  LEARNER:    Learns the decided value                                       │
│                                                                              │
│  GOAL: Agree on a single value among N acceptors                            │
│                                                                              │
│  KEY INSIGHT:                                                               │
│  ────────────                                                               │
│  A value is "chosen" when accepted by a MAJORITY of acceptors               │
│  Any two majorities overlap, ensuring agreement                             │
│                                                                              │
│  PHASES:                                                                    │
│  ───────                                                                    │
│  Phase 1a: PREPARE    - Proposer asks acceptors to promise                  │
│  Phase 1b: PROMISE    - Acceptors promise (or reject)                       │
│  Phase 2a: ACCEPT     - Proposer asks acceptors to accept value             │
│  Phase 2b: ACCEPTED   - Acceptors accept (or reject)                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Basic Paxos Step by Step

Phase 1: Prepare
Phase 1b: Promise
Phase 2: Accept
Phase 2b: Decided

Proposer selects a unique proposal number n and sends Prepare(n) to all acceptors.

Proposer P1                    Acceptors
───────────                    ─────────
     │                         [A1] [A2] [A3]
     │    Prepare(n=1)           │    │    │
     │─────────────────────────>│    │    │
     │─────────────────────────>│────│    │
     │─────────────────────────>│────│────│
     │                           │    │    │

Acceptor behavior on receiving Prepare(n):

def on_prepare(n):
    if n > promised_n:
        promised_n = n  # Promise not to accept lower
        return Promise(n, accepted_n, accepted_v)
    else:
        return Reject()  # Already promised higher

Acceptors respond with:

Promise to not accept proposals with number < n
The highest-numbered proposal they’ve already accepted (if any)

Proposer P1                    Acceptors
───────────                    ─────────
     │                         [A1] [A2] [A3]
     │<───Promise(1,∅,∅)─────────│    │    │
     │<───Promise(1,∅,∅)──────────────│    │
     │<───Promise(1,∅,∅)───────────────────│
     │                           │    │    │
     
P1 has promises from majority (3/3), can proceed!

If proposer receives promises from majority:

Use highest accepted value from promises, OR
Use its own proposed value if no prior accepts

Proposer P1                    Acceptors
───────────                    ─────────
     │     Accept(1, "X")        [A1] [A2] [A3]
     │─────────────────────────>│    │    │
     │─────────────────────────>│────│    │
     │─────────────────────────>│────│────│
     

Acceptor behavior on receiving Accept(n, v):

def on_accept(n, v):
    if n >= promised_n:
        promised_n = n
        accepted_n = n
        accepted_v = v
        return Accepted(n, v)
    else:
        return Reject()  # Promised to higher proposal

If proposer receives Accepted from majority, value is chosen!

Proposer P1                    Acceptors
───────────                    ─────────
     │                         [A1] [A2] [A3]
     │<───Accepted(1,"X")────────│    │    │
     │<───Accepted(1,"X")─────────────│    │
     │<───Accepted(1,"X")──────────────────│
     │                           │    │    │
     │  Value "X" is DECIDED!    │    │    │
     │                           │    │    │
Notify learners...

Paxos with Failures

This is where Paxos shines - handling failures:

┌─────────────────────────────────────────────────────────────────────────────┐
│                 PAXOS HANDLING COMPETING PROPOSERS                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Scenario: P1 and P2 both try to propose                                   │
│                                                                              │
│  Time    P1           Acceptors        P2                                  │
│  ────    ──           ─────────        ──                                  │
│   t1     Prepare(1) ────> A1,A2,A3                                          │
│   t2                 <──── Promise(1,∅)                                     │
│   t3                                   Prepare(2) ────> A1,A2,A3           │
│   t4                                   <──── Promise(2,∅)                   │
│   t5     Accept(1,"X") ──> A1,A2,A3                                         │
│   t6                 ──── REJECT! ──── (promised n=2 already)              │
│   t7                                   Accept(2,"Y") ──> A1,A2,A3          │
│   t8                                   <──── Accepted(2,"Y")               │
│                                                                              │
│  Result: "Y" is chosen (P2 won with higher proposal number)                │
│                                                                              │
│  P1 must retry with n > 2                                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Why Paxos is Safe

THEOREM: If a value v is chosen, any future proposal will have value v.

PROOF INTUITION:
─────────────────
1. v is chosen → accepted by majority M1
2. Future proposal n' > n must get promises from majority M2
3. M1 ∩ M2 ≠ ∅ (majorities overlap)
4. At least one acceptor in M2 already accepted v
5. That acceptor tells proposer about v in Promise
6. Proposer MUST use v (highest accepted value)
7. Therefore, v is preserved

This is why proposal numbers and the "use highest accepted value" rule exist!

Multi-Paxos

Basic Paxos decides ONE value. Multi-Paxos decides a SEQUENCE:

┌─────────────────────────────────────────────────────────────────────────────┐
│                          MULTI-PAXOS                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  OPTIMIZATION: Elect a stable leader, skip Phase 1                          │
│                                                                              │
│  Log Index:    1        2        3        4        5                        │
│               ┌────┐   ┌────┐   ┌────┐   ┌────┐   ┌────┐                   │
│  Log:         │ v1 │   │ v2 │   │ v3 │   │ v4 │   │ v5 │                   │
│               └────┘   └────┘   └────┘   └────┘   └────┘                   │
│                 │        │        │        │        │                       │
│              Decided  Decided  Decided  Pending  Pending                    │
│                                                                              │
│  LEADER OPTIMIZATION:                                                       │
│  ────────────────────                                                       │
│  1. Leader runs Phase 1 ONCE for all future slots                          │
│  2. Each new command: just Phase 2 (Accept/Accepted)                        │
│  3. Much faster! (2 message delays instead of 4)                            │
│                                                                              │
│  IF LEADER FAILS:                                                           │
│  ────────────────                                                           │
│  New leader runs Phase 1 for any undecided slots                            │
│  May discover in-progress proposals and complete them                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 8: Raft Consensus (Deep Dive)

Most Asked in Interviews: Raft is designed to be understandable. You MUST be able to explain it clearly.

For a full, implementation-focused treatment with invariants, failure scenarios, and a build-from-scratch roadmap, see the dedicated Raft Deep Dive.

Why Raft?

┌─────────────────────────────────────────────────────────────────────────────┐
│                    RAFT vs PAXOS                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PAXOS PROBLEMS:                       RAFT SOLUTIONS:                      │
│  ───────────────                       ───────────────                      │
│  - Hard to understand                  - Designed for understandability     │
│  - Many variants, unclear which to use - One clear algorithm                │
│  - Difficult to implement correctly    - Clearer implementation path        │
│  - Symmetric roles                     - Strong leader model                │
│                                                                              │
│  RAFT KEY DECISIONS:                                                        │
│  ───────────────────                                                        │
│  1. Strong leader: Only leader handles client requests                      │
│  2. Leader election: Randomized timeouts for faster convergence             │
│  3. Log matching: If entries match at index, all prior entries match        │
│  4. Leader completeness: Elected leader has all committed entries           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Raft State Machine

Raft Terms

┌─────────────────────────────────────────────────────────────────────────────┐
│                           RAFT TERMS                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Time is divided into TERMS of arbitrary length:                            │
│                                                                              │
│   Term 1         Term 2      Term 3       Term 4        Term 5             │
│  ┌──────────────┬───────────┬─────────────────────────┬────────────────┐    │
│  │  Election │  │ Election  │   Election │            │   Election │   │    │
│  │     +     │  │    +      │      +     │   Normal   │      +     │   │    │
│  │  Normal   │  │  Normal   │   Normal   │ Operation  │   Normal   │   │    │
│  │ Operation │  │ Operation │  Operation │  (leader)  │  Operation │   │    │
│  │ (leader)  │  │ (failed)  │  (leader)  │            │  (leader)  │   │    │
│  └──────────────┴───────────┴─────────────────────────┴────────────────┘    │
│                                                                              │
│  TERM RULES:                                                                │
│  ───────────                                                                │
│  • Each term starts with election                                           │
│  • At most ONE leader per term                                              │
│  • Higher term always wins                                                  │
│  • If node sees higher term → immediately becomes follower                  │
│  • Terms act as logical clock                                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Leader Election

Election Trigger
RequestVote RPC
Election Process

FOLLOWER TIMEOUT:
─────────────────

Follower                   Leader
   │                         │
   │ <── Heartbeat ─────────│  (every 150ms)
   │                         │
   │     (waiting...)       │
   │                         │
   │     (timeout: 300ms)    X  (leader dies)
   │                         
   │  No heartbeat received!
   │  Convert to CANDIDATE
   │  Start election
   ▼

TIMEOUT: Random between 150-300ms
(Randomness prevents split votes)

# Candidate sends to all nodes
RequestVote {
    term: int,           # Candidate's term
    candidateId: int,    # Candidate requesting vote
    lastLogIndex: int,   # Index of candidate's last log entry
    lastLogTerm: int     # Term of candidate's last log entry
}

# Receiver's implementation
def handle_request_vote(request):
    # Reject if candidate's term is old
    if request.term < currentTerm:
        return VoteResponse(term=currentTerm, voteGranted=False)
    
    # Update term if candidate's is newer
    if request.term > currentTerm:
        currentTerm = request.term
        votedFor = None
        convert_to_follower()
    
    # Grant vote if:
    # 1. Haven't voted in this term OR already voted for this candidate
    # 2. Candidate's log is at least as up-to-date as ours
    log_ok = (request.lastLogTerm > my_last_log_term or
              (request.lastLogTerm == my_last_log_term and
               request.lastLogIndex >= my_last_log_index))
    
    if (votedFor is None or votedFor == request.candidateId) and log_ok:
        votedFor = request.candidateId
        reset_election_timer()
        return VoteResponse(term=currentTerm, voteGranted=True)
    
    return VoteResponse(term=currentTerm, voteGranted=False)

CANDIDATE BEHAVIOR:
───────────────────

1. Increment currentTerm
2. Vote for self
3. Reset election timer
4. Send RequestVote to all other nodes

OUTCOMES:
─────────

(a) Wins election (majority votes)
    → Become leader
    → Send heartbeats immediately

(b) Another node wins
    → Receive AppendEntries from new leader
    → Convert to follower

(c) Election timeout (no winner)
    → Increment term, start new election
    → Randomized timeout prevents repeated ties

EXAMPLE (5 nodes, S1-S5):
─────────────────────────

Term 5 Election:

S1: Times out first, becomes candidate
    Votes for self (1 vote)
    Sends RequestVote to S2, S3, S4, S5

S2: Receives RequestVote, grants vote
S3: Receives RequestVote, grants vote
S4: Already voted for self (also candidate)
S5: Receives RequestVote, grants vote

S1: Has 4 votes (S1, S2, S3, S5) = MAJORITY
S1: Becomes LEADER of term 5

Log Replication

The core of Raft - replicating commands across nodes:

┌─────────────────────────────────────────────────────────────────────────────┐
│                        RAFT LOG STRUCTURE                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Leader's Log:                                                              │
│  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐                          │
│  │ 1,1 │ 1,1 │ 1,2 │ 2,3 │ 2,3 │ 3,3 │ 3,3 │ 3,3 │                          │
│  │ x←1 │ y←2 │ x←3 │ y←4 │ z←5 │ x←6 │ y←7 │ z←8 │                          │
│  └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘                          │
│    1     2     3     4     5     6     7     8    ← Log Index               │
│                                                                              │
│  Entry format: (index, term) command                                        │
│                                                                              │
│  COMMIT INDEX:                                                              │
│  ─────────────                                                              │
│  An entry is COMMITTED when replicated to majority                          │
│                                                                              │
│  Leader:      commitIndex = 6 (entries 1-6 are committed)                   │
│  Followers:   Apply entries up to commitIndex                               │
│                                                                              │
│  SAFETY: Once committed, entry is NEVER overwritten                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

AppendEntries RPC

# Leader sends to followers
AppendEntries {
    term: int,              # Leader's term
    leaderId: int,          # So follower can redirect clients
    prevLogIndex: int,      # Index of entry before new ones
    prevLogTerm: int,       # Term of prevLogIndex entry
    entries: List[Entry],   # Log entries to store (empty for heartbeat)
    leaderCommit: int       # Leader's commitIndex
}

# Follower's implementation
def handle_append_entries(request):
    # Reject if leader's term is old
    if request.term < currentTerm:
        return AppendResponse(term=currentTerm, success=False)
    
    # Reset election timer (leader is alive)
    reset_election_timer()
    
    # Update term if needed
    if request.term > currentTerm:
        currentTerm = request.term
        convert_to_follower()
    
    # LOG MATCHING: Check if we have the previous entry
    if not log_contains(request.prevLogIndex, request.prevLogTerm):
        # Tell leader to back up and retry
        return AppendResponse(term=currentTerm, success=False)
    
    # APPEND NEW ENTRIES
    # (delete conflicting entries first if any)
    for i, entry in enumerate(request.entries):
        index = request.prevLogIndex + 1 + i
        if log[index].term != entry.term:
            # Delete this and all following entries
            log = log[:index]
        if index > len(log):
            log.append(entry)
    
    # UPDATE COMMIT INDEX
    if request.leaderCommit > commitIndex:
        commitIndex = min(request.leaderCommit, len(log))
        apply_committed_entries()
    
    return AppendResponse(term=currentTerm, success=True)

Log Matching Property

┌─────────────────────────────────────────────────────────────────────────────┐
│                     LOG MATCHING PROPERTY                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  IF two entries in different logs have same index AND term:                 │
│  ├── They store the same command                                            │
│  └── All preceding entries are identical                                    │
│                                                                              │
│  HOW IT'S MAINTAINED:                                                       │
│  ────────────────────                                                       │
│  1. Leader creates entries with current term                                │
│  2. AppendEntries includes prevLogIndex and prevLogTerm                     │
│  3. Follower rejects if previous entry doesn't match                        │
│  4. Leader backs up and retries until match found                           │
│                                                                              │
│  EXAMPLE - Repairing Inconsistent Log:                                      │
│                                                                              │
│  Leader:   [1,1] [1,2] [2,3] [2,4] [3,5] [3,6]                             │
│  Follower: [1,1] [1,2] [2,3] [2,4]                    (behind)              │
│                                                                              │
│  AppendEntries(prevLogIndex=4, prevLogTerm=2, entries=[3,5][3,6])          │
│  Follower: Entry at index 4, term 2 exists? ✓                               │
│  Follower: Append entries 5 and 6                                           │
│  Result:   [1,1] [1,2] [2,3] [2,4] [3,5] [3,6]        (matches!)            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Safety: Leader Completeness

Critical Property: If a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher terms.

HOW RAFT ENSURES THIS:
──────────────────────

1. ELECTION RESTRICTION:
   Voter denies vote if candidate's log is less up-to-date
   
   "Up-to-date" comparison:
   - Compare last entry's term (higher is better)
   - If same term, compare last entry's index (longer is better)

2. LEADER ONLY COMMITS ENTRIES FROM CURRENT TERM:
   Leader doesn't count old-term entries as committed
   until a new entry from current term is replicated

   Why? Prevents overwriting committed entries
   
   EXAMPLE (The Figure 8 scenario):
   ─────────────────────────────────
   
   Term 2: Leader S1 replicates entry at index 2 to S2
           Crashes before committing
   
   Term 3: S5 becomes leader (didn't have index 2)
           Appends entry at index 2 with term 3
   
   Term 4: S1 comes back, becomes leader
           CAN'T just count term-2 entry as committed!
           Must replicate a term-4 entry first
           
   This prevents the term-2 entry from being overwritten
   by the term-3 entry that S5 had

Handling Cluster Membership Changes

┌─────────────────────────────────────────────────────────────────────────────┐
│                    JOINT CONSENSUS                                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PROBLEM: Adding/removing nodes during operation                            │
│                                                                              │
│  DANGER: If we switch configs directly:                                     │
│                                                                              │
│  Old: {A, B, C}        New: {A, B, C, D, E}                                 │
│                                                                              │
│  During transition:                                                         │
│  - {A, B} thinks old config, majority = 2                                   │
│  - {C, D, E} thinks new config, majority = 3                                │
│  - BOTH could be majorities! Two leaders possible!                          │
│                                                                              │
│  SOLUTION: JOINT CONSENSUS                                                  │
│  ─────────────────────────                                                  │
│                                                                              │
│  Phase 1: C_old,new (joint configuration)                                   │
│           Decisions need majorities in BOTH old AND new                     │
│                                                                              │
│  Phase 2: C_new (new configuration)                                         │
│           Only after C_old,new is committed                                 │
│                                                                              │
│  Timeline:                                                                  │
│  ─────────                                                                  │
│        │              │                │                                    │
│   C_old│ C_old,new    │   C_old,new    │  C_new                             │
│        │ entry        │   committed    │  entry                             │
│        │ created      │                │  created                           │
│        │              │                │                                    │
│        ▼              ▼                ▼                                    │
│  ──────┬──────────────┬────────────────┬────────────────►                   │
│                                                                              │
│  SIMPLER ALTERNATIVE: Single-server changes (one at a time)                 │
│  ─────────────────────                                                      │
│  Many implementations use this simpler approach                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Log Compaction (Snapshots)

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SNAPSHOTS                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PROBLEM: Log grows forever                                                 │
│                                                                              │
│  SOLUTION: Periodically take snapshots                                      │
│                                                                              │
│  Before snapshot:                                                           │
│  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐                          │
│  │ x←1 │ y←2 │ x←3 │ y←4 │ x←5 │ y←6 │ x←7 │ y←8 │                          │
│  └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘                          │
│    1     2     3     4     5     6     7     8                              │
│                              ▲                                              │
│                         committed                                           │
│                                                                              │
│  After snapshot (at index 5):                                               │
│  ┌────────────────────┬─────┬─────┬─────┐                                   │
│  │     SNAPSHOT       │ y←6 │ x←7 │ y←8 │                                   │
│  │ lastIndex=5        │     │     │     │                                   │
│  │ lastTerm=2         │     │     │     │                                   │
│  │ state={x:5, y:4}   │     │     │     │                                   │
│  └────────────────────┴─────┴─────┴─────┘                                   │
│                          6     7     8                                      │
│                                                                              │
│  INSTALLSNAPSHOT RPC:                                                       │
│  Used when follower is too far behind                                       │
│  Leader sends its snapshot instead of replaying entries                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 9: Viewstamped Replication

An alternative consensus protocol that predates Paxos:

┌─────────────────────────────────────────────────────────────────────────────┐
│                 VIEWSTAMPED REPLICATION                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SIMILAR TO RAFT:                                                           │
│  ────────────────                                                           │
│  • Strong leader (called "primary")                                         │
│  • Log replication                                                          │
│  • View changes (like term changes)                                         │
│                                                                              │
│  KEY DIFFERENCES:                                                           │
│  ────────────────                                                           │
│  • View-change is more complex                                              │
│  • Different recovery protocol                                              │
│  • Historically important (1988, before Paxos publication)                  │
│                                                                              │
│  VIEW CHANGE PROTOCOL:                                                      │
│  ─────────────────────                                                      │
│  1. New primary collects state from f+1 replicas                            │
│  2. Computes new state (union of all logs)                                  │
│  3. Sends StartView to all replicas                                         │
│  4. Replicas update and respond                                             │
│                                                                              │
│  WHEN TO USE VR vs RAFT:                                                    │
│  ────────────────────────                                                   │
│  Usually prefer Raft (simpler, better understood)                           │
│  VR useful for academic understanding                                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 10: ZAB (Zookeeper Atomic Broadcast)

ZAB Protocol

┌─────────────────────────────────────────────────────────────────────────────┐
│                              ZAB                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Designed specifically for Zookeeper's requirements:                        │
│  • Primary order: All updates from a leader in order                        │
│  • Prefix property: If transaction T is delivered, all before T too         │
│                                                                              │
│  PHASES:                                                                    │
│  ───────                                                                    │
│                                                                              │
│  1. LEADER ELECTION                                                         │
│     └── Fast leader election using voting                                   │
│     └── Leader chosen based on highest zxid (transaction id)                │
│                                                                              │
│  2. DISCOVERY                                                               │
│     └── Leader learns about all accepted transactions                       │
│     └── Collects epoch information from followers                           │
│                                                                              │
│  3. SYNCHRONIZATION                                                         │
│     └── Leader syncs followers to its state                                 │
│     └── Handles DIFF, TRUNC, or SNAP updates                                │
│                                                                              │
│  4. BROADCAST                                                               │
│     └── Normal operation - leader broadcasts updates                        │
│     └── Two-phase commit: PROPOSAL → ACK → COMMIT                           │
│                                                                              │
│  ZXID (Transaction ID):                                                     │
│  ──────────────────────                                                     │
│  64-bit: [epoch (32 bits)][counter (32 bits)]                               │
│                                                                              │
│  Epoch increases with each new leader                                       │
│  Counter increases with each transaction                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Zookeeper Guarantees

GUARANTEES:
───────────
1. Linearizable Writes: All writes go through leader, ordered globally
2. FIFO Client Order: Client's operations processed in order
3. Wait-free Reads: Reads can go to any server (possibly stale)
4. Watch Notifications: Clients notified of changes

USE CASES:
──────────
• Distributed locking
• Leader election
• Configuration management
• Service discovery
• Cluster membership

Comparison: Paxos vs Raft vs ZAB

┌─────────────────┬──────────────┬──────────────┬──────────────┐
│                 │    PAXOS     │    RAFT      │     ZAB      │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ Leader          │ Optional     │ Required     │ Required     │
│ Elections       │ Implicit     │ Explicit     │ Explicit     │
│ Log Gaps        │ Allowed      │ Not allowed  │ Not allowed  │
│ Understandable  │ Harder       │ Easier       │ Medium       │
│ Implementations │ Many         │ Many         │ Zookeeper    │
│ Year            │ 1989         │ 2014         │ 2011         │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ Used By         │ Chubby       │ etcd         │ Zookeeper    │
│                 │ Spanner      │ Consul       │ Kafka        │
│                 │ CockroachDB  │ TiKV         │              │
└─────────────────┴──────────────┴──────────────┴──────────────┘

Advanced: EPaxos (Egalitarian Paxos)

Staff+ Level: EPaxos is an advanced topic that shows deep understanding of consensus trade-offs. It’s less commonly asked but demonstrates expertise.

┌─────────────────────────────────────────────────────────────────────────────┐
│                            EPAXOS                                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  THE PROBLEM WITH LEADER-BASED CONSENSUS:                                   │
│  ─────────────────────────────────────────                                  │
│  • All requests go through leader (bottleneck)                              │
│  • Leader failure triggers expensive election                               │
│  • Cross-datacenter: high latency to leader                                 │
│                                                                              │
│  EPAXOS KEY INSIGHT:                                                        │
│  ───────────────────                                                        │
│  "If commands don't conflict, they can commit independently"                │
│                                                                              │
│  LEADERLESS CONSENSUS:                                                      │
│  ─────────────────────                                                      │
│  • ANY replica can be "command leader" for any command                      │
│  • Conflicting commands require extra round                                 │
│  • Non-conflicting commands: 1 round-trip (fast path)                       │
│                                                                              │
│  CONFLICT DETECTION:                                                        │
│  ───────────────────                                                        │
│  Commands conflict if they access same keys with at least one write         │
│                                                                              │
│  get(x), get(x)    → NO conflict (both reads)                              │
│  get(x), put(x,1)  → CONFLICT (one is write)                               │
│  put(x,1), put(y,2) → NO conflict (different keys)                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

EPaxos Protocol Flow

FAST PATH (no conflicts):
────────────────────────

Client ──> R1 (command leader for cmd A)
           │
           │ PreAccept(A)
           ├──────────────────> R2  ─┐
           ├──────────────────> R3  ─┼─ All reply "OK, no conflicts"
           ├──────────────────> R4  ─┤
           ├──────────────────> R5  ─┘
           │
           │ <── PreAcceptOK ────────
           │ (from fast quorum: ⌊3N/4⌋)
           │
           │ Commit(A)
           ├──────────────────> All replicas
           │
Result: 2 message delays (same as Multi-Paxos with stable leader!)


SLOW PATH (conflicts detected):
───────────────────────────────

R1: PreAccept(A) includes deps = {}
R2: Already has B that conflicts with A
R2: PreAcceptOK with deps = {B}

R1: Sees different deps from different replicas
R1: Must run Paxos Accept phase to agree on deps

Additional round:
R1: Accept(A, deps={B})
All: AcceptOK

Result: 4 message delays (slower, but still correct)

When to Consider EPaxos

┌─────────────────────────────────────────────────────────────────────────────┐
│                    EPAXOS TRADE-OFFS                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ADVANTAGES:                                                                │
│  ───────────                                                                │
│  ✓ No leader bottleneck                                                    │
│  ✓ Lower latency for clients near any replica                              │
│  ✓ No leader election overhead                                             │
│  ✓ Better for geo-distributed deployments                                  │
│  ✓ Fast path = same latency as optimized Paxos                             │
│                                                                              │
│  DISADVANTAGES:                                                             │
│  ─────────────                                                              │
│  ✗ More complex to implement correctly                                     │
│  ✗ Slower when there are many conflicts                                    │
│  ✗ More complex recovery                                                   │
│  ✗ Fewer battle-tested implementations                                     │
│                                                                              │
│  USE WHEN:                                                                  │
│  ─────────                                                                  │
│  • Multi-datacenter deployment                                              │
│  • Low conflict workload (different clients access different data)          │
│  • Read-heavy or partition-friendly workload                                │
│  • Need to eliminate leader as single point of latency                      │
│                                                                              │
│  DON'T USE WHEN:                                                            │
│  ───────────────                                                            │
│  • High conflict rate (hot keys)                                            │
│  • Single datacenter (Raft is simpler)                                      │
│  • Team doesn't have consensus expertise                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Advanced: Pipelining & Batching Optimizations

Production consensus systems use several optimizations to achieve high throughput despite the fundamental requirement of disk I/O and network round-trips.

1. Request Pipelining

Instead of waiting for each request to commit before starting the next, leaders can process multiple requests concurrently.

SEQUENTIAL (Slow):
┌──────────────────────────────────────────────────────────────────┐
│  Req1: [Append] → [Replicate] → [Commit] → [Respond]             │
│  Req2:                                      [Append] → ...       │
│  Time: ═══════════════════════════════════════════════════════►  │
└──────────────────────────────────────────────────────────────────┘

PIPELINED (Fast):
┌──────────────────────────────────────────────────────────────────┐
│  Req1: [Append] → [Replicate] → [Commit] → [Respond]             │
│  Req2:    [Append] → [Replicate] → [Commit] → [Respond]          │
│  Req3:       [Append] → [Replicate] → [Commit] → [Respond]       │
│  Time: ══════════════════════►                                   │
│        (3x throughput!)                                          │
└──────────────────────────────────────────────────────────────────┘

Key Insight: Requests at different pipeline stages use different resources (CPU for append, network for replicate, disk for commit). Pipelining maximizes resource utilization.

2. Batching Log Entries

Instead of one network round-trip per entry, batch multiple entries into a single AppendEntries RPC.

# Without Batching: 1000 entries = 1000 RPCs
for entry in entries:
    send_append_entries([entry])  # Slow!

# With Batching: 1000 entries = 10 RPCs (batch size 100)
for batch in chunks(entries, size=100):
    send_append_entries(batch)    # 100x fewer RPCs!

Batching Trade-offs:

Parameter	Small Batch	Large Batch
Latency	Lower (per-request)	Higher (wait for batch)
Throughput	Lower	Higher
Memory	Lower	Higher

Adaptive Batching: Modern systems (TiKV, etcd) use adaptive batching:

Low load: Send immediately (minimize latency)
High load: Batch aggressively (maximize throughput)

3. Parallel Disk I/O

The leader can write to its own log while simultaneously sending AppendEntries to followers.

SEQUENTIAL:
Leader:   [Write to Disk] ──────────► [Send to Followers]
                                          ↓
Followers:                            [Receive] → [Write to Disk]

PARALLEL:
Leader:   [Write to Disk] ─────┐
                               ├─► [Wait for both]
          [Send to Followers] ─┘
                   ↓
Followers: [Receive] → [Write to Disk]

Savings: Hides leader's disk latency behind network latency

4. Group Commit (fsync Batching)

fsync() is expensive (~10ms on HDD, ~0.1ms on NVMe). Instead of fsync per entry, batch multiple entries into one fsync.

class GroupCommitter:
    def __init__(self, max_wait_ms=1, max_batch=100):
        self.pending = []
        self.lock = Lock()
    
    def append_and_sync(self, entry):
        with self.lock:
            self.pending.append(entry)
            if len(self.pending) >= max_batch:
                self._flush()
        
        # Or wait for timer-based flush
        wait_for_flush_complete()
    
    def _flush(self):
        # Write all pending entries
        for entry in self.pending:
            log_file.write(entry)
        
        # Single fsync for all entries!
        log_file.fsync()  # One fsync instead of N
        
        # Notify all waiters
        notify_all_complete()

Result: 1000 entries with 1 fsync instead of 1000 fsyncs = ~1000x throughput improvement for disk-bound workloads.

5. Flow Control & Backpressure

Prevent slow followers from causing unbounded memory growth on the leader.

┌─────────────────────────────────────────────────────────────────┐
│  FLOW CONTROL STRATEGY                                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Leader tracks per-follower state:                              │
│  • nextIndex: Next entry to send                                │
│  • matchIndex: Highest replicated entry                         │
│  • inflight: Entries sent but not ACKed                         │
│                                                                  │
│  Flow Control Rules:                                            │
│  IF inflight > MAX_INFLIGHT:                                    │
│      Pause sending to this follower                             │
│      (prevents OOM on leader)                                   │
│                                                                  │
│  IF matchIndex too far behind:                                  │
│      Send snapshot instead of entries                           │
│      (prevents unbounded log growth)                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Performance Numbers (etcd Benchmarks)

Optimization	Throughput	Latency (p99)
Baseline	10K ops/s	50ms
+ Batching	50K ops/s	20ms
+ Pipelining	80K ops/s	15ms
+ Group Commit	100K ops/s	10ms
+ Parallel I/O	120K ops/s	8ms

Staff Tip: In interviews, mention that “naive Raft” achieves ~1000 ops/s, but production systems like etcd/TiKV achieve 100K+ ops/s through these optimizations. This shows you understand the gap between textbook algorithms and production systems.

Practical Implementation Guide

Building a Production Consensus System

Start with Raft (Not Paxos)

Raft is designed for understandability. If you’re building consensus from scratch:

# Recommended approach
Use an existing library (etcd/raft, hashicorp/raft)
If building from scratch, follow MIT 6.824 lab structure
Write extensive tests before anything else

Test Adversarially

# Critical test scenarios
test_cases = [
    "leader_crashes_during_commit",
    "network_partition_minority_vs_majority",
    "split_brain_scenario",
    "slow_follower_catches_up",
    "all_nodes_restart_simultaneously",
    "leader_isolated_then_rejoins",
    "log_divergence_after_multiple_elections",
    "snapshot_during_compaction",
]

# Use chaos testing
chaos_scenarios = [
    "random_network_delays(10-500ms)",
    "random_message_drops(5%)",
    "random_node_restarts",
    "clock_skew_between_nodes",
]

Performance Tuning

# Key parameters to tune
raft_config:
  heartbeat_interval: 150ms      # Leader sends heartbeat
  election_timeout_min: 300ms    # Min timeout for election
  election_timeout_max: 500ms    # Max timeout (randomized)
  max_entries_per_append: 100    # Batch size for efficiency
  snapshot_threshold: 10000      # Entries before snapshot
  
# Trade-offs:
# - Lower heartbeat = faster failure detection, more overhead
# - Higher election timeout = fewer spurious elections, slower recovery
# - Larger batch = higher throughput, higher latency per request

Production Checklist

□ Persistent storage for term, votedFor, and log
□ Proper fsync on commit (don't lose acknowledged writes)
□ Snapshot mechanism for log compaction
□ Membership change handling (or limit to single-server changes)
□ Pre-vote optimization (prevents disruption from stale candidates)
□ Leader lease for efficient reads
□ Metrics: election count, replication lag, commit latency
□ Alerting on: leader changes, replication lag, log size

Common Implementation Bugs

These bugs have caused real production outages. Watch out for them!

# BUG 1: Not persisting state before responding
def on_request_vote(request):
    if should_grant_vote(request):
        voted_for = request.candidate_id
        # BUG: Must persist BEFORE responding!
        # persist_state()  # <-- MISSING
        return VoteGranted()
    # If crash happens, you might vote twice!

# BUG 2: Incorrect leader lease
def handle_read(request):
    if self.is_leader():
        # BUG: Might not be leader anymore!
        # Must check lease or use read index
        return self.state_machine.read(request.key)
    
# BUG 3: Commit index going backwards
def apply_committed_entries():
    # BUG: Must ensure commit_index only increases
    # Snapshot installation can reset state machine
    # But commit_index should reflect reality

# BUG 4: Not handling log prefix during append
def append_entries(request):
    for entry in request.entries:
        # BUG: Must delete conflicting entries!
        # If local log has different term at same index,
        # delete it and all following entries

Module 11: Quorum Loss & Disaster Recovery

A critical “Principal Level” operational reality: What happens when you lose a majority of your nodes forever? (e.g., a region outage or physical data corruption).

The Quorum Loss Problem

Consensus protocols (Paxos/Raft) are designed to block when a majority is lost. This is Safe but leads to Permanent Unavailability.

In a 3-node cluster, if 2 nodes are destroyed, the remaining node cannot reach consensus.
You cannot simply “add new nodes” because adding nodes requires a consensus vote, which you can’t get.

Recovery Strategies

1. Forced Reconfiguration (The “God Mode” fix)

Most production implementations (etcd, Consul, Zookeeper) provide a tool to manually override the configuration.

Process:
1. Stop the remaining node.
2. Use a tool to rewrite the conf file or metadata on disk to say: “The cluster now only consists of ME.”
3. Restart the node. It now has a majority (1/1) and can process writes.
4. Gradually add new nodes to rebuild the cluster to $N=3$ or $N=5$ .

2. Seed-based Recovery

In systems like Cassandra or FoundationDB, you can bootstrap a new cluster from a snapshot of the surviving node’s data and “re-seed” the consensus state.

3. The “Stale Read” Escape Hatch

If you only need to recover data (not resume writes), you can often perform Stale Reads from the remaining nodes by bypassing the consensus layer and reading directly from the state machine (SSTables/RocksDB). Staff Tip: Disaster recovery is the only time you should ever touch the internal state of a consensus engine. Mention the “etcd disaster recovery” guide in interviews to show you understand that safety isn’t just about code, but about operational survival.

Key Interview Questions

Q: Walk me through a Raft leader election

Answer structure:

Follower times out (no heartbeat from leader)
Converts to candidate, increments term
Votes for itself, sends RequestVote to all
Other nodes vote (if haven’t voted, log is up-to-date)
If majority votes → becomes leader, sends heartbeats
If receives AppendEntries from valid leader → becomes follower
If timeout again → new election with higher term

Key points to mention:

Randomized timeouts prevent split votes
Log up-to-date check ensures safety
At most one leader per term

Q: What happens if the Raft leader fails during commit?

Answer: Scenario: Leader has replicated entry to 2/5 nodes and crashes

Followers detect leader failure (timeout)
New election starts
Only nodes with the entry CAN become leader (election restriction)
New leader has the entry, completes replication
Entry gets committed when new-term entry is committed

Key insight: Raft never loses committed entries. An entry replicated to majority WILL be in new leader’s log.

Q: How does Paxos handle competing proposers?

Answer:

Each proposer picks unique proposal number (higher wins)
Acceptor promises to reject lower numbers
If proposer’s Prepare rejected → retry with higher number
In Accept phase, use highest accepted value from promises
System makes progress when one proposer “wins”

Key insight: The “use highest accepted value” rule preserves safety even with competing proposers.

Q: Why can't consensus be achieved in asynchronous systems?

Answer: FLP Impossibility proves this:

Can’t distinguish slow from dead
Adversarial schedule can always prevent decision
Must choose: safety OR guaranteed termination

Practical workaround: Assume partial synchrony (eventually messages arrive). Raft/Paxos do this with timeouts.

Q: Design a distributed lock using consensus

Answer:

1. Lock request → propose "lock(client_id, lock_name)"
2. Use consensus to order all lock requests
3. Apply in order: first requester gets lock
4. Store lock state in replicated state machine
5. Unlock: consensus on "unlock(client_id, lock_name)"
6. Add lease/TTL to handle client failures

Implementation options:
- Zookeeper ephemeral nodes
- etcd lease-based locks
- Consul sessions

Hands-On Project: Implement Raft

Raft Implementation Lab

Build a complete Raft implementation:Phase 1: Leader Election

Implement RequestVote RPC
Handle election timeouts
State transitions (follower → candidate → leader)

Phase 2: Log Replication

Implement AppendEntries RPC
Log matching checks
Commit index tracking

Phase 3: Persistence

Persist term, votedFor, log
Recovery after restart

Phase 4: Snapshots

Implement log compaction
InstallSnapshot RPC

Recommended Resources:

MIT 6.824 Labs
Raft visualization: raft.github.io
etcd/raft source code

Next Steps

Continue to Track 3: Replication Strategies

Learn single-leader, multi-leader, and leaderless replication patterns

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Track 2: Consensus Protocols

​Module 6: The Consensus Problem

​What is Consensus?

​Use Cases for Consensus

Leader Election

Distributed Locking

Configuration Management

State Machine Replication

​FLP Impossibility Theorem

​Safety vs Liveness

​Module 7: Paxos Protocol

​The Original Consensus Algorithm

​Basic Paxos Step by Step

​Paxos with Failures

​Why Paxos is Safe

​Multi-Paxos

​Module 8: Raft Consensus (Deep Dive)

​Why Raft?

​Raft State Machine

​Raft Terms

​Leader Election

​Log Replication

​AppendEntries RPC

​Log Matching Property

​Safety: Leader Completeness

​Handling Cluster Membership Changes

​Log Compaction (Snapshots)

​Module 9: Viewstamped Replication

​Module 10: ZAB (Zookeeper Atomic Broadcast)

​ZAB Protocol

​Zookeeper Guarantees

​Comparison: Paxos vs Raft vs ZAB

​Advanced: EPaxos (Egalitarian Paxos)

​EPaxos Protocol Flow

​When to Consider EPaxos

​Advanced: Pipelining & Batching Optimizations

​1. Request Pipelining

​2. Batching Log Entries

Track 2: Consensus Protocols

Module 6: The Consensus Problem

What is Consensus?

Use Cases for Consensus

FLP Impossibility Theorem

Safety vs Liveness

Module 7: Paxos Protocol

The Original Consensus Algorithm

Basic Paxos Step by Step

Paxos with Failures

Why Paxos is Safe

Multi-Paxos

Module 8: Raft Consensus (Deep Dive)

Why Raft?

Raft State Machine

Raft Terms

Leader Election

Log Replication

AppendEntries RPC

Log Matching Property

Safety: Leader Completeness

Handling Cluster Membership Changes

Log Compaction (Snapshots)

Module 9: Viewstamped Replication

Module 10: ZAB (Zookeeper Atomic Broadcast)

ZAB Protocol

Zookeeper Guarantees

Comparison: Paxos vs Raft vs ZAB

Advanced: EPaxos (Egalitarian Paxos)

EPaxos Protocol Flow

When to Consider EPaxos

Advanced: Pipelining & Batching Optimizations

1. Request Pipelining

2. Batching Log Entries

3. Parallel Disk I/O

4. Group Commit (fsync Batching)

5. Flow Control & Backpressure

Performance Numbers (etcd Benchmarks)

Practical Implementation Guide

Building a Production Consensus System

Common Implementation Bugs

Module 11: Quorum Loss & Disaster Recovery

The Quorum Loss Problem

Recovery Strategies

1. Forced Reconfiguration (The “God Mode” fix)

2. Seed-based Recovery

3. The “Stale Read” Escape Hatch

Key Interview Questions

Hands-On Project: Implement Raft

Next Steps