Skip to main content
Raft Consensus Protocol

Track 2: Consensus Protocols

Consensus is the most critical topic in distributed systems. Master this track to ace Staff+ interviews at top companies.
Track Duration: 36-46 hours
Modules: 5
Key Topics: FLP Impossibility, Paxos, Raft, Viewstamped Replication, ZAB

Module 6: The Consensus Problem

What is Consensus?

Consensus is getting multiple nodes to agree on a single value, despite failures.
┌─────────────────────────────────────────────────────────────────────────────┐
│                        THE CONSENSUS PROBLEM                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SETUP:                                                                     │
│  ├── N nodes in a distributed system                                        │
│  ├── Some nodes may fail (crash failures)                                   │
│  ├── Network may delay/lose messages                                        │
│  └── Goal: Agree on a single value                                          │
│                                                                              │
│  PROPERTIES (must satisfy ALL):                                             │
│                                                                              │
│  1. AGREEMENT:        All non-faulty nodes decide the same value            │
│  2. VALIDITY:         The decided value was proposed by some node           │
│  3. TERMINATION:      All non-faulty nodes eventually decide                │
│                       (liveness property)                                    │
│  4. INTEGRITY:        Each node decides at most once                        │
│                                                                              │
│  WHY IT'S HARD:                                                             │
│  ─────────────                                                              │
│  Node A proposes "5"     │    What if:                                      │
│  Node B proposes "7"     │    - A crashes after proposing?                  │
│  Node C proposes "3"     │    - Message from A to C is lost?                │
│                          │    - B thinks A is dead (just slow)?             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Use Cases for Consensus

Leader Election

Nodes agree on who the leader isExamples:
  • Database primary selection
  • Kafka partition leader
  • Kubernetes control plane

Distributed Locking

Nodes agree on who holds the lockExamples:
  • Zookeeper locks
  • Redis Redlock
  • etcd leases

Configuration Management

Nodes agree on system configurationExamples:
  • Cluster membership
  • Feature flags
  • Service discovery

State Machine Replication

Nodes agree on sequence of commandsExamples:
  • Replicated databases
  • Distributed logs
  • Blockchain

FLP Impossibility Theorem

The Most Important Result: Consensus is impossible in an asynchronous system with even one faulty process.
┌─────────────────────────────────────────────────────────────────────────────┐
│                    FLP IMPOSSIBILITY (1985)                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Fischer, Lynch, and Paterson proved:                                       │
│                                                                              │
│  "No deterministic consensus protocol can guarantee termination             │
│   in an asynchronous system with even one crash failure."                   │
│                                                                              │
│  IMPLICATIONS:                                                              │
│  ─────────────                                                              │
│  You CANNOT have all three:                                                 │
│  1. Safety (agreement + validity)                                           │
│  2. Liveness (termination)                                                  │
│  3. Asynchrony (no timing assumptions)                                      │
│                                                                              │
│  WHY? (Intuition)                                                           │
│  ────────────────                                                           │
│  - In async systems, you can't distinguish slow from dead                   │
│  - There's always an adversarial schedule that prevents progress            │
│  - No matter the protocol, an adversary can delay messages                  │
│    to keep the system in an undecided state                                 │
│                                                                              │
│  HOW PRACTICAL SYSTEMS WORK AROUND IT:                                      │
│  ─────────────────────────────────────                                      │
│  1. Timeouts (partial synchrony) - Most systems do this                     │
│  2. Randomization - Some probability of progress                            │
│  3. Failure detectors - Oracle that hints about failures                    │
│                                                                              │
│  RAFT/PAXOS assume: "Eventually, the network will be synchronous enough"   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Safety vs Liveness

SAFETY: "Nothing bad happens"
├── Agreement is never violated
├── Only proposed values are chosen
└── Can be proven mathematically

LIVENESS: "Something good eventually happens"  
├── Decisions are eventually made
├── Progress is made
└── Can be violated by FLP (temporarily)

PRACTICAL APPROACH:
├── ALWAYS maintain safety
└── EVENTUALLY achieve liveness (when network is stable)

Module 7: Paxos Protocol

The Original Consensus Algorithm

Paxos was invented by Leslie Lamport in 1989. Despite being notoriously difficult to understand, it’s the foundation of modern consensus.
┌─────────────────────────────────────────────────────────────────────────────┐
│                          PAXOS OVERVIEW                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ROLES (a node can play multiple roles):                                    │
│  ─────                                                                      │
│  PROPOSER:   Proposes values, drives the protocol                           │
│  ACCEPTOR:   Votes on proposals, stores accepted values                     │
│  LEARNER:    Learns the decided value                                       │
│                                                                              │
│  GOAL: Agree on a single value among N acceptors                            │
│                                                                              │
│  KEY INSIGHT:                                                               │
│  ────────────                                                               │
│  A value is "chosen" when accepted by a MAJORITY of acceptors               │
│  Any two majorities overlap, ensuring agreement                             │
│                                                                              │
│  PHASES:                                                                    │
│  ───────                                                                    │
│  Phase 1a: PREPARE    - Proposer asks acceptors to promise                  │
│  Phase 1b: PROMISE    - Acceptors promise (or reject)                       │
│  Phase 2a: ACCEPT     - Proposer asks acceptors to accept value             │
│  Phase 2b: ACCEPTED   - Acceptors accept (or reject)                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Basic Paxos Step by Step

Proposer selects a unique proposal number n and sends Prepare(n) to all acceptors.
Proposer P1                    Acceptors
───────────                    ─────────
     │                         [A1] [A2] [A3]
     │    Prepare(n=1)           │    │    │
     │─────────────────────────>│    │    │
     │─────────────────────────>│────│    │
     │─────────────────────────>│────│────│
     │                           │    │    │
Acceptor behavior on receiving Prepare(n):
def on_prepare(n):
    if n > promised_n:
        promised_n = n  # Promise not to accept lower
        return Promise(n, accepted_n, accepted_v)
    else:
        return Reject()  # Already promised higher

Paxos with Failures

This is where Paxos shines - handling failures:
┌─────────────────────────────────────────────────────────────────────────────┐
│                 PAXOS HANDLING COMPETING PROPOSERS                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Scenario: P1 and P2 both try to propose                                   │
│                                                                              │
│  Time    P1           Acceptors        P2                                  │
│  ────    ──           ─────────        ──                                  │
│   t1     Prepare(1) ────> A1,A2,A3                                          │
│   t2                 <──── Promise(1,∅)                                     │
│   t3                                   Prepare(2) ────> A1,A2,A3           │
│   t4                                   <──── Promise(2,∅)                   │
│   t5     Accept(1,"X") ──> A1,A2,A3                                         │
│   t6                 ──── REJECT! ──── (promised n=2 already)              │
│   t7                                   Accept(2,"Y") ──> A1,A2,A3          │
│   t8                                   <──── Accepted(2,"Y")               │
│                                                                              │
│  Result: "Y" is chosen (P2 won with higher proposal number)                │
│                                                                              │
│  P1 must retry with n > 2                                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Why Paxos is Safe

THEOREM: If a value v is chosen, any future proposal will have value v.

PROOF INTUITION:
─────────────────
1. v is chosen → accepted by majority M1
2. Future proposal n' > n must get promises from majority M2
3. M1 ∩ M2 ≠ ∅ (majorities overlap)
4. At least one acceptor in M2 already accepted v
5. That acceptor tells proposer about v in Promise
6. Proposer MUST use v (highest accepted value)
7. Therefore, v is preserved

This is why proposal numbers and the "use highest accepted value" rule exist!

Multi-Paxos

Basic Paxos decides ONE value. Multi-Paxos decides a SEQUENCE:
┌─────────────────────────────────────────────────────────────────────────────┐
│                          MULTI-PAXOS                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  OPTIMIZATION: Elect a stable leader, skip Phase 1                          │
│                                                                              │
│  Log Index:    1        2        3        4        5                        │
│               ┌────┐   ┌────┐   ┌────┐   ┌────┐   ┌────┐                   │
│  Log:         │ v1 │   │ v2 │   │ v3 │   │ v4 │   │ v5 │                   │
│               └────┘   └────┘   └────┘   └────┘   └────┘                   │
│                 │        │        │        │        │                       │
│              Decided  Decided  Decided  Pending  Pending                    │
│                                                                              │
│  LEADER OPTIMIZATION:                                                       │
│  ────────────────────                                                       │
│  1. Leader runs Phase 1 ONCE for all future slots                          │
│  2. Each new command: just Phase 2 (Accept/Accepted)                        │
│  3. Much faster! (2 message delays instead of 4)                            │
│                                                                              │
│  IF LEADER FAILS:                                                           │
│  ────────────────                                                           │
│  New leader runs Phase 1 for any undecided slots                            │
│  May discover in-progress proposals and complete them                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 8: Raft Consensus (Deep Dive)

Most Asked in Interviews: Raft is designed to be understandable. You MUST be able to explain it clearly.

Why Raft?

┌─────────────────────────────────────────────────────────────────────────────┐
│                    RAFT vs PAXOS                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PAXOS PROBLEMS:                       RAFT SOLUTIONS:                      │
│  ───────────────                       ───────────────                      │
│  - Hard to understand                  - Designed for understandability     │
│  - Many variants, unclear which to use - One clear algorithm                │
│  - Difficult to implement correctly    - Clearer implementation path        │
│  - Symmetric roles                     - Strong leader model                │
│                                                                              │
│  RAFT KEY DECISIONS:                                                        │
│  ───────────────────                                                        │
│  1. Strong leader: Only leader handles client requests                      │
│  2. Leader election: Randomized timeouts for faster convergence             │
│  3. Log matching: If entries match at index, all prior entries match        │
│  4. Leader completeness: Elected leader has all committed entries           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Raft State Machine

Raft Consensus State Machine

Raft Terms

┌─────────────────────────────────────────────────────────────────────────────┐
│                           RAFT TERMS                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Time is divided into TERMS of arbitrary length:                            │
│                                                                              │
│   Term 1         Term 2      Term 3       Term 4        Term 5             │
│  ┌──────────────┬───────────┬─────────────────────────┬────────────────┐    │
│  │  Election │  │ Election  │   Election │            │   Election │   │    │
│  │     +     │  │    +      │      +     │   Normal   │      +     │   │    │
│  │  Normal   │  │  Normal   │   Normal   │ Operation  │   Normal   │   │    │
│  │ Operation │  │ Operation │  Operation │  (leader)  │  Operation │   │    │
│  │ (leader)  │  │ (failed)  │  (leader)  │            │  (leader)  │   │    │
│  └──────────────┴───────────┴─────────────────────────┴────────────────┘    │
│                                                                              │
│  TERM RULES:                                                                │
│  ───────────                                                                │
│  • Each term starts with election                                           │
│  • At most ONE leader per term                                              │
│  • Higher term always wins                                                  │
│  • If node sees higher term → immediately becomes follower                  │
│  • Terms act as logical clock                                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Leader Election

FOLLOWER TIMEOUT:
─────────────────

Follower                   Leader
   │                         │
   │ <── Heartbeat ─────────│  (every 150ms)
   │                         │
   │     (waiting...)       │
   │                         │
   │     (timeout: 300ms)    X  (leader dies)

   │  No heartbeat received!
   │  Convert to CANDIDATE
   │  Start election


TIMEOUT: Random between 150-300ms
(Randomness prevents split votes)

Log Replication

The core of Raft - replicating commands across nodes:
┌─────────────────────────────────────────────────────────────────────────────┐
│                        RAFT LOG STRUCTURE                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Leader's Log:                                                              │
│  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐                          │
│  │ 1,1 │ 1,1 │ 1,2 │ 2,3 │ 2,3 │ 3,3 │ 3,3 │ 3,3 │                          │
│  │ x←1 │ y←2 │ x←3 │ y←4 │ z←5 │ x←6 │ y←7 │ z←8 │                          │
│  └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘                          │
│    1     2     3     4     5     6     7     8    ← Log Index               │
│                                                                              │
│  Entry format: (index, term) command                                        │
│                                                                              │
│  COMMIT INDEX:                                                              │
│  ─────────────                                                              │
│  An entry is COMMITTED when replicated to majority                          │
│                                                                              │
│  Leader:      commitIndex = 6 (entries 1-6 are committed)                   │
│  Followers:   Apply entries up to commitIndex                               │
│                                                                              │
│  SAFETY: Once committed, entry is NEVER overwritten                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

AppendEntries RPC

# Leader sends to followers
AppendEntries {
    term: int,              # Leader's term
    leaderId: int,          # So follower can redirect clients
    prevLogIndex: int,      # Index of entry before new ones
    prevLogTerm: int,       # Term of prevLogIndex entry
    entries: List[Entry],   # Log entries to store (empty for heartbeat)
    leaderCommit: int       # Leader's commitIndex
}

# Follower's implementation
def handle_append_entries(request):
    # Reject if leader's term is old
    if request.term < currentTerm:
        return AppendResponse(term=currentTerm, success=False)
    
    # Reset election timer (leader is alive)
    reset_election_timer()
    
    # Update term if needed
    if request.term > currentTerm:
        currentTerm = request.term
        convert_to_follower()
    
    # LOG MATCHING: Check if we have the previous entry
    if not log_contains(request.prevLogIndex, request.prevLogTerm):
        # Tell leader to back up and retry
        return AppendResponse(term=currentTerm, success=False)
    
    # APPEND NEW ENTRIES
    # (delete conflicting entries first if any)
    for i, entry in enumerate(request.entries):
        index = request.prevLogIndex + 1 + i
        if log[index].term != entry.term:
            # Delete this and all following entries
            log = log[:index]
        if index > len(log):
            log.append(entry)
    
    # UPDATE COMMIT INDEX
    if request.leaderCommit > commitIndex:
        commitIndex = min(request.leaderCommit, len(log))
        apply_committed_entries()
    
    return AppendResponse(term=currentTerm, success=True)

Log Matching Property

┌─────────────────────────────────────────────────────────────────────────────┐
│                     LOG MATCHING PROPERTY                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  IF two entries in different logs have same index AND term:                 │
│  ├── They store the same command                                            │
│  └── All preceding entries are identical                                    │
│                                                                              │
│  HOW IT'S MAINTAINED:                                                       │
│  ────────────────────                                                       │
│  1. Leader creates entries with current term                                │
│  2. AppendEntries includes prevLogIndex and prevLogTerm                     │
│  3. Follower rejects if previous entry doesn't match                        │
│  4. Leader backs up and retries until match found                           │
│                                                                              │
│  EXAMPLE - Repairing Inconsistent Log:                                      │
│                                                                              │
│  Leader:   [1,1] [1,2] [2,3] [2,4] [3,5] [3,6]                             │
│  Follower: [1,1] [1,2] [2,3] [2,4]                    (behind)              │
│                                                                              │
│  AppendEntries(prevLogIndex=4, prevLogTerm=2, entries=[3,5][3,6])          │
│  Follower: Entry at index 4, term 2 exists? ✓                               │
│  Follower: Append entries 5 and 6                                           │
│  Result:   [1,1] [1,2] [2,3] [2,4] [3,5] [3,6]        (matches!)            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Safety: Leader Completeness

Critical Property: If a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher terms.
HOW RAFT ENSURES THIS:
──────────────────────

1. ELECTION RESTRICTION:
   Voter denies vote if candidate's log is less up-to-date
   
   "Up-to-date" comparison:
   - Compare last entry's term (higher is better)
   - If same term, compare last entry's index (longer is better)

2. LEADER ONLY COMMITS ENTRIES FROM CURRENT TERM:
   Leader doesn't count old-term entries as committed
   until a new entry from current term is replicated

   Why? Prevents overwriting committed entries
   
   EXAMPLE (The Figure 8 scenario):
   ─────────────────────────────────
   
   Term 2: Leader S1 replicates entry at index 2 to S2
           Crashes before committing
   
   Term 3: S5 becomes leader (didn't have index 2)
           Appends entry at index 2 with term 3
   
   Term 4: S1 comes back, becomes leader
           CAN'T just count term-2 entry as committed!
           Must replicate a term-4 entry first
           
   This prevents the term-2 entry from being overwritten
   by the term-3 entry that S5 had

Handling Cluster Membership Changes

┌─────────────────────────────────────────────────────────────────────────────┐
│                    JOINT CONSENSUS                                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PROBLEM: Adding/removing nodes during operation                            │
│                                                                              │
│  DANGER: If we switch configs directly:                                     │
│                                                                              │
│  Old: {A, B, C}        New: {A, B, C, D, E}                                 │
│                                                                              │
│  During transition:                                                         │
│  - {A, B} thinks old config, majority = 2                                   │
│  - {C, D, E} thinks new config, majority = 3                                │
│  - BOTH could be majorities! Two leaders possible!                          │
│                                                                              │
│  SOLUTION: JOINT CONSENSUS                                                  │
│  ─────────────────────────                                                  │
│                                                                              │
│  Phase 1: C_old,new (joint configuration)                                   │
│           Decisions need majorities in BOTH old AND new                     │
│                                                                              │
│  Phase 2: C_new (new configuration)                                         │
│           Only after C_old,new is committed                                 │
│                                                                              │
│  Timeline:                                                                  │
│  ─────────                                                                  │
│        │              │                │                                    │
│   C_old│ C_old,new    │   C_old,new    │  C_new                             │
│        │ entry        │   committed    │  entry                             │
│        │ created      │                │  created                           │
│        │              │                │                                    │
│        ▼              ▼                ▼                                    │
│  ──────┬──────────────┬────────────────┬────────────────►                   │
│                                                                              │
│  SIMPLER ALTERNATIVE: Single-server changes (one at a time)                 │
│  ─────────────────────                                                      │
│  Many implementations use this simpler approach                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Log Compaction (Snapshots)

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SNAPSHOTS                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PROBLEM: Log grows forever                                                 │
│                                                                              │
│  SOLUTION: Periodically take snapshots                                      │
│                                                                              │
│  Before snapshot:                                                           │
│  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐                          │
│  │ x←1 │ y←2 │ x←3 │ y←4 │ x←5 │ y←6 │ x←7 │ y←8 │                          │
│  └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘                          │
│    1     2     3     4     5     6     7     8                              │
│                              ▲                                              │
│                         committed                                           │
│                                                                              │
│  After snapshot (at index 5):                                               │
│  ┌────────────────────┬─────┬─────┬─────┐                                   │
│  │     SNAPSHOT       │ y←6 │ x←7 │ y←8 │                                   │
│  │ lastIndex=5        │     │     │     │                                   │
│  │ lastTerm=2         │     │     │     │                                   │
│  │ state={x:5, y:4}   │     │     │     │                                   │
│  └────────────────────┴─────┴─────┴─────┘                                   │
│                          6     7     8                                      │
│                                                                              │
│  INSTALLSNAPSHOT RPC:                                                       │
│  Used when follower is too far behind                                       │
│  Leader sends its snapshot instead of replaying entries                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 9: Viewstamped Replication

An alternative consensus protocol that predates Paxos:
┌─────────────────────────────────────────────────────────────────────────────┐
│                 VIEWSTAMPED REPLICATION                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SIMILAR TO RAFT:                                                           │
│  ────────────────                                                           │
│  • Strong leader (called "primary")                                         │
│  • Log replication                                                          │
│  • View changes (like term changes)                                         │
│                                                                              │
│  KEY DIFFERENCES:                                                           │
│  ────────────────                                                           │
│  • View-change is more complex                                              │
│  • Different recovery protocol                                              │
│  • Historically important (1988, before Paxos publication)                  │
│                                                                              │
│  VIEW CHANGE PROTOCOL:                                                      │
│  ─────────────────────                                                      │
│  1. New primary collects state from f+1 replicas                            │
│  2. Computes new state (union of all logs)                                  │
│  3. Sends StartView to all replicas                                         │
│  4. Replicas update and respond                                             │
│                                                                              │
│  WHEN TO USE VR vs RAFT:                                                    │
│  ────────────────────────                                                   │
│  Usually prefer Raft (simpler, better understood)                           │
│  VR useful for academic understanding                                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 10: ZAB (Zookeeper Atomic Broadcast)

ZAB Protocol

┌─────────────────────────────────────────────────────────────────────────────┐
│                              ZAB                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Designed specifically for Zookeeper's requirements:                        │
│  • Primary order: All updates from a leader in order                        │
│  • Prefix property: If transaction T is delivered, all before T too         │
│                                                                              │
│  PHASES:                                                                    │
│  ───────                                                                    │
│                                                                              │
│  1. LEADER ELECTION                                                         │
│     └── Fast leader election using voting                                   │
│     └── Leader chosen based on highest zxid (transaction id)                │
│                                                                              │
│  2. DISCOVERY                                                               │
│     └── Leader learns about all accepted transactions                       │
│     └── Collects epoch information from followers                           │
│                                                                              │
│  3. SYNCHRONIZATION                                                         │
│     └── Leader syncs followers to its state                                 │
│     └── Handles DIFF, TRUNC, or SNAP updates                                │
│                                                                              │
│  4. BROADCAST                                                               │
│     └── Normal operation - leader broadcasts updates                        │
│     └── Two-phase commit: PROPOSAL → ACK → COMMIT                           │
│                                                                              │
│  ZXID (Transaction ID):                                                     │
│  ──────────────────────                                                     │
│  64-bit: [epoch (32 bits)][counter (32 bits)]                               │
│                                                                              │
│  Epoch increases with each new leader                                       │
│  Counter increases with each transaction                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Zookeeper Guarantees

GUARANTEES:
───────────
1. Linearizable Writes: All writes go through leader, ordered globally
2. FIFO Client Order: Client's operations processed in order
3. Wait-free Reads: Reads can go to any server (possibly stale)
4. Watch Notifications: Clients notified of changes

USE CASES:
──────────
• Distributed locking
• Leader election
• Configuration management
• Service discovery
• Cluster membership

Comparison: Paxos vs Raft vs ZAB

┌─────────────────┬──────────────┬──────────────┬──────────────┐
│                 │    PAXOS     │    RAFT      │     ZAB      │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ Leader          │ Optional     │ Required     │ Required     │
│ Elections       │ Implicit     │ Explicit     │ Explicit     │
│ Log Gaps        │ Allowed      │ Not allowed  │ Not allowed  │
│ Understandable  │ Harder       │ Easier       │ Medium       │
│ Implementations │ Many         │ Many         │ Zookeeper    │
│ Year            │ 1989         │ 2014         │ 2011         │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ Used By         │ Chubby       │ etcd         │ Zookeeper    │
│                 │ Spanner      │ Consul       │ Kafka        │
│                 │ CockroachDB  │ TiKV         │              │
└─────────────────┴──────────────┴──────────────┴──────────────┘

Advanced: EPaxos (Egalitarian Paxos)

Staff+ Level: EPaxos is an advanced topic that shows deep understanding of consensus trade-offs. It’s less commonly asked but demonstrates expertise.
┌─────────────────────────────────────────────────────────────────────────────┐
│                            EPAXOS                                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  THE PROBLEM WITH LEADER-BASED CONSENSUS:                                   │
│  ─────────────────────────────────────────                                  │
│  • All requests go through leader (bottleneck)                              │
│  • Leader failure triggers expensive election                               │
│  • Cross-datacenter: high latency to leader                                 │
│                                                                              │
│  EPAXOS KEY INSIGHT:                                                        │
│  ───────────────────                                                        │
│  "If commands don't conflict, they can commit independently"                │
│                                                                              │
│  LEADERLESS CONSENSUS:                                                      │
│  ─────────────────────                                                      │
│  • ANY replica can be "command leader" for any command                      │
│  • Conflicting commands require extra round                                 │
│  • Non-conflicting commands: 1 round-trip (fast path)                       │
│                                                                              │
│  CONFLICT DETECTION:                                                        │
│  ───────────────────                                                        │
│  Commands conflict if they access same keys with at least one write         │
│                                                                              │
│  get(x), get(x)    → NO conflict (both reads)                              │
│  get(x), put(x,1)  → CONFLICT (one is write)                               │
│  put(x,1), put(y,2) → NO conflict (different keys)                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

EPaxos Protocol Flow

FAST PATH (no conflicts):
────────────────────────

Client ──> R1 (command leader for cmd A)

           │ PreAccept(A)
           ├──────────────────> R2  ─┐
           ├──────────────────> R3  ─┼─ All reply "OK, no conflicts"
           ├──────────────────> R4  ─┤
           ├──────────────────> R5  ─┘

           │ <── PreAcceptOK ────────
           │ (from fast quorum: ⌊3N/4⌋)

           │ Commit(A)
           ├──────────────────> All replicas

Result: 2 message delays (same as Multi-Paxos with stable leader!)


SLOW PATH (conflicts detected):
───────────────────────────────

R1: PreAccept(A) includes deps = {}
R2: Already has B that conflicts with A
R2: PreAcceptOK with deps = {B}

R1: Sees different deps from different replicas
R1: Must run Paxos Accept phase to agree on deps

Additional round:
R1: Accept(A, deps={B})
All: AcceptOK

Result: 4 message delays (slower, but still correct)

When to Consider EPaxos

┌─────────────────────────────────────────────────────────────────────────────┐
│                    EPAXOS TRADE-OFFS                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ADVANTAGES:                                                                │
│  ───────────                                                                │
│  ✓ No leader bottleneck                                                    │
│  ✓ Lower latency for clients near any replica                              │
│  ✓ No leader election overhead                                             │
│  ✓ Better for geo-distributed deployments                                  │
│  ✓ Fast path = same latency as optimized Paxos                             │
│                                                                              │
│  DISADVANTAGES:                                                             │
│  ─────────────                                                              │
│  ✗ More complex to implement correctly                                     │
│  ✗ Slower when there are many conflicts                                    │
│  ✗ More complex recovery                                                   │
│  ✗ Fewer battle-tested implementations                                     │
│                                                                              │
│  USE WHEN:                                                                  │
│  ─────────                                                                  │
│  • Multi-datacenter deployment                                              │
│  • Low conflict workload (different clients access different data)          │
│  • Read-heavy or partition-friendly workload                                │
│  • Need to eliminate leader as single point of latency                      │
│                                                                              │
│  DON'T USE WHEN:                                                            │
│  ───────────────                                                            │
│  • High conflict rate (hot keys)                                            │
│  • Single datacenter (Raft is simpler)                                      │
│  • Team doesn't have consensus expertise                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Practical Implementation Guide

Building a Production Consensus System

1

Start with Raft (Not Paxos)

Raft is designed for understandability. If you’re building consensus from scratch:
# Recommended approach
1. Use an existing library (etcd/raft, hashicorp/raft)
2. If building from scratch, follow MIT 6.824 lab structure
3. Write extensive tests before anything else
2

Test Adversarially

# Critical test scenarios
test_cases = [
    "leader_crashes_during_commit",
    "network_partition_minority_vs_majority",
    "split_brain_scenario",
    "slow_follower_catches_up",
    "all_nodes_restart_simultaneously",
    "leader_isolated_then_rejoins",
    "log_divergence_after_multiple_elections",
    "snapshot_during_compaction",
]

# Use chaos testing
chaos_scenarios = [
    "random_network_delays(10-500ms)",
    "random_message_drops(5%)",
    "random_node_restarts",
    "clock_skew_between_nodes",
]
3

Performance Tuning

# Key parameters to tune
raft_config:
  heartbeat_interval: 150ms      # Leader sends heartbeat
  election_timeout_min: 300ms    # Min timeout for election
  election_timeout_max: 500ms    # Max timeout (randomized)
  max_entries_per_append: 100    # Batch size for efficiency
  snapshot_threshold: 10000      # Entries before snapshot
  
# Trade-offs:
# - Lower heartbeat = faster failure detection, more overhead
# - Higher election timeout = fewer spurious elections, slower recovery
# - Larger batch = higher throughput, higher latency per request
4

Production Checklist

□ Persistent storage for term, votedFor, and log
□ Proper fsync on commit (don't lose acknowledged writes)
□ Snapshot mechanism for log compaction
□ Membership change handling (or limit to single-server changes)
□ Pre-vote optimization (prevents disruption from stale candidates)
□ Leader lease for efficient reads
□ Metrics: election count, replication lag, commit latency
□ Alerting on: leader changes, replication lag, log size

Common Implementation Bugs

These bugs have caused real production outages. Watch out for them!
# BUG 1: Not persisting state before responding
def on_request_vote(request):
    if should_grant_vote(request):
        voted_for = request.candidate_id
        # BUG: Must persist BEFORE responding!
        # persist_state()  # <-- MISSING
        return VoteGranted()
    # If crash happens, you might vote twice!

# BUG 2: Incorrect leader lease
def handle_read(request):
    if self.is_leader():
        # BUG: Might not be leader anymore!
        # Must check lease or use read index
        return self.state_machine.read(request.key)
    
# BUG 3: Commit index going backwards
def apply_committed_entries():
    # BUG: Must ensure commit_index only increases
    # Snapshot installation can reset state machine
    # But commit_index should reflect reality

# BUG 4: Not handling log prefix during append
def append_entries(request):
    for entry in request.entries:
        # BUG: Must delete conflicting entries!
        # If local log has different term at same index,
        # delete it and all following entries

Key Interview Questions

Answer structure:
  1. Follower times out (no heartbeat from leader)
  2. Converts to candidate, increments term
  3. Votes for itself, sends RequestVote to all
  4. Other nodes vote (if haven’t voted, log is up-to-date)
  5. If majority votes → becomes leader, sends heartbeats
  6. If receives AppendEntries from valid leader → becomes follower
  7. If timeout again → new election with higher term
Key points to mention:
  • Randomized timeouts prevent split votes
  • Log up-to-date check ensures safety
  • At most one leader per term
Answer: Scenario: Leader has replicated entry to 2/5 nodes and crashes
  1. Followers detect leader failure (timeout)
  2. New election starts
  3. Only nodes with the entry CAN become leader (election restriction)
  4. New leader has the entry, completes replication
  5. Entry gets committed when new-term entry is committed
Key insight: Raft never loses committed entries. An entry replicated to majority WILL be in new leader’s log.
Answer:
  1. Each proposer picks unique proposal number (higher wins)
  2. Acceptor promises to reject lower numbers
  3. If proposer’s Prepare rejected → retry with higher number
  4. In Accept phase, use highest accepted value from promises
  5. System makes progress when one proposer “wins”
Key insight: The “use highest accepted value” rule preserves safety even with competing proposers.
Answer: FLP Impossibility proves this:
  1. Can’t distinguish slow from dead
  2. Adversarial schedule can always prevent decision
  3. Must choose: safety OR guaranteed termination
Practical workaround: Assume partial synchrony (eventually messages arrive). Raft/Paxos do this with timeouts.
Answer:
1. Lock request → propose "lock(client_id, lock_name)"
2. Use consensus to order all lock requests
3. Apply in order: first requester gets lock
4. Store lock state in replicated state machine
5. Unlock: consensus on "unlock(client_id, lock_name)"
6. Add lease/TTL to handle client failures

Implementation options:
- Zookeeper ephemeral nodes
- etcd lease-based locks
- Consul sessions

Hands-On Project: Implement Raft

Raft Implementation Lab

Build a complete Raft implementation:Phase 1: Leader Election
  • Implement RequestVote RPC
  • Handle election timeouts
  • State transitions (follower → candidate → leader)
Phase 2: Log Replication
  • Implement AppendEntries RPC
  • Log matching checks
  • Commit index tracking
Phase 3: Persistence
  • Persist term, votedFor, log
  • Recovery after restart
Phase 4: Snapshots
  • Implement log compaction
  • InstallSnapshot RPC
Recommended Resources:
  • MIT 6.824 Labs
  • Raft visualization: raft.github.io
  • etcd/raft source code

Next Steps

Continue to Track 3: Replication Strategies

Learn single-leader, multi-leader, and leaderless replication patterns