Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Raft Consensus Protocol

Track 2: Consensus Protocols

Consensus is the most critical topic in distributed systems. Master this track to ace Staff+ interviews at top companies.
Track Duration: 36-46 hours
Modules: 5
Key Topics: FLP Impossibility, Paxos, Raft, Viewstamped Replication, ZAB

Module 6: The Consensus Problem

What is Consensus?

Consensus is getting multiple nodes to agree on a single value, despite failures. The dinner reservation analogy: Imagine five friends trying to agree on a restaurant via a group chat where messages can be delayed, arrive out of order, or never arrive at all — and some friends might close the app mid-conversation. That is the consensus problem. It sounds trivial until you realize that “just pick the first suggestion” breaks when two people suggest simultaneously, and “wait for everyone to respond” breaks when someone goes offline. Every consensus protocol is fundamentally a clever answer to the question: “How do we make a group decision when we can’t even hold a reliable meeting?”
┌─────────────────────────────────────────────────────────────────────────────┐
│                        THE CONSENSUS PROBLEM                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SETUP:                                                                     │
│  ├── N nodes in a distributed system                                        │
│  ├── Some nodes may fail (crash failures)                                   │
│  ├── Network may delay/lose messages                                        │
│  └── Goal: Agree on a single value                                          │
│                                                                              │
│  PROPERTIES (must satisfy ALL):                                             │
│                                                                              │
│  1. AGREEMENT:        All non-faulty nodes decide the same value            │
│  2. VALIDITY:         The decided value was proposed by some node           │
│  3. TERMINATION:      All non-faulty nodes eventually decide                │
│                       (liveness property)                                    │
│  4. INTEGRITY:        Each node decides at most once                        │
│                                                                              │
│  WHY IT'S HARD:                                                             │
│  ─────────────                                                              │
│  Node A proposes "5"     │    What if:                                      │
│  Node B proposes "7"     │    - A crashes after proposing?                  │
│  Node C proposes "3"     │    - Message from A to C is lost?                │
│                          │    - B thinks A is dead (just slow)?             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Use Cases for Consensus

Leader Election

Nodes agree on who the leader isExamples:
  • Database primary selection
  • Kafka partition leader
  • Kubernetes control plane

Distributed Locking

Nodes agree on who holds the lockExamples:
  • Zookeeper locks
  • Redis Redlock
  • etcd leases

Configuration Management

Nodes agree on system configurationExamples:
  • Cluster membership
  • Feature flags
  • Service discovery

State Machine Replication

Nodes agree on sequence of commandsExamples:
  • Replicated databases
  • Distributed logs
  • Blockchain

FLP Impossibility Theorem

The Most Important Result: Consensus is impossible in an asynchronous system with even one faulty process.
┌─────────────────────────────────────────────────────────────────────────────┐
│                    FLP IMPOSSIBILITY (1985)                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Fischer, Lynch, and Paterson proved:                                       │
│                                                                              │
│  "No deterministic consensus protocol can guarantee termination             │
│   in an asynchronous system with even one crash failure."                   │
│                                                                              │
│  IMPLICATIONS:                                                              │
│  ─────────────                                                              │
│  You CANNOT have all three:                                                 │
│  1. Safety (agreement + validity)                                           │
│  2. Liveness (termination)                                                  │
│  3. Asynchrony (no timing assumptions)                                      │
│                                                                              │
│  WHY? (Intuition)                                                           │
│  ────────────────                                                           │
│  - In async systems, you can't distinguish slow from dead                   │
│  - There's always an adversarial schedule that prevents progress            │
│  - No matter the protocol, an adversary can delay messages                  │
│    to keep the system in an undecided state                                 │
│                                                                              │
│  HOW PRACTICAL SYSTEMS WORK AROUND IT:                                      │
│  ─────────────────────────────────────                                      │
│  1. Timeouts (partial synchrony) - Most systems do this                     │
│  2. Randomization - Some probability of progress                            │
│  3. Failure detectors - Oracle that hints about failures                    │
│                                                                              │
│  RAFT/PAXOS assume: "Eventually, the network will be synchronous enough"   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Safety vs Liveness

SAFETY: "Nothing bad happens"
├── Agreement is never violated
├── Only proposed values are chosen
└── Can be proven mathematically

LIVENESS: "Something good eventually happens"  
├── Decisions are eventually made
├── Progress is made
└── Can be violated by FLP (temporarily)

PRACTICAL APPROACH:
├── ALWAYS maintain safety
└── EVENTUALLY achieve liveness (when network is stable)

Module 7: Paxos Protocol

The Original Consensus Algorithm

Paxos was invented by Leslie Lamport in 1989 (published in 1998 after being rejected for being “too whimsical” in its original Greek parliament metaphor). Despite being notoriously difficult to understand — Lamport himself said “The Paxos algorithm, when presented in plain English, is very simple” and the community collectively disagreed — it’s the foundation of modern consensus.
Why Paxos is hard to grok: The difficulty is not in any single step but in understanding why each step is necessary. Every rule in Paxos exists to prevent a specific failure scenario. The “promise” mechanism prevents split decisions. The “use highest accepted value” rule prevents overwriting an already-chosen value. If you find yourself confused, ask: “What goes wrong if I skip this step?” The answer is always a concrete safety violation.
┌─────────────────────────────────────────────────────────────────────────────┐
│                          PAXOS OVERVIEW                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ROLES (a node can play multiple roles):                                    │
│  ─────                                                                      │
│  PROPOSER:   Proposes values, drives the protocol                           │
│  ACCEPTOR:   Votes on proposals, stores accepted values                     │
│  LEARNER:    Learns the decided value                                       │
│                                                                              │
│  GOAL: Agree on a single value among N acceptors                            │
│                                                                              │
│  KEY INSIGHT:                                                               │
│  ────────────                                                               │
│  A value is "chosen" when accepted by a MAJORITY of acceptors               │
│  Any two majorities overlap, ensuring agreement                             │
│                                                                              │
│  PHASES:                                                                    │
│  ───────                                                                    │
│  Phase 1a: PREPARE    - Proposer asks acceptors to promise                  │
│  Phase 1b: PROMISE    - Acceptors promise (or reject)                       │
│  Phase 2a: ACCEPT     - Proposer asks acceptors to accept value             │
│  Phase 2b: ACCEPTED   - Acceptors accept (or reject)                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Basic Paxos Step by Step

Proposer selects a unique proposal number n and sends Prepare(n) to all acceptors.
Proposer P1                    Acceptors
───────────                    ─────────
     │                         [A1] [A2] [A3]
     │    Prepare(n=1)           │    │    │
     │─────────────────────────>│    │    │
     │─────────────────────────>│────│    │
     │─────────────────────────>│────│────│
     │                           │    │    │
Acceptor behavior on receiving Prepare(n):
def on_prepare(n):
    # An acceptor is like a voter who can only pledge allegiance to
    # one candidate at a time. A higher proposal number is a more
    # "persuasive" candidate -- the voter switches loyalty.
    if n > promised_n:
        promised_n = n  # Promise not to accept any proposal numbered lower than n
        return Promise(n, accepted_n, accepted_v)  # Also report any value already accepted
    else:
        return Reject()  # "Sorry, I already promised to support a higher-numbered proposal"

Paxos with Failures

This is where Paxos shines - handling failures:
┌─────────────────────────────────────────────────────────────────────────────┐
│                 PAXOS HANDLING COMPETING PROPOSERS                          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Scenario: P1 and P2 both try to propose                                   │
│                                                                              │
│  Time    P1           Acceptors        P2                                  │
│  ────    ──           ─────────        ──                                  │
│   t1     Prepare(1) ────> A1,A2,A3                                          │
│   t2                 <──── Promise(1,∅)                                     │
│   t3                                   Prepare(2) ────> A1,A2,A3           │
│   t4                                   <──── Promise(2,∅)                   │
│   t5     Accept(1,"X") ──> A1,A2,A3                                         │
│   t6                 ──── REJECT! ──── (promised n=2 already)              │
│   t7                                   Accept(2,"Y") ──> A1,A2,A3          │
│   t8                                   <──── Accepted(2,"Y")               │
│                                                                              │
│  Result: "Y" is chosen (P2 won with higher proposal number)                │
│                                                                              │
│  P1 must retry with n > 2                                                  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Why Paxos is Safe

THEOREM: If a value v is chosen, any future proposal will have value v.

PROOF INTUITION:
─────────────────
1. v is chosen → accepted by majority M1
2. Future proposal n' > n must get promises from majority M2
3. M1 ∩ M2 ≠ ∅ (majorities overlap)
4. At least one acceptor in M2 already accepted v
5. That acceptor tells proposer about v in Promise
6. Proposer MUST use v (highest accepted value)
7. Therefore, v is preserved

This is why proposal numbers and the "use highest accepted value" rule exist!

Multi-Paxos

Basic Paxos decides ONE value. Multi-Paxos decides a SEQUENCE:
┌─────────────────────────────────────────────────────────────────────────────┐
│                          MULTI-PAXOS                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  OPTIMIZATION: Elect a stable leader, skip Phase 1                          │
│                                                                              │
│  Log Index:    1        2        3        4        5                        │
│               ┌────┐   ┌────┐   ┌────┐   ┌────┐   ┌────┐                   │
│  Log:         │ v1 │   │ v2 │   │ v3 │   │ v4 │   │ v5 │                   │
│               └────┘   └────┘   └────┘   └────┘   └────┘                   │
│                 │        │        │        │        │                       │
│              Decided  Decided  Decided  Pending  Pending                    │
│                                                                              │
│  LEADER OPTIMIZATION:                                                       │
│  ────────────────────                                                       │
│  1. Leader runs Phase 1 ONCE for all future slots                          │
│  2. Each new command: just Phase 2 (Accept/Accepted)                        │
│  3. Much faster! (2 message delays instead of 4)                            │
│                                                                              │
│  IF LEADER FAILS:                                                           │
│  ────────────────                                                           │
│  New leader runs Phase 1 for any undecided slots                            │
│  May discover in-progress proposals and complete them                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 8: Raft Consensus (Deep Dive)

Most Asked in Interviews: Raft is designed to be understandable. You MUST be able to explain it clearly. The original Raft paper by Ongaro and Ousterhout was explicitly motivated by the observation that students and practitioners struggled to implement Paxos correctly. Raft achieves the same safety guarantees as Multi-Paxos but decomposes the problem into three relatively independent sub-problems: leader election, log replication, and safety. This decomposition is what makes it teachable and implementable.
For a full, implementation-focused treatment with invariants, failure scenarios, and a build-from-scratch roadmap, see the dedicated Raft Deep Dive.

Why Raft?

┌─────────────────────────────────────────────────────────────────────────────┐
│                    RAFT vs PAXOS                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PAXOS PROBLEMS:                       RAFT SOLUTIONS:                      │
│  ───────────────                       ───────────────                      │
│  - Hard to understand                  - Designed for understandability     │
│  - Many variants, unclear which to use - One clear algorithm                │
│  - Difficult to implement correctly    - Clearer implementation path        │
│  - Symmetric roles                     - Strong leader model                │
│                                                                              │
│  RAFT KEY DECISIONS:                                                        │
│  ───────────────────                                                        │
│  1. Strong leader: Only leader handles client requests                      │
│  2. Leader election: Randomized timeouts for faster convergence             │
│  3. Log matching: If entries match at index, all prior entries match        │
│  4. Leader completeness: Elected leader has all committed entries           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Raft State Machine

Raft Consensus State Machine

Raft Terms

┌─────────────────────────────────────────────────────────────────────────────┐
│                           RAFT TERMS                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Time is divided into TERMS of arbitrary length:                            │
│                                                                              │
│   Term 1         Term 2      Term 3       Term 4        Term 5             │
│  ┌──────────────┬───────────┬─────────────────────────┬────────────────┐    │
│  │  Election │  │ Election  │   Election │            │   Election │   │    │
│  │     +     │  │    +      │      +     │   Normal   │      +     │   │    │
│  │  Normal   │  │  Normal   │   Normal   │ Operation  │   Normal   │   │    │
│  │ Operation │  │ Operation │  Operation │  (leader)  │  Operation │   │    │
│  │ (leader)  │  │ (failed)  │  (leader)  │            │  (leader)  │   │    │
│  └──────────────┴───────────┴─────────────────────────┴────────────────┘    │
│                                                                              │
│  TERM RULES:                                                                │
│  ───────────                                                                │
│  • Each term starts with election                                           │
│  • At most ONE leader per term                                              │
│  • Higher term always wins                                                  │
│  • If node sees higher term → immediately becomes follower                  │
│  • Terms act as logical clock                                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Leader Election

FOLLOWER TIMEOUT:
─────────────────

Follower                   Leader
   │                         │
   │ <── Heartbeat ─────────│  (every 150ms)
   │                         │
   │     (waiting...)       │
   │                         │
   │     (timeout: 300ms)    X  (leader dies)

   │  No heartbeat received!
   │  Convert to CANDIDATE
   │  Start election


TIMEOUT: Random between 150-300ms
(Randomness prevents split votes)

Log Replication

The shared notebook analogy: Log replication is like a team that must keep identical lab notebooks. The leader writes an entry in their notebook and sends a copy to every team member. Each team member writes the entry in their own notebook, but only after checking that their previous entries match the leader’s — if there is a discrepancy, the leader walks them back until they find where their notebooks agree, then replays everything from that point forward. An entry is “committed” (permanent) only when a majority of team members have it in their notebooks. This ensures that even if some notebooks are destroyed, the committed entries survive. The core of Raft - replicating commands across nodes:
┌─────────────────────────────────────────────────────────────────────────────┐
│                        RAFT LOG STRUCTURE                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Leader's Log:                                                              │
│  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐                          │
│  │ 1,1 │ 1,1 │ 1,2 │ 2,3 │ 2,3 │ 3,3 │ 3,3 │ 3,3 │                          │
│  │ x←1 │ y←2 │ x←3 │ y←4 │ z←5 │ x←6 │ y←7 │ z←8 │                          │
│  └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘                          │
│    1     2     3     4     5     6     7     8    ← Log Index               │
│                                                                              │
│  Entry format: (index, term) command                                        │
│                                                                              │
│  COMMIT INDEX:                                                              │
│  ─────────────                                                              │
│  An entry is COMMITTED when replicated to majority                          │
│                                                                              │
│  Leader:      commitIndex = 6 (entries 1-6 are committed)                   │
│  Followers:   Apply entries up to commitIndex                               │
│                                                                              │
│  SAFETY: Once committed, entry is NEVER overwritten                         │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

AppendEntries RPC

# Leader sends to followers
AppendEntries {
    term: int,              # Leader's term
    leaderId: int,          # So follower can redirect clients
    prevLogIndex: int,      # Index of entry before new ones
    prevLogTerm: int,       # Term of prevLogIndex entry
    entries: List[Entry],   # Log entries to store (empty for heartbeat)
    leaderCommit: int       # Leader's commitIndex
}

# Follower's implementation
def handle_append_entries(request):
    # STEP 1: Term check -- a leader with a stale term has been superseded.
    # This is the distributed equivalent of "your badge is expired."
    if request.term < currentTerm:
        return AppendResponse(term=currentTerm, success=False)
    
    # Any valid AppendEntries proves the leader is alive --
    # reset the election timer so we don't start a needless election.
    reset_election_timer()
    
    # STEP 2: If the leader's term is newer, we must have missed an
    # election. Update our term and revert to follower state.
    if request.term > currentTerm:
        currentTerm = request.term
        convert_to_follower()
    
    # STEP 3: LOG MATCHING -- the critical consistency check.
    # The leader says "my entry just before the new ones is at index X,
    # term Y." If we don't have that same entry, our logs have diverged
    # and the leader must back up and try an earlier prevLogIndex.
    if not log_contains(request.prevLogIndex, request.prevLogTerm):
        return AppendResponse(term=currentTerm, success=False)
    
    # STEP 4: APPEND NEW ENTRIES
    # If our log has an entry at the same index but a different term,
    # that entry came from a deposed leader -- delete it and everything
    # after it. This is safe because uncommitted entries are expendable.
    for i, entry in enumerate(request.entries):
        index = request.prevLogIndex + 1 + i
        if log[index].term != entry.term:
            log = log[:index]  # Truncate divergent suffix
        if index > len(log):
            log.append(entry)
    
    # STEP 5: Advance our commit index to match the leader's.
    # We can only commit up to the end of our own log (we might be
    # behind the leader if some AppendEntries were lost in transit).
    if request.leaderCommit > commitIndex:
        commitIndex = min(request.leaderCommit, len(log))
        apply_committed_entries()  # Apply to state machine in order
    
    return AppendResponse(term=currentTerm, success=True)

Log Matching Property

┌─────────────────────────────────────────────────────────────────────────────┐
│                     LOG MATCHING PROPERTY                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  IF two entries in different logs have same index AND term:                 │
│  ├── They store the same command                                            │
│  └── All preceding entries are identical                                    │
│                                                                              │
│  HOW IT'S MAINTAINED:                                                       │
│  ────────────────────                                                       │
│  1. Leader creates entries with current term                                │
│  2. AppendEntries includes prevLogIndex and prevLogTerm                     │
│  3. Follower rejects if previous entry doesn't match                        │
│  4. Leader backs up and retries until match found                           │
│                                                                              │
│  EXAMPLE - Repairing Inconsistent Log:                                      │
│                                                                              │
│  Leader:   [1,1] [1,2] [2,3] [2,4] [3,5] [3,6]                             │
│  Follower: [1,1] [1,2] [2,3] [2,4]                    (behind)              │
│                                                                              │
│  AppendEntries(prevLogIndex=4, prevLogTerm=2, entries=[3,5][3,6])          │
│  Follower: Entry at index 4, term 2 exists? ✓                               │
│  Follower: Append entries 5 and 6                                           │
│  Result:   [1,1] [1,2] [2,3] [2,4] [3,5] [3,6]        (matches!)            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Safety: Leader Completeness

Critical Property: If a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher terms.
HOW RAFT ENSURES THIS:
──────────────────────

1. ELECTION RESTRICTION:
   Voter denies vote if candidate's log is less up-to-date
   
   "Up-to-date" comparison:
   - Compare last entry's term (higher is better)
   - If same term, compare last entry's index (longer is better)

2. LEADER ONLY COMMITS ENTRIES FROM CURRENT TERM:
   Leader doesn't count old-term entries as committed
   until a new entry from current term is replicated

   Why? Prevents overwriting committed entries
   
   EXAMPLE (The Figure 8 scenario):
   ─────────────────────────────────
   
   Term 2: Leader S1 replicates entry at index 2 to S2
           Crashes before committing
   
   Term 3: S5 becomes leader (didn't have index 2)
           Appends entry at index 2 with term 3
   
   Term 4: S1 comes back, becomes leader
           CAN'T just count term-2 entry as committed!
           Must replicate a term-4 entry first
           
   This prevents the term-2 entry from being overwritten
   by the term-3 entry that S5 had

Handling Cluster Membership Changes

┌─────────────────────────────────────────────────────────────────────────────┐
│                    JOINT CONSENSUS                                           │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PROBLEM: Adding/removing nodes during operation                            │
│                                                                              │
│  DANGER: If we switch configs directly:                                     │
│                                                                              │
│  Old: {A, B, C}        New: {A, B, C, D, E}                                 │
│                                                                              │
│  During transition:                                                         │
│  - {A, B} thinks old config, majority = 2                                   │
│  - {C, D, E} thinks new config, majority = 3                                │
│  - BOTH could be majorities! Two leaders possible!                          │
│                                                                              │
│  SOLUTION: JOINT CONSENSUS                                                  │
│  ─────────────────────────                                                  │
│                                                                              │
│  Phase 1: C_old,new (joint configuration)                                   │
│           Decisions need majorities in BOTH old AND new                     │
│                                                                              │
│  Phase 2: C_new (new configuration)                                         │
│           Only after C_old,new is committed                                 │
│                                                                              │
│  Timeline:                                                                  │
│  ─────────                                                                  │
│        │              │                │                                    │
│   C_old│ C_old,new    │   C_old,new    │  C_new                             │
│        │ entry        │   committed    │  entry                             │
│        │ created      │                │  created                           │
│        │              │                │                                    │
│        ▼              ▼                ▼                                    │
│  ──────┬──────────────┬────────────────┬────────────────►                   │
│                                                                              │
│  SIMPLER ALTERNATIVE: Single-server changes (one at a time)                 │
│  ─────────────────────                                                      │
│  Many implementations use this simpler approach                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Distributed pitfall: Membership changes are the most bug-prone part of any consensus implementation. The Raft paper’s joint consensus protocol is correct but complex. In practice, most production systems (etcd, Consul) use single-server changes — adding or removing one node at a time — because this guarantees that old and new majorities always overlap without needing a joint configuration. The rule is simple: never change more than one node between committed configuration entries. If you need to go from 3 nodes to 5, add one node, wait for it to catch up, then add the second.

Log Compaction (Snapshots)

┌─────────────────────────────────────────────────────────────────────────────┐
│                        SNAPSHOTS                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PROBLEM: Log grows forever                                                 │
│                                                                              │
│  SOLUTION: Periodically take snapshots                                      │
│                                                                              │
│  Before snapshot:                                                           │
│  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐                          │
│  │ x←1 │ y←2 │ x←3 │ y←4 │ x←5 │ y←6 │ x←7 │ y←8 │                          │
│  └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘                          │
│    1     2     3     4     5     6     7     8                              │
│                              ▲                                              │
│                         committed                                           │
│                                                                              │
│  After snapshot (at index 5):                                               │
│  ┌────────────────────┬─────┬─────┬─────┐                                   │
│  │     SNAPSHOT       │ y←6 │ x←7 │ y←8 │                                   │
│  │ lastIndex=5        │     │     │     │                                   │
│  │ lastTerm=2         │     │     │     │                                   │
│  │ state={x:5, y:4}   │     │     │     │                                   │
│  └────────────────────┴─────┴─────┴─────┘                                   │
│                          6     7     8                                      │
│                                                                              │
│  INSTALLSNAPSHOT RPC:                                                       │
│  Used when follower is too far behind                                       │
│  Leader sends its snapshot instead of replaying entries                     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 9: Viewstamped Replication

An alternative consensus protocol that predates Paxos:
┌─────────────────────────────────────────────────────────────────────────────┐
│                 VIEWSTAMPED REPLICATION                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SIMILAR TO RAFT:                                                           │
│  ────────────────                                                           │
│  • Strong leader (called "primary")                                         │
│  • Log replication                                                          │
│  • View changes (like term changes)                                         │
│                                                                              │
│  KEY DIFFERENCES:                                                           │
│  ────────────────                                                           │
│  • View-change is more complex                                              │
│  • Different recovery protocol                                              │
│  • Historically important (1988, before Paxos publication)                  │
│                                                                              │
│  VIEW CHANGE PROTOCOL:                                                      │
│  ─────────────────────                                                      │
│  1. New primary collects state from f+1 replicas                            │
│  2. Computes new state (union of all logs)                                  │
│  3. Sends StartView to all replicas                                         │
│  4. Replicas update and respond                                             │
│                                                                              │
│  WHEN TO USE VR vs RAFT:                                                    │
│  ────────────────────────                                                   │
│  Usually prefer Raft (simpler, better understood)                           │
│  VR useful for academic understanding                                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Module 10: ZAB (Zookeeper Atomic Broadcast)

ZAB Protocol

┌─────────────────────────────────────────────────────────────────────────────┐
│                              ZAB                                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Designed specifically for Zookeeper's requirements:                        │
│  • Primary order: All updates from a leader in order                        │
│  • Prefix property: If transaction T is delivered, all before T too         │
│                                                                              │
│  PHASES:                                                                    │
│  ───────                                                                    │
│                                                                              │
│  1. LEADER ELECTION                                                         │
│     └── Fast leader election using voting                                   │
│     └── Leader chosen based on highest zxid (transaction id)                │
│                                                                              │
│  2. DISCOVERY                                                               │
│     └── Leader learns about all accepted transactions                       │
│     └── Collects epoch information from followers                           │
│                                                                              │
│  3. SYNCHRONIZATION                                                         │
│     └── Leader syncs followers to its state                                 │
│     └── Handles DIFF, TRUNC, or SNAP updates                                │
│                                                                              │
│  4. BROADCAST                                                               │
│     └── Normal operation - leader broadcasts updates                        │
│     └── Two-phase commit: PROPOSAL → ACK → COMMIT                           │
│                                                                              │
│  ZXID (Transaction ID):                                                     │
│  ──────────────────────                                                     │
│  64-bit: [epoch (32 bits)][counter (32 bits)]                               │
│                                                                              │
│  Epoch increases with each new leader                                       │
│  Counter increases with each transaction                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Zookeeper Guarantees

GUARANTEES:
───────────
1. Linearizable Writes: All writes go through leader, ordered globally
2. FIFO Client Order: Client's operations processed in order
3. Wait-free Reads: Reads can go to any server (possibly stale)
4. Watch Notifications: Clients notified of changes

USE CASES:
──────────
• Distributed locking
• Leader election
• Configuration management
• Service discovery
• Cluster membership

Comparison: Paxos vs Raft vs ZAB

┌─────────────────┬──────────────┬──────────────┬──────────────┐
│                 │    PAXOS     │    RAFT      │     ZAB      │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ Leader          │ Optional     │ Required     │ Required     │
│ Elections       │ Implicit     │ Explicit     │ Explicit     │
│ Log Gaps        │ Allowed      │ Not allowed  │ Not allowed  │
│ Understandable  │ Harder       │ Easier       │ Medium       │
│ Implementations │ Many         │ Many         │ Zookeeper    │
│ Year            │ 1989         │ 2014         │ 2011         │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ Used By         │ Chubby       │ etcd         │ Zookeeper    │
│                 │ Spanner      │ Consul       │ Kafka        │
│                 │ CockroachDB  │ TiKV         │              │
└─────────────────┴──────────────┴──────────────┴──────────────┘

Advanced: EPaxos (Egalitarian Paxos)

Staff+ Level: EPaxos is an advanced topic that shows deep understanding of consensus trade-offs. It’s less commonly asked but demonstrates expertise.
┌─────────────────────────────────────────────────────────────────────────────┐
│                            EPAXOS                                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  THE PROBLEM WITH LEADER-BASED CONSENSUS:                                   │
│  ─────────────────────────────────────────                                  │
│  • All requests go through leader (bottleneck)                              │
│  • Leader failure triggers expensive election                               │
│  • Cross-datacenter: high latency to leader                                 │
│                                                                              │
│  EPAXOS KEY INSIGHT:                                                        │
│  ───────────────────                                                        │
│  "If commands don't conflict, they can commit independently"                │
│                                                                              │
│  LEADERLESS CONSENSUS:                                                      │
│  ─────────────────────                                                      │
│  • ANY replica can be "command leader" for any command                      │
│  • Conflicting commands require extra round                                 │
│  • Non-conflicting commands: 1 round-trip (fast path)                       │
│                                                                              │
│  CONFLICT DETECTION:                                                        │
│  ───────────────────                                                        │
│  Commands conflict if they access same keys with at least one write         │
│                                                                              │
│  get(x), get(x)    → NO conflict (both reads)                              │
│  get(x), put(x,1)  → CONFLICT (one is write)                               │
│  put(x,1), put(y,2) → NO conflict (different keys)                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

EPaxos Protocol Flow

FAST PATH (no conflicts):
────────────────────────

Client ──> R1 (command leader for cmd A)

           │ PreAccept(A)
           ├──────────────────> R2  ─┐
           ├──────────────────> R3  ─┼─ All reply "OK, no conflicts"
           ├──────────────────> R4  ─┤
           ├──────────────────> R5  ─┘

           │ <── PreAcceptOK ────────
           │ (from fast quorum: ⌊3N/4⌋)

           │ Commit(A)
           ├──────────────────> All replicas

Result: 2 message delays (same as Multi-Paxos with stable leader!)


SLOW PATH (conflicts detected):
───────────────────────────────

R1: PreAccept(A) includes deps = {}
R2: Already has B that conflicts with A
R2: PreAcceptOK with deps = {B}

R1: Sees different deps from different replicas
R1: Must run Paxos Accept phase to agree on deps

Additional round:
R1: Accept(A, deps={B})
All: AcceptOK

Result: 4 message delays (slower, but still correct)

When to Consider EPaxos

┌─────────────────────────────────────────────────────────────────────────────┐
│                    EPAXOS TRADE-OFFS                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ADVANTAGES:                                                                │
│  ───────────                                                                │
│  ✓ No leader bottleneck                                                    │
│  ✓ Lower latency for clients near any replica                              │
│  ✓ No leader election overhead                                             │
│  ✓ Better for geo-distributed deployments                                  │
│  ✓ Fast path = same latency as optimized Paxos                             │
│                                                                              │
│  DISADVANTAGES:                                                             │
│  ─────────────                                                              │
│  ✗ More complex to implement correctly                                     │
│  ✗ Slower when there are many conflicts                                    │
│  ✗ More complex recovery                                                   │
│  ✗ Fewer battle-tested implementations                                     │
│                                                                              │
│  USE WHEN:                                                                  │
│  ─────────                                                                  │
│  • Multi-datacenter deployment                                              │
│  • Low conflict workload (different clients access different data)          │
│  • Read-heavy or partition-friendly workload                                │
│  • Need to eliminate leader as single point of latency                      │
│                                                                              │
│  DON'T USE WHEN:                                                            │
│  ───────────────                                                            │
│  • High conflict rate (hot keys)                                            │
│  • Single datacenter (Raft is simpler)                                      │
│  • Team doesn't have consensus expertise                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Advanced: Pipelining & Batching Optimizations

Production consensus systems use several optimizations to achieve high throughput despite the fundamental requirement of disk I/O and network round-trips.

1. Request Pipelining

Instead of waiting for each request to commit before starting the next, leaders can process multiple requests concurrently.
SEQUENTIAL (Slow):
┌──────────────────────────────────────────────────────────────────┐
│  Req1: [Append] → [Replicate] → [Commit] → [Respond]             │
│  Req2:                                      [Append] → ...       │
│  Time: ═══════════════════════════════════════════════════════►  │
└──────────────────────────────────────────────────────────────────┘

PIPELINED (Fast):
┌──────────────────────────────────────────────────────────────────┐
│  Req1: [Append] → [Replicate] → [Commit] → [Respond]             │
│  Req2:    [Append] → [Replicate] → [Commit] → [Respond]          │
│  Req3:       [Append] → [Replicate] → [Commit] → [Respond]       │
│  Time: ══════════════════════►                                   │
│        (3x throughput!)                                          │
└──────────────────────────────────────────────────────────────────┘
Key Insight: Requests at different pipeline stages use different resources (CPU for append, network for replicate, disk for commit). Pipelining maximizes resource utilization.

2. Batching Log Entries

Instead of one network round-trip per entry, batch multiple entries into a single AppendEntries RPC.
# Without Batching: 1000 entries = 1000 RPCs
for entry in entries:
    send_append_entries([entry])  # Slow!

# With Batching: 1000 entries = 10 RPCs (batch size 100)
for batch in chunks(entries, size=100):
    send_append_entries(batch)    # 100x fewer RPCs!
Batching Trade-offs:
ParameterSmall BatchLarge Batch
LatencyLower (per-request)Higher (wait for batch)
ThroughputLowerHigher
MemoryLowerHigher
Adaptive Batching: Modern systems (TiKV, etcd) use adaptive batching:
  • Low load: Send immediately (minimize latency)
  • High load: Batch aggressively (maximize throughput)

3. Parallel Disk I/O

The leader can write to its own log while simultaneously sending AppendEntries to followers.
SEQUENTIAL:
Leader:   [Write to Disk] ──────────► [Send to Followers]

Followers:                            [Receive] → [Write to Disk]

PARALLEL:
Leader:   [Write to Disk] ─────┐
                               ├─► [Wait for both]
          [Send to Followers] ─┘

Followers: [Receive] → [Write to Disk]

Savings: Hides leader's disk latency behind network latency

4. Group Commit (fsync Batching)

fsync() is expensive (~10ms on HDD, ~0.1ms on NVMe). Instead of fsync per entry, batch multiple entries into one fsync.
class GroupCommitter:
    def __init__(self, max_wait_ms=1, max_batch=100):
        self.pending = []
        self.lock = Lock()
    
    def append_and_sync(self, entry):
        with self.lock:
            self.pending.append(entry)
            if len(self.pending) >= max_batch:
                self._flush()
        
        # Or wait for timer-based flush
        wait_for_flush_complete()
    
    def _flush(self):
        # Write all pending entries
        for entry in self.pending:
            log_file.write(entry)
        
        # Single fsync for all entries!
        log_file.fsync()  # One fsync instead of N
        
        # Notify all waiters
        notify_all_complete()
Result: 1000 entries with 1 fsync instead of 1000 fsyncs = ~1000x throughput improvement for disk-bound workloads.

5. Flow Control & Backpressure

Prevent slow followers from causing unbounded memory growth on the leader.
┌─────────────────────────────────────────────────────────────────┐
│  FLOW CONTROL STRATEGY                                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Leader tracks per-follower state:                              │
│  • nextIndex: Next entry to send                                │
│  • matchIndex: Highest replicated entry                         │
│  • inflight: Entries sent but not ACKed                         │
│                                                                  │
│  Flow Control Rules:                                            │
│  IF inflight > MAX_INFLIGHT:                                    │
│      Pause sending to this follower                             │
│      (prevents OOM on leader)                                   │
│                                                                  │
│  IF matchIndex too far behind:                                  │
│      Send snapshot instead of entries                           │
│      (prevents unbounded log growth)                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Performance Numbers (etcd Benchmarks)

OptimizationThroughputLatency (p99)
Baseline10K ops/s50ms
+ Batching50K ops/s20ms
+ Pipelining80K ops/s15ms
+ Group Commit100K ops/s10ms
+ Parallel I/O120K ops/s8ms
Staff Tip: In interviews, mention that “naive Raft” achieves ~1000 ops/s, but production systems like etcd/TiKV achieve 100K+ ops/s through these optimizations. This shows you understand the gap between textbook algorithms and production systems.

Practical Implementation Guide

Building a Production Consensus System

1

Start with Raft (Not Paxos)

Raft is designed for understandability. If you’re building consensus from scratch:
# Recommended approach
1. Use an existing library (etcd/raft, hashicorp/raft)
2. If building from scratch, follow MIT 6.824 lab structure
3. Write extensive tests before anything else
2

Test Adversarially

# Critical test scenarios
test_cases = [
    "leader_crashes_during_commit",
    "network_partition_minority_vs_majority",
    "split_brain_scenario",
    "slow_follower_catches_up",
    "all_nodes_restart_simultaneously",
    "leader_isolated_then_rejoins",
    "log_divergence_after_multiple_elections",
    "snapshot_during_compaction",
]

# Use chaos testing
chaos_scenarios = [
    "random_network_delays(10-500ms)",
    "random_message_drops(5%)",
    "random_node_restarts",
    "clock_skew_between_nodes",
]
3

Performance Tuning

# Key parameters to tune
raft_config:
  heartbeat_interval: 150ms      # Leader sends heartbeat
  election_timeout_min: 300ms    # Min timeout for election
  election_timeout_max: 500ms    # Max timeout (randomized)
  max_entries_per_append: 100    # Batch size for efficiency
  snapshot_threshold: 10000      # Entries before snapshot
  
# Trade-offs:
# - Lower heartbeat = faster failure detection, more overhead
# - Higher election timeout = fewer spurious elections, slower recovery
# - Larger batch = higher throughput, higher latency per request
4

Production Checklist

□ Persistent storage for term, votedFor, and log
□ Proper fsync on commit (don't lose acknowledged writes)
□ Snapshot mechanism for log compaction
□ Membership change handling (or limit to single-server changes)
□ Pre-vote optimization (prevents disruption from stale candidates)
□ Leader lease for efficient reads
□ Metrics: election count, replication lag, commit latency
□ Alerting on: leader changes, replication lag, log size

Common Implementation Bugs

These bugs have caused real production outages. Watch out for them!
# BUG 1: Not persisting state before responding
def on_request_vote(request):
    if should_grant_vote(request):
        voted_for = request.candidate_id
        # BUG: Must persist BEFORE responding!
        # persist_state()  # <-- MISSING
        return VoteGranted()
    # If crash happens, you might vote twice!

# BUG 2: Incorrect leader lease
def handle_read(request):
    if self.is_leader():
        # BUG: Might not be leader anymore!
        # Must check lease or use read index
        return self.state_machine.read(request.key)
    
# BUG 3: Commit index going backwards
def apply_committed_entries():
    # BUG: Must ensure commit_index only increases
    # Snapshot installation can reset state machine
    # But commit_index should reflect reality

# BUG 4: Not handling log prefix during append
def append_entries(request):
    for entry in request.entries:
        # BUG: Must delete conflicting entries!
        # If local log has different term at same index,
        # delete it and all following entries

Module 11: Quorum Loss & Disaster Recovery

A critical “Principal Level” operational reality: What happens when you lose a majority of your nodes forever? (e.g., a region outage or physical data corruption).

The Quorum Loss Problem

Consensus protocols (Paxos/Raft) are designed to block when a majority is lost. This is Safe but leads to Permanent Unavailability.
  • In a 3-node cluster, if 2 nodes are destroyed, the remaining node cannot reach consensus.
  • You cannot simply “add new nodes” because adding nodes requires a consensus vote, which you can’t get.

Recovery Strategies

1. Forced Reconfiguration (The “God Mode” fix)

Most production implementations (etcd, Consul, Zookeeper) provide a tool to manually override the configuration.
  • Process:
    1. Stop the remaining node.
    2. Use a tool to rewrite the conf file or metadata on disk to say: “The cluster now only consists of ME.”
    3. Restart the node. It now has a majority (1/1) and can process writes.
    4. Gradually add new nodes to rebuild the cluster to N=3N=3 or N=5N=5.

2. Seed-based Recovery

In systems like Cassandra or FoundationDB, you can bootstrap a new cluster from a snapshot of the surviving node’s data and “re-seed” the consensus state.

3. The “Stale Read” Escape Hatch

If you only need to recover data (not resume writes), you can often perform Stale Reads from the remaining nodes by bypassing the consensus layer and reading directly from the state machine (SSTables/RocksDB). Staff Tip: Disaster recovery is the only time you should ever touch the internal state of a consensus engine. Mention the “etcd disaster recovery” guide in interviews to show you understand that safety isn’t just about code, but about operational survival.

Key Interview Questions

Answer structure:
  1. Follower times out (no heartbeat from leader)
  2. Converts to candidate, increments term
  3. Votes for itself, sends RequestVote to all
  4. Other nodes vote (if haven’t voted, log is up-to-date)
  5. If majority votes → becomes leader, sends heartbeats
  6. If receives AppendEntries from valid leader → becomes follower
  7. If timeout again → new election with higher term
Key points to mention:
  • Randomized timeouts prevent split votes
  • Log up-to-date check ensures safety
  • At most one leader per term
Answer: Scenario: Leader has replicated entry to 2/5 nodes and crashes
  1. Followers detect leader failure (timeout)
  2. New election starts
  3. Only nodes with the entry CAN become leader (election restriction)
  4. New leader has the entry, completes replication
  5. Entry gets committed when new-term entry is committed
Key insight: Raft never loses committed entries. An entry replicated to majority WILL be in new leader’s log.
Answer:
  1. Each proposer picks unique proposal number (higher wins)
  2. Acceptor promises to reject lower numbers
  3. If proposer’s Prepare rejected → retry with higher number
  4. In Accept phase, use highest accepted value from promises
  5. System makes progress when one proposer “wins”
Key insight: The “use highest accepted value” rule preserves safety even with competing proposers.
Answer: FLP Impossibility proves this:
  1. Can’t distinguish slow from dead
  2. Adversarial schedule can always prevent decision
  3. Must choose: safety OR guaranteed termination
Practical workaround: Assume partial synchrony (eventually messages arrive). Raft/Paxos do this with timeouts.
Answer:
1. Lock request → propose "lock(client_id, lock_name)"
2. Use consensus to order all lock requests
3. Apply in order: first requester gets lock
4. Store lock state in replicated state machine
5. Unlock: consensus on "unlock(client_id, lock_name)"
6. Add lease/TTL to handle client failures

Implementation options:
- Zookeeper ephemeral nodes
- etcd lease-based locks
- Consul sessions

Hands-On Project: Implement Raft

Raft Implementation Lab

Build a complete Raft implementation:Phase 1: Leader Election
  • Implement RequestVote RPC
  • Handle election timeouts
  • State transitions (follower → candidate → leader)
Phase 2: Log Replication
  • Implement AppendEntries RPC
  • Log matching checks
  • Commit index tracking
Phase 3: Persistence
  • Persist term, votedFor, log
  • Recovery after restart
Phase 4: Snapshots
  • Implement log compaction
  • InstallSnapshot RPC
Recommended Resources:
  • MIT 6.824 Labs
  • Raft visualization: raft.github.io
  • etcd/raft source code

Next Steps

Continue to Track 3: Replication Strategies

Learn single-leader, multi-leader, and leaderless replication patterns

Interview Deep-Dive

Strong Answer:
  • Assume a 5-node cluster (A, B, C, D, E) where A is the current leader. The partition splits into (minority) and (majority).
  • On the minority side: A continues sending heartbeats to B, and B responds. A also tries to send heartbeats to C, D, E but gets no response. If A tries to commit a new write, it needs 3 acknowledgments (majority of 5). It can only get 2 (A itself and B), so A cannot commit any new entries. A remains the leader of term T on its side, but it is operationally frozen — it can accept client requests but cannot commit them. Eventually, depending on the implementation, A may step down or reject client writes directly.
  • On the majority side: C, D, and E stop receiving heartbeats from A. After a randomized election timeout (e.g., 150-300ms), one of them — say C — times out first, increments the term to T+1, votes for itself, and sends RequestVote to D and E. Both grant votes (since they have not voted in term T+1 and C’s log is at least as up-to-date). C becomes the leader of term T+1 with a majority quorum of . This side can now accept and commit new writes.
  • When the partition heals: A sends a heartbeat to C with term T. C responds with term T+1. A sees a higher term, immediately steps down to follower, and adopts term T+1. Any uncommitted entries in A’s log (entries accepted during the partition but not committed) may be overwritten by C’s log during the next AppendEntries exchange. This is safe because those entries were never committed — no client received a success response for them.
Follow-up: What if the old leader A managed to replicate an entry to B during the partition but could not commit it. Is that entry lost after partition healing?Yes, that entry can be lost, and this is by design. Raft’s safety guarantee applies only to committed entries (replicated to a majority). An entry replicated to only 2 of 5 nodes is not committed. After the partition heals, the new leader C will send AppendEntries to A and B. If C’s log diverges from A’s at the position of the uncommitted entry, C’s log wins. A and B truncate their logs to match C’s. The uncommitted entry is overwritten. The client that submitted that write either received a timeout (if A could not commit) or received no response, so the client knows to retry. This is why Raft requires majority acknowledgment before returning success to the client — it ensures that any committed entry survives any subsequent leader change.
Strong Answer:
  • Raft and Multi-Paxos solve the same fundamental problem — replicated state machines — but with different design philosophies. Raft prioritizes understandability by decomposing the problem into leader election, log replication, and safety as separate subproblems. Multi-Paxos prioritizes flexibility by defining consensus on individual log slots independently.
  • The key structural difference: Raft requires a strong leader. All writes go through the leader, and the leader’s log is always the authoritative truth. Multi-Paxos can operate with a “distinguished proposer” (de facto leader) for performance, but any node can propose for any log slot. This means Multi-Paxos can continue making progress even if the leader is temporarily slow (another node proposes for that slot), while Raft must wait for a leader election.
  • I would choose Multi-Paxos (or variants like EPaxos) in geo-distributed deployments where leader-based protocols create latency hotspots. If you have 5 data centers across continents, Raft forces all writes through one leader, adding cross-continent latency. EPaxos (an extension of Paxos) allows any node to propose, and non-conflicting commands can be committed in a single round trip to the nearest quorum. This can cut write latency in half for geo-distributed workloads.
  • I would choose Raft for everything else. The implementation is more straightforward, the debugging is simpler (one leader means one source of truth), and the ecosystem is mature (etcd, Consul, TiKV, CockroachDB all use Raft).
Follow-up: What is the “dueling proposers” problem in Paxos and how does Multi-Paxos solve it?In basic Paxos, any node can propose at any time. If two nodes simultaneously propose for the same slot with different values, they each execute Phase 1 (Prepare) with increasing proposal numbers. Node A sends Prepare(1), Node B sends Prepare(2). B’s higher number causes acceptors to reject A’s subsequent Accept(1). A retries with Prepare(3), which preempts B. This can livelock: neither proposer makes progress because each keeps preempting the other. Multi-Paxos solves this by electing a stable leader (distinguished proposer) who handles all proposals. Once elected, the leader skips Phase 1 entirely for subsequent slots — it only needs Phase 2 (Accept), cutting the message count in half. The leader acts as a “gatekeeper” that serializes proposals, eliminating the dueling problem. If the leader fails, any node can start Phase 1 for the next slot, triggering a de facto leader change. This is why Multi-Paxos in practice looks very similar to Raft — both have leaders, both skip the first round after election.
Strong Answer:
  • You can absolutely have an even number of nodes, but it is almost never a good idea. Consider a 4-node Raft cluster: the majority quorum is 3 (more than half of 4). You can tolerate exactly 1 failure. Now consider a 3-node cluster: the majority quorum is 2, and you can also tolerate exactly 1 failure. So going from 3 to 4 nodes gives you zero additional fault tolerance while increasing the number of nodes that must acknowledge each write (from 2 to 3), which increases latency.
  • This is why consensus clusters almost always use odd numbers: 3 (tolerates 1 failure), 5 (tolerates 2), 7 (tolerates 3). Each additional pair of nodes increases fault tolerance by exactly one.
  • There is an edge case where even numbers appear: Flexible Paxos allows asymmetric quorums where the write quorum and read quorum can differ, as long as their sum exceeds N. In that model, a 4-node cluster with write quorum 3 and read quorum 2 could be useful if reads are much more frequent than writes. But this is an advanced configuration that most systems do not support out of the box.
Follow-up: What about a 2-node consensus cluster? Is that ever valid?A 2-node consensus cluster has a majority quorum of 2, meaning both nodes must agree on every write. This provides zero fault tolerance: if either node goes down, the cluster cannot make progress. It is strictly worse than a single node for availability (you now have two points of failure instead of one) while providing the benefit of data redundancy (if one node’s disk fails, the other has a copy). In practice, 2-node clusters are used as primary-backup pairs with an external arbiter — a lightweight “witness” node that participates only in leader elections and stores no data. etcd supports a “learner” node for this purpose. The witness breaks the tie during elections, giving you effective 3-node quorum behavior with only 2 full data replicas. This is useful when storage is expensive but you still need fault tolerance.
Strong Answer:
  • Safety means “nothing bad ever happens.” For Raft, safety means: (1) at most one leader per term, (2) a committed entry is never lost or overwritten, (3) if two logs contain an entry with the same index and term, they are identical and all preceding entries are identical. Safety properties must hold under all circumstances — crashes, partitions, message delays, any adversarial schedule of events.
  • Liveness means “something good eventually happens.” For Raft, liveness means: the system eventually elects a leader and makes progress (commits entries). Liveness requires timing assumptions — specifically, that the broadcast time is much less than the election timeout, which is much less than the mean time between failures. If the network is completely asynchronous (unbounded delays), Raft’s liveness is not guaranteed (per FLP impossibility).
  • Raft never violates safety. Even under the worst network conditions, two different leaders cannot commit conflicting entries for the same log index. This is ensured by the election restriction (candidates must have an up-to-date log) and the log matching property (AppendEntries checks prevLogIndex/prevLogTerm).
  • Raft can violate liveness. If the election timeout is too short relative to network latency, the cluster can enter a cycle of perpetual elections where no candidate wins a majority before the next timeout fires. This is mitigated by randomized election timeouts, but it is theoretically possible (though statistically improbable) for this to continue indefinitely. In practice, poor timeout tuning is the most common cause of Raft liveness issues.
Follow-up: Describe a concrete production scenario where Raft’s liveness could be compromised.Imagine a 5-node Raft cluster deployed across 3 availability zones, with 2 nodes in AZ-1, 2 in AZ-2, and 1 in AZ-3. The cross-AZ latency is 5ms. The election timeout is set to 150-300ms. Now suppose AZ-3 experiences severe network congestion, increasing its latency to 500ms. The node in AZ-3 keeps timing out and starting elections (incrementing the term), which forces the current leader to step down whenever it receives a RequestVote with a higher term. But the AZ-3 node cannot win the election because its AppendEntries responses are too slow. The result: the disruptive node repeatedly triggers elections, preventing any leader from maintaining stability. Raft addresses this with the “Pre-Vote” extension: before incrementing its term and starting a real election, a candidate sends a PreVote request to check if it would win. If the majority rejects the PreVote (because they already have a functioning leader), the candidate does not proceed, preventing the disruptive election cycle. etcd and TiKV both implement Pre-Vote for this reason.