Consensus is getting multiple nodes to agree on a single value, despite failures.The dinner reservation analogy: Imagine five friends trying to agree on a restaurant via a group chat where messages can be delayed, arrive out of order, or never arrive at all — and some friends might close the app mid-conversation. That is the consensus problem. It sounds trivial until you realize that “just pick the first suggestion” breaks when two people suggest simultaneously, and “wait for everyone to respond” breaks when someone goes offline. Every consensus protocol is fundamentally a clever answer to the question: “How do we make a group decision when we can’t even hold a reliable meeting?”
┌─────────────────────────────────────────────────────────────────────────────┐│ THE CONSENSUS PROBLEM │├─────────────────────────────────────────────────────────────────────────────┤│ ││ SETUP: ││ ├── N nodes in a distributed system ││ ├── Some nodes may fail (crash failures) ││ ├── Network may delay/lose messages ││ └── Goal: Agree on a single value ││ ││ PROPERTIES (must satisfy ALL): ││ ││ 1. AGREEMENT: All non-faulty nodes decide the same value ││ 2. VALIDITY: The decided value was proposed by some node ││ 3. TERMINATION: All non-faulty nodes eventually decide ││ (liveness property) ││ 4. INTEGRITY: Each node decides at most once ││ ││ WHY IT'S HARD: ││ ───────────── ││ Node A proposes "5" │ What if: ││ Node B proposes "7" │ - A crashes after proposing? ││ Node C proposes "3" │ - Message from A to C is lost? ││ │ - B thinks A is dead (just slow)? ││ │└─────────────────────────────────────────────────────────────────────────────┘
The Most Important Result: Consensus is impossible in an asynchronous system with even one faulty process.
┌─────────────────────────────────────────────────────────────────────────────┐│ FLP IMPOSSIBILITY (1985) │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Fischer, Lynch, and Paterson proved: ││ ││ "No deterministic consensus protocol can guarantee termination ││ in an asynchronous system with even one crash failure." ││ ││ IMPLICATIONS: ││ ───────────── ││ You CANNOT have all three: ││ 1. Safety (agreement + validity) ││ 2. Liveness (termination) ││ 3. Asynchrony (no timing assumptions) ││ ││ WHY? (Intuition) ││ ──────────────── ││ - In async systems, you can't distinguish slow from dead ││ - There's always an adversarial schedule that prevents progress ││ - No matter the protocol, an adversary can delay messages ││ to keep the system in an undecided state ││ ││ HOW PRACTICAL SYSTEMS WORK AROUND IT: ││ ───────────────────────────────────── ││ 1. Timeouts (partial synchrony) - Most systems do this ││ 2. Randomization - Some probability of progress ││ 3. Failure detectors - Oracle that hints about failures ││ ││ RAFT/PAXOS assume: "Eventually, the network will be synchronous enough" ││ │└─────────────────────────────────────────────────────────────────────────────┘
SAFETY: "Nothing bad happens"├── Agreement is never violated├── Only proposed values are chosen└── Can be proven mathematicallyLIVENESS: "Something good eventually happens" ├── Decisions are eventually made├── Progress is made└── Can be violated by FLP (temporarily)PRACTICAL APPROACH:├── ALWAYS maintain safety└── EVENTUALLY achieve liveness (when network is stable)
Paxos was invented by Leslie Lamport in 1989 (published in 1998 after being rejected for being “too whimsical” in its original Greek parliament metaphor). Despite being notoriously difficult to understand — Lamport himself said “The Paxos algorithm, when presented in plain English, is very simple” and the community collectively disagreed — it’s the foundation of modern consensus.
Why Paxos is hard to grok: The difficulty is not in any single step but in understanding why each step is necessary. Every rule in Paxos exists to prevent a specific failure scenario. The “promise” mechanism prevents split decisions. The “use highest accepted value” rule prevents overwriting an already-chosen value. If you find yourself confused, ask: “What goes wrong if I skip this step?” The answer is always a concrete safety violation.
┌─────────────────────────────────────────────────────────────────────────────┐│ PAXOS OVERVIEW │├─────────────────────────────────────────────────────────────────────────────┤│ ││ ROLES (a node can play multiple roles): ││ ───── ││ PROPOSER: Proposes values, drives the protocol ││ ACCEPTOR: Votes on proposals, stores accepted values ││ LEARNER: Learns the decided value ││ ││ GOAL: Agree on a single value among N acceptors ││ ││ KEY INSIGHT: ││ ──────────── ││ A value is "chosen" when accepted by a MAJORITY of acceptors ││ Any two majorities overlap, ensuring agreement ││ ││ PHASES: ││ ─────── ││ Phase 1a: PREPARE - Proposer asks acceptors to promise ││ Phase 1b: PROMISE - Acceptors promise (or reject) ││ Phase 2a: ACCEPT - Proposer asks acceptors to accept value ││ Phase 2b: ACCEPTED - Acceptors accept (or reject) ││ │└─────────────────────────────────────────────────────────────────────────────┘
def on_prepare(n): # An acceptor is like a voter who can only pledge allegiance to # one candidate at a time. A higher proposal number is a more # "persuasive" candidate -- the voter switches loyalty. if n > promised_n: promised_n = n # Promise not to accept any proposal numbered lower than n return Promise(n, accepted_n, accepted_v) # Also report any value already accepted else: return Reject() # "Sorry, I already promised to support a higher-numbered proposal"
Acceptors respond with:
Promise to not accept proposals with number < n
The highest-numbered proposal they’ve already accepted (if any)
Proposer P1 Acceptors─────────── ───────── │ [A1] [A2] [A3] │<───Promise(1,∅,∅)─────────│ │ │ │<───Promise(1,∅,∅)──────────────│ │ │<───Promise(1,∅,∅)───────────────────│ │ │ │ │P1 has promises from majority (3/3), can proceed!
def on_accept(n, v): # The acceptor checks: "Is this proposal at least as recent as # the most recent one I promised to consider?" If yes, accept it. # This is the moment a vote is actually cast, not just a promise. if n >= promised_n: promised_n = n # Update promise to this proposal accepted_n = n # Record that we accepted this proposal number accepted_v = v # Record the accepted value return Accepted(n, v) else: return Reject() # "I promised to consider a higher proposal; yours is stale"
If proposer receives Accepted from majority, value is chosen!
THEOREM: If a value v is chosen, any future proposal will have value v.PROOF INTUITION:─────────────────1. v is chosen → accepted by majority M12. Future proposal n' > n must get promises from majority M23. M1 ∩ M2 ≠ ∅ (majorities overlap)4. At least one acceptor in M2 already accepted v5. That acceptor tells proposer about v in Promise6. Proposer MUST use v (highest accepted value)7. Therefore, v is preservedThis is why proposal numbers and the "use highest accepted value" rule exist!
Most Asked in Interviews: Raft is designed to be understandable. You MUST be able to explain it clearly. The original Raft paper by Ongaro and Ousterhout was explicitly motivated by the observation that students and practitioners struggled to implement Paxos correctly. Raft achieves the same safety guarantees as Multi-Paxos but decomposes the problem into three relatively independent sub-problems: leader election, log replication, and safety. This decomposition is what makes it teachable and implementable.
For a full, implementation-focused treatment with invariants, failure scenarios, and a build-from-scratch roadmap,
see the dedicated Raft Deep Dive.
┌─────────────────────────────────────────────────────────────────────────────┐│ RAFT vs PAXOS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ PAXOS PROBLEMS: RAFT SOLUTIONS: ││ ─────────────── ─────────────── ││ - Hard to understand - Designed for understandability ││ - Many variants, unclear which to use - One clear algorithm ││ - Difficult to implement correctly - Clearer implementation path ││ - Symmetric roles - Strong leader model ││ ││ RAFT KEY DECISIONS: ││ ─────────────────── ││ 1. Strong leader: Only leader handles client requests ││ 2. Leader election: Randomized timeouts for faster convergence ││ 3. Log matching: If entries match at index, all prior entries match ││ 4. Leader completeness: Elected leader has all committed entries ││ │└─────────────────────────────────────────────────────────────────────────────┘
FOLLOWER TIMEOUT:─────────────────Follower Leader │ │ │ <── Heartbeat ─────────│ (every 150ms) │ │ │ (waiting...) │ │ │ │ (timeout: 300ms) X (leader dies) │ │ No heartbeat received! │ Convert to CANDIDATE │ Start election ▼TIMEOUT: Random between 150-300ms(Randomness prevents split votes)
# Candidate sends to all nodesRequestVote { term: int, # Candidate's term candidateId: int, # Candidate requesting vote lastLogIndex: int, # Index of candidate's last log entry lastLogTerm: int # Term of candidate's last log entry}# Receiver's implementationdef handle_request_vote(request): # Reject if candidate's term is old if request.term < currentTerm: return VoteResponse(term=currentTerm, voteGranted=False) # Update term if candidate's is newer if request.term > currentTerm: currentTerm = request.term votedFor = None convert_to_follower() # Grant vote if: # 1. Haven't voted in this term OR already voted for this candidate # 2. Candidate's log is at least as up-to-date as ours log_ok = (request.lastLogTerm > my_last_log_term or (request.lastLogTerm == my_last_log_term and request.lastLogIndex >= my_last_log_index)) if (votedFor is None or votedFor == request.candidateId) and log_ok: votedFor = request.candidateId reset_election_timer() return VoteResponse(term=currentTerm, voteGranted=True) return VoteResponse(term=currentTerm, voteGranted=False)
CANDIDATE BEHAVIOR:───────────────────1. Increment currentTerm2. Vote for self3. Reset election timer4. Send RequestVote to all other nodesOUTCOMES:─────────(a) Wins election (majority votes) → Become leader → Send heartbeats immediately(b) Another node wins → Receive AppendEntries from new leader → Convert to follower(c) Election timeout (no winner) → Increment term, start new election → Randomized timeout prevents repeated tiesEXAMPLE (5 nodes, S1-S5):─────────────────────────Term 5 Election:S1: Times out first, becomes candidate Votes for self (1 vote) Sends RequestVote to S2, S3, S4, S5S2: Receives RequestVote, grants voteS3: Receives RequestVote, grants voteS4: Already voted for self (also candidate)S5: Receives RequestVote, grants voteS1: Has 4 votes (S1, S2, S3, S5) = MAJORITYS1: Becomes LEADER of term 5
The shared notebook analogy: Log replication is like a team that must keep identical lab notebooks. The leader writes an entry in their notebook and sends a copy to every team member. Each team member writes the entry in their own notebook, but only after checking that their previous entries match the leader’s — if there is a discrepancy, the leader walks them back until they find where their notebooks agree, then replays everything from that point forward. An entry is “committed” (permanent) only when a majority of team members have it in their notebooks. This ensures that even if some notebooks are destroyed, the committed entries survive.The core of Raft - replicating commands across nodes:
┌─────────────────────────────────────────────────────────────────────────────┐│ RAFT LOG STRUCTURE │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Leader's Log: ││ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ ││ │ 1,1 │ 1,1 │ 1,2 │ 2,3 │ 2,3 │ 3,3 │ 3,3 │ 3,3 │ ││ │ x←1 │ y←2 │ x←3 │ y←4 │ z←5 │ x←6 │ y←7 │ z←8 │ ││ └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ ││ 1 2 3 4 5 6 7 8 ← Log Index ││ ││ Entry format: (index, term) command ││ ││ COMMIT INDEX: ││ ───────────── ││ An entry is COMMITTED when replicated to majority ││ ││ Leader: commitIndex = 6 (entries 1-6 are committed) ││ Followers: Apply entries up to commitIndex ││ ││ SAFETY: Once committed, entry is NEVER overwritten ││ │└─────────────────────────────────────────────────────────────────────────────┘
# Leader sends to followersAppendEntries { term: int, # Leader's term leaderId: int, # So follower can redirect clients prevLogIndex: int, # Index of entry before new ones prevLogTerm: int, # Term of prevLogIndex entry entries: List[Entry], # Log entries to store (empty for heartbeat) leaderCommit: int # Leader's commitIndex}# Follower's implementationdef handle_append_entries(request): # STEP 1: Term check -- a leader with a stale term has been superseded. # This is the distributed equivalent of "your badge is expired." if request.term < currentTerm: return AppendResponse(term=currentTerm, success=False) # Any valid AppendEntries proves the leader is alive -- # reset the election timer so we don't start a needless election. reset_election_timer() # STEP 2: If the leader's term is newer, we must have missed an # election. Update our term and revert to follower state. if request.term > currentTerm: currentTerm = request.term convert_to_follower() # STEP 3: LOG MATCHING -- the critical consistency check. # The leader says "my entry just before the new ones is at index X, # term Y." If we don't have that same entry, our logs have diverged # and the leader must back up and try an earlier prevLogIndex. if not log_contains(request.prevLogIndex, request.prevLogTerm): return AppendResponse(term=currentTerm, success=False) # STEP 4: APPEND NEW ENTRIES # If our log has an entry at the same index but a different term, # that entry came from a deposed leader -- delete it and everything # after it. This is safe because uncommitted entries are expendable. for i, entry in enumerate(request.entries): index = request.prevLogIndex + 1 + i if log[index].term != entry.term: log = log[:index] # Truncate divergent suffix if index > len(log): log.append(entry) # STEP 5: Advance our commit index to match the leader's. # We can only commit up to the end of our own log (we might be # behind the leader if some AppendEntries were lost in transit). if request.leaderCommit > commitIndex: commitIndex = min(request.leaderCommit, len(log)) apply_committed_entries() # Apply to state machine in order return AppendResponse(term=currentTerm, success=True)
┌─────────────────────────────────────────────────────────────────────────────┐│ LOG MATCHING PROPERTY │├─────────────────────────────────────────────────────────────────────────────┤│ ││ IF two entries in different logs have same index AND term: ││ ├── They store the same command ││ └── All preceding entries are identical ││ ││ HOW IT'S MAINTAINED: ││ ──────────────────── ││ 1. Leader creates entries with current term ││ 2. AppendEntries includes prevLogIndex and prevLogTerm ││ 3. Follower rejects if previous entry doesn't match ││ 4. Leader backs up and retries until match found ││ ││ EXAMPLE - Repairing Inconsistent Log: ││ ││ Leader: [1,1] [1,2] [2,3] [2,4] [3,5] [3,6] ││ Follower: [1,1] [1,2] [2,3] [2,4] (behind) ││ ││ AppendEntries(prevLogIndex=4, prevLogTerm=2, entries=[3,5][3,6]) ││ Follower: Entry at index 4, term 2 exists? ✓ ││ Follower: Append entries 5 and 6 ││ Result: [1,1] [1,2] [2,3] [2,4] [3,5] [3,6] (matches!) ││ │└─────────────────────────────────────────────────────────────────────────────┘
Critical Property: If a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher terms.
HOW RAFT ENSURES THIS:──────────────────────1. ELECTION RESTRICTION: Voter denies vote if candidate's log is less up-to-date "Up-to-date" comparison: - Compare last entry's term (higher is better) - If same term, compare last entry's index (longer is better)2. LEADER ONLY COMMITS ENTRIES FROM CURRENT TERM: Leader doesn't count old-term entries as committed until a new entry from current term is replicated Why? Prevents overwriting committed entries EXAMPLE (The Figure 8 scenario): ───────────────────────────────── Term 2: Leader S1 replicates entry at index 2 to S2 Crashes before committing Term 3: S5 becomes leader (didn't have index 2) Appends entry at index 2 with term 3 Term 4: S1 comes back, becomes leader CAN'T just count term-2 entry as committed! Must replicate a term-4 entry first This prevents the term-2 entry from being overwritten by the term-3 entry that S5 had
┌─────────────────────────────────────────────────────────────────────────────┐│ JOINT CONSENSUS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ PROBLEM: Adding/removing nodes during operation ││ ││ DANGER: If we switch configs directly: ││ ││ Old: {A, B, C} New: {A, B, C, D, E} ││ ││ During transition: ││ - {A, B} thinks old config, majority = 2 ││ - {C, D, E} thinks new config, majority = 3 ││ - BOTH could be majorities! Two leaders possible! ││ ││ SOLUTION: JOINT CONSENSUS ││ ───────────────────────── ││ ││ Phase 1: C_old,new (joint configuration) ││ Decisions need majorities in BOTH old AND new ││ ││ Phase 2: C_new (new configuration) ││ Only after C_old,new is committed ││ ││ Timeline: ││ ───────── ││ │ │ │ ││ C_old│ C_old,new │ C_old,new │ C_new ││ │ entry │ committed │ entry ││ │ created │ │ created ││ │ │ │ ││ ▼ ▼ ▼ ││ ──────┬──────────────┬────────────────┬────────────────► ││ ││ SIMPLER ALTERNATIVE: Single-server changes (one at a time) ││ ───────────────────── ││ Many implementations use this simpler approach ││ │└─────────────────────────────────────────────────────────────────────────────┘
Distributed pitfall: Membership changes are the most bug-prone part of any consensus implementation. The Raft paper’s joint consensus protocol is correct but complex. In practice, most production systems (etcd, Consul) use single-server changes — adding or removing one node at a time — because this guarantees that old and new majorities always overlap without needing a joint configuration. The rule is simple: never change more than one node between committed configuration entries. If you need to go from 3 nodes to 5, add one node, wait for it to catch up, then add the second.
An alternative consensus protocol that predates Paxos:
┌─────────────────────────────────────────────────────────────────────────────┐│ VIEWSTAMPED REPLICATION │├─────────────────────────────────────────────────────────────────────────────┤│ ││ SIMILAR TO RAFT: ││ ──────────────── ││ • Strong leader (called "primary") ││ • Log replication ││ • View changes (like term changes) ││ ││ KEY DIFFERENCES: ││ ──────────────── ││ • View-change is more complex ││ • Different recovery protocol ││ • Historically important (1988, before Paxos publication) ││ ││ VIEW CHANGE PROTOCOL: ││ ───────────────────── ││ 1. New primary collects state from f+1 replicas ││ 2. Computes new state (union of all logs) ││ 3. Sends StartView to all replicas ││ 4. Replicas update and respond ││ ││ WHEN TO USE VR vs RAFT: ││ ──────────────────────── ││ Usually prefer Raft (simpler, better understood) ││ VR useful for academic understanding ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐│ ZAB │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Designed specifically for Zookeeper's requirements: ││ • Primary order: All updates from a leader in order ││ • Prefix property: If transaction T is delivered, all before T too ││ ││ PHASES: ││ ─────── ││ ││ 1. LEADER ELECTION ││ └── Fast leader election using voting ││ └── Leader chosen based on highest zxid (transaction id) ││ ││ 2. DISCOVERY ││ └── Leader learns about all accepted transactions ││ └── Collects epoch information from followers ││ ││ 3. SYNCHRONIZATION ││ └── Leader syncs followers to its state ││ └── Handles DIFF, TRUNC, or SNAP updates ││ ││ 4. BROADCAST ││ └── Normal operation - leader broadcasts updates ││ └── Two-phase commit: PROPOSAL → ACK → COMMIT ││ ││ ZXID (Transaction ID): ││ ────────────────────── ││ 64-bit: [epoch (32 bits)][counter (32 bits)] ││ ││ Epoch increases with each new leader ││ Counter increases with each transaction ││ │└─────────────────────────────────────────────────────────────────────────────┘
GUARANTEES:───────────1. Linearizable Writes: All writes go through leader, ordered globally2. FIFO Client Order: Client's operations processed in order3. Wait-free Reads: Reads can go to any server (possibly stale)4. Watch Notifications: Clients notified of changesUSE CASES:──────────• Distributed locking• Leader election• Configuration management• Service discovery• Cluster membership
Staff+ Level: EPaxos is an advanced topic that shows deep understanding of consensus trade-offs. It’s less commonly asked but demonstrates expertise.
┌─────────────────────────────────────────────────────────────────────────────┐│ EPAXOS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ THE PROBLEM WITH LEADER-BASED CONSENSUS: ││ ───────────────────────────────────────── ││ • All requests go through leader (bottleneck) ││ • Leader failure triggers expensive election ││ • Cross-datacenter: high latency to leader ││ ││ EPAXOS KEY INSIGHT: ││ ─────────────────── ││ "If commands don't conflict, they can commit independently" ││ ││ LEADERLESS CONSENSUS: ││ ───────────────────── ││ • ANY replica can be "command leader" for any command ││ • Conflicting commands require extra round ││ • Non-conflicting commands: 1 round-trip (fast path) ││ ││ CONFLICT DETECTION: ││ ─────────────────── ││ Commands conflict if they access same keys with at least one write ││ ││ get(x), get(x) → NO conflict (both reads) ││ get(x), put(x,1) → CONFLICT (one is write) ││ put(x,1), put(y,2) → NO conflict (different keys) ││ │└─────────────────────────────────────────────────────────────────────────────┘
FAST PATH (no conflicts):────────────────────────Client ──> R1 (command leader for cmd A) │ │ PreAccept(A) ├──────────────────> R2 ─┐ ├──────────────────> R3 ─┼─ All reply "OK, no conflicts" ├──────────────────> R4 ─┤ ├──────────────────> R5 ─┘ │ │ <── PreAcceptOK ──────── │ (from fast quorum: ⌊3N/4⌋) │ │ Commit(A) ├──────────────────> All replicas │Result: 2 message delays (same as Multi-Paxos with stable leader!)SLOW PATH (conflicts detected):───────────────────────────────R1: PreAccept(A) includes deps = {}R2: Already has B that conflicts with AR2: PreAcceptOK with deps = {B}R1: Sees different deps from different replicasR1: Must run Paxos Accept phase to agree on depsAdditional round:R1: Accept(A, deps={B})All: AcceptOKResult: 4 message delays (slower, but still correct)
┌─────────────────────────────────────────────────────────────────────────────┐│ EPAXOS TRADE-OFFS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ ADVANTAGES: ││ ─────────── ││ ✓ No leader bottleneck ││ ✓ Lower latency for clients near any replica ││ ✓ No leader election overhead ││ ✓ Better for geo-distributed deployments ││ ✓ Fast path = same latency as optimized Paxos ││ ││ DISADVANTAGES: ││ ───────────── ││ ✗ More complex to implement correctly ││ ✗ Slower when there are many conflicts ││ ✗ More complex recovery ││ ✗ Fewer battle-tested implementations ││ ││ USE WHEN: ││ ───────── ││ • Multi-datacenter deployment ││ • Low conflict workload (different clients access different data) ││ • Read-heavy or partition-friendly workload ││ • Need to eliminate leader as single point of latency ││ ││ DON'T USE WHEN: ││ ─────────────── ││ • High conflict rate (hot keys) ││ • Single datacenter (Raft is simpler) ││ • Team doesn't have consensus expertise ││ │└─────────────────────────────────────────────────────────────────────────────┘
Production consensus systems use several optimizations to achieve high throughput despite the fundamental requirement of disk I/O and network round-trips.
Key Insight: Requests at different pipeline stages use different resources (CPU for append, network for replicate, disk for commit). Pipelining maximizes resource utilization.
The leader can write to its own log while simultaneously sending AppendEntries to followers.
SEQUENTIAL:Leader: [Write to Disk] ──────────► [Send to Followers] ↓Followers: [Receive] → [Write to Disk]PARALLEL:Leader: [Write to Disk] ─────┐ ├─► [Wait for both] [Send to Followers] ─┘ ↓Followers: [Receive] → [Write to Disk]Savings: Hides leader's disk latency behind network latency
fsync() is expensive (~10ms on HDD, ~0.1ms on NVMe). Instead of fsync per entry, batch multiple entries into one fsync.
class GroupCommitter: def __init__(self, max_wait_ms=1, max_batch=100): self.pending = [] self.lock = Lock() def append_and_sync(self, entry): with self.lock: self.pending.append(entry) if len(self.pending) >= max_batch: self._flush() # Or wait for timer-based flush wait_for_flush_complete() def _flush(self): # Write all pending entries for entry in self.pending: log_file.write(entry) # Single fsync for all entries! log_file.fsync() # One fsync instead of N # Notify all waiters notify_all_complete()
Result: 1000 entries with 1 fsync instead of 1000 fsyncs = ~1000x throughput improvement for disk-bound workloads.
Prevent slow followers from causing unbounded memory growth on the leader.
┌─────────────────────────────────────────────────────────────────┐│ FLOW CONTROL STRATEGY │├─────────────────────────────────────────────────────────────────┤│ ││ Leader tracks per-follower state: ││ • nextIndex: Next entry to send ││ • matchIndex: Highest replicated entry ││ • inflight: Entries sent but not ACKed ││ ││ Flow Control Rules: ││ IF inflight > MAX_INFLIGHT: ││ Pause sending to this follower ││ (prevents OOM on leader) ││ ││ IF matchIndex too far behind: ││ Send snapshot instead of entries ││ (prevents unbounded log growth) ││ │└─────────────────────────────────────────────────────────────────┘
Staff Tip: In interviews, mention that “naive Raft” achieves ~1000 ops/s, but production systems like etcd/TiKV achieve 100K+ ops/s through these optimizations. This shows you understand the gap between textbook algorithms and production systems.
Raft is designed for understandability. If you’re building consensus from scratch:
# Recommended approach1. Use an existing library (etcd/raft, hashicorp/raft)2. If building from scratch, follow MIT 6.824 lab structure3. Write extensive tests before anything else
These bugs have caused real production outages. Watch out for them!
# BUG 1: Not persisting state before respondingdef on_request_vote(request): if should_grant_vote(request): voted_for = request.candidate_id # BUG: Must persist BEFORE responding! # persist_state() # <-- MISSING return VoteGranted() # If crash happens, you might vote twice!# BUG 2: Incorrect leader leasedef handle_read(request): if self.is_leader(): # BUG: Might not be leader anymore! # Must check lease or use read index return self.state_machine.read(request.key)# BUG 3: Commit index going backwardsdef apply_committed_entries(): # BUG: Must ensure commit_index only increases # Snapshot installation can reset state machine # But commit_index should reflect reality# BUG 4: Not handling log prefix during appenddef append_entries(request): for entry in request.entries: # BUG: Must delete conflicting entries! # If local log has different term at same index, # delete it and all following entries
A critical “Principal Level” operational reality: What happens when you lose a majority of your nodes forever? (e.g., a region outage or physical data corruption).
In systems like Cassandra or FoundationDB, you can bootstrap a new cluster from a snapshot of the surviving node’s data and “re-seed” the consensus state.
If you only need to recover data (not resume writes), you can often perform Stale Reads from the remaining nodes by bypassing the consensus layer and reading directly from the state machine (SSTables/RocksDB).Staff Tip: Disaster recovery is the only time you should ever touch the internal state of a consensus engine. Mention the “etcd disaster recovery” guide in interviews to show you understand that safety isn’t just about code, but about operational survival.
Other nodes vote (if haven’t voted, log is up-to-date)
If majority votes → becomes leader, sends heartbeats
If receives AppendEntries from valid leader → becomes follower
If timeout again → new election with higher term
Key points to mention:
Randomized timeouts prevent split votes
Log up-to-date check ensures safety
At most one leader per term
Q: What happens if the Raft leader fails during commit?
Answer:
Scenario: Leader has replicated entry to 2/5 nodes and crashes
Followers detect leader failure (timeout)
New election starts
Only nodes with the entry CAN become leader (election restriction)
New leader has the entry, completes replication
Entry gets committed when new-term entry is committed
Key insight: Raft never loses committed entries. An entry replicated to majority WILL be in new leader’s log.
Q: How does Paxos handle competing proposers?
Answer:
Each proposer picks unique proposal number (higher wins)
Acceptor promises to reject lower numbers
If proposer’s Prepare rejected → retry with higher number
In Accept phase, use highest accepted value from promises
System makes progress when one proposer “wins”
Key insight: The “use highest accepted value” rule preserves safety even with competing proposers.
Q: Why can't consensus be achieved in asynchronous systems?
Answer:
FLP Impossibility proves this:
Can’t distinguish slow from dead
Adversarial schedule can always prevent decision
Must choose: safety OR guaranteed termination
Practical workaround: Assume partial synchrony (eventually messages arrive). Raft/Paxos do this with timeouts.
Q: Design a distributed lock using consensus
Answer:
1. Lock request → propose "lock(client_id, lock_name)"2. Use consensus to order all lock requests3. Apply in order: first requester gets lock4. Store lock state in replicated state machine5. Unlock: consensus on "unlock(client_id, lock_name)"6. Add lease/TTL to handle client failuresImplementation options:- Zookeeper ephemeral nodes- etcd lease-based locks- Consul sessions
In a Raft cluster, what exactly happens during a leader election when the network partitions into two groups of unequal size? Walk me through the sequence of events on both sides.
Strong Answer:
Assume a 5-node cluster (A, B, C, D, E) where A is the current leader. The partition splits into (minority) and (majority).
On the minority side: A continues sending heartbeats to B, and B responds. A also tries to send heartbeats to C, D, E but gets no response. If A tries to commit a new write, it needs 3 acknowledgments (majority of 5). It can only get 2 (A itself and B), so A cannot commit any new entries. A remains the leader of term T on its side, but it is operationally frozen — it can accept client requests but cannot commit them. Eventually, depending on the implementation, A may step down or reject client writes directly.
On the majority side: C, D, and E stop receiving heartbeats from A. After a randomized election timeout (e.g., 150-300ms), one of them — say C — times out first, increments the term to T+1, votes for itself, and sends RequestVote to D and E. Both grant votes (since they have not voted in term T+1 and C’s log is at least as up-to-date). C becomes the leader of term T+1 with a majority quorum of . This side can now accept and commit new writes.
When the partition heals: A sends a heartbeat to C with term T. C responds with term T+1. A sees a higher term, immediately steps down to follower, and adopts term T+1. Any uncommitted entries in A’s log (entries accepted during the partition but not committed) may be overwritten by C’s log during the next AppendEntries exchange. This is safe because those entries were never committed — no client received a success response for them.
Follow-up: What if the old leader A managed to replicate an entry to B during the partition but could not commit it. Is that entry lost after partition healing?Yes, that entry can be lost, and this is by design. Raft’s safety guarantee applies only to committed entries (replicated to a majority). An entry replicated to only 2 of 5 nodes is not committed. After the partition heals, the new leader C will send AppendEntries to A and B. If C’s log diverges from A’s at the position of the uncommitted entry, C’s log wins. A and B truncate their logs to match C’s. The uncommitted entry is overwritten. The client that submitted that write either received a timeout (if A could not commit) or received no response, so the client knows to retry. This is why Raft requires majority acknowledgment before returning success to the client — it ensures that any committed entry survives any subsequent leader change.
Compare Raft and Multi-Paxos. In what scenarios would you choose Multi-Paxos over Raft?
Strong Answer:
Raft and Multi-Paxos solve the same fundamental problem — replicated state machines — but with different design philosophies. Raft prioritizes understandability by decomposing the problem into leader election, log replication, and safety as separate subproblems. Multi-Paxos prioritizes flexibility by defining consensus on individual log slots independently.
The key structural difference: Raft requires a strong leader. All writes go through the leader, and the leader’s log is always the authoritative truth. Multi-Paxos can operate with a “distinguished proposer” (de facto leader) for performance, but any node can propose for any log slot. This means Multi-Paxos can continue making progress even if the leader is temporarily slow (another node proposes for that slot), while Raft must wait for a leader election.
I would choose Multi-Paxos (or variants like EPaxos) in geo-distributed deployments where leader-based protocols create latency hotspots. If you have 5 data centers across continents, Raft forces all writes through one leader, adding cross-continent latency. EPaxos (an extension of Paxos) allows any node to propose, and non-conflicting commands can be committed in a single round trip to the nearest quorum. This can cut write latency in half for geo-distributed workloads.
I would choose Raft for everything else. The implementation is more straightforward, the debugging is simpler (one leader means one source of truth), and the ecosystem is mature (etcd, Consul, TiKV, CockroachDB all use Raft).
Follow-up: What is the “dueling proposers” problem in Paxos and how does Multi-Paxos solve it?In basic Paxos, any node can propose at any time. If two nodes simultaneously propose for the same slot with different values, they each execute Phase 1 (Prepare) with increasing proposal numbers. Node A sends Prepare(1), Node B sends Prepare(2). B’s higher number causes acceptors to reject A’s subsequent Accept(1). A retries with Prepare(3), which preempts B. This can livelock: neither proposer makes progress because each keeps preempting the other. Multi-Paxos solves this by electing a stable leader (distinguished proposer) who handles all proposals. Once elected, the leader skips Phase 1 entirely for subsequent slots — it only needs Phase 2 (Accept), cutting the message count in half. The leader acts as a “gatekeeper” that serializes proposals, eliminating the dueling problem. If the leader fails, any node can start Phase 1 for the next slot, triggering a de facto leader change. This is why Multi-Paxos in practice looks very similar to Raft — both have leaders, both skip the first round after election.
A junior engineer asks you: 'If consensus needs a majority quorum, does that mean we can never have an even number of nodes?' What do you tell them?
Strong Answer:
You can absolutely have an even number of nodes, but it is almost never a good idea. Consider a 4-node Raft cluster: the majority quorum is 3 (more than half of 4). You can tolerate exactly 1 failure. Now consider a 3-node cluster: the majority quorum is 2, and you can also tolerate exactly 1 failure. So going from 3 to 4 nodes gives you zero additional fault tolerance while increasing the number of nodes that must acknowledge each write (from 2 to 3), which increases latency.
This is why consensus clusters almost always use odd numbers: 3 (tolerates 1 failure), 5 (tolerates 2), 7 (tolerates 3). Each additional pair of nodes increases fault tolerance by exactly one.
There is an edge case where even numbers appear: Flexible Paxos allows asymmetric quorums where the write quorum and read quorum can differ, as long as their sum exceeds N. In that model, a 4-node cluster with write quorum 3 and read quorum 2 could be useful if reads are much more frequent than writes. But this is an advanced configuration that most systems do not support out of the box.
Follow-up: What about a 2-node consensus cluster? Is that ever valid?A 2-node consensus cluster has a majority quorum of 2, meaning both nodes must agree on every write. This provides zero fault tolerance: if either node goes down, the cluster cannot make progress. It is strictly worse than a single node for availability (you now have two points of failure instead of one) while providing the benefit of data redundancy (if one node’s disk fails, the other has a copy). In practice, 2-node clusters are used as primary-backup pairs with an external arbiter — a lightweight “witness” node that participates only in leader elections and stores no data. etcd supports a “learner” node for this purpose. The witness breaks the tie during elections, giving you effective 3-node quorum behavior with only 2 full data replicas. This is useful when storage is expensive but you still need fault tolerance.
Explain what safety and liveness mean in the context of consensus protocols. Can Raft violate either one?
Strong Answer:
Safety means “nothing bad ever happens.” For Raft, safety means: (1) at most one leader per term, (2) a committed entry is never lost or overwritten, (3) if two logs contain an entry with the same index and term, they are identical and all preceding entries are identical. Safety properties must hold under all circumstances — crashes, partitions, message delays, any adversarial schedule of events.
Liveness means “something good eventually happens.” For Raft, liveness means: the system eventually elects a leader and makes progress (commits entries). Liveness requires timing assumptions — specifically, that the broadcast time is much less than the election timeout, which is much less than the mean time between failures. If the network is completely asynchronous (unbounded delays), Raft’s liveness is not guaranteed (per FLP impossibility).
Raft never violates safety. Even under the worst network conditions, two different leaders cannot commit conflicting entries for the same log index. This is ensured by the election restriction (candidates must have an up-to-date log) and the log matching property (AppendEntries checks prevLogIndex/prevLogTerm).
Raft can violate liveness. If the election timeout is too short relative to network latency, the cluster can enter a cycle of perpetual elections where no candidate wins a majority before the next timeout fires. This is mitigated by randomized election timeouts, but it is theoretically possible (though statistically improbable) for this to continue indefinitely. In practice, poor timeout tuning is the most common cause of Raft liveness issues.
Follow-up: Describe a concrete production scenario where Raft’s liveness could be compromised.Imagine a 5-node Raft cluster deployed across 3 availability zones, with 2 nodes in AZ-1, 2 in AZ-2, and 1 in AZ-3. The cross-AZ latency is 5ms. The election timeout is set to 150-300ms. Now suppose AZ-3 experiences severe network congestion, increasing its latency to 500ms. The node in AZ-3 keeps timing out and starting elections (incrementing the term), which forces the current leader to step down whenever it receives a RequestVote with a higher term. But the AZ-3 node cannot win the election because its AppendEntries responses are too slow. The result: the disruptive node repeatedly triggers elections, preventing any leader from maintaining stability. Raft addresses this with the “Pre-Vote” extension: before incrementing its term and starting a real election, a candidate sends a PreVote request to check if it would win. If the majority rejects the PreVote (because they already have a functioning leader), the candidate does not proceed, preventing the disruptive election cycle. etcd and TiKV both implement Pre-Vote for this reason.