Consensus is getting multiple nodes to agree on a single value, despite failures.
Copy
┌─────────────────────────────────────────────────────────────────────────────┐│ THE CONSENSUS PROBLEM │├─────────────────────────────────────────────────────────────────────────────┤│ ││ SETUP: ││ ├── N nodes in a distributed system ││ ├── Some nodes may fail (crash failures) ││ ├── Network may delay/lose messages ││ └── Goal: Agree on a single value ││ ││ PROPERTIES (must satisfy ALL): ││ ││ 1. AGREEMENT: All non-faulty nodes decide the same value ││ 2. VALIDITY: The decided value was proposed by some node ││ 3. TERMINATION: All non-faulty nodes eventually decide ││ (liveness property) ││ 4. INTEGRITY: Each node decides at most once ││ ││ WHY IT'S HARD: ││ ───────────── ││ Node A proposes "5" │ What if: ││ Node B proposes "7" │ - A crashes after proposing? ││ Node C proposes "3" │ - Message from A to C is lost? ││ │ - B thinks A is dead (just slow)? ││ │└─────────────────────────────────────────────────────────────────────────────┘
The Most Important Result: Consensus is impossible in an asynchronous system with even one faulty process.
Copy
┌─────────────────────────────────────────────────────────────────────────────┐│ FLP IMPOSSIBILITY (1985) │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Fischer, Lynch, and Paterson proved: ││ ││ "No deterministic consensus protocol can guarantee termination ││ in an asynchronous system with even one crash failure." ││ ││ IMPLICATIONS: ││ ───────────── ││ You CANNOT have all three: ││ 1. Safety (agreement + validity) ││ 2. Liveness (termination) ││ 3. Asynchrony (no timing assumptions) ││ ││ WHY? (Intuition) ││ ──────────────── ││ - In async systems, you can't distinguish slow from dead ││ - There's always an adversarial schedule that prevents progress ││ - No matter the protocol, an adversary can delay messages ││ to keep the system in an undecided state ││ ││ HOW PRACTICAL SYSTEMS WORK AROUND IT: ││ ───────────────────────────────────── ││ 1. Timeouts (partial synchrony) - Most systems do this ││ 2. Randomization - Some probability of progress ││ 3. Failure detectors - Oracle that hints about failures ││ ││ RAFT/PAXOS assume: "Eventually, the network will be synchronous enough" ││ │└─────────────────────────────────────────────────────────────────────────────┘
SAFETY: "Nothing bad happens"├── Agreement is never violated├── Only proposed values are chosen└── Can be proven mathematicallyLIVENESS: "Something good eventually happens" ├── Decisions are eventually made├── Progress is made└── Can be violated by FLP (temporarily)PRACTICAL APPROACH:├── ALWAYS maintain safety└── EVENTUALLY achieve liveness (when network is stable)
Paxos was invented by Leslie Lamport in 1989. Despite being notoriously difficult to understand, it’s the foundation of modern consensus.
Copy
┌─────────────────────────────────────────────────────────────────────────────┐│ PAXOS OVERVIEW │├─────────────────────────────────────────────────────────────────────────────┤│ ││ ROLES (a node can play multiple roles): ││ ───── ││ PROPOSER: Proposes values, drives the protocol ││ ACCEPTOR: Votes on proposals, stores accepted values ││ LEARNER: Learns the decided value ││ ││ GOAL: Agree on a single value among N acceptors ││ ││ KEY INSIGHT: ││ ──────────── ││ A value is "chosen" when accepted by a MAJORITY of acceptors ││ Any two majorities overlap, ensuring agreement ││ ││ PHASES: ││ ─────── ││ Phase 1a: PREPARE - Proposer asks acceptors to promise ││ Phase 1b: PROMISE - Acceptors promise (or reject) ││ Phase 2a: ACCEPT - Proposer asks acceptors to accept value ││ Phase 2b: ACCEPTED - Acceptors accept (or reject) ││ │└─────────────────────────────────────────────────────────────────────────────┘
def on_accept(n, v): if n >= promised_n: promised_n = n accepted_n = n accepted_v = v return Accepted(n, v) else: return Reject() # Promised to higher proposal
If proposer receives Accepted from majority, value is chosen!
THEOREM: If a value v is chosen, any future proposal will have value v.PROOF INTUITION:─────────────────1. v is chosen → accepted by majority M12. Future proposal n' > n must get promises from majority M23. M1 ∩ M2 ≠ ∅ (majorities overlap)4. At least one acceptor in M2 already accepted v5. That acceptor tells proposer about v in Promise6. Proposer MUST use v (highest accepted value)7. Therefore, v is preservedThis is why proposal numbers and the "use highest accepted value" rule exist!
┌─────────────────────────────────────────────────────────────────────────────┐│ RAFT vs PAXOS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ PAXOS PROBLEMS: RAFT SOLUTIONS: ││ ─────────────── ─────────────── ││ - Hard to understand - Designed for understandability ││ - Many variants, unclear which to use - One clear algorithm ││ - Difficult to implement correctly - Clearer implementation path ││ - Symmetric roles - Strong leader model ││ ││ RAFT KEY DECISIONS: ││ ─────────────────── ││ 1. Strong leader: Only leader handles client requests ││ 2. Leader election: Randomized timeouts for faster convergence ││ 3. Log matching: If entries match at index, all prior entries match ││ 4. Leader completeness: Elected leader has all committed entries ││ │└─────────────────────────────────────────────────────────────────────────────┘
FOLLOWER TIMEOUT:─────────────────Follower Leader │ │ │ <── Heartbeat ─────────│ (every 150ms) │ │ │ (waiting...) │ │ │ │ (timeout: 300ms) X (leader dies) │ │ No heartbeat received! │ Convert to CANDIDATE │ Start election ▼TIMEOUT: Random between 150-300ms(Randomness prevents split votes)
Copy
# Candidate sends to all nodesRequestVote { term: int, # Candidate's term candidateId: int, # Candidate requesting vote lastLogIndex: int, # Index of candidate's last log entry lastLogTerm: int # Term of candidate's last log entry}# Receiver's implementationdef handle_request_vote(request): # Reject if candidate's term is old if request.term < currentTerm: return VoteResponse(term=currentTerm, voteGranted=False) # Update term if candidate's is newer if request.term > currentTerm: currentTerm = request.term votedFor = None convert_to_follower() # Grant vote if: # 1. Haven't voted in this term OR already voted for this candidate # 2. Candidate's log is at least as up-to-date as ours log_ok = (request.lastLogTerm > my_last_log_term or (request.lastLogTerm == my_last_log_term and request.lastLogIndex >= my_last_log_index)) if (votedFor is None or votedFor == request.candidateId) and log_ok: votedFor = request.candidateId reset_election_timer() return VoteResponse(term=currentTerm, voteGranted=True) return VoteResponse(term=currentTerm, voteGranted=False)
Copy
CANDIDATE BEHAVIOR:───────────────────1. Increment currentTerm2. Vote for self3. Reset election timer4. Send RequestVote to all other nodesOUTCOMES:─────────(a) Wins election (majority votes) → Become leader → Send heartbeats immediately(b) Another node wins → Receive AppendEntries from new leader → Convert to follower(c) Election timeout (no winner) → Increment term, start new election → Randomized timeout prevents repeated tiesEXAMPLE (5 nodes, S1-S5):─────────────────────────Term 5 Election:S1: Times out first, becomes candidate Votes for self (1 vote) Sends RequestVote to S2, S3, S4, S5S2: Receives RequestVote, grants voteS3: Receives RequestVote, grants voteS4: Already voted for self (also candidate)S5: Receives RequestVote, grants voteS1: Has 4 votes (S1, S2, S3, S5) = MAJORITYS1: Becomes LEADER of term 5
# Leader sends to followersAppendEntries { term: int, # Leader's term leaderId: int, # So follower can redirect clients prevLogIndex: int, # Index of entry before new ones prevLogTerm: int, # Term of prevLogIndex entry entries: List[Entry], # Log entries to store (empty for heartbeat) leaderCommit: int # Leader's commitIndex}# Follower's implementationdef handle_append_entries(request): # Reject if leader's term is old if request.term < currentTerm: return AppendResponse(term=currentTerm, success=False) # Reset election timer (leader is alive) reset_election_timer() # Update term if needed if request.term > currentTerm: currentTerm = request.term convert_to_follower() # LOG MATCHING: Check if we have the previous entry if not log_contains(request.prevLogIndex, request.prevLogTerm): # Tell leader to back up and retry return AppendResponse(term=currentTerm, success=False) # APPEND NEW ENTRIES # (delete conflicting entries first if any) for i, entry in enumerate(request.entries): index = request.prevLogIndex + 1 + i if log[index].term != entry.term: # Delete this and all following entries log = log[:index] if index > len(log): log.append(entry) # UPDATE COMMIT INDEX if request.leaderCommit > commitIndex: commitIndex = min(request.leaderCommit, len(log)) apply_committed_entries() return AppendResponse(term=currentTerm, success=True)
┌─────────────────────────────────────────────────────────────────────────────┐│ LOG MATCHING PROPERTY │├─────────────────────────────────────────────────────────────────────────────┤│ ││ IF two entries in different logs have same index AND term: ││ ├── They store the same command ││ └── All preceding entries are identical ││ ││ HOW IT'S MAINTAINED: ││ ──────────────────── ││ 1. Leader creates entries with current term ││ 2. AppendEntries includes prevLogIndex and prevLogTerm ││ 3. Follower rejects if previous entry doesn't match ││ 4. Leader backs up and retries until match found ││ ││ EXAMPLE - Repairing Inconsistent Log: ││ ││ Leader: [1,1] [1,2] [2,3] [2,4] [3,5] [3,6] ││ Follower: [1,1] [1,2] [2,3] [2,4] (behind) ││ ││ AppendEntries(prevLogIndex=4, prevLogTerm=2, entries=[3,5][3,6]) ││ Follower: Entry at index 4, term 2 exists? ✓ ││ Follower: Append entries 5 and 6 ││ Result: [1,1] [1,2] [2,3] [2,4] [3,5] [3,6] (matches!) ││ │└─────────────────────────────────────────────────────────────────────────────┘
Critical Property: If a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher terms.
Copy
HOW RAFT ENSURES THIS:──────────────────────1. ELECTION RESTRICTION: Voter denies vote if candidate's log is less up-to-date "Up-to-date" comparison: - Compare last entry's term (higher is better) - If same term, compare last entry's index (longer is better)2. LEADER ONLY COMMITS ENTRIES FROM CURRENT TERM: Leader doesn't count old-term entries as committed until a new entry from current term is replicated Why? Prevents overwriting committed entries EXAMPLE (The Figure 8 scenario): ───────────────────────────────── Term 2: Leader S1 replicates entry at index 2 to S2 Crashes before committing Term 3: S5 becomes leader (didn't have index 2) Appends entry at index 2 with term 3 Term 4: S1 comes back, becomes leader CAN'T just count term-2 entry as committed! Must replicate a term-4 entry first This prevents the term-2 entry from being overwritten by the term-3 entry that S5 had
An alternative consensus protocol that predates Paxos:
Copy
┌─────────────────────────────────────────────────────────────────────────────┐│ VIEWSTAMPED REPLICATION │├─────────────────────────────────────────────────────────────────────────────┤│ ││ SIMILAR TO RAFT: ││ ──────────────── ││ • Strong leader (called "primary") ││ • Log replication ││ • View changes (like term changes) ││ ││ KEY DIFFERENCES: ││ ──────────────── ││ • View-change is more complex ││ • Different recovery protocol ││ • Historically important (1988, before Paxos publication) ││ ││ VIEW CHANGE PROTOCOL: ││ ───────────────────── ││ 1. New primary collects state from f+1 replicas ││ 2. Computes new state (union of all logs) ││ 3. Sends StartView to all replicas ││ 4. Replicas update and respond ││ ││ WHEN TO USE VR vs RAFT: ││ ──────────────────────── ││ Usually prefer Raft (simpler, better understood) ││ VR useful for academic understanding ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐│ ZAB │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Designed specifically for Zookeeper's requirements: ││ • Primary order: All updates from a leader in order ││ • Prefix property: If transaction T is delivered, all before T too ││ ││ PHASES: ││ ─────── ││ ││ 1. LEADER ELECTION ││ └── Fast leader election using voting ││ └── Leader chosen based on highest zxid (transaction id) ││ ││ 2. DISCOVERY ││ └── Leader learns about all accepted transactions ││ └── Collects epoch information from followers ││ ││ 3. SYNCHRONIZATION ││ └── Leader syncs followers to its state ││ └── Handles DIFF, TRUNC, or SNAP updates ││ ││ 4. BROADCAST ││ └── Normal operation - leader broadcasts updates ││ └── Two-phase commit: PROPOSAL → ACK → COMMIT ││ ││ ZXID (Transaction ID): ││ ────────────────────── ││ 64-bit: [epoch (32 bits)][counter (32 bits)] ││ ││ Epoch increases with each new leader ││ Counter increases with each transaction ││ │└─────────────────────────────────────────────────────────────────────────────┘
GUARANTEES:───────────1. Linearizable Writes: All writes go through leader, ordered globally2. FIFO Client Order: Client's operations processed in order3. Wait-free Reads: Reads can go to any server (possibly stale)4. Watch Notifications: Clients notified of changesUSE CASES:──────────• Distributed locking• Leader election• Configuration management• Service discovery• Cluster membership
Staff+ Level: EPaxos is an advanced topic that shows deep understanding of consensus trade-offs. It’s less commonly asked but demonstrates expertise.
Copy
┌─────────────────────────────────────────────────────────────────────────────┐│ EPAXOS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ THE PROBLEM WITH LEADER-BASED CONSENSUS: ││ ───────────────────────────────────────── ││ • All requests go through leader (bottleneck) ││ • Leader failure triggers expensive election ││ • Cross-datacenter: high latency to leader ││ ││ EPAXOS KEY INSIGHT: ││ ─────────────────── ││ "If commands don't conflict, they can commit independently" ││ ││ LEADERLESS CONSENSUS: ││ ───────────────────── ││ • ANY replica can be "command leader" for any command ││ • Conflicting commands require extra round ││ • Non-conflicting commands: 1 round-trip (fast path) ││ ││ CONFLICT DETECTION: ││ ─────────────────── ││ Commands conflict if they access same keys with at least one write ││ ││ get(x), get(x) → NO conflict (both reads) ││ get(x), put(x,1) → CONFLICT (one is write) ││ put(x,1), put(y,2) → NO conflict (different keys) ││ │└─────────────────────────────────────────────────────────────────────────────┘
FAST PATH (no conflicts):────────────────────────Client ──> R1 (command leader for cmd A) │ │ PreAccept(A) ├──────────────────> R2 ─┐ ├──────────────────> R3 ─┼─ All reply "OK, no conflicts" ├──────────────────> R4 ─┤ ├──────────────────> R5 ─┘ │ │ <── PreAcceptOK ──────── │ (from fast quorum: ⌊3N/4⌋) │ │ Commit(A) ├──────────────────> All replicas │Result: 2 message delays (same as Multi-Paxos with stable leader!)SLOW PATH (conflicts detected):───────────────────────────────R1: PreAccept(A) includes deps = {}R2: Already has B that conflicts with AR2: PreAcceptOK with deps = {B}R1: Sees different deps from different replicasR1: Must run Paxos Accept phase to agree on depsAdditional round:R1: Accept(A, deps={B})All: AcceptOKResult: 4 message delays (slower, but still correct)
┌─────────────────────────────────────────────────────────────────────────────┐│ EPAXOS TRADE-OFFS │├─────────────────────────────────────────────────────────────────────────────┤│ ││ ADVANTAGES: ││ ─────────── ││ ✓ No leader bottleneck ││ ✓ Lower latency for clients near any replica ││ ✓ No leader election overhead ││ ✓ Better for geo-distributed deployments ││ ✓ Fast path = same latency as optimized Paxos ││ ││ DISADVANTAGES: ││ ───────────── ││ ✗ More complex to implement correctly ││ ✗ Slower when there are many conflicts ││ ✗ More complex recovery ││ ✗ Fewer battle-tested implementations ││ ││ USE WHEN: ││ ───────── ││ • Multi-datacenter deployment ││ • Low conflict workload (different clients access different data) ││ • Read-heavy or partition-friendly workload ││ • Need to eliminate leader as single point of latency ││ ││ DON'T USE WHEN: ││ ─────────────── ││ • High conflict rate (hot keys) ││ • Single datacenter (Raft is simpler) ││ • Team doesn't have consensus expertise ││ │└─────────────────────────────────────────────────────────────────────────────┘
Production consensus systems use several optimizations to achieve high throughput despite the fundamental requirement of disk I/O and network round-trips.
Key Insight: Requests at different pipeline stages use different resources (CPU for append, network for replicate, disk for commit). Pipelining maximizes resource utilization.
The leader can write to its own log while simultaneously sending AppendEntries to followers.
Copy
SEQUENTIAL:Leader: [Write to Disk] ──────────► [Send to Followers] ↓Followers: [Receive] → [Write to Disk]PARALLEL:Leader: [Write to Disk] ─────┐ ├─► [Wait for both] [Send to Followers] ─┘ ↓Followers: [Receive] → [Write to Disk]Savings: Hides leader's disk latency behind network latency
fsync() is expensive (~10ms on HDD, ~0.1ms on NVMe). Instead of fsync per entry, batch multiple entries into one fsync.
Copy
class GroupCommitter: def __init__(self, max_wait_ms=1, max_batch=100): self.pending = [] self.lock = Lock() def append_and_sync(self, entry): with self.lock: self.pending.append(entry) if len(self.pending) >= max_batch: self._flush() # Or wait for timer-based flush wait_for_flush_complete() def _flush(self): # Write all pending entries for entry in self.pending: log_file.write(entry) # Single fsync for all entries! log_file.fsync() # One fsync instead of N # Notify all waiters notify_all_complete()
Result: 1000 entries with 1 fsync instead of 1000 fsyncs = ~1000x throughput improvement for disk-bound workloads.
Prevent slow followers from causing unbounded memory growth on the leader.
Copy
┌─────────────────────────────────────────────────────────────────┐│ FLOW CONTROL STRATEGY │├─────────────────────────────────────────────────────────────────┤│ ││ Leader tracks per-follower state: ││ • nextIndex: Next entry to send ││ • matchIndex: Highest replicated entry ││ • inflight: Entries sent but not ACKed ││ ││ Flow Control Rules: ││ IF inflight > MAX_INFLIGHT: ││ Pause sending to this follower ││ (prevents OOM on leader) ││ ││ IF matchIndex too far behind: ││ Send snapshot instead of entries ││ (prevents unbounded log growth) ││ │└─────────────────────────────────────────────────────────────────┘
Staff Tip: In interviews, mention that “naive Raft” achieves ~1000 ops/s, but production systems like etcd/TiKV achieve 100K+ ops/s through these optimizations. This shows you understand the gap between textbook algorithms and production systems.
Raft is designed for understandability. If you’re building consensus from scratch:
Copy
# Recommended approach1. Use an existing library (etcd/raft, hashicorp/raft)2. If building from scratch, follow MIT 6.824 lab structure3. Write extensive tests before anything else
These bugs have caused real production outages. Watch out for them!
Copy
# BUG 1: Not persisting state before respondingdef on_request_vote(request): if should_grant_vote(request): voted_for = request.candidate_id # BUG: Must persist BEFORE responding! # persist_state() # <-- MISSING return VoteGranted() # If crash happens, you might vote twice!# BUG 2: Incorrect leader leasedef handle_read(request): if self.is_leader(): # BUG: Might not be leader anymore! # Must check lease or use read index return self.state_machine.read(request.key)# BUG 3: Commit index going backwardsdef apply_committed_entries(): # BUG: Must ensure commit_index only increases # Snapshot installation can reset state machine # But commit_index should reflect reality# BUG 4: Not handling log prefix during appenddef append_entries(request): for entry in request.entries: # BUG: Must delete conflicting entries! # If local log has different term at same index, # delete it and all following entries
A critical “Principal Level” operational reality: What happens when you lose a majority of your nodes forever? (e.g., a region outage or physical data corruption).
In systems like Cassandra or FoundationDB, you can bootstrap a new cluster from a snapshot of the surviving node’s data and “re-seed” the consensus state.
If you only need to recover data (not resume writes), you can often perform Stale Reads from the remaining nodes by bypassing the consensus layer and reading directly from the state machine (SSTables/RocksDB).Staff Tip: Disaster recovery is the only time you should ever touch the internal state of a consensus engine. Mention the “etcd disaster recovery” guide in interviews to show you understand that safety isn’t just about code, but about operational survival.
Other nodes vote (if haven’t voted, log is up-to-date)
If majority votes → becomes leader, sends heartbeats
If receives AppendEntries from valid leader → becomes follower
If timeout again → new election with higher term
Key points to mention:
Randomized timeouts prevent split votes
Log up-to-date check ensures safety
At most one leader per term
Q: What happens if the Raft leader fails during commit?
Answer:
Scenario: Leader has replicated entry to 2/5 nodes and crashes
Followers detect leader failure (timeout)
New election starts
Only nodes with the entry CAN become leader (election restriction)
New leader has the entry, completes replication
Entry gets committed when new-term entry is committed
Key insight: Raft never loses committed entries. An entry replicated to majority WILL be in new leader’s log.
Q: How does Paxos handle competing proposers?
Answer:
Each proposer picks unique proposal number (higher wins)
Acceptor promises to reject lower numbers
If proposer’s Prepare rejected → retry with higher number
In Accept phase, use highest accepted value from promises
System makes progress when one proposer “wins”
Key insight: The “use highest accepted value” rule preserves safety even with competing proposers.
Q: Why can't consensus be achieved in asynchronous systems?
Answer:
FLP Impossibility proves this:
Can’t distinguish slow from dead
Adversarial schedule can always prevent decision
Must choose: safety OR guaranteed termination
Practical workaround: Assume partial synchrony (eventually messages arrive). Raft/Paxos do this with timeouts.
Q: Design a distributed lock using consensus
Answer:
Copy
1. Lock request → propose "lock(client_id, lock_name)"2. Use consensus to order all lock requests3. Apply in order: first requester gets lock4. Store lock state in replicated state machine5. Unlock: consensus on "unlock(client_id, lock_name)"6. Add lease/TTL to handle client failuresImplementation options:- Zookeeper ephemeral nodes- etcd lease-based locks- Consul sessions