Track 2: Consensus Protocols
Consensus is the most critical topic in distributed systems. Master this track to ace Staff+ interviews at top companies.Track Duration: 36-46 hours
Modules: 5
Key Topics: FLP Impossibility, Paxos, Raft, Viewstamped Replication, ZAB
Modules: 5
Key Topics: FLP Impossibility, Paxos, Raft, Viewstamped Replication, ZAB
Module 6: The Consensus Problem
What is Consensus?
Consensus is getting multiple nodes to agree on a single value, despite failures.Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE CONSENSUS PROBLEM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SETUP: │
│ ├── N nodes in a distributed system │
│ ├── Some nodes may fail (crash failures) │
│ ├── Network may delay/lose messages │
│ └── Goal: Agree on a single value │
│ │
│ PROPERTIES (must satisfy ALL): │
│ │
│ 1. AGREEMENT: All non-faulty nodes decide the same value │
│ 2. VALIDITY: The decided value was proposed by some node │
│ 3. TERMINATION: All non-faulty nodes eventually decide │
│ (liveness property) │
│ 4. INTEGRITY: Each node decides at most once │
│ │
│ WHY IT'S HARD: │
│ ───────────── │
│ Node A proposes "5" │ What if: │
│ Node B proposes "7" │ - A crashes after proposing? │
│ Node C proposes "3" │ - Message from A to C is lost? │
│ │ - B thinks A is dead (just slow)? │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Use Cases for Consensus
Leader Election
Nodes agree on who the leader isExamples:
- Database primary selection
- Kafka partition leader
- Kubernetes control plane
Distributed Locking
Nodes agree on who holds the lockExamples:
- Zookeeper locks
- Redis Redlock
- etcd leases
Configuration Management
Nodes agree on system configurationExamples:
- Cluster membership
- Feature flags
- Service discovery
State Machine Replication
Nodes agree on sequence of commandsExamples:
- Replicated databases
- Distributed logs
- Blockchain
FLP Impossibility Theorem
The Most Important Result: Consensus is impossible in an asynchronous system with even one faulty process.
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ FLP IMPOSSIBILITY (1985) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Fischer, Lynch, and Paterson proved: │
│ │
│ "No deterministic consensus protocol can guarantee termination │
│ in an asynchronous system with even one crash failure." │
│ │
│ IMPLICATIONS: │
│ ───────────── │
│ You CANNOT have all three: │
│ 1. Safety (agreement + validity) │
│ 2. Liveness (termination) │
│ 3. Asynchrony (no timing assumptions) │
│ │
│ WHY? (Intuition) │
│ ──────────────── │
│ - In async systems, you can't distinguish slow from dead │
│ - There's always an adversarial schedule that prevents progress │
│ - No matter the protocol, an adversary can delay messages │
│ to keep the system in an undecided state │
│ │
│ HOW PRACTICAL SYSTEMS WORK AROUND IT: │
│ ───────────────────────────────────── │
│ 1. Timeouts (partial synchrony) - Most systems do this │
│ 2. Randomization - Some probability of progress │
│ 3. Failure detectors - Oracle that hints about failures │
│ │
│ RAFT/PAXOS assume: "Eventually, the network will be synchronous enough" │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Safety vs Liveness
Copy
SAFETY: "Nothing bad happens"
├── Agreement is never violated
├── Only proposed values are chosen
└── Can be proven mathematically
LIVENESS: "Something good eventually happens"
├── Decisions are eventually made
├── Progress is made
└── Can be violated by FLP (temporarily)
PRACTICAL APPROACH:
├── ALWAYS maintain safety
└── EVENTUALLY achieve liveness (when network is stable)
Module 7: Paxos Protocol
The Original Consensus Algorithm
Paxos was invented by Leslie Lamport in 1989. Despite being notoriously difficult to understand, it’s the foundation of modern consensus.Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ PAXOS OVERVIEW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ROLES (a node can play multiple roles): │
│ ───── │
│ PROPOSER: Proposes values, drives the protocol │
│ ACCEPTOR: Votes on proposals, stores accepted values │
│ LEARNER: Learns the decided value │
│ │
│ GOAL: Agree on a single value among N acceptors │
│ │
│ KEY INSIGHT: │
│ ──────────── │
│ A value is "chosen" when accepted by a MAJORITY of acceptors │
│ Any two majorities overlap, ensuring agreement │
│ │
│ PHASES: │
│ ─────── │
│ Phase 1a: PREPARE - Proposer asks acceptors to promise │
│ Phase 1b: PROMISE - Acceptors promise (or reject) │
│ Phase 2a: ACCEPT - Proposer asks acceptors to accept value │
│ Phase 2b: ACCEPTED - Acceptors accept (or reject) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Basic Paxos Step by Step
- Phase 1: Prepare
- Phase 1b: Promise
- Phase 2: Accept
- Phase 2b: Decided
Proposer selects a unique proposal number Acceptor behavior on receiving
n and sends Prepare(n) to all acceptors.Copy
Proposer P1 Acceptors
─────────── ─────────
│ [A1] [A2] [A3]
│ Prepare(n=1) │ │ │
│─────────────────────────>│ │ │
│─────────────────────────>│────│ │
│─────────────────────────>│────│────│
│ │ │ │
Prepare(n):Copy
def on_prepare(n):
if n > promised_n:
promised_n = n # Promise not to accept lower
return Promise(n, accepted_n, accepted_v)
else:
return Reject() # Already promised higher
Acceptors respond with:
- Promise to not accept proposals with number < n
- The highest-numbered proposal they’ve already accepted (if any)
Copy
Proposer P1 Acceptors
─────────── ─────────
│ [A1] [A2] [A3]
│<───Promise(1,∅,∅)─────────│ │ │
│<───Promise(1,∅,∅)──────────────│ │
│<───Promise(1,∅,∅)───────────────────│
│ │ │ │
P1 has promises from majority (3/3), can proceed!
If proposer receives promises from majority:Acceptor behavior on receiving
- Use highest accepted value from promises, OR
- Use its own proposed value if no prior accepts
Copy
Proposer P1 Acceptors
─────────── ─────────
│ Accept(1, "X") [A1] [A2] [A3]
│─────────────────────────>│ │ │
│─────────────────────────>│────│ │
│─────────────────────────>│────│────│
Accept(n, v):Copy
def on_accept(n, v):
if n >= promised_n:
promised_n = n
accepted_n = n
accepted_v = v
return Accepted(n, v)
else:
return Reject() # Promised to higher proposal
If proposer receives
Accepted from majority, value is chosen!Copy
Proposer P1 Acceptors
─────────── ─────────
│ [A1] [A2] [A3]
│<───Accepted(1,"X")────────│ │ │
│<───Accepted(1,"X")─────────────│ │
│<───Accepted(1,"X")──────────────────│
│ │ │ │
│ Value "X" is DECIDED! │ │ │
│ │ │ │
Notify learners...
Paxos with Failures
This is where Paxos shines - handling failures:Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ PAXOS HANDLING COMPETING PROPOSERS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Scenario: P1 and P2 both try to propose │
│ │
│ Time P1 Acceptors P2 │
│ ──── ── ───────── ── │
│ t1 Prepare(1) ────> A1,A2,A3 │
│ t2 <──── Promise(1,∅) │
│ t3 Prepare(2) ────> A1,A2,A3 │
│ t4 <──── Promise(2,∅) │
│ t5 Accept(1,"X") ──> A1,A2,A3 │
│ t6 ──── REJECT! ──── (promised n=2 already) │
│ t7 Accept(2,"Y") ──> A1,A2,A3 │
│ t8 <──── Accepted(2,"Y") │
│ │
│ Result: "Y" is chosen (P2 won with higher proposal number) │
│ │
│ P1 must retry with n > 2 │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Why Paxos is Safe
Copy
THEOREM: If a value v is chosen, any future proposal will have value v.
PROOF INTUITION:
─────────────────
1. v is chosen → accepted by majority M1
2. Future proposal n' > n must get promises from majority M2
3. M1 ∩ M2 ≠ ∅ (majorities overlap)
4. At least one acceptor in M2 already accepted v
5. That acceptor tells proposer about v in Promise
6. Proposer MUST use v (highest accepted value)
7. Therefore, v is preserved
This is why proposal numbers and the "use highest accepted value" rule exist!
Multi-Paxos
Basic Paxos decides ONE value. Multi-Paxos decides a SEQUENCE:Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ MULTI-PAXOS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ OPTIMIZATION: Elect a stable leader, skip Phase 1 │
│ │
│ Log Index: 1 2 3 4 5 │
│ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │
│ Log: │ v1 │ │ v2 │ │ v3 │ │ v4 │ │ v5 │ │
│ └────┘ └────┘ └────┘ └────┘ └────┘ │
│ │ │ │ │ │ │
│ Decided Decided Decided Pending Pending │
│ │
│ LEADER OPTIMIZATION: │
│ ──────────────────── │
│ 1. Leader runs Phase 1 ONCE for all future slots │
│ 2. Each new command: just Phase 2 (Accept/Accepted) │
│ 3. Much faster! (2 message delays instead of 4) │
│ │
│ IF LEADER FAILS: │
│ ──────────────── │
│ New leader runs Phase 1 for any undecided slots │
│ May discover in-progress proposals and complete them │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Module 8: Raft Consensus (Deep Dive)
Most Asked in Interviews: Raft is designed to be understandable. You MUST be able to explain it clearly.
Why Raft?
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ RAFT vs PAXOS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PAXOS PROBLEMS: RAFT SOLUTIONS: │
│ ─────────────── ─────────────── │
│ - Hard to understand - Designed for understandability │
│ - Many variants, unclear which to use - One clear algorithm │
│ - Difficult to implement correctly - Clearer implementation path │
│ - Symmetric roles - Strong leader model │
│ │
│ RAFT KEY DECISIONS: │
│ ─────────────────── │
│ 1. Strong leader: Only leader handles client requests │
│ 2. Leader election: Randomized timeouts for faster convergence │
│ 3. Log matching: If entries match at index, all prior entries match │
│ 4. Leader completeness: Elected leader has all committed entries │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Raft State Machine
Raft Terms
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ RAFT TERMS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Time is divided into TERMS of arbitrary length: │
│ │
│ Term 1 Term 2 Term 3 Term 4 Term 5 │
│ ┌──────────────┬───────────┬─────────────────────────┬────────────────┐ │
│ │ Election │ │ Election │ Election │ │ Election │ │ │
│ │ + │ │ + │ + │ Normal │ + │ │ │
│ │ Normal │ │ Normal │ Normal │ Operation │ Normal │ │ │
│ │ Operation │ │ Operation │ Operation │ (leader) │ Operation │ │ │
│ │ (leader) │ │ (failed) │ (leader) │ │ (leader) │ │ │
│ └──────────────┴───────────┴─────────────────────────┴────────────────┘ │
│ │
│ TERM RULES: │
│ ─────────── │
│ • Each term starts with election │
│ • At most ONE leader per term │
│ • Higher term always wins │
│ • If node sees higher term → immediately becomes follower │
│ • Terms act as logical clock │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Leader Election
- Election Trigger
- RequestVote RPC
- Election Process
Copy
FOLLOWER TIMEOUT:
─────────────────
Follower Leader
│ │
│ <── Heartbeat ─────────│ (every 150ms)
│ │
│ (waiting...) │
│ │
│ (timeout: 300ms) X (leader dies)
│
│ No heartbeat received!
│ Convert to CANDIDATE
│ Start election
▼
TIMEOUT: Random between 150-300ms
(Randomness prevents split votes)
Copy
# Candidate sends to all nodes
RequestVote {
term: int, # Candidate's term
candidateId: int, # Candidate requesting vote
lastLogIndex: int, # Index of candidate's last log entry
lastLogTerm: int # Term of candidate's last log entry
}
# Receiver's implementation
def handle_request_vote(request):
# Reject if candidate's term is old
if request.term < currentTerm:
return VoteResponse(term=currentTerm, voteGranted=False)
# Update term if candidate's is newer
if request.term > currentTerm:
currentTerm = request.term
votedFor = None
convert_to_follower()
# Grant vote if:
# 1. Haven't voted in this term OR already voted for this candidate
# 2. Candidate's log is at least as up-to-date as ours
log_ok = (request.lastLogTerm > my_last_log_term or
(request.lastLogTerm == my_last_log_term and
request.lastLogIndex >= my_last_log_index))
if (votedFor is None or votedFor == request.candidateId) and log_ok:
votedFor = request.candidateId
reset_election_timer()
return VoteResponse(term=currentTerm, voteGranted=True)
return VoteResponse(term=currentTerm, voteGranted=False)
Copy
CANDIDATE BEHAVIOR:
───────────────────
1. Increment currentTerm
2. Vote for self
3. Reset election timer
4. Send RequestVote to all other nodes
OUTCOMES:
─────────
(a) Wins election (majority votes)
→ Become leader
→ Send heartbeats immediately
(b) Another node wins
→ Receive AppendEntries from new leader
→ Convert to follower
(c) Election timeout (no winner)
→ Increment term, start new election
→ Randomized timeout prevents repeated ties
EXAMPLE (5 nodes, S1-S5):
─────────────────────────
Term 5 Election:
S1: Times out first, becomes candidate
Votes for self (1 vote)
Sends RequestVote to S2, S3, S4, S5
S2: Receives RequestVote, grants vote
S3: Receives RequestVote, grants vote
S4: Already voted for self (also candidate)
S5: Receives RequestVote, grants vote
S1: Has 4 votes (S1, S2, S3, S5) = MAJORITY
S1: Becomes LEADER of term 5
Log Replication
The core of Raft - replicating commands across nodes:Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ RAFT LOG STRUCTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Leader's Log: │
│ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │
│ │ 1,1 │ 1,1 │ 1,2 │ 2,3 │ 2,3 │ 3,3 │ 3,3 │ 3,3 │ │
│ │ x←1 │ y←2 │ x←3 │ y←4 │ z←5 │ x←6 │ y←7 │ z←8 │ │
│ └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ │
│ 1 2 3 4 5 6 7 8 ← Log Index │
│ │
│ Entry format: (index, term) command │
│ │
│ COMMIT INDEX: │
│ ───────────── │
│ An entry is COMMITTED when replicated to majority │
│ │
│ Leader: commitIndex = 6 (entries 1-6 are committed) │
│ Followers: Apply entries up to commitIndex │
│ │
│ SAFETY: Once committed, entry is NEVER overwritten │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
AppendEntries RPC
Copy
# Leader sends to followers
AppendEntries {
term: int, # Leader's term
leaderId: int, # So follower can redirect clients
prevLogIndex: int, # Index of entry before new ones
prevLogTerm: int, # Term of prevLogIndex entry
entries: List[Entry], # Log entries to store (empty for heartbeat)
leaderCommit: int # Leader's commitIndex
}
# Follower's implementation
def handle_append_entries(request):
# Reject if leader's term is old
if request.term < currentTerm:
return AppendResponse(term=currentTerm, success=False)
# Reset election timer (leader is alive)
reset_election_timer()
# Update term if needed
if request.term > currentTerm:
currentTerm = request.term
convert_to_follower()
# LOG MATCHING: Check if we have the previous entry
if not log_contains(request.prevLogIndex, request.prevLogTerm):
# Tell leader to back up and retry
return AppendResponse(term=currentTerm, success=False)
# APPEND NEW ENTRIES
# (delete conflicting entries first if any)
for i, entry in enumerate(request.entries):
index = request.prevLogIndex + 1 + i
if log[index].term != entry.term:
# Delete this and all following entries
log = log[:index]
if index > len(log):
log.append(entry)
# UPDATE COMMIT INDEX
if request.leaderCommit > commitIndex:
commitIndex = min(request.leaderCommit, len(log))
apply_committed_entries()
return AppendResponse(term=currentTerm, success=True)
Log Matching Property
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ LOG MATCHING PROPERTY │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ IF two entries in different logs have same index AND term: │
│ ├── They store the same command │
│ └── All preceding entries are identical │
│ │
│ HOW IT'S MAINTAINED: │
│ ──────────────────── │
│ 1. Leader creates entries with current term │
│ 2. AppendEntries includes prevLogIndex and prevLogTerm │
│ 3. Follower rejects if previous entry doesn't match │
│ 4. Leader backs up and retries until match found │
│ │
│ EXAMPLE - Repairing Inconsistent Log: │
│ │
│ Leader: [1,1] [1,2] [2,3] [2,4] [3,5] [3,6] │
│ Follower: [1,1] [1,2] [2,3] [2,4] (behind) │
│ │
│ AppendEntries(prevLogIndex=4, prevLogTerm=2, entries=[3,5][3,6]) │
│ Follower: Entry at index 4, term 2 exists? ✓ │
│ Follower: Append entries 5 and 6 │
│ Result: [1,1] [1,2] [2,3] [2,4] [3,5] [3,6] (matches!) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Safety: Leader Completeness
Critical Property: If a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher terms.
Copy
HOW RAFT ENSURES THIS:
──────────────────────
1. ELECTION RESTRICTION:
Voter denies vote if candidate's log is less up-to-date
"Up-to-date" comparison:
- Compare last entry's term (higher is better)
- If same term, compare last entry's index (longer is better)
2. LEADER ONLY COMMITS ENTRIES FROM CURRENT TERM:
Leader doesn't count old-term entries as committed
until a new entry from current term is replicated
Why? Prevents overwriting committed entries
EXAMPLE (The Figure 8 scenario):
─────────────────────────────────
Term 2: Leader S1 replicates entry at index 2 to S2
Crashes before committing
Term 3: S5 becomes leader (didn't have index 2)
Appends entry at index 2 with term 3
Term 4: S1 comes back, becomes leader
CAN'T just count term-2 entry as committed!
Must replicate a term-4 entry first
This prevents the term-2 entry from being overwritten
by the term-3 entry that S5 had
Handling Cluster Membership Changes
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ JOINT CONSENSUS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PROBLEM: Adding/removing nodes during operation │
│ │
│ DANGER: If we switch configs directly: │
│ │
│ Old: {A, B, C} New: {A, B, C, D, E} │
│ │
│ During transition: │
│ - {A, B} thinks old config, majority = 2 │
│ - {C, D, E} thinks new config, majority = 3 │
│ - BOTH could be majorities! Two leaders possible! │
│ │
│ SOLUTION: JOINT CONSENSUS │
│ ───────────────────────── │
│ │
│ Phase 1: C_old,new (joint configuration) │
│ Decisions need majorities in BOTH old AND new │
│ │
│ Phase 2: C_new (new configuration) │
│ Only after C_old,new is committed │
│ │
│ Timeline: │
│ ───────── │
│ │ │ │ │
│ C_old│ C_old,new │ C_old,new │ C_new │
│ │ entry │ committed │ entry │
│ │ created │ │ created │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ──────┬──────────────┬────────────────┬────────────────► │
│ │
│ SIMPLER ALTERNATIVE: Single-server changes (one at a time) │
│ ───────────────────── │
│ Many implementations use this simpler approach │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Log Compaction (Snapshots)
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ SNAPSHOTS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PROBLEM: Log grows forever │
│ │
│ SOLUTION: Periodically take snapshots │
│ │
│ Before snapshot: │
│ ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐ │
│ │ x←1 │ y←2 │ x←3 │ y←4 │ x←5 │ y←6 │ x←7 │ y←8 │ │
│ └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘ │
│ 1 2 3 4 5 6 7 8 │
│ ▲ │
│ committed │
│ │
│ After snapshot (at index 5): │
│ ┌────────────────────┬─────┬─────┬─────┐ │
│ │ SNAPSHOT │ y←6 │ x←7 │ y←8 │ │
│ │ lastIndex=5 │ │ │ │ │
│ │ lastTerm=2 │ │ │ │ │
│ │ state={x:5, y:4} │ │ │ │ │
│ └────────────────────┴─────┴─────┴─────┘ │
│ 6 7 8 │
│ │
│ INSTALLSNAPSHOT RPC: │
│ Used when follower is too far behind │
│ Leader sends its snapshot instead of replaying entries │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Module 9: Viewstamped Replication
An alternative consensus protocol that predates Paxos:Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ VIEWSTAMPED REPLICATION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SIMILAR TO RAFT: │
│ ──────────────── │
│ • Strong leader (called "primary") │
│ • Log replication │
│ • View changes (like term changes) │
│ │
│ KEY DIFFERENCES: │
│ ──────────────── │
│ • View-change is more complex │
│ • Different recovery protocol │
│ • Historically important (1988, before Paxos publication) │
│ │
│ VIEW CHANGE PROTOCOL: │
│ ───────────────────── │
│ 1. New primary collects state from f+1 replicas │
│ 2. Computes new state (union of all logs) │
│ 3. Sends StartView to all replicas │
│ 4. Replicas update and respond │
│ │
│ WHEN TO USE VR vs RAFT: │
│ ──────────────────────── │
│ Usually prefer Raft (simpler, better understood) │
│ VR useful for academic understanding │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Module 10: ZAB (Zookeeper Atomic Broadcast)
ZAB Protocol
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ ZAB │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Designed specifically for Zookeeper's requirements: │
│ • Primary order: All updates from a leader in order │
│ • Prefix property: If transaction T is delivered, all before T too │
│ │
│ PHASES: │
│ ─────── │
│ │
│ 1. LEADER ELECTION │
│ └── Fast leader election using voting │
│ └── Leader chosen based on highest zxid (transaction id) │
│ │
│ 2. DISCOVERY │
│ └── Leader learns about all accepted transactions │
│ └── Collects epoch information from followers │
│ │
│ 3. SYNCHRONIZATION │
│ └── Leader syncs followers to its state │
│ └── Handles DIFF, TRUNC, or SNAP updates │
│ │
│ 4. BROADCAST │
│ └── Normal operation - leader broadcasts updates │
│ └── Two-phase commit: PROPOSAL → ACK → COMMIT │
│ │
│ ZXID (Transaction ID): │
│ ────────────────────── │
│ 64-bit: [epoch (32 bits)][counter (32 bits)] │
│ │
│ Epoch increases with each new leader │
│ Counter increases with each transaction │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Zookeeper Guarantees
Copy
GUARANTEES:
───────────
1. Linearizable Writes: All writes go through leader, ordered globally
2. FIFO Client Order: Client's operations processed in order
3. Wait-free Reads: Reads can go to any server (possibly stale)
4. Watch Notifications: Clients notified of changes
USE CASES:
──────────
• Distributed locking
• Leader election
• Configuration management
• Service discovery
• Cluster membership
Comparison: Paxos vs Raft vs ZAB
Copy
┌─────────────────┬──────────────┬──────────────┬──────────────┐
│ │ PAXOS │ RAFT │ ZAB │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ Leader │ Optional │ Required │ Required │
│ Elections │ Implicit │ Explicit │ Explicit │
│ Log Gaps │ Allowed │ Not allowed │ Not allowed │
│ Understandable │ Harder │ Easier │ Medium │
│ Implementations │ Many │ Many │ Zookeeper │
│ Year │ 1989 │ 2014 │ 2011 │
├─────────────────┼──────────────┼──────────────┼──────────────┤
│ Used By │ Chubby │ etcd │ Zookeeper │
│ │ Spanner │ Consul │ Kafka │
│ │ CockroachDB │ TiKV │ │
└─────────────────┴──────────────┴──────────────┴──────────────┘
Advanced: EPaxos (Egalitarian Paxos)
Staff+ Level: EPaxos is an advanced topic that shows deep understanding of consensus trade-offs. It’s less commonly asked but demonstrates expertise.
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ EPAXOS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ THE PROBLEM WITH LEADER-BASED CONSENSUS: │
│ ───────────────────────────────────────── │
│ • All requests go through leader (bottleneck) │
│ • Leader failure triggers expensive election │
│ • Cross-datacenter: high latency to leader │
│ │
│ EPAXOS KEY INSIGHT: │
│ ─────────────────── │
│ "If commands don't conflict, they can commit independently" │
│ │
│ LEADERLESS CONSENSUS: │
│ ───────────────────── │
│ • ANY replica can be "command leader" for any command │
│ • Conflicting commands require extra round │
│ • Non-conflicting commands: 1 round-trip (fast path) │
│ │
│ CONFLICT DETECTION: │
│ ─────────────────── │
│ Commands conflict if they access same keys with at least one write │
│ │
│ get(x), get(x) → NO conflict (both reads) │
│ get(x), put(x,1) → CONFLICT (one is write) │
│ put(x,1), put(y,2) → NO conflict (different keys) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
EPaxos Protocol Flow
Copy
FAST PATH (no conflicts):
────────────────────────
Client ──> R1 (command leader for cmd A)
│
│ PreAccept(A)
├──────────────────> R2 ─┐
├──────────────────> R3 ─┼─ All reply "OK, no conflicts"
├──────────────────> R4 ─┤
├──────────────────> R5 ─┘
│
│ <── PreAcceptOK ────────
│ (from fast quorum: ⌊3N/4⌋)
│
│ Commit(A)
├──────────────────> All replicas
│
Result: 2 message delays (same as Multi-Paxos with stable leader!)
SLOW PATH (conflicts detected):
───────────────────────────────
R1: PreAccept(A) includes deps = {}
R2: Already has B that conflicts with A
R2: PreAcceptOK with deps = {B}
R1: Sees different deps from different replicas
R1: Must run Paxos Accept phase to agree on deps
Additional round:
R1: Accept(A, deps={B})
All: AcceptOK
Result: 4 message delays (slower, but still correct)
When to Consider EPaxos
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ EPAXOS TRADE-OFFS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ADVANTAGES: │
│ ─────────── │
│ ✓ No leader bottleneck │
│ ✓ Lower latency for clients near any replica │
│ ✓ No leader election overhead │
│ ✓ Better for geo-distributed deployments │
│ ✓ Fast path = same latency as optimized Paxos │
│ │
│ DISADVANTAGES: │
│ ───────────── │
│ ✗ More complex to implement correctly │
│ ✗ Slower when there are many conflicts │
│ ✗ More complex recovery │
│ ✗ Fewer battle-tested implementations │
│ │
│ USE WHEN: │
│ ───────── │
│ • Multi-datacenter deployment │
│ • Low conflict workload (different clients access different data) │
│ • Read-heavy or partition-friendly workload │
│ • Need to eliminate leader as single point of latency │
│ │
│ DON'T USE WHEN: │
│ ─────────────── │
│ • High conflict rate (hot keys) │
│ • Single datacenter (Raft is simpler) │
│ • Team doesn't have consensus expertise │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Practical Implementation Guide
Building a Production Consensus System
1
Start with Raft (Not Paxos)
Raft is designed for understandability. If you’re building consensus from scratch:
Copy
# Recommended approach
1. Use an existing library (etcd/raft, hashicorp/raft)
2. If building from scratch, follow MIT 6.824 lab structure
3. Write extensive tests before anything else
2
Test Adversarially
Copy
# Critical test scenarios
test_cases = [
"leader_crashes_during_commit",
"network_partition_minority_vs_majority",
"split_brain_scenario",
"slow_follower_catches_up",
"all_nodes_restart_simultaneously",
"leader_isolated_then_rejoins",
"log_divergence_after_multiple_elections",
"snapshot_during_compaction",
]
# Use chaos testing
chaos_scenarios = [
"random_network_delays(10-500ms)",
"random_message_drops(5%)",
"random_node_restarts",
"clock_skew_between_nodes",
]
3
Performance Tuning
Copy
# Key parameters to tune
raft_config:
heartbeat_interval: 150ms # Leader sends heartbeat
election_timeout_min: 300ms # Min timeout for election
election_timeout_max: 500ms # Max timeout (randomized)
max_entries_per_append: 100 # Batch size for efficiency
snapshot_threshold: 10000 # Entries before snapshot
# Trade-offs:
# - Lower heartbeat = faster failure detection, more overhead
# - Higher election timeout = fewer spurious elections, slower recovery
# - Larger batch = higher throughput, higher latency per request
4
Production Checklist
Copy
□ Persistent storage for term, votedFor, and log
□ Proper fsync on commit (don't lose acknowledged writes)
□ Snapshot mechanism for log compaction
□ Membership change handling (or limit to single-server changes)
□ Pre-vote optimization (prevents disruption from stale candidates)
□ Leader lease for efficient reads
□ Metrics: election count, replication lag, commit latency
□ Alerting on: leader changes, replication lag, log size
Common Implementation Bugs
These bugs have caused real production outages. Watch out for them!
Copy
# BUG 1: Not persisting state before responding
def on_request_vote(request):
if should_grant_vote(request):
voted_for = request.candidate_id
# BUG: Must persist BEFORE responding!
# persist_state() # <-- MISSING
return VoteGranted()
# If crash happens, you might vote twice!
# BUG 2: Incorrect leader lease
def handle_read(request):
if self.is_leader():
# BUG: Might not be leader anymore!
# Must check lease or use read index
return self.state_machine.read(request.key)
# BUG 3: Commit index going backwards
def apply_committed_entries():
# BUG: Must ensure commit_index only increases
# Snapshot installation can reset state machine
# But commit_index should reflect reality
# BUG 4: Not handling log prefix during append
def append_entries(request):
for entry in request.entries:
# BUG: Must delete conflicting entries!
# If local log has different term at same index,
# delete it and all following entries
Key Interview Questions
Q: Walk me through a Raft leader election
Q: Walk me through a Raft leader election
Answer structure:
- Follower times out (no heartbeat from leader)
- Converts to candidate, increments term
- Votes for itself, sends RequestVote to all
- Other nodes vote (if haven’t voted, log is up-to-date)
- If majority votes → becomes leader, sends heartbeats
- If receives AppendEntries from valid leader → becomes follower
- If timeout again → new election with higher term
- Randomized timeouts prevent split votes
- Log up-to-date check ensures safety
- At most one leader per term
Q: What happens if the Raft leader fails during commit?
Q: What happens if the Raft leader fails during commit?
Answer:
Scenario: Leader has replicated entry to 2/5 nodes and crashes
- Followers detect leader failure (timeout)
- New election starts
- Only nodes with the entry CAN become leader (election restriction)
- New leader has the entry, completes replication
- Entry gets committed when new-term entry is committed
Q: How does Paxos handle competing proposers?
Q: How does Paxos handle competing proposers?
Answer:
- Each proposer picks unique proposal number (higher wins)
- Acceptor promises to reject lower numbers
- If proposer’s Prepare rejected → retry with higher number
- In Accept phase, use highest accepted value from promises
- System makes progress when one proposer “wins”
Q: Why can't consensus be achieved in asynchronous systems?
Q: Why can't consensus be achieved in asynchronous systems?
Answer:
FLP Impossibility proves this:
- Can’t distinguish slow from dead
- Adversarial schedule can always prevent decision
- Must choose: safety OR guaranteed termination
Q: Design a distributed lock using consensus
Q: Design a distributed lock using consensus
Answer:
Copy
1. Lock request → propose "lock(client_id, lock_name)"
2. Use consensus to order all lock requests
3. Apply in order: first requester gets lock
4. Store lock state in replicated state machine
5. Unlock: consensus on "unlock(client_id, lock_name)"
6. Add lease/TTL to handle client failures
Implementation options:
- Zookeeper ephemeral nodes
- etcd lease-based locks
- Consul sessions
Hands-On Project: Implement Raft
Raft Implementation Lab
Build a complete Raft implementation:Phase 1: Leader Election
- Implement RequestVote RPC
- Handle election timeouts
- State transitions (follower → candidate → leader)
- Implement AppendEntries RPC
- Log matching checks
- Commit index tracking
- Persist term, votedFor, log
- Recovery after restart
- Implement log compaction
- InstallSnapshot RPC
- MIT 6.824 Labs
- Raft visualization: raft.github.io
- etcd/raft source code
Next Steps
Continue to Track 3: Replication Strategies
Learn single-leader, multi-leader, and leaderless replication patterns