Raft Consensus Deep Dive
This chapter is designed as a from-scratch Raft mastery guide. The goal is that after working through it, you can:- Explain Raft on a whiteboard under interview pressure
- Reason about safety and liveness properties formally
- Implement Raft in a language of your choice (Go, Java, Rust, etc.)
- Debug real-world Raft issues (e.g., in etcd, Consul, TiKV)
Target Audience: Senior / Staff engineersPrerequisites:
- You have already read the high-level Consensus chapter
- You understand logs, replication, and basic reliability concepts
1. Problem Setting and Mental Model
Before code, internalize the mental model Raft operates in.1.1 System Model
- Nodes:
Nservers, typicallyN = 3 or 5 - Failures: Crash-stop failures (no Byzantine behavior)
- Network:
- Messages may be delayed, reordered, or dropped
- Messages are not corrupted (we rely on lower layers for that)
- Assumption: The system is asynchronous most of the time, but there are periods of partial synchrony where timeouts are meaningful (this is how we dodge FLP in practice).
1.2 Safety vs Liveness in Raft
- Safety (never bad):
- At most one value is committed at index
i - A committed log entry is never lost or changed
- All committed entries are applied in order on every server
- At most one value is committed at index
- Liveness (eventually good):
- As long as a majority of nodes are available and the network is stable enough, the system continues to elect leaders and process client commands.
2. High-Level Design
2.1 Node Roles
Each Raft server is always in exactly one of three roles:- Follower: Passive, responds to requests from others
- Candidate: Tries to become leader
- Leader: Handles client writes and manages log replication
2.2 Terms as a Logical Clock
Time is divided into terms:- Each term starts with an election
- At most one leader per term
- Terms are monotonically increasing integers
currentTerm. Whenever they see a higher term in a message, they immediately:
- Update their
currentTerm - Step down to Follower
2.3 Persistent and Volatile State
On each server, Raft separates persistent (must survive crash) and volatile state.-
Persistent state (stored on stable storage):
currentTerm– latest term this server has seenvotedFor– candidate ID that received this server’s vote in current term (ornull)log[]– array of log entries; each entry has{term, command}
-
Volatile state on all servers:
commitIndex– index of highest log entry known to be committedlastApplied– index of highest log entry applied to state machine
-
Volatile state on leaders (reinitialized after election):
nextIndex[peer]– index of the next log entry to send to that followermatchIndex[peer]– index of highest log entry known to be replicated on that follower
3. The Raft Log and State Machine
3.1 Log Structure
Each server keeps a log:- Index grows monotonically as new entries are appended
- Each entry is tagged with the term in which it was created by the leader
3.2 Commit Index and Application
- An entry at index
iis committed when it is stored on a majority of servers and is known to be durable - Once committed, leaders and followers apply entries in order to their state machine:
4. RPCs Overview
Raft only uses two RPCs in steady-state operation:RequestVote– used during electionsAppendEntries– used for both heartbeats and log replication
5. Leader Election in Depth
5.1 Election Timeouts and Heartbeats
- Followers expect to hear from a leader via AppendEntries heartbeats at regular intervals (e.g., 100–150 ms)
- If a follower does not receive a heartbeat within its election timeout (e.g., 150–300 ms, randomized), it converts to a Candidate and starts an election.
5.2 Becoming a Candidate
When a follower times out:5.3 RequestVote RPC Logic
RequestVote args:term– candidate’s termcandidateId– candidate requesting votelastLogIndex– index of candidate’s last log entrylastLogTerm– term of candidate’s last log entry
- Each server votes at most once per term
- A leader must have a log that is at least as up to date as any follower
5.4 Election Outcomes
A candidate continues the election until one of three outcomes happens:- Candidate receives votes from a majority → becomes Leader
- Candidate receives AppendEntries from another server with a newer or equal term → steps down to Follower
- Election times out without a winner → increments term and starts a new election
5.5 Failure Scenario: Split Vote
Imagine 5 servers: S1–S5. S1 and S2 time out at nearly the same time and both become candidates.- S1 sends RequestVote(term=5) to everyone
- S2 sends RequestVote(term=5) to everyone
- Voting pattern might end up:
- S1 votes for S1
- S2 votes for S2
- S3 votes for S1
- S4 votes for S2
- S5 hasn’t responded in time
6. Log Replication in Depth
Once a leader is elected, it takes over the cluster:- Clients send all writes to the leader
- Leader appends commands to its log and replicates them via
AppendEntries
6.1 AppendEntries RPC
AppendEntries args:term– leader’s termleaderId– leader’s ID (for redirects)prevLogIndex– index of log entry immediately preceding new onesprevLogTerm– term of entry atprevLogIndexentries[]– list of new log entries (empty for heartbeat)leaderCommit– leader’scommitIndex
6.2 Log Matching Property
Log Matching Property:If two logs contain an entry with the same index and term, then the logs are identical in all preceding entries.How Raft enforces it:
- Leader always includes
(prevLogIndex, prevLogTerm)when sending entries - Follower rejects AppendEntries if it does not have an entry at
prevLogIndexwith termprevLogTerm - On rejection, leader backs up
nextIndex[follower]and retries until logs match
6.3 Leader’s Bookkeeping: nextIndex and matchIndex
On election, the leader initializes:nextIndex[i]– the next log index the leader will send to follower imatchIndex[i]– highest index known to be replicated on follower i
success=false, the leader decrements nextIndex[i] and retries.
When a follower responds with success=true, the leader:
6.4 Advancing commitIndex on the Leader
The leader periodically checks whether some log indexN can be considered committed:
log[N].term == currentTerm – this is essential for Leader Completeness.
7. Safety Invariants and Intuition
Raft’s safety comes from a few core invariants.7.1 Election Safety
For a given term, at most one leader can be elected.
- Each server votes at most once per term (
votedFor) - A candidate must receive votes from a majority to become leader
- Two distinct candidates cannot both receive votes from a majority because majorities intersect
7.2 Leader Completeness
If a log entry is committed in termKey mechanisms:T, then that entry will be present in the logs of all leaders for all terms> T.
- Commit rule on leaders: A leader only considers entries from its current term when deciding new
commitIndex - Election restriction: A server only grants its vote to candidates whose log is at least as up-to-date as its own
- Suppose entry
(index X, term T)is committed - That means a majority of servers have it in their log
- Any future leader must get votes from a majority; thus, at least one server that has the entry must vote
- Due to the “up-to-date” rule, a candidate missing this entry cannot gain that vote
- Therefore any new leader must also contain this entry
7.3 State Machine Safety
If any server has applied a log entry at indexCombination of previous invariants + log matching ensures this.ito its state machine, then no other server will ever apply a different command at indexi.
8. Handling Failures and Partitions
Let’s walk through concrete failure scenarios – the kind you’ll get in interviews.8.1 Leader Crashes Before Replication
Scenario:- Leader L (term 5) appends entry at index 10
- It sends AppendEntries to followers, but crashes before any follower stores the entry
- A new leader L2 (term 6) is elected, without that entry
- Entry at index 10 in L’s log is never committed
- New leader’s log does not contain entry 10; it will overwrite L’s uncommitted entry during log repair
8.2 Leader Crashes After Partial Replication
Scenario:- Leader L (term 5) appends entry at index 10
- Two out of three followers store entry 10
- Leader crashes before updating commitIndex or responding to the client
- A follower that has entry 10 becomes new leader
- Because of the “up-to-date” voting rule, a candidate missing entry 10 cannot become leader
- Entry 10 eventually becomes committed and applied
- Client may need to retry the request, but system behaves as if it executed exactly once
8.3 Network Partition: Leader Isolated in Minority
Scenario (5-node cluster):- Partition:
{S1, S2}vs{S3, S4, S5} - S1 is the leader but ends up in the minority partition
- S1 continues to think it is leader but cannot replicate to a majority
- It cannot commit new entries → no progress in minority partition
- In majority partition, followers time out and elect a new leader, S3 (higher term)
- S1 receives AppendEntries with higher term from S3
- S1 steps down to follower and updates its log to match the majority
9. Advanced Production Features
While the basic Raft protocol is correct, production systems (like etcd, TiDB, or CockroachDB) require several optimizations for performance and robustness.9.1 Linearizable Reads
A naive read (reading from the current leader) can be stale if the leader has been partitioned away but doesn’t know it yet (zombie leader). Solution 1: Read Index- Leader receives read request.
- Leader records current
commitIndexasreadIndex. - Leader sends heartbeats to a majority to confirm it’s still the leader.
- Once confirmed, leader waits until its
lastAppliedreadIndex. - Leader returns the value from its state machine.
- Leader maintains a “leader lease” (time-based).
- As long as the lease is active, the leader knows no other leader can be elected.
- Leader can serve reads locally without heartbeats.
- Risk: Relies on synchronized physical clocks (though less strictly than Spanner).
9.2 Pre-Vote Protocol
A node partitioned away from the cluster will keep incrementing itscurrentTerm as it fails to win elections. When it rejoins, it has a very high term, causing the current leader to step down unnecessarily.
Pre-Vote:
- Before incrementing its term, a candidate sends a Pre-Vote RPC.
- Peers only vote “Yes” if they haven’t heard from a leader in their election timeout.
- This prevents a “disruptive” node from triggering unnecessary elections.
9.3 CheckQuorum
To prevent a leader in a minority partition from continuing to serve reads (if using leases), the leader periodically checks if it can still reach a majority.- If it hasn’t heard from a majority of nodes for an election timeout, it steps down to follower.
9.4 Learner Nodes (Non-Voting Replicas)
In a standard Raft cluster, every node is a voter. This means every node must participate in elections and every write must wait for a majority. However, sometimes you want to add replicas for read-scaling or geo-distribution without increasing the quorum size. Learner Nodes are replicas that:- Receive log entries from the leader like followers.
- Apply entries to their local state machine.
- Do not vote in elections.
- Do not count toward the write majority.
Use Case: Cold Booting a New Node
If you add a new node to a 5-node cluster, it starts with an empty log. If you immediately make it a voter, the quorum becomes harder to reach. Instead:- Add the node as a Learner.
- Let it “catch up” by receiving the snapshot and replaying the log.
- Once it is “caught up” (lag is small), promote it to a Voter using a configuration change.
10. Joint Consensus: Safe Configuration Changes
Changing the set of servers (e.g., from 3 to 5 nodes) is one of the most dangerous operations in distributed systems. If not done carefully, two independent majorities can exist during the transition, leading to a split-brain.10.1 The Two-Phase Transition
Raft uses Joint Consensus to ensure safety during reconfiguration.- Phase 1: :
- The leader appends a special configuration entry to its log that contains both the old membership () and the new one ().
- For any commit or election to succeed, it must receive a majority from BOTH and .
- This is the “Joint” state. Even if the leader crashes, any successor must have this entry to win the double-majority.
- Phase 2: :
- Once is committed (stored on a majority of both groups), the leader appends a entry.
- From this point on, only majorities are required.
- Nodes removed in eventually see the entry and shut themselves down.
10.2 Staff Tip: The Single-Server Change Optimization
Many modern Raft implementations (like etcd) prefer Single-Server Changes. By adding or removing only one node at a time, you can skip the complex Joint Consensus.- Proof: If you add exactly 1 node to a 3-node cluster, any majority of 3 and any majority of 4 are guaranteed to intersect.
- This is much simpler to implement but requires sequential changes (you can’t add 10 nodes at once).
11. Log Compaction and Snapshots (Deep Dive)
The log cannot grow forever. Snapshots are the solution, but they introduce complexity.11.1 Concurrent Snapshotting
Taking a snapshot takes time. The state machine should not be blocked during this process.- Copy-on-Write: Many systems use CoW data structures or OS-level
fork()to create a point-in-time view of the state machine while continuing to process writes.
11.2 Installing Snapshots
If a follower is so far behind that the leader has already discarded the necessary log entries:- Leader sends
InstallSnapshotRPC. - Follower replaces its entire log and state machine with the snapshot.
- Follower resumes normal replication from the snapshot’s
lastIncludedIndex.
12. Implementation Strategy (Step-by-Step)
Here is a suggested order to implement Raft from scratch.10.1 Step 1: Data Structures
Define the core types (in pseudocode):10.2 Step 2: Timers and Role Transitions
- Implement an event loop or goroutines that handle:
- Election timer (randomized)
- Heartbeat timer on the leader
- RPC handlers for RequestVote and AppendEntries
- Kill random nodes
- Add artificial delays
- Ensure the cluster eventually elects a leader and keeps one leader at a time
10.3 Step 3: Log Replication
- After leader election stabilizes, implement AppendEntries for both heartbeats and log entries
- Ensure that followers repair divergent logs correctly via
(prevLogIndex, prevLogTerm)and backtracking
- Start cluster, write commands while killing and restarting nodes
- After stabilization, check that all logs are identical on all non-faulty nodes
10.4 Step 4: Commit Rules and State Machine
- Implement the leader’s logic for advancing
commitIndexbased onmatchIndex - Implement application of committed entries to your state machine (e.g., a key-value store)
10.5 Step 5: Persistence and Crash Recovery
- Before responding to RPCs or clients, ensure you fsync persistent state
(
currentTerm,votedFor, andlog) - On restart, reload these from disk and resume operation
- Kill a leader suddenly
- Restart it after some time
- Verify that it rejoins as a follower and converges to the majority log
10.6 Step 6: Advanced Features (Optional)
- Snapshots for log compaction
- Dynamic membership changes
- Read-only optimizations (e.g., linearizable reads using leader leases or read-index)
11. Interview Playbook: Explaining Raft
When asked about Raft in an interview, structure your answer:- Problem statement – replicated state machine; need to keep nodes in sync across failures
- Roles – leader, followers, candidates
- Terms and elections – logical clock, one leader per term, randomized timeouts
- Log replication – AppendEntries, log matching property, commitIndex
- Safety guarantees – leader completeness, state machine safety
- Failure handling – leader crash, partition where leader is in minority
- Optional advanced topics – snapshots, membership changes, linearizable reads
“Raft is a consensus algorithm that maintains a replicated log across multiple servers. Time is split into terms, each with at most one leader. Followers convert to candidates and start elections if they don’t hear from a leader. The leader accepts client commands, appends them to its log, and uses AppendEntries RPCs to replicate them to followers. Entries are considered committed once they’re stored on a majority and are then applied in order to a state machine. Raft guarantees that committed entries are never lost and that all non-faulty servers eventually apply the same sequence of commands, even across crashes and network partitions, as long as a majority is available.”
12. Next Steps
- Implement a toy Raft:
- Start with in-memory data structures and in-process message passing
- Then separate processes and real RPC
- Read the original paper: “In Search of an Understandable Consensus Algorithm” (Ongaro & Ousterhout)
- Study production implementations:
- etcd (Go)
- HashiCorp Raft (Go)
- TiKV (Rust)