Paxos Consensus Deep Dive
This chapter is a practical and theoretical deep dive into Paxos. The goal is to make Paxos feel as concrete and implementable as Raft, even though the original presentations are notoriously abstract.After this chapter you should be able to:
- Derive the basic Paxos protocol from first principles
- Explain why quorum overlap is enough for safety
- Implement single-decree Paxos and then extend it to Multi-Paxos
- Compare Paxos vs Raft with confidence in design reviews
1. Problem Restatement: Single-Value Consensus
Paxos solves the consensus problem for a single value in an asynchronous, crash-fail system. We have:- A set of processes (nodes), some of which may crash and recover
- A network that may lose, reorder, duplicate, or delay messages
- No bounds on message delay (pure asynchrony), so we cannot reliably distinguish slow vs dead
v such that:
- Validity:
vwas proposed by some process - Agreement: No two processes decide different values
- Termination: All non-faulty processes eventually decide (liveness – can be sacrificed under extreme asynchrony)
2. Roles and Quorums
2.1 Roles
In Paxos, a process can play one or more of three logical roles:- Proposers – suggest values to be chosen
- Acceptors – vote on proposals; a majority of them determines the chosen value
- Learners – learn the chosen value to apply it to state machines
2.2 Quorums
A quorum is any subset of acceptors whose size is large enough that any two quorums intersect. For classic Paxos, we use majority quorums:- With
Nacceptors, a quorum is any subset of size⌊N/2⌋ + 1 - This ensures that any two majorities share at least one acceptor
3. Paxos in Two Phases
At the heart of Paxos is a two-phase protocol driven by proposal numbers.3.1 Proposal Numbers
Each proposal is identified by a monotonically increasing numbern that is globally unique.
- Simplest mental model:
(counter, proposerId)pairs, ordered lexicographically - Higher
(counter, proposerId)means higher proposal number
3.2 Phase 1 – Prepare / Promise
Goal: Reserve a proposal number and learn about any previously accepted values.- The proposer selects a new proposal number
nand sendsPrepare(n)to all (or a quorum of) acceptors - Each acceptor maintains two pieces of persistent state:
promised_n– highest proposal number it has promised not to go below(accepted_n, accepted_v)– highest-numbered proposal it has accepted so far (or⊥if none)
Prepare(n):
3.3 Phase 2 – Accept / Accepted
Goal: Ask acceptors to accept a specific valuev for proposal n.
- The proposer chooses value
v:- If any acceptor reported a previously accepted value
(accepted_n, accepted_v)in its Promise, the proposer must choose the value with highestaccepted_namong them - Otherwise, it can choose any value it wants (e.g., its own client request)
- If any acceptor reported a previously accepted value
Accept(n, v) to all acceptors.
Acceptor behavior on Accept(n, v):
Accepted(n, v), value v is considered chosen.
Learners can be notified either directly by acceptors or via an aggregator.
4. Why This Works: Safety Intuition
The key safety theorem:Once a valuevhas been chosen, no other distinct valuev' ≠ vcan ever be chosen.
4.1 High-Level Proof Sketch
Assume valuev is chosen in proposal n (i.e., a quorum Q1 of acceptors accepted (n, v)).
Consider any later proposal n' > n that manages to gather a quorum Q2 of Promises.
- Because both
Q1andQ2are majorities, they must intersect at some acceptora - At the time
vwas chosen,amust have accepted(n, v) - Later, when
areceivesPrepare(n'), it will respond withPromise(n', accepted_n = n, accepted_v = v) - The proposer for
n'must choose the accepted value with the highestaccepted_nfrom Promises → that isv
v again, and no different value v' can be chosen.
The combination of quorum intersection and the “pick highest accepted value” rule ensures safety.
5. Detailed Message Flow Example
Let’s walk through a concrete example with 3 acceptors: A1, A2, A3 and one proposer P.5.1 No Contention Scenario
- P chooses proposal number
n = 1and sendsPrepare(1)to A1, A2, A3 - All acceptors have
promised_n = 0andaccepted_n = ⊥, so they all replyPromise(1, ⊥, ⊥) - P receives Promises from a quorum (any 2, say A1, A2)
- Since none of them had an accepted value, P can choose its own value
v = 42 - P sends
Accept(1, 42)to A1, A2, A3 - Each acceptor checks
n >= promised_n(true) and updatesaccepted_n=1, accepted_v=42, replyingAccepted(1, 42) - P sees a quorum of
Acceptedand decides that value42is chosen
5.2 Competing Proposers Scenario
Now imagine two proposers P1 and P2.- P1 uses proposal numbers 1, 3, 5, …
- P2 uses proposal numbers 2, 4, 6, …
- P1 sends
Prepare(1)to all, gets Promises, moves to Phase 2 - Before its
Accept(1, v1)messages reach acceptors, P2 sendsPrepare(2) - Acceptors respond
Promise(2, accepted_n, accepted_v)depending on what they’ve already accepted - Because
2 > 1, they may promise to P2 even if they previously promised to P1
- Eventually, one proposer will “get ahead” with a higher proposal number and gather a quorum of Promises
- That proposer will then carry through with Phase 2 and choose a value consistent with any previously chosen value
6. Multi-Paxos: From One Value to a Replicated Log
Basic Paxos decides one value. Real-world systems need a linearizable log of commands.6.1 The Cost of Naive Paxos
If we run Basic Paxos for every log index:- Latency: 2 round-trips (Prepare + Accept) per command.
- Throughput: Heavy contention on acceptor state.
6.2 Multi-Paxos Optimization
Multi-Paxos optimizes the “steady state” by assuming a stable leader. The “One Prepare” Rule:- A leader runs Phase 1 (Prepare) for all future log indices simultaneously (conceptually).
- Once a proposer receives a quorum of Promises, it becomes the Master.
- For all subsequent commands, the Master skips Phase 1 and sends only Accept (Phase 2) messages.
- This reduces latency to 1 round-trip in the common case.
6.3 Master Leases
How does the system know who the current Master is without running Paxos every time?- Leases: Acceptors grant the Master a time-based lease (e.g., 5 seconds).
- While the lease is active, acceptors promise to reject any
Preparerequests from other proposers. - This allows the Master to perform local reads safely (Linearizable Reads), as it knows no other value can be chosen during the lease.
6.3.1 Advanced: Ephemeral Leases & Graceful Handoff
At the Principal level, “Master Leases” are not just binary. High-availability systems (like Google Spanner or etcd) use sophisticated lease management to minimize downtime during leader transitions.The Problem: The “Dead Leader” Gap
If a Master has a 5-second lease and crashes at , the acceptors will reject all other proposers until . This creates a 4-second unavailability window where the cluster cannot process writes.Optimization 1: Graceful Handoff
When a leader knows it is about to stop (e.g., during a rolling deployment), it can perform a Graceful Handoff:- Step Down: The leader stops accepting new requests.
- Synchronize: It ensures its log is fully replicated to a designated successor.
- Renounce Lease: It sends a special message to acceptors saying, “I am renouncing my lease early.”
- Instant Election: The successor can immediately run Phase 1 and take over without waiting for the timeout.
Optimization 2: Bidirectional Leases (Leader Sticks)
Instead of just acceptors granting leases to the leader, the leader can include a Lease Renewal bit in its Phase 2 (Accept) messages.
- If the Master is active and sending heartbeats/commands, the lease is automatically extended.
- This prevents the “Leader Duel” where a leader’s lease expires just as it’s trying to commit a value.
6.4 Log Compaction & Snapshots
A replicated log cannot grow forever. Eventually, you will run out of disk space, and a new node joining the cluster would take years to “replay” millions of commands. Snapshots allow you to “compact” the log:- Capture State: The state machine (e.g., a Key-Value store) captures its current state at index .
- Discard Log: All log entries from to can be safely deleted.
- Bootstrap: A new node starts by downloading the snapshot and then only replaying the log from onwards.
- Coordination: In Paxos, snapshots are usually managed locally by each node, but they must coordinate to ensure they don’t delete entries that a lagging peer might still need.
7. Mencius: High-Throughput Multi-Leader Paxos
One major criticism of Multi-Paxos is that all writes must go through a single leader, creating a bottleneck. Mencius (named after the Chinese philosopher) is a variant designed for high-throughput, wide-area networks.7.1 How Mencius Works
Mencius partitions the log indices among the nodes.- If there are nodes, Node is the “default leader” for all indices where .
- Efficiency: Since each node is the leader for its own slots, it can skip Phase 1 (Prepare) and go straight to Phase 2 (Accept).
- Latency: No need to forward requests to a remote leader; nodes process local requests in their assigned slots.
7.2 Handling Idle Nodes (The Skip Mechanism)
What if Node 1 has no requests but Node 2 is flooded?- Node 1 must propose “No-Op” (Skip) for its slots so the cluster can move past them.
- If Node 1 is down, other nodes can run Paxos on Node 1’s slots to “force-skip” them.
- Trade-off: Mencius is extremely fast when all nodes are active, but performance degrades if some nodes are idle or slow, as they block the “common log” from progressing.
8. Advanced Paxos Variants
7.1 Fast Paxos (Low Latency)
In classic Paxos, the message flow isClient -> Proposer -> Acceptor -> Proposer -> Client.
Fast Paxos allows clients to send requests directly to acceptors:
- Classic Quorum:
- Fast Quorum:
- If a client gets a “Fast Quorum” of identical responses, the value is chosen in one round-trip from the client’s perspective.
- Trade-off: If there is contention (multiple clients), it “collides” and falls back to a slower recovery path.
7.2 Flexible Paxos (FPaxos)
Traditional Paxos assumes . Flexible Paxos observes that we only need the Prepare quorum of a later proposal to intersect with the Accept quorum of an earlier successful proposal.- You can have a small
Acceptquorum (e.g., 2 nodes out of 10) for fast writes. - This requires a large
Preparequorum (e.g., 9 nodes out of 10) for leader changes. - This is powerful for systems where leader changes are rare but writes are frequent.
7.3 EPaxos (Egalitarian Paxos)
EPaxos is a leaderless consensus protocol:- Any node can propose a command for any index.
- It uses dependency tracking (like Vector Clocks) to order concurrent commands.
- If two commands don’t interfere (e.g., different keys), they can be committed in 1 round-trip without a central leader.
- If they interfere, they use a “slow path” (2 round-trips) to resolve dependencies.
7.4 Dynamic Quorum Reconfiguration
A common question in Staff-level interviews: “How do you add or remove nodes from a running Paxos cluster without downtime?” This is known as Reconfiguration or Membership Change.1. The Naive Approach (The Trap)
Simply changing the “Quorum Size” variable in a config file and restarting nodes is DANGEROUS.- If some nodes use (Quorum=2) and others use (Quorum=3), you can reach a state where two different values are chosen because the quorums don’t overlap correctly across the transition.
2. The Paxos-in-Paxos Approach (Alpha)
The standard way to handle reconfiguration is to treat the Cluster Membership as a value decided by the protocol itself.- Current Membership is used for all indices up to .
- Index contains a special command:
RECONFIG(C_2). - Once
RECONFIG(C_2)is chosen at index , all indices use membership . - (the “alpha” parameter) is a pipeline limit. It ensures that the system doesn’t switch to while commands using are still in flight.
3. Joint Consensus (Raft Style)
While Raft uses a 2-phase approach (Joint Consensus), Paxos systems (like Spanner) often use a simpler Leader-driven Reconfiguration:- The leader proposes the new configuration.
- The system enters a state where both the old and new quorums must grant their approval.
- Once the new configuration is committed, the old nodes can be safely decommissioned.
| Strategy | Mechanism | Complexity |
|---|---|---|
| Stop-the-World | Pause all writes, update config, resume. | Simple, but high downtime. |
| Paxos Log | Membership changes are just log entries. | Elegant, but requires window. |
| Joint Quorum | Overlap quorums during transition. | Complex, but zero downtime. |
9. Paxos vs Raft: The Engineering Reality
| Feature | Paxos (Multi-Paxos) | Raft |
|---|---|---|
| Leadership | Implicit/Stable Leader | Explicit/Strong Leader |
| Log Structure | Can have gaps, out-of-order | Strictly sequential, no gaps |
| Complexity | High (many variants, subtle) | Moderate (designed for teaching) |
| Performance | Can be higher (FPaxos, EPaxos) | Limited by leader bottleneck |
| Failure Recovery | Fast (index-by-index) | Slower (must reconstruct log) |
- You need extreme performance or geo-distribution (EPaxos).
- You are building a system where “gaps” in the log are acceptable (e.g., non-deterministic state machines).
- You want to optimize quorum sizes (Flexible Paxos).
10. Formal Verification: Why Paxos is “Correct”
Paxos is one of the most formally studied algorithms in CS.- TLA+: Leslie Lamport (creator of Paxos) also created TLA+, a formal specification language.
- Invariants:
- P1: An acceptor must accept the first proposal it receives.
- P2: If a proposal with value is chosen, then every higher-numbered proposal that is chosen has value .
- Proving P2 is the core of Paxos’s safety. It relies on the induction over proposal numbers and the intersection of majorities.
11. Implementation Pitfalls
- Livelock (The Duel): Two proposers keep preempting each other with higher proposal numbers.
- Fix: Use exponential backoff or a stable leader (Multi-Paxos).
- Disk I/O: Acceptors must
fsynctheirpromised_nandaccepted_n/vbefore replying. If a node forgets its promise after a crash, safety is lost. - Unique Proposal Numbers: If two nodes use the same for different values, the protocol breaks. Always use
(counter, server_id).
12. Interview “Master” Answer
“Paxos is a consensus protocol that ensures safety through quorum intersection. In its basic form, it uses a two-phase process: Prepare, where a proposer reserves a number and learns history, and Accept, where it proposes a value. Multi-Paxos optimizes this by electing a stable leader who skips the Prepare phase for subsequent commands. For high-performance needs, variants like Fast Paxos reduce latency by allowing direct client-to-acceptor communication, while EPaxos provides a leaderless approach to avoid bottlenecks. The core trade-off in Paxos is always prioritizing consistency over availability in the face of partitions, guaranteed by the fact that any two majorities must overlap on at least one node that ‘remembers’ the chosen value.”