Skip to main content

Paxos Consensus Deep Dive

This chapter is a practical and theoretical deep dive into Paxos. The goal is to make Paxos feel as concrete and implementable as Raft, even though the original presentations are notoriously abstract.
After this chapter you should be able to:
  • Derive the basic Paxos protocol from first principles
  • Explain why quorum overlap is enough for safety
  • Implement single-decree Paxos and then extend it to Multi-Paxos
  • Compare Paxos vs Raft with confidence in design reviews

1. Problem Restatement: Single-Value Consensus

Paxos solves the consensus problem for a single value in an asynchronous, crash-fail system. We have:
  • A set of processes (nodes), some of which may crash and recover
  • A network that may lose, reorder, duplicate, or delay messages
  • No bounds on message delay (pure asynchrony), so we cannot reliably distinguish slow vs dead
Goal: All non-faulty processes agree on one value v such that:
  • Validity: v was proposed by some process
  • Agreement: No two processes decide different values
  • Termination: All non-faulty processes eventually decide (liveness – can be sacrificed under extreme asynchrony)
Paxos prioritizes safety over liveness: it ensures agreement and validity even if the system stalls.

2. Roles and Quorums

2.1 Roles

In Paxos, a process can play one or more of three logical roles:
  • Proposers – suggest values to be chosen
  • Acceptors – vote on proposals; a majority of them determines the chosen value
  • Learners – learn the chosen value to apply it to state machines
In practice, real systems often combine roles: each node can be proposer, acceptor, and learner.

2.2 Quorums

A quorum is any subset of acceptors whose size is large enough that any two quorums intersect. For classic Paxos, we use majority quorums:
  • With N acceptors, a quorum is any subset of size ⌊N/2⌋ + 1
  • This ensures that any two majorities share at least one acceptor
This intersection property is the core of Paxos safety.

3. Paxos in Two Phases

At the heart of Paxos is a two-phase protocol driven by proposal numbers.

3.1 Proposal Numbers

Each proposal is identified by a monotonically increasing number n that is globally unique.
  • Simplest mental model: (counter, proposerId) pairs, ordered lexicographically
  • Higher (counter, proposerId) means higher proposal number

3.2 Phase 1 – Prepare / Promise

Goal: Reserve a proposal number and learn about any previously accepted values.
  • The proposer selects a new proposal number n and sends Prepare(n) to all (or a quorum of) acceptors
  • Each acceptor maintains two pieces of persistent state:
    • promised_n – highest proposal number it has promised not to go below
    • (accepted_n, accepted_v) – highest-numbered proposal it has accepted so far (or if none)
Acceptor behavior on Prepare(n):
on Prepare(n):
    if n > promised_n:
        promised_n = n
        reply Promise(n, accepted_n, accepted_v)
    else:
        reply Nack(promised_n)
If a proposer receives Promise responses from a quorum of acceptors, it can move to Phase 2.

3.3 Phase 2 – Accept / Accepted

Goal: Ask acceptors to accept a specific value v for proposal n.
  • The proposer chooses value v:
    • If any acceptor reported a previously accepted value (accepted_n, accepted_v) in its Promise, the proposer must choose the value with highest accepted_n among them
    • Otherwise, it can choose any value it wants (e.g., its own client request)
Then it sends Accept(n, v) to all acceptors. Acceptor behavior on Accept(n, v):
on Accept(n, v):
    if n >= promised_n:
        promised_n = n
        accepted_n = n
        accepted_v = v
        reply Accepted(n, v)
    else:
        reply Nack(promised_n)
When a quorum of acceptors has sent Accepted(n, v), value v is considered chosen. Learners can be notified either directly by acceptors or via an aggregator.

4. Why This Works: Safety Intuition

The key safety theorem:
Once a value v has been chosen, no other distinct value v' ≠ v can ever be chosen.

4.1 High-Level Proof Sketch

Assume value v is chosen in proposal n (i.e., a quorum Q1 of acceptors accepted (n, v)). Consider any later proposal n' > n that manages to gather a quorum Q2 of Promises.
  • Because both Q1 and Q2 are majorities, they must intersect at some acceptor a
  • At the time v was chosen, a must have accepted (n, v)
  • Later, when a receives Prepare(n'), it will respond with Promise(n', accepted_n = n, accepted_v = v)
  • The proposer for n' must choose the accepted value with the highest accepted_n from Promises → that is v
Thus any later proposal that can possibly succeed will propose v again, and no different value v' can be chosen. The combination of quorum intersection and the “pick highest accepted value” rule ensures safety.

5. Detailed Message Flow Example

Let’s walk through a concrete example with 3 acceptors: A1, A2, A3 and one proposer P.

5.1 No Contention Scenario

  1. P chooses proposal number n = 1 and sends Prepare(1) to A1, A2, A3
  2. All acceptors have promised_n = 0 and accepted_n = ⊥, so they all reply Promise(1, ⊥, ⊥)
  3. P receives Promises from a quorum (any 2, say A1, A2)
  4. Since none of them had an accepted value, P can choose its own value v = 42
  5. P sends Accept(1, 42) to A1, A2, A3
  6. Each acceptor checks n >= promised_n (true) and updates accepted_n=1, accepted_v=42, replying Accepted(1, 42)
  7. P sees a quorum of Accepted and decides that value 42 is chosen

5.2 Competing Proposers Scenario

Now imagine two proposers P1 and P2.
  • P1 uses proposal numbers 1, 3, 5, …
  • P2 uses proposal numbers 2, 4, 6, …
A typical interleaving:
  1. P1 sends Prepare(1) to all, gets Promises, moves to Phase 2
  2. Before its Accept(1, v1) messages reach acceptors, P2 sends Prepare(2)
  3. Acceptors respond Promise(2, accepted_n, accepted_v) depending on what they’ve already accepted
  4. Because 2 > 1, they may promise to P2 even if they previously promised to P1
This kind of competition can delay progress but cannot break safety:
  • Eventually, one proposer will “get ahead” with a higher proposal number and gather a quorum of Promises
  • That proposer will then carry through with Phase 2 and choose a value consistent with any previously chosen value
Liveness is probabilistic: with reasonable backoff strategies, the protocol eventually converges.

6. Multi-Paxos: From One Value to a Replicated Log

Basic Paxos decides one value. Real-world systems need a linearizable log of commands.

6.1 The Cost of Naive Paxos

If we run Basic Paxos for every log index:
  1. Latency: 2 round-trips (Prepare + Accept) per command.
  2. Throughput: Heavy contention on acceptor state.

6.2 Multi-Paxos Optimization

Multi-Paxos optimizes the “steady state” by assuming a stable leader. The “One Prepare” Rule:
  • A leader runs Phase 1 (Prepare) for all future log indices simultaneously (conceptually).
  • Once a proposer receives a quorum of Promises, it becomes the Master.
  • For all subsequent commands, the Master skips Phase 1 and sends only Accept (Phase 2) messages.
  • This reduces latency to 1 round-trip in the common case.

6.3 Master Leases

How does the system know who the current Master is without running Paxos every time?
  • Leases: Acceptors grant the Master a time-based lease (e.g., 5 seconds).
  • While the lease is active, acceptors promise to reject any Prepare requests from other proposers.
  • This allows the Master to perform local reads safely (Linearizable Reads), as it knows no other value can be chosen during the lease.

6.3.1 Advanced: Ephemeral Leases & Graceful Handoff

At the Principal level, “Master Leases” are not just binary. High-availability systems (like Google Spanner or etcd) use sophisticated lease management to minimize downtime during leader transitions.

The Problem: The “Dead Leader” Gap

If a Master has a 5-second lease and crashes at t=1st=1s, the acceptors will reject all other proposers until t=6st=6s. This creates a 4-second unavailability window where the cluster cannot process writes.

Optimization 1: Graceful Handoff

When a leader knows it is about to stop (e.g., during a rolling deployment), it can perform a Graceful Handoff:
  1. Step Down: The leader stops accepting new requests.
  2. Synchronize: It ensures its log is fully replicated to a designated successor.
  3. Renounce Lease: It sends a special message to acceptors saying, “I am renouncing my lease early.”
  4. Instant Election: The successor can immediately run Phase 1 and take over without waiting for the timeout.

Optimization 2: Bidirectional Leases (Leader Sticks)

Instead of just acceptors granting leases to the leader, the leader can include a Lease Renewal bit in its Phase 2 (Accept) messages.
  • If the Master is active and sending heartbeats/commands, the lease is automatically extended.
  • This prevents the “Leader Duel” where a leader’s lease expires just as it’s trying to commit a value.
Staff Tip: When designing for “Five Nines” (99.999%), the downtime during leader election is usually your biggest bottleneck. Always implement handoff mechanisms rather than relying on raw timeouts.

6.4 Log Compaction & Snapshots

A replicated log cannot grow forever. Eventually, you will run out of disk space, and a new node joining the cluster would take years to “replay” millions of commands. Snapshots allow you to “compact” the log:
  1. Capture State: The state machine (e.g., a Key-Value store) captures its current state at index ii.
  2. Discard Log: All log entries from 11 to ii can be safely deleted.
  3. Bootstrap: A new node starts by downloading the snapshot and then only replaying the log from i+1i+1 onwards.
  4. Coordination: In Paxos, snapshots are usually managed locally by each node, but they must coordinate to ensure they don’t delete entries that a lagging peer might still need.

7. Mencius: High-Throughput Multi-Leader Paxos

One major criticism of Multi-Paxos is that all writes must go through a single leader, creating a bottleneck. Mencius (named after the Chinese philosopher) is a variant designed for high-throughput, wide-area networks.

7.1 How Mencius Works

Mencius partitions the log indices among the nodes.
  • If there are NN nodes, Node ii is the “default leader” for all indices kk where ki(modN)k \equiv i \pmod N.
  • Efficiency: Since each node is the leader for its own slots, it can skip Phase 1 (Prepare) and go straight to Phase 2 (Accept).
  • Latency: No need to forward requests to a remote leader; nodes process local requests in their assigned slots.

7.2 Handling Idle Nodes (The Skip Mechanism)

What if Node 1 has no requests but Node 2 is flooded?
  • Node 1 must propose “No-Op” (Skip) for its slots so the cluster can move past them.
  • If Node 1 is down, other nodes can run Paxos on Node 1’s slots to “force-skip” them.
  • Trade-off: Mencius is extremely fast when all nodes are active, but performance degrades if some nodes are idle or slow, as they block the “common log” from progressing.

8. Advanced Paxos Variants

7.1 Fast Paxos (Low Latency)

In classic Paxos, the message flow is Client -> Proposer -> Acceptor -> Proposer -> Client. Fast Paxos allows clients to send requests directly to acceptors:
  • Classic Quorum: Q=N/2+1Q = \lfloor N/2 \rfloor + 1
  • Fast Quorum: Qf=3N/4Q_f = \lceil 3N/4 \rceil
  • If a client gets a “Fast Quorum” of identical responses, the value is chosen in one round-trip from the client’s perspective.
  • Trade-off: If there is contention (multiple clients), it “collides” and falls back to a slower recovery path.

7.2 Flexible Paxos (FPaxos)

Traditional Paxos assumes QprepareQacceptQ_{prepare} \cap Q_{accept} \neq \emptyset. Flexible Paxos observes that we only need the Prepare quorum of a later proposal to intersect with the Accept quorum of an earlier successful proposal.
  • You can have a small Accept quorum (e.g., 2 nodes out of 10) for fast writes.
  • This requires a large Prepare quorum (e.g., 9 nodes out of 10) for leader changes.
  • This is powerful for systems where leader changes are rare but writes are frequent.

7.3 EPaxos (Egalitarian Paxos)

EPaxos is a leaderless consensus protocol:
  • Any node can propose a command for any index.
  • It uses dependency tracking (like Vector Clocks) to order concurrent commands.
  • If two commands don’t interfere (e.g., different keys), they can be committed in 1 round-trip without a central leader.
  • If they interfere, they use a “slow path” (2 round-trips) to resolve dependencies.

7.4 Dynamic Quorum Reconfiguration

A common question in Staff-level interviews: “How do you add or remove nodes from a running Paxos cluster without downtime?” This is known as Reconfiguration or Membership Change.

1. The Naive Approach (The Trap)

Simply changing the “Quorum Size” variable in a config file and restarting nodes is DANGEROUS.
  • If some nodes use N=3N=3 (Quorum=2) and others use N=5N=5 (Quorum=3), you can reach a state where two different values are chosen because the quorums don’t overlap correctly across the transition.

2. The Paxos-in-Paxos Approach (Alpha)

The standard way to handle reconfiguration is to treat the Cluster Membership as a value decided by the protocol itself.
  • Current Membership C1C_1 is used for all indices up to ii.
  • Index ii contains a special command: RECONFIG(C_2).
  • Once RECONFIG(C_2) is chosen at index ii, all indices >i+Δ> i + \Delta use membership C2C_2.
  • Δ\Delta (the “alpha” parameter) is a pipeline limit. It ensures that the system doesn’t switch to C2C_2 while commands using C1C_1 are still in flight.

3. Joint Consensus (Raft Style)

While Raft uses a 2-phase approach (Joint Consensus), Paxos systems (like Spanner) often use a simpler Leader-driven Reconfiguration:
  1. The leader proposes the new configuration.
  2. The system enters a state where both the old and new quorums must grant their approval.
  3. Once the new configuration is committed, the old nodes can be safely decommissioned.
StrategyMechanismComplexity
Stop-the-WorldPause all writes, update config, resume.Simple, but high downtime.
Paxos LogMembership changes are just log entries.Elegant, but requires Δ\Delta window.
Joint QuorumOverlap quorums during transition.Complex, but zero downtime.
Staff Tip: When designing for high availability, always assume membership changes will happen (upgrades, scaling). Mention that Zookeeper actually had a bug in its early reconfiguration logic that took years to fix, highlighting the complexity of this “edge case.”

9. Paxos vs Raft: The Engineering Reality

FeaturePaxos (Multi-Paxos)Raft
LeadershipImplicit/Stable LeaderExplicit/Strong Leader
Log StructureCan have gaps, out-of-orderStrictly sequential, no gaps
ComplexityHigh (many variants, subtle)Moderate (designed for teaching)
PerformanceCan be higher (FPaxos, EPaxos)Limited by leader bottleneck
Failure RecoveryFast (index-by-index)Slower (must reconstruct log)
When to choose Paxos:
  • You need extreme performance or geo-distribution (EPaxos).
  • You are building a system where “gaps” in the log are acceptable (e.g., non-deterministic state machines).
  • You want to optimize quorum sizes (Flexible Paxos).

10. Formal Verification: Why Paxos is “Correct”

Paxos is one of the most formally studied algorithms in CS.
  • TLA+: Leslie Lamport (creator of Paxos) also created TLA+, a formal specification language.
  • Invariants:
    • P1: An acceptor must accept the first proposal it receives.
    • P2: If a proposal with value vv is chosen, then every higher-numbered proposal that is chosen has value vv.
  • Proving P2 is the core of Paxos’s safety. It relies on the induction over proposal numbers and the intersection of majorities.

11. Implementation Pitfalls

  1. Livelock (The Duel): Two proposers keep preempting each other with higher proposal numbers.
    • Fix: Use exponential backoff or a stable leader (Multi-Paxos).
  2. Disk I/O: Acceptors must fsync their promised_n and accepted_n/v before replying. If a node forgets its promise after a crash, safety is lost.
  3. Unique Proposal Numbers: If two nodes use the same nn for different values, the protocol breaks. Always use (counter, server_id).

12. Interview “Master” Answer

“Paxos is a consensus protocol that ensures safety through quorum intersection. In its basic form, it uses a two-phase process: Prepare, where a proposer reserves a number and learns history, and Accept, where it proposes a value. Multi-Paxos optimizes this by electing a stable leader who skips the Prepare phase for subsequent commands. For high-performance needs, variants like Fast Paxos reduce latency by allowing direct client-to-acceptor communication, while EPaxos provides a leaderless approach to avoid bottlenecks. The core trade-off in Paxos is always prioritizing consistency over availability in the face of partitions, guaranteed by the fact that any two majorities must overlap on at least one node that ‘remembers’ the chosen value.”