Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Paxos Consensus Deep Dive

This chapter is a practical and theoretical deep dive into Paxos. The goal is to make Paxos feel as concrete and implementable as Raft, even though the original presentations are notoriously abstract.
After this chapter you should be able to:
  • Derive the basic Paxos protocol from first principles
  • Explain why quorum overlap is enough for safety
  • Implement single-decree Paxos and then extend it to Multi-Paxos
  • Compare Paxos vs Raft with confidence in design reviews

1. Problem Restatement: Single-Value Consensus

Paxos solves the consensus problem for a single value in an asynchronous, crash-fail system. Here is a real-world analogy that makes the protocol click: imagine a group of accountants who need to agree on a single number for the annual report. They communicate only by unreliable postal mail (letters can be lost or delayed). Each accountant can propose a number, but the group must converge on exactly one. The Paxos protocol is the set of rules these accountants follow so that, no matter how many letters get lost, they never accidentally publish two different numbers. We have:
  • A set of processes (nodes), some of which may crash and recover
  • A network that may lose, reorder, duplicate, or delay messages
  • No bounds on message delay (pure asynchrony), so we cannot reliably distinguish slow vs dead
Goal: All non-faulty processes agree on one value v such that:
  • Validity: v was proposed by some process
  • Agreement: No two processes decide different values
  • Termination: All non-faulty processes eventually decide (liveness – can be sacrificed under extreme asynchrony)
Paxos prioritizes safety over liveness: it ensures agreement and validity even if the system stalls.

2. Roles and Quorums

2.1 Roles

In Paxos, a process can play one or more of three logical roles:
  • Proposers – suggest values to be chosen
  • Acceptors – vote on proposals; a majority of them determines the chosen value
  • Learners – learn the chosen value to apply it to state machines
In practice, real systems often combine roles: each node can be proposer, acceptor, and learner.

2.2 Quorums

A quorum is any subset of acceptors whose size is large enough that any two quorums intersect. For classic Paxos, we use majority quorums:
  • With N acceptors, a quorum is any subset of size ⌊N/2⌋ + 1
  • This ensures that any two majorities share at least one acceptor
This intersection property is the core of Paxos safety. Think of it like a Venn diagram: any two majority groups from the same pool of people must share at least one member. That shared member is the “witness” who ensures that knowledge from one round of voting cannot be lost in the next round.

3. Paxos in Two Phases

At the heart of Paxos is a two-phase protocol driven by proposal numbers.

3.1 Proposal Numbers

Each proposal is identified by a monotonically increasing number n that is globally unique.
  • Simplest mental model: (counter, proposerId) pairs, ordered lexicographically
  • Higher (counter, proposerId) means higher proposal number

3.2 Phase 1 – Prepare / Promise

Goal: Reserve a proposal number and learn about any previously accepted values.
  • The proposer selects a new proposal number n and sends Prepare(n) to all (or a quorum of) acceptors
  • Each acceptor maintains two pieces of persistent state:
    • promised_n – highest proposal number it has promised not to go below
    • (accepted_n, accepted_v) – highest-numbered proposal it has accepted so far (or if none)
Acceptor behavior on Prepare(n):
on Prepare(n):
    // An acceptor receiving a Prepare is being asked: "Will you promise
    // not to accept any proposal numbered less than n?" If the acceptor
    // hasn't made a higher promise already, it agrees and shares any
    // value it previously accepted -- this is how knowledge propagates
    // from earlier rounds to later ones.
    if n > promised_n:
        promised_n = n          // Record the promise (MUST be persisted to disk)
        reply Promise(n, accepted_n, accepted_v)
    else:
        reply Nack(promised_n)  // "I already promised someone with a higher number"
If a proposer receives Promise responses from a quorum of acceptors, it can move to Phase 2.

3.3 Phase 2 – Accept / Accepted

Goal: Ask acceptors to accept a specific value v for proposal n.
  • The proposer chooses value v:
    • If any acceptor reported a previously accepted value (accepted_n, accepted_v) in its Promise, the proposer must choose the value with highest accepted_n among them
    • Otherwise, it can choose any value it wants (e.g., its own client request)
Then it sends Accept(n, v) to all acceptors. Acceptor behavior on Accept(n, v):
on Accept(n, v):
    // The acceptor checks: "Is this proposal at least as high as anything
    // I've promised?" If yes, it accepts the value. The >= (not just >)
    // is important: the proposer that did the Prepare for n is now coming
    // back with Accept for the same n, which must succeed.
    if n >= promised_n:
        promised_n = n     // Update promise (MUST be persisted before replying)
        accepted_n = n     // Record acceptance (MUST be persisted before replying)
        accepted_v = v     // Record the accepted value
        reply Accepted(n, v)
    else:
        reply Nack(promised_n)  // "A higher-numbered proposal preempted you"
When a quorum of acceptors has sent Accepted(n, v), value v is considered chosen. Learners can be notified either directly by acceptors or via an aggregator.

4. Why This Works: Safety Intuition

The key safety theorem:
Once a value v has been chosen, no other distinct value v' ≠ v can ever be chosen.

4.1 High-Level Proof Sketch

Assume value v is chosen in proposal n (i.e., a quorum Q1 of acceptors accepted (n, v)). Consider any later proposal n' > n that manages to gather a quorum Q2 of Promises.
  • Because both Q1 and Q2 are majorities, they must intersect at some acceptor a
  • At the time v was chosen, a must have accepted (n, v)
  • Later, when a receives Prepare(n'), it will respond with Promise(n', accepted_n = n, accepted_v = v)
  • The proposer for n' must choose the accepted value with the highest accepted_n from Promises → that is v
Thus any later proposal that can possibly succeed will propose v again, and no different value v' can be chosen. The combination of quorum intersection and the “pick highest accepted value” rule ensures safety.

5. Detailed Message Flow Example

Let’s walk through a concrete example with 3 acceptors: A1, A2, A3 and one proposer P.

5.1 No Contention Scenario

  1. P chooses proposal number n = 1 and sends Prepare(1) to A1, A2, A3
  2. All acceptors have promised_n = 0 and accepted_n = ⊥, so they all reply Promise(1, ⊥, ⊥)
  3. P receives Promises from a quorum (any 2, say A1, A2)
  4. Since none of them had an accepted value, P can choose its own value v = 42
  5. P sends Accept(1, 42) to A1, A2, A3
  6. Each acceptor checks n >= promised_n (true) and updates accepted_n=1, accepted_v=42, replying Accepted(1, 42)
  7. P sees a quorum of Accepted and decides that value 42 is chosen

5.2 Competing Proposers Scenario

Now imagine two proposers P1 and P2.
  • P1 uses proposal numbers 1, 3, 5, …
  • P2 uses proposal numbers 2, 4, 6, …
A typical interleaving:
  1. P1 sends Prepare(1) to all, gets Promises, moves to Phase 2
  2. Before its Accept(1, v1) messages reach acceptors, P2 sends Prepare(2)
  3. Acceptors respond Promise(2, accepted_n, accepted_v) depending on what they’ve already accepted
  4. Because 2 > 1, they may promise to P2 even if they previously promised to P1
This kind of competition can delay progress but cannot break safety:
  • Eventually, one proposer will “get ahead” with a higher proposal number and gather a quorum of Promises
  • That proposer will then carry through with Phase 2 and choose a value consistent with any previously chosen value
Liveness is probabilistic: with reasonable backoff strategies, the protocol eventually converges.

6. Multi-Paxos: From One Value to a Replicated Log

Basic Paxos decides one value. Real-world systems need a linearizable log of commands.

6.1 The Cost of Naive Paxos

If we run Basic Paxos for every log index:
  1. Latency: 2 round-trips (Prepare + Accept) per command.
  2. Throughput: Heavy contention on acceptor state.

6.2 Multi-Paxos Optimization

Multi-Paxos optimizes the “steady state” by assuming a stable leader. The analogy: Basic Paxos is like raising your hand and asking for speaking permission before every single sentence in a meeting. Multi-Paxos is like being elected chairperson once — from that point on, you can speak freely until someone objects. The “One Prepare” Rule:
  • A leader runs Phase 1 (Prepare) for all future log indices simultaneously (conceptually).
  • Once a proposer receives a quorum of Promises, it becomes the Master.
  • For all subsequent commands, the Master skips Phase 1 and sends only Accept (Phase 2) messages.
  • This reduces latency to 1 round-trip in the common case.

6.3 Master Leases

How does the system know who the current Master is without running Paxos every time?
  • Leases: Acceptors grant the Master a time-based lease (e.g., 5 seconds).
  • While the lease is active, acceptors promise to reject any Prepare requests from other proposers.
  • This allows the Master to perform local reads safely (Linearizable Reads), as it knows no other value can be chosen during the lease.

6.3.1 Advanced: Ephemeral Leases & Graceful Handoff

At the Principal level, “Master Leases” are not just binary. High-availability systems (like Google Spanner or etcd) use sophisticated lease management to minimize downtime during leader transitions.

The Problem: The “Dead Leader” Gap

If a Master has a 5-second lease and crashes at t=1st=1s, the acceptors will reject all other proposers until t=6st=6s. This creates a 4-second unavailability window where the cluster cannot process writes.

Optimization 1: Graceful Handoff

When a leader knows it is about to stop (e.g., during a rolling deployment), it can perform a Graceful Handoff:
  1. Step Down: The leader stops accepting new requests.
  2. Synchronize: It ensures its log is fully replicated to a designated successor.
  3. Renounce Lease: It sends a special message to acceptors saying, “I am renouncing my lease early.”
  4. Instant Election: The successor can immediately run Phase 1 and take over without waiting for the timeout.

Optimization 2: Bidirectional Leases (Leader Sticks)

Instead of just acceptors granting leases to the leader, the leader can include a Lease Renewal bit in its Phase 2 (Accept) messages.
  • If the Master is active and sending heartbeats/commands, the lease is automatically extended.
  • This prevents the “Leader Duel” where a leader’s lease expires just as it’s trying to commit a value.
Staff Tip: When designing for “Five Nines” (99.999%), the downtime during leader election is usually your biggest bottleneck. Always implement handoff mechanisms rather than relying on raw timeouts.

6.4 Log Compaction & Snapshots

A replicated log cannot grow forever. Eventually, you will run out of disk space, and a new node joining the cluster would take years to “replay” millions of commands. Snapshots allow you to “compact” the log:
  1. Capture State: The state machine (e.g., a Key-Value store) captures its current state at index ii.
  2. Discard Log: All log entries from 11 to ii can be safely deleted.
  3. Bootstrap: A new node starts by downloading the snapshot and then only replaying the log from i+1i+1 onwards.
  4. Coordination: In Paxos, snapshots are usually managed locally by each node, but they must coordinate to ensure they don’t delete entries that a lagging peer might still need.

7. Mencius: High-Throughput Multi-Leader Paxos

One major criticism of Multi-Paxos is that all writes must go through a single leader, creating a bottleneck. Mencius (named after the Chinese philosopher) is a variant designed for high-throughput, wide-area networks.

7.1 How Mencius Works

Mencius partitions the log indices among the nodes.
  • If there are NN nodes, Node ii is the “default leader” for all indices kk where ki(modN)k \equiv i \pmod N.
  • Efficiency: Since each node is the leader for its own slots, it can skip Phase 1 (Prepare) and go straight to Phase 2 (Accept).
  • Latency: No need to forward requests to a remote leader; nodes process local requests in their assigned slots.

7.2 Handling Idle Nodes (The Skip Mechanism)

What if Node 1 has no requests but Node 2 is flooded?
  • Node 1 must propose “No-Op” (Skip) for its slots so the cluster can move past them.
  • If Node 1 is down, other nodes can run Paxos on Node 1’s slots to “force-skip” them.
  • Trade-off: Mencius is extremely fast when all nodes are active, but performance degrades if some nodes are idle or slow, as they block the “common log” from progressing.

8. Advanced Paxos Variants

8.1 Fast Paxos (Low Latency)

In classic Paxos, the message flow is Client -> Proposer -> Acceptor -> Proposer -> Client — two full network round-trips. For latency-sensitive systems (think high-frequency trading or real-time bidding), this is too slow. Fast Paxos allows clients to send requests directly to acceptors, cutting out the proposer as middleman in the happy path:
  • Classic Quorum: Q=N/2+1Q = \lfloor N/2 \rfloor + 1
  • Fast Quorum: Qf=3N/4Q_f = \lceil 3N/4 \rceil
  • If a client gets a “Fast Quorum” of identical responses, the value is chosen in one round-trip from the client’s perspective.
  • Trade-off: If there is contention (multiple clients), it “collides” and falls back to a slower recovery path.

8.2 Flexible Paxos (FPaxos)

Traditional Paxos assumes QprepareQacceptQ_{prepare} \cap Q_{accept} \neq \emptyset. Flexible Paxos (Heidi Howard, 2016) observes that we only need the Prepare quorum of a later proposal to intersect with the Accept quorum of an earlier successful proposal. This insight unlocks asymmetric quorum configurations that were previously thought impossible.
  • You can have a small Accept quorum (e.g., 2 nodes out of 10) for fast writes.
  • This requires a large Prepare quorum (e.g., 9 nodes out of 10) for leader changes.
  • This is powerful for systems where leader changes are rare but writes are frequent.

8.3 EPaxos (Egalitarian Paxos)

EPaxos (Moraru, Andersen, Kaminsky, 2013) is a leaderless consensus protocol. Think of it as a round table where any knight can propose a decree and, if the decree does not conflict with anyone else’s, it passes in a single round of discussion:
  • Any node can propose a command for any index.
  • It uses dependency tracking (like Vector Clocks) to order concurrent commands.
  • If two commands don’t interfere (e.g., different keys), they can be committed in 1 round-trip without a central leader.
  • If they interfere, they use a “slow path” (2 round-trips) to resolve dependencies.

8.4 Dynamic Quorum Reconfiguration

A common question in Staff-level interviews: “How do you add or remove nodes from a running Paxos cluster without downtime?” This is known as Reconfiguration or Membership Change.

1. The Naive Approach (The Trap)

Simply changing the “Quorum Size” variable in a config file and restarting nodes is DANGEROUS.
  • If some nodes use N=3N=3 (Quorum=2) and others use N=5N=5 (Quorum=3), you can reach a state where two different values are chosen because the quorums don’t overlap correctly across the transition.

2. The Paxos-in-Paxos Approach (Alpha)

The standard way to handle reconfiguration is to treat the Cluster Membership as a value decided by the protocol itself.
  • Current Membership C1C_1 is used for all indices up to ii.
  • Index ii contains a special command: RECONFIG(C_2).
  • Once RECONFIG(C_2) is chosen at index ii, all indices >i+Δ> i + \Delta use membership C2C_2.
  • Δ\Delta (the “alpha” parameter) is a pipeline limit. It ensures that the system doesn’t switch to C2C_2 while commands using C1C_1 are still in flight.

3. Joint Consensus (Raft Style)

While Raft uses a 2-phase approach (Joint Consensus), Paxos systems (like Spanner) often use a simpler Leader-driven Reconfiguration:
  1. The leader proposes the new configuration.
  2. The system enters a state where both the old and new quorums must grant their approval.
  3. Once the new configuration is committed, the old nodes can be safely decommissioned.
StrategyMechanismComplexity
Stop-the-WorldPause all writes, update config, resume.Simple, but high downtime.
Paxos LogMembership changes are just log entries.Elegant, but requires Δ\Delta window.
Joint QuorumOverlap quorums during transition.Complex, but zero downtime.
Staff Tip: When designing for high availability, always assume membership changes will happen (upgrades, scaling). Mention that Zookeeper actually had a bug in its early reconfiguration logic that took years to fix, highlighting the complexity of this “edge case.”

9. Paxos vs Raft: The Engineering Reality

FeaturePaxos (Multi-Paxos)Raft
LeadershipImplicit/Stable LeaderExplicit/Strong Leader
Log StructureCan have gaps, out-of-orderStrictly sequential, no gaps
ComplexityHigh (many variants, subtle)Moderate (designed for teaching)
PerformanceCan be higher (FPaxos, EPaxos)Limited by leader bottleneck
Failure RecoveryFast (index-by-index)Slower (must reconstruct log)
When to choose Paxos:
  • You need extreme performance or geo-distribution (EPaxos).
  • You are building a system where “gaps” in the log are acceptable (e.g., non-deterministic state machines).
  • You want to optimize quorum sizes (Flexible Paxos).

10. Formal Verification: Why Paxos is “Correct”

Paxos is one of the most formally studied algorithms in CS.
  • TLA+: Leslie Lamport (creator of Paxos) also created TLA+, a formal specification language.
  • Invariants:
    • P1: An acceptor must accept the first proposal it receives.
    • P2: If a proposal with value vv is chosen, then every higher-numbered proposal that is chosen has value vv.
  • Proving P2 is the core of Paxos’s safety. It relies on the induction over proposal numbers and the intersection of majorities.

11. Implementation Pitfalls

  1. Livelock (The Duel): Two proposers keep preempting each other with higher proposal numbers. This is Paxos’s most famous operational failure mode — it is safe (no wrong answer is ever chosen) but progress stalls indefinitely. Real-world example: early Google Chubby deployments saw this under high contention.
    • Fix: Use exponential backoff with jitter, or elect a stable leader (Multi-Paxos) so only one node proposes at a time.
  2. Disk I/O: Acceptors must fsync their promised_n and accepted_n/v to stable storage before replying. If a node forgets its promise after a crash (e.g., data was only in the OS page cache), safety is violated — the node might accept a contradictory proposal. This is the kind of bug that only manifests under simultaneous power failure + leader crash, making it almost impossible to catch in testing.
    • Fix: Use fdatasync() or O_DIRECT writes. Profile your fsync latency — on cloud SSDs it is typically 0.1-1ms, but on spinning disks it can be 5-15ms, becoming your throughput bottleneck.
  3. Unique Proposal Numbers: If two nodes use the same nn for different values, the protocol breaks. This is not a theoretical concern — it happens when nodes are provisioned from the same image with the same initial counter.
    • Fix: Always use (counter, server_id) pairs, where server_id is globally unique (e.g., from Zookeeper, or a MAC-based hash).
  4. Confusing “chosen” with “known to be chosen”: A value is chosen when a quorum of acceptors has accepted it, but no single node may know this yet. The proposer infers it from the Accept responses, and learners must be told. A common implementation bug is to act on a value before confirming it is truly chosen — for example, applying a command to the state machine based on a single Accept response rather than waiting for a quorum. This violates safety because the value might not actually be chosen if other acceptors reject.
    • Fix: Only apply a value to the state machine after confirming a quorum of Accepted responses. Track this explicitly in your implementation.
  5. Ignoring the “gaps” problem in Multi-Paxos: Unlike Raft, a Multi-Paxos log can have gaps — index 5 might be decided before index 3. If your state machine applies commands in order, you must wait for gaps to fill before applying later entries. Naive implementations that skip gaps or apply out of order produce inconsistent state.
    • Fix: Maintain an “applied up to” watermark and only advance it when all preceding entries are decided. Use no-ops to fill gaps proactively.
Distributed systems pitfall — the “ghost leader”: In Multi-Paxos, a former leader that was partitioned away may still believe it holds the leadership. When the partition heals, it may try to send Accept messages for old log indices with stale proposal numbers. If acceptors have already promised higher numbers, these Accepts are harmlessly rejected. But if the ghost leader’s Accept arrives at a newly restarted acceptor whose state was lost (see pitfall #2 above), it might be incorrectly accepted, violating safety. This is why durable persistence of promised_n is a hard safety requirement, not an optimization.

12. Interview “Master” Answer

“Paxos is a consensus protocol that ensures safety through quorum intersection. In its basic form, it uses a two-phase process: Prepare, where a proposer reserves a number and learns history, and Accept, where it proposes a value. Multi-Paxos optimizes this by electing a stable leader who skips the Prepare phase for subsequent commands. For high-performance needs, variants like Fast Paxos reduce latency by allowing direct client-to-acceptor communication, while EPaxos provides a leaderless approach to avoid bottlenecks. The core trade-off in Paxos is always prioritizing consistency over availability in the face of partitions, guaranteed by the fact that any two majorities must overlap on at least one node that ‘remembers’ the chosen value.”

Interview Deep-Dive

Strong Answer:
  • Every Paxos decision requires acknowledgment from a majority (quorum) of nodes. With N nodes, a majority is floor(N/2) + 1. The mathematical property: any two majorities must share at least one member. With 5 nodes, any group of 3 overlaps with any other group of 3 by at least 1 node.
  • This guarantees safety. In the Prepare phase, a proposer contacts a majority and learns about any previously accepted values. If a value was accepted by a majority in an earlier round, the new proposer’s Prepare majority will include at least one node from that Accept majority. That node says “I already accepted value V,” and the new proposer must propose V, not its own value.
  • The analogy: 5 people in a room. If 3 sign a contract, later asking any 3 “did anyone sign?” guarantees at least one says “yes, here is what it said.”
Follow-up: What happens if an acceptor crashes after promising in Phase 1 but before voting in Phase 2?Nothing unsafe happens. The proposer does not receive enough Phase 2 votes, so the proposal fails. A new proposer starts a fresh round with a higher number. The crashed acceptor, when it recovers, still has its Phase 1 promise on disk (mandatory for correct implementations). It rejects lower-numbered requests but accepts the new proposer’s higher-numbered round. Safety is never compromised.
Strong Answer:
  • If two proposers simultaneously propose different values, they can livelock: Proposer A sends Prepare(1), Proposer B sends Prepare(2) preempting A’s Accept. A retries with Prepare(3) preempting B. This repeats indefinitely. Safety holds (no wrong value chosen) but no value is ever chosen.
  • Mitigations: (1) Exponential backoff with jitter — preempted proposers wait an increasing random delay. (2) Leader election (Multi-Paxos) — one designated proposer eliminates contention. (3) Multi-Paxos skips Phase 1 entirely for subsequent slots once a leader is established, cutting message count in half.
  • The key insight: livelock is a liveness problem, not a safety problem. The system stalls but never produces an incorrect result.
Follow-up: How does Multi-Paxos’s leader compare to Raft’s strong leader?In Multi-Paxos, the leader is an optimization, not a requirement. Any node can still propose; the leader prevents contention by convention. In Raft, the leader is strict — only the leader proposes entries. Raft has a cleaner invariant but is less flexible under partial failures. Multi-Paxos can make progress even with a degraded leader, while Raft must complete a full election first.
Strong Answer:
  • EPaxos is leaderless: any replica can propose commands. Non-conflicting commands commit in a single round-trip (fast path). Conflicting commands require an additional round for ordering. No leader means no bottleneck.
  • I would choose EPaxos for geo-distributed systems with low contention. A European client proposes to the European replica and commits in one local-quorum round-trip, avoiding cross-continent leader latency. With Raft, all writes traverse to the leader.
  • Trade-offs: EPaxos is significantly more complex. High-contention workloads rarely use the fast path, degrading to Multi-Paxos performance with added complexity. Fewer production implementations exist. I would only use EPaxos when geo-distribution is a hard requirement, contention is low, and the team has the expertise.
Follow-up: How does EPaxos determine whether two commands conflict?The application provides a conflict relation: given two commands, do they interfere? For a key-value store, commands conflict if they touch the same key and at least one is a write. Two GETs on the same key do not conflict. When a replica proposes, it checks for concurrent conflicting commands. If none, it commits via the fast path. If conflicts exist, it enters a second phase to establish total order. A conservative conflict relation reduces fast path usage; a permissive one risks incorrect ordering.