Skip to main content

Distributed Snapshots

How do you take a picture of a system that never stops moving? In a distributed system, there is no global clock and no global memory. Capturing a “consistent” global state is one of the most fundamental challenges.
Module Duration: 6-8 hours
Key Topics: Global State, Consistent Cuts, Chandy-Lamport Algorithm, Termination Detection
Interview Focus: How to debug distributed state, checkpointing, and recovery

1. The Challenge of Global State

In a single machine, you can stop the world and dump memory. In a distributed system:
  1. No Global Clock: We can’t say “everyone record your state at exactly 10:00:00”.
  2. Messages in Flight: The state isn’t just what’s on the nodes; it’s also the messages currently traveling on the network.

1.1 The “Money Transfer” Problem

Imagine two banks, A and B, each with 100.Total=100. Total = 200.
  1. A sends $50 to B.
  2. If you snapshot A after it sends, and B before it receives, the total looks like 150.150. 50 vanished!
  3. If you snapshot B after it receives, and A before it sends, the total looks like 250.250. 50 was created!
A consistent snapshot must account for both node state and channel state.

2. Consistent Cuts

A Cut is a line drawn across the execution of a distributed system, dividing events into “past” and “future”.

2.1 Consistent vs. Inconsistent Cuts

  • Consistent Cut: If an event ee is in the cut (past), then every event that happened-before ee must also be in the cut.
  • Inconsistent Cut: A cut where an effect is recorded but its cause is not. (e.g., recording a message receipt but not its sending).
Theorem: A snapshot is consistent if and only if it corresponds to a consistent cut.

3. The Chandy-Lamport Algorithm

The classic algorithm for capturing a consistent global state in a system with FIFO channels.

3.1 The Protocol: Markers and Recording Rules

The algorithm uses a special Marker message to coordinate the snapshot. Crucially, markers don’t stop the application; they “sweep” across the system.
  • Marker Sending Rule:
    • Process PiP_i records its state and sends a Marker on all outgoing channels before sending any more application messages.
  • Marker Receiving Rule (Process PjP_j receives Marker from PiP_i):
    • Case A: First Marker received:
      • PjP_j records its state immediately.
      • PjP_j marks the channel from PiP_i to PjP_j as empty.
      • PjP_j starts recording all incoming messages on all other channels.
      • PjP_j sends Marker on all its outgoing channels.
    • Case B: Already recorded state:
      • PjP_j stops recording the channel from PiP_i to PjP_j.
      • The state of that channel is the sequence of messages recorded since PjP_j first took its own snapshot.

3.2 Why it Works: The Proof of Consistency

A Chandy-Lamport snapshot represents a state that could have happened. Even if the snapshot doesn’t correspond to any specific “instant” in real time, it is Consistent. The Logic: If message MM was received by PjP_j before PjP_j took its snapshot, but sent by PiP_i after PiP_i took its snapshot, the cut would be inconsistent. However, the protocol prevents this: PiP_i sends the Marker before any message sent after its snapshot. Since channels are FIFO, the Marker must reach PjP_j before any subsequent message MM. Thus, PjP_j would have already taken its snapshot.

4. Advanced Variant: Asynchronous Barrier Snapshotting (ABS)

Modern stream processors like Apache Flink use a specialized version of Chandy-Lamport designed for high-throughput DAGs.

4.1 Flink’s Barrier Mechanism

Instead of markers on every channel, Flink injects Barriers into the data stream at the sources.
  1. Alignment: When an operator has multiple input streams, it must wait (buffer) messages from faster streams until it receives the barrier from all input streams.
  2. Snapshot: Once all barriers arrive, the operator snapshots its state to a durable store (like S3/HDFS) and forwards the barrier.
  3. Optimization: Unaligned Checkpoints allow operators to snapshot immediately upon seeing the first barrier, by including the buffered data from other inputs in the snapshot itself. This reduces “backpressure” at the cost of larger snapshot sizes.

5. Evaluating Global Predicates

Taking a snapshot is only half the battle. Once you have it, how do you use it to detect distributed properties?

5.1 Stable vs. Unstable Predicates

  • Stable Predicates: Once true, they stay true (e.g., “The system is deadlocked”, “The computation has terminated”).
  • Unstable Predicates: Can be true at one instant and false the next (e.g., “Queue size > 100”).
  • Staff Tip: You can only reliably detect Stable predicates using Chandy-Lamport. Detecting unstable predicates requires capturing all possible consistent cuts (lattice of global states), which is O(2N)O(2^N) complex.

5.2 Distributed Deadlock Detection

To find a deadlock, you take a snapshot of the Wait-For Graph (WFG). If the snapshot contains a cycle, the system is deadlocked. Without Chandy-Lamport, you might see a “phantom deadlock” because you caught the tail of one message and the head of another.

6. Real-World Applications

4.1 Distributed Debugging

Finding “zombie” processes or deadlocks requires a global view of who is waiting on whom.

4.2 Checkpointing and Recovery

Large-scale processing systems (like Apache Flink) use Chandy-Lamport (or variants like Asynchronous Barrier Snapshotting) to save state. If a node fails, the system rolls back to the last consistent global snapshot.

4.3 Garbage Collection

Identifying objects that are no longer reachable across a network requires a consistent snapshot of the “object graph”.

5. Interview Questions

Answer: Even with NTP, clock skew exists (typically 1-10ms). In that window, thousands of messages can be sent and received. A snapshot triggered by physical time would likely be inconsistent because it wouldn’t account for messages “in flight”. Chandy-Lamport uses causality (markers) rather than physical time to ensure consistency.
Answer:
  1. FIFO Channels: Messages on a channel must be delivered in the order they were sent.
  2. Reliable Delivery: No messages are lost.
  3. No Process Crashes: During the snapshot period (though modern variants handle this).

6. Key Takeaways

Global State = Nodes + Channels

You must record both the local memory and the messages currently on the wires.

Causality Over Time

Consistent snapshots rely on logical ordering (happened-before) rather than wall-clock time.