Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Distributed Snapshots

How do you take a picture of a system that never stops moving? In a distributed system, there is no global clock and no global memory. Capturing a “consistent” global state is one of the most fundamental challenges. Think of it like trying to photograph a flock of birds in flight from multiple cameras at different positions. If the cameras do not fire at exactly the same instant, each photo shows the birds in different positions — and stitching them together produces an impossible scene where the same bird appears in two places or vanishes entirely. Distributed snapshots solve this coordination problem without requiring perfectly synchronized shutters.
Module Duration: 6-8 hours
Key Topics: Global State, Consistent Cuts, Chandy-Lamport Algorithm, Termination Detection
Interview Focus: How to debug distributed state, checkpointing, and recovery

1. The Challenge of Global State

In a single machine, you can stop the world and dump memory. In a distributed system:
  1. No Global Clock: We can’t say “everyone record your state at exactly 10:00:00”.
  2. Messages in Flight: The state isn’t just what’s on the nodes; it’s also the messages currently traveling on the network.

1.1 The “Money Transfer” Problem

Imagine two banks, A and B, each with 100.Total=100. Total = 200.
  1. A sends $50 to B.
  2. If you snapshot A after it sends, and B before it receives, the total looks like 150.150. 50 vanished!
  3. If you snapshot B after it receives, and A before it sends, the total looks like 250.250. 50 was created!
A consistent snapshot must account for both node state and channel state.

2. Consistent Cuts

A Cut is a line drawn across the execution of a distributed system, dividing events into “past” and “future”. The analogy here is a detective’s timeline pinned to a corkboard: you draw a vertical line across all the timelines and say “everything to the left has happened, everything to the right has not.” A consistent cut means you never have a consequence (a received message) showing up on the left while its cause (the sending of that message) is still on the right.

2.1 Consistent vs. Inconsistent Cuts

  • Consistent Cut: If an event ee is in the cut (past), then every event that happened-before ee must also be in the cut.
  • Inconsistent Cut: A cut where an effect is recorded but its cause is not. (e.g., recording a message receipt but not its sending).
Theorem: A snapshot is consistent if and only if it corresponds to a consistent cut.

3. The Chandy-Lamport Algorithm

The classic algorithm for capturing a consistent global state in a system with FIFO channels. Published by K. Mani Chandy and Leslie Lamport in 1985, it remains one of the most elegant protocols in distributed systems because it requires no pause in application processing — the system keeps running while the snapshot is being taken.

3.1 The Protocol: Markers and Recording Rules

The algorithm uses a special Marker message to coordinate the snapshot. Crucially, markers don’t stop the application; they “sweep” across the system like a wave rolling through the network. Think of it like a census. Census workers (markers) fan out across a city. When a worker arrives at your door, you report your current state (bank balance, household members). The worker then moves on to your neighbors. The key rule: once a worker has visited you, every letter you subsequently send is “post-census.” Because mail is delivered in order (FIFO), anyone who receives a census worker after receiving your post-census letter knows to stop recording your channel — they already have the right picture of what you sent.
  • Marker Sending Rule:
    • Process PiP_i records its state and sends a Marker on all outgoing channels before sending any more application messages.
  • Marker Receiving Rule (Process PjP_j receives Marker from PiP_i):
    • Case A: First Marker received:
      • PjP_j records its state immediately.
      • PjP_j marks the channel from PiP_i to PjP_j as empty.
      • PjP_j starts recording all incoming messages on all other channels.
      • PjP_j sends Marker on all its outgoing channels.
    • Case B: Already recorded state:
      • PjP_j stops recording the channel from PiP_i to PjP_j.
      • The state of that channel is the sequence of messages recorded since PjP_j first took its own snapshot.

3.2 Why it Works: The Proof of Consistency

A Chandy-Lamport snapshot represents a state that could have happened. Even if the snapshot doesn’t correspond to any specific “instant” in real time, it is Consistent. The Logic: If message MM was received by PjP_j before PjP_j took its snapshot, but sent by PiP_i after PiP_i took its snapshot, the cut would be inconsistent. However, the protocol prevents this: PiP_i sends the Marker before any message sent after its snapshot. Since channels are FIFO, the Marker must reach PjP_j before any subsequent message MM. Thus, PjP_j would have already taken its snapshot.

4. Advanced Variant: Asynchronous Barrier Snapshotting (ABS)

Modern stream processors like Apache Flink use a specialized version of Chandy-Lamport designed for high-throughput DAGs.

4.1 Flink’s Barrier Mechanism

Instead of markers on every channel, Flink injects Barriers into the data stream at the sources. Think of barriers as floating buoys dropped into a river at specific points. As the buoys flow downstream, every dam (operator) they pass through takes a snapshot of the water level (state) behind it.
  1. Alignment: When an operator has multiple input streams, it must wait (buffer) messages from faster streams until it receives the barrier from all input streams. This is the most expensive part — imagine a dam with two inlets; it cannot snapshot until both buoys arrive, so the faster inlet backs up.
  2. Snapshot: Once all barriers arrive, the operator snapshots its state to a durable store (like S3/HDFS) and forwards the barrier.
  3. Optimization: Unaligned Checkpoints (introduced in Flink 1.11) allow operators to snapshot immediately upon seeing the first barrier, by including the buffered data from other inputs in the snapshot itself. This reduces “backpressure” at the cost of larger snapshot sizes.
Distributed systems pitfall: Barrier alignment can cause cascading backpressure in deep DAGs. If one slow source delays its barrier by seconds, every downstream operator buffering aligned input data can exhaust memory and crash. In production Flink deployments, monitor checkpointAlignmentDuration — if it regularly exceeds your checkpoint interval, switch to unaligned checkpoints or investigate the slow source.

5. Evaluating Global Predicates

Taking a snapshot is only half the battle. Once you have it, how do you use it to detect distributed properties?

5.1 Stable vs. Unstable Predicates

  • Stable Predicates: Once true, they stay true (e.g., “The system is deadlocked”, “The computation has terminated”). These are like a broken vase — once shattered, no future event can un-shatter it.
  • Unstable Predicates: Can be true at one instant and false the next (e.g., “Queue size > 100”). These are like the temperature of a room — it fluctuates and a single reading might not represent the overall trend.
  • Staff Tip: You can only reliably detect Stable predicates using Chandy-Lamport. Detecting unstable predicates requires capturing all possible consistent cuts (lattice of global states), which is O(2N)O(2^N) complex. This is why systems like Flink’s “exactly-once” processing is built around stable predicates (committed offsets) rather than transient state.

5.2 Distributed Deadlock Detection

To find a deadlock, you take a snapshot of the Wait-For Graph (WFG). If the snapshot contains a cycle, the system is deadlocked. Without Chandy-Lamport, you might see a “phantom deadlock” because you caught the tail of one message and the head of another.
Distributed systems pitfall — phantom deadlocks: If you collect the wait-for graph without using a consistent snapshot algorithm, you can observe cycles that never actually existed. Node A was waiting for B at time t1t_1, and B was waiting for A at time t2t_2, but by t2t_2 node A had already released. Your inconsistent snapshot sees both waits and reports a deadlock that never happened. This was a real bug in early distributed database implementations — false deadlocks would trigger unnecessary transaction aborts, degrading throughput under load.

6. Real-World Applications

6.1 Distributed Debugging

Finding “zombie” processes or deadlocks requires a global view of who is waiting on whom. Tools like Distributed Snapshot Debuggers (built on Chandy-Lamport) let engineers freeze-frame a running system’s state and inspect it post-mortem — similar to how a core dump works on a single machine, but across dozens or hundreds of nodes. Microsoft’s research on TimeTravel Debugging for distributed systems extends this idea by allowing engineers to step backward through distributed events.

6.2 Checkpointing and Recovery

Large-scale processing systems (like Apache Flink, Apache Spark Structured Streaming, and Google Dataflow) use Chandy-Lamport (or variants like Asynchronous Barrier Snapshotting) to save state. If a node fails, the system rolls back to the last consistent global snapshot and replays only the events since that checkpoint. This is the foundation of exactly-once processing in stream engines: by atomically capturing the state of all operators and their input offsets, the system can recover without duplicating or losing data. Real-world numbers: Flink at Alibaba checkpoints hundreds of terabytes of state across thousands of operators. Checkpoint intervals of 1-5 minutes are typical; sub-second checkpointing is possible with incremental snapshots backed by RocksDB.

6.3 Garbage Collection

Identifying objects that are no longer reachable across a network requires a consistent snapshot of the “object graph”. Distributed garbage collectors (used in systems like Erlang clusters and Java RMI) periodically take snapshots to identify objects with no remaining remote references. Without a consistent snapshot, the GC might incorrectly collect an object that still has a reference in a message currently in transit.

6.4 Database Consistent Backups

Databases like CockroachDB and TiDB use snapshot-based protocols to take consistent backups across a sharded cluster without pausing writes. Each shard records its state at a specific MVCC timestamp, and because the system uses hybrid logical clocks, these per-shard snapshots together form a consistent global cut.

5. Interview Questions

Answer: Even with NTP, clock skew exists (typically 1-10ms). In that window, thousands of messages can be sent and received. A snapshot triggered by physical time would likely be inconsistent because it wouldn’t account for messages “in flight”. Chandy-Lamport uses causality (markers) rather than physical time to ensure consistency.
Answer:
  1. FIFO Channels: Messages on a channel must be delivered in the order they were sent.
  2. Reliable Delivery: No messages are lost.
  3. No Process Crashes: During the snapshot period (though modern variants handle this).

7. Common Pitfalls

Chandy-Lamport requires FIFO channels. If your system uses UDP, unordered message queues, or multiple network paths between nodes, markers can arrive before earlier application messages, producing an inconsistent snapshot. Modern systems solve this by either (a) using TCP which provides FIFO per-connection, or (b) using sequence numbers to impose logical FIFO ordering.
A common implementation mistake is snapshotting only node state and ignoring messages in transit. If you skip channel recording, your snapshot will be missing data — like counting the money in every bank vault but ignoring the armored trucks on the road. This leads to “lost” or “phantom” data during recovery.
If you trigger snapshots too frequently, the marker messages themselves consume significant bandwidth and the snapshot storage costs grow rapidly. In Flink, excessively frequent checkpoints (sub-second) can cause more time spent snapshotting than processing, effectively halving throughput. A senior engineer would say: “Checkpoint interval is a direct trade-off between recovery time (RPO) and steady-state throughput.”
Snapshots enable recovery by replaying events from the snapshot forward. But if your state machine uses random(), System.currentTimeMillis(), or other non-deterministic functions, replaying the same events produces different state. This is why Flink and similar systems require operators to be deterministic, or to explicitly record non-deterministic choices as part of the snapshot.

8. Key Takeaways

Global State = Nodes + Channels

You must record both the local memory and the messages currently on the wires. Forgetting channel state is the most common implementation mistake.

Causality Over Time

Consistent snapshots rely on logical ordering (happened-before) rather than wall-clock time. This is why Chandy-Lamport works in asynchronous systems where NTP cannot be trusted.

FIFO is Non-Negotiable

The entire correctness proof of Chandy-Lamport depends on FIFO channels. Verify this assumption before adopting the algorithm, or use a variant designed for non-FIFO networks.

Snapshot Cost is a Trade-off

More frequent snapshots reduce recovery time but increase steady-state overhead. A senior engineer tunes checkpoint intervals based on measured RPO requirements and observed throughput impact.