Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Distributed Snapshots
How do you take a picture of a system that never stops moving? In a distributed system, there is no global clock and no global memory. Capturing a “consistent” global state is one of the most fundamental challenges. Think of it like trying to photograph a flock of birds in flight from multiple cameras at different positions. If the cameras do not fire at exactly the same instant, each photo shows the birds in different positions — and stitching them together produces an impossible scene where the same bird appears in two places or vanishes entirely. Distributed snapshots solve this coordination problem without requiring perfectly synchronized shutters.Module Duration: 6-8 hours
Key Topics: Global State, Consistent Cuts, Chandy-Lamport Algorithm, Termination Detection
Interview Focus: How to debug distributed state, checkpointing, and recovery
Key Topics: Global State, Consistent Cuts, Chandy-Lamport Algorithm, Termination Detection
Interview Focus: How to debug distributed state, checkpointing, and recovery
1. The Challenge of Global State
In a single machine, you can stop the world and dump memory. In a distributed system:- No Global Clock: We can’t say “everyone record your state at exactly 10:00:00”.
- Messages in Flight: The state isn’t just what’s on the nodes; it’s also the messages currently traveling on the network.
1.1 The “Money Transfer” Problem
Imagine two banks, A and B, each with 200.- A sends $50 to B.
- If you snapshot A after it sends, and B before it receives, the total looks like 50 vanished!
- If you snapshot B after it receives, and A before it sends, the total looks like 50 was created!
2. Consistent Cuts
A Cut is a line drawn across the execution of a distributed system, dividing events into “past” and “future”. The analogy here is a detective’s timeline pinned to a corkboard: you draw a vertical line across all the timelines and say “everything to the left has happened, everything to the right has not.” A consistent cut means you never have a consequence (a received message) showing up on the left while its cause (the sending of that message) is still on the right.2.1 Consistent vs. Inconsistent Cuts
- Consistent Cut: If an event is in the cut (past), then every event that happened-before must also be in the cut.
- Inconsistent Cut: A cut where an effect is recorded but its cause is not. (e.g., recording a message receipt but not its sending).
3. The Chandy-Lamport Algorithm
The classic algorithm for capturing a consistent global state in a system with FIFO channels. Published by K. Mani Chandy and Leslie Lamport in 1985, it remains one of the most elegant protocols in distributed systems because it requires no pause in application processing — the system keeps running while the snapshot is being taken.3.1 The Protocol: Markers and Recording Rules
The algorithm uses a special Marker message to coordinate the snapshot. Crucially, markers don’t stop the application; they “sweep” across the system like a wave rolling through the network. Think of it like a census. Census workers (markers) fan out across a city. When a worker arrives at your door, you report your current state (bank balance, household members). The worker then moves on to your neighbors. The key rule: once a worker has visited you, every letter you subsequently send is “post-census.” Because mail is delivered in order (FIFO), anyone who receives a census worker after receiving your post-census letter knows to stop recording your channel — they already have the right picture of what you sent.- Marker Sending Rule:
- Process records its state and sends a Marker on all outgoing channels before sending any more application messages.
- Marker Receiving Rule (Process receives Marker from ):
- Case A: First Marker received:
- records its state immediately.
- marks the channel from to as empty.
- starts recording all incoming messages on all other channels.
- sends Marker on all its outgoing channels.
- Case B: Already recorded state:
- stops recording the channel from to .
- The state of that channel is the sequence of messages recorded since first took its own snapshot.
- Case A: First Marker received:
3.2 Why it Works: The Proof of Consistency
A Chandy-Lamport snapshot represents a state that could have happened. Even if the snapshot doesn’t correspond to any specific “instant” in real time, it is Consistent. The Logic: If message was received by before took its snapshot, but sent by after took its snapshot, the cut would be inconsistent. However, the protocol prevents this: sends the Marker before any message sent after its snapshot. Since channels are FIFO, the Marker must reach before any subsequent message . Thus, would have already taken its snapshot.4. Advanced Variant: Asynchronous Barrier Snapshotting (ABS)
Modern stream processors like Apache Flink use a specialized version of Chandy-Lamport designed for high-throughput DAGs.4.1 Flink’s Barrier Mechanism
Instead of markers on every channel, Flink injects Barriers into the data stream at the sources. Think of barriers as floating buoys dropped into a river at specific points. As the buoys flow downstream, every dam (operator) they pass through takes a snapshot of the water level (state) behind it.- Alignment: When an operator has multiple input streams, it must wait (buffer) messages from faster streams until it receives the barrier from all input streams. This is the most expensive part — imagine a dam with two inlets; it cannot snapshot until both buoys arrive, so the faster inlet backs up.
- Snapshot: Once all barriers arrive, the operator snapshots its state to a durable store (like S3/HDFS) and forwards the barrier.
- Optimization: Unaligned Checkpoints (introduced in Flink 1.11) allow operators to snapshot immediately upon seeing the first barrier, by including the buffered data from other inputs in the snapshot itself. This reduces “backpressure” at the cost of larger snapshot sizes.
5. Evaluating Global Predicates
Taking a snapshot is only half the battle. Once you have it, how do you use it to detect distributed properties?5.1 Stable vs. Unstable Predicates
- Stable Predicates: Once true, they stay true (e.g., “The system is deadlocked”, “The computation has terminated”). These are like a broken vase — once shattered, no future event can un-shatter it.
- Unstable Predicates: Can be true at one instant and false the next (e.g., “Queue size > 100”). These are like the temperature of a room — it fluctuates and a single reading might not represent the overall trend.
- Staff Tip: You can only reliably detect Stable predicates using Chandy-Lamport. Detecting unstable predicates requires capturing all possible consistent cuts (lattice of global states), which is complex. This is why systems like Flink’s “exactly-once” processing is built around stable predicates (committed offsets) rather than transient state.
5.2 Distributed Deadlock Detection
To find a deadlock, you take a snapshot of the Wait-For Graph (WFG). If the snapshot contains a cycle, the system is deadlocked. Without Chandy-Lamport, you might see a “phantom deadlock” because you caught the tail of one message and the head of another.6. Real-World Applications
6.1 Distributed Debugging
Finding “zombie” processes or deadlocks requires a global view of who is waiting on whom. Tools like Distributed Snapshot Debuggers (built on Chandy-Lamport) let engineers freeze-frame a running system’s state and inspect it post-mortem — similar to how a core dump works on a single machine, but across dozens or hundreds of nodes. Microsoft’s research on TimeTravel Debugging for distributed systems extends this idea by allowing engineers to step backward through distributed events.6.2 Checkpointing and Recovery
Large-scale processing systems (like Apache Flink, Apache Spark Structured Streaming, and Google Dataflow) use Chandy-Lamport (or variants like Asynchronous Barrier Snapshotting) to save state. If a node fails, the system rolls back to the last consistent global snapshot and replays only the events since that checkpoint. This is the foundation of exactly-once processing in stream engines: by atomically capturing the state of all operators and their input offsets, the system can recover without duplicating or losing data. Real-world numbers: Flink at Alibaba checkpoints hundreds of terabytes of state across thousands of operators. Checkpoint intervals of 1-5 minutes are typical; sub-second checkpointing is possible with incremental snapshots backed by RocksDB.6.3 Garbage Collection
Identifying objects that are no longer reachable across a network requires a consistent snapshot of the “object graph”. Distributed garbage collectors (used in systems like Erlang clusters and Java RMI) periodically take snapshots to identify objects with no remaining remote references. Without a consistent snapshot, the GC might incorrectly collect an object that still has a reference in a message currently in transit.6.4 Database Consistent Backups
Databases like CockroachDB and TiDB use snapshot-based protocols to take consistent backups across a sharded cluster without pausing writes. Each shard records its state at a specific MVCC timestamp, and because the system uses hybrid logical clocks, these per-shard snapshots together form a consistent global cut.5. Interview Questions
Q: Why can't we just use NTP to take a snapshot?
Q: Why can't we just use NTP to take a snapshot?
Answer: Even with NTP, clock skew exists (typically 1-10ms). In that window, thousands of messages can be sent and received. A snapshot triggered by physical time would likely be inconsistent because it wouldn’t account for messages “in flight”. Chandy-Lamport uses causality (markers) rather than physical time to ensure consistency.
Q: What are the assumptions of Chandy-Lamport?
Q: What are the assumptions of Chandy-Lamport?
Answer:
- FIFO Channels: Messages on a channel must be delivered in the order they were sent.
- Reliable Delivery: No messages are lost.
- No Process Crashes: During the snapshot period (though modern variants handle this).
7. Common Pitfalls
Pitfall: Assuming FIFO channels in non-FIFO systems
Pitfall: Assuming FIFO channels in non-FIFO systems
Chandy-Lamport requires FIFO channels. If your system uses UDP, unordered message queues, or multiple network paths between nodes, markers can arrive before earlier application messages, producing an inconsistent snapshot. Modern systems solve this by either (a) using TCP which provides FIFO per-connection, or (b) using sequence numbers to impose logical FIFO ordering.
Pitfall: Forgetting channel state
Pitfall: Forgetting channel state
A common implementation mistake is snapshotting only node state and ignoring messages in transit. If you skip channel recording, your snapshot will be missing data — like counting the money in every bank vault but ignoring the armored trucks on the road. This leads to “lost” or “phantom” data during recovery.
Pitfall: Snapshot storm under high frequency
Pitfall: Snapshot storm under high frequency
If you trigger snapshots too frequently, the marker messages themselves consume significant bandwidth and the snapshot storage costs grow rapidly. In Flink, excessively frequent checkpoints (sub-second) can cause more time spent snapshotting than processing, effectively halving throughput. A senior engineer would say: “Checkpoint interval is a direct trade-off between recovery time (RPO) and steady-state throughput.”
Pitfall: Non-deterministic state machines break replay
Pitfall: Non-deterministic state machines break replay
Snapshots enable recovery by replaying events from the snapshot forward. But if your state machine uses
random(), System.currentTimeMillis(), or other non-deterministic functions, replaying the same events produces different state. This is why Flink and similar systems require operators to be deterministic, or to explicitly record non-deterministic choices as part of the snapshot.8. Key Takeaways
Global State = Nodes + Channels
You must record both the local memory and the messages currently on the wires. Forgetting channel state is the most common implementation mistake.
Causality Over Time
Consistent snapshots rely on logical ordering (happened-before) rather than wall-clock time. This is why Chandy-Lamport works in asynchronous systems where NTP cannot be trusted.
FIFO is Non-Negotiable
The entire correctness proof of Chandy-Lamport depends on FIFO channels. Verify this assumption before adopting the algorithm, or use a variant designed for non-FIFO networks.
Snapshot Cost is a Trade-off
More frequent snapshots reduce recovery time but increase steady-state overhead. A senior engineer tunes checkpoint intervals based on measured RPO requirements and observed throughput impact.