Distributed Snapshots
1. The Challenge of Global State
1.1 The “Money Transfer” Problem
2. Consistent Cuts
2.1 Consistent vs. Inconsistent Cuts
3. The Chandy-Lamport Algorithm
3.1 The Protocol: Markers and Recording Rules
3.2 Why it Works: The Proof of Consistency
4. Advanced Variant: Asynchronous Barrier Snapshotting (ABS)
4.1 Flink’s Barrier Mechanism
5. Evaluating Global Predicates
5.1 Stable vs. Unstable Predicates
5.2 Distributed Deadlock Detection
6. Real-World Applications
4.1 Distributed Debugging
4.2 Checkpointing and Recovery
4.3 Garbage Collection
5. Interview Questions
6. Key Takeaways

Distributed Snapshots

How do you take a picture of a system that never stops moving? In a distributed system, there is no global clock and no global memory. Capturing a “consistent” global state is one of the most fundamental challenges.

Module Duration: 6-8 hours
Key Topics: Global State, Consistent Cuts, Chandy-Lamport Algorithm, Termination Detection
Interview Focus: How to debug distributed state, checkpointing, and recovery

1. The Challenge of Global State

In a single machine, you can stop the world and dump memory. In a distributed system:

No Global Clock: We can’t say “everyone record your state at exactly 10:00:00”.
Messages in Flight: The state isn’t just what’s on the nodes; it’s also the messages currently traveling on the network.

1.1 The “Money Transfer” Problem

Imagine two banks, A and B, each with

100. Total =

200.

A sends $50 to B.
If you snapshot A after it sends, and B before it receives, the total looks like $150.$ 50 vanished!
If you snapshot B after it receives, and A before it sends, the total looks like $250.$ 50 was created!

A consistent snapshot must account for both node state and channel state.

2. Consistent Cuts

A Cut is a line drawn across the execution of a distributed system, dividing events into “past” and “future”.

2.1 Consistent vs. Inconsistent Cuts

Consistent Cut: If an event $e$ is in the cut (past), then every event that happened-before $e$ must also be in the cut.
Inconsistent Cut: A cut where an effect is recorded but its cause is not. (e.g., recording a message receipt but not its sending).

Theorem: A snapshot is consistent if and only if it corresponds to a consistent cut.

3. The Chandy-Lamport Algorithm

The classic algorithm for capturing a consistent global state in a system with FIFO channels.

3.1 The Protocol: Markers and Recording Rules

The algorithm uses a special Marker message to coordinate the snapshot. Crucially, markers don’t stop the application; they “sweep” across the system.

Marker Sending Rule:
- Process $P_i$ records its state and sends a Marker on all outgoing channels before sending any more application messages.
Marker Receiving Rule (Process $P_j$ $P_{j}$ receives Marker from $P_i$ $P_{i}$ ):
- Case A: First Marker received:
  - $P_j$ records its state immediately.
  - $P_j$ marks the channel from $P_i$ to $P_j$ as empty.
  - $P_j$ starts recording all incoming messages on all other channels.
  - $P_j$ sends Marker on all its outgoing channels.
- Case B: Already recorded state:
  - $P_j$ stops recording the channel from $P_i$ to $P_j$ .
  - The state of that channel is the sequence of messages recorded since $P_j$ first took its own snapshot.

3.2 Why it Works: The Proof of Consistency

A Chandy-Lamport snapshot represents a state that could have happened. Even if the snapshot doesn’t correspond to any specific “instant” in real time, it is Consistent. The Logic: If message

M

was received by

P_j

before

P_j

took its snapshot, but sent by

P_i

after

P_i

took its snapshot, the cut would be inconsistent. However, the protocol prevents this:

P_i

sends the Marker before any message sent after its snapshot. Since channels are FIFO, the Marker must reach

P_j

before any subsequent message

M

. Thus,

P_j

would have already taken its snapshot.

4. Advanced Variant: Asynchronous Barrier Snapshotting (ABS)

Modern stream processors like Apache Flink use a specialized version of Chandy-Lamport designed for high-throughput DAGs.

4.1 Flink’s Barrier Mechanism

Instead of markers on every channel, Flink injects Barriers into the data stream at the sources.

Alignment: When an operator has multiple input streams, it must wait (buffer) messages from faster streams until it receives the barrier from all input streams.
Snapshot: Once all barriers arrive, the operator snapshots its state to a durable store (like S3/HDFS) and forwards the barrier.
Optimization: Unaligned Checkpoints allow operators to snapshot immediately upon seeing the first barrier, by including the buffered data from other inputs in the snapshot itself. This reduces “backpressure” at the cost of larger snapshot sizes.

5. Evaluating Global Predicates

Taking a snapshot is only half the battle. Once you have it, how do you use it to detect distributed properties?

5.1 Stable vs. Unstable Predicates

Stable Predicates: Once true, they stay true (e.g., “The system is deadlocked”, “The computation has terminated”).
Unstable Predicates: Can be true at one instant and false the next (e.g., “Queue size > 100”).
Staff Tip: You can only reliably detect Stable predicates using Chandy-Lamport. Detecting unstable predicates requires capturing all possible consistent cuts (lattice of global states), which is $O(2^N)$ complex.

5.2 Distributed Deadlock Detection

To find a deadlock, you take a snapshot of the Wait-For Graph (WFG). If the snapshot contains a cycle, the system is deadlocked. Without Chandy-Lamport, you might see a “phantom deadlock” because you caught the tail of one message and the head of another.

6. Real-World Applications

4.1 Distributed Debugging

Finding “zombie” processes or deadlocks requires a global view of who is waiting on whom.

4.2 Checkpointing and Recovery

Large-scale processing systems (like Apache Flink) use Chandy-Lamport (or variants like Asynchronous Barrier Snapshotting) to save state. If a node fails, the system rolls back to the last consistent global snapshot.

4.3 Garbage Collection

Identifying objects that are no longer reachable across a network requires a consistent snapshot of the “object graph”.

5. Interview Questions

Q: Why can't we just use NTP to take a snapshot?

Answer: Even with NTP, clock skew exists (typically 1-10ms). In that window, thousands of messages can be sent and received. A snapshot triggered by physical time would likely be inconsistent because it wouldn’t account for messages “in flight”. Chandy-Lamport uses causality (markers) rather than physical time to ensure consistency.

Q: What are the assumptions of Chandy-Lamport?

Answer:

FIFO Channels: Messages on a channel must be delivered in the order they were sent.
Reliable Delivery: No messages are lost.
No Process Crashes: During the snapshot period (though modern variants handle this).

6. Key Takeaways

Global State = Nodes + Channels

You must record both the local memory and the messages currently on the wires.

Causality Over Time

Consistent snapshots rely on logical ordering (happened-before) rather than wall-clock time.

Consistency Models Consensus

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Distributed Snapshots

​1. The Challenge of Global State

​1.1 The “Money Transfer” Problem

​2. Consistent Cuts

​2.1 Consistent vs. Inconsistent Cuts

​3. The Chandy-Lamport Algorithm

​3.1 The Protocol: Markers and Recording Rules

​3.2 Why it Works: The Proof of Consistency

​4. Advanced Variant: Asynchronous Barrier Snapshotting (ABS)

​4.1 Flink’s Barrier Mechanism

​5. Evaluating Global Predicates

​5.1 Stable vs. Unstable Predicates

​5.2 Distributed Deadlock Detection

​6. Real-World Applications

​4.1 Distributed Debugging

​4.2 Checkpointing and Recovery

​4.3 Garbage Collection

​5. Interview Questions

​6. Key Takeaways

Global State = Nodes + Channels

Causality Over Time

Distributed Snapshots

1. The Challenge of Global State

1.1 The “Money Transfer” Problem

2. Consistent Cuts

2.1 Consistent vs. Inconsistent Cuts

3. The Chandy-Lamport Algorithm

3.1 The Protocol: Markers and Recording Rules

3.2 Why it Works: The Proof of Consistency

4. Advanced Variant: Asynchronous Barrier Snapshotting (ABS)

4.1 Flink’s Barrier Mechanism

5. Evaluating Global Predicates

5.1 Stable vs. Unstable Predicates

5.2 Distributed Deadlock Detection

6. Real-World Applications

4.1 Distributed Debugging

4.2 Checkpointing and Recovery

4.3 Garbage Collection

5. Interview Questions

6. Key Takeaways