Time is the foundation upon which all distributed systems reasoning is built. Understanding clocks is crucial for designing systems that maintain consistency and order.Here is the uncomfortable truth that makes this chapter important: there is no “now” in a distributed system. On a single computer, “now” is a meaningful concept — you can read the clock and trust it. In a distributed system, each machine has its own clock, each clock drifts differently, and the act of asking another machine “what time is it?” takes a non-zero amount of time. This means that every timestamp-based decision (who wrote first? has this lease expired? is this cache entry stale?) is fundamentally uncertain. The rest of this chapter is about different strategies for coping with that uncertainty.
Earth’s rotation is gradually slowing, requiring occasional corrections.
LEAP SECOND EVENT:─────────────────Normal: 23:59:58 → 23:59:59 → 00:00:00Leap: 23:59:58 → 23:59:59 → 23:59:60 → 00:00:00PROBLEMS CAUSED:────────────────• 2012 leap second: Crashed Reddit, LinkedIn, Gawker, Mozilla• Linux kernel bug: Livelock due to clock_gettime() returning 60• Cassandra outages: Time comparisons brokeSOLUTIONS:──────────1. Leap smear (Google): Spread the second over 24 hours Add/subtract 11.6μs per second throughout the day2. Step at midnight: Risk of duplicate timestamps3. Ignore (dangerous): Systems may crash or produce errors
Interview Tip: Explain how your timestamp-dependent features handle leap seconds.
Clock Jumps (Step Changes)
When NTP decides the clock is too far off, it jumps:
BEFORE JUMP:───────────Event A at t=100Event B at t=101Event C at t=102NTP JUMP: Clock was 50ms behind, jumps forwardAFTER JUMP:───────────Event D at t=152 (jumped from 102 to 152)PROBLEM: If you're calculating durations:Duration C→D = 152 - 102 = 50 secondsActual time elapsed: ~0 seconds!MITIGATION:───────────• Use monotonic clocks for duration measurement• CLOCK_MONOTONIC (Linux): Only moves forward, no jumps• System.nanoTime() (Java): Monotonic• time.monotonic() (Python): Monotonic
NTP Slewing
For small corrections, NTP gradually adjusts the clock rate:
SLEW ADJUSTMENT:────────────────Clock is 100ms behindSlew rate: 500μs per secondTime to correct: 100ms / 500μs = 200 seconds (~3.3 minutes)DURING SLEW:────────────Your clock runs 0.05% faster1 second in real time = 1.0005 seconds on your clockIMPLICATIONS:─────────────• Timeouts may be slightly longer/shorter• Rate calculations affected• Usually not a problem in practice
VM Clock Issues
Virtual machines have additional clock challenges:
VM CLOCK PROBLEMS:──────────────────1. VM Pause: - VM paused for migration/snapshot - Guest OS clock stops - Resumes hours later, still thinks old time2. Steal Time: - Hypervisor schedules other VMs - Guest gets less CPU than expected - Time-based assumptions break3. Clock Source: - kvm-clock, Hyper-V clock, VMware tools - Each has different accuracy/overheadAWS RECOMMENDATIONS:────────────────────• Use Amazon Time Sync Service (chrony)• Enable enhanced networking• Consider dedicated hosts for timing-critical appsGCP/AZURE:──────────• Similar managed NTP services available• Same VM clock challenges apply
┌─────────────────────────────────────────────────────────────────────────────┐│ LAMPORT TIMESTAMPS IN ACTION │├─────────────────────────────────────────────────────────────────────────────┤│ ││ Three processes exchanging messages: ││ ││ Process A Process B Process C ││ │ │ │ ││ (1)─────────────────────────────────────── Event: A starts ││ │ │ │ ││ (2)────send────>(3) │ A sends to B ││ │ │ │ ││ │ (4)────send────>(5) B sends to C ││ │ │ │ ││ (3) │ (6) A, C do local events ││ │ │ │ ││ (4)<─────────────────────────(7) C sends to A ││ │ │ │ ││ ││ LAMPORT CLOCK VALUES: ││ ───────────────────── ││ A: 1 → 2 → 3 → 8 (receives C's 7+1) ││ B: 3 → 4 ││ C: 5 → 6 → 7 ││ ││ PROPERTY: ││ ───────── ││ If A → B (A happened before B), then L(A) < L(B) ││ ││ BUT NOT THE CONVERSE! ││ L(A) < L(B) does NOT mean A → B ││ ││ Example: A's event (3) and C's event (6) ││ L(A:3) = 3, L(C:6) = 6, but they are CONCURRENT ││ │└─────────────────────────────────────────────────────────────────────────────┘
Implementation:
class LamportClock: """Thread-safe Lamport clock implementation. A Lamport clock is like a page counter on a notebook that every node carries. Before doing anything, you write the next page number. When you send a letter, you include your current page number. When you receive a letter with a higher page number, you jump ahead -- because the sender clearly did more work than you realized. This ensures that if event A caused event B, A always has a lower page number than B. The critical limitation: if A has a lower number than B, you CANNOT conclude A caused B. They might be completely independent events that happened to get these numbers by coincidence. For true causality tracking, you need Vector Clocks. """ def __init__(self, node_id: str): self._counter = 0 self._lock = threading.Lock() self.node_id = node_id def tick(self) -> int: """Increment clock for local event""" with self._lock: self._counter += 1 return self._counter def send_timestamp(self) -> int: """Get timestamp for outgoing message -- attach this to the message""" with self._lock: self._counter += 1 return self._counter def receive_timestamp(self, received: int) -> int: """Update clock on message receipt. The max() is the key insight: we synchronize with the sender's knowledge of how far "time" has advanced across the system.""" with self._lock: self._counter = max(self._counter, received) + 1 return self._counter def current(self) -> int: """Get current clock value (no increment)""" with self._lock: return self._counter def compare(self, ts1: int, ts2: int) -> int: """Compare timestamps, break ties with node_id. Tie-breaking with node_id gives a total order -- useful for things like distributed mutex (Lamport's mutual exclusion algorithm).""" if ts1 != ts2: return ts1 - ts2 return 0 # In practice, also compare node_ids for total ordering
┌─────────────────────────────────────────────────────────────────────────────┐│ SPANNER COMMIT PROTOCOL │├─────────────────────────────────────────────────────────────────────────────┤│ ││ EXTERNAL CONSISTENCY: If T1 commits before T2 starts, ││ then T1's timestamp < T2's timestamp (globally!) ││ ││ COMMIT WAIT ALGORITHM: ││ ────────────────────── ││ ││ 1. Acquire locks on all participants ││ 2. Get commit timestamp s = TT.now().latest ││ 3. Wait until TT.after(s) is true ││ 4. Release locks and return to client ││ ││ WHY THIS WORKS: ││ ─────────────── ││ ││ Transaction T1: ││ TT.now() = [100, 107] ││ s1 = 107 ││ Wait until TT.after(107) ││ Commit at real time ~114 ││ ││ Transaction T2 (starts after T1 commits): ││ User sees T1 committed at real time 114 ││ T2 starts at real time 115 ││ TT.now() = [108, 115] (at least!) ││ s2 ≥ 115 > 107 = s1 ││ ││ LATENCY IMPACT: ││ ─────────────── ││ Commit wait adds ε (1-7ms) to every transaction ││ This is why Google invests heavily in reducing ε ││ GPS receivers on every datacenter roof! ││ │└─────────────────────────────────────────────────────────────────────────────┘
Marzullo’s algorithm is used to select a confidence interval from a set of noisy time sources.
┌─────────────────────────────────────────────────────────────────────────────┐│ MARZULLO'S ALGORITHM (INTERSECTION) │├─────────────────────────────────────────────────────────────────────────────┤│ ││ INPUT: A set of intervals [c - e, c + e] where c is clock and e is error. ││ ││ GOAL: Find the smallest interval that contains at least M correct clocks. ││ ││ STEPS: ││ 1. For each interval [low, high], create two tuples: ││ (low, -1) and (high, +1) ││ 2. Sort all tuples by value. ││ 3. Iterate through sorted tuples, maintaining a counter: ││ - If type is -1 (low), counter++ ││ - If type is +1 (high), counter-- ││ 4. The intersection is where counter is maximum. ││ ││ EXAMPLE: ││ A: [10, 12], B: [11, 13], C: [10.5, 11.5] ││ Sorted: (10, -1), (10.5, -1), (11, -1), (11.5, +1), (12, +1), (13, +1) ││ Counter: 1 2 3 2 1 0 ││ Max counter (3) is between 11 and 11.5. ││ │└─────────────────────────────────────────────────────────────────────────────┘
For two events e1 and e2 to be definitively ordered (e1<e2), their uncertainty intervals must not overlap:
e1.high<e2.lowIf they overlap, the system cannot determine the true order based on physical time alone. This is the Causality Gap.
When a node queries N time sources, it uses Marzullo’s (or the Improved Marzullo/Intersection algorithm) to find the “True” interval. If the sources disagree significantly, the interval grows (increasing uncertainty) or the node enters a “Panic” state.Staff Tip: When designing systems that rely on time for consistency (like Spanner), you must explicitly handle the “Overlap Case” by either:
Waiting: Wait until the uncertainty window of e1 has passed before starting e2 (Commit-Wait).
Versioning: Use a logical counter (HLC) to break ties during the overlap.
If f nodes are Byzantine (malicious), we need at least 3f+1 total nodes to synchronize clocks correctly.
Lynch-Welch Algorithm: Nodes exchange their clock values. Each node discards the f highest and f lowest values and takes the average of the remaining n−2f values.
Clock Drift Bounds: In a Byzantine environment, the maximum skew between correct clocks is bounded by Δ≈ϵ+ρR, where ϵ is message delay uncertainty, ρ is drift rate, and R is synchronization interval.
┌─────────────────────────────────────────────────────────────────────────────┐│ CLOCK BEST PRACTICES │├─────────────────────────────────────────────────────────────────────────────┤│ ││ 1. USE MONOTONIC CLOCKS FOR DURATIONS ││ ✗ end_time - start_time (wall clock, can go backward!) ││ ✓ Use monotonic clock APIs ││ ││ 2. DON'T TRUST TIMESTAMPS ACROSS MACHINES ││ ✗ if remote_timestamp > local_timestamp ││ ✓ Use vector clocks or HLC for ordering ││ ││ 3. ALWAYS INCLUDE UNCERTAINTY ││ ✗ "Event happened at exactly 10:00:00.000" ││ ✓ "Event happened between 10:00:00.000 and 10:00:00.010" ││ ││ 4. USE UTC EVERYWHERE INTERNALLY ││ ✗ Store local times, convert on display ││ ✓ Store UTC, convert to local only for display ││ ││ 5. HANDLE LEAP SECONDS ││ ✗ Assume 86400 seconds per day ││ ✓ Use libraries that handle leap seconds, or leap smearing ││ ││ 6. SYNC YOUR SERVERS ││ ✗ Default OS NTP settings ││ ✓ Configure chrony/ntpd properly, monitor clock drift ││ ││ 7. LOG CLOCK SKEW ││ Monitor and alert on excessive clock drift between nodes ││ │└─────────────────────────────────────────────────────────────────────────────┘
Q: How does Google Spanner achieve external consistency?
Answer:
TrueTime API: Returns time interval [earliest, latest] instead of single value
GPS + Atomic Clocks: Hardware in every datacenter for accurate time
Commit Wait: After getting timestamp, wait until uncertainty period passes
Guarantee: If T1 commits before T2 starts, T1’s timestamp < T2’s timestamp
The wait adds latency (1-7ms) but provides global ordering without coordination.
Q: Why can't we just use physical timestamps for ordering?
Answer:
Clock Skew: Different machines have different times (milliseconds to seconds)
Clock Drift: Clocks run at slightly different speeds
NTP Jumps: Clocks can jump forward or backward during sync
No Causality: Physical time doesn’t capture happened-before relationships
Example: Machine A (behind) writes at “10:00:00.100”, Machine B (ahead) writes at “10:00:00.050” - B’s write appears earlier despite happening after A’s!
Q: When would you use vector clocks vs HLC?
Answer:Vector Clocks:
When you need precise conflict detection
Systems like DynamoDB/Riak that return conflicting versions
When number of nodes is bounded and small
HLC:
When you need approximate physical time for debugging
Systems with many nodes (vector clock size = O(n))
You are building a multi-region database and your team is debating whether to use Hybrid Logical Clocks or invest in a TrueTime-like infrastructure. Walk me through the trade-offs.
Strong Answer:
HLC gives you causality tracking (if event A causes event B, HLC guarantees A’s timestamp is less than B’s) plus a close approximation of physical time, all in software with zero hardware investment. CockroachDB and MongoDB both use HLC. The downside is that HLC cannot provide external consistency — if two unrelated transactions happen in different regions, HLC cannot guarantee their timestamps reflect real-world ordering because the physical clocks they are based on have unbounded skew (in theory).
TrueTime gives you bounded uncertainty intervals — the system knows the actual time is within a window (typically 1-7ms). This allows Spanner’s commit-wait protocol: after assigning a timestamp, the transaction waits until the uncertainty interval has passed, guaranteeing that no future transaction can receive a lower timestamp. This achieves external consistency. The cost is GPS receivers and atomic clocks in every data center, plus the latency overhead of the commit-wait (equal to the uncertainty interval).
For most companies, HLC is the right choice. External consistency matters only when you need globally ordered transactions across independent shards with no causal relationship. If your workload can tolerate “causal consistency” rather than “strict serializable across unrelated transactions,” HLC is sufficient and dramatically simpler to operate.
The CockroachDB compromise is instructive: they use HLC but also enforce a maximum clock skew bound. If a node’s clock drifts beyond that bound, it self-quarantines. This provides “external consistency within the skew bound” without specialized hardware.
Follow-up: What happens in CockroachDB if two nodes have clocks skewed by more than the configured maximum offset?CockroachDB’s safety depends on a clock skew bound (default 500ms). If a node’s clock exceeds this offset, the node will refuse to serve reads and writes to prevent consistency violations. But the real danger is if the skew exceeds the bound without being detected. In that case, a transaction T1 on node A could commit with timestamp 100, and a later transaction T2 on node B (whose clock is far behind) could commit with timestamp 90, violating external consistency. CockroachDB mitigates this with “uncertainty intervals”: when a transaction reads data, it treats any version within the uncertainty window as potentially concurrent and restarts the transaction with a higher timestamp. This means clock skew does not cause correctness bugs — it causes transaction restarts and higher latency. The system degrades gracefully rather than silently corrupting data.
Explain Lamport timestamps and their fundamental limitation. Then explain how vector clocks fix that limitation.
Strong Answer:
Lamport timestamps assign a single integer counter to each event. The rule is: before any event, increment the counter; when sending a message, attach the counter; when receiving, set your counter to max(local, received) + 1. This guarantees that if event A happened-before event B (causally), then L(A) is less than L(B).
The fundamental limitation is the converse is not true: L(A) less than L(B) does NOT imply A happened before B. Two completely independent events on different nodes can have ordered timestamps by coincidence. You cannot distinguish “A caused B” from “A and B were concurrent but A happened to get a lower number.” This means Lamport timestamps cannot detect conflicts.
Vector clocks fix this by maintaining one counter per node. Each node increments only its own entry. When sending, it attaches the entire vector. When receiving, it takes the element-wise max and increments its own entry. Now you can compare two vectors: if every entry in V1 is less than or equal to V2, and at least one is strictly less, then V1 happened before V2. If neither dominates the other (V1 has some entries greater, V2 has others greater), the events are concurrent — a true conflict.
The trade-off is size: vector clocks grow linearly with the number of nodes. For a system with thousands of nodes, this overhead is prohibitive. This is why systems like DynamoDB originally used vector clocks but later moved to simpler mechanisms (last-writer-wins with server-side timestamps).
Follow-up: If vector clocks are O(n) in the number of nodes, how do real systems handle this in clusters with thousands of nodes?There are several practical approaches. First, version vectors instead of vector clocks: version vectors only track replicas that modify data (often 3-5 in a typical replication group), not all nodes in the cluster. This keeps the vector small. Second, dotted version vectors (used in Riak) which are more space-efficient and can accurately prune entries from nodes that are no longer relevant. Third, many systems abandon vector clocks entirely and use HLC or simple last-writer-wins with wall-clock timestamps, accepting the possibility of lost updates in exchange for O(1) metadata. The right choice depends on whether conflict detection is critical for your use case. For a shopping cart (Amazon Dynamo), detecting conflicts matters — you do not want to silently drop items. For a metrics counter, last-writer-wins is fine because counters can be re-aggregated.
Walk me through exactly how Google Spanner's commit-wait protocol works and why it guarantees external consistency.
Strong Answer:
When a Spanner transaction is ready to commit, it acquires locks on all participants, then gets a commit timestamp s = TT.now().latest — the upper bound of the current TrueTime uncertainty interval.
Then it waits. Specifically, it waits until TT.after(s) returns true, meaning the system is now certain that time s has definitively passed on every node in the world. This wait is typically 1-7ms, equal to twice the TrueTime uncertainty epsilon.
After the wait, the transaction’s effects are made visible and locks are released.
Why this guarantees external consistency: suppose transaction T1 commits with timestamp s1 and then a client, having observed T1’s commit, starts transaction T2. T2 starts at real-time t2, which is after T1’s commit-wait completed. Therefore t2 > s1 in absolute real time. When T2 calls TT.now(), the returned interval will have earliest >= t2 - epsilon > s1 (because we waited until s1 was definitely in the past). So T2’s commit timestamp s2 = TT.now().latest >= t2 > s1. This guarantees s2 > s1, meaning T2 is ordered after T1 in the commit order.
The brilliance is that this works without any coordination between T1 and T2 — they could be on different continents, different Paxos groups. The global ordering comes from physics (bounded clock uncertainty) rather than communication.
Follow-up: What would happen if the TrueTime uncertainty interval suddenly grew much larger — say from 7ms to 500ms?The commit-wait latency would increase proportionally. Every transaction would now wait up to 500ms before its commit is visible, making the system effectively unusable for interactive workloads. This is why Google invests heavily in reducing epsilon: they deploy GPS receivers and atomic clocks in every data center, and the TrueTime daemon continuously calibrates against multiple time sources. If a GPS antenna fails or a time master becomes unreliable, epsilon grows for nodes that depend on it. Google monitors epsilon closely — it is a key operational metric. In the absolute worst case, if epsilon grows unbounded (all time sources fail), Spanner would have to choose between blocking indefinitely (maintaining external consistency but losing availability) or proceeding without commit-wait (gaining availability but risking consistency). The system chooses to block, because Spanner’s entire value proposition is consistency. This is a concrete example of the CAP trade-off: during a “time partition” (inability to bound clock uncertainty), Spanner sacrifices availability.