Time & Clock Synchronization
Time is the foundation upon which all distributed systems reasoning is built. Understanding clocks is crucial for designing systems that maintain consistency and order.Track Duration: 10-14 hours
Key Topics: Physical Clocks, NTP, PTP, Logical Clocks, Vector Clocks, HLC, TrueTime
Interview Focus: Google Spanner’s TrueTime, causality tracking, clock synchronization trade-offs
Key Topics: Physical Clocks, NTP, PTP, Logical Clocks, Vector Clocks, HLC, TrueTime
Interview Focus: Google Spanner’s TrueTime, causality tracking, clock synchronization trade-offs
The Fundamental Problem
The Hard Truth: There is no global clock in distributed systems. Every node has its own view of time, and they never perfectly agree.
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ WHY PERFECT TIME SYNC IS IMPOSSIBLE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PHYSICAL LIMITATIONS: │
│ ───────────────────── │
│ ▸ Speed of light: ~200km/ms in fiber │
│ ▸ NYC ↔ London: minimum ~28ms one-way │
│ ▸ Message delivery time is variable │
│ ▸ Network congestion adds unpredictable delay │
│ │
│ CLOCK HARDWARE: │
│ ─────────────── │
│ ▸ Quartz oscillators drift: 10-100 ppm │
│ ▸ 50 ppm = 4.32 seconds/day drift │
│ ▸ Temperature affects crystal frequency │
│ ▸ Each machine drifts differently │
│ │
│ SYNCHRONIZATION PARADOX: │
│ ──────────────────────── │
│ To sync clocks, you need to send messages │
│ Message delay is unknown and variable │
│ Therefore, you can never know the exact time another node has │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Physical Clock Synchronization
Network Time Protocol (NTP)
NTP is the most widely used clock synchronization protocol on the internet.Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ NTP ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────┐ │
│ │ Stratum 0 │ ← Atomic clocks, GPS receivers │
│ │ (Reference) │ NIST, USNO primary standards │
│ └───────┬───────┘ │
│ │ │
│ ┌─────────────┼─────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Stratum 1 │ │Stratum 1 │ │Stratum 1 │ ← Direct connection │
│ │ Server │ │ Server │ │ Server │ to reference clocks │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ ┌────┴────────────┴──────────────┴────┐ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ... │
│ │Stratum 2 │ │Stratum 2 │ │Stratum 2 │ ← Most enterprise servers │
│ │ Server │ │ Server │ │ Server │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ [Stratum 3+] [Your servers] [End users] │
│ │
│ ACCURACY BY STRATUM: │
│ ──────────────────── │
│ Stratum 0: < 1 microsecond (reference) │
│ Stratum 1: < 10 microseconds │
│ Stratum 2: 1-10 milliseconds (typical datacenter) │
│ Stratum 3+: 10-100+ milliseconds │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
NTP Synchronization Algorithm
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ NTP MESSAGE EXCHANGE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Client Server │
│ │ │ │
│ │ T1 = client send time │ │
│ │────────────── Request ──────────────────────────>│ │
│ │ │ T2 = server recv │
│ │ │ T3 = server send │
│ │<────────────── Response ─────────────────────────│ │
│ │ T4 = client recv time │ │
│ │
│ CALCULATIONS: │
│ ───────────── │
│ Round-trip delay: δ = (T4 - T1) - (T3 - T2) │
│ Clock offset: θ = ((T2 - T1) + (T3 - T4)) / 2 │
│ │
│ EXAMPLE: │
│ ──────── │
│ T1 = 10:00:00.000 (client thinks) │
│ T2 = 10:00:00.050 (server records) │
│ T3 = 10:00:00.051 (server sends) │
│ T4 = 10:00:00.101 (client receives) │
│ │
│ δ = (0.101) - (0.001) = 0.100s (100ms round-trip) │
│ θ = ((0.050) + (-0.050)) / 2 = 0s │
│ │
│ If θ ≠ 0, client adjusts its clock │
│ │
│ CLOCK ADJUSTMENT MODES: │
│ ─────────────────────── │
│ • Small offset (< 128ms): Slew (gradual adjustment) │
│ • Medium offset: Step (jump, may break applications!) │
│ • Large offset (> 1000s): Panic (refuse to adjust) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Precision Time Protocol (PTP - IEEE 1588)
For applications requiring sub-microsecond accuracy:Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ PTP vs NTP COMPARISON │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ FEATURE NTP PTP │
│ ─────── ─── ─── │
│ Accuracy 1-10 ms < 1 μs (with hardware) │
│ Hardware support Not required Required for best accuracy │
│ Network type Any IP network Usually LAN only │
│ Complexity Simple More complex │
│ Cost Free Hardware timestamping NICs │
│ │
│ PTP COMPONENTS: │
│ ─────────────── │
│ • Grandmaster clock (time source) │
│ • Boundary clocks (synchronize segments) │
│ • Ordinary clocks (end devices) │
│ • Transparent clocks (switches that compensate for delay) │
│ │
│ USE CASES: │
│ ────────── │
│ • High-frequency trading (nanoseconds matter) │
│ • Telecom (5G requires tight sync) │
│ • Industrial automation │
│ • Scientific experiments │
│ │
│ DISTRIBUTED DATABASES: │
│ ────────────────────── │
│ Most use NTP + additional mechanisms (HLC, TrueTime) │
│ CockroachDB: NTP + Hybrid Logical Clocks │
│ Spanner: GPS + atomic clocks (TrueTime) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Clock Anomalies and Edge Cases
Real-World Clock Problems
Leap Seconds
Leap Seconds
Earth’s rotation is gradually slowing, requiring occasional corrections.Interview Tip: Explain how your timestamp-dependent features handle leap seconds.
Copy
LEAP SECOND EVENT:
─────────────────
Normal: 23:59:58 → 23:59:59 → 00:00:00
Leap: 23:59:58 → 23:59:59 → 23:59:60 → 00:00:00
PROBLEMS CAUSED:
────────────────
• 2012 leap second: Crashed Reddit, LinkedIn, Gawker, Mozilla
• Linux kernel bug: Livelock due to clock_gettime() returning 60
• Cassandra outages: Time comparisons broke
SOLUTIONS:
──────────
1. Leap smear (Google): Spread the second over 24 hours
Add/subtract 11.6μs per second throughout the day
2. Step at midnight: Risk of duplicate timestamps
3. Ignore (dangerous): Systems may crash or produce errors
Clock Jumps (Step Changes)
Clock Jumps (Step Changes)
When NTP decides the clock is too far off, it jumps:
Copy
BEFORE JUMP:
───────────
Event A at t=100
Event B at t=101
Event C at t=102
NTP JUMP: Clock was 50ms behind, jumps forward
AFTER JUMP:
───────────
Event D at t=152 (jumped from 102 to 152)
PROBLEM: If you're calculating durations:
Duration C→D = 152 - 102 = 50 seconds
Actual time elapsed: ~0 seconds!
MITIGATION:
───────────
• Use monotonic clocks for duration measurement
• CLOCK_MONOTONIC (Linux): Only moves forward, no jumps
• System.nanoTime() (Java): Monotonic
• time.monotonic() (Python): Monotonic
NTP Slewing
NTP Slewing
For small corrections, NTP gradually adjusts the clock rate:
Copy
SLEW ADJUSTMENT:
────────────────
Clock is 100ms behind
Slew rate: 500μs per second
Time to correct: 100ms / 500μs = 200 seconds (~3.3 minutes)
DURING SLEW:
────────────
Your clock runs 0.05% faster
1 second in real time = 1.0005 seconds on your clock
IMPLICATIONS:
─────────────
• Timeouts may be slightly longer/shorter
• Rate calculations affected
• Usually not a problem in practice
VM Clock Issues
VM Clock Issues
Virtual machines have additional clock challenges:
Copy
VM CLOCK PROBLEMS:
──────────────────
1. VM Pause:
- VM paused for migration/snapshot
- Guest OS clock stops
- Resumes hours later, still thinks old time
2. Steal Time:
- Hypervisor schedules other VMs
- Guest gets less CPU than expected
- Time-based assumptions break
3. Clock Source:
- kvm-clock, Hyper-V clock, VMware tools
- Each has different accuracy/overhead
AWS RECOMMENDATIONS:
────────────────────
• Use Amazon Time Sync Service (chrony)
• Enable enhanced networking
• Consider dedicated hosts for timing-critical apps
GCP/AZURE:
──────────
• Similar managed NTP services available
• Same VM clock challenges apply
Logical Clocks Deep Dive
Lamport Timestamps
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ LAMPORT TIMESTAMPS IN ACTION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Three processes exchanging messages: │
│ │
│ Process A Process B Process C │
│ │ │ │ │
│ (1)─────────────────────────────────────── Event: A starts │
│ │ │ │ │
│ (2)────send────>(3) │ A sends to B │
│ │ │ │ │
│ │ (4)────send────>(5) B sends to C │
│ │ │ │ │
│ (3) │ (6) A, C do local events │
│ │ │ │ │
│ (4)<─────────────────────────(7) C sends to A │
│ │ │ │ │
│ │
│ LAMPORT CLOCK VALUES: │
│ ───────────────────── │
│ A: 1 → 2 → 3 → 8 (receives C's 7+1) │
│ B: 3 → 4 │
│ C: 5 → 6 → 7 │
│ │
│ PROPERTY: │
│ ───────── │
│ If A → B (A happened before B), then L(A) < L(B) │
│ │
│ BUT NOT THE CONVERSE! │
│ L(A) < L(B) does NOT mean A → B │
│ │
│ Example: A's event (3) and C's event (6) │
│ L(A:3) = 3, L(C:6) = 6, but they are CONCURRENT │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Copy
class LamportClock:
"""Thread-safe Lamport clock implementation"""
def __init__(self, node_id: str):
self._counter = 0
self._lock = threading.Lock()
self.node_id = node_id
def tick(self) -> int:
"""Increment clock for local event"""
with self._lock:
self._counter += 1
return self._counter
def send_timestamp(self) -> int:
"""Get timestamp for outgoing message"""
with self._lock:
self._counter += 1
return self._counter
def receive_timestamp(self, received: int) -> int:
"""Update clock on message receipt"""
with self._lock:
self._counter = max(self._counter, received) + 1
return self._counter
def current(self) -> int:
"""Get current clock value (no increment)"""
with self._lock:
return self._counter
def compare(self, ts1: int, ts2: int) -> int:
"""Compare timestamps, break ties with node_id"""
if ts1 != ts2:
return ts1 - ts2
return 0 # In practice, also compare node_ids
Vector Clocks
Vector clocks capture causality - they can tell you if two events are concurrent:Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ VECTOR CLOCKS DEEP DIVE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Each process maintains a vector of counters (one per process): │
│ │
│ Process A: [A:0, B:0, C:0] │
│ Process B: [A:0, B:0, C:0] │
│ Process C: [A:0, B:0, C:0] │
│ │
│ RULES: │
│ ────── │
│ 1. Before event: Increment own counter │
│ 2. Send: Attach current vector to message │
│ 3. Receive: Merge vectors (take max), then increment own │
│ │
│ EXAMPLE EXECUTION: │
│ ────────────────── │
│ │
│ A: [1,0,0]──────send──────>[1,1,0] (B) │
│ │ │ │
│ [2,0,0] [1,2,0]──send──>[1,2,1] (C) │
│ │ │ │ │
│ [3,0,0] │ [1,2,2] │
│ │ │ │ │
│ [4,2,2]<─────────────────────────────[1,2,3] │
│ │
│ COMPARISON RULES: │
│ ───────────────── │
│ V1 < V2 iff ∀i: V1[i] ≤ V2[i] AND ∃j: V1[j] < V2[j] │
│ V1 || V2 (concurrent) iff neither V1 < V2 nor V2 < V1 │
│ │
│ EXAMPLES: │
│ [1,0,0] < [1,1,0] ✓ (A's first event happened before B's second) │
│ [2,0,0] || [1,2,0] (A's [2,0,0] and B's [1,2,0] are concurrent) │
│ [4,2,2] > [1,2,3] ✗ (Not comparable! 4>1 but 0<3) │
│ [4,2,2] || [1,2,3] (These are concurrent) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Copy
from enum import Enum
from typing import Dict, Tuple
import copy
class Ordering(Enum):
BEFORE = -1
AFTER = 1
CONCURRENT = 0
EQUAL = 2
class VectorClock:
"""Vector clock implementation for distributed systems"""
def __init__(self, node_id: str, nodes: list[str]):
self.node_id = node_id
self.clock: Dict[str, int] = {n: 0 for n in nodes}
def tick(self) -> Dict[str, int]:
"""Increment for local event"""
self.clock[self.node_id] += 1
return self.copy()
def send(self) -> Dict[str, int]:
"""Get timestamp for outgoing message"""
self.clock[self.node_id] += 1
return self.copy()
def receive(self, received: Dict[str, int]) -> Dict[str, int]:
"""Merge incoming clock and increment"""
for node, time in received.items():
if node in self.clock:
self.clock[node] = max(self.clock[node], time)
else:
self.clock[node] = time
self.clock[self.node_id] += 1
return self.copy()
def copy(self) -> Dict[str, int]:
return copy.deepcopy(self.clock)
@staticmethod
def compare(vc1: Dict[str, int], vc2: Dict[str, int]) -> Ordering:
"""Compare two vector clocks"""
all_nodes = set(vc1.keys()) | set(vc2.keys())
less = False
greater = False
for node in all_nodes:
v1 = vc1.get(node, 0)
v2 = vc2.get(node, 0)
if v1 < v2:
less = True
elif v1 > v2:
greater = True
if less and not greater:
return Ordering.BEFORE
elif greater and not less:
return Ordering.AFTER
elif not less and not greater:
return Ordering.EQUAL
else:
return Ordering.CONCURRENT
@staticmethod
def merge(vc1: Dict[str, int], vc2: Dict[str, int]) -> Dict[str, int]:
"""Merge two vector clocks (for conflict resolution)"""
all_nodes = set(vc1.keys()) | set(vc2.keys())
return {n: max(vc1.get(n, 0), vc2.get(n, 0)) for n in all_nodes}
Vector Clock in DynamoDB/Riak
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ VECTOR CLOCKS FOR CONFLICT DETECTION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SCENARIO: Shopping cart updates from multiple devices │
│ │
│ Initial: cart = ["shirt"], vc = [A:1] │
│ │
│ Device 1 (Node A): Device 2 (Node B): │
│ Read cart ["shirt"], vc=[A:1] Read cart ["shirt"], vc=[A:1] │
│ │ │ │
│ ▼ ▼ │
│ Add "pants" Add "hat" │
│ cart = ["shirt","pants"] cart = ["shirt","hat"] │
│ vc = [A:2] vc = [A:1, B:1] │
│ │ │ │
│ └────────────┬──────────────────────────┘ │
│ ▼ │
│ CONFLICT DETECTED! │
│ [A:2] || [A:1, B:1] │
│ │
│ RESOLUTION OPTIONS: │
│ ─────────────────── │
│ 1. Last-Write-Wins (LWW): Pick one, lose data │
│ 2. Return both: Let application merge │
│ 3. CRDT merge: ["shirt", "pants", "hat"] │
│ │
│ AMAZON DYNAMO APPROACH: │
│ ─────────────────────── │
│ • Return all conflicting versions (siblings) │
│ • Application must merge │
│ • New write includes merged vector clock │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Hybrid Logical Clocks (HLC)
Used by CockroachDB, MongoDB, and many modern databases:Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ HYBRID LOGICAL CLOCKS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ MOTIVATION: │
│ ─────────── │
│ • Lamport clocks: No relation to physical time (hard to debug) │
│ • Physical clocks: No causality guarantees │
│ • HLC: Best of both worlds! │
│ │
│ HLC STRUCTURE: │
│ ────────────── │
│ HLC = (physical_time, logical_counter) │
│ └──────┬──────┘ └──────┬───────┘ │
│ │ │ │
│ Close to wall clock Breaks ties when │
│ physical time is same │
│ │
│ PROPERTIES: │
│ ─────────── │
│ 1. l.pt ≤ pt (HLC physical component ≤ actual physical time) │
│ 2. If e1 → e2, then l.e1 < l.e2 (causality preserved) │
│ 3. Bounded divergence from physical time │
│ │
│ ALGORITHM: │
│ ────────── │
│ Send/Local event: │
│ l.pt = max(l.pt, pt) // Take max of HLC and physical time │
│ if l.pt == pt: │
│ l.c = 0 // Physical time advanced, reset counter │
│ else: │
│ l.c = l.c + 1 // Same physical time, increment counter │
│ │
│ Receive(m): │
│ l.pt = max(l.pt, m.pt, pt) │
│ if l.pt == l.pt == m.pt: │
│ l.c = max(l.c, m.c) + 1 │
│ elif l.pt == l.pt: │
│ l.c = l.c + 1 │
│ elif l.pt == m.pt: │
│ l.c = m.c + 1 │
│ else: │
│ l.c = 0 │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Copy
import time
from dataclasses import dataclass
from threading import Lock
@dataclass
class HLCTimestamp:
physical: int # Milliseconds
logical: int # Counter
def __lt__(self, other: 'HLCTimestamp') -> bool:
if self.physical != other.physical:
return self.physical < other.physical
return self.logical < other.logical
def __eq__(self, other: 'HLCTimestamp') -> bool:
return self.physical == other.physical and self.logical == other.logical
def __repr__(self) -> str:
return f"HLC({self.physical}.{self.logical})"
class HybridLogicalClock:
"""
Hybrid Logical Clock implementation.
Used by CockroachDB, MongoDB, and others.
"""
def __init__(self, max_drift_ms: int = 500):
self._physical = 0
self._logical = 0
self._lock = Lock()
self._max_drift = max_drift_ms
def _now_ms(self) -> int:
return int(time.time() * 1000)
def now(self) -> HLCTimestamp:
"""Get current HLC timestamp for local event or send"""
with self._lock:
pt = self._now_ms()
if pt > self._physical:
self._physical = pt
self._logical = 0
else:
self._logical += 1
return HLCTimestamp(self._physical, self._logical)
def receive(self, remote: HLCTimestamp) -> HLCTimestamp:
"""Update HLC on receiving a message"""
with self._lock:
pt = self._now_ms()
# Check for excessive drift
if remote.physical - pt > self._max_drift:
raise ValueError(f"Remote clock too far ahead: {remote.physical - pt}ms")
if pt > self._physical and pt > remote.physical:
self._physical = pt
self._logical = 0
elif self._physical > remote.physical:
self._logical += 1
elif remote.physical > self._physical:
self._physical = remote.physical
self._logical = remote.logical + 1
else: # Equal physical times
self._logical = max(self._logical, remote.logical) + 1
return HLCTimestamp(self._physical, self._logical)
def current(self) -> HLCTimestamp:
"""Get current timestamp without incrementing"""
with self._lock:
return HLCTimestamp(self._physical, self._logical)
Google TrueTime
The gold standard for distributed time synchronization:Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ GOOGLE TRUETIME ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ TRUETIME API: │
│ ───────────── │
│ TT.now() → returns TTinterval: [earliest, latest] │
│ TT.after(t) → true if t is definitely in the past │
│ TT.before(t) → true if t is definitely in the future │
│ │
│ KEY INSIGHT: │
│ ──────────── │
│ Instead of claiming "it's exactly 10:00:00.000" │
│ TrueTime says "it's between 10:00:00.000 and 10:00:00.007" │
│ │
│ HARDWARE INFRASTRUCTURE: │
│ ──────────────────────── │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ DATACENTER TIME MASTERS │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ GPS │ │ Atomic │ │ GPS │ │ │
│ │ │ Receiver │ │ Clock │ │ Receiver │ │ │
│ │ │ (Armageddon│ │ (Rubidium) │ │ (Armageddon│ │ │
│ │ │ Master) │ │ │ │ Master) │ │ │
│ │ └──────┬──────┘ └──────┬───────┘ └──────┬──────┘ │ │
│ │ │ │ │ │ │
│ │ └────────────────┬───────────────────┘ │ │
│ │ ▼ │ │
│ │ ┌────────────────────┐ │ │
│ │ │ Time Master │ │ │
│ │ │ Daemon │ │ │
│ │ └─────────┬──────────┘ │ │
│ │ │ │ │
│ └────────────────────────┼─────────────────────────────────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Spanserver │ │ Spanserver │ │ Spanserver │ │
│ │ (has local │ │ (has local │ │ (has local │ │
│ │ TT daemon) │ │ TT daemon) │ │ TT daemon) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ UNCERTAINTY SOURCES: │
│ ──────────────────── │
│ 1. Communication delay to time masters: ~1ms │
│ 2. Local clock drift between syncs: ~200μs/sec │
│ 3. Typical ε (uncertainty): 1-7ms │
│ │
│ WAIT-OUT STRATEGY: │
│ ────────────────── │
│ To guarantee s1 < s2 for commits: │
│ After getting commit timestamp, wait until TT.after(s1) is true │
│ This ensures no one else can get a timestamp ≤ s1 │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Spanner’s External Consistency
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ SPANNER COMMIT PROTOCOL │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ EXTERNAL CONSISTENCY: If T1 commits before T2 starts, │
│ then T1's timestamp < T2's timestamp (globally!) │
│ │
│ COMMIT WAIT ALGORITHM: │
│ ────────────────────── │
│ │
│ 1. Acquire locks on all participants │
│ 2. Get commit timestamp s = TT.now().latest │
│ 3. Wait until TT.after(s) is true │
│ 4. Release locks and return to client │
│ │
│ WHY THIS WORKS: │
│ ─────────────── │
│ │
│ Transaction T1: │
│ TT.now() = [100, 107] │
│ s1 = 107 │
│ Wait until TT.after(107) │
│ Commit at real time ~114 │
│ │
│ Transaction T2 (starts after T1 commits): │
│ User sees T1 committed at real time 114 │
│ T2 starts at real time 115 │
│ TT.now() = [108, 115] (at least!) │
│ s2 ≥ 115 > 107 = s1 │
│ │
│ LATENCY IMPACT: │
│ ─────────────── │
│ Commit wait adds ε (1-7ms) to every transaction │
│ This is why Google invests heavily in reducing ε │
│ GPS receivers on every datacenter roof! │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Practical Guidelines
Choosing the Right Clock
- Use Physical Time When...
- Use Logical Clocks When...
- Use HLC When...
- Use TrueTime When...
- Generating user-facing timestamps
- Debugging and logging
- TTL (time-to-live) calculations
- Scheduling future events
- Audit trails and compliance
- Need to establish causality
- Ordering events across nodes
- Conflict detection in replicated systems
- Implementing distributed algorithms (consensus, etc.)
- Need both causality and approximate physical time
- Building distributed databases
- Snapshot isolation across nodes
- Debugging distributed traces
- Building globally consistent systems
- Need external consistency
- Have budget for GPS/atomic clocks
Clock Best Practices
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ CLOCK BEST PRACTICES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. USE MONOTONIC CLOCKS FOR DURATIONS │
│ ✗ end_time - start_time (wall clock, can go backward!) │
│ ✓ Use monotonic clock APIs │
│ │
│ 2. DON'T TRUST TIMESTAMPS ACROSS MACHINES │
│ ✗ if remote_timestamp > local_timestamp │
│ ✓ Use vector clocks or HLC for ordering │
│ │
│ 3. ALWAYS INCLUDE UNCERTAINTY │
│ ✗ "Event happened at exactly 10:00:00.000" │
│ ✓ "Event happened between 10:00:00.000 and 10:00:00.010" │
│ │
│ 4. USE UTC EVERYWHERE INTERNALLY │
│ ✗ Store local times, convert on display │
│ ✓ Store UTC, convert to local only for display │
│ │
│ 5. HANDLE LEAP SECONDS │
│ ✗ Assume 86400 seconds per day │
│ ✓ Use libraries that handle leap seconds, or leap smearing │
│ │
│ 6. SYNC YOUR SERVERS │
│ ✗ Default OS NTP settings │
│ ✓ Configure chrony/ntpd properly, monitor clock drift │
│ │
│ 7. LOG CLOCK SKEW │
│ Monitor and alert on excessive clock drift between nodes │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Interview Questions
Q: How does Google Spanner achieve external consistency?
Q: How does Google Spanner achieve external consistency?
Answer:
- TrueTime API: Returns time interval [earliest, latest] instead of single value
- GPS + Atomic Clocks: Hardware in every datacenter for accurate time
- Commit Wait: After getting timestamp, wait until uncertainty period passes
- Guarantee: If T1 commits before T2 starts, T1’s timestamp < T2’s timestamp
Q: Why can't we just use physical timestamps for ordering?
Q: Why can't we just use physical timestamps for ordering?
Answer:
- Clock Skew: Different machines have different times (milliseconds to seconds)
- Clock Drift: Clocks run at slightly different speeds
- NTP Jumps: Clocks can jump forward or backward during sync
- No Causality: Physical time doesn’t capture happened-before relationships
Q: When would you use vector clocks vs HLC?
Q: When would you use vector clocks vs HLC?
Answer:Vector Clocks:
- When you need precise conflict detection
- Systems like DynamoDB/Riak that return conflicting versions
- When number of nodes is bounded and small
- When you need approximate physical time for debugging
- Systems with many nodes (vector clock size = O(n))
- Databases like CockroachDB, MongoDB
- When snapshot isolation is needed
Q: Design a distributed system that orders events across data centers
Q: Design a distributed system that orders events across data centers
Answer approach:
- Understand requirements: Strong ordering (expensive) vs causal ordering (cheaper)
-
For causal ordering:
- Use HLC at each data center
- Propagate timestamps with messages
- Compare HLC timestamps for ordering
-
For strong ordering:
- Central sequencer (single point of failure)
- OR distributed consensus (high latency)
- OR TrueTime-like approach (hardware investment)
-
Practical trade-off:
- Use causal ordering where possible
- Strong ordering only where required (e.g., financial transactions)
Key Takeaways
Perfect Time is Impossible
Accept uncertainty. Design systems that handle clock skew gracefully.
Causality ≠ Physical Time
Use logical or hybrid clocks when you need to reason about event ordering.
Measure and Monitor
Track clock drift between nodes. Alert when skew exceeds thresholds.
Choose the Right Tool
Physical for display, logical for causality, HLC for both. TrueTime if you’re Google.