Skip to main content

Time & Clock Synchronization

Time is the foundation upon which all distributed systems reasoning is built. Understanding clocks is crucial for designing systems that maintain consistency and order.
Track Duration: 10-14 hours
Key Topics: Physical Clocks, NTP, PTP, Logical Clocks, Vector Clocks, HLC, TrueTime
Interview Focus: Google Spanner’s TrueTime, causality tracking, clock synchronization trade-offs

The Fundamental Problem

The Hard Truth: There is no global clock in distributed systems. Every node has its own view of time, and they never perfectly agree.
┌─────────────────────────────────────────────────────────────────────────────┐
│                   WHY PERFECT TIME SYNC IS IMPOSSIBLE                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PHYSICAL LIMITATIONS:                                                      │
│  ─────────────────────                                                      │
│  ▸ Speed of light: ~200km/ms in fiber                                      │
│  ▸ NYC ↔ London: minimum ~28ms one-way                                     │
│  ▸ Message delivery time is variable                                        │
│  ▸ Network congestion adds unpredictable delay                              │
│                                                                              │
│  CLOCK HARDWARE:                                                            │
│  ───────────────                                                            │
│  ▸ Quartz oscillators drift: 10-100 ppm                                    │
│  ▸ 50 ppm = 4.32 seconds/day drift                                         │
│  ▸ Temperature affects crystal frequency                                    │
│  ▸ Each machine drifts differently                                          │
│                                                                              │
│  SYNCHRONIZATION PARADOX:                                                   │
│  ────────────────────────                                                   │
│  To sync clocks, you need to send messages                                  │
│  Message delay is unknown and variable                                       │
│  Therefore, you can never know the exact time another node has              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Physical Clock Synchronization

Network Time Protocol (NTP)

NTP is the most widely used clock synchronization protocol on the internet.
┌─────────────────────────────────────────────────────────────────────────────┐
│                      NTP ARCHITECTURE                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                    ┌───────────────┐                                        │
│                    │  Stratum 0    │ ← Atomic clocks, GPS receivers        │
│                    │  (Reference)  │   NIST, USNO primary standards        │
│                    └───────┬───────┘                                        │
│                            │                                                 │
│              ┌─────────────┼─────────────┐                                  │
│              ▼             ▼             ▼                                  │
│        ┌──────────┐  ┌──────────┐  ┌──────────┐                            │
│        │Stratum 1 │  │Stratum 1 │  │Stratum 1 │  ← Direct connection       │
│        │  Server  │  │  Server  │  │  Server  │    to reference clocks     │
│        └────┬─────┘  └────┬─────┘  └────┬─────┘                            │
│             │             │             │                                   │
│        ┌────┴────────────┴──────────────┴────┐                              │
│        ▼             ▼             ▼         ▼                              │
│   ┌──────────┐  ┌──────────┐  ┌──────────┐  ...                            │
│   │Stratum 2 │  │Stratum 2 │  │Stratum 2 │       ← Most enterprise servers │
│   │  Server  │  │  Server  │  │  Server  │                                 │
│   └────┬─────┘  └────┬─────┘  └────┬─────┘                                 │
│        │             │             │                                        │
│        ▼             ▼             ▼                                        │
│   [Stratum 3+]  [Your servers]  [End users]                                │
│                                                                              │
│  ACCURACY BY STRATUM:                                                       │
│  ────────────────────                                                       │
│  Stratum 0: < 1 microsecond (reference)                                     │
│  Stratum 1: < 10 microseconds                                               │
│  Stratum 2: 1-10 milliseconds (typical datacenter)                          │
│  Stratum 3+: 10-100+ milliseconds                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

NTP Synchronization Algorithm

┌─────────────────────────────────────────────────────────────────────────────┐
│                    NTP MESSAGE EXCHANGE                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│   Client                                              Server                │
│     │                                                   │                   │
│     │  T1 = client send time                           │                   │
│     │────────────── Request ──────────────────────────>│                   │
│     │                                                   │ T2 = server recv  │
│     │                                                   │ T3 = server send  │
│     │<────────────── Response ─────────────────────────│                   │
│     │  T4 = client recv time                           │                   │
│                                                                              │
│   CALCULATIONS:                                                             │
│   ─────────────                                                             │
│   Round-trip delay: δ = (T4 - T1) - (T3 - T2)                               │
│   Clock offset: θ = ((T2 - T1) + (T3 - T4)) / 2                            │
│                                                                              │
│   EXAMPLE:                                                                  │
│   ────────                                                                  │
│   T1 = 10:00:00.000 (client thinks)                                        │
│   T2 = 10:00:00.050 (server records)                                        │
│   T3 = 10:00:00.051 (server sends)                                          │
│   T4 = 10:00:00.101 (client receives)                                       │
│                                                                              │
│   δ = (0.101) - (0.001) = 0.100s (100ms round-trip)                        │
│   θ = ((0.050) + (-0.050)) / 2 = 0s                                        │
│                                                                              │
│   If θ ≠ 0, client adjusts its clock                                        │
│                                                                              │
│   CLOCK ADJUSTMENT MODES:                                                   │
│   ───────────────────────                                                   │
│   • Small offset (< 128ms): Slew (gradual adjustment)                      │
│   • Medium offset: Step (jump, may break applications!)                     │
│   • Large offset (> 1000s): Panic (refuse to adjust)                       │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Precision Time Protocol (PTP - IEEE 1588)

For applications requiring sub-microsecond accuracy:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    PTP vs NTP COMPARISON                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  FEATURE             NTP                    PTP                             │
│  ───────             ───                    ───                             │
│  Accuracy            1-10 ms                < 1 μs (with hardware)          │
│  Hardware support    Not required           Required for best accuracy      │
│  Network type        Any IP network         Usually LAN only                │
│  Complexity          Simple                 More complex                    │
│  Cost                Free                   Hardware timestamping NICs      │
│                                                                              │
│  PTP COMPONENTS:                                                            │
│  ───────────────                                                            │
│  • Grandmaster clock (time source)                                          │
│  • Boundary clocks (synchronize segments)                                   │
│  • Ordinary clocks (end devices)                                            │
│  • Transparent clocks (switches that compensate for delay)                  │
│                                                                              │
│  USE CASES:                                                                 │
│  ──────────                                                                 │
│  • High-frequency trading (nanoseconds matter)                              │
│  • Telecom (5G requires tight sync)                                         │
│  • Industrial automation                                                    │
│  • Scientific experiments                                                   │
│                                                                              │
│  DISTRIBUTED DATABASES:                                                     │
│  ──────────────────────                                                     │
│  Most use NTP + additional mechanisms (HLC, TrueTime)                       │
│  CockroachDB: NTP + Hybrid Logical Clocks                                   │
│  Spanner: GPS + atomic clocks (TrueTime)                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Clock Anomalies and Edge Cases

Real-World Clock Problems

Earth’s rotation is gradually slowing, requiring occasional corrections.
LEAP SECOND EVENT:
─────────────────
Normal:  23:59:58 → 23:59:59 → 00:00:00
Leap:    23:59:58 → 23:59:59 → 23:59:60 → 00:00:00

PROBLEMS CAUSED:
────────────────
• 2012 leap second: Crashed Reddit, LinkedIn, Gawker, Mozilla
• Linux kernel bug: Livelock due to clock_gettime() returning 60
• Cassandra outages: Time comparisons broke

SOLUTIONS:
──────────
1. Leap smear (Google): Spread the second over 24 hours
   Add/subtract 11.6μs per second throughout the day

2. Step at midnight: Risk of duplicate timestamps

3. Ignore (dangerous): Systems may crash or produce errors
Interview Tip: Explain how your timestamp-dependent features handle leap seconds.
When NTP decides the clock is too far off, it jumps:
BEFORE JUMP:
───────────
Event A at t=100
Event B at t=101
Event C at t=102

NTP JUMP: Clock was 50ms behind, jumps forward

AFTER JUMP:
───────────
Event D at t=152 (jumped from 102 to 152)

PROBLEM: If you're calculating durations:
Duration C→D = 152 - 102 = 50 seconds
Actual time elapsed: ~0 seconds!

MITIGATION:
───────────
• Use monotonic clocks for duration measurement
• CLOCK_MONOTONIC (Linux): Only moves forward, no jumps
• System.nanoTime() (Java): Monotonic
• time.monotonic() (Python): Monotonic
For small corrections, NTP gradually adjusts the clock rate:
SLEW ADJUSTMENT:
────────────────
Clock is 100ms behind
Slew rate: 500μs per second
Time to correct: 100ms / 500μs = 200 seconds (~3.3 minutes)

DURING SLEW:
────────────
Your clock runs 0.05% faster
1 second in real time = 1.0005 seconds on your clock

IMPLICATIONS:
─────────────
• Timeouts may be slightly longer/shorter
• Rate calculations affected
• Usually not a problem in practice
Virtual machines have additional clock challenges:
VM CLOCK PROBLEMS:
──────────────────
1. VM Pause:
   - VM paused for migration/snapshot
   - Guest OS clock stops
   - Resumes hours later, still thinks old time

2. Steal Time:
   - Hypervisor schedules other VMs
   - Guest gets less CPU than expected
   - Time-based assumptions break

3. Clock Source:
   - kvm-clock, Hyper-V clock, VMware tools
   - Each has different accuracy/overhead

AWS RECOMMENDATIONS:
────────────────────
• Use Amazon Time Sync Service (chrony)
• Enable enhanced networking
• Consider dedicated hosts for timing-critical apps

GCP/AZURE:
──────────
• Similar managed NTP services available
• Same VM clock challenges apply

Logical Clocks Deep Dive

Lamport Timestamps

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LAMPORT TIMESTAMPS IN ACTION                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Three processes exchanging messages:                                       │
│                                                                              │
│  Process A      Process B      Process C                                    │
│     │              │              │                                         │
│   (1)─────────────────────────────────────── Event: A starts                │
│     │              │              │                                         │
│   (2)────send────>(3)             │          A sends to B                   │
│     │              │              │                                         │
│     │            (4)────send────>(5)         B sends to C                   │
│     │              │              │                                         │
│   (3)              │            (6)          A, C do local events           │
│     │              │              │                                         │
│   (4)<─────────────────────────(7)           C sends to A                   │
│     │              │              │                                         │
│                                                                              │
│  LAMPORT CLOCK VALUES:                                                      │
│  ─────────────────────                                                      │
│  A: 1 → 2 → 3 → 8 (receives C's 7+1)                                       │
│  B: 3 → 4                                                                   │
│  C: 5 → 6 → 7                                                               │
│                                                                              │
│  PROPERTY:                                                                  │
│  ─────────                                                                  │
│  If A → B (A happened before B), then L(A) < L(B)                          │
│                                                                              │
│  BUT NOT THE CONVERSE!                                                      │
│  L(A) < L(B) does NOT mean A → B                                           │
│                                                                              │
│  Example: A's event (3) and C's event (6)                                  │
│  L(A:3) = 3, L(C:6) = 6, but they are CONCURRENT                          │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation:
class LamportClock:
    """Thread-safe Lamport clock implementation"""
    
    def __init__(self, node_id: str):
        self._counter = 0
        self._lock = threading.Lock()
        self.node_id = node_id
    
    def tick(self) -> int:
        """Increment clock for local event"""
        with self._lock:
            self._counter += 1
            return self._counter
    
    def send_timestamp(self) -> int:
        """Get timestamp for outgoing message"""
        with self._lock:
            self._counter += 1
            return self._counter
    
    def receive_timestamp(self, received: int) -> int:
        """Update clock on message receipt"""
        with self._lock:
            self._counter = max(self._counter, received) + 1
            return self._counter
    
    def current(self) -> int:
        """Get current clock value (no increment)"""
        with self._lock:
            return self._counter
    
    def compare(self, ts1: int, ts2: int) -> int:
        """Compare timestamps, break ties with node_id"""
        if ts1 != ts2:
            return ts1 - ts2
        return 0  # In practice, also compare node_ids

Vector Clocks

Vector clocks capture causality - they can tell you if two events are concurrent:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    VECTOR CLOCKS DEEP DIVE                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Each process maintains a vector of counters (one per process):             │
│                                                                              │
│  Process A: [A:0, B:0, C:0]                                                 │
│  Process B: [A:0, B:0, C:0]                                                 │
│  Process C: [A:0, B:0, C:0]                                                 │
│                                                                              │
│  RULES:                                                                     │
│  ──────                                                                     │
│  1. Before event: Increment own counter                                     │
│  2. Send: Attach current vector to message                                  │
│  3. Receive: Merge vectors (take max), then increment own                   │
│                                                                              │
│  EXAMPLE EXECUTION:                                                         │
│  ──────────────────                                                         │
│                                                                              │
│  A: [1,0,0]──────send──────>[1,1,0] (B)                                    │
│      │                        │                                             │
│    [2,0,0]                  [1,2,0]──send──>[1,2,1] (C)                     │
│      │                        │               │                             │
│    [3,0,0]                    │           [1,2,2]                           │
│      │                        │               │                             │
│    [4,2,2]<─────────────────────────────[1,2,3]                             │
│                                                                              │
│  COMPARISON RULES:                                                          │
│  ─────────────────                                                          │
│  V1 < V2  iff  ∀i: V1[i] ≤ V2[i] AND ∃j: V1[j] < V2[j]                    │
│  V1 || V2 (concurrent) iff neither V1 < V2 nor V2 < V1                     │
│                                                                              │
│  EXAMPLES:                                                                  │
│  [1,0,0] < [1,1,0] ✓ (A's first event happened before B's second)          │
│  [2,0,0] || [1,2,0]  (A's [2,0,0] and B's [1,2,0] are concurrent)         │
│  [4,2,2] > [1,2,3] ✗ (Not comparable! 4>1 but 0<3)                         │
│  [4,2,2] || [1,2,3]  (These are concurrent)                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation with Conflict Detection:
from enum import Enum
from typing import Dict, Tuple
import copy

class Ordering(Enum):
    BEFORE = -1
    AFTER = 1
    CONCURRENT = 0
    EQUAL = 2

class VectorClock:
    """Vector clock implementation for distributed systems"""
    
    def __init__(self, node_id: str, nodes: list[str]):
        self.node_id = node_id
        self.clock: Dict[str, int] = {n: 0 for n in nodes}
    
    def tick(self) -> Dict[str, int]:
        """Increment for local event"""
        self.clock[self.node_id] += 1
        return self.copy()
    
    def send(self) -> Dict[str, int]:
        """Get timestamp for outgoing message"""
        self.clock[self.node_id] += 1
        return self.copy()
    
    def receive(self, received: Dict[str, int]) -> Dict[str, int]:
        """Merge incoming clock and increment"""
        for node, time in received.items():
            if node in self.clock:
                self.clock[node] = max(self.clock[node], time)
            else:
                self.clock[node] = time
        self.clock[self.node_id] += 1
        return self.copy()
    
    def copy(self) -> Dict[str, int]:
        return copy.deepcopy(self.clock)
    
    @staticmethod
    def compare(vc1: Dict[str, int], vc2: Dict[str, int]) -> Ordering:
        """Compare two vector clocks"""
        all_nodes = set(vc1.keys()) | set(vc2.keys())
        
        less = False
        greater = False
        
        for node in all_nodes:
            v1 = vc1.get(node, 0)
            v2 = vc2.get(node, 0)
            
            if v1 < v2:
                less = True
            elif v1 > v2:
                greater = True
        
        if less and not greater:
            return Ordering.BEFORE
        elif greater and not less:
            return Ordering.AFTER
        elif not less and not greater:
            return Ordering.EQUAL
        else:
            return Ordering.CONCURRENT
    
    @staticmethod
    def merge(vc1: Dict[str, int], vc2: Dict[str, int]) -> Dict[str, int]:
        """Merge two vector clocks (for conflict resolution)"""
        all_nodes = set(vc1.keys()) | set(vc2.keys())
        return {n: max(vc1.get(n, 0), vc2.get(n, 0)) for n in all_nodes}

Vector Clock in DynamoDB/Riak

┌─────────────────────────────────────────────────────────────────────────────┐
│                    VECTOR CLOCKS FOR CONFLICT DETECTION                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SCENARIO: Shopping cart updates from multiple devices                      │
│                                                                              │
│  Initial: cart = ["shirt"], vc = [A:1]                                      │
│                                                                              │
│  Device 1 (Node A):                     Device 2 (Node B):                  │
│  Read cart ["shirt"], vc=[A:1]          Read cart ["shirt"], vc=[A:1]       │
│       │                                       │                             │
│       ▼                                       ▼                             │
│  Add "pants"                            Add "hat"                           │
│  cart = ["shirt","pants"]               cart = ["shirt","hat"]              │
│  vc = [A:2]                             vc = [A:1, B:1]                     │
│       │                                       │                             │
│       └────────────┬──────────────────────────┘                             │
│                    ▼                                                        │
│              CONFLICT DETECTED!                                             │
│              [A:2] || [A:1, B:1]                                            │
│                                                                              │
│  RESOLUTION OPTIONS:                                                        │
│  ───────────────────                                                        │
│  1. Last-Write-Wins (LWW): Pick one, lose data                             │
│  2. Return both: Let application merge                                      │
│  3. CRDT merge: ["shirt", "pants", "hat"]                                  │
│                                                                              │
│  AMAZON DYNAMO APPROACH:                                                    │
│  ───────────────────────                                                    │
│  • Return all conflicting versions (siblings)                               │
│  • Application must merge                                                   │
│  • New write includes merged vector clock                                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Hybrid Logical Clocks (HLC)

Used by CockroachDB, MongoDB, and many modern databases:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    HYBRID LOGICAL CLOCKS                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  MOTIVATION:                                                                │
│  ───────────                                                                │
│  • Lamport clocks: No relation to physical time (hard to debug)            │
│  • Physical clocks: No causality guarantees                                 │
│  • HLC: Best of both worlds!                                                │
│                                                                              │
│  HLC STRUCTURE:                                                             │
│  ──────────────                                                             │
│  HLC = (physical_time, logical_counter)                                     │
│        └──────┬──────┘ └──────┬───────┘                                    │
│               │               │                                             │
│     Close to wall clock   Breaks ties when                                 │
│                           physical time is same                             │
│                                                                              │
│  PROPERTIES:                                                                │
│  ───────────                                                                │
│  1. l.pt ≤ pt (HLC physical component ≤ actual physical time)              │
│  2. If e1 → e2, then l.e1 < l.e2 (causality preserved)                     │
│  3. Bounded divergence from physical time                                   │
│                                                                              │
│  ALGORITHM:                                                                 │
│  ──────────                                                                 │
│  Send/Local event:                                                          │
│    l.pt = max(l.pt, pt)  // Take max of HLC and physical time              │
│    if l.pt == pt:                                                           │
│      l.c = 0             // Physical time advanced, reset counter          │
│    else:                                                                    │
│      l.c = l.c + 1       // Same physical time, increment counter          │
│                                                                              │
│  Receive(m):                                                                │
│    l.pt = max(l.pt, m.pt, pt)                                               │
│    if l.pt == l.pt == m.pt:                                                 │
│      l.c = max(l.c, m.c) + 1                                               │
│    elif l.pt == l.pt:                                                       │
│      l.c = l.c + 1                                                          │
│    elif l.pt == m.pt:                                                       │
│      l.c = m.c + 1                                                          │
│    else:                                                                    │
│      l.c = 0                                                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation:
import time
from dataclasses import dataclass
from threading import Lock

@dataclass
class HLCTimestamp:
    physical: int  # Milliseconds
    logical: int   # Counter
    
    def __lt__(self, other: 'HLCTimestamp') -> bool:
        if self.physical != other.physical:
            return self.physical < other.physical
        return self.logical < other.logical
    
    def __eq__(self, other: 'HLCTimestamp') -> bool:
        return self.physical == other.physical and self.logical == other.logical
    
    def __repr__(self) -> str:
        return f"HLC({self.physical}.{self.logical})"

class HybridLogicalClock:
    """
    Hybrid Logical Clock implementation.
    Used by CockroachDB, MongoDB, and others.
    """
    
    def __init__(self, max_drift_ms: int = 500):
        self._physical = 0
        self._logical = 0
        self._lock = Lock()
        self._max_drift = max_drift_ms
    
    def _now_ms(self) -> int:
        return int(time.time() * 1000)
    
    def now(self) -> HLCTimestamp:
        """Get current HLC timestamp for local event or send"""
        with self._lock:
            pt = self._now_ms()
            
            if pt > self._physical:
                self._physical = pt
                self._logical = 0
            else:
                self._logical += 1
            
            return HLCTimestamp(self._physical, self._logical)
    
    def receive(self, remote: HLCTimestamp) -> HLCTimestamp:
        """Update HLC on receiving a message"""
        with self._lock:
            pt = self._now_ms()
            
            # Check for excessive drift
            if remote.physical - pt > self._max_drift:
                raise ValueError(f"Remote clock too far ahead: {remote.physical - pt}ms")
            
            if pt > self._physical and pt > remote.physical:
                self._physical = pt
                self._logical = 0
            elif self._physical > remote.physical:
                self._logical += 1
            elif remote.physical > self._physical:
                self._physical = remote.physical
                self._logical = remote.logical + 1
            else:  # Equal physical times
                self._logical = max(self._logical, remote.logical) + 1
            
            return HLCTimestamp(self._physical, self._logical)
    
    def current(self) -> HLCTimestamp:
        """Get current timestamp without incrementing"""
        with self._lock:
            return HLCTimestamp(self._physical, self._logical)

Google TrueTime

The gold standard for distributed time synchronization:
┌─────────────────────────────────────────────────────────────────────────────┐
│                    GOOGLE TRUETIME ARCHITECTURE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  TRUETIME API:                                                              │
│  ─────────────                                                              │
│  TT.now() → returns TTinterval: [earliest, latest]                          │
│  TT.after(t) → true if t is definitely in the past                         │
│  TT.before(t) → true if t is definitely in the future                      │
│                                                                              │
│  KEY INSIGHT:                                                               │
│  ────────────                                                               │
│  Instead of claiming "it's exactly 10:00:00.000"                            │
│  TrueTime says "it's between 10:00:00.000 and 10:00:00.007"                │
│                                                                              │
│  HARDWARE INFRASTRUCTURE:                                                   │
│  ────────────────────────                                                   │
│                                                                              │
│   ┌─────────────────────────────────────────────────────────────┐           │
│   │               DATACENTER TIME MASTERS                       │           │
│   │  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐    │           │
│   │  │  GPS        │   │  Atomic      │   │  GPS        │    │           │
│   │  │  Receiver   │   │  Clock       │   │  Receiver   │    │           │
│   │  │  (Armageddon│   │  (Rubidium)  │   │  (Armageddon│    │           │
│   │  │   Master)   │   │              │   │   Master)   │    │           │
│   │  └──────┬──────┘   └──────┬───────┘   └──────┬──────┘    │           │
│   │         │                 │                   │           │           │
│   │         └────────────────┬───────────────────┘           │           │
│   │                          ▼                               │           │
│   │              ┌────────────────────┐                      │           │
│   │              │   Time Master      │                      │           │
│   │              │   Daemon           │                      │           │
│   │              └─────────┬──────────┘                      │           │
│   │                        │                                 │           │
│   └────────────────────────┼─────────────────────────────────┘           │
│                            │                                             │
│            ┌───────────────┼───────────────┐                             │
│            ▼               ▼               ▼                             │
│   ┌──────────────┐ ┌──────────────┐ ┌──────────────┐                    │
│   │  Spanserver  │ │  Spanserver  │ │  Spanserver  │                    │
│   │  (has local  │ │  (has local  │ │  (has local  │                    │
│   │   TT daemon) │ │   TT daemon) │ │   TT daemon) │                    │
│   └──────────────┘ └──────────────┘ └──────────────┘                    │
│                                                                          │
│  UNCERTAINTY SOURCES:                                                    │
│  ────────────────────                                                    │
│  1. Communication delay to time masters: ~1ms                            │
│  2. Local clock drift between syncs: ~200μs/sec                         │
│  3. Typical ε (uncertainty): 1-7ms                                       │
│                                                                          │
│  WAIT-OUT STRATEGY:                                                     │
│  ──────────────────                                                      │
│  To guarantee s1 < s2 for commits:                                       │
│  After getting commit timestamp, wait until TT.after(s1) is true        │
│  This ensures no one else can get a timestamp ≤ s1                      │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────────┘

Spanner’s External Consistency

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SPANNER COMMIT PROTOCOL                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  EXTERNAL CONSISTENCY: If T1 commits before T2 starts,                      │
│  then T1's timestamp < T2's timestamp (globally!)                           │
│                                                                              │
│  COMMIT WAIT ALGORITHM:                                                     │
│  ──────────────────────                                                     │
│                                                                              │
│  1. Acquire locks on all participants                                       │
│  2. Get commit timestamp s = TT.now().latest                                │
│  3. Wait until TT.after(s) is true                                          │
│  4. Release locks and return to client                                      │
│                                                                              │
│  WHY THIS WORKS:                                                            │
│  ───────────────                                                            │
│                                                                              │
│  Transaction T1:                                                            │
│     TT.now() = [100, 107]                                                   │
│     s1 = 107                                                                │
│     Wait until TT.after(107)                                                │
│     Commit at real time ~114                                                │
│                                                                              │
│  Transaction T2 (starts after T1 commits):                                  │
│     User sees T1 committed at real time 114                                 │
│     T2 starts at real time 115                                              │
│     TT.now() = [108, 115]  (at least!)                                     │
│     s2 ≥ 115 > 107 = s1                                                    │
│                                                                              │
│  LATENCY IMPACT:                                                            │
│  ───────────────                                                            │
│  Commit wait adds ε (1-7ms) to every transaction                           │
│  This is why Google invests heavily in reducing ε                          │
│  GPS receivers on every datacenter roof!                                    │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Practical Guidelines

Choosing the Right Clock

  • Generating user-facing timestamps
  • Debugging and logging
  • TTL (time-to-live) calculations
  • Scheduling future events
  • Audit trails and compliance
Caveat: Accept that ordering might be wrong across machines

Clock Best Practices

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CLOCK BEST PRACTICES                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. USE MONOTONIC CLOCKS FOR DURATIONS                                      │
│     ✗ end_time - start_time (wall clock, can go backward!)                 │
│     ✓ Use monotonic clock APIs                                              │
│                                                                              │
│  2. DON'T TRUST TIMESTAMPS ACROSS MACHINES                                  │
│     ✗ if remote_timestamp > local_timestamp                                │
│     ✓ Use vector clocks or HLC for ordering                                │
│                                                                              │
│  3. ALWAYS INCLUDE UNCERTAINTY                                              │
│     ✗ "Event happened at exactly 10:00:00.000"                             │
│     ✓ "Event happened between 10:00:00.000 and 10:00:00.010"               │
│                                                                              │
│  4. USE UTC EVERYWHERE INTERNALLY                                           │
│     ✗ Store local times, convert on display                                │
│     ✓ Store UTC, convert to local only for display                         │
│                                                                              │
│  5. HANDLE LEAP SECONDS                                                     │
│     ✗ Assume 86400 seconds per day                                         │
│     ✓ Use libraries that handle leap seconds, or leap smearing             │
│                                                                              │
│  6. SYNC YOUR SERVERS                                                       │
│     ✗ Default OS NTP settings                                              │
│     ✓ Configure chrony/ntpd properly, monitor clock drift                  │
│                                                                              │
│  7. LOG CLOCK SKEW                                                          │
│     Monitor and alert on excessive clock drift between nodes               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Interview Questions

Answer:
  1. TrueTime API: Returns time interval [earliest, latest] instead of single value
  2. GPS + Atomic Clocks: Hardware in every datacenter for accurate time
  3. Commit Wait: After getting timestamp, wait until uncertainty period passes
  4. Guarantee: If T1 commits before T2 starts, T1’s timestamp < T2’s timestamp
The wait adds latency (1-7ms) but provides global ordering without coordination.
Answer:
  1. Clock Skew: Different machines have different times (milliseconds to seconds)
  2. Clock Drift: Clocks run at slightly different speeds
  3. NTP Jumps: Clocks can jump forward or backward during sync
  4. No Causality: Physical time doesn’t capture happened-before relationships
Example: Machine A (behind) writes at “10:00:00.100”, Machine B (ahead) writes at “10:00:00.050” - B’s write appears earlier despite happening after A’s!
Answer:Vector Clocks:
  • When you need precise conflict detection
  • Systems like DynamoDB/Riak that return conflicting versions
  • When number of nodes is bounded and small
HLC:
  • When you need approximate physical time for debugging
  • Systems with many nodes (vector clock size = O(n))
  • Databases like CockroachDB, MongoDB
  • When snapshot isolation is needed
Key difference: Vector clocks track all participants, HLC bounds clock size.
Answer approach:
  1. Understand requirements: Strong ordering (expensive) vs causal ordering (cheaper)
  2. For causal ordering:
    • Use HLC at each data center
    • Propagate timestamps with messages
    • Compare HLC timestamps for ordering
  3. For strong ordering:
    • Central sequencer (single point of failure)
    • OR distributed consensus (high latency)
    • OR TrueTime-like approach (hardware investment)
  4. Practical trade-off:
    • Use causal ordering where possible
    • Strong ordering only where required (e.g., financial transactions)

Key Takeaways

Perfect Time is Impossible

Accept uncertainty. Design systems that handle clock skew gracefully.

Causality ≠ Physical Time

Use logical or hybrid clocks when you need to reason about event ordering.

Measure and Monitor

Track clock drift between nodes. Alert when skew exceeds thresholds.

Choose the Right Tool

Physical for display, logical for causality, HLC for both. TrueTime if you’re Google.