Skip to main content

Distributed Coordination

Coordination services are the backbone of distributed systems, providing primitives for leader election, configuration management, service discovery, and distributed locking.
Module Duration: 12-16 hours
Key Topics: Zookeeper, etcd, Consul, Service Discovery, Leader Election, Distributed Locks
Interview Focus: Zookeeper recipes, leader election algorithms, lock safety

Why Coordination is Hard

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE COORDINATION PROBLEM                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SCENARIO: Three database replicas need to elect a leader                  │
│                                                                              │
│  Without coordination:                                                      │
│  ─────────────────────                                                      │
│  Node A: "I'll be leader!" (doesn't hear from others)                      │
│  Node B: "I'll be leader!" (doesn't hear from others)                      │
│  Node C: "I'll be leader!" (doesn't hear from others)                      │
│                                                                              │
│  Result: SPLIT BRAIN! Three leaders, data diverges                         │
│                                                                              │
│  FUNDAMENTAL CHALLENGES:                                                    │
│  ────────────────────────                                                   │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ 1. PARTIAL FAILURE                                                  │   │
│  │    Some nodes crash, network partitions, but others continue        │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │ 2. ASYNCHRONY                                                       │   │
│  │    Messages delayed arbitrarily, can't distinguish slow from dead   │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │ 3. ORDERING                                                         │   │
│  │    No global clock, messages arrive out of order                    │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │ 4. BYZANTINE BEHAVIOR                                               │   │
│  │    Nodes may behave incorrectly (usually ignored in coordination)   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  SOLUTION: Use a coordination service that provides strong guarantees      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Zookeeper

Industry Standard: Zookeeper is used by Kafka, HBase, Solr, and many other distributed systems. Understanding it is essential.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ZOOKEEPER ARCHITECTURE                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ENSEMBLE (typically 3 or 5 nodes):                                        │
│  ──────────────────────────────────                                         │
│                                                                              │
│            ┌─────────────────────────────────────────────────┐             │
│            │                  CLIENTS                         │             │
│            └───────────┬──────────────────────┬──────────────┘             │
│                        │                      │                             │
│           ┌────────────┼──────────────────────┼────────────┐               │
│           ▼            ▼                      ▼            ▼               │
│      ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐      │
│      │   ZK1   │  │   ZK2   │  │   ZK3   │  │   ZK4   │  │   ZK5   │      │
│      │(Leader) │  │(Follower│  │(Follower│  │(Follower│  │(Follower│      │
│      │         │◄─┤         │◄─┤         │◄─┤         │◄─┤         │      │
│      └─────────┘  └─────────┘  └─────────┘  └─────────┘  └─────────┘      │
│           │                                                                  │
│           │ Writes go through leader                                        │
│           │ Reads can go to any node                                        │
│           │                                                                  │
│           └─────────────────────────────────────────────────────────        │
│                                                                              │
│  DATA MODEL: Hierarchical namespace (like filesystem)                      │
│  ───────────────────────────────────────────────────                        │
│                                                                              │
│                         /                                                   │
│                    ┌────┴────┐                                              │
│                   /app      /services                                       │
│              ┌────┴────┐     ┌──┴──┐                                       │
│          /config  /locks  /api  /db                                        │
│             │        │      │     │                                        │
│         /leader  /lock_1  /node1 /leader                                   │
│                  /lock_2  /node2                                           │
│                           /node3                                           │
│                                                                              │
│  NODE TYPES:                                                                │
│  ───────────                                                                │
│  Persistent:  Survives client disconnect                                   │
│  Ephemeral:   Deleted when client session ends                             │
│  Sequential:  Appends monotonic number (lock_0001, lock_0002)              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Zookeeper Guarantees

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ZOOKEEPER GUARANTEES                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. SEQUENTIAL CONSISTENCY                                                  │
│     ─────────────────────────                                               │
│     Updates from a client are applied in order sent                        │
│                                                                              │
│  2. ATOMICITY                                                               │
│     ──────────                                                              │
│     Updates either succeed completely or fail completely                   │
│                                                                              │
│  3. SINGLE SYSTEM IMAGE                                                     │
│     ───────────────────────                                                 │
│     Client sees same view regardless of which server it connects to        │
│                                                                              │
│  4. RELIABILITY                                                             │
│     ───────────                                                             │
│     Once update is applied, it persists until overwritten                  │
│                                                                              │
│  5. TIMELINESS                                                              │
│     ───────────                                                             │
│     Client view is guaranteed current within bounded time                  │
│                                                                              │
│  CONSISTENCY MODEL:                                                         │
│  ──────────────────                                                         │
│  Writes: Linearizable (go through leader, use ZAB)                         │
│  Reads:  Sequential consistency (may read stale, but ordered)              │
│                                                                              │
│  For linearizable reads: Use sync() before read                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

ZAB Protocol (Zookeeper Atomic Broadcast)

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ZAB PROTOCOL                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ZAB ensures all replicas agree on order of updates.                       │
│                                                                              │
│  PHASES:                                                                    │
│  ───────                                                                    │
│                                                                              │
│  1. LEADER ELECTION                                                         │
│     Nodes elect a leader (highest epoch + zxid wins)                       │
│                                                                              │
│  2. DISCOVERY                                                               │
│     Leader gathers state from followers                                    │
│     Determines most up-to-date log                                         │
│                                                                              │
│  3. SYNCHRONIZATION                                                         │
│     Leader syncs followers to consistent state                             │
│     Replays missing transactions                                           │
│                                                                              │
│  4. BROADCAST (steady state)                                               │
│     Leader receives writes, assigns zxid                                   │
│     Broadcasts to followers, waits for majority ACK                        │
│     Commits and notifies client                                            │
│                                                                              │
│  ZXID (Transaction ID):                                                    │
│  ──────────────────────                                                     │
│  64-bit: [32-bit epoch][32-bit counter]                                    │
│  Epoch: Increments with each new leader                                    │
│  Counter: Increments with each transaction                                 │
│                                                                              │
│  Example: 0x00000002_00000015                                              │
│           epoch=2, transaction=21                                          │
│                                                                              │
│  WRITE FLOW:                                                                │
│  ───────────                                                                │
│  Client ─Write─► Leader                                                    │
│  Leader: Assign zxid, persist to log                                       │
│  Leader ─Proposal─► Followers                                              │
│  Followers: Persist to log, send ACK                                       │
│  Leader: Receive majority ACKs                                             │
│  Leader ─Commit─► Followers                                                │
│  Leader ─Response─► Client                                                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

etcd

Modern Coordination Service

┌─────────────────────────────────────────────────────────────────────────────┐
│                    etcd ARCHITECTURE                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  etcd is the coordination backbone of Kubernetes.                          │
│                                                                              │
│  KEY DIFFERENCES FROM ZOOKEEPER:                                            │
│  ───────────────────────────────                                            │
│  ┌─────────────────────┬─────────────────────┬─────────────────────┐       │
│  │ Feature             │ Zookeeper           │ etcd                │       │
│  ├─────────────────────┼─────────────────────┼─────────────────────┤       │
│  │ Consensus           │ ZAB                 │ Raft                │       │
│  │ Data Model          │ Hierarchical        │ Flat key-value      │       │
│  │ API                 │ Binary (Jute)       │ gRPC + JSON         │       │
│  │ Watch               │ One-time triggers   │ Streaming           │       │
│  │ Transactions        │ Multi-op            │ Compare-and-swap    │       │
│  │ Language            │ Java                │ Go                  │       │
│  └─────────────────────┴─────────────────────┴─────────────────────┘       │
│                                                                              │
│  RAFT CONSENSUS:                                                            │
│  ───────────────                                                            │
│  Leader Election:                                                          │
│  - Nodes start as followers                                                │
│  - If no heartbeat, become candidate                                       │
│  - Request votes from peers                                                │
│  - Majority votes → become leader                                          │
│                                                                              │
│  Log Replication:                                                          │
│  - Leader appends entries to log                                           │
│  - Sends AppendEntries to followers                                        │
│  - Committed when majority have it                                         │
│                                                                              │
│  KEY FEATURES:                                                              │
│  ─────────────                                                              │
│  • MVCC (Multi-Version Concurrency Control)                               │
│  • Watch with revision numbers                                             │
│  • Lease-based TTL                                                         │
│  • Transactions (If-Then-Else)                                            │
│  • Linearizable reads                                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

etcd API Examples

import etcd3

# Connect to etcd
etcd = etcd3.client(host='localhost', port=2379)

# Basic key-value operations
etcd.put('/config/database/host', 'db.example.com')
value, metadata = etcd.get('/config/database/host')
print(value.decode())  # 'db.example.com'

# Compare-and-swap (atomic update)
etcd.transaction(
    compare=[
        etcd.transactions.version('/config/database/host') > 0
    ],
    success=[
        etcd.transactions.put('/config/database/host', 'newdb.example.com')
    ],
    failure=[
        etcd.transactions.put('/config/database/host', 'fallback.example.com')
    ]
)

# Leases (for ephemeral keys)
lease = etcd.lease(ttl=30)  # 30 second TTL
etcd.put('/services/api/node1', '192.168.1.1:8080', lease=lease)

# Keep lease alive
for _ in lease.keepalive():
    print("Lease renewed")

# Watch for changes
events_iterator, cancel = etcd.watch_prefix('/services/')
for event in events_iterator:
    print(f"Key: {event.key}, Value: {event.value}, Type: {event.type}")


# Distributed lock
lock = etcd.lock('/locks/my-resource', ttl=30)
with lock:
    # Critical section
    print("Acquired lock, doing work...")

Leader Election Patterns

Zookeeper Recipe

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LEADER ELECTION WITH ZOOKEEPER                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ALGORITHM:                                                                 │
│  ──────────                                                                 │
│  1. All candidates create EPHEMERAL SEQUENTIAL node under /election        │
│  2. Get all children of /election                                          │
│  3. If your node is the smallest → YOU ARE LEADER                          │
│  4. Otherwise → watch the node just before yours                           │
│  5. When watched node deleted → repeat from step 2                         │
│                                                                              │
│  EXAMPLE:                                                                   │
│  ─────────                                                                  │
│  /election                                                                  │
│     ├── node_0000000001 (created by Node A) ← LEADER                       │
│     ├── node_0000000002 (created by Node B) watches 0001                   │
│     └── node_0000000003 (created by Node C) watches 0002                   │
│                                                                              │
│  If Node A crashes:                                                         │
│  - node_0000000001 deleted (ephemeral)                                     │
│  - Node B notified (was watching 0001)                                     │
│  - Node B checks: am I smallest? Yes → Node B is new LEADER                │
│                                                                              │
│  WHY WATCH ONLY PREDECESSOR?                                                │
│  ────────────────────────────                                               │
│  Watching /election directly → HERD EFFECT                                 │
│  All nodes notified on any change → thundering herd                        │
│  Watching predecessor → O(1) notifications, graceful succession            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation

from kazoo.client import KazooClient
from kazoo.recipe.election import Election

class LeaderElection:
    """
    Leader election using Zookeeper.
    """
    
    def __init__(self, zk_hosts: str, election_path: str, node_id: str):
        self.zk = KazooClient(hosts=zk_hosts)
        self.election_path = election_path
        self.node_id = node_id
        self.is_leader = False
        self.my_node = None
    
    def start(self):
        self.zk.start()
        self.zk.ensure_path(self.election_path)
        self._participate()
    
    def _participate(self):
        # Create ephemeral sequential node
        path = f"{self.election_path}/node_"
        self.my_node = self.zk.create(
            path, 
            value=self.node_id.encode(),
            ephemeral=True, 
            sequence=True
        )
        self._check_leadership()
    
    def _check_leadership(self):
        children = self.zk.get_children(self.election_path)
        children.sort()
        
        my_seq = self.my_node.split('_')[-1]
        
        if children[0].endswith(my_seq):
            # I'm the leader!
            self.is_leader = True
            self._on_elected_leader()
        else:
            # Watch predecessor
            my_index = next(
                i for i, c in enumerate(children) 
                if c.endswith(my_seq)
            )
            predecessor = children[my_index - 1]
            predecessor_path = f"{self.election_path}/{predecessor}"
            
            @self.zk.DataWatch(predecessor_path)
            def watch_predecessor(data, stat):
                if stat is None:  # Node deleted
                    self._check_leadership()
                return False  # One-time watch
    
    def _on_elected_leader(self):
        print(f"Node {self.node_id} is now the LEADER!")
        # Do leader work...
    
    def stop(self):
        self.zk.stop()


# Using Kazoo's built-in recipe
from kazoo.recipe.election import Election

zk = KazooClient(hosts='localhost:2181')
zk.start()

election = Election(zk, "/election", "my-node-1")

# This blocks until we're the leader
election.run(leader_function)

# Or use callbacks
election.election.run_async(leader_function)

Distributed Locks

Lock Recipes

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED LOCK PATTERNS                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ZOOKEEPER LOCK RECIPE:                                                     │
│  ──────────────────────                                                     │
│  Similar to leader election!                                               │
│                                                                              │
│  1. Create ephemeral sequential under /locks/resource                      │
│  2. If smallest → acquired lock                                            │
│  3. Otherwise → watch predecessor                                          │
│  4. On predecessor delete → check if smallest                              │
│  5. On unlock → delete your node                                           │
│                                                                              │
│  READ-WRITE LOCKS:                                                          │
│  ─────────────────                                                          │
│  Write: /locks/resource/write_0000001                                      │
│  Read:  /locks/resource/read_0000002                                       │
│                                                                              │
│  Write lock: Wait for ALL predecessors                                     │
│  Read lock: Wait for only WRITE predecessors                               │
│                                                                              │
│  FENCING TOKENS:                                                            │
│  ───────────────                                                            │
│  Problem: Lock holder pauses (GC), lock times out, another gets lock       │
│           Original holder resumes, thinks it still has lock!               │
│                                                                              │
│  Solution: Fencing tokens                                                  │
│  - Lock returns monotonically increasing token                             │
│  - Resource rejects requests with old tokens                               │
│                                                                              │
│     Holder A: token=33 ─────────────────────────────────────► resource     │
│     Holder B: token=34 ─────────► resource (rejects token=33 from A)      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Redis Distributed Locks (Redlock)

import redis
import time
import uuid

class RedisLock:
    """
    Simple Redis distributed lock.
    
    WARNING: Has known issues! See Redlock debate.
    Consider Zookeeper/etcd for critical applications.
    """
    
    def __init__(self, redis_client, resource: str, ttl: int = 10):
        self.redis = redis_client
        self.resource = f"lock:{resource}"
        self.ttl = ttl
        self.token = None
    
    def acquire(self, timeout: float = 10) -> bool:
        """Try to acquire lock with timeout."""
        self.token = str(uuid.uuid4())
        end_time = time.time() + timeout
        
        while time.time() < end_time:
            # SET NX (only if not exists) with expiry
            if self.redis.set(
                self.resource, 
                self.token, 
                nx=True, 
                ex=self.ttl
            ):
                return True
            time.sleep(0.1)
        
        return False
    
    def release(self):
        """Release lock only if we own it."""
        # Lua script for atomic check-and-delete
        script = """
        if redis.call("GET", KEYS[1]) == ARGV[1] then
            return redis.call("DEL", KEYS[1])
        else
            return 0
        end
        """
        self.redis.eval(script, 1, self.resource, self.token)
    
    def extend(self, additional_time: int):
        """Extend lock TTL if we own it."""
        script = """
        if redis.call("GET", KEYS[1]) == ARGV[1] then
            return redis.call("EXPIRE", KEYS[1], ARGV[2])
        else
            return 0
        end
        """
        self.redis.eval(
            script, 1, self.resource, 
            self.token, self.ttl + additional_time
        )


class Redlock:
    """
    Redlock algorithm for distributed locking across Redis instances.
    
    Requires N independent Redis instances (typically 5).
    Lock acquired if majority (N/2 + 1) grant it.
    """
    
    def __init__(self, redis_instances: list):
        self.instances = redis_instances
        self.quorum = len(redis_instances) // 2 + 1
    
    def acquire(self, resource: str, ttl: int) -> tuple:
        """
        Try to acquire lock on majority of instances.
        
        Returns (success, token) tuple.
        """
        token = str(uuid.uuid4())
        start_time = time.time()
        
        acquired = 0
        for instance in self.instances:
            try:
                if instance.set(
                    f"lock:{resource}", token, 
                    nx=True, px=ttl
                ):
                    acquired += 1
            except redis.RedisError:
                pass  # Instance unreachable
        
        # Check if we acquired quorum
        elapsed = (time.time() - start_time) * 1000  # ms
        validity_time = ttl - elapsed - 2  # drift allowance
        
        if acquired >= self.quorum and validity_time > 0:
            return True, token
        
        # Failed - release any locks acquired
        self._release_all(resource, token)
        return False, None
    
    def _release_all(self, resource: str, token: str):
        script = """
        if redis.call("GET", KEYS[1]) == ARGV[1] then
            return redis.call("DEL", KEYS[1])
        end
        """
        for instance in self.instances:
            try:
                instance.eval(script, 1, f"lock:{resource}", token)
            except redis.RedisError:
                pass
Redlock Controversy: Martin Kleppmann’s analysis shows Redlock can fail under certain conditions (GC pauses, clock drift). For safety-critical applications, prefer coordination services with stronger guarantees (Zookeeper, etcd).

Advanced: Distributed Coordination Patterns

1. Barrier Synchronization

In parallel computing and distributed data processing (like MPI or Apache Spark), you often need to ensure that no node starts Step 2 until all nodes have finished Step 1.

The Distributed Barrier Algorithm (Zookeeper Recipe)

  1. Barrier Creation: A designated node creates a znode /barrier.
  2. Entering the Barrier:
    • Each participant node creates an ephemeral znode under /barrier/node_i.
    • Nodes call getChildren("/barrier", watch=True).
    • If the number of children is less than NN (the threshold), they wait for the watch to trigger.
    • Once the NN-th node joins, the watch triggers, and all nodes proceed.
  3. Exiting the Barrier:
    • Similar to entering, nodes wait until all NN nodes have finished their work before deleting their ephemeral znodes.

2. Distributed Phasers

A Phaser is a more advanced synchronization primitive that combines aspects of a Barrier and a CountDownLatch.

Why Phasers?

Static barriers require a fixed number of participants (NN). Distributed phasers allow:
  • Dynamic Registration: Nodes can join or leave the coordination group mid-flight.
  • Hierarchical Phasers: To avoid the O(N)O(N) bottleneck of a single coordination znode, phasers can be arranged in a tree. A parent phaser only moves to the next phase once all its child phasers have signaled.
  • Tiered Wait: Some nodes can “signal” they are done but not “wait” for others (useful for pipelining).

3. Distributed Lock-Free Patterns

At Staff/Principal level, you must understand when locks (even distributed ones) are too expensive and how to move toward Lock-Free progress. In single-node systems, we use CAS (Compare-And-Swap) instructions. In distributed systems, we use Optimistic Concurrency Control (OCC) and Wait-Free Data Structures.

Hazard Pointers (Distributed Context)

Hazard pointers are used to safely manage memory in lock-free data structures by ensuring that a node is not deleted while another thread (or node) is still accessing it. In a distributed environment (e.g., a shared-memory distributed graph or a distributed cache like Pelikan), hazard pointers can be implemented using a Lease-based Central Registry:
  • Before a node reads a shared object, it publishes a “Hazard Lease” (short TTL).
  • The “Owner” of the object cannot reclaim or reuse that memory address until all hazard leases have expired or been revoked.

RCU (Read-Copy-Update) at Scale

RCU allows multiple readers to access data while a single writer updates it without any locks.
  • Read: Readers access the current version of the data without any overhead.
  • Update: The writer creates a new copy, updates it, and atomically swaps the pointer.
  • Reclaim: The writer waits for a “Grace Period” (where all pre-existing readers have finished) before deleting the old copy.
Distributed RCU Implementation: Using a coordination service (etcd/ZK) to track the Grace Period:
  1. Writer: Publishes a new configuration version V2V_2 and records the timestamp TswapT_{swap}.
  2. Readers: Report their “Highest Observed Version” periodically.
  3. Reclaim: Once all active readers report a version V2\ge V_2, the writer safely deletes V1V_1.

4. Comparison of Progress Guarantees

Progress LevelDefinitionScalabilityCost
Lock-BasedAt least one thread is making progress as long as it holds the lock.Poor (Contention)High (Wait time)
Lock-FreeAt least one thread in the system is guaranteed to make progress.GoodLow
Wait-FreeEvery thread is guaranteed to make progress in a finite number of steps.BestLowest
Staff Tip: Moving from Lock-based to Lock-Free coordination (using OCC or RCU) is the primary way to achieve linear scalability in high-throughput control planes.

Load Balancing Algorithms

Power of Two Choices (P2C)

For Staff-level engineering, you must understand why simple “Least Connections” or “Round Robin” can fail in large-scale distributed systems. The Problem: Herd Effect In a traditional “Least Connections” load balancer, if one server becomes slightly faster, all clients might simultaneously see it as the “least loaded” and overwhelm it (the thundering herd). The Solution: P2C Introduced by Michael Mitzenmacher, the Power of Two Choices algorithm is elegantly simple:
  1. Pick two random nodes from the pool.
  2. Choose the one with the least load (e.g., fewest active requests).
# P2C Implementation
import random

class P2CLoadBalancer:
    def __init__(self, nodes):
        self.nodes = nodes # List of (node_id, current_load)
        
    def select_node(self):
        # Pick two random candidates
        n1, n2 = random.sample(self.nodes, 2)
        
        # Pick the winner
        return n1 if n1.load < n2.load else n2
Why it works:
  • Exponential Improvement: Choosing the best of two random samples provides an exponential improvement over choosing one random sample. It performs almost as well as “Least Loaded” across the entire pool but with O(1)O(1) complexity instead of O(N)O(N).
  • Avoids Hotspots: Because it’s stochastic, it prevents all clients from converging on the same “best” node at the exact same micro-second.
Used By: NGINX, HAProxy, gRPC (lookaside LB), and Finagle (Twitter).

Service Discovery

Patterns

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SERVICE DISCOVERY PATTERNS                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. CLIENT-SIDE DISCOVERY                                                   │
│     ─────────────────────────                                               │
│                                                                              │
│     ┌─────────────┐      ┌──────────────┐      ┌─────────────┐             │
│     │   Client    │─────►│   Registry   │      │  Service A  │             │
│     │             │◄─────│ (Consul/etcd)│      │ 10.0.1.1    │             │
│     └──────┬──────┘      └──────────────┘      └─────────────┘             │
│            │                                    ┌─────────────┐             │
│            └───────────────────────────────────►│  Service A  │             │
│              Client knows all instances         │ 10.0.1.2    │             │
│              Does load balancing locally        └─────────────┘             │
│                                                                              │
│     Pro: No single point of failure                                        │
│     Con: Client complexity, language coupling                              │
│                                                                              │
│  2. SERVER-SIDE DISCOVERY                                                   │
│     ─────────────────────────                                               │
│                                                                              │
│     ┌─────────────┐      ┌──────────────┐      ┌─────────────┐             │
│     │   Client    │─────►│    Load      │─────►│  Service A  │             │
│     │             │      │   Balancer   │      │ 10.0.1.1    │             │
│     └─────────────┘      └───────┬──────┘      └─────────────┘             │
│                                  │              ┌─────────────┐             │
│                                  │              │  Service A  │             │
│                                  │              │ 10.0.1.2    │             │
│                                  │              └─────────────┘             │
│                                  │                                          │
│                                  └──────► Registry                          │
│                                                                              │
│     Pro: Simple client, centralized                                        │
│     Con: Load balancer is SPOF, latency                                    │
│                                                                              │
│  3. DNS-BASED DISCOVERY                                                     │
│     ───────────────────                                                     │
│                                                                              │
│     Client ─► DNS ─► api.internal → [10.0.1.1, 10.0.1.2, 10.0.1.3]        │
│                                                                              │
│     Pro: Standard, simple                                                  │
│     Con: TTL caching delays updates, no health checks                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Consul Service Discovery

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CONSUL SERVICE DISCOVERY                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ARCHITECTURE:                                                              │
│  ─────────────                                                              │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐    │
│  │                         CONSUL CLUSTER                             │    │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐                  │    │
│  │  │   Server    │ │   Server    │ │   Server    │                  │    │
│  │  │  (Leader)   │ │  (Follower) │ │  (Follower) │                  │    │
│  │  └─────────────┘ └─────────────┘ └─────────────┘                  │    │
│  └────────────────────────────────────────────────────────────────────┘    │
│              ▲                                                              │
│              │ Raft consensus                                               │
│              │                                                              │
│  ┌───────────┴────────────┬─────────────────────┬────────────────────┐     │
│  │                        │                     │                    │     │
│  ▼                        ▼                     ▼                    ▼     │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌──────────┐ │
│  │ Node 1          │ │ Node 2          │ │ Node 3          │ │ Node 4   │ │
│  │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │          │ │
│  │ │Consul Agent │ │ │ │Consul Agent │ │ │ │Consul Agent │ │ │  Agent   │ │
│  │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │          │ │
│  │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │          │ │
│  │ │   API       │ │ │ │   Web       │ │ │ │   Worker    │ │ │          │ │
│  │ │  Service    │ │ │ │  Service    │ │ │ │  Service    │ │ │          │ │
│  │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │          │ │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘ └──────────┘ │
│                                                                              │
│  FEATURES:                                                                  │
│  ─────────                                                                  │
│  • Service registration (agent-based or API)                               │
│  • Health checks (HTTP, TCP, script, TTL)                                  │
│  • DNS interface (service.consul)                                          │
│  • HTTP API for discovery                                                  │
│  • Key-Value store                                                         │
│  • Multi-datacenter support                                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation Example

import consul
import socket
import threading
import time

class ServiceRegistry:
    """
    Consul-based service registration and discovery.
    """
    
    def __init__(self, consul_host: str = 'localhost', consul_port: int = 8500):
        self.consul = consul.Consul(host=consul_host, port=consul_port)
    
    def register_service(
        self, 
        name: str, 
        port: int, 
        tags: list = None,
        health_check_interval: str = "10s"
    ) -> str:
        """Register a service with health check."""
        service_id = f"{name}-{socket.gethostname()}-{port}"
        
        self.consul.agent.service.register(
            name=name,
            service_id=service_id,
            port=port,
            tags=tags or [],
            check=consul.Check.http(
                url=f"http://localhost:{port}/health",
                interval=health_check_interval,
                timeout="5s"
            )
        )
        
        return service_id
    
    def deregister_service(self, service_id: str):
        """Deregister a service."""
        self.consul.agent.service.deregister(service_id)
    
    def discover_service(self, name: str, only_healthy: bool = True) -> list:
        """Discover all instances of a service."""
        _, services = self.consul.health.service(
            name, 
            passing=only_healthy
        )
        
        instances = []
        for service in services:
            instances.append({
                'id': service['Service']['ID'],
                'address': service['Service']['Address'] or service['Node']['Address'],
                'port': service['Service']['Port'],
                'tags': service['Service']['Tags'],
            })
        
        return instances
    
    def watch_service(self, name: str, callback):
        """Watch for service changes."""
        index = None
        
        def watch_loop():
            nonlocal index
            while True:
                try:
                    index, services = self.consul.health.service(
                        name, 
                        passing=True,
                        index=index,  # Blocking query
                        wait='30s'
                    )
                    callback(services)
                except Exception as e:
                    print(f"Watch error: {e}")
                    time.sleep(5)
        
        thread = threading.Thread(target=watch_loop, daemon=True)
        thread.start()
        return thread


# Client-side load balancing
import random

class LoadBalancer:
    """Simple client-side load balancer with service discovery."""
    
    def __init__(self, registry: ServiceRegistry, service_name: str):
        self.registry = registry
        self.service_name = service_name
        self.instances = []
        
        # Watch for changes
        self.registry.watch_service(service_name, self._update_instances)
        self._refresh_instances()
    
    def _refresh_instances(self):
        self.instances = self.registry.discover_service(self.service_name)
    
    def _update_instances(self, services):
        self.instances = [
            {
                'address': s['Service']['Address'] or s['Node']['Address'],
                'port': s['Service']['Port'],
            }
            for s in services
        ]
    
    def get_instance(self) -> dict:
        """Get an instance using round-robin or random selection."""
        if not self.instances:
            self._refresh_instances()
        
        if not self.instances:
            raise Exception(f"No healthy instances for {self.service_name}")
        
        return random.choice(self.instances)

Configuration Management

Dynamic Configuration

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CONFIGURATION MANAGEMENT                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PROBLEM: How to update configuration across many services?                │
│                                                                              │
│  STATIC CONFIG (bad):                                                       │
│  - Config in files, requires restart                                       │
│  - Different values in different environments                              │
│  - No audit trail                                                          │
│                                                                              │
│  DYNAMIC CONFIG (good):                                                     │
│  - Config in coordination service                                          │
│  - Watch for changes, apply immediately                                    │
│  - Audit trail in version history                                          │
│                                                                              │
│  EXAMPLE - Feature Flags:                                                   │
│  ─────────────────────────                                                  │
│  /config/features/new_checkout     = {"enabled": true, "percentage": 50}   │
│  /config/features/dark_mode        = {"enabled": false}                    │
│  /config/limits/rate_limit_qps     = 1000                                  │
│  /config/limits/max_connections    = 500                                   │
│                                                                              │
│  WATCH PATTERN:                                                             │
│  ──────────────                                                             │
│  Service starts → read all config                                          │
│  Set watch on /config/...                                                  │
│  On change → update in-memory config                                       │
│  No restart needed!                                                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
class ConfigManager:
    """
    Dynamic configuration management with etcd.
    """
    
    def __init__(self, etcd_client, prefix: str = '/config'):
        self.etcd = etcd_client
        self.prefix = prefix
        self.config = {}
        self.callbacks = {}
        
        self._load_all()
        self._start_watch()
    
    def _load_all(self):
        """Load all configuration values."""
        for value, metadata in self.etcd.get_prefix(self.prefix):
            key = metadata.key.decode().replace(self.prefix + '/', '')
            self.config[key] = json.loads(value.decode())
    
    def _start_watch(self):
        """Watch for configuration changes."""
        events_iterator, cancel = self.etcd.watch_prefix(self.prefix)
        
        def watch_loop():
            for event in events_iterator:
                key = event.key.decode().replace(self.prefix + '/', '')
                
                if event.type == etcd3.EventType.PUT:
                    value = json.loads(event.value.decode())
                    self.config[key] = value
                    self._notify_change(key, value)
                    
                elif event.type == etcd3.EventType.DELETE:
                    self.config.pop(key, None)
                    self._notify_change(key, None)
        
        threading.Thread(target=watch_loop, daemon=True).start()
    
    def get(self, key: str, default=None):
        """Get configuration value."""
        return self.config.get(key, default)
    
    def set(self, key: str, value):
        """Set configuration value."""
        self.etcd.put(
            f"{self.prefix}/{key}", 
            json.dumps(value)
        )
    
    def on_change(self, key: str, callback):
        """Register callback for configuration changes."""
        self.callbacks[key] = callback
    
    def _notify_change(self, key: str, value):
        if key in self.callbacks:
            self.callbacks[key](value)


# Usage
config = ConfigManager(etcd_client)

# Get feature flag
if config.get('features/new_checkout', {}).get('enabled'):
    use_new_checkout()

# React to changes
def on_rate_limit_change(new_value):
    rate_limiter.set_limit(new_value)

config.on_change('limits/rate_limit_qps', on_rate_limit_change)

Advanced Design Scenarios

Scenario 1: Highly Available Control Plane with etcd

You are building a control plane (like Kubernetes API + controllers) that must never have two leaders and must tolerate node and zone failures. Design:
  • etcd cluster:
    • 3 or 5 nodes spread across failure domains (AZs).
    • Use Raft (already built into etcd) for linearizable writes.
  • Control-plane components (e.g., schedulers, controllers):
    • Each instance registers its lease and identity under a prefix, e.g., /controllers/scheduler/instances/<id>.
    • Leader election implemented using compare-and-swap + leases:
# Pseudocode using etcd for leader election with leases
lease = etcd.lease(ttl=15)

# Try to become leader
success, _ = etcd.transaction(
    compare=[
        etcd.transactions.version("/leaders/scheduler") == 0
    ],
    success=[
        etcd.transactions.put("/leaders/scheduler", node_id, lease=lease)
    ],
    failure=[],
)

if success:
    # We are leader, periodically keep lease alive
    for _ in lease.keepalive():
        run_leader_loop()
else:
    # Watch for leader changes
    events, cancel = etcd.watch("/leaders/scheduler")
    # When key deleted or lease expires, retry election
Failure modes:
  • If the leader crashes or loses connectivity, its lease expires → key deleted → another instance wins via CAS.
  • Network partition:
    • Only the majority side of etcd can make progress (Raft quorum).
    • Minority side cannot renew its lease; it loses leadership and should demote itself.
This scenario uses:
  • etcd’s leases, transactions, and watch APIs.
  • Quorum-based leader election to avoid split-brain in coordination-critical components.

Scenario 2: Distributed Lock with Fencing for a Financial System

You need a lock to guard access to a payment ledger so that no two writers interleave operations on the same account, even in the presence of pauses and network glitches. Design:
  • Use Zookeeper or etcd to implement a FIFO lock with monotonic fencing tokens.
  • On acquiring the lock, the client gets a token that increases over time.
  • Downstream resources (e.g., database, Kafka consumer) validate tokens and reject stale ones.
Lock acquisition:
  1. Create ephemeral sequential node under /locks/account-<id>/lock_...
  2. If smallest → lock acquired.
  3. Fencing token = sequence number (monotonic per account).

Write path with fencing:
  - Client includes fencing_token in every write:
      UPDATE ledger
      SET balance = balance + :delta
      WHERE account_id = :id
        AND last_token <= :fencing_token;

  - If 0 rows updated → some newer holder has already written (we're stale).
Why this is safe:
  • Even if a GC pause or network glitch causes the old lock holder to “wake up” after losing the lock, its token is smaller, so its write fails.
  • The lock service enforces mutual exclusion at the coordination layer; fencing tokens extend that guarantee all the way to the storage layer.

Scenario 3: Multi-Region Service Discovery and Configuration

You operate services in multiple regions with independent Consul/etcd clusters, but want a consistent way to discover services and roll out configuration. Design:
  • Per-region registry:
    • Each region runs its own Consul or etcd cluster for low-latency discovery.
    • Services register only in their local registry under paths like /services/api and /config/....
  • Global view:
    • A small control-plane service aggregates per-region registry data into a read-only global catalog (for dashboards, tooling).
  • Traffic routing:
    • Clients prefer local region endpoints discovered from the local registry.
    • For failover, they can fall back to remote region instances using a secondary discovery path.
Config rollout pattern:
  • Store global config under a versioned key, e.g., /config/feature-flags/v42.
  • Each service:
    • Watches /config/feature-flags/current (a pointer to the active version).
    • On change, fetches the new version document and swaps it in-memory.
# Pointer-based config rollout
CONFIG_PREFIX = "/config/feature-flags"

async def config_watcher():
    events, cancel = etcd.watch(f"{CONFIG_PREFIX}/current")
    async for event in events:
        version = event.value.decode()  # e.g., "v42"
        raw, _ = etcd.get(f"{CONFIG_PREFIX}/{version}")
        LOCAL_FLAGS.update(json.loads(raw))
Benefits:
  • Can prepare new configs at /v43 in all regions, validate them, and then atomically switch currentv43.
  • During regional partitions, each region keeps using its last known current version; when connectivity restores, they converge.
This scenario composes:
  • Service discovery patterns, dynamic configuration, and watch-based updates into a multi-region, fault-tolerant design.

Interview Practice

Question: Design a distributed lock service for a payment system.Requirements:
  • Mutual exclusion (only one holder)
  • Deadlock prevention (timeouts)
  • Fault tolerance (handle crashes)
  • Fairness (FIFO ordering)
Design:
Architecture:
- Use Zookeeper/etcd for coordination
- Ephemeral nodes for automatic cleanup
- Sequential nodes for FIFO ordering
- Fencing tokens for safety

Lock acquisition:
1. Create ephemeral sequential node
2. Check if smallest → acquired
3. Otherwise, watch predecessor
4. Return fencing token = node sequence number

Lock release:
1. Delete ephemeral node
2. Next waiter gets notified

Safety (fencing tokens):
- Lock returns monotonic token
- Payment service checks: token >= last_seen_token
- Rejects stale requests from old lock holders
Question: How do you prevent split-brain during leader election?Answer:The Problem:
Network partition:
[Node A, Node B] | [Node C]

Without quorum:
- A thinks: "I'm leader of partition 1"
- C thinks: "I'm leader of partition 2"
- Two leaders! Data diverges!
Solutions:
  1. Quorum requirement
    • Need majority (N/2 + 1) to elect leader
    • 3 nodes → need 2
    • Partition with minority cannot elect leader
  2. Fencing
    • Each leader gets epoch number
    • Resources only accept from current epoch
    • Old leader’s requests rejected
  3. Lease-based leadership
    • Leader holds lease with TTL
    • Must renew before expiry
    • On partition, lease expires → no leader in minority
Question: Design service discovery for a microservices platform.Components:
1. Registry (Consul/etcd)
   - Stores service → instances mapping
   - Health check status
   - Metadata (version, region, etc.)

2. Service Registration
   - Self-registration on startup
   - Heartbeat/TTL renewal
   - Graceful deregistration on shutdown

3. Discovery
   - DNS-based (simple, cacheable)
   - API-based (more features)
   - Client-side caching with watch

4. Load Balancing
   - Client-side (Ribbon, gRPC)
   - Server-side (Envoy, HAProxy)

5. Health Checks
   - Active (ping from registry)
   - Passive (TTL refresh)
   - Multiple levels (liveness, readiness)
Failure Handling:
  • Service crashes → ephemeral entry deleted
  • Registry partitioned → use cached instances
  • All instances down → return error, trigger alerts

Key Takeaways

Use Proven Coordination Services

Zookeeper, etcd, Consul are battle-tested. Don’t reinvent distributed consensus.

Ephemeral Nodes are Powerful

Automatic cleanup on session end enables leader election, locks, and service discovery.

Fencing Tokens Prevent Stale Operations

Always use monotonic tokens with locks to prevent split-brain safety violations.

Watch, Don't Poll

Coordination services provide watch/subscribe for efficient change detection.

Next Steps