Skip to main content

Distributed Coordination

Coordination services are the backbone of distributed systems, providing primitives for leader election, configuration management, service discovery, and distributed locking.
Module Duration: 12-16 hours
Key Topics: Zookeeper, etcd, Consul, Service Discovery, Leader Election, Distributed Locks
Interview Focus: Zookeeper recipes, leader election algorithms, lock safety

Why Coordination is Hard

┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE COORDINATION PROBLEM                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SCENARIO: Three database replicas need to elect a leader                  │
│                                                                              │
│  Without coordination:                                                      │
│  ─────────────────────                                                      │
│  Node A: "I'll be leader!" (doesn't hear from others)                      │
│  Node B: "I'll be leader!" (doesn't hear from others)                      │
│  Node C: "I'll be leader!" (doesn't hear from others)                      │
│                                                                              │
│  Result: SPLIT BRAIN! Three leaders, data diverges                         │
│                                                                              │
│  FUNDAMENTAL CHALLENGES:                                                    │
│  ────────────────────────                                                   │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ 1. PARTIAL FAILURE                                                  │   │
│  │    Some nodes crash, network partitions, but others continue        │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │ 2. ASYNCHRONY                                                       │   │
│  │    Messages delayed arbitrarily, can't distinguish slow from dead   │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │ 3. ORDERING                                                         │   │
│  │    No global clock, messages arrive out of order                    │   │
│  ├─────────────────────────────────────────────────────────────────────┤   │
│  │ 4. BYZANTINE BEHAVIOR                                               │   │
│  │    Nodes may behave incorrectly (usually ignored in coordination)   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  SOLUTION: Use a coordination service that provides strong guarantees      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Zookeeper

Industry Standard: Zookeeper is used by Kafka, HBase, Solr, and many other distributed systems. Understanding it is essential.

Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ZOOKEEPER ARCHITECTURE                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ENSEMBLE (typically 3 or 5 nodes):                                        │
│  ──────────────────────────────────                                         │
│                                                                              │
│            ┌─────────────────────────────────────────────────┐             │
│            │                  CLIENTS                         │             │
│            └───────────┬──────────────────────┬──────────────┘             │
│                        │                      │                             │
│           ┌────────────┼──────────────────────┼────────────┐               │
│           ▼            ▼                      ▼            ▼               │
│      ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐      │
│      │   ZK1   │  │   ZK2   │  │   ZK3   │  │   ZK4   │  │   ZK5   │      │
│      │(Leader) │  │(Follower│  │(Follower│  │(Follower│  │(Follower│      │
│      │         │◄─┤         │◄─┤         │◄─┤         │◄─┤         │      │
│      └─────────┘  └─────────┘  └─────────┘  └─────────┘  └─────────┘      │
│           │                                                                  │
│           │ Writes go through leader                                        │
│           │ Reads can go to any node                                        │
│           │                                                                  │
│           └─────────────────────────────────────────────────────────        │
│                                                                              │
│  DATA MODEL: Hierarchical namespace (like filesystem)                      │
│  ───────────────────────────────────────────────────                        │
│                                                                              │
│                         /                                                   │
│                    ┌────┴────┐                                              │
│                   /app      /services                                       │
│              ┌────┴────┐     ┌──┴──┐                                       │
│          /config  /locks  /api  /db                                        │
│             │        │      │     │                                        │
│         /leader  /lock_1  /node1 /leader                                   │
│                  /lock_2  /node2                                           │
│                           /node3                                           │
│                                                                              │
│  NODE TYPES:                                                                │
│  ───────────                                                                │
│  Persistent:  Survives client disconnect                                   │
│  Ephemeral:   Deleted when client session ends                             │
│  Sequential:  Appends monotonic number (lock_0001, lock_0002)              │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Zookeeper Guarantees

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ZOOKEEPER GUARANTEES                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. SEQUENTIAL CONSISTENCY                                                  │
│     ─────────────────────────                                               │
│     Updates from a client are applied in order sent                        │
│                                                                              │
│  2. ATOMICITY                                                               │
│     ──────────                                                              │
│     Updates either succeed completely or fail completely                   │
│                                                                              │
│  3. SINGLE SYSTEM IMAGE                                                     │
│     ───────────────────────                                                 │
│     Client sees same view regardless of which server it connects to        │
│                                                                              │
│  4. RELIABILITY                                                             │
│     ───────────                                                             │
│     Once update is applied, it persists until overwritten                  │
│                                                                              │
│  5. TIMELINESS                                                              │
│     ───────────                                                             │
│     Client view is guaranteed current within bounded time                  │
│                                                                              │
│  CONSISTENCY MODEL:                                                         │
│  ──────────────────                                                         │
│  Writes: Linearizable (go through leader, use ZAB)                         │
│  Reads:  Sequential consistency (may read stale, but ordered)              │
│                                                                              │
│  For linearizable reads: Use sync() before read                            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

ZAB Protocol (Zookeeper Atomic Broadcast)

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ZAB PROTOCOL                                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ZAB ensures all replicas agree on order of updates.                       │
│                                                                              │
│  PHASES:                                                                    │
│  ───────                                                                    │
│                                                                              │
│  1. LEADER ELECTION                                                         │
│     Nodes elect a leader (highest epoch + zxid wins)                       │
│                                                                              │
│  2. DISCOVERY                                                               │
│     Leader gathers state from followers                                    │
│     Determines most up-to-date log                                         │
│                                                                              │
│  3. SYNCHRONIZATION                                                         │
│     Leader syncs followers to consistent state                             │
│     Replays missing transactions                                           │
│                                                                              │
│  4. BROADCAST (steady state)                                               │
│     Leader receives writes, assigns zxid                                   │
│     Broadcasts to followers, waits for majority ACK                        │
│     Commits and notifies client                                            │
│                                                                              │
│  ZXID (Transaction ID):                                                    │
│  ──────────────────────                                                     │
│  64-bit: [32-bit epoch][32-bit counter]                                    │
│  Epoch: Increments with each new leader                                    │
│  Counter: Increments with each transaction                                 │
│                                                                              │
│  Example: 0x00000002_00000015                                              │
│           epoch=2, transaction=21                                          │
│                                                                              │
│  WRITE FLOW:                                                                │
│  ───────────                                                                │
│  Client ─Write─► Leader                                                    │
│  Leader: Assign zxid, persist to log                                       │
│  Leader ─Proposal─► Followers                                              │
│  Followers: Persist to log, send ACK                                       │
│  Leader: Receive majority ACKs                                             │
│  Leader ─Commit─► Followers                                                │
│  Leader ─Response─► Client                                                 │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

etcd

Modern Coordination Service

┌─────────────────────────────────────────────────────────────────────────────┐
│                    etcd ARCHITECTURE                                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  etcd is the coordination backbone of Kubernetes.                          │
│                                                                              │
│  KEY DIFFERENCES FROM ZOOKEEPER:                                            │
│  ───────────────────────────────                                            │
│  ┌─────────────────────┬─────────────────────┬─────────────────────┐       │
│  │ Feature             │ Zookeeper           │ etcd                │       │
│  ├─────────────────────┼─────────────────────┼─────────────────────┤       │
│  │ Consensus           │ ZAB                 │ Raft                │       │
│  │ Data Model          │ Hierarchical        │ Flat key-value      │       │
│  │ API                 │ Binary (Jute)       │ gRPC + JSON         │       │
│  │ Watch               │ One-time triggers   │ Streaming           │       │
│  │ Transactions        │ Multi-op            │ Compare-and-swap    │       │
│  │ Language            │ Java                │ Go                  │       │
│  └─────────────────────┴─────────────────────┴─────────────────────┘       │
│                                                                              │
│  RAFT CONSENSUS:                                                            │
│  ───────────────                                                            │
│  Leader Election:                                                          │
│  - Nodes start as followers                                                │
│  - If no heartbeat, become candidate                                       │
│  - Request votes from peers                                                │
│  - Majority votes → become leader                                          │
│                                                                              │
│  Log Replication:                                                          │
│  - Leader appends entries to log                                           │
│  - Sends AppendEntries to followers                                        │
│  - Committed when majority have it                                         │
│                                                                              │
│  KEY FEATURES:                                                              │
│  ─────────────                                                              │
│  • MVCC (Multi-Version Concurrency Control)                               │
│  • Watch with revision numbers                                             │
│  • Lease-based TTL                                                         │
│  • Transactions (If-Then-Else)                                            │
│  • Linearizable reads                                                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

etcd API Examples

import etcd3

# Connect to etcd
etcd = etcd3.client(host='localhost', port=2379)

# Basic key-value operations
etcd.put('/config/database/host', 'db.example.com')
value, metadata = etcd.get('/config/database/host')
print(value.decode())  # 'db.example.com'

# Compare-and-swap (atomic update)
etcd.transaction(
    compare=[
        etcd.transactions.version('/config/database/host') > 0
    ],
    success=[
        etcd.transactions.put('/config/database/host', 'newdb.example.com')
    ],
    failure=[
        etcd.transactions.put('/config/database/host', 'fallback.example.com')
    ]
)

# Leases (for ephemeral keys)
lease = etcd.lease(ttl=30)  # 30 second TTL
etcd.put('/services/api/node1', '192.168.1.1:8080', lease=lease)

# Keep lease alive
for _ in lease.keepalive():
    print("Lease renewed")

# Watch for changes
events_iterator, cancel = etcd.watch_prefix('/services/')
for event in events_iterator:
    print(f"Key: {event.key}, Value: {event.value}, Type: {event.type}")


# Distributed lock
lock = etcd.lock('/locks/my-resource', ttl=30)
with lock:
    # Critical section
    print("Acquired lock, doing work...")

Leader Election Patterns

Zookeeper Recipe

┌─────────────────────────────────────────────────────────────────────────────┐
│                    LEADER ELECTION WITH ZOOKEEPER                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ALGORITHM:                                                                 │
│  ──────────                                                                 │
│  1. All candidates create EPHEMERAL SEQUENTIAL node under /election        │
│  2. Get all children of /election                                          │
│  3. If your node is the smallest → YOU ARE LEADER                          │
│  4. Otherwise → watch the node just before yours                           │
│  5. When watched node deleted → repeat from step 2                         │
│                                                                              │
│  EXAMPLE:                                                                   │
│  ─────────                                                                  │
│  /election                                                                  │
│     ├── node_0000000001 (created by Node A) ← LEADER                       │
│     ├── node_0000000002 (created by Node B) watches 0001                   │
│     └── node_0000000003 (created by Node C) watches 0002                   │
│                                                                              │
│  If Node A crashes:                                                         │
│  - node_0000000001 deleted (ephemeral)                                     │
│  - Node B notified (was watching 0001)                                     │
│  - Node B checks: am I smallest? Yes → Node B is new LEADER                │
│                                                                              │
│  WHY WATCH ONLY PREDECESSOR?                                                │
│  ────────────────────────────                                               │
│  Watching /election directly → HERD EFFECT                                 │
│  All nodes notified on any change → thundering herd                        │
│  Watching predecessor → O(1) notifications, graceful succession            │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation

from kazoo.client import KazooClient
from kazoo.recipe.election import Election

class LeaderElection:
    """
    Leader election using Zookeeper.
    """
    
    def __init__(self, zk_hosts: str, election_path: str, node_id: str):
        self.zk = KazooClient(hosts=zk_hosts)
        self.election_path = election_path
        self.node_id = node_id
        self.is_leader = False
        self.my_node = None
    
    def start(self):
        self.zk.start()
        self.zk.ensure_path(self.election_path)
        self._participate()
    
    def _participate(self):
        # Create ephemeral sequential node
        path = f"{self.election_path}/node_"
        self.my_node = self.zk.create(
            path, 
            value=self.node_id.encode(),
            ephemeral=True, 
            sequence=True
        )
        self._check_leadership()
    
    def _check_leadership(self):
        children = self.zk.get_children(self.election_path)
        children.sort()
        
        my_seq = self.my_node.split('_')[-1]
        
        if children[0].endswith(my_seq):
            # I'm the leader!
            self.is_leader = True
            self._on_elected_leader()
        else:
            # Watch predecessor
            my_index = next(
                i for i, c in enumerate(children) 
                if c.endswith(my_seq)
            )
            predecessor = children[my_index - 1]
            predecessor_path = f"{self.election_path}/{predecessor}"
            
            @self.zk.DataWatch(predecessor_path)
            def watch_predecessor(data, stat):
                if stat is None:  # Node deleted
                    self._check_leadership()
                return False  # One-time watch
    
    def _on_elected_leader(self):
        print(f"Node {self.node_id} is now the LEADER!")
        # Do leader work...
    
    def stop(self):
        self.zk.stop()


# Using Kazoo's built-in recipe
from kazoo.recipe.election import Election

zk = KazooClient(hosts='localhost:2181')
zk.start()

election = Election(zk, "/election", "my-node-1")

# This blocks until we're the leader
election.run(leader_function)

# Or use callbacks
election.election.run_async(leader_function)

Distributed Locks

Lock Recipes

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED LOCK PATTERNS                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ZOOKEEPER LOCK RECIPE:                                                     │
│  ──────────────────────                                                     │
│  Similar to leader election!                                               │
│                                                                              │
│  1. Create ephemeral sequential under /locks/resource                      │
│  2. If smallest → acquired lock                                            │
│  3. Otherwise → watch predecessor                                          │
│  4. On predecessor delete → check if smallest                              │
│  5. On unlock → delete your node                                           │
│                                                                              │
│  READ-WRITE LOCKS:                                                          │
│  ─────────────────                                                          │
│  Write: /locks/resource/write_0000001                                      │
│  Read:  /locks/resource/read_0000002                                       │
│                                                                              │
│  Write lock: Wait for ALL predecessors                                     │
│  Read lock: Wait for only WRITE predecessors                               │
│                                                                              │
│  FENCING TOKENS:                                                            │
│  ───────────────                                                            │
│  Problem: Lock holder pauses (GC), lock times out, another gets lock       │
│           Original holder resumes, thinks it still has lock!               │
│                                                                              │
│  Solution: Fencing tokens                                                  │
│  - Lock returns monotonically increasing token                             │
│  - Resource rejects requests with old tokens                               │
│                                                                              │
│     Holder A: token=33 ─────────────────────────────────────► resource     │
│     Holder B: token=34 ─────────► resource (rejects token=33 from A)      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Redis Distributed Locks (Redlock)

import redis
import time
import uuid

class RedisLock:
    """
    Simple Redis distributed lock.
    
    WARNING: Has known issues! See Redlock debate.
    Consider Zookeeper/etcd for critical applications.
    """
    
    def __init__(self, redis_client, resource: str, ttl: int = 10):
        self.redis = redis_client
        self.resource = f"lock:{resource}"
        self.ttl = ttl
        self.token = None
    
    def acquire(self, timeout: float = 10) -> bool:
        """Try to acquire lock with timeout."""
        self.token = str(uuid.uuid4())
        end_time = time.time() + timeout
        
        while time.time() < end_time:
            # SET NX (only if not exists) with expiry
            if self.redis.set(
                self.resource, 
                self.token, 
                nx=True, 
                ex=self.ttl
            ):
                return True
            time.sleep(0.1)
        
        return False
    
    def release(self):
        """Release lock only if we own it."""
        # Lua script for atomic check-and-delete
        script = """
        if redis.call("GET", KEYS[1]) == ARGV[1] then
            return redis.call("DEL", KEYS[1])
        else
            return 0
        end
        """
        self.redis.eval(script, 1, self.resource, self.token)
    
    def extend(self, additional_time: int):
        """Extend lock TTL if we own it."""
        script = """
        if redis.call("GET", KEYS[1]) == ARGV[1] then
            return redis.call("EXPIRE", KEYS[1], ARGV[2])
        else
            return 0
        end
        """
        self.redis.eval(
            script, 1, self.resource, 
            self.token, self.ttl + additional_time
        )


class Redlock:
    """
    Redlock algorithm for distributed locking across Redis instances.
    
    Requires N independent Redis instances (typically 5).
    Lock acquired if majority (N/2 + 1) grant it.
    """
    
    def __init__(self, redis_instances: list):
        self.instances = redis_instances
        self.quorum = len(redis_instances) // 2 + 1
    
    def acquire(self, resource: str, ttl: int) -> tuple:
        """
        Try to acquire lock on majority of instances.
        
        Returns (success, token) tuple.
        """
        token = str(uuid.uuid4())
        start_time = time.time()
        
        acquired = 0
        for instance in self.instances:
            try:
                if instance.set(
                    f"lock:{resource}", token, 
                    nx=True, px=ttl
                ):
                    acquired += 1
            except redis.RedisError:
                pass  # Instance unreachable
        
        # Check if we acquired quorum
        elapsed = (time.time() - start_time) * 1000  # ms
        validity_time = ttl - elapsed - 2  # drift allowance
        
        if acquired >= self.quorum and validity_time > 0:
            return True, token
        
        # Failed - release any locks acquired
        self._release_all(resource, token)
        return False, None
    
    def _release_all(self, resource: str, token: str):
        script = """
        if redis.call("GET", KEYS[1]) == ARGV[1] then
            return redis.call("DEL", KEYS[1])
        end
        """
        for instance in self.instances:
            try:
                instance.eval(script, 1, f"lock:{resource}", token)
            except redis.RedisError:
                pass
Redlock Controversy: Martin Kleppmann’s analysis shows Redlock can fail under certain conditions (GC pauses, clock drift). For safety-critical applications, prefer coordination services with stronger guarantees (Zookeeper, etcd).

Service Discovery

Patterns

┌─────────────────────────────────────────────────────────────────────────────┐
│                    SERVICE DISCOVERY PATTERNS                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. CLIENT-SIDE DISCOVERY                                                   │
│     ─────────────────────────                                               │
│                                                                              │
│     ┌─────────────┐      ┌──────────────┐      ┌─────────────┐             │
│     │   Client    │─────►│   Registry   │      │  Service A  │             │
│     │             │◄─────│ (Consul/etcd)│      │ 10.0.1.1    │             │
│     └──────┬──────┘      └──────────────┘      └─────────────┘             │
│            │                                    ┌─────────────┐             │
│            └───────────────────────────────────►│  Service A  │             │
│              Client knows all instances         │ 10.0.1.2    │             │
│              Does load balancing locally        └─────────────┘             │
│                                                                              │
│     Pro: No single point of failure                                        │
│     Con: Client complexity, language coupling                              │
│                                                                              │
│  2. SERVER-SIDE DISCOVERY                                                   │
│     ─────────────────────────                                               │
│                                                                              │
│     ┌─────────────┐      ┌──────────────┐      ┌─────────────┐             │
│     │   Client    │─────►│    Load      │─────►│  Service A  │             │
│     │             │      │   Balancer   │      │ 10.0.1.1    │             │
│     └─────────────┘      └───────┬──────┘      └─────────────┘             │
│                                  │              ┌─────────────┐             │
│                                  │              │  Service A  │             │
│                                  │              │ 10.0.1.2    │             │
│                                  │              └─────────────┘             │
│                                  │                                          │
│                                  └──────► Registry                          │
│                                                                              │
│     Pro: Simple client, centralized                                        │
│     Con: Load balancer is SPOF, latency                                    │
│                                                                              │
│  3. DNS-BASED DISCOVERY                                                     │
│     ───────────────────                                                     │
│                                                                              │
│     Client ─► DNS ─► api.internal → [10.0.1.1, 10.0.1.2, 10.0.1.3]        │
│                                                                              │
│     Pro: Standard, simple                                                  │
│     Con: TTL caching delays updates, no health checks                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Consul Service Discovery

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CONSUL SERVICE DISCOVERY                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ARCHITECTURE:                                                              │
│  ─────────────                                                              │
│                                                                              │
│  ┌────────────────────────────────────────────────────────────────────┐    │
│  │                         CONSUL CLUSTER                             │    │
│  │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐                  │    │
│  │  │   Server    │ │   Server    │ │   Server    │                  │    │
│  │  │  (Leader)   │ │  (Follower) │ │  (Follower) │                  │    │
│  │  └─────────────┘ └─────────────┘ └─────────────┘                  │    │
│  └────────────────────────────────────────────────────────────────────┘    │
│              ▲                                                              │
│              │ Raft consensus                                               │
│              │                                                              │
│  ┌───────────┴────────────┬─────────────────────┬────────────────────┐     │
│  │                        │                     │                    │     │
│  ▼                        ▼                     ▼                    ▼     │
│  ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌──────────┐ │
│  │ Node 1          │ │ Node 2          │ │ Node 3          │ │ Node 4   │ │
│  │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │          │ │
│  │ │Consul Agent │ │ │ │Consul Agent │ │ │ │Consul Agent │ │ │  Agent   │ │
│  │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │          │ │
│  │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │          │ │
│  │ │   API       │ │ │ │   Web       │ │ │ │   Worker    │ │ │          │ │
│  │ │  Service    │ │ │ │  Service    │ │ │ │  Service    │ │ │          │ │
│  │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │          │ │
│  └─────────────────┘ └─────────────────┘ └─────────────────┘ └──────────┘ │
│                                                                              │
│  FEATURES:                                                                  │
│  ─────────                                                                  │
│  • Service registration (agent-based or API)                               │
│  • Health checks (HTTP, TCP, script, TTL)                                  │
│  • DNS interface (service.consul)                                          │
│  • HTTP API for discovery                                                  │
│  • Key-Value store                                                         │
│  • Multi-datacenter support                                                │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementation Example

import consul
import socket
import threading
import time

class ServiceRegistry:
    """
    Consul-based service registration and discovery.
    """
    
    def __init__(self, consul_host: str = 'localhost', consul_port: int = 8500):
        self.consul = consul.Consul(host=consul_host, port=consul_port)
    
    def register_service(
        self, 
        name: str, 
        port: int, 
        tags: list = None,
        health_check_interval: str = "10s"
    ) -> str:
        """Register a service with health check."""
        service_id = f"{name}-{socket.gethostname()}-{port}"
        
        self.consul.agent.service.register(
            name=name,
            service_id=service_id,
            port=port,
            tags=tags or [],
            check=consul.Check.http(
                url=f"http://localhost:{port}/health",
                interval=health_check_interval,
                timeout="5s"
            )
        )
        
        return service_id
    
    def deregister_service(self, service_id: str):
        """Deregister a service."""
        self.consul.agent.service.deregister(service_id)
    
    def discover_service(self, name: str, only_healthy: bool = True) -> list:
        """Discover all instances of a service."""
        _, services = self.consul.health.service(
            name, 
            passing=only_healthy
        )
        
        instances = []
        for service in services:
            instances.append({
                'id': service['Service']['ID'],
                'address': service['Service']['Address'] or service['Node']['Address'],
                'port': service['Service']['Port'],
                'tags': service['Service']['Tags'],
            })
        
        return instances
    
    def watch_service(self, name: str, callback):
        """Watch for service changes."""
        index = None
        
        def watch_loop():
            nonlocal index
            while True:
                try:
                    index, services = self.consul.health.service(
                        name, 
                        passing=True,
                        index=index,  # Blocking query
                        wait='30s'
                    )
                    callback(services)
                except Exception as e:
                    print(f"Watch error: {e}")
                    time.sleep(5)
        
        thread = threading.Thread(target=watch_loop, daemon=True)
        thread.start()
        return thread


# Client-side load balancing
import random

class LoadBalancer:
    """Simple client-side load balancer with service discovery."""
    
    def __init__(self, registry: ServiceRegistry, service_name: str):
        self.registry = registry
        self.service_name = service_name
        self.instances = []
        
        # Watch for changes
        self.registry.watch_service(service_name, self._update_instances)
        self._refresh_instances()
    
    def _refresh_instances(self):
        self.instances = self.registry.discover_service(self.service_name)
    
    def _update_instances(self, services):
        self.instances = [
            {
                'address': s['Service']['Address'] or s['Node']['Address'],
                'port': s['Service']['Port'],
            }
            for s in services
        ]
    
    def get_instance(self) -> dict:
        """Get an instance using round-robin or random selection."""
        if not self.instances:
            self._refresh_instances()
        
        if not self.instances:
            raise Exception(f"No healthy instances for {self.service_name}")
        
        return random.choice(self.instances)

Configuration Management

Dynamic Configuration

┌─────────────────────────────────────────────────────────────────────────────┐
│                    CONFIGURATION MANAGEMENT                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  PROBLEM: How to update configuration across many services?                │
│                                                                              │
│  STATIC CONFIG (bad):                                                       │
│  - Config in files, requires restart                                       │
│  - Different values in different environments                              │
│  - No audit trail                                                          │
│                                                                              │
│  DYNAMIC CONFIG (good):                                                     │
│  - Config in coordination service                                          │
│  - Watch for changes, apply immediately                                    │
│  - Audit trail in version history                                          │
│                                                                              │
│  EXAMPLE - Feature Flags:                                                   │
│  ─────────────────────────                                                  │
│  /config/features/new_checkout     = {"enabled": true, "percentage": 50}   │
│  /config/features/dark_mode        = {"enabled": false}                    │
│  /config/limits/rate_limit_qps     = 1000                                  │
│  /config/limits/max_connections    = 500                                   │
│                                                                              │
│  WATCH PATTERN:                                                             │
│  ──────────────                                                             │
│  Service starts → read all config                                          │
│  Set watch on /config/...                                                  │
│  On change → update in-memory config                                       │
│  No restart needed!                                                        │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
class ConfigManager:
    """
    Dynamic configuration management with etcd.
    """
    
    def __init__(self, etcd_client, prefix: str = '/config'):
        self.etcd = etcd_client
        self.prefix = prefix
        self.config = {}
        self.callbacks = {}
        
        self._load_all()
        self._start_watch()
    
    def _load_all(self):
        """Load all configuration values."""
        for value, metadata in self.etcd.get_prefix(self.prefix):
            key = metadata.key.decode().replace(self.prefix + '/', '')
            self.config[key] = json.loads(value.decode())
    
    def _start_watch(self):
        """Watch for configuration changes."""
        events_iterator, cancel = self.etcd.watch_prefix(self.prefix)
        
        def watch_loop():
            for event in events_iterator:
                key = event.key.decode().replace(self.prefix + '/', '')
                
                if event.type == etcd3.EventType.PUT:
                    value = json.loads(event.value.decode())
                    self.config[key] = value
                    self._notify_change(key, value)
                    
                elif event.type == etcd3.EventType.DELETE:
                    self.config.pop(key, None)
                    self._notify_change(key, None)
        
        threading.Thread(target=watch_loop, daemon=True).start()
    
    def get(self, key: str, default=None):
        """Get configuration value."""
        return self.config.get(key, default)
    
    def set(self, key: str, value):
        """Set configuration value."""
        self.etcd.put(
            f"{self.prefix}/{key}", 
            json.dumps(value)
        )
    
    def on_change(self, key: str, callback):
        """Register callback for configuration changes."""
        self.callbacks[key] = callback
    
    def _notify_change(self, key: str, value):
        if key in self.callbacks:
            self.callbacks[key](value)


# Usage
config = ConfigManager(etcd_client)

# Get feature flag
if config.get('features/new_checkout', {}).get('enabled'):
    use_new_checkout()

# React to changes
def on_rate_limit_change(new_value):
    rate_limiter.set_limit(new_value)

config.on_change('limits/rate_limit_qps', on_rate_limit_change)

Interview Practice

Question: Design a distributed lock service for a payment system.Requirements:
  • Mutual exclusion (only one holder)
  • Deadlock prevention (timeouts)
  • Fault tolerance (handle crashes)
  • Fairness (FIFO ordering)
Design:
Architecture:
- Use Zookeeper/etcd for coordination
- Ephemeral nodes for automatic cleanup
- Sequential nodes for FIFO ordering
- Fencing tokens for safety

Lock acquisition:
1. Create ephemeral sequential node
2. Check if smallest → acquired
3. Otherwise, watch predecessor
4. Return fencing token = node sequence number

Lock release:
1. Delete ephemeral node
2. Next waiter gets notified

Safety (fencing tokens):
- Lock returns monotonic token
- Payment service checks: token >= last_seen_token
- Rejects stale requests from old lock holders
Question: How do you prevent split-brain during leader election?Answer:The Problem:
Network partition:
[Node A, Node B] | [Node C]

Without quorum:
- A thinks: "I'm leader of partition 1"
- C thinks: "I'm leader of partition 2"
- Two leaders! Data diverges!
Solutions:
  1. Quorum requirement
    • Need majority (N/2 + 1) to elect leader
    • 3 nodes → need 2
    • Partition with minority cannot elect leader
  2. Fencing
    • Each leader gets epoch number
    • Resources only accept from current epoch
    • Old leader’s requests rejected
  3. Lease-based leadership
    • Leader holds lease with TTL
    • Must renew before expiry
    • On partition, lease expires → no leader in minority
Question: Design service discovery for a microservices platform.Components:
1. Registry (Consul/etcd)
   - Stores service → instances mapping
   - Health check status
   - Metadata (version, region, etc.)

2. Service Registration
   - Self-registration on startup
   - Heartbeat/TTL renewal
   - Graceful deregistration on shutdown

3. Discovery
   - DNS-based (simple, cacheable)
   - API-based (more features)
   - Client-side caching with watch

4. Load Balancing
   - Client-side (Ribbon, gRPC)
   - Server-side (Envoy, HAProxy)

5. Health Checks
   - Active (ping from registry)
   - Passive (TTL refresh)
   - Multiple levels (liveness, readiness)
Failure Handling:
  • Service crashes → ephemeral entry deleted
  • Registry partitioned → use cached instances
  • All instances down → return error, trigger alerts

Key Takeaways

Use Proven Coordination Services

Zookeeper, etcd, Consul are battle-tested. Don’t reinvent distributed consensus.

Ephemeral Nodes are Powerful

Automatic cleanup on session end enables leader election, locks, and service discovery.

Fencing Tokens Prevent Stale Operations

Always use monotonic tokens with locks to prevent split-brain safety violations.

Watch, Don't Poll

Coordination services provide watch/subscribe for efficient change detection.

Next Steps