Distributed Coordination
Coordination services are the backbone of distributed systems, providing primitives for leader election, configuration management, service discovery, and distributed locking.Module Duration: 12-16 hours
Key Topics: Zookeeper, etcd, Consul, Service Discovery, Leader Election, Distributed Locks
Interview Focus: Zookeeper recipes, leader election algorithms, lock safety
Key Topics: Zookeeper, etcd, Consul, Service Discovery, Leader Election, Distributed Locks
Interview Focus: Zookeeper recipes, leader election algorithms, lock safety
Why Coordination is Hard
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ THE COORDINATION PROBLEM │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SCENARIO: Three database replicas need to elect a leader │
│ │
│ Without coordination: │
│ ───────────────────── │
│ Node A: "I'll be leader!" (doesn't hear from others) │
│ Node B: "I'll be leader!" (doesn't hear from others) │
│ Node C: "I'll be leader!" (doesn't hear from others) │
│ │
│ Result: SPLIT BRAIN! Three leaders, data diverges │
│ │
│ FUNDAMENTAL CHALLENGES: │
│ ──────────────────────── │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ 1. PARTIAL FAILURE │ │
│ │ Some nodes crash, network partitions, but others continue │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ 2. ASYNCHRONY │ │
│ │ Messages delayed arbitrarily, can't distinguish slow from dead │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ 3. ORDERING │ │
│ │ No global clock, messages arrive out of order │ │
│ ├─────────────────────────────────────────────────────────────────────┤ │
│ │ 4. BYZANTINE BEHAVIOR │ │
│ │ Nodes may behave incorrectly (usually ignored in coordination) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ SOLUTION: Use a coordination service that provides strong guarantees │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Zookeeper
Industry Standard: Zookeeper is used by Kafka, HBase, Solr, and many other distributed systems. Understanding it is essential.
Architecture
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ ZOOKEEPER ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ENSEMBLE (typically 3 or 5 nodes): │
│ ────────────────────────────────── │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ CLIENTS │ │
│ └───────────┬──────────────────────┬──────────────┘ │
│ │ │ │
│ ┌────────────┼──────────────────────┼────────────┐ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ ZK1 │ │ ZK2 │ │ ZK3 │ │ ZK4 │ │ ZK5 │ │
│ │(Leader) │ │(Follower│ │(Follower│ │(Follower│ │(Follower│ │
│ │ │◄─┤ │◄─┤ │◄─┤ │◄─┤ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │
│ │ Writes go through leader │
│ │ Reads can go to any node │
│ │ │
│ └───────────────────────────────────────────────────────── │
│ │
│ DATA MODEL: Hierarchical namespace (like filesystem) │
│ ─────────────────────────────────────────────────── │
│ │
│ / │
│ ┌────┴────┐ │
│ /app /services │
│ ┌────┴────┐ ┌──┴──┐ │
│ /config /locks /api /db │
│ │ │ │ │ │
│ /leader /lock_1 /node1 /leader │
│ /lock_2 /node2 │
│ /node3 │
│ │
│ NODE TYPES: │
│ ─────────── │
│ Persistent: Survives client disconnect │
│ Ephemeral: Deleted when client session ends │
│ Sequential: Appends monotonic number (lock_0001, lock_0002) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Zookeeper Guarantees
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ ZOOKEEPER GUARANTEES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. SEQUENTIAL CONSISTENCY │
│ ───────────────────────── │
│ Updates from a client are applied in order sent │
│ │
│ 2. ATOMICITY │
│ ────────── │
│ Updates either succeed completely or fail completely │
│ │
│ 3. SINGLE SYSTEM IMAGE │
│ ─────────────────────── │
│ Client sees same view regardless of which server it connects to │
│ │
│ 4. RELIABILITY │
│ ─────────── │
│ Once update is applied, it persists until overwritten │
│ │
│ 5. TIMELINESS │
│ ─────────── │
│ Client view is guaranteed current within bounded time │
│ │
│ CONSISTENCY MODEL: │
│ ────────────────── │
│ Writes: Linearizable (go through leader, use ZAB) │
│ Reads: Sequential consistency (may read stale, but ordered) │
│ │
│ For linearizable reads: Use sync() before read │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
ZAB Protocol (Zookeeper Atomic Broadcast)
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ ZAB PROTOCOL │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ZAB ensures all replicas agree on order of updates. │
│ │
│ PHASES: │
│ ─────── │
│ │
│ 1. LEADER ELECTION │
│ Nodes elect a leader (highest epoch + zxid wins) │
│ │
│ 2. DISCOVERY │
│ Leader gathers state from followers │
│ Determines most up-to-date log │
│ │
│ 3. SYNCHRONIZATION │
│ Leader syncs followers to consistent state │
│ Replays missing transactions │
│ │
│ 4. BROADCAST (steady state) │
│ Leader receives writes, assigns zxid │
│ Broadcasts to followers, waits for majority ACK │
│ Commits and notifies client │
│ │
│ ZXID (Transaction ID): │
│ ────────────────────── │
│ 64-bit: [32-bit epoch][32-bit counter] │
│ Epoch: Increments with each new leader │
│ Counter: Increments with each transaction │
│ │
│ Example: 0x00000002_00000015 │
│ epoch=2, transaction=21 │
│ │
│ WRITE FLOW: │
│ ─────────── │
│ Client ─Write─► Leader │
│ Leader: Assign zxid, persist to log │
│ Leader ─Proposal─► Followers │
│ Followers: Persist to log, send ACK │
│ Leader: Receive majority ACKs │
│ Leader ─Commit─► Followers │
│ Leader ─Response─► Client │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
etcd
Modern Coordination Service
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ etcd ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ etcd is the coordination backbone of Kubernetes. │
│ │
│ KEY DIFFERENCES FROM ZOOKEEPER: │
│ ─────────────────────────────── │
│ ┌─────────────────────┬─────────────────────┬─────────────────────┐ │
│ │ Feature │ Zookeeper │ etcd │ │
│ ├─────────────────────┼─────────────────────┼─────────────────────┤ │
│ │ Consensus │ ZAB │ Raft │ │
│ │ Data Model │ Hierarchical │ Flat key-value │ │
│ │ API │ Binary (Jute) │ gRPC + JSON │ │
│ │ Watch │ One-time triggers │ Streaming │ │
│ │ Transactions │ Multi-op │ Compare-and-swap │ │
│ │ Language │ Java │ Go │ │
│ └─────────────────────┴─────────────────────┴─────────────────────┘ │
│ │
│ RAFT CONSENSUS: │
│ ─────────────── │
│ Leader Election: │
│ - Nodes start as followers │
│ - If no heartbeat, become candidate │
│ - Request votes from peers │
│ - Majority votes → become leader │
│ │
│ Log Replication: │
│ - Leader appends entries to log │
│ - Sends AppendEntries to followers │
│ - Committed when majority have it │
│ │
│ KEY FEATURES: │
│ ───────────── │
│ • MVCC (Multi-Version Concurrency Control) │
│ • Watch with revision numbers │
│ • Lease-based TTL │
│ • Transactions (If-Then-Else) │
│ • Linearizable reads │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
etcd API Examples
Copy
import etcd3
# Connect to etcd
etcd = etcd3.client(host='localhost', port=2379)
# Basic key-value operations
etcd.put('/config/database/host', 'db.example.com')
value, metadata = etcd.get('/config/database/host')
print(value.decode()) # 'db.example.com'
# Compare-and-swap (atomic update)
etcd.transaction(
compare=[
etcd.transactions.version('/config/database/host') > 0
],
success=[
etcd.transactions.put('/config/database/host', 'newdb.example.com')
],
failure=[
etcd.transactions.put('/config/database/host', 'fallback.example.com')
]
)
# Leases (for ephemeral keys)
lease = etcd.lease(ttl=30) # 30 second TTL
etcd.put('/services/api/node1', '192.168.1.1:8080', lease=lease)
# Keep lease alive
for _ in lease.keepalive():
print("Lease renewed")
# Watch for changes
events_iterator, cancel = etcd.watch_prefix('/services/')
for event in events_iterator:
print(f"Key: {event.key}, Value: {event.value}, Type: {event.type}")
# Distributed lock
lock = etcd.lock('/locks/my-resource', ttl=30)
with lock:
# Critical section
print("Acquired lock, doing work...")
Leader Election Patterns
Zookeeper Recipe
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ LEADER ELECTION WITH ZOOKEEPER │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ALGORITHM: │
│ ────────── │
│ 1. All candidates create EPHEMERAL SEQUENTIAL node under /election │
│ 2. Get all children of /election │
│ 3. If your node is the smallest → YOU ARE LEADER │
│ 4. Otherwise → watch the node just before yours │
│ 5. When watched node deleted → repeat from step 2 │
│ │
│ EXAMPLE: │
│ ───────── │
│ /election │
│ ├── node_0000000001 (created by Node A) ← LEADER │
│ ├── node_0000000002 (created by Node B) watches 0001 │
│ └── node_0000000003 (created by Node C) watches 0002 │
│ │
│ If Node A crashes: │
│ - node_0000000001 deleted (ephemeral) │
│ - Node B notified (was watching 0001) │
│ - Node B checks: am I smallest? Yes → Node B is new LEADER │
│ │
│ WHY WATCH ONLY PREDECESSOR? │
│ ──────────────────────────── │
│ Watching /election directly → HERD EFFECT │
│ All nodes notified on any change → thundering herd │
│ Watching predecessor → O(1) notifications, graceful succession │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation
Copy
from kazoo.client import KazooClient
from kazoo.recipe.election import Election
class LeaderElection:
"""
Leader election using Zookeeper.
"""
def __init__(self, zk_hosts: str, election_path: str, node_id: str):
self.zk = KazooClient(hosts=zk_hosts)
self.election_path = election_path
self.node_id = node_id
self.is_leader = False
self.my_node = None
def start(self):
self.zk.start()
self.zk.ensure_path(self.election_path)
self._participate()
def _participate(self):
# Create ephemeral sequential node
path = f"{self.election_path}/node_"
self.my_node = self.zk.create(
path,
value=self.node_id.encode(),
ephemeral=True,
sequence=True
)
self._check_leadership()
def _check_leadership(self):
children = self.zk.get_children(self.election_path)
children.sort()
my_seq = self.my_node.split('_')[-1]
if children[0].endswith(my_seq):
# I'm the leader!
self.is_leader = True
self._on_elected_leader()
else:
# Watch predecessor
my_index = next(
i for i, c in enumerate(children)
if c.endswith(my_seq)
)
predecessor = children[my_index - 1]
predecessor_path = f"{self.election_path}/{predecessor}"
@self.zk.DataWatch(predecessor_path)
def watch_predecessor(data, stat):
if stat is None: # Node deleted
self._check_leadership()
return False # One-time watch
def _on_elected_leader(self):
print(f"Node {self.node_id} is now the LEADER!")
# Do leader work...
def stop(self):
self.zk.stop()
# Using Kazoo's built-in recipe
from kazoo.recipe.election import Election
zk = KazooClient(hosts='localhost:2181')
zk.start()
election = Election(zk, "/election", "my-node-1")
# This blocks until we're the leader
election.run(leader_function)
# Or use callbacks
election.election.run_async(leader_function)
Distributed Locks
Lock Recipes
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED LOCK PATTERNS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ZOOKEEPER LOCK RECIPE: │
│ ────────────────────── │
│ Similar to leader election! │
│ │
│ 1. Create ephemeral sequential under /locks/resource │
│ 2. If smallest → acquired lock │
│ 3. Otherwise → watch predecessor │
│ 4. On predecessor delete → check if smallest │
│ 5. On unlock → delete your node │
│ │
│ READ-WRITE LOCKS: │
│ ───────────────── │
│ Write: /locks/resource/write_0000001 │
│ Read: /locks/resource/read_0000002 │
│ │
│ Write lock: Wait for ALL predecessors │
│ Read lock: Wait for only WRITE predecessors │
│ │
│ FENCING TOKENS: │
│ ─────────────── │
│ Problem: Lock holder pauses (GC), lock times out, another gets lock │
│ Original holder resumes, thinks it still has lock! │
│ │
│ Solution: Fencing tokens │
│ - Lock returns monotonically increasing token │
│ - Resource rejects requests with old tokens │
│ │
│ Holder A: token=33 ─────────────────────────────────────► resource │
│ Holder B: token=34 ─────────► resource (rejects token=33 from A) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Redis Distributed Locks (Redlock)
Copy
import redis
import time
import uuid
class RedisLock:
"""
Simple Redis distributed lock.
WARNING: Has known issues! See Redlock debate.
Consider Zookeeper/etcd for critical applications.
"""
def __init__(self, redis_client, resource: str, ttl: int = 10):
self.redis = redis_client
self.resource = f"lock:{resource}"
self.ttl = ttl
self.token = None
def acquire(self, timeout: float = 10) -> bool:
"""Try to acquire lock with timeout."""
self.token = str(uuid.uuid4())
end_time = time.time() + timeout
while time.time() < end_time:
# SET NX (only if not exists) with expiry
if self.redis.set(
self.resource,
self.token,
nx=True,
ex=self.ttl
):
return True
time.sleep(0.1)
return False
def release(self):
"""Release lock only if we own it."""
# Lua script for atomic check-and-delete
script = """
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("DEL", KEYS[1])
else
return 0
end
"""
self.redis.eval(script, 1, self.resource, self.token)
def extend(self, additional_time: int):
"""Extend lock TTL if we own it."""
script = """
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("EXPIRE", KEYS[1], ARGV[2])
else
return 0
end
"""
self.redis.eval(
script, 1, self.resource,
self.token, self.ttl + additional_time
)
class Redlock:
"""
Redlock algorithm for distributed locking across Redis instances.
Requires N independent Redis instances (typically 5).
Lock acquired if majority (N/2 + 1) grant it.
"""
def __init__(self, redis_instances: list):
self.instances = redis_instances
self.quorum = len(redis_instances) // 2 + 1
def acquire(self, resource: str, ttl: int) -> tuple:
"""
Try to acquire lock on majority of instances.
Returns (success, token) tuple.
"""
token = str(uuid.uuid4())
start_time = time.time()
acquired = 0
for instance in self.instances:
try:
if instance.set(
f"lock:{resource}", token,
nx=True, px=ttl
):
acquired += 1
except redis.RedisError:
pass # Instance unreachable
# Check if we acquired quorum
elapsed = (time.time() - start_time) * 1000 # ms
validity_time = ttl - elapsed - 2 # drift allowance
if acquired >= self.quorum and validity_time > 0:
return True, token
# Failed - release any locks acquired
self._release_all(resource, token)
return False, None
def _release_all(self, resource: str, token: str):
script = """
if redis.call("GET", KEYS[1]) == ARGV[1] then
return redis.call("DEL", KEYS[1])
end
"""
for instance in self.instances:
try:
instance.eval(script, 1, f"lock:{resource}", token)
except redis.RedisError:
pass
Redlock Controversy: Martin Kleppmann’s analysis shows Redlock can fail under certain conditions (GC pauses, clock drift). For safety-critical applications, prefer coordination services with stronger guarantees (Zookeeper, etcd).
Service Discovery
Patterns
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ SERVICE DISCOVERY PATTERNS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1. CLIENT-SIDE DISCOVERY │
│ ───────────────────────── │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Client │─────►│ Registry │ │ Service A │ │
│ │ │◄─────│ (Consul/etcd)│ │ 10.0.1.1 │ │
│ └──────┬──────┘ └──────────────┘ └─────────────┘ │
│ │ ┌─────────────┐ │
│ └───────────────────────────────────►│ Service A │ │
│ Client knows all instances │ 10.0.1.2 │ │
│ Does load balancing locally └─────────────┘ │
│ │
│ Pro: No single point of failure │
│ Con: Client complexity, language coupling │
│ │
│ 2. SERVER-SIDE DISCOVERY │
│ ───────────────────────── │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Client │─────►│ Load │─────►│ Service A │ │
│ │ │ │ Balancer │ │ 10.0.1.1 │ │
│ └─────────────┘ └───────┬──────┘ └─────────────┘ │
│ │ ┌─────────────┐ │
│ │ │ Service A │ │
│ │ │ 10.0.1.2 │ │
│ │ └─────────────┘ │
│ │ │
│ └──────► Registry │
│ │
│ Pro: Simple client, centralized │
│ Con: Load balancer is SPOF, latency │
│ │
│ 3. DNS-BASED DISCOVERY │
│ ─────────────────── │
│ │
│ Client ─► DNS ─► api.internal → [10.0.1.1, 10.0.1.2, 10.0.1.3] │
│ │
│ Pro: Standard, simple │
│ Con: TTL caching delays updates, no health checks │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Consul Service Discovery
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONSUL SERVICE DISCOVERY │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ARCHITECTURE: │
│ ───────────── │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ CONSUL CLUSTER │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Server │ │ Server │ │ Server │ │ │
│ │ │ (Leader) │ │ (Follower) │ │ (Follower) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ ▲ │
│ │ Raft consensus │
│ │ │
│ ┌───────────┴────────────┬─────────────────────┬────────────────────┐ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌──────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ Node 4 │ │
│ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ │ │
│ │ │Consul Agent │ │ │ │Consul Agent │ │ │ │Consul Agent │ │ │ Agent │ │
│ │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │ │ │
│ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │ │ │
│ │ │ API │ │ │ │ Web │ │ │ │ Worker │ │ │ │ │
│ │ │ Service │ │ │ │ Service │ │ │ │ Service │ │ │ │ │
│ │ └─────────────┘ │ │ └─────────────┘ │ │ └─────────────┘ │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ └──────────┘ │
│ │
│ FEATURES: │
│ ───────── │
│ • Service registration (agent-based or API) │
│ • Health checks (HTTP, TCP, script, TTL) │
│ • DNS interface (service.consul) │
│ • HTTP API for discovery │
│ • Key-Value store │
│ • Multi-datacenter support │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Implementation Example
Copy
import consul
import socket
import threading
import time
class ServiceRegistry:
"""
Consul-based service registration and discovery.
"""
def __init__(self, consul_host: str = 'localhost', consul_port: int = 8500):
self.consul = consul.Consul(host=consul_host, port=consul_port)
def register_service(
self,
name: str,
port: int,
tags: list = None,
health_check_interval: str = "10s"
) -> str:
"""Register a service with health check."""
service_id = f"{name}-{socket.gethostname()}-{port}"
self.consul.agent.service.register(
name=name,
service_id=service_id,
port=port,
tags=tags or [],
check=consul.Check.http(
url=f"http://localhost:{port}/health",
interval=health_check_interval,
timeout="5s"
)
)
return service_id
def deregister_service(self, service_id: str):
"""Deregister a service."""
self.consul.agent.service.deregister(service_id)
def discover_service(self, name: str, only_healthy: bool = True) -> list:
"""Discover all instances of a service."""
_, services = self.consul.health.service(
name,
passing=only_healthy
)
instances = []
for service in services:
instances.append({
'id': service['Service']['ID'],
'address': service['Service']['Address'] or service['Node']['Address'],
'port': service['Service']['Port'],
'tags': service['Service']['Tags'],
})
return instances
def watch_service(self, name: str, callback):
"""Watch for service changes."""
index = None
def watch_loop():
nonlocal index
while True:
try:
index, services = self.consul.health.service(
name,
passing=True,
index=index, # Blocking query
wait='30s'
)
callback(services)
except Exception as e:
print(f"Watch error: {e}")
time.sleep(5)
thread = threading.Thread(target=watch_loop, daemon=True)
thread.start()
return thread
# Client-side load balancing
import random
class LoadBalancer:
"""Simple client-side load balancer with service discovery."""
def __init__(self, registry: ServiceRegistry, service_name: str):
self.registry = registry
self.service_name = service_name
self.instances = []
# Watch for changes
self.registry.watch_service(service_name, self._update_instances)
self._refresh_instances()
def _refresh_instances(self):
self.instances = self.registry.discover_service(self.service_name)
def _update_instances(self, services):
self.instances = [
{
'address': s['Service']['Address'] or s['Node']['Address'],
'port': s['Service']['Port'],
}
for s in services
]
def get_instance(self) -> dict:
"""Get an instance using round-robin or random selection."""
if not self.instances:
self._refresh_instances()
if not self.instances:
raise Exception(f"No healthy instances for {self.service_name}")
return random.choice(self.instances)
Configuration Management
Dynamic Configuration
Copy
┌─────────────────────────────────────────────────────────────────────────────┐
│ CONFIGURATION MANAGEMENT │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ PROBLEM: How to update configuration across many services? │
│ │
│ STATIC CONFIG (bad): │
│ - Config in files, requires restart │
│ - Different values in different environments │
│ - No audit trail │
│ │
│ DYNAMIC CONFIG (good): │
│ - Config in coordination service │
│ - Watch for changes, apply immediately │
│ - Audit trail in version history │
│ │
│ EXAMPLE - Feature Flags: │
│ ───────────────────────── │
│ /config/features/new_checkout = {"enabled": true, "percentage": 50} │
│ /config/features/dark_mode = {"enabled": false} │
│ /config/limits/rate_limit_qps = 1000 │
│ /config/limits/max_connections = 500 │
│ │
│ WATCH PATTERN: │
│ ────────────── │
│ Service starts → read all config │
│ Set watch on /config/... │
│ On change → update in-memory config │
│ No restart needed! │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Copy
class ConfigManager:
"""
Dynamic configuration management with etcd.
"""
def __init__(self, etcd_client, prefix: str = '/config'):
self.etcd = etcd_client
self.prefix = prefix
self.config = {}
self.callbacks = {}
self._load_all()
self._start_watch()
def _load_all(self):
"""Load all configuration values."""
for value, metadata in self.etcd.get_prefix(self.prefix):
key = metadata.key.decode().replace(self.prefix + '/', '')
self.config[key] = json.loads(value.decode())
def _start_watch(self):
"""Watch for configuration changes."""
events_iterator, cancel = self.etcd.watch_prefix(self.prefix)
def watch_loop():
for event in events_iterator:
key = event.key.decode().replace(self.prefix + '/', '')
if event.type == etcd3.EventType.PUT:
value = json.loads(event.value.decode())
self.config[key] = value
self._notify_change(key, value)
elif event.type == etcd3.EventType.DELETE:
self.config.pop(key, None)
self._notify_change(key, None)
threading.Thread(target=watch_loop, daemon=True).start()
def get(self, key: str, default=None):
"""Get configuration value."""
return self.config.get(key, default)
def set(self, key: str, value):
"""Set configuration value."""
self.etcd.put(
f"{self.prefix}/{key}",
json.dumps(value)
)
def on_change(self, key: str, callback):
"""Register callback for configuration changes."""
self.callbacks[key] = callback
def _notify_change(self, key: str, value):
if key in self.callbacks:
self.callbacks[key](value)
# Usage
config = ConfigManager(etcd_client)
# Get feature flag
if config.get('features/new_checkout', {}).get('enabled'):
use_new_checkout()
# React to changes
def on_rate_limit_change(new_value):
rate_limiter.set_limit(new_value)
config.on_change('limits/rate_limit_qps', on_rate_limit_change)
Interview Practice
Q1: Design a distributed lock service
Q1: Design a distributed lock service
Question: Design a distributed lock service for a payment system.Requirements:
- Mutual exclusion (only one holder)
- Deadlock prevention (timeouts)
- Fault tolerance (handle crashes)
- Fairness (FIFO ordering)
Copy
Architecture:
- Use Zookeeper/etcd for coordination
- Ephemeral nodes for automatic cleanup
- Sequential nodes for FIFO ordering
- Fencing tokens for safety
Lock acquisition:
1. Create ephemeral sequential node
2. Check if smallest → acquired
3. Otherwise, watch predecessor
4. Return fencing token = node sequence number
Lock release:
1. Delete ephemeral node
2. Next waiter gets notified
Safety (fencing tokens):
- Lock returns monotonic token
- Payment service checks: token >= last_seen_token
- Rejects stale requests from old lock holders
Q2: Handle split-brain in leader election
Q2: Handle split-brain in leader election
Question: How do you prevent split-brain during leader election?Answer:The Problem:Solutions:
Copy
Network partition:
[Node A, Node B] | [Node C]
Without quorum:
- A thinks: "I'm leader of partition 1"
- C thinks: "I'm leader of partition 2"
- Two leaders! Data diverges!
- Quorum requirement
- Need majority (N/2 + 1) to elect leader
- 3 nodes → need 2
- Partition with minority cannot elect leader
- Fencing
- Each leader gets epoch number
- Resources only accept from current epoch
- Old leader’s requests rejected
- Lease-based leadership
- Leader holds lease with TTL
- Must renew before expiry
- On partition, lease expires → no leader in minority
Q3: Implement service discovery
Q3: Implement service discovery
Question: Design service discovery for a microservices platform.Components:Failure Handling:
Copy
1. Registry (Consul/etcd)
- Stores service → instances mapping
- Health check status
- Metadata (version, region, etc.)
2. Service Registration
- Self-registration on startup
- Heartbeat/TTL renewal
- Graceful deregistration on shutdown
3. Discovery
- DNS-based (simple, cacheable)
- API-based (more features)
- Client-side caching with watch
4. Load Balancing
- Client-side (Ribbon, gRPC)
- Server-side (Envoy, HAProxy)
5. Health Checks
- Active (ping from registry)
- Passive (TTL refresh)
- Multiple levels (liveness, readiness)
- Service crashes → ephemeral entry deleted
- Registry partitioned → use cached instances
- All instances down → return error, trigger alerts
Key Takeaways
Use Proven Coordination Services
Zookeeper, etcd, Consul are battle-tested. Don’t reinvent distributed consensus.
Ephemeral Nodes are Powerful
Automatic cleanup on session end enables leader election, locks, and service discovery.
Fencing Tokens Prevent Stale Operations
Always use monotonic tokens with locks to prevent split-brain safety violations.
Watch, Don't Poll
Coordination services provide watch/subscribe for efficient change detection.