Coordination services are the backbone of distributed systems, providing primitives for leader election, configuration management, service discovery, and distributed locking.
┌─────────────────────────────────────────────────────────────────────────────┐│ THE COORDINATION PROBLEM │├─────────────────────────────────────────────────────────────────────────────┤│ ││ SCENARIO: Three database replicas need to elect a leader ││ ││ Without coordination: ││ ───────────────────── ││ Node A: "I'll be leader!" (doesn't hear from others) ││ Node B: "I'll be leader!" (doesn't hear from others) ││ Node C: "I'll be leader!" (doesn't hear from others) ││ ││ Result: SPLIT BRAIN! Three leaders, data diverges ││ ││ FUNDAMENTAL CHALLENGES: ││ ──────────────────────── ││ ┌─────────────────────────────────────────────────────────────────────┐ ││ │ 1. PARTIAL FAILURE │ ││ │ Some nodes crash, network partitions, but others continue │ ││ ├─────────────────────────────────────────────────────────────────────┤ ││ │ 2. ASYNCHRONY │ ││ │ Messages delayed arbitrarily, can't distinguish slow from dead │ ││ ├─────────────────────────────────────────────────────────────────────┤ ││ │ 3. ORDERING │ ││ │ No global clock, messages arrive out of order │ ││ ├─────────────────────────────────────────────────────────────────────┤ ││ │ 4. BYZANTINE BEHAVIOR │ ││ │ Nodes may behave incorrectly (usually ignored in coordination) │ ││ └─────────────────────────────────────────────────────────────────────┘ ││ ││ SOLUTION: Use a coordination service that provides strong guarantees ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐│ ZOOKEEPER GUARANTEES │├─────────────────────────────────────────────────────────────────────────────┤│ ││ 1. SEQUENTIAL CONSISTENCY ││ ───────────────────────── ││ Updates from a client are applied in order sent ││ ││ 2. ATOMICITY ││ ────────── ││ Updates either succeed completely or fail completely ││ ││ 3. SINGLE SYSTEM IMAGE ││ ─────────────────────── ││ Client sees same view regardless of which server it connects to ││ ││ 4. RELIABILITY ││ ─────────── ││ Once update is applied, it persists until overwritten ││ ││ 5. TIMELINESS ││ ─────────── ││ Client view is guaranteed current within bounded time ││ ││ CONSISTENCY MODEL: ││ ────────────────── ││ Writes: Linearizable (go through leader, use ZAB) ││ Reads: Sequential consistency (may read stale, but ordered) ││ ││ For linearizable reads: Use sync() before read ││ │└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐│ LEADER ELECTION WITH ZOOKEEPER │├─────────────────────────────────────────────────────────────────────────────┤│ ││ ALGORITHM: ││ ────────── ││ 1. All candidates create EPHEMERAL SEQUENTIAL node under /election ││ 2. Get all children of /election ││ 3. If your node is the smallest → YOU ARE LEADER ││ 4. Otherwise → watch the node just before yours ││ 5. When watched node deleted → repeat from step 2 ││ ││ EXAMPLE: ││ ───────── ││ /election ││ ├── node_0000000001 (created by Node A) ← LEADER ││ ├── node_0000000002 (created by Node B) watches 0001 ││ └── node_0000000003 (created by Node C) watches 0002 ││ ││ If Node A crashes: ││ - node_0000000001 deleted (ephemeral) ││ - Node B notified (was watching 0001) ││ - Node B checks: am I smallest? Yes → Node B is new LEADER ││ ││ WHY WATCH ONLY PREDECESSOR? ││ ──────────────────────────── ││ Watching /election directly → HERD EFFECT ││ All nodes notified on any change → thundering herd ││ Watching predecessor → O(1) notifications, graceful succession ││ │└─────────────────────────────────────────────────────────────────────────────┘
import redisimport timeimport uuidclass RedisLock: """ Simple Redis distributed lock. WARNING: Has known issues! See Redlock debate. Consider Zookeeper/etcd for critical applications. """ def __init__(self, redis_client, resource: str, ttl: int = 10): self.redis = redis_client self.resource = f"lock:{resource}" self.ttl = ttl self.token = None def acquire(self, timeout: float = 10) -> bool: """Try to acquire lock with timeout.""" self.token = str(uuid.uuid4()) end_time = time.time() + timeout while time.time() < end_time: # SET NX (only if not exists) with expiry if self.redis.set( self.resource, self.token, nx=True, ex=self.ttl ): return True time.sleep(0.1) return False def release(self): """Release lock only if we own it.""" # Lua script for atomic check-and-delete script = """ if redis.call("GET", KEYS[1]) == ARGV[1] then return redis.call("DEL", KEYS[1]) else return 0 end """ self.redis.eval(script, 1, self.resource, self.token) def extend(self, additional_time: int): """Extend lock TTL if we own it.""" script = """ if redis.call("GET", KEYS[1]) == ARGV[1] then return redis.call("EXPIRE", KEYS[1], ARGV[2]) else return 0 end """ self.redis.eval( script, 1, self.resource, self.token, self.ttl + additional_time )class Redlock: """ Redlock algorithm for distributed locking across Redis instances. Requires N independent Redis instances (typically 5). Lock acquired if majority (N/2 + 1) grant it. """ def __init__(self, redis_instances: list): self.instances = redis_instances self.quorum = len(redis_instances) // 2 + 1 def acquire(self, resource: str, ttl: int) -> tuple: """ Try to acquire lock on majority of instances. Returns (success, token) tuple. """ token = str(uuid.uuid4()) start_time = time.time() acquired = 0 for instance in self.instances: try: if instance.set( f"lock:{resource}", token, nx=True, px=ttl ): acquired += 1 except redis.RedisError: pass # Instance unreachable # Check if we acquired quorum elapsed = (time.time() - start_time) * 1000 # ms validity_time = ttl - elapsed - 2 # drift allowance if acquired >= self.quorum and validity_time > 0: return True, token # Failed - release any locks acquired self._release_all(resource, token) return False, None def _release_all(self, resource: str, token: str): script = """ if redis.call("GET", KEYS[1]) == ARGV[1] then return redis.call("DEL", KEYS[1]) end """ for instance in self.instances: try: instance.eval(script, 1, f"lock:{resource}", token) except redis.RedisError: pass
Redlock Controversy: Martin Kleppmann’s analysis shows Redlock can fail under certain conditions (GC pauses, clock drift). For safety-critical applications, prefer coordination services with stronger guarantees (Zookeeper, etcd).
In parallel computing and distributed data processing (like MPI or Apache Spark), you often need to ensure that no node starts Step 2 until all nodes have finished Step 1.
Static barriers require a fixed number of participants (N). Distributed phasers allow:
Dynamic Registration: Nodes can join or leave the coordination group mid-flight.
Hierarchical Phasers: To avoid the O(N) bottleneck of a single coordination znode, phasers can be arranged in a tree. A parent phaser only moves to the next phase once all its child phasers have signaled.
Tiered Wait: Some nodes can “signal” they are done but not “wait” for others (useful for pipelining).
At Staff/Principal level, you must understand when locks (even distributed ones) are too expensive and how to move toward Lock-Free progress.In single-node systems, we use CAS (Compare-And-Swap) instructions. In distributed systems, we use Optimistic Concurrency Control (OCC) and Wait-Free Data Structures.
Hazard pointers are used to safely manage memory in lock-free data structures by ensuring that a node is not deleted while another thread (or node) is still accessing it.In a distributed environment (e.g., a shared-memory distributed graph or a distributed cache like Pelikan), hazard pointers can be implemented using a Lease-based Central Registry:
Before a node reads a shared object, it publishes a “Hazard Lease” (short TTL).
The “Owner” of the object cannot reclaim or reuse that memory address until all hazard leases have expired or been revoked.
At least one thread is making progress as long as it holds the lock.
Poor (Contention)
High (Wait time)
Lock-Free
At least one thread in the system is guaranteed to make progress.
Good
Low
Wait-Free
Every thread is guaranteed to make progress in a finite number of steps.
Best
Lowest
Staff Tip: Moving from Lock-based to Lock-Free coordination (using OCC or RCU) is the primary way to achieve linear scalability in high-throughput control planes.
For Staff-level engineering, you must understand why simple “Least Connections” or “Round Robin” can fail in large-scale distributed systems.The Problem: Herd Effect
In a traditional “Least Connections” load balancer, if one server becomes slightly faster, all clients might simultaneously see it as the “least loaded” and overwhelm it (the thundering herd).The Solution: P2C
Introduced by Michael Mitzenmacher, the Power of Two Choices algorithm is elegantly simple:
Pick two random nodes from the pool.
Choose the one with the least load (e.g., fewest active requests).
Copy
# P2C Implementationimport randomclass P2CLoadBalancer: def __init__(self, nodes): self.nodes = nodes # List of (node_id, current_load) def select_node(self): # Pick two random candidates n1, n2 = random.sample(self.nodes, 2) # Pick the winner return n1 if n1.load < n2.load else n2
Why it works:
Exponential Improvement: Choosing the best of two random samples provides an exponential improvement over choosing one random sample. It performs almost as well as “Least Loaded” across the entire pool but with O(1) complexity instead of O(N).
Avoids Hotspots: Because it’s stochastic, it prevents all clients from converging on the same “best” node at the exact same micro-second.
Used By: NGINX, HAProxy, gRPC (lookaside LB), and Finagle (Twitter).
Scenario 2: Distributed Lock with Fencing for a Financial System
You need a lock to guard access to a payment ledger so that no two writers interleave operations on the same account, even in the presence of pauses and network glitches.Design:
Use Zookeeper or etcd to implement a FIFO lock with monotonic fencing tokens.
On acquiring the lock, the client gets a token that increases over time.
Downstream resources (e.g., database, Kafka consumer) validate tokens and reject stale ones.
Copy
Lock acquisition: 1. Create ephemeral sequential node under /locks/account-<id>/lock_... 2. If smallest → lock acquired. 3. Fencing token = sequence number (monotonic per account).Write path with fencing: - Client includes fencing_token in every write: UPDATE ledger SET balance = balance + :delta WHERE account_id = :id AND last_token <= :fencing_token; - If 0 rows updated → some newer holder has already written (we're stale).
Why this is safe:
Even if a GC pause or network glitch causes the old lock holder to “wake up” after losing the lock, its token is smaller, so its write fails.
The lock service enforces mutual exclusion at the coordination layer; fencing tokens extend that guarantee all the way to the storage layer.
Scenario 3: Multi-Region Service Discovery and Configuration
You operate services in multiple regions with independent Consul/etcd clusters, but want a consistent way to discover services and roll out configuration.Design:
Per-region registry:
Each region runs its own Consul or etcd cluster for low-latency discovery.
Services register only in their local registry under paths like /services/api and /config/....
Global view:
A small control-plane service aggregates per-region registry data into a read-only global catalog (for dashboards, tooling).
Traffic routing:
Clients prefer local region endpoints discovered from the local registry.
For failover, they can fall back to remote region instances using a secondary discovery path.
Config rollout pattern:
Store global config under a versioned key, e.g., /config/feature-flags/v42.
Each service:
Watches /config/feature-flags/current (a pointer to the active version).
On change, fetches the new version document and swaps it in-memory.
Copy
# Pointer-based config rolloutCONFIG_PREFIX = "/config/feature-flags"async def config_watcher(): events, cancel = etcd.watch(f"{CONFIG_PREFIX}/current") async for event in events: version = event.value.decode() # e.g., "v42" raw, _ = etcd.get(f"{CONFIG_PREFIX}/{version}") LOCAL_FLAGS.update(json.loads(raw))
Benefits:
Can prepare new configs at /v43 in all regions, validate them, and then atomically switch current → v43.
During regional partitions, each region keeps using its last known current version; when connectivity restores, they converge.
This scenario composes:
Service discovery patterns, dynamic configuration, and watch-based updates into a multi-region, fault-tolerant design.
Question: Design a distributed lock service for a payment system.Requirements:
Mutual exclusion (only one holder)
Deadlock prevention (timeouts)
Fault tolerance (handle crashes)
Fairness (FIFO ordering)
Design:
Copy
Architecture:- Use Zookeeper/etcd for coordination- Ephemeral nodes for automatic cleanup- Sequential nodes for FIFO ordering- Fencing tokens for safetyLock acquisition:1. Create ephemeral sequential node2. Check if smallest → acquired3. Otherwise, watch predecessor4. Return fencing token = node sequence numberLock release:1. Delete ephemeral node2. Next waiter gets notifiedSafety (fencing tokens):- Lock returns monotonic token- Payment service checks: token >= last_seen_token- Rejects stale requests from old lock holders
Q2: Handle split-brain in leader election
Question: How do you prevent split-brain during leader election?Answer:The Problem:
Copy
Network partition:[Node A, Node B] | [Node C]Without quorum:- A thinks: "I'm leader of partition 1"- C thinks: "I'm leader of partition 2"- Two leaders! Data diverges!
Solutions:
Quorum requirement
Need majority (N/2 + 1) to elect leader
3 nodes → need 2
Partition with minority cannot elect leader
Fencing
Each leader gets epoch number
Resources only accept from current epoch
Old leader’s requests rejected
Lease-based leadership
Leader holds lease with TTL
Must renew before expiry
On partition, lease expires → no leader in minority
Q3: Implement service discovery
Question: Design service discovery for a microservices platform.Components:
Copy
1. Registry (Consul/etcd) - Stores service → instances mapping - Health check status - Metadata (version, region, etc.)2. Service Registration - Self-registration on startup - Heartbeat/TTL renewal - Graceful deregistration on shutdown3. Discovery - DNS-based (simple, cacheable) - API-based (more features) - Client-side caching with watch4. Load Balancing - Client-side (Ribbon, gRPC) - Server-side (Envoy, HAProxy)5. Health Checks - Active (ping from registry) - Passive (TTL refresh) - Multiple levels (liveness, readiness)