Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Load Balancing Deep Dive

Load balancing is critical for distributing traffic across service instances efficiently and reliably. Think of a load balancer like a restaurant host seating guests — a bad host sends everyone to the same table while others sit empty. A good host considers which section has capacity, which server is fastest, and whether a particular guest always sits in the same spot. The choice of algorithm matters far more than most teams realize: the wrong algorithm under high load can turn a 50ms request into a 5-second timeout. The reason load balancing deserves an entire chapter is that it sits at the fault line between two very different worlds. On one side is capacity planning: do your instances have enough CPU to handle expected load? On the other side is latency engineering: are requests reaching the fastest instance right now? A naive load balancer solves the first problem but ignores the second — it will happily route 20% of traffic to a dying instance because “it is still in the pool.” A sophisticated load balancer treats every instance as a live, shifting quantity and routes based on real-time behavior, not static configuration. The patterns in this chapter exist because teams have learned, often painfully, that assumptions about instance health do not hold under load.
Learning Objectives:
  • Understand client-side vs server-side load balancing
  • Master load balancing algorithms
  • Implement health checking strategies
  • Build intelligent load balancing with Node.js

Client-Side vs Server-Side Load Balancing

Before choosing an algorithm, you must first choose where the load balancing happens. This is one of the most consequential architectural decisions in microservices, and yet most teams default to whatever their platform provides (Kubernetes Service -> server-side, gRPC -> client-side) without understanding the tradeoffs. Server-side load balancing is simpler: clients talk to a single endpoint and the load balancer decides where traffic goes. Client-side load balancing is more powerful: the client itself knows about all backend instances and decides directly, eliminating one network hop and enabling per-request routing intelligence. The catch with server-side: you have added a proxy that every request traverses. If that proxy is slow, has bugs, or is overloaded, all your traffic suffers. The catch with client-side: every service that calls the target service needs the load balancing logic embedded in it — a Python service, a Go service, and a Node service all need compatible client-side load balancers. This is why companies like Netflix and Uber built service meshes (Envoy, Linkerd): they give you client-side load balancing without forcing every language to reimplement it. In modern microservices, the answer is often “both” — server-side at the edge (for external traffic) and client-side or mesh-based internally.
┌─────────────────────────────────────────────────────────────────────────────┐
│                    LOAD BALANCING APPROACHES                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SERVER-SIDE LOAD BALANCING                                                 │
│  ──────────────────────────────                                             │
│                                                                              │
│  ┌──────────┐         ┌──────────────┐         ┌──────────────┐            │
│  │  Client  │────────▶│ Load Balancer│────────▶│ Service A-1  │            │
│  └──────────┘         │  (nginx/HAP) │         ├──────────────┤            │
│                       └──────────────┘────────▶│ Service A-2  │            │
│                                       └───────▶│ Service A-3  │            │
│                                                └──────────────┘            │
│                                                                              │
│  Pros: Simple for clients, centralized control                              │
│  Cons: Single point of failure, extra hop, limited to L4/L7                │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════   │
│                                                                              │
│  CLIENT-SIDE LOAD BALANCING                                                 │
│  ────────────────────────────                                               │
│                                                                              │
│  ┌──────────┐         ┌──────────────┐                                     │
│  │  Client  │◀────────│   Service    │         ┌──────────────┐            │
│  │  + LB    │         │   Registry   │────────▶│ Service A-1  │            │
│  │  Logic   │────────────────────────────────▶│ Service A-2  │            │
│  └──────────┘         └──────────────┘└───────▶│ Service A-3  │            │
│                                                └──────────────┘            │
│                                                                              │
│  Pros: No extra hop, distributed (no SPOF), more intelligent               │
│  Cons: Complex clients, language-specific implementations                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Caveats & Common Pitfalls: L4 vs L7 Confusion.
  • Treating L4 and L7 as interchangeable. L4 (TCP) load balancers balance connections; L7 (HTTP) load balancers balance requests. Running gRPC behind an L4 balancer (AWS NLB, bare kube-proxy) means all requests over a single HTTP/2 connection land on one backend. A client-heavy workload hits two backends while three sit idle.
  • Choosing L4 “for performance” without measuring. L4 is faster per packet but L7 does connection pooling, header-based routing, and retries that an L4 cannot. The overhead of L7 at 10k RPS is negligible on modern hardware; the features are not.
  • Using L7 for pure TCP workloads. Running a Postgres connection through an L7 HTTP-aware proxy breaks things in subtle ways (proxy terminates the TLS and forwards plaintext, headers get rewritten, long-lived connections get reset during reloads).
  • Mixing layers in one deploy. Some teams put L4 in front of L7 in front of L4, creating four hops that each can misconfigure health checks, timeouts, and IP preservation.
Solutions & Patterns: Choose the right layer.
  • HTTP, gRPC, WebSocket: use L7 (Envoy, NGINX, HAProxy). L7 sees individual requests within an HTTP/2 connection and balances them across backends.
  • TCP, UDP, database protocols: use L4 (AWS NLB, kube-proxy, HAProxy TCP mode). L4 does not try to inspect payloads.
  • TLS termination decision is load-balancer-layer decision. Terminate TLS at the edge L7 when you need header-based routing; terminate at the application when you need end-to-end encryption for compliance.
  • Run both if you need both. An L4 NLB for raw TCP services, an L7 ALB/Envoy for HTTP services, behind the same domain using DNS or host-header routing.

Server-Side: NGINX Configuration

NGINX has been the default choice for server-side HTTP load balancing for over a decade, and the config below demonstrates why. Notice how the configuration is declarative — you describe the upstream pool, the algorithm, the health check rules, and let NGINX handle the mechanics. The downside is also visible: this configuration is static. To change weights or add a server, you edit the file and reload NGINX. That does not scale when you have 50 microservices and hundreds of instances coming and going via Kubernetes autoscaling. This is why cloud-native environments prefer dynamic service discovery (Consul, Kubernetes Service, Envoy xDS) even though the underlying algorithms are the same. One specific anti-pattern worth highlighting: running NGINX with ip_hash and then wondering why one server is overwhelmed. ip_hash buckets clients by IP, so if most of your traffic comes from a small number of IPs (think mobile carriers, corporate NATs), the distribution will be badly skewed. Use ip_hash only when you genuinely need session affinity and you have verified your client IP distribution is diverse.
# nginx.conf - Production-ready load balancing

upstream user_service {
    # Load balancing algorithm
    least_conn;  # Send to server with fewest active connections
    
    # Backend servers with weights and health
    server user-1.internal:3000 weight=5 max_fails=3 fail_timeout=30s;
    server user-2.internal:3000 weight=5 max_fails=3 fail_timeout=30s;
    server user-3.internal:3000 weight=3 backup;  # Backup server
    
    # Keepalive connections to backends
    keepalive 32;
    keepalive_timeout 60s;
}

upstream order_service {
    # IP Hash - session affinity
    ip_hash;
    
    server order-1.internal:3001;
    server order-2.internal:3001;
    server order-3.internal:3001;
}

server {
    listen 80;
    
    location /api/users {
        proxy_pass http://user_service;
        
        # Health check headers
        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 3;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
        
        # Keepalive
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
    
    location /api/orders {
        proxy_pass http://order_service;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

Client-Side: Implementation

Now let us build client-side load balancing from scratch. The goal of this implementation is to show the moving parts clearly: service discovery integration, health check loop, algorithm selection, and automatic retry with failover. In production you would typically not implement this yourself — you would use gRPC’s built-in load balancer, a service mesh like Linkerd, or a library like Netflix’s Ribbon (Java) or go-micro (Go). But understanding the mechanics is essential because when something goes wrong (and it will), you need to know which layer to debug. Pay close attention to how we track per-instance metrics (activeConnections, responseTime, consecutiveFailures). These are the inputs that make intelligent algorithms like least-connections and least-response-time actually work. Without this tracking, your “load balancer” is just round-robin with extra steps. The failure-counting logic is subtle: we remove an instance after 3 consecutive failures, but a single success resets the counter. This prevents flapping when a network blip causes one failed request, while still catching instances that are genuinely broken. What would happen if you skipped the health check entirely? The load balancer would continue sending 20% of traffic to a dead instance until the service registry eventually removed it (typically 30-60 seconds later). Every one of those requests would time out or return 502 errors, multiplying the impact of a single instance failure. Client-side health checks close this gap — you start routing away from unhealthy instances within seconds, not minutes.
// client-side-lb.js - Client-side load balancing with service discovery

const EventEmitter = require('events');

class ClientSideLoadBalancer extends EventEmitter {
  constructor(options = {}) {
    super();
    this.serviceName = options.serviceName;
    this.registry = options.registry;
    this.algorithm = options.algorithm || 'round-robin';
    this.healthCheckInterval = options.healthCheckInterval || 10000;
    
    this.instances = [];
    this.currentIndex = 0;
    this.healthStatus = new Map();
    
    this.initializeDiscovery();
    this.startHealthChecks();
  }

  async initializeDiscovery() {
    // Initial fetch from service registry
    await this.refreshInstances();
    
    // Subscribe to registry updates
    this.registry.on('instances-changed', (serviceName) => {
      if (serviceName === this.serviceName) {
        this.refreshInstances();
      }
    });
  }

  async refreshInstances() {
    try {
      const instances = await this.registry.getInstances(this.serviceName);
      this.instances = instances.map(instance => ({
        ...instance,
        weight: instance.weight || 1,
        activeConnections: 0,
        responseTime: 0,
        consecutiveFailures: 0
      }));
      this.emit('instances-updated', this.instances);
    } catch (error) {
      this.emit('error', error);
    }
  }

  startHealthChecks() {
    setInterval(async () => {
      for (const instance of this.instances) {
        try {
          const start = Date.now();
          const response = await fetch(`http://${instance.host}:${instance.port}/health`, {
            timeout: 5000
          });
          const latency = Date.now() - start;
          
          this.healthStatus.set(instance.id, {
            healthy: response.ok,
            latency,
            lastCheck: Date.now()
          });
          
          instance.responseTime = latency;
          instance.consecutiveFailures = 0;
        } catch (error) {
          const status = this.healthStatus.get(instance.id) || {};
          instance.consecutiveFailures++;
          this.healthStatus.set(instance.id, {
            ...status,
            healthy: false,
            lastCheck: Date.now(),
            error: error.message
          });
        }
      }
    }, this.healthCheckInterval);
  }

  // Get next available instance based on algorithm
  getNextInstance() {
    const healthyInstances = this.instances.filter(
      i => (this.healthStatus.get(i.id)?.healthy !== false) && 
           i.consecutiveFailures < 3
    );

    if (healthyInstances.length === 0) {
      throw new Error(`No healthy instances available for ${this.serviceName}`);
    }

    switch (this.algorithm) {
      case 'round-robin':
        return this.roundRobin(healthyInstances);
      case 'weighted-round-robin':
        return this.weightedRoundRobin(healthyInstances);
      case 'least-connections':
        return this.leastConnections(healthyInstances);
      case 'least-response-time':
        return this.leastResponseTime(healthyInstances);
      case 'random':
        return this.random(healthyInstances);
      default:
        return this.roundRobin(healthyInstances);
    }
  }

  roundRobin(instances) {
    const instance = instances[this.currentIndex % instances.length];
    this.currentIndex++;
    return instance;
  }

  weightedRoundRobin(instances) {
    // Create weighted list
    const weighted = [];
    for (const instance of instances) {
      for (let i = 0; i < instance.weight; i++) {
        weighted.push(instance);
      }
    }
    
    const instance = weighted[this.currentIndex % weighted.length];
    this.currentIndex++;
    return instance;
  }

  leastConnections(instances) {
    return instances.reduce((min, current) => 
      current.activeConnections < min.activeConnections ? current : min
    );
  }

  leastResponseTime(instances) {
    return instances.reduce((best, current) => {
      const currentScore = current.activeConnections * 0.5 + current.responseTime * 0.5;
      const bestScore = best.activeConnections * 0.5 + best.responseTime * 0.5;
      return currentScore < bestScore ? current : best;
    });
  }

  random(instances) {
    return instances[Math.floor(Math.random() * instances.length)];
  }

  // Execute request with automatic failover
  async execute(requestFn, options = {}) {
    const maxRetries = options.retries || 3;
    const retryDelay = options.retryDelay || 100;
    let lastError;

    for (let attempt = 0; attempt < maxRetries; attempt++) {
      const instance = this.getNextInstance();
      
      try {
        instance.activeConnections++;
        const start = Date.now();
        
        const result = await requestFn(instance);
        
        instance.responseTime = Date.now() - start;
        instance.activeConnections--;
        
        return result;
      } catch (error) {
        instance.activeConnections--;
        instance.consecutiveFailures++;
        lastError = error;
        
        this.emit('request-failed', { instance, error, attempt });
        
        if (attempt < maxRetries - 1) {
          await this.delay(retryDelay * Math.pow(2, attempt));
        }
      }
    }

    throw lastError;
  }

  delay(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

// Usage example
const axios = require('axios');

const userServiceLB = new ClientSideLoadBalancer({
  serviceName: 'user-service',
  registry: serviceRegistry,
  algorithm: 'least-connections'
});

// Make requests through the load balancer
async function getUser(userId) {
  return userServiceLB.execute(async (instance) => {
    const response = await axios.get(
      `http://${instance.host}:${instance.port}/users/${userId}`,
      { timeout: 5000 }
    );
    return response.data;
  });
}

module.exports = { ClientSideLoadBalancer };

Load Balancing Algorithms

The algorithm you choose is effectively a hypothesis about your workload. Round-robin assumes all instances are identical and all requests take equal time. Least-connections assumes requests vary in duration but instances are otherwise equal. Least-response-time assumes instances themselves vary in speed. Consistent hashing assumes you need keys to stick to specific instances (for caching or partitioning). When your hypothesis matches reality, the algorithm works well. When it does not, you get mysterious tail latency or cache misses that nobody can explain. The most common mistake is picking an algorithm that sounds smart without validating it matches your workload. I have seen teams switch from round-robin to least-response-time expecting better performance, only to find that it created a feedback loop: the fastest instance got all the traffic, its response time crept up, another instance became fastest, and so on. The resulting oscillation was worse than round-robin’s simple equal distribution. This is why Power of Two Choices (P2C) — discussed below — has become the default in modern proxies like Envoy. It has the intelligence of least-connections without the herding problem.
┌─────────────────────────────────────────────────────────────────────────────┐
│                    LOAD BALANCING ALGORITHMS                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ROUND ROBIN                          WEIGHTED ROUND ROBIN                  │
│  ─────────────────                    ─────────────────────                 │
│                                                                              │
│  Request 1 → Server A                 Request 1 → Server A (w=5)           │
│  Request 2 → Server B                 Request 2 → Server A                 │
│  Request 3 → Server C                 Request 3 → Server A                 │
│  Request 4 → Server A                 Request 4 → Server B (w=3)           │
│  ...                                  Request 5 → Server B                 │
│                                       Request 6 → Server B                 │
│  Simple, equal distribution           Request 7 → Server C (w=2)           │
│                                       ...                                   │
│                                       Accounts for server capacity         │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════   │
│                                                                              │
│  LEAST CONNECTIONS                    LEAST RESPONSE TIME                  │
│  ──────────────────────               ─────────────────────                 │
│                                                                              │
│  Server A: 5 conn  ←────              Server A: 50ms avg  ←────            │
│  Server B: 8 conn                     Server B: 75ms avg                   │
│  Server C: 3 conn                     Server C: 45ms avg                   │
│                    ↑                                     ↑                  │
│           Next → Server C             Next → Server C (fastest)            │
│                                                                              │
│  Best for long-lived connections      Best for varying server loads        │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════   │
│                                                                              │
│  IP HASH                              CONSISTENT HASHING                   │
│  ────────────                         ───────────────────                   │
│                                                                              │
│  hash(client_ip) → Server B           hash(request) → Ring position        │
│                                       Minimal redistribution on change     │
│  Same client → same server                                                 │
│  Session affinity                     Good for caching, stateful services  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Caveats & Common Pitfalls: Algorithm Selection.
  • Weighted round-robin with unhealthy upstreams. If your health check is slow to react, weighted round-robin happily keeps sending weight-proportional traffic to a dying instance. An instance with weight 5 that is about to crash gets 5x the traffic of a healthy instance with weight 1 for the whole detection window.
  • Session affinity (“sticky sessions”) causing hot instances. When a popular user or a large corporate NAT hashes to one instance, that instance receives 10x the average load. Teams usually notice only when that one instance melts during peak hours.
  • Least-connections without accurate connection tracking. If your load balancer’s “connections” metric is per-LB-instance and you run multiple LB replicas, each LB thinks it has few connections and they all pile onto the same backend.
  • Picking least-response-time and creating feedback oscillation. The fastest instance gets all traffic, its response time degrades, now another instance is fastest and receives the herd. Result: oscillating traffic and worse p99 than round-robin.
Solutions & Patterns: Algorithm best practices.
  • Default to Power of Two Choices (P2C). Pick 2 random instances, send to the less-loaded one. Breaks the herd effect, requires no global state, and outperforms least-connections in realistic workloads.
  • Tie weights to CPU or memory, not age. Use weights that reflect actual capacity (4-core instance = weight 4, 2-core = weight 2), and recompute them when autoscaling changes the fleet.
  • Combine session affinity with a fallback. Prefer the sticky instance, but fail open to round-robin if the sticky instance is unhealthy or overloaded. Never make stickiness an absolute constraint.
  • Monitor per-instance load variance, not just averages. If instance p99 load is 3x instance median load, your algorithm is not balancing well. Average load of 50% can hide one instance at 95%.

Advanced Algorithms Implementation

The algorithms below demonstrate the ideas behind production load balancers. Weighted Round Robin with smooth distribution prevents clumping (naive weighted round-robin sends all weight-5 requests consecutively, which can overwhelm that instance; the smooth version interleaves them). Consistent hashing with virtual nodes is the foundation of distributed caches like Memcached and Redis Cluster; without virtual nodes, removing one server would cause 30-50% of keys to remap, but with 150 virtual nodes per server, that drops to a few percent. Power of Two Choices (P2C) is the algorithm I recommend for most modern load balancers. The math is surprising: picking 2 random instances and choosing the less loaded one outperforms least-connections in realistic workloads. Why? Because pure least-connections creates a herd effect — the load balancer sees instance A has 3 connections and instance B has 4, routes to A, and 10ms later the next decision sees A has 4 and B has 4, and the next client also picks A, and so on. With P2C, the randomness breaks the herding. The mathematical analysis (Azar, Broder, Karlin, Upfal 1994) shows P2C achieves O(log log n) maximum load while random achieves O(log n / log log n) — a dramatic improvement for essentially no extra cost.
// advanced-lb-algorithms.js

class LoadBalancingAlgorithms {
  // Weighted Round Robin with smooth distribution
  static createWeightedRoundRobin(instances) {
    let currentWeight = 0;
    let maxWeight = Math.max(...instances.map(i => i.weight));
    let gcdWeight = instances.reduce((a, b) => gcd(a, b.weight), instances[0].weight);
    let currentIndex = -1;

    function gcd(a, b) {
      return b === 0 ? a : gcd(b, a % b);
    }

    return function getNext() {
      while (true) {
        currentIndex = (currentIndex + 1) % instances.length;
        
        if (currentIndex === 0) {
          currentWeight -= gcdWeight;
          if (currentWeight <= 0) {
            currentWeight = maxWeight;
          }
        }
        
        if (instances[currentIndex].weight >= currentWeight) {
          return instances[currentIndex];
        }
      }
    };
  }

  // Consistent Hashing with virtual nodes
  static createConsistentHash(instances, virtualNodes = 150) {
    const ring = new Map();
    const sortedKeys = [];

    // Add virtual nodes for each instance
    for (const instance of instances) {
      for (let i = 0; i < virtualNodes; i++) {
        const key = hash(`${instance.id}-${i}`);
        ring.set(key, instance);
        sortedKeys.push(key);
      }
    }
    
    sortedKeys.sort((a, b) => a - b);

    function hash(str) {
      let hash = 0;
      for (let i = 0; i < str.length; i++) {
        hash = ((hash << 5) - hash) + str.charCodeAt(i);
        hash = hash & hash;
      }
      return Math.abs(hash);
    }

    return function getNode(key) {
      const keyHash = hash(key);
      
      // Binary search for first key >= keyHash
      let low = 0, high = sortedKeys.length - 1;
      
      while (low < high) {
        const mid = Math.floor((low + high) / 2);
        if (sortedKeys[mid] < keyHash) {
          low = mid + 1;
        } else {
          high = mid;
        }
      }

      // Wrap around if key is larger than all
      const index = sortedKeys[low] >= keyHash ? low : 0;
      return ring.get(sortedKeys[index]);
    };
  }

  // Power of Two Choices (P2C)
  // This deceptively simple algorithm has mathematical guarantees that make it surprisingly effective.
  // By picking just 2 random servers and choosing the less loaded one, you avoid the "herd effect"
  // where all clients pick the same "best" server (least-connections) and overwhelm it.
  // Used by Envoy, HAProxy, and Netflix Zuul in production. The max load is O(log(log(n))).
  static createP2C(instances) {
    return function getNext() {
      if (instances.length === 1) return instances[0];
      
      // Pick two random instances
      const i1 = Math.floor(Math.random() * instances.length);
      let i2 = Math.floor(Math.random() * instances.length);
      while (i2 === i1) {
        i2 = Math.floor(Math.random() * instances.length);
      }

      // Choose the one with fewer active connections
      return instances[i1].activeConnections <= instances[i2].activeConnections
        ? instances[i1]
        : instances[i2];
    };
  }

  // Adaptive Load Balancing (based on real-time metrics)
  static createAdaptive(instances) {
    return function getNext() {
      // Calculate scores based on multiple factors
      const scored = instances.map(instance => ({
        instance,
        score: calculateScore(instance)
      }));

      // Sort by score (lower is better)
      scored.sort((a, b) => a.score - b.score);
      
      // Weighted random from top 3
      const topN = scored.slice(0, Math.min(3, scored.length));
      const totalWeight = topN.reduce((sum, s) => sum + (1 / s.score), 0);
      
      let random = Math.random() * totalWeight;
      for (const { instance, score } of topN) {
        random -= (1 / score);
        if (random <= 0) return instance;
      }
      
      return topN[0].instance;
    };

    function calculateScore(instance) {
      // Lower score = better
      const connectionScore = instance.activeConnections * 0.3;
      const latencyScore = instance.avgResponseTime * 0.4;
      const errorScore = instance.errorRate * 100 * 0.3;
      
      return connectionScore + latencyScore + errorScore + 0.001; // Avoid division by zero
    }
  }
}

module.exports = { LoadBalancingAlgorithms };

Health Checking Strategies

Health checks are the eyes of the load balancer. If they report wrong, the load balancer makes wrong decisions — routing to dead instances or evicting healthy ones. The most important distinction in health checking is between liveness (is the process alive?) and readiness (can it handle traffic?). This sounds pedantic but it is the single most common mistake I see in production: teams implement a single “health” endpoint that checks everything, use it for both liveness and readiness, and then wonder why their system goes into a death spiral whenever a downstream dependency has a bad day. The failure mode is subtle but catastrophic. Imagine your /health endpoint checks the database. The database has a slow query that makes health checks time out. Kubernetes sees liveness failures and restarts your pods. The new pods also fail the check (database is still slow). Kubernetes marks them failed and kills them too. Within minutes you have zero pods, the database recovers, but there are no pods to serve traffic. You have a full outage. The fix: liveness checks should be shallow (just “is the HTTP server responding?”), and only readiness checks should verify dependencies. The deep health check is the third kind: an endpoint used by humans and dashboards to see rich dependency status. It is NOT what the load balancer polls. It tells you why a service is degraded, not whether it should receive traffic.
┌─────────────────────────────────────────────────────────────────────────────┐
│                    HEALTH CHECK PATTERNS                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  LIVENESS vs READINESS                                                      │
│  ─────────────────────────                                                  │
│                                                                              │
│  LIVENESS: Is the process running?                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  GET /healthz                                                        │   │
│  │  → 200 OK: Process is alive                                          │   │
│  │  → 5xx: Process is dead, restart it                                  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  READINESS: Can it handle traffic?                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  GET /ready                                                          │   │
│  │  → 200 OK: Ready to receive traffic                                  │   │
│  │  → 503: Not ready (warming up, dependencies down)                    │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  DEEP HEALTH CHECK (with dependencies)                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  GET /health/deep                                                    │   │
│  │  {                                                                   │   │
│  │    "status": "degraded",                                             │   │
│  │    "checks": {                                                       │   │
│  │      "database": { "status": "healthy", "latency": "5ms" },         │   │
│  │      "redis": { "status": "healthy", "latency": "2ms" },            │   │
│  │      "external-api": { "status": "unhealthy", "error": "timeout" } │   │
│  │    }                                                                 │   │
│  │  }                                                                   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
Caveats & Common Pitfalls: Health Checks and Deploys.
  • Connection draining gap during deploy. The load balancer removes the instance from its pool but the instance still has in-flight connections. If you terminate the container before connections drain, in-flight requests fail with 502s. Users get errors during every deploy.
  • Deep health checks triggering cascade restarts. A liveness probe that queries the database dies when the database has a blip. Kubernetes restarts the pod, the new pod also dies (database still bad), a crash loop starts, and now zero pods serve traffic.
  • Health check frequency too high. A 1-second health check against a 10-instance pool with 3 retries creates 30 requests per second of pure health-check traffic. On a 200-instance pool this is 600 req/sec that adds nothing to business value but consumes resources.
  • Health check bypassing the critical path. The health endpoint returns 200 but only tests the process, not the actual request handler. You get a “healthy” service that returns 500s to every real request because of a bad config load.
Solutions & Patterns: Health checks done right.
  • Separate liveness (shallow) from readiness (with dependencies) from deep health (human-readable). Three endpoints, three audiences. Never let one endpoint serve all three.
  • Preserve in-flight traffic during deploys. Send SIGTERM, wait for readiness to go false, let the LB drain (typically 15-30 seconds), then terminate. Kubernetes terminationGracePeriodSeconds plus a preStop hook that flips readiness gives you this.
  • Tune health check intervals to your recovery SLA. If you need to detect a dead instance within 10 seconds, use 2-second intervals with 3 failures threshold. Under that SLA, go less aggressive to reduce noise.
  • Test the actual critical path. Readiness should exercise the real request handler with a synthetic request, not just return 200 from a different route.

Comprehensive Health Check Service

The service below is the minimum viable implementation of a proper health check subsystem. The design has three parts: a registration mechanism (where each dependency declares how to check itself), a background runner (so health checks happen continuously, not per-request), and three HTTP endpoints (liveness, readiness, deep). The background execution is critical — if you check dependencies on every incoming request, your health check becomes a DoS vector. A misbehaving caller can hammer /health and multiply the load on every dependency. Instead, check dependencies on a timer and return cached results. One detail worth emphasizing: health checks should have aggressive timeouts. A health check that takes 10 seconds to fail makes your “unhealthy” detection take 30+ seconds (since you typically need 3 consecutive failures). Use 2-5 second timeouts for critical dependencies and make your instance’s response time budget proportional. If your service’s SLA is 500ms, a 5-second health check timeout is 10x your SLA — that is too slow.
// health-check-service.js

const express = require('express');

class HealthCheckService {
  constructor() {
    this.checks = new Map();
    this.status = 'starting';
    this.startTime = Date.now();
  }

  // Register a health check
  registerCheck(name, checkFn, options = {}) {
    this.checks.set(name, {
      fn: checkFn,
      critical: options.critical !== false, // Default to critical
      timeout: options.timeout || 5000,
      interval: options.interval || 30000,
      lastResult: null,
      lastCheck: null
    });

    // Start periodic checking
    if (options.interval) {
      setInterval(() => this.runCheck(name), options.interval);
    }
  }

  async runCheck(name) {
    const check = this.checks.get(name);
    if (!check) return null;

    const startTime = Date.now();
    
    try {
      const result = await Promise.race([
        check.fn(),
        new Promise((_, reject) => 
          setTimeout(() => reject(new Error('Health check timeout')), check.timeout)
        )
      ]);

      check.lastResult = {
        status: 'healthy',
        latency: Date.now() - startTime,
        ...result
      };
    } catch (error) {
      check.lastResult = {
        status: 'unhealthy',
        error: error.message,
        latency: Date.now() - startTime
      };
    }

    check.lastCheck = Date.now();
    return check.lastResult;
  }

  async runAllChecks() {
    const results = {};
    
    for (const [name, check] of this.checks) {
      results[name] = await this.runCheck(name);
    }

    return results;
  }

  getOverallStatus(checkResults) {
    const criticalChecks = Array.from(this.checks.entries())
      .filter(([_, check]) => check.critical)
      .map(([name]) => name);

    const hasUnhealthyCritical = criticalChecks.some(
      name => checkResults[name]?.status === 'unhealthy'
    );

    const hasAnyUnhealthy = Object.values(checkResults).some(
      r => r?.status === 'unhealthy'
    );

    if (hasUnhealthyCritical) return 'unhealthy';
    if (hasAnyUnhealthy) return 'degraded';
    return 'healthy';
  }

  // Express middleware for health endpoints
  createRouter() {
    const router = express.Router();

    // Liveness probe - is the process running?
    router.get('/healthz', (req, res) => {
      res.status(200).json({
        status: 'alive',
        uptime: Date.now() - this.startTime
      });
    });

    // Readiness probe - can we handle traffic?
    router.get('/ready', async (req, res) => {
      if (this.status !== 'ready') {
        return res.status(503).json({
          status: this.status,
          message: 'Service not ready'
        });
      }

      // Quick check of critical dependencies
      const criticalResults = {};
      for (const [name, check] of this.checks) {
        if (check.critical) {
          criticalResults[name] = check.lastResult;
        }
      }

      const hasUnhealthy = Object.values(criticalResults).some(
        r => r?.status === 'unhealthy'
      );

      if (hasUnhealthy) {
        return res.status(503).json({
          status: 'not_ready',
          checks: criticalResults
        });
      }

      res.status(200).json({ status: 'ready' });
    });

    // Deep health check - detailed status of all dependencies
    router.get('/health', async (req, res) => {
      const checkResults = await this.runAllChecks();
      const overallStatus = this.getOverallStatus(checkResults);

      const statusCode = overallStatus === 'healthy' ? 200 : 
                         overallStatus === 'degraded' ? 200 : 503;

      res.status(statusCode).json({
        status: overallStatus,
        timestamp: new Date().toISOString(),
        uptime: Date.now() - this.startTime,
        version: process.env.APP_VERSION || 'unknown',
        checks: checkResults
      });
    });

    return router;
  }

  setReady() {
    this.status = 'ready';
  }

  setNotReady(reason) {
    this.status = reason || 'not_ready';
  }
}

// Usage example
const healthService = new HealthCheckService();

// Register database check
healthService.registerCheck('database', async () => {
  const start = Date.now();
  await pool.query('SELECT 1');
  return { latency: Date.now() - start };
}, { critical: true, interval: 30000 });

// Register Redis check
healthService.registerCheck('redis', async () => {
  const start = Date.now();
  await redis.ping();
  return { latency: Date.now() - start };
}, { critical: true, interval: 30000 });

// Register external API check
healthService.registerCheck('payment-api', async () => {
  const response = await fetch('https://api.stripe.com/v1/health', {
    timeout: 5000
  });
  return { status: response.ok ? 'reachable' : 'unreachable' };
}, { critical: false, interval: 60000 });

// Mount health routes
app.use(healthService.createRouter());

// Mark service as ready after initialization
await initializeDatabase();
await warmUpCaches();
healthService.setReady();

module.exports = { HealthCheckService };

Load Balancer Patterns

Kubernetes Service Load Balancing

# kubernetes-lb.yaml

# ClusterIP Service (internal load balancing)
apiVersion: v1
kind: Service
metadata:
  name: user-service
spec:
  type: ClusterIP
  selector:
    app: user-service
  ports:
    - port: 80
      targetPort: 3000
  sessionAffinity: None  # or ClientIP for sticky sessions

---
# Headless Service (for client-side LB with service discovery)
apiVersion: v1
kind: Service
metadata:
  name: user-service-headless
spec:
  clusterIP: None
  selector:
    app: user-service
  ports:
    - port: 3000

---
# Deployment with health checks
apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: user-service
  template:
    metadata:
      labels:
        app: user-service
    spec:
      containers:
        - name: user-service
          image: user-service:latest
          ports:
            - containerPort: 3000
          
          # Liveness probe - restart if fails
          livenessProbe:
            httpGet:
              path: /healthz
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 15
            failureThreshold: 3
            
          # Readiness probe - remove from LB if fails
          readinessProbe:
            httpGet:
              path: /ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
            
          # Startup probe - for slow starting containers
          startupProbe:
            httpGet:
              path: /healthz
              port: 3000
            failureThreshold: 30
            periodSeconds: 10

Envoy Proxy Configuration

# envoy-lb.yaml - Advanced L7 load balancing

static_resources:
  listeners:
    - name: listener_0
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 8080
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: ingress_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: backend
                      domains: ["*"]
                      routes:
                        - match:
                            prefix: "/api/users"
                          route:
                            cluster: user_service
                            timeout: 30s
                            retry_policy:
                              retry_on: "5xx,reset,connect-failure"
                              num_retries: 3
                              per_try_timeout: 10s

  clusters:
    - name: user_service
      connect_timeout: 5s
      type: STRICT_DNS
      lb_policy: LEAST_REQUEST  # Least connections
      
      # Circuit breaker
      circuit_breakers:
        thresholds:
          - priority: DEFAULT
            max_connections: 1000
            max_pending_requests: 1000
            max_requests: 1000
            max_retries: 3
            
      # Health checking
      health_checks:
        - timeout: 5s
          interval: 10s
          unhealthy_threshold: 3
          healthy_threshold: 2
          http_health_check:
            path: "/health"
            
      # Outlier detection (automatic ejection of unhealthy hosts)
      outlier_detection:
        consecutive_5xx: 5
        interval: 10s
        base_ejection_time: 30s
        max_ejection_percent: 50
        
      load_assignment:
        cluster_name: user_service
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: user-1.internal
                      port_value: 3000
              - endpoint:
                  address:
                    socket_address:
                      address: user-2.internal
                      port_value: 3000
              - endpoint:
                  address:
                    socket_address:
                      address: user-3.internal
                      port_value: 3000

Interview Questions

Answer:Server-side (e.g., NGINX, HAProxy):
  • Single point for routing
  • Simple clients
  • Extra network hop
  • Centralized control
Client-side (e.g., Ribbon, custom):
  • Client decides which server
  • No extra hop
  • More complex clients
  • Better for microservices
When to use each:
  • Server-side: External traffic, legacy clients
  • Client-side: Service-to-service within cluster
  • Hybrid: Edge LB + client-side internally
Answer:Use cases:
  • Cache servers (minimize cache misses on scale)
  • Session affinity without IP hash
  • Partitioned data (same key → same server)
How it works:
  • Servers and keys mapped to same hash ring
  • Key routed to next server clockwise
  • Adding/removing server affects only neighbors
Virtual nodes:
  • Multiple positions per server for balance
  • 100-200 virtual nodes per physical server
Answer:Liveness:
  • “Is the process stuck?”
  • Failure → container restart
  • Should be simple (no dependencies)
  • Example: Can the HTTP server respond?
Readiness:
  • “Can it handle traffic?”
  • Failure → remove from load balancer
  • Can check dependencies
  • Example: Is database connection ready?
Common mistake: Using deep checks for liveness causes cascading restarts when a dependency is down. If your liveness check queries the database, and the database is slow, Kubernetes restarts your pods. Now you have fewer pods, more load on remaining ones, they time out too, and Kubernetes restarts them as well. Within minutes, every pod is in a restart loop — all because your liveness check was too aggressive. Keep liveness checks trivial (can the HTTP server respond?) and put dependency checks in readiness only.
Answer:Simple but effective algorithm:
  1. Pick 2 random servers
  2. Choose the one with fewer connections
Why it works:
  • Avoids herd behavior (all clients picking same “best” server)
  • O(1) complexity (no sorting)
  • Statistical guarantees: max load ~log(log(n))
Used by: Envoy, HAProxy, Netflix ZuulBetter than round-robin because it considers actual load.

Chapter Summary

Key Takeaways:
  • Server-side LB for external traffic, client-side for internal
  • Algorithm choice depends on workload: Round-robin for simple, Least-connections for varying load
  • Consistent hashing for caching and stateful services
  • Implement both liveness and readiness probes
  • Health checks should have appropriate timeouts
  • Use circuit breakers with load balancing for resilience
Next Chapter: Migration Patterns - Strangler Fig, Branch by Abstraction, and more.

Interview Questions with Structured Answers

Strong Answer Framework:
  1. Identify the connection type. Long-lived connections (HTTP/2, gRPC, WebSocket) stay attached to the old backend until the client disconnects. A 20-minute HTTP/2 connection is expected, not a bug.
  2. Check health check state propagation. The load balancer may still consider old instances healthy if the health check interval is high or the failure threshold is large.
  3. Inspect the deployment mechanism. Rolling deploy? Blue/green? Did the old pods actually terminate, or is terminationGracePeriodSeconds still bleeding them out?
  4. Look at session affinity. If stickiness is enabled, the 10% of affected users might be sticky to old instances that have not yet rotated.
  5. Check cache layers. DNS TTL, client-side service discovery caches, and sidecar proxies may be holding stale endpoint lists.
  6. Correlate the 10% to a signal. Is it 10% of connections, 10% of users, 10% of specific clients? The specificity tells you which layer to inspect.
Real-World Example: Slack (2020 incident postmortem). Slack had a case where gRPC clients with long-lived HTTP/2 connections stayed pinned to old backends for hours after deploy because they did not rotate connections. The fix was adding max-connection-age of 30 minutes on the server side, forcing clients to reconnect periodically and rebalance. The pattern appears in every company that runs gRPC at scale.Senior Follow-up Questions:
Q: “How would you force clients on a long-lived HTTP/2 connection to reconnect during a deploy?”A: Server-side, you can send a GOAWAY frame, which tells the client “finish your in-flight requests but open a new connection for future requests.” Envoy does this via max_connection_duration in the HTTP/2 settings; gRPC servers have MaxConnectionAge (Go) or maxConnectionAge (Java). This is the canonical solution. Alternative: force connection close on shutdown, but that causes in-flight request failures if not combined with connection draining.
Q: “The 10% is really 10% of users, not connections. What does that tell you?”A: That signals session affinity. If 10% of users are getting old behavior, those users are probably sticky to specific instances that have not rotated yet. Check for: cookie-based stickiness, IP hash affinity, or application-level sharding where user ID maps to a specific shard. Fix: ensure affinity rotates along with deploys (invalidate affinity cookies on deploy) or accept that sticky users will lag the rollout by the session timeout.
Q: “Your deploy was blue/green with a traffic switch at the load balancer. How is 10% still hitting blue?”A: DNS TTL is the usual culprit. If your traffic switch updated DNS records with a 5-minute TTL, clients that resolved before the switch keep connecting to blue for up to 5 minutes (or longer if they are caching DNS aggressively, like Java’s default of infinity unless you set the security policy). For instant cutovers, do the traffic switch at the load balancer level (weighted listener, Envoy RDS update), not at DNS. Alternative: use a service mesh where the mesh itself handles endpoint changes in milliseconds.
Common Wrong Answers:
  • “The deploy did not complete; some pods are still running the old version.” Possible but lazy. Kubernetes rolling deploys are visible in kubectl rollout status; if the rollout finished, the old pods are gone. This answer ignores the hard cases where deploy completed but traffic lingers.
  • “It is a cache somewhere, just bust the cache.” Too vague. Which cache? DNS, client service discovery, application cache, CDN, browser? A strong answer names the specific cache layer and its TTL.
Further Reading:
  • “gRPC Load Balancing” — official gRPC documentation on client-side LB and connection lifecycle
  • “Envoy Proxy Documentation: Connection Management” — covers max_connection_duration and graceful termination
  • “DNS is still the protocol of the internet” — Julia Evans blog post explaining why DNS caching surprises engineers during deploys
Strong Answer Framework:
  1. Rule out application-level causes first. Slow queries, GC pauses, cold caches all produce long tails that are not the load balancer’s fault.
  2. Measure per-instance p99. If one instance has a 2x higher p99 than others, traffic distribution is uneven even if CPU averages match.
  3. Check connection pooling under L4. If your LB is L4 and clients use HTTP/1.1 keepalive, the connections distribute but requests within each connection do not.
  4. Inspect queue depth at each backend. Two backends with the same CPU can have very different queue depths if one is upstream of a slower dependency or has a different JIT state.
  5. Consider algorithmic herding. Least-connections can oscillate; least-response-time can create feedback loops. P2C breaks both.
  6. Look at outlier detection. If a backend is 3x slower than the median, it should be ejected. Envoy’s outlier detection and Kubernetes readiness probes should catch this.
Real-World Example: Shopify (around 2019). Shopify documented tail latency issues caused by round-robin load balancing where one backend had a subtly slower disk. The instance did not fail health checks, but its p99 was 300ms while others were 100ms. Round-robin gave it equal traffic, so 20% of requests saw 300ms even though the service “looked healthy.” Their fix: EWMA-based (exponentially weighted moving average) response-time load balancing that routes fewer requests to slower instances without creating oscillation.Senior Follow-up Questions:
Q: “How would you detect and automatically eject a ‘gray failure’ instance that is slow but not failing health checks?”A: Outlier detection at the load balancer layer. Envoy’s outlier detection config ejects instances that exceed a threshold for consecutive 5xx, consecutive gateway failures, or success-rate deviation. For latency-based ejection specifically, you need success-rate-based ejection combined with p99 tracking. A service mesh like Istio exposes this. Alternative: custom sidecar logic that reports per-instance latency to the LB control plane, which adjusts weights downward for slow instances.
Q: “You find that one instance always has higher p99. It is on a shared node with a noisy neighbor. How do you structurally prevent this?”A: Three options. First, use dedicated nodes (Kubernetes node taints, dedicated node pools) so critical services do not share hardware. Cost: higher infrastructure spend. Second, request guaranteed-QoS pods with CPU/memory limits equal to requests, so the kernel enforces isolation. Cost: less bin-packing efficiency. Third, use pod anti-affinity to spread replicas across nodes, so a single noisy neighbor only affects one replica out of many. Cost: harder to schedule in small clusters. In practice, combine all three for tier-1 services.
Q: “If all else fails, can you just over-provision and ignore the problem?”A: Over-provisioning reduces per-instance load, which reduces tail latency effects, but it does not fix the imbalance. If one instance is 2x slower intrinsically, even with half the load that instance still has 2x p99. You might mask the problem until traffic grows 3x and the slow instance becomes a bottleneck again. Over-provisioning is a valid short-term mitigation while you address the root cause, not a strategy.
Common Wrong Answers:
  • “Use least-response-time, that will solve it.” Dangerous. Pure least-response-time creates feedback oscillation. Use P2C or EWMA-weighted instead.
  • “Add more replicas.” Does not address the algorithmic or per-instance issues. You need to fix the imbalance, not add noise to dilute it.
Further Reading:
  • “The Tail at Scale” by Jeffrey Dean and Luiz Barroso — the canonical Google paper on tail latency
  • “Envoy Outlier Detection” — docs cover the ejection algorithm and tuning
  • “Load Balancing at Shopify” — engineering blog posts on EWMA and weighted load balancing
Strong Answer Framework:
  1. Identify the affinity scheme. IP-hash? Cookie-hash? Header-hash? Each has distinct failure modes.
  2. Check the input distribution. IP-hash buckets by source IP; if your traffic comes through a handful of corporate NATs or mobile carrier gateways, all users behind that NAT map to one instance.
  3. Estimate the skew. How many unique keys map to each instance? If one instance gets 100x the keys of another, the hash distribution is broken.
  4. Consider whether stickiness is needed at all. Session affinity is often a crutch for state that should be externalized (session store, Redis, JWT).
  5. Choose a mitigation. Bucket by session ID (higher-cardinality input), fall back to round-robin for the hot instance, or remove stickiness entirely.
Real-World Example: Zoom (2020 during COVID surge). Zoom had session affinity at the meeting-ID level to pin users to the same media server for a call. When a handful of massive meetings (tens of thousands of attendees each) all started at once, those meetings’ servers maxed out while other servers sat idle. Fix was two-layered: shard large meetings across multiple servers (session affinity at a sub-meeting level) and implement graceful overload handling (reject new connections to an over-capacity server, let the LB pick another).Senior Follow-up Questions:
Q: “Can you remove session affinity without breaking the application?”A: Yes, if application state is externalized. Move session state to Redis or a signed JWT, move file upload state to S3 with a signed URL, and move database transactions to a connection pooler that handles routing. Once state is external, affinity is not needed — any instance can serve any request. The hard cases are WebSocket connections (physically bound to an instance by the TCP connection) and in-memory caches that take minutes to warm up. For WebSockets, horizontal scaling with a pub-sub like Redis handles it. For warm caches, accept a cold-start penalty or use a shared distributed cache instead.
Q: “You need to keep session affinity for WebSockets. How do you prevent the hot-instance problem?”A: Three defenses. First, hash by a high-cardinality key (connection ID, not user ID or IP) so you do not create artificial clusters. Second, implement connection caps per instance — once an instance reaches its max (say 5000 WebSockets), new connection requests get routed elsewhere even if the hash would normally land there. Third, monitor per-instance variance and alert if any instance exceeds 2x the fleet median; this catches the problem before users notice.
Q: “If you had designed this system from scratch, would you use session affinity at all?”A: Reluctantly, yes, for specific cases: WebSocket connections (physically bound), in-memory warm caches where cold starts are expensive, and stateful streaming protocols. For stateless HTTP APIs, no. State in any form (local caches, session storage) scales worse than external state, and affinity is a tax paid to hide that. My preference: externalize state first, treat instances as fungible, and only introduce affinity for the genuinely stateful protocols where it is unavoidable.
Common Wrong Answers:
  • “Add more instances and the load will even out.” Does not fix the hash distribution; you just get more idle instances while the hot one stays hot.
  • “Disable session affinity.” May be the right answer in the end, but only if the application can handle it. Blindly disabling affinity often breaks features (logged-out users, lost shopping carts) in ways that are hard to diagnose.
Further Reading:
  • “Load Balancing is Impossible” by Tyler McMullen — talk explaining why perfect distribution is theoretically impossible
  • “Session Affinity in Kubernetes” — official Kubernetes docs on clientIP affinity and its limitations
  • “Zoom’s Architecture” — various 2020-2021 blog posts about how Zoom scaled during the pandemic

Interview Deep-Dive

Strong Answer:With round-robin, each instance gets 20% of traffic regardless of performance. If one instance has 3x the latency of the others, 20% of all requests are slow. From the user’s perspective, one in five requests takes 3 seconds instead of 1 second, making the P80 latency equal to the slowest instance’s latency. This is worse than it sounds because many users retry slow requests, which adds even more load to the system.The immediate fix is switching to least-connections load balancing. Least-connections sends the next request to the instance with the fewest active connections. The slow instance accumulates connections (because responses take longer), so it naturally receives fewer new requests. The fast instances finish requests quickly, free up connections, and get more traffic. The system self-balances based on actual performance.For an even better approach: weighted least-connections with adaptive weights. The load balancer tracks each instance’s response time and adjusts weights accordingly. An instance that consistently responds in 500ms gets 3x the weight of an instance responding in 1500ms. NGINX Plus and Envoy both support this with “least_time” or “EWMA” (Exponentially Weighted Moving Average) algorithms.The root cause fix: investigate why that instance is slow. Common causes: the instance landed on a host with a noisy neighbor (another pod consuming CPU), the instance has a different configuration (wrong JVM heap size, missing optimization flags), or the instance’s local cache is cold while the others are warm. In Kubernetes, I would check if the pod is scheduled on a node with resource pressure using kubectl describe node.Follow-up: “How does load balancing work differently for gRPC compared to REST?”gRPC uses HTTP/2 with long-lived connections. A traditional L4 load balancer (AWS NLB, kube-proxy) balances at connection time, not per-request. Once a gRPC client opens a connection to one backend, all requests go to that backend. With 5 backends and 3 clients, you might have all 3 clients connected to the same 2 backends while 3 backends sit idle.The fix is L7 load balancing that understands HTTP/2 frames and can route individual requests within a connection. Envoy (in Istio or standalone) does this natively. Alternatively, use client-side load balancing: gRPC has built-in support for resolving multiple backends (via DNS or a custom resolver) and distributing requests across them using round-robin or pick-first policies. The client opens connections to all backends and rotates requests.
Strong Answer:Consistent hashing distributes keys across cache nodes in a way that minimizes key redistribution when nodes are added or removed. In traditional hash-based routing (key % number_of_nodes), adding one node changes every key’s assignment — a 5-node cluster adding a 6th node causes 80% of keys to remap. With consistent hashing, adding one node only remaps 1/N of the keys (20% for 5 nodes).The mechanism: imagine a circular ring from 0 to 2^32. Each cache node is hashed to a position on the ring. Each key is also hashed to a position, and it is assigned to the next node clockwise from its position. When a node is added, only the keys between the new node and its predecessor are remapped. When a node is removed, only its keys move to the next clockwise node.In a microservices context, this matters for Redis Cluster, Memcached pools, and any distributed cache. When a cache node goes down during a traffic spike, you want minimal cache misses. With consistent hashing, only 1/N of keys become misses (they need to be re-fetched from the database). Without it, nearly all keys become misses, triggering the thundering herd problem.The practical enhancement: virtual nodes. Instead of one position on the ring per physical node, each node gets 100-200 virtual positions. This ensures even distribution — without virtual nodes, the ring can become unbalanced with some nodes owning disproportionately more keys. With virtual nodes, even small clusters have uniform key distribution.Real-world application: when I set up a Redis cluster for a microservices caching layer, I use Redis Cluster’s built-in consistent hashing (16384 hash slots distributed across nodes). When I add a shard, Redis migrates only the affected slots. The clients (using redis-cluster client libraries) follow redirects during migration, so the transition is transparent to the application.Follow-up: “What about hot keys — when one key is accessed 1000x more than others and the node holding that key becomes a bottleneck?”Hot keys are the Achilles heel of consistent hashing. The key is always on one node, and no amount of load balancing fixes that. Solutions: first, read replicas — the hot key is replicated to multiple nodes, and reads are distributed across replicas. Second, local caching — each service instance caches the hot key in-process memory with a short TTL, reducing Redis traffic by 90%+. Third, key splitting — instead of one key “product:123”, create “product:123:shard:0” through “product:123:shard:9” and randomly read from any shard. This distributes the hot key across 10 nodes. The trade-off is write complexity (writes must update all 10 shards).
Strong Answer:In load balancing, health checks determine whether an instance should receive traffic. A healthy instance gets requests; an unhealthy instance is removed from the pool. The subtlety is that “healthy” is not binary — an instance can be alive but not ready, or ready for some requests but degraded for others.Liveness checks answer: “Is this process running and not deadlocked?” If liveness fails, the instance should be restarted (in Kubernetes, the kubelet kills and restarts the container). Liveness checks should be simple: “can you respond to HTTP on this port?” They should NOT call external dependencies — if your liveness check verifies the database connection and the database is down, Kubernetes kills your perfectly functional application pods, making the outage worse.Readiness checks answer: “Can this instance handle requests right now?” If readiness fails, the instance is removed from the load balancer pool but NOT restarted. Readiness checks should verify that the instance can serve its purpose: database connection is established, caches are warm, dependent services are reachable. During a rolling deployment, new pods are not “ready” until they have initialized. During a downstream outage, pods might become “not ready” if their circuit breaker is open.The mistake I see most often: combining liveness and readiness into one health check. When a downstream service goes down, the combined check fails, Kubernetes kills the pods (liveness failure), new pods start but also fail the check (downstream is still down), and now you have a crash loop that eliminates your service entirely. With separate checks, the pods stay alive (liveness passes) but stop receiving traffic (readiness fails) until the downstream recovers.For load balancer configuration outside Kubernetes (NGINX, HAProxy), I implement active health checks (the load balancer periodically probes each backend) with a failure threshold (3 consecutive failures before removal) and a recovery threshold (2 consecutive successes before re-addition). This prevents flapping when a single health check times out due to network jitter.Follow-up: “How do you implement a ‘degraded’ health state where the instance can handle some requests but not all?”I implement a custom health endpoint that returns different responses based on the instance’s state. HTTP 200 means fully healthy. HTTP 429 means “I am overloaded, reduce my traffic weight.” HTTP 503 means “remove me from the pool.” The load balancer interprets 429 by reducing the weight (sending fewer requests) rather than removing the instance entirely. This allows graceful degradation: an instance experiencing high memory pressure can signal “send me fewer requests” rather than going from 100% traffic to 0% traffic. Envoy supports this with its “degraded” health status. For simpler load balancers, I achieve the same effect by having the instance’s readiness check return 503 intermittently (every other check fails), which causes the load balancer to halve its traffic.