> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# 23. Load Balancing Deep Dive

> Master client-side vs server-side load balancing, algorithms, health checks, and advanced patterns for microservices

# Load Balancing Deep Dive

Load balancing is critical for distributing traffic across service instances efficiently and reliably. Think of a load balancer like a restaurant host seating guests -- a bad host sends everyone to the same table while others sit empty. A good host considers which section has capacity, which server is fastest, and whether a particular guest always sits in the same spot. The choice of algorithm matters far more than most teams realize: the wrong algorithm under high load can turn a 50ms request into a 5-second timeout.

The reason load balancing deserves an entire chapter is that it sits at the fault line between two very different worlds. On one side is capacity planning: do your instances have enough CPU to handle expected load? On the other side is latency engineering: are requests reaching the fastest instance right now? A naive load balancer solves the first problem but ignores the second -- it will happily route 20% of traffic to a dying instance because "it is still in the pool." A sophisticated load balancer treats every instance as a live, shifting quantity and routes based on real-time behavior, not static configuration. The patterns in this chapter exist because teams have learned, often painfully, that assumptions about instance health do not hold under load.

<Info>
  **Learning Objectives:**

  * Understand client-side vs server-side load balancing
  * Master load balancing algorithms
  * Implement health checking strategies
  * Build intelligent load balancing with Node.js
</Info>

***

## Client-Side vs Server-Side Load Balancing

Before choosing an algorithm, you must first choose **where the load balancing happens**. This is one of the most consequential architectural decisions in microservices, and yet most teams default to whatever their platform provides (Kubernetes Service -> server-side, gRPC -> client-side) without understanding the tradeoffs. Server-side load balancing is simpler: clients talk to a single endpoint and the load balancer decides where traffic goes. Client-side load balancing is more powerful: the client itself knows about all backend instances and decides directly, eliminating one network hop and enabling per-request routing intelligence.

The catch with server-side: you have added a proxy that every request traverses. If that proxy is slow, has bugs, or is overloaded, all your traffic suffers. The catch with client-side: every service that calls the target service needs the load balancing logic embedded in it -- a Python service, a Go service, and a Node service all need compatible client-side load balancers. This is why companies like Netflix and Uber built service meshes (Envoy, Linkerd): they give you client-side load balancing without forcing every language to reimplement it. In modern microservices, the answer is often "both" -- server-side at the edge (for external traffic) and client-side or mesh-based internally.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    LOAD BALANCING APPROACHES                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  SERVER-SIDE LOAD BALANCING                                                 │
│  ──────────────────────────────                                             │
│                                                                              │
│  ┌──────────┐         ┌──────────────┐         ┌──────────────┐            │
│  │  Client  │────────▶│ Load Balancer│────────▶│ Service A-1  │            │
│  └──────────┘         │  (nginx/HAP) │         ├──────────────┤            │
│                       └──────────────┘────────▶│ Service A-2  │            │
│                                       └───────▶│ Service A-3  │            │
│                                                └──────────────┘            │
│                                                                              │
│  Pros: Simple for clients, centralized control                              │
│  Cons: Single point of failure, extra hop, limited to L4/L7                │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════   │
│                                                                              │
│  CLIENT-SIDE LOAD BALANCING                                                 │
│  ────────────────────────────                                               │
│                                                                              │
│  ┌──────────┐         ┌──────────────┐                                     │
│  │  Client  │◀────────│   Service    │         ┌──────────────┐            │
│  │  + LB    │         │   Registry   │────────▶│ Service A-1  │            │
│  │  Logic   │────────────────────────────────▶│ Service A-2  │            │
│  └──────────┘         └──────────────┘└───────▶│ Service A-3  │            │
│                                                └──────────────┘            │
│                                                                              │
│  Pros: No extra hop, distributed (no SPOF), more intelligent               │
│  Cons: Complex clients, language-specific implementations                   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

<Warning>
  **Caveats & Common Pitfalls: L4 vs L7 Confusion.**

  * **Treating L4 and L7 as interchangeable.** L4 (TCP) load balancers balance connections; L7 (HTTP) load balancers balance requests. Running gRPC behind an L4 balancer (AWS NLB, bare kube-proxy) means all requests over a single HTTP/2 connection land on one backend. A client-heavy workload hits two backends while three sit idle.
  * **Choosing L4 "for performance" without measuring.** L4 is faster per packet but L7 does connection pooling, header-based routing, and retries that an L4 cannot. The overhead of L7 at 10k RPS is negligible on modern hardware; the features are not.
  * **Using L7 for pure TCP workloads.** Running a Postgres connection through an L7 HTTP-aware proxy breaks things in subtle ways (proxy terminates the TLS and forwards plaintext, headers get rewritten, long-lived connections get reset during reloads).
  * **Mixing layers in one deploy.** Some teams put L4 in front of L7 in front of L4, creating four hops that each can misconfigure health checks, timeouts, and IP preservation.
</Warning>

<Tip>
  **Solutions & Patterns: Choose the right layer.**

  * **HTTP, gRPC, WebSocket: use L7** (Envoy, NGINX, HAProxy). L7 sees individual requests within an HTTP/2 connection and balances them across backends.
  * **TCP, UDP, database protocols: use L4** (AWS NLB, kube-proxy, HAProxy TCP mode). L4 does not try to inspect payloads.
  * **TLS termination decision is load-balancer-layer decision.** Terminate TLS at the edge L7 when you need header-based routing; terminate at the application when you need end-to-end encryption for compliance.
  * **Run both if you need both.** An L4 NLB for raw TCP services, an L7 ALB/Envoy for HTTP services, behind the same domain using DNS or host-header routing.
</Tip>

### Server-Side: NGINX Configuration

NGINX has been the default choice for server-side HTTP load balancing for over a decade, and the config below demonstrates why. Notice how the configuration is declarative -- you describe the upstream pool, the algorithm, the health check rules, and let NGINX handle the mechanics. The downside is also visible: this configuration is static. To change weights or add a server, you edit the file and reload NGINX. That does not scale when you have 50 microservices and hundreds of instances coming and going via Kubernetes autoscaling. This is why cloud-native environments prefer dynamic service discovery (Consul, Kubernetes Service, Envoy xDS) even though the underlying algorithms are the same.

One specific anti-pattern worth highlighting: running NGINX with `ip_hash` and then wondering why one server is overwhelmed. `ip_hash` buckets clients by IP, so if most of your traffic comes from a small number of IPs (think mobile carriers, corporate NATs), the distribution will be badly skewed. Use `ip_hash` only when you genuinely need session affinity and you have verified your client IP distribution is diverse.

```nginx theme={null}
# nginx.conf - Production-ready load balancing

upstream user_service {
    # Load balancing algorithm
    least_conn;  # Send to server with fewest active connections
    
    # Backend servers with weights and health
    server user-1.internal:3000 weight=5 max_fails=3 fail_timeout=30s;
    server user-2.internal:3000 weight=5 max_fails=3 fail_timeout=30s;
    server user-3.internal:3000 weight=3 backup;  # Backup server
    
    # Keepalive connections to backends
    keepalive 32;
    keepalive_timeout 60s;
}

upstream order_service {
    # IP Hash - session affinity
    ip_hash;
    
    server order-1.internal:3001;
    server order-2.internal:3001;
    server order-3.internal:3001;
}

server {
    listen 80;
    
    location /api/users {
        proxy_pass http://user_service;
        
        # Health check headers
        proxy_next_upstream error timeout http_502 http_503 http_504;
        proxy_next_upstream_tries 3;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;
        
        # Keepalive
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
    
    location /api/orders {
        proxy_pass http://order_service;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}
```

### Client-Side: Implementation

Now let us build client-side load balancing from scratch. The goal of this implementation is to show the moving parts clearly: service discovery integration, health check loop, algorithm selection, and automatic retry with failover. In production you would typically not implement this yourself -- you would use gRPC's built-in load balancer, a service mesh like Linkerd, or a library like Netflix's Ribbon (Java) or go-micro (Go). But understanding the mechanics is essential because when something goes wrong (and it will), you need to know which layer to debug.

Pay close attention to how we track per-instance metrics (activeConnections, responseTime, consecutiveFailures). These are the inputs that make intelligent algorithms like least-connections and least-response-time actually work. Without this tracking, your "load balancer" is just round-robin with extra steps. The failure-counting logic is subtle: we remove an instance after 3 consecutive failures, but a single success resets the counter. This prevents flapping when a network blip causes one failed request, while still catching instances that are genuinely broken.

What would happen if you skipped the health check entirely? The load balancer would continue sending 20% of traffic to a dead instance until the service registry eventually removed it (typically 30-60 seconds later). Every one of those requests would time out or return 502 errors, multiplying the impact of a single instance failure. Client-side health checks close this gap -- you start routing away from unhealthy instances within seconds, not minutes.

<Tabs>
  <Tab title="Node.js">
    ```javascript theme={null}
    // client-side-lb.js - Client-side load balancing with service discovery

    const EventEmitter = require('events');

    class ClientSideLoadBalancer extends EventEmitter {
      constructor(options = {}) {
        super();
        this.serviceName = options.serviceName;
        this.registry = options.registry;
        this.algorithm = options.algorithm || 'round-robin';
        this.healthCheckInterval = options.healthCheckInterval || 10000;
        
        this.instances = [];
        this.currentIndex = 0;
        this.healthStatus = new Map();
        
        this.initializeDiscovery();
        this.startHealthChecks();
      }

      async initializeDiscovery() {
        // Initial fetch from service registry
        await this.refreshInstances();
        
        // Subscribe to registry updates
        this.registry.on('instances-changed', (serviceName) => {
          if (serviceName === this.serviceName) {
            this.refreshInstances();
          }
        });
      }

      async refreshInstances() {
        try {
          const instances = await this.registry.getInstances(this.serviceName);
          this.instances = instances.map(instance => ({
            ...instance,
            weight: instance.weight || 1,
            activeConnections: 0,
            responseTime: 0,
            consecutiveFailures: 0
          }));
          this.emit('instances-updated', this.instances);
        } catch (error) {
          this.emit('error', error);
        }
      }

      startHealthChecks() {
        setInterval(async () => {
          for (const instance of this.instances) {
            try {
              const start = Date.now();
              const response = await fetch(`http://${instance.host}:${instance.port}/health`, {
                timeout: 5000
              });
              const latency = Date.now() - start;
              
              this.healthStatus.set(instance.id, {
                healthy: response.ok,
                latency,
                lastCheck: Date.now()
              });
              
              instance.responseTime = latency;
              instance.consecutiveFailures = 0;
            } catch (error) {
              const status = this.healthStatus.get(instance.id) || {};
              instance.consecutiveFailures++;
              this.healthStatus.set(instance.id, {
                ...status,
                healthy: false,
                lastCheck: Date.now(),
                error: error.message
              });
            }
          }
        }, this.healthCheckInterval);
      }

      // Get next available instance based on algorithm
      getNextInstance() {
        const healthyInstances = this.instances.filter(
          i => (this.healthStatus.get(i.id)?.healthy !== false) && 
               i.consecutiveFailures < 3
        );

        if (healthyInstances.length === 0) {
          throw new Error(`No healthy instances available for ${this.serviceName}`);
        }

        switch (this.algorithm) {
          case 'round-robin':
            return this.roundRobin(healthyInstances);
          case 'weighted-round-robin':
            return this.weightedRoundRobin(healthyInstances);
          case 'least-connections':
            return this.leastConnections(healthyInstances);
          case 'least-response-time':
            return this.leastResponseTime(healthyInstances);
          case 'random':
            return this.random(healthyInstances);
          default:
            return this.roundRobin(healthyInstances);
        }
      }

      roundRobin(instances) {
        const instance = instances[this.currentIndex % instances.length];
        this.currentIndex++;
        return instance;
      }

      weightedRoundRobin(instances) {
        // Create weighted list
        const weighted = [];
        for (const instance of instances) {
          for (let i = 0; i < instance.weight; i++) {
            weighted.push(instance);
          }
        }
        
        const instance = weighted[this.currentIndex % weighted.length];
        this.currentIndex++;
        return instance;
      }

      leastConnections(instances) {
        return instances.reduce((min, current) => 
          current.activeConnections < min.activeConnections ? current : min
        );
      }

      leastResponseTime(instances) {
        return instances.reduce((best, current) => {
          const currentScore = current.activeConnections * 0.5 + current.responseTime * 0.5;
          const bestScore = best.activeConnections * 0.5 + best.responseTime * 0.5;
          return currentScore < bestScore ? current : best;
        });
      }

      random(instances) {
        return instances[Math.floor(Math.random() * instances.length)];
      }

      // Execute request with automatic failover
      async execute(requestFn, options = {}) {
        const maxRetries = options.retries || 3;
        const retryDelay = options.retryDelay || 100;
        let lastError;

        for (let attempt = 0; attempt < maxRetries; attempt++) {
          const instance = this.getNextInstance();
          
          try {
            instance.activeConnections++;
            const start = Date.now();
            
            const result = await requestFn(instance);
            
            instance.responseTime = Date.now() - start;
            instance.activeConnections--;
            
            return result;
          } catch (error) {
            instance.activeConnections--;
            instance.consecutiveFailures++;
            lastError = error;
            
            this.emit('request-failed', { instance, error, attempt });
            
            if (attempt < maxRetries - 1) {
              await this.delay(retryDelay * Math.pow(2, attempt));
            }
          }
        }

        throw lastError;
      }

      delay(ms) {
        return new Promise(resolve => setTimeout(resolve, ms));
      }
    }

    // Usage example
    const axios = require('axios');

    const userServiceLB = new ClientSideLoadBalancer({
      serviceName: 'user-service',
      registry: serviceRegistry,
      algorithm: 'least-connections'
    });

    // Make requests through the load balancer
    async function getUser(userId) {
      return userServiceLB.execute(async (instance) => {
        const response = await axios.get(
          `http://${instance.host}:${instance.port}/users/${userId}`,
          { timeout: 5000 }
        );
        return response.data;
      });
    }

    module.exports = { ClientSideLoadBalancer };
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    # Client-side load balancer with service discovery integration.
    # Uses httpx for async HTTP, a background task for health checking,
    # and asyncio primitives for safe concurrent access to instance state.
    # In production you would plug this into a registry like python-consul2
    # or etcd3-py; here the Registry is abstracted behind a Protocol.

    from __future__ import annotations

    import asyncio
    import logging
    import random
    import time
    from dataclasses import dataclass, field
    from typing import Any, Awaitable, Callable, Literal, Protocol

    import httpx

    log = logging.getLogger(__name__)

    Algorithm = Literal[
        "round-robin",
        "weighted-round-robin",
        "least-connections",
        "least-response-time",
        "random",
    ]


    @dataclass
    class Instance:
        id: str
        host: str
        port: int
        weight: int = 1
        active_connections: int = 0
        response_time_ms: float = 0.0
        consecutive_failures: int = 0
        healthy: bool = True
        last_check: float = 0.0


    class Registry(Protocol):
        async def get_instances(self, service_name: str) -> list[Instance]: ...
        def on_change(self, service_name: str, callback: Callable[[], Awaitable[None]]) -> None: ...


    class ClientSideLoadBalancer:
        """Client-side load balancer with pluggable algorithms and active health checks.

        Owns a long-lived httpx.AsyncClient for connection pooling. All algorithm
        methods operate only on the currently-healthy subset of instances.
        """

        def __init__(
            self,
            service_name: str,
            registry: Registry,
            algorithm: Algorithm = "round-robin",
            health_check_interval: float = 10.0,
            max_failures: int = 3,
        ) -> None:
            self.service_name = service_name
            self._registry = registry
            self._algorithm = algorithm
            self._health_check_interval = health_check_interval
            self._max_failures = max_failures

            self._instances: list[Instance] = []
            self._current_index = 0
            self._lock = asyncio.Lock()
            self._client = httpx.AsyncClient(timeout=httpx.Timeout(5.0))
            self._health_task: asyncio.Task[None] | None = None

        async def start(self) -> None:
            await self._refresh_instances()
            self._registry.on_change(self.service_name, self._refresh_instances)
            self._health_task = asyncio.create_task(self._health_check_loop())

        async def close(self) -> None:
            if self._health_task:
                self._health_task.cancel()
            await self._client.aclose()

        async def _refresh_instances(self) -> None:
            fresh = await self._registry.get_instances(self.service_name)
            async with self._lock:
                # Preserve runtime stats across refresh to avoid losing warm-up state.
                by_id = {i.id: i for i in self._instances}
                merged: list[Instance] = []
                for inst in fresh:
                    existing = by_id.get(inst.id)
                    if existing:
                        inst.active_connections = existing.active_connections
                        inst.response_time_ms = existing.response_time_ms
                        inst.consecutive_failures = existing.consecutive_failures
                        inst.healthy = existing.healthy
                    merged.append(inst)
                self._instances = merged

        async def _health_check_loop(self) -> None:
            while True:
                try:
                    await self._check_all()
                except asyncio.CancelledError:
                    raise
                except Exception:
                    log.exception("health check loop error")
                await asyncio.sleep(self._health_check_interval)

        async def _check_all(self) -> None:
            async with self._lock:
                instances = list(self._instances)

            await asyncio.gather(*(self._check_one(i) for i in instances))

        async def _check_one(self, instance: Instance) -> None:
            url = f"http://{instance.host}:{instance.port}/health"
            start = time.perf_counter()
            try:
                r = await self._client.get(url, timeout=5.0)
                latency_ms = (time.perf_counter() - start) * 1000
                instance.response_time_ms = latency_ms
                instance.last_check = time.time()
                instance.healthy = r.is_success
                if r.is_success:
                    instance.consecutive_failures = 0
            except httpx.HTTPError:
                instance.consecutive_failures += 1
                instance.healthy = False
                instance.last_check = time.time()

        # ---------- algorithms ----------

        def _healthy(self) -> list[Instance]:
            return [
                i for i in self._instances
                if i.healthy and i.consecutive_failures < self._max_failures
            ]

        def _next_instance(self) -> Instance:
            pool = self._healthy()
            if not pool:
                raise RuntimeError(f"No healthy instances for {self.service_name}")

            if self._algorithm == "round-robin":
                chosen = pool[self._current_index % len(pool)]
                self._current_index += 1
                return chosen
            if self._algorithm == "weighted-round-robin":
                expanded = [i for i in pool for _ in range(i.weight)]
                chosen = expanded[self._current_index % len(expanded)]
                self._current_index += 1
                return chosen
            if self._algorithm == "least-connections":
                return min(pool, key=lambda i: i.active_connections)
            if self._algorithm == "least-response-time":
                return min(
                    pool,
                    key=lambda i: i.active_connections * 0.5 + i.response_time_ms * 0.5,
                )
            if self._algorithm == "random":
                return random.choice(pool)
            raise ValueError(f"unknown algorithm: {self._algorithm}")

        # ---------- request execution ----------

        async def execute(
            self,
            request_fn: Callable[[Instance, httpx.AsyncClient], Awaitable[Any]],
            retries: int = 3,
            retry_delay: float = 0.1,
        ) -> Any:
            last_error: BaseException | None = None

            for attempt in range(retries):
                instance = self._next_instance()
                instance.active_connections += 1
                start = time.perf_counter()
                try:
                    result = await request_fn(instance, self._client)
                    instance.response_time_ms = (time.perf_counter() - start) * 1000
                    return result
                except Exception as exc:
                    instance.consecutive_failures += 1
                    last_error = exc
                    log.warning(
                        "request to %s failed on attempt %d/%d: %s",
                        instance.id, attempt + 1, retries, exc,
                    )
                    if attempt < retries - 1:
                        await asyncio.sleep(retry_delay * (2 ** attempt))
                finally:
                    instance.active_connections -= 1

            assert last_error is not None
            raise last_error


    # ---------- usage ----------

    async def get_user(lb: ClientSideLoadBalancer, user_id: str) -> dict[str, Any]:
        async def call(instance: Instance, client: httpx.AsyncClient) -> dict[str, Any]:
            r = await client.get(f"http://{instance.host}:{instance.port}/users/{user_id}")
            r.raise_for_status()
            return r.json()

        return await lb.execute(call)
    ```
  </Tab>
</Tabs>

***

## Load Balancing Algorithms

The algorithm you choose is effectively a hypothesis about your workload. Round-robin assumes all instances are identical and all requests take equal time. Least-connections assumes requests vary in duration but instances are otherwise equal. Least-response-time assumes instances themselves vary in speed. Consistent hashing assumes you need keys to stick to specific instances (for caching or partitioning). When your hypothesis matches reality, the algorithm works well. When it does not, you get mysterious tail latency or cache misses that nobody can explain.

The most common mistake is picking an algorithm that sounds smart without validating it matches your workload. I have seen teams switch from round-robin to least-response-time expecting better performance, only to find that it created a feedback loop: the fastest instance got all the traffic, its response time crept up, another instance became fastest, and so on. The resulting oscillation was worse than round-robin's simple equal distribution. This is why Power of Two Choices (P2C) -- discussed below -- has become the default in modern proxies like Envoy. It has the intelligence of least-connections without the herding problem.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    LOAD BALANCING ALGORITHMS                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ROUND ROBIN                          WEIGHTED ROUND ROBIN                  │
│  ─────────────────                    ─────────────────────                 │
│                                                                              │
│  Request 1 → Server A                 Request 1 → Server A (w=5)           │
│  Request 2 → Server B                 Request 2 → Server A                 │
│  Request 3 → Server C                 Request 3 → Server A                 │
│  Request 4 → Server A                 Request 4 → Server B (w=3)           │
│  ...                                  Request 5 → Server B                 │
│                                       Request 6 → Server B                 │
│  Simple, equal distribution           Request 7 → Server C (w=2)           │
│                                       ...                                   │
│                                       Accounts for server capacity         │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════   │
│                                                                              │
│  LEAST CONNECTIONS                    LEAST RESPONSE TIME                  │
│  ──────────────────────               ─────────────────────                 │
│                                                                              │
│  Server A: 5 conn  ←────              Server A: 50ms avg  ←────            │
│  Server B: 8 conn                     Server B: 75ms avg                   │
│  Server C: 3 conn                     Server C: 45ms avg                   │
│                    ↑                                     ↑                  │
│           Next → Server C             Next → Server C (fastest)            │
│                                                                              │
│  Best for long-lived connections      Best for varying server loads        │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════   │
│                                                                              │
│  IP HASH                              CONSISTENT HASHING                   │
│  ────────────                         ───────────────────                   │
│                                                                              │
│  hash(client_ip) → Server B           hash(request) → Ring position        │
│                                       Minimal redistribution on change     │
│  Same client → same server                                                 │
│  Session affinity                     Good for caching, stateful services  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

<Warning>
  **Caveats & Common Pitfalls: Algorithm Selection.**

  * **Weighted round-robin with unhealthy upstreams.** If your health check is slow to react, weighted round-robin happily keeps sending weight-proportional traffic to a dying instance. An instance with weight 5 that is about to crash gets 5x the traffic of a healthy instance with weight 1 for the whole detection window.
  * **Session affinity ("sticky sessions") causing hot instances.** When a popular user or a large corporate NAT hashes to one instance, that instance receives 10x the average load. Teams usually notice only when that one instance melts during peak hours.
  * **Least-connections without accurate connection tracking.** If your load balancer's "connections" metric is per-LB-instance and you run multiple LB replicas, each LB thinks it has few connections and they all pile onto the same backend.
  * **Picking least-response-time and creating feedback oscillation.** The fastest instance gets all traffic, its response time degrades, now another instance is fastest and receives the herd. Result: oscillating traffic and worse p99 than round-robin.
</Warning>

<Tip>
  **Solutions & Patterns: Algorithm best practices.**

  * **Default to Power of Two Choices (P2C).** Pick 2 random instances, send to the less-loaded one. Breaks the herd effect, requires no global state, and outperforms least-connections in realistic workloads.
  * **Tie weights to CPU or memory, not age.** Use weights that reflect actual capacity (4-core instance = weight 4, 2-core = weight 2), and recompute them when autoscaling changes the fleet.
  * **Combine session affinity with a fallback.** Prefer the sticky instance, but fail open to round-robin if the sticky instance is unhealthy or overloaded. Never make stickiness an absolute constraint.
  * **Monitor per-instance load variance, not just averages.** If instance p99 load is 3x instance median load, your algorithm is not balancing well. Average load of 50% can hide one instance at 95%.
</Tip>

### Advanced Algorithms Implementation

The algorithms below demonstrate the ideas behind production load balancers. Weighted Round Robin with smooth distribution prevents clumping (naive weighted round-robin sends all weight-5 requests consecutively, which can overwhelm that instance; the smooth version interleaves them). Consistent hashing with virtual nodes is the foundation of distributed caches like Memcached and Redis Cluster; without virtual nodes, removing one server would cause 30-50% of keys to remap, but with 150 virtual nodes per server, that drops to a few percent.

Power of Two Choices (P2C) is the algorithm I recommend for most modern load balancers. The math is surprising: picking 2 random instances and choosing the less loaded one outperforms least-connections in realistic workloads. Why? Because pure least-connections creates a herd effect -- the load balancer sees instance A has 3 connections and instance B has 4, routes to A, and 10ms later the next decision sees A has 4 and B has 4, and the next client also picks A, and so on. With P2C, the randomness breaks the herding. The mathematical analysis (Azar, Broder, Karlin, Upfal 1994) shows P2C achieves O(log log n) maximum load while random achieves O(log n / log log n) -- a dramatic improvement for essentially no extra cost.

<Tabs>
  <Tab title="Node.js">
    ```javascript theme={null}
    // advanced-lb-algorithms.js

    class LoadBalancingAlgorithms {
      // Weighted Round Robin with smooth distribution
      static createWeightedRoundRobin(instances) {
        let currentWeight = 0;
        let maxWeight = Math.max(...instances.map(i => i.weight));
        let gcdWeight = instances.reduce((a, b) => gcd(a, b.weight), instances[0].weight);
        let currentIndex = -1;

        function gcd(a, b) {
          return b === 0 ? a : gcd(b, a % b);
        }

        return function getNext() {
          while (true) {
            currentIndex = (currentIndex + 1) % instances.length;
            
            if (currentIndex === 0) {
              currentWeight -= gcdWeight;
              if (currentWeight <= 0) {
                currentWeight = maxWeight;
              }
            }
            
            if (instances[currentIndex].weight >= currentWeight) {
              return instances[currentIndex];
            }
          }
        };
      }

      // Consistent Hashing with virtual nodes
      static createConsistentHash(instances, virtualNodes = 150) {
        const ring = new Map();
        const sortedKeys = [];

        // Add virtual nodes for each instance
        for (const instance of instances) {
          for (let i = 0; i < virtualNodes; i++) {
            const key = hash(`${instance.id}-${i}`);
            ring.set(key, instance);
            sortedKeys.push(key);
          }
        }
        
        sortedKeys.sort((a, b) => a - b);

        function hash(str) {
          let hash = 0;
          for (let i = 0; i < str.length; i++) {
            hash = ((hash << 5) - hash) + str.charCodeAt(i);
            hash = hash & hash;
          }
          return Math.abs(hash);
        }

        return function getNode(key) {
          const keyHash = hash(key);
          
          // Binary search for first key >= keyHash
          let low = 0, high = sortedKeys.length - 1;
          
          while (low < high) {
            const mid = Math.floor((low + high) / 2);
            if (sortedKeys[mid] < keyHash) {
              low = mid + 1;
            } else {
              high = mid;
            }
          }

          // Wrap around if key is larger than all
          const index = sortedKeys[low] >= keyHash ? low : 0;
          return ring.get(sortedKeys[index]);
        };
      }

      // Power of Two Choices (P2C)
      // This deceptively simple algorithm has mathematical guarantees that make it surprisingly effective.
      // By picking just 2 random servers and choosing the less loaded one, you avoid the "herd effect"
      // where all clients pick the same "best" server (least-connections) and overwhelm it.
      // Used by Envoy, HAProxy, and Netflix Zuul in production. The max load is O(log(log(n))).
      static createP2C(instances) {
        return function getNext() {
          if (instances.length === 1) return instances[0];
          
          // Pick two random instances
          const i1 = Math.floor(Math.random() * instances.length);
          let i2 = Math.floor(Math.random() * instances.length);
          while (i2 === i1) {
            i2 = Math.floor(Math.random() * instances.length);
          }

          // Choose the one with fewer active connections
          return instances[i1].activeConnections <= instances[i2].activeConnections
            ? instances[i1]
            : instances[i2];
        };
      }

      // Adaptive Load Balancing (based on real-time metrics)
      static createAdaptive(instances) {
        return function getNext() {
          // Calculate scores based on multiple factors
          const scored = instances.map(instance => ({
            instance,
            score: calculateScore(instance)
          }));

          // Sort by score (lower is better)
          scored.sort((a, b) => a.score - b.score);
          
          // Weighted random from top 3
          const topN = scored.slice(0, Math.min(3, scored.length));
          const totalWeight = topN.reduce((sum, s) => sum + (1 / s.score), 0);
          
          let random = Math.random() * totalWeight;
          for (const { instance, score } of topN) {
            random -= (1 / score);
            if (random <= 0) return instance;
          }
          
          return topN[0].instance;
        };

        function calculateScore(instance) {
          // Lower score = better
          const connectionScore = instance.activeConnections * 0.3;
          const latencyScore = instance.avgResponseTime * 0.4;
          const errorScore = instance.errorRate * 100 * 0.3;
          
          return connectionScore + latencyScore + errorScore + 0.001; // Avoid division by zero
        }
      }
    }

    module.exports = { LoadBalancingAlgorithms };
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    # Advanced load balancing algorithms in Python.
    # Uses hashlib for stable cross-process hashing, bisect for the ring
    # lookup, and math.gcd for smooth weighted round-robin. All algorithms
    # return a callable so they can be swapped at runtime.

    from __future__ import annotations

    import bisect
    import hashlib
    import math
    import random
    from dataclasses import dataclass
    from functools import reduce
    from typing import Callable


    @dataclass
    class Instance:
        id: str
        host: str
        port: int
        weight: int = 1
        active_connections: int = 0
        avg_response_time_ms: float = 0.0
        error_rate: float = 0.0


    def _stable_hash(key: str) -> int:
        """Deterministic cross-process hash. Python's hash() is salted per process."""
        return int(hashlib.md5(key.encode(), usedforsecurity=False).hexdigest()[:16], 16)


    def smooth_weighted_round_robin(instances: list[Instance]) -> Callable[[], Instance]:
        """Smooth weighted round robin -- spaces out requests to high-weight servers."""
        max_weight = max(i.weight for i in instances)
        gcd_weight = reduce(math.gcd, (i.weight for i in instances))

        state = {"index": -1, "current_weight": 0}

        def get_next() -> Instance:
            while True:
                state["index"] = (state["index"] + 1) % len(instances)
                if state["index"] == 0:
                    state["current_weight"] -= gcd_weight
                    if state["current_weight"] <= 0:
                        state["current_weight"] = max_weight
                if instances[state["index"]].weight >= state["current_weight"]:
                    return instances[state["index"]]

        return get_next


    def consistent_hash(
        instances: list[Instance], virtual_nodes: int = 150
    ) -> Callable[[str], Instance]:
        """Classic consistent hashing with virtual nodes for even distribution.

        Returns a function that maps a request key to an instance. Adding or
        removing an instance only remaps ~1/N of the keys.
        """
        ring: list[tuple[int, Instance]] = []
        for inst in instances:
            for i in range(virtual_nodes):
                ring.append((_stable_hash(f"{inst.id}-{i}"), inst))
        ring.sort(key=lambda kv: kv[0])
        keys = [h for h, _ in ring]

        def get_node(request_key: str) -> Instance:
            h = _stable_hash(request_key)
            idx = bisect.bisect_left(keys, h)
            if idx == len(keys):  # wrap around
                idx = 0
            return ring[idx][1]

        return get_node


    def power_of_two_choices(instances: list[Instance]) -> Callable[[], Instance]:
        """P2C: pick two random instances, choose the one with fewer connections.

        Outperforms least-connections in practice because the randomness breaks
        the herd effect. Used by Envoy's LEAST_REQUEST, HAProxy, Netflix Zuul.
        The max load is O(log log n) vs O(log n / log log n) for pure random.
        """
        def get_next() -> Instance:
            if len(instances) == 1:
                return instances[0]
            a, b = random.sample(instances, 2)
            return a if a.active_connections <= b.active_connections else b

        return get_next


    def adaptive_lb(instances: list[Instance]) -> Callable[[], Instance]:
        """Score-based adaptive load balancing. Blends connection count, latency,
        and error rate into a single score and picks from the top 3 using weighted
        random -- so we don't always hammer the single best instance.
        """
        def score(inst: Instance) -> float:
            connection_score = inst.active_connections * 0.3
            latency_score = inst.avg_response_time_ms * 0.4
            error_score = inst.error_rate * 100 * 0.3
            return connection_score + latency_score + error_score + 0.001

        def get_next() -> Instance:
            ranked = sorted(instances, key=score)[:3]
            weights = [1.0 / score(i) for i in ranked]
            return random.choices(ranked, weights=weights, k=1)[0]

        return get_next
    ```
  </Tab>
</Tabs>

***

## Health Checking Strategies

Health checks are the eyes of the load balancer. If they report wrong, the load balancer makes wrong decisions -- routing to dead instances or evicting healthy ones. The most important distinction in health checking is between liveness (is the process alive?) and readiness (can it handle traffic?). This sounds pedantic but it is the single most common mistake I see in production: teams implement a single "health" endpoint that checks everything, use it for both liveness and readiness, and then wonder why their system goes into a death spiral whenever a downstream dependency has a bad day.

The failure mode is subtle but catastrophic. Imagine your `/health` endpoint checks the database. The database has a slow query that makes health checks time out. Kubernetes sees liveness failures and restarts your pods. The new pods also fail the check (database is still slow). Kubernetes marks them failed and kills them too. Within minutes you have zero pods, the database recovers, but there are no pods to serve traffic. You have a full outage. The fix: liveness checks should be shallow (just "is the HTTP server responding?"), and only readiness checks should verify dependencies.

The deep health check is the third kind: an endpoint used by humans and dashboards to see rich dependency status. It is NOT what the load balancer polls. It tells you *why* a service is degraded, not whether it should receive traffic.

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                    HEALTH CHECK PATTERNS                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  LIVENESS vs READINESS                                                      │
│  ─────────────────────────                                                  │
│                                                                              │
│  LIVENESS: Is the process running?                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  GET /healthz                                                        │   │
│  │  → 200 OK: Process is alive                                          │   │
│  │  → 5xx: Process is dead, restart it                                  │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  READINESS: Can it handle traffic?                                          │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  GET /ready                                                          │   │
│  │  → 200 OK: Ready to receive traffic                                  │   │
│  │  → 503: Not ready (warming up, dependencies down)                    │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
│  DEEP HEALTH CHECK (with dependencies)                                      │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │  GET /health/deep                                                    │   │
│  │  {                                                                   │   │
│  │    "status": "degraded",                                             │   │
│  │    "checks": {                                                       │   │
│  │      "database": { "status": "healthy", "latency": "5ms" },         │   │
│  │      "redis": { "status": "healthy", "latency": "2ms" },            │   │
│  │      "external-api": { "status": "unhealthy", "error": "timeout" } │   │
│  │    }                                                                 │   │
│  │  }                                                                   │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
```

<Warning>
  **Caveats & Common Pitfalls: Health Checks and Deploys.**

  * **Connection draining gap during deploy.** The load balancer removes the instance from its pool but the instance still has in-flight connections. If you terminate the container before connections drain, in-flight requests fail with 502s. Users get errors during every deploy.
  * **Deep health checks triggering cascade restarts.** A liveness probe that queries the database dies when the database has a blip. Kubernetes restarts the pod, the new pod also dies (database still bad), a crash loop starts, and now zero pods serve traffic.
  * **Health check frequency too high.** A 1-second health check against a 10-instance pool with 3 retries creates 30 requests per second of pure health-check traffic. On a 200-instance pool this is 600 req/sec that adds nothing to business value but consumes resources.
  * **Health check bypassing the critical path.** The health endpoint returns 200 but only tests the process, not the actual request handler. You get a "healthy" service that returns 500s to every real request because of a bad config load.
</Warning>

<Tip>
  **Solutions & Patterns: Health checks done right.**

  * **Separate liveness (shallow) from readiness (with dependencies) from deep health (human-readable).** Three endpoints, three audiences. Never let one endpoint serve all three.
  * **Preserve in-flight traffic during deploys.** Send SIGTERM, wait for readiness to go false, let the LB drain (typically 15-30 seconds), then terminate. Kubernetes `terminationGracePeriodSeconds` plus a `preStop` hook that flips readiness gives you this.
  * **Tune health check intervals to your recovery SLA.** If you need to detect a dead instance within 10 seconds, use 2-second intervals with 3 failures threshold. Under that SLA, go less aggressive to reduce noise.
  * **Test the actual critical path.** Readiness should exercise the real request handler with a synthetic request, not just return 200 from a different route.
</Tip>

### Comprehensive Health Check Service

The service below is the minimum viable implementation of a proper health check subsystem. The design has three parts: a registration mechanism (where each dependency declares how to check itself), a background runner (so health checks happen continuously, not per-request), and three HTTP endpoints (liveness, readiness, deep). The background execution is critical -- if you check dependencies on every incoming request, your health check becomes a DoS vector. A misbehaving caller can hammer /health and multiply the load on every dependency. Instead, check dependencies on a timer and return cached results.

One detail worth emphasizing: health checks should have aggressive timeouts. A health check that takes 10 seconds to fail makes your "unhealthy" detection take 30+ seconds (since you typically need 3 consecutive failures). Use 2-5 second timeouts for critical dependencies and make your instance's response time budget proportional. If your service's SLA is 500ms, a 5-second health check timeout is 10x your SLA -- that is too slow.

<Tabs>
  <Tab title="Node.js">
    ```javascript theme={null}
    // health-check-service.js

    const express = require('express');

    class HealthCheckService {
      constructor() {
        this.checks = new Map();
        this.status = 'starting';
        this.startTime = Date.now();
      }

      // Register a health check
      registerCheck(name, checkFn, options = {}) {
        this.checks.set(name, {
          fn: checkFn,
          critical: options.critical !== false, // Default to critical
          timeout: options.timeout || 5000,
          interval: options.interval || 30000,
          lastResult: null,
          lastCheck: null
        });

        // Start periodic checking
        if (options.interval) {
          setInterval(() => this.runCheck(name), options.interval);
        }
      }

      async runCheck(name) {
        const check = this.checks.get(name);
        if (!check) return null;

        const startTime = Date.now();
        
        try {
          const result = await Promise.race([
            check.fn(),
            new Promise((_, reject) => 
              setTimeout(() => reject(new Error('Health check timeout')), check.timeout)
            )
          ]);

          check.lastResult = {
            status: 'healthy',
            latency: Date.now() - startTime,
            ...result
          };
        } catch (error) {
          check.lastResult = {
            status: 'unhealthy',
            error: error.message,
            latency: Date.now() - startTime
          };
        }

        check.lastCheck = Date.now();
        return check.lastResult;
      }

      async runAllChecks() {
        const results = {};
        
        for (const [name, check] of this.checks) {
          results[name] = await this.runCheck(name);
        }

        return results;
      }

      getOverallStatus(checkResults) {
        const criticalChecks = Array.from(this.checks.entries())
          .filter(([_, check]) => check.critical)
          .map(([name]) => name);

        const hasUnhealthyCritical = criticalChecks.some(
          name => checkResults[name]?.status === 'unhealthy'
        );

        const hasAnyUnhealthy = Object.values(checkResults).some(
          r => r?.status === 'unhealthy'
        );

        if (hasUnhealthyCritical) return 'unhealthy';
        if (hasAnyUnhealthy) return 'degraded';
        return 'healthy';
      }

      // Express middleware for health endpoints
      createRouter() {
        const router = express.Router();

        // Liveness probe - is the process running?
        router.get('/healthz', (req, res) => {
          res.status(200).json({
            status: 'alive',
            uptime: Date.now() - this.startTime
          });
        });

        // Readiness probe - can we handle traffic?
        router.get('/ready', async (req, res) => {
          if (this.status !== 'ready') {
            return res.status(503).json({
              status: this.status,
              message: 'Service not ready'
            });
          }

          // Quick check of critical dependencies
          const criticalResults = {};
          for (const [name, check] of this.checks) {
            if (check.critical) {
              criticalResults[name] = check.lastResult;
            }
          }

          const hasUnhealthy = Object.values(criticalResults).some(
            r => r?.status === 'unhealthy'
          );

          if (hasUnhealthy) {
            return res.status(503).json({
              status: 'not_ready',
              checks: criticalResults
            });
          }

          res.status(200).json({ status: 'ready' });
        });

        // Deep health check - detailed status of all dependencies
        router.get('/health', async (req, res) => {
          const checkResults = await this.runAllChecks();
          const overallStatus = this.getOverallStatus(checkResults);

          const statusCode = overallStatus === 'healthy' ? 200 : 
                             overallStatus === 'degraded' ? 200 : 503;

          res.status(statusCode).json({
            status: overallStatus,
            timestamp: new Date().toISOString(),
            uptime: Date.now() - this.startTime,
            version: process.env.APP_VERSION || 'unknown',
            checks: checkResults
          });
        });

        return router;
      }

      setReady() {
        this.status = 'ready';
      }

      setNotReady(reason) {
        this.status = reason || 'not_ready';
      }
    }

    // Usage example
    const healthService = new HealthCheckService();

    // Register database check
    healthService.registerCheck('database', async () => {
      const start = Date.now();
      await pool.query('SELECT 1');
      return { latency: Date.now() - start };
    }, { critical: true, interval: 30000 });

    // Register Redis check
    healthService.registerCheck('redis', async () => {
      const start = Date.now();
      await redis.ping();
      return { latency: Date.now() - start };
    }, { critical: true, interval: 30000 });

    // Register external API check
    healthService.registerCheck('payment-api', async () => {
      const response = await fetch('https://api.stripe.com/v1/health', {
        timeout: 5000
      });
      return { status: response.ok ? 'reachable' : 'unreachable' };
    }, { critical: false, interval: 60000 });

    // Mount health routes
    app.use(healthService.createRouter());

    // Mark service as ready after initialization
    await initializeDatabase();
    await warmUpCaches();
    healthService.setReady();

    module.exports = { HealthCheckService };
    ```
  </Tab>

  <Tab title="Python">
    ```python theme={null}
    # FastAPI health check service with separate liveness, readiness, and deep endpoints.
    # Background tasks run each dependency check on its own interval so incoming
    # requests read cached results instead of triggering a fresh check every time.

    from __future__ import annotations

    import asyncio
    import os
    import time
    from dataclasses import dataclass, field
    from typing import Any, Awaitable, Callable

    from fastapi import APIRouter, FastAPI
    from fastapi.responses import JSONResponse

    CheckFn = Callable[[], Awaitable[dict[str, Any]]]


    @dataclass
    class Check:
        fn: CheckFn
        critical: bool = True
        timeout_s: float = 5.0
        interval_s: float = 30.0
        last_result: dict[str, Any] | None = None
        last_check_at: float = 0.0
        _task: asyncio.Task[None] | None = field(default=None, repr=False)


    class HealthCheckService:
        """Owns all registered dependency checks and exposes liveness / readiness / deep endpoints.

        Background tasks run checks on their configured intervals. The HTTP handlers read
        cached results (O(1)) to avoid turning /health into an amplifier under load.
        """

        def __init__(self) -> None:
            self._checks: dict[str, Check] = {}
            self._status: str = "starting"
            self._start_time = time.time()

        def register(
            self,
            name: str,
            fn: CheckFn,
            *,
            critical: bool = True,
            timeout_s: float = 5.0,
            interval_s: float = 30.0,
        ) -> None:
            self._checks[name] = Check(
                fn=fn, critical=critical, timeout_s=timeout_s, interval_s=interval_s
            )

        async def start(self) -> None:
            for name, check in self._checks.items():
                check._task = asyncio.create_task(self._run_loop(name, check))

        async def stop(self) -> None:
            for check in self._checks.values():
                if check._task:
                    check._task.cancel()

        def mark_ready(self) -> None:
            self._status = "ready"

        def mark_not_ready(self, reason: str = "not_ready") -> None:
            self._status = reason

        async def _run_loop(self, name: str, check: Check) -> None:
            while True:
                await self._run_once(name, check)
                await asyncio.sleep(check.interval_s)

        async def _run_once(self, name: str, check: Check) -> None:
            start = time.perf_counter()
            try:
                result = await asyncio.wait_for(check.fn(), timeout=check.timeout_s)
                check.last_result = {
                    "status": "healthy",
                    "latency_ms": (time.perf_counter() - start) * 1000,
                    **result,
                }
            except asyncio.TimeoutError:
                check.last_result = {
                    "status": "unhealthy",
                    "error": "timeout",
                    "latency_ms": (time.perf_counter() - start) * 1000,
                }
            except Exception as exc:
                check.last_result = {
                    "status": "unhealthy",
                    "error": str(exc),
                    "latency_ms": (time.perf_counter() - start) * 1000,
                }
            check.last_check_at = time.time()

        def _overall_status(self) -> str:
            has_critical_failure = any(
                c.critical and (c.last_result or {}).get("status") == "unhealthy"
                for c in self._checks.values()
            )
            has_any_failure = any(
                (c.last_result or {}).get("status") == "unhealthy"
                for c in self._checks.values()
            )
            if has_critical_failure:
                return "unhealthy"
            if has_any_failure:
                return "degraded"
            return "healthy"

        def build_router(self) -> APIRouter:
            router = APIRouter()

            @router.get("/healthz")
            async def liveness() -> dict[str, Any]:
                # Shallow: if the ASGI event loop can serve this handler, the process is alive.
                return {"status": "alive", "uptime_s": time.time() - self._start_time}

            @router.get("/ready")
            async def readiness() -> JSONResponse:
                if self._status != "ready":
                    return JSONResponse(
                        {"status": self._status, "message": "service not ready"},
                        status_code=503,
                    )
                critical = {
                    name: check.last_result
                    for name, check in self._checks.items()
                    if check.critical
                }
                if any((r or {}).get("status") == "unhealthy" for r in critical.values()):
                    return JSONResponse(
                        {"status": "not_ready", "checks": critical},
                        status_code=503,
                    )
                return JSONResponse({"status": "ready"}, status_code=200)

            @router.get("/health")
            async def deep() -> JSONResponse:
                results = {name: c.last_result for name, c in self._checks.items()}
                status = self._overall_status()
                code = 200 if status != "unhealthy" else 503
                return JSONResponse(
                    {
                        "status": status,
                        "timestamp": time.time(),
                        "uptime_s": time.time() - self._start_time,
                        "version": os.environ.get("APP_VERSION", "unknown"),
                        "checks": results,
                    },
                    status_code=code,
                )

            return router


    # ---------- wiring into FastAPI ----------

    async def check_database() -> dict[str, Any]:
        # In real code: `async with pool.acquire() as conn: await conn.execute("SELECT 1")`
        await asyncio.sleep(0.005)
        return {}


    async def check_redis() -> dict[str, Any]:
        # await redis.ping()
        await asyncio.sleep(0.002)
        return {}


    async def check_payment_api() -> dict[str, Any]:
        # async with httpx.AsyncClient(timeout=5.0) as c:
        #     r = await c.get("https://api.stripe.com/v1/health")
        return {"reachable": True}


    def build_app() -> FastAPI:
        app = FastAPI()
        health = HealthCheckService()
        health.register("database", check_database, critical=True, interval_s=30)
        health.register("redis", check_redis, critical=True, interval_s=30)
        health.register("payment-api", check_payment_api, critical=False, interval_s=60)

        @app.on_event("startup")
        async def _startup() -> None:
            await health.start()
            # ... warm caches, open pools ...
            health.mark_ready()

        @app.on_event("shutdown")
        async def _shutdown() -> None:
            await health.stop()

        app.include_router(health.build_router())
        return app
    ```
  </Tab>
</Tabs>

***

## Load Balancer Patterns

### Kubernetes Service Load Balancing

```yaml theme={null}
# kubernetes-lb.yaml

# ClusterIP Service (internal load balancing)
apiVersion: v1
kind: Service
metadata:
  name: user-service
spec:
  type: ClusterIP
  selector:
    app: user-service
  ports:
    - port: 80
      targetPort: 3000
  sessionAffinity: None  # or ClientIP for sticky sessions

---
# Headless Service (for client-side LB with service discovery)
apiVersion: v1
kind: Service
metadata:
  name: user-service-headless
spec:
  clusterIP: None
  selector:
    app: user-service
  ports:
    - port: 3000

---
# Deployment with health checks
apiVersion: apps/v1
kind: Deployment
metadata:
  name: user-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: user-service
  template:
    metadata:
      labels:
        app: user-service
    spec:
      containers:
        - name: user-service
          image: user-service:latest
          ports:
            - containerPort: 3000
          
          # Liveness probe - restart if fails
          livenessProbe:
            httpGet:
              path: /healthz
              port: 3000
            initialDelaySeconds: 10
            periodSeconds: 15
            failureThreshold: 3
            
          # Readiness probe - remove from LB if fails
          readinessProbe:
            httpGet:
              path: /ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
            
          # Startup probe - for slow starting containers
          startupProbe:
            httpGet:
              path: /healthz
              port: 3000
            failureThreshold: 30
            periodSeconds: 10
```

### Envoy Proxy Configuration

```yaml theme={null}
# envoy-lb.yaml - Advanced L7 load balancing

static_resources:
  listeners:
    - name: listener_0
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 8080
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: ingress_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: backend
                      domains: ["*"]
                      routes:
                        - match:
                            prefix: "/api/users"
                          route:
                            cluster: user_service
                            timeout: 30s
                            retry_policy:
                              retry_on: "5xx,reset,connect-failure"
                              num_retries: 3
                              per_try_timeout: 10s

  clusters:
    - name: user_service
      connect_timeout: 5s
      type: STRICT_DNS
      lb_policy: LEAST_REQUEST  # Least connections
      
      # Circuit breaker
      circuit_breakers:
        thresholds:
          - priority: DEFAULT
            max_connections: 1000
            max_pending_requests: 1000
            max_requests: 1000
            max_retries: 3
            
      # Health checking
      health_checks:
        - timeout: 5s
          interval: 10s
          unhealthy_threshold: 3
          healthy_threshold: 2
          http_health_check:
            path: "/health"
            
      # Outlier detection (automatic ejection of unhealthy hosts)
      outlier_detection:
        consecutive_5xx: 5
        interval: 10s
        base_ejection_time: 30s
        max_ejection_percent: 50
        
      load_assignment:
        cluster_name: user_service
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: user-1.internal
                      port_value: 3000
              - endpoint:
                  address:
                    socket_address:
                      address: user-2.internal
                      port_value: 3000
              - endpoint:
                  address:
                    socket_address:
                      address: user-3.internal
                      port_value: 3000
```

***

## Interview Questions

<AccordionGroup>
  <Accordion title="Q1: Client-side vs Server-side load balancing?">
    **Answer:**

    **Server-side (e.g., NGINX, HAProxy):**

    * Single point for routing
    * Simple clients
    * Extra network hop
    * Centralized control

    **Client-side (e.g., Ribbon, custom):**

    * Client decides which server
    * No extra hop
    * More complex clients
    * Better for microservices

    **When to use each:**

    * Server-side: External traffic, legacy clients
    * Client-side: Service-to-service within cluster
    * Hybrid: Edge LB + client-side internally
  </Accordion>

  <Accordion title="Q2: When would you use Consistent Hashing?">
    **Answer:**

    **Use cases:**

    * Cache servers (minimize cache misses on scale)
    * Session affinity without IP hash
    * Partitioned data (same key → same server)

    **How it works:**

    * Servers and keys mapped to same hash ring
    * Key routed to next server clockwise
    * Adding/removing server affects only neighbors

    **Virtual nodes:**

    * Multiple positions per server for balance
    * 100-200 virtual nodes per physical server
  </Accordion>

  <Accordion title="Q3: What's the difference between liveness and readiness probes?">
    **Answer:**

    **Liveness:**

    * "Is the process stuck?"
    * Failure → container restart
    * Should be simple (no dependencies)
    * Example: Can the HTTP server respond?

    **Readiness:**

    * "Can it handle traffic?"
    * Failure → remove from load balancer
    * Can check dependencies
    * Example: Is database connection ready?

    **Common mistake:** Using deep checks for liveness causes cascading restarts when a dependency is down. If your liveness check queries the database, and the database is slow, Kubernetes restarts your pods. Now you have fewer pods, more load on remaining ones, they time out too, and Kubernetes restarts them as well. Within minutes, every pod is in a restart loop -- all because your liveness check was too aggressive. Keep liveness checks trivial (can the HTTP server respond?) and put dependency checks in readiness only.
  </Accordion>

  <Accordion title="Q4: How does Power of Two Choices (P2C) work?">
    **Answer:**

    Simple but effective algorithm:

    1. Pick 2 random servers
    2. Choose the one with fewer connections

    **Why it works:**

    * Avoids herd behavior (all clients picking same "best" server)
    * O(1) complexity (no sorting)
    * Statistical guarantees: max load \~log(log(n))

    **Used by:** Envoy, HAProxy, Netflix Zuul

    **Better than round-robin** because it considers actual load.
  </Accordion>
</AccordionGroup>

***

## Chapter Summary

<Info>
  **Key Takeaways:**

  * Server-side LB for external traffic, client-side for internal
  * Algorithm choice depends on workload: Round-robin for simple, Least-connections for varying load
  * Consistent hashing for caching and stateful services
  * Implement both liveness and readiness probes
  * Health checks should have appropriate timeouts
  * Use circuit breakers with load balancing for resilience
</Info>

**Next Chapter:** Migration Patterns - Strangler Fig, Branch by Abstraction, and more.

***

## Interview Questions with Structured Answers

<AccordionGroup>
  <Accordion title="You deploy a new version but 10% of users are still hitting the old version 20 minutes later. What is wrong?">
    **Strong Answer Framework:**

    1. **Identify the connection type.** Long-lived connections (HTTP/2, gRPC, WebSocket) stay attached to the old backend until the client disconnects. A 20-minute HTTP/2 connection is expected, not a bug.
    2. **Check health check state propagation.** The load balancer may still consider old instances healthy if the health check interval is high or the failure threshold is large.
    3. **Inspect the deployment mechanism.** Rolling deploy? Blue/green? Did the old pods actually terminate, or is `terminationGracePeriodSeconds` still bleeding them out?
    4. **Look at session affinity.** If stickiness is enabled, the 10% of affected users might be sticky to old instances that have not yet rotated.
    5. **Check cache layers.** DNS TTL, client-side service discovery caches, and sidecar proxies may be holding stale endpoint lists.
    6. **Correlate the 10% to a signal.** Is it 10% of connections, 10% of users, 10% of specific clients? The specificity tells you which layer to inspect.

    **Real-World Example:** Slack (2020 incident postmortem). Slack had a case where gRPC clients with long-lived HTTP/2 connections stayed pinned to old backends for hours after deploy because they did not rotate connections. The fix was adding max-connection-age of 30 minutes on the server side, forcing clients to reconnect periodically and rebalance. The pattern appears in every company that runs gRPC at scale.

    **Senior Follow-up Questions:**

    <Note>
      **Q: "How would you force clients on a long-lived HTTP/2 connection to reconnect during a deploy?"**

      A: Server-side, you can send a GOAWAY frame, which tells the client "finish your in-flight requests but open a new connection for future requests." Envoy does this via `max_connection_duration` in the HTTP/2 settings; gRPC servers have `MaxConnectionAge` (Go) or `maxConnectionAge` (Java). This is the canonical solution. Alternative: force connection close on shutdown, but that causes in-flight request failures if not combined with connection draining.
    </Note>

    <Note>
      **Q: "The 10% is really 10% of users, not connections. What does that tell you?"**

      A: That signals session affinity. If 10% of users are getting old behavior, those users are probably sticky to specific instances that have not rotated yet. Check for: cookie-based stickiness, IP hash affinity, or application-level sharding where user ID maps to a specific shard. Fix: ensure affinity rotates along with deploys (invalidate affinity cookies on deploy) or accept that sticky users will lag the rollout by the session timeout.
    </Note>

    <Note>
      **Q: "Your deploy was blue/green with a traffic switch at the load balancer. How is 10% still hitting blue?"**

      A: DNS TTL is the usual culprit. If your traffic switch updated DNS records with a 5-minute TTL, clients that resolved before the switch keep connecting to blue for up to 5 minutes (or longer if they are caching DNS aggressively, like Java's default of infinity unless you set the security policy). For instant cutovers, do the traffic switch at the load balancer level (weighted listener, Envoy RDS update), not at DNS. Alternative: use a service mesh where the mesh itself handles endpoint changes in milliseconds.
    </Note>

    **Common Wrong Answers:**

    * **"The deploy did not complete; some pods are still running the old version."** Possible but lazy. Kubernetes rolling deploys are visible in `kubectl rollout status`; if the rollout finished, the old pods are gone. This answer ignores the hard cases where deploy completed but traffic lingers.
    * **"It is a cache somewhere, just bust the cache."** Too vague. Which cache? DNS, client service discovery, application cache, CDN, browser? A strong answer names the specific cache layer and its TTL.

    **Further Reading:**

    * "gRPC Load Balancing" -- official gRPC documentation on client-side LB and connection lifecycle
    * "Envoy Proxy Documentation: Connection Management" -- covers max\_connection\_duration and graceful termination
    * "DNS is still the protocol of the internet" -- Julia Evans blog post explaining why DNS caching surprises engineers during deploys
  </Accordion>

  <Accordion title="Your p99 latency is 2x your p50 even though all instances have similar CPU. What load balancing issue might be causing this?">
    **Strong Answer Framework:**

    1. **Rule out application-level causes first.** Slow queries, GC pauses, cold caches all produce long tails that are not the load balancer's fault.
    2. **Measure per-instance p99.** If one instance has a 2x higher p99 than others, traffic distribution is uneven even if CPU averages match.
    3. **Check connection pooling under L4.** If your LB is L4 and clients use HTTP/1.1 keepalive, the connections distribute but requests within each connection do not.
    4. **Inspect queue depth at each backend.** Two backends with the same CPU can have very different queue depths if one is upstream of a slower dependency or has a different JIT state.
    5. **Consider algorithmic herding.** Least-connections can oscillate; least-response-time can create feedback loops. P2C breaks both.
    6. **Look at outlier detection.** If a backend is 3x slower than the median, it should be ejected. Envoy's outlier detection and Kubernetes readiness probes should catch this.

    **Real-World Example:** Shopify (around 2019). Shopify documented tail latency issues caused by round-robin load balancing where one backend had a subtly slower disk. The instance did not fail health checks, but its p99 was 300ms while others were 100ms. Round-robin gave it equal traffic, so 20% of requests saw 300ms even though the service "looked healthy." Their fix: EWMA-based (exponentially weighted moving average) response-time load balancing that routes fewer requests to slower instances without creating oscillation.

    **Senior Follow-up Questions:**

    <Note>
      **Q: "How would you detect and automatically eject a 'gray failure' instance that is slow but not failing health checks?"**

      A: Outlier detection at the load balancer layer. Envoy's outlier detection config ejects instances that exceed a threshold for consecutive 5xx, consecutive gateway failures, or success-rate deviation. For latency-based ejection specifically, you need success-rate-based ejection combined with p99 tracking. A service mesh like Istio exposes this. Alternative: custom sidecar logic that reports per-instance latency to the LB control plane, which adjusts weights downward for slow instances.
    </Note>

    <Note>
      **Q: "You find that one instance always has higher p99. It is on a shared node with a noisy neighbor. How do you structurally prevent this?"**

      A: Three options. First, use dedicated nodes (Kubernetes node taints, dedicated node pools) so critical services do not share hardware. Cost: higher infrastructure spend. Second, request guaranteed-QoS pods with CPU/memory limits equal to requests, so the kernel enforces isolation. Cost: less bin-packing efficiency. Third, use pod anti-affinity to spread replicas across nodes, so a single noisy neighbor only affects one replica out of many. Cost: harder to schedule in small clusters. In practice, combine all three for tier-1 services.
    </Note>

    <Note>
      **Q: "If all else fails, can you just over-provision and ignore the problem?"**

      A: Over-provisioning reduces per-instance load, which reduces tail latency effects, but it does not fix the imbalance. If one instance is 2x slower intrinsically, even with half the load that instance still has 2x p99. You might mask the problem until traffic grows 3x and the slow instance becomes a bottleneck again. Over-provisioning is a valid short-term mitigation while you address the root cause, not a strategy.
    </Note>

    **Common Wrong Answers:**

    * **"Use least-response-time, that will solve it."** Dangerous. Pure least-response-time creates feedback oscillation. Use P2C or EWMA-weighted instead.
    * **"Add more replicas."** Does not address the algorithmic or per-instance issues. You need to fix the imbalance, not add noise to dilute it.

    **Further Reading:**

    * "The Tail at Scale" by Jeffrey Dean and Luiz Barroso -- the canonical Google paper on tail latency
    * "Envoy Outlier Detection" -- docs cover the ejection algorithm and tuning
    * "Load Balancing at Shopify" -- engineering blog posts on EWMA and weighted load balancing
  </Accordion>

  <Accordion title="You enable session affinity and now one instance is at 90% CPU while others are at 20%. What happened and how do you fix it?">
    **Strong Answer Framework:**

    1. **Identify the affinity scheme.** IP-hash? Cookie-hash? Header-hash? Each has distinct failure modes.
    2. **Check the input distribution.** IP-hash buckets by source IP; if your traffic comes through a handful of corporate NATs or mobile carrier gateways, all users behind that NAT map to one instance.
    3. **Estimate the skew.** How many unique keys map to each instance? If one instance gets 100x the keys of another, the hash distribution is broken.
    4. **Consider whether stickiness is needed at all.** Session affinity is often a crutch for state that should be externalized (session store, Redis, JWT).
    5. **Choose a mitigation.** Bucket by session ID (higher-cardinality input), fall back to round-robin for the hot instance, or remove stickiness entirely.

    **Real-World Example:** Zoom (2020 during COVID surge). Zoom had session affinity at the meeting-ID level to pin users to the same media server for a call. When a handful of massive meetings (tens of thousands of attendees each) all started at once, those meetings' servers maxed out while other servers sat idle. Fix was two-layered: shard large meetings across multiple servers (session affinity at a sub-meeting level) and implement graceful overload handling (reject new connections to an over-capacity server, let the LB pick another).

    **Senior Follow-up Questions:**

    <Note>
      **Q: "Can you remove session affinity without breaking the application?"**

      A: Yes, if application state is externalized. Move session state to Redis or a signed JWT, move file upload state to S3 with a signed URL, and move database transactions to a connection pooler that handles routing. Once state is external, affinity is not needed -- any instance can serve any request. The hard cases are WebSocket connections (physically bound to an instance by the TCP connection) and in-memory caches that take minutes to warm up. For WebSockets, horizontal scaling with a pub-sub like Redis handles it. For warm caches, accept a cold-start penalty or use a shared distributed cache instead.
    </Note>

    <Note>
      **Q: "You need to keep session affinity for WebSockets. How do you prevent the hot-instance problem?"**

      A: Three defenses. First, hash by a high-cardinality key (connection ID, not user ID or IP) so you do not create artificial clusters. Second, implement connection caps per instance -- once an instance reaches its max (say 5000 WebSockets), new connection requests get routed elsewhere even if the hash would normally land there. Third, monitor per-instance variance and alert if any instance exceeds 2x the fleet median; this catches the problem before users notice.
    </Note>

    <Note>
      **Q: "If you had designed this system from scratch, would you use session affinity at all?"**

      A: Reluctantly, yes, for specific cases: WebSocket connections (physically bound), in-memory warm caches where cold starts are expensive, and stateful streaming protocols. For stateless HTTP APIs, no. State in any form (local caches, session storage) scales worse than external state, and affinity is a tax paid to hide that. My preference: externalize state first, treat instances as fungible, and only introduce affinity for the genuinely stateful protocols where it is unavoidable.
    </Note>

    **Common Wrong Answers:**

    * **"Add more instances and the load will even out."** Does not fix the hash distribution; you just get more idle instances while the hot one stays hot.
    * **"Disable session affinity."** May be the right answer in the end, but only if the application can handle it. Blindly disabling affinity often breaks features (logged-out users, lost shopping carts) in ways that are hard to diagnose.

    **Further Reading:**

    * "Load Balancing is Impossible" by Tyler McMullen -- talk explaining why perfect distribution is theoretically impossible
    * "Session Affinity in Kubernetes" -- official Kubernetes docs on clientIP affinity and its limitations
    * "Zoom's Architecture" -- various 2020-2021 blog posts about how Zoom scaled during the pandemic
  </Accordion>
</AccordionGroup>

***

## Interview Deep-Dive

<AccordionGroup>
  <Accordion title="'You have 5 instances of a service behind a load balancer using round-robin. One instance is consistently slower than the others (maybe it is on a degraded host). How does this affect the system, and what load balancing algorithm would you switch to?'">
    **Strong Answer:**

    With round-robin, each instance gets 20% of traffic regardless of performance. If one instance has 3x the latency of the others, 20% of all requests are slow. From the user's perspective, one in five requests takes 3 seconds instead of 1 second, making the P80 latency equal to the slowest instance's latency. This is worse than it sounds because many users retry slow requests, which adds even more load to the system.

    The immediate fix is switching to least-connections load balancing. Least-connections sends the next request to the instance with the fewest active connections. The slow instance accumulates connections (because responses take longer), so it naturally receives fewer new requests. The fast instances finish requests quickly, free up connections, and get more traffic. The system self-balances based on actual performance.

    For an even better approach: weighted least-connections with adaptive weights. The load balancer tracks each instance's response time and adjusts weights accordingly. An instance that consistently responds in 500ms gets 3x the weight of an instance responding in 1500ms. NGINX Plus and Envoy both support this with "least\_time" or "EWMA" (Exponentially Weighted Moving Average) algorithms.

    The root cause fix: investigate why that instance is slow. Common causes: the instance landed on a host with a noisy neighbor (another pod consuming CPU), the instance has a different configuration (wrong JVM heap size, missing optimization flags), or the instance's local cache is cold while the others are warm. In Kubernetes, I would check if the pod is scheduled on a node with resource pressure using `kubectl describe node`.

    **Follow-up: "How does load balancing work differently for gRPC compared to REST?"**

    gRPC uses HTTP/2 with long-lived connections. A traditional L4 load balancer (AWS NLB, kube-proxy) balances at connection time, not per-request. Once a gRPC client opens a connection to one backend, all requests go to that backend. With 5 backends and 3 clients, you might have all 3 clients connected to the same 2 backends while 3 backends sit idle.

    The fix is L7 load balancing that understands HTTP/2 frames and can route individual requests within a connection. Envoy (in Istio or standalone) does this natively. Alternatively, use client-side load balancing: gRPC has built-in support for resolving multiple backends (via DNS or a custom resolver) and distributing requests across them using round-robin or pick-first policies. The client opens connections to all backends and rotates requests.
  </Accordion>

  <Accordion title="'Explain consistent hashing and why it matters for distributed caching. What happens when you add or remove a cache node?'">
    **Strong Answer:**

    Consistent hashing distributes keys across cache nodes in a way that minimizes key redistribution when nodes are added or removed. In traditional hash-based routing (key % number\_of\_nodes), adding one node changes every key's assignment -- a 5-node cluster adding a 6th node causes 80% of keys to remap. With consistent hashing, adding one node only remaps 1/N of the keys (20% for 5 nodes).

    The mechanism: imagine a circular ring from 0 to 2^32. Each cache node is hashed to a position on the ring. Each key is also hashed to a position, and it is assigned to the next node clockwise from its position. When a node is added, only the keys between the new node and its predecessor are remapped. When a node is removed, only its keys move to the next clockwise node.

    In a microservices context, this matters for Redis Cluster, Memcached pools, and any distributed cache. When a cache node goes down during a traffic spike, you want minimal cache misses. With consistent hashing, only 1/N of keys become misses (they need to be re-fetched from the database). Without it, nearly all keys become misses, triggering the thundering herd problem.

    The practical enhancement: virtual nodes. Instead of one position on the ring per physical node, each node gets 100-200 virtual positions. This ensures even distribution -- without virtual nodes, the ring can become unbalanced with some nodes owning disproportionately more keys. With virtual nodes, even small clusters have uniform key distribution.

    Real-world application: when I set up a Redis cluster for a microservices caching layer, I use Redis Cluster's built-in consistent hashing (16384 hash slots distributed across nodes). When I add a shard, Redis migrates only the affected slots. The clients (using redis-cluster client libraries) follow redirects during migration, so the transition is transparent to the application.

    **Follow-up: "What about hot keys -- when one key is accessed 1000x more than others and the node holding that key becomes a bottleneck?"**

    Hot keys are the Achilles heel of consistent hashing. The key is always on one node, and no amount of load balancing fixes that. Solutions: first, read replicas -- the hot key is replicated to multiple nodes, and reads are distributed across replicas. Second, local caching -- each service instance caches the hot key in-process memory with a short TTL, reducing Redis traffic by 90%+. Third, key splitting -- instead of one key "product:123", create "product:123:shard:0" through "product:123:shard:9" and randomly read from any shard. This distributes the hot key across 10 nodes. The trade-off is write complexity (writes must update all 10 shards).
  </Accordion>

  <Accordion title="'How do health checks work in a load balancing context, and what is the difference between liveness and readiness checks?'">
    **Strong Answer:**

    In load balancing, health checks determine whether an instance should receive traffic. A healthy instance gets requests; an unhealthy instance is removed from the pool. The subtlety is that "healthy" is not binary -- an instance can be alive but not ready, or ready for some requests but degraded for others.

    Liveness checks answer: "Is this process running and not deadlocked?" If liveness fails, the instance should be restarted (in Kubernetes, the kubelet kills and restarts the container). Liveness checks should be simple: "can you respond to HTTP on this port?" They should NOT call external dependencies -- if your liveness check verifies the database connection and the database is down, Kubernetes kills your perfectly functional application pods, making the outage worse.

    Readiness checks answer: "Can this instance handle requests right now?" If readiness fails, the instance is removed from the load balancer pool but NOT restarted. Readiness checks should verify that the instance can serve its purpose: database connection is established, caches are warm, dependent services are reachable. During a rolling deployment, new pods are not "ready" until they have initialized. During a downstream outage, pods might become "not ready" if their circuit breaker is open.

    The mistake I see most often: combining liveness and readiness into one health check. When a downstream service goes down, the combined check fails, Kubernetes kills the pods (liveness failure), new pods start but also fail the check (downstream is still down), and now you have a crash loop that eliminates your service entirely. With separate checks, the pods stay alive (liveness passes) but stop receiving traffic (readiness fails) until the downstream recovers.

    For load balancer configuration outside Kubernetes (NGINX, HAProxy), I implement active health checks (the load balancer periodically probes each backend) with a failure threshold (3 consecutive failures before removal) and a recovery threshold (2 consecutive successes before re-addition). This prevents flapping when a single health check times out due to network jitter.

    **Follow-up: "How do you implement a 'degraded' health state where the instance can handle some requests but not all?"**

    I implement a custom health endpoint that returns different responses based on the instance's state. HTTP 200 means fully healthy. HTTP 429 means "I am overloaded, reduce my traffic weight." HTTP 503 means "remove me from the pool." The load balancer interprets 429 by reducing the weight (sending fewer requests) rather than removing the instance entirely. This allows graceful degradation: an instance experiencing high memory pressure can signal "send me fewer requests" rather than going from 100% traffic to 0% traffic. Envoy supports this with its "degraded" health status. For simpler load balancers, I achieve the same effect by having the instance's readiness check return 503 intermittently (every other check fails), which causes the load balancer to halve its traffic.
  </Accordion>
</AccordionGroup>