Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Load Balancing Deep Dive
Load balancing is critical for distributing traffic across service instances efficiently and reliably. Think of a load balancer like a restaurant host seating guests — a bad host sends everyone to the same table while others sit empty. A good host considers which section has capacity, which server is fastest, and whether a particular guest always sits in the same spot. The choice of algorithm matters far more than most teams realize: the wrong algorithm under high load can turn a 50ms request into a 5-second timeout. The reason load balancing deserves an entire chapter is that it sits at the fault line between two very different worlds. On one side is capacity planning: do your instances have enough CPU to handle expected load? On the other side is latency engineering: are requests reaching the fastest instance right now? A naive load balancer solves the first problem but ignores the second — it will happily route 20% of traffic to a dying instance because “it is still in the pool.” A sophisticated load balancer treats every instance as a live, shifting quantity and routes based on real-time behavior, not static configuration. The patterns in this chapter exist because teams have learned, often painfully, that assumptions about instance health do not hold under load.- Understand client-side vs server-side load balancing
- Master load balancing algorithms
- Implement health checking strategies
- Build intelligent load balancing with Node.js
Client-Side vs Server-Side Load Balancing
Before choosing an algorithm, you must first choose where the load balancing happens. This is one of the most consequential architectural decisions in microservices, and yet most teams default to whatever their platform provides (Kubernetes Service -> server-side, gRPC -> client-side) without understanding the tradeoffs. Server-side load balancing is simpler: clients talk to a single endpoint and the load balancer decides where traffic goes. Client-side load balancing is more powerful: the client itself knows about all backend instances and decides directly, eliminating one network hop and enabling per-request routing intelligence. The catch with server-side: you have added a proxy that every request traverses. If that proxy is slow, has bugs, or is overloaded, all your traffic suffers. The catch with client-side: every service that calls the target service needs the load balancing logic embedded in it — a Python service, a Go service, and a Node service all need compatible client-side load balancers. This is why companies like Netflix and Uber built service meshes (Envoy, Linkerd): they give you client-side load balancing without forcing every language to reimplement it. In modern microservices, the answer is often “both” — server-side at the edge (for external traffic) and client-side or mesh-based internally.Server-Side: NGINX Configuration
NGINX has been the default choice for server-side HTTP load balancing for over a decade, and the config below demonstrates why. Notice how the configuration is declarative — you describe the upstream pool, the algorithm, the health check rules, and let NGINX handle the mechanics. The downside is also visible: this configuration is static. To change weights or add a server, you edit the file and reload NGINX. That does not scale when you have 50 microservices and hundreds of instances coming and going via Kubernetes autoscaling. This is why cloud-native environments prefer dynamic service discovery (Consul, Kubernetes Service, Envoy xDS) even though the underlying algorithms are the same. One specific anti-pattern worth highlighting: running NGINX withip_hash and then wondering why one server is overwhelmed. ip_hash buckets clients by IP, so if most of your traffic comes from a small number of IPs (think mobile carriers, corporate NATs), the distribution will be badly skewed. Use ip_hash only when you genuinely need session affinity and you have verified your client IP distribution is diverse.
Client-Side: Implementation
Now let us build client-side load balancing from scratch. The goal of this implementation is to show the moving parts clearly: service discovery integration, health check loop, algorithm selection, and automatic retry with failover. In production you would typically not implement this yourself — you would use gRPC’s built-in load balancer, a service mesh like Linkerd, or a library like Netflix’s Ribbon (Java) or go-micro (Go). But understanding the mechanics is essential because when something goes wrong (and it will), you need to know which layer to debug. Pay close attention to how we track per-instance metrics (activeConnections, responseTime, consecutiveFailures). These are the inputs that make intelligent algorithms like least-connections and least-response-time actually work. Without this tracking, your “load balancer” is just round-robin with extra steps. The failure-counting logic is subtle: we remove an instance after 3 consecutive failures, but a single success resets the counter. This prevents flapping when a network blip causes one failed request, while still catching instances that are genuinely broken. What would happen if you skipped the health check entirely? The load balancer would continue sending 20% of traffic to a dead instance until the service registry eventually removed it (typically 30-60 seconds later). Every one of those requests would time out or return 502 errors, multiplying the impact of a single instance failure. Client-side health checks close this gap — you start routing away from unhealthy instances within seconds, not minutes.- Node.js
- Python
Load Balancing Algorithms
The algorithm you choose is effectively a hypothesis about your workload. Round-robin assumes all instances are identical and all requests take equal time. Least-connections assumes requests vary in duration but instances are otherwise equal. Least-response-time assumes instances themselves vary in speed. Consistent hashing assumes you need keys to stick to specific instances (for caching or partitioning). When your hypothesis matches reality, the algorithm works well. When it does not, you get mysterious tail latency or cache misses that nobody can explain. The most common mistake is picking an algorithm that sounds smart without validating it matches your workload. I have seen teams switch from round-robin to least-response-time expecting better performance, only to find that it created a feedback loop: the fastest instance got all the traffic, its response time crept up, another instance became fastest, and so on. The resulting oscillation was worse than round-robin’s simple equal distribution. This is why Power of Two Choices (P2C) — discussed below — has become the default in modern proxies like Envoy. It has the intelligence of least-connections without the herding problem.Advanced Algorithms Implementation
The algorithms below demonstrate the ideas behind production load balancers. Weighted Round Robin with smooth distribution prevents clumping (naive weighted round-robin sends all weight-5 requests consecutively, which can overwhelm that instance; the smooth version interleaves them). Consistent hashing with virtual nodes is the foundation of distributed caches like Memcached and Redis Cluster; without virtual nodes, removing one server would cause 30-50% of keys to remap, but with 150 virtual nodes per server, that drops to a few percent. Power of Two Choices (P2C) is the algorithm I recommend for most modern load balancers. The math is surprising: picking 2 random instances and choosing the less loaded one outperforms least-connections in realistic workloads. Why? Because pure least-connections creates a herd effect — the load balancer sees instance A has 3 connections and instance B has 4, routes to A, and 10ms later the next decision sees A has 4 and B has 4, and the next client also picks A, and so on. With P2C, the randomness breaks the herding. The mathematical analysis (Azar, Broder, Karlin, Upfal 1994) shows P2C achieves O(log log n) maximum load while random achieves O(log n / log log n) — a dramatic improvement for essentially no extra cost.- Node.js
- Python
Health Checking Strategies
Health checks are the eyes of the load balancer. If they report wrong, the load balancer makes wrong decisions — routing to dead instances or evicting healthy ones. The most important distinction in health checking is between liveness (is the process alive?) and readiness (can it handle traffic?). This sounds pedantic but it is the single most common mistake I see in production: teams implement a single “health” endpoint that checks everything, use it for both liveness and readiness, and then wonder why their system goes into a death spiral whenever a downstream dependency has a bad day. The failure mode is subtle but catastrophic. Imagine your/health endpoint checks the database. The database has a slow query that makes health checks time out. Kubernetes sees liveness failures and restarts your pods. The new pods also fail the check (database is still slow). Kubernetes marks them failed and kills them too. Within minutes you have zero pods, the database recovers, but there are no pods to serve traffic. You have a full outage. The fix: liveness checks should be shallow (just “is the HTTP server responding?”), and only readiness checks should verify dependencies.
The deep health check is the third kind: an endpoint used by humans and dashboards to see rich dependency status. It is NOT what the load balancer polls. It tells you why a service is degraded, not whether it should receive traffic.
Comprehensive Health Check Service
The service below is the minimum viable implementation of a proper health check subsystem. The design has three parts: a registration mechanism (where each dependency declares how to check itself), a background runner (so health checks happen continuously, not per-request), and three HTTP endpoints (liveness, readiness, deep). The background execution is critical — if you check dependencies on every incoming request, your health check becomes a DoS vector. A misbehaving caller can hammer /health and multiply the load on every dependency. Instead, check dependencies on a timer and return cached results. One detail worth emphasizing: health checks should have aggressive timeouts. A health check that takes 10 seconds to fail makes your “unhealthy” detection take 30+ seconds (since you typically need 3 consecutive failures). Use 2-5 second timeouts for critical dependencies and make your instance’s response time budget proportional. If your service’s SLA is 500ms, a 5-second health check timeout is 10x your SLA — that is too slow.- Node.js
- Python
Load Balancer Patterns
Kubernetes Service Load Balancing
Envoy Proxy Configuration
Interview Questions
Q1: Client-side vs Server-side load balancing?
Q1: Client-side vs Server-side load balancing?
- Single point for routing
- Simple clients
- Extra network hop
- Centralized control
- Client decides which server
- No extra hop
- More complex clients
- Better for microservices
- Server-side: External traffic, legacy clients
- Client-side: Service-to-service within cluster
- Hybrid: Edge LB + client-side internally
Q2: When would you use Consistent Hashing?
Q2: When would you use Consistent Hashing?
- Cache servers (minimize cache misses on scale)
- Session affinity without IP hash
- Partitioned data (same key → same server)
- Servers and keys mapped to same hash ring
- Key routed to next server clockwise
- Adding/removing server affects only neighbors
- Multiple positions per server for balance
- 100-200 virtual nodes per physical server
Q3: What's the difference between liveness and readiness probes?
Q3: What's the difference between liveness and readiness probes?
- “Is the process stuck?”
- Failure → container restart
- Should be simple (no dependencies)
- Example: Can the HTTP server respond?
- “Can it handle traffic?”
- Failure → remove from load balancer
- Can check dependencies
- Example: Is database connection ready?
Q4: How does Power of Two Choices (P2C) work?
Q4: How does Power of Two Choices (P2C) work?
- Pick 2 random servers
- Choose the one with fewer connections
- Avoids herd behavior (all clients picking same “best” server)
- O(1) complexity (no sorting)
- Statistical guarantees: max load ~log(log(n))
Chapter Summary
- Server-side LB for external traffic, client-side for internal
- Algorithm choice depends on workload: Round-robin for simple, Least-connections for varying load
- Consistent hashing for caching and stateful services
- Implement both liveness and readiness probes
- Health checks should have appropriate timeouts
- Use circuit breakers with load balancing for resilience
Interview Questions with Structured Answers
You deploy a new version but 10% of users are still hitting the old version 20 minutes later. What is wrong?
You deploy a new version but 10% of users are still hitting the old version 20 minutes later. What is wrong?
- Identify the connection type. Long-lived connections (HTTP/2, gRPC, WebSocket) stay attached to the old backend until the client disconnects. A 20-minute HTTP/2 connection is expected, not a bug.
- Check health check state propagation. The load balancer may still consider old instances healthy if the health check interval is high or the failure threshold is large.
- Inspect the deployment mechanism. Rolling deploy? Blue/green? Did the old pods actually terminate, or is
terminationGracePeriodSecondsstill bleeding them out? - Look at session affinity. If stickiness is enabled, the 10% of affected users might be sticky to old instances that have not yet rotated.
- Check cache layers. DNS TTL, client-side service discovery caches, and sidecar proxies may be holding stale endpoint lists.
- Correlate the 10% to a signal. Is it 10% of connections, 10% of users, 10% of specific clients? The specificity tells you which layer to inspect.
max_connection_duration in the HTTP/2 settings; gRPC servers have MaxConnectionAge (Go) or maxConnectionAge (Java). This is the canonical solution. Alternative: force connection close on shutdown, but that causes in-flight request failures if not combined with connection draining.- “The deploy did not complete; some pods are still running the old version.” Possible but lazy. Kubernetes rolling deploys are visible in
kubectl rollout status; if the rollout finished, the old pods are gone. This answer ignores the hard cases where deploy completed but traffic lingers. - “It is a cache somewhere, just bust the cache.” Too vague. Which cache? DNS, client service discovery, application cache, CDN, browser? A strong answer names the specific cache layer and its TTL.
- “gRPC Load Balancing” — official gRPC documentation on client-side LB and connection lifecycle
- “Envoy Proxy Documentation: Connection Management” — covers max_connection_duration and graceful termination
- “DNS is still the protocol of the internet” — Julia Evans blog post explaining why DNS caching surprises engineers during deploys
Your p99 latency is 2x your p50 even though all instances have similar CPU. What load balancing issue might be causing this?
Your p99 latency is 2x your p50 even though all instances have similar CPU. What load balancing issue might be causing this?
- Rule out application-level causes first. Slow queries, GC pauses, cold caches all produce long tails that are not the load balancer’s fault.
- Measure per-instance p99. If one instance has a 2x higher p99 than others, traffic distribution is uneven even if CPU averages match.
- Check connection pooling under L4. If your LB is L4 and clients use HTTP/1.1 keepalive, the connections distribute but requests within each connection do not.
- Inspect queue depth at each backend. Two backends with the same CPU can have very different queue depths if one is upstream of a slower dependency or has a different JIT state.
- Consider algorithmic herding. Least-connections can oscillate; least-response-time can create feedback loops. P2C breaks both.
- Look at outlier detection. If a backend is 3x slower than the median, it should be ejected. Envoy’s outlier detection and Kubernetes readiness probes should catch this.
- “Use least-response-time, that will solve it.” Dangerous. Pure least-response-time creates feedback oscillation. Use P2C or EWMA-weighted instead.
- “Add more replicas.” Does not address the algorithmic or per-instance issues. You need to fix the imbalance, not add noise to dilute it.
- “The Tail at Scale” by Jeffrey Dean and Luiz Barroso — the canonical Google paper on tail latency
- “Envoy Outlier Detection” — docs cover the ejection algorithm and tuning
- “Load Balancing at Shopify” — engineering blog posts on EWMA and weighted load balancing
You enable session affinity and now one instance is at 90% CPU while others are at 20%. What happened and how do you fix it?
You enable session affinity and now one instance is at 90% CPU while others are at 20%. What happened and how do you fix it?
- Identify the affinity scheme. IP-hash? Cookie-hash? Header-hash? Each has distinct failure modes.
- Check the input distribution. IP-hash buckets by source IP; if your traffic comes through a handful of corporate NATs or mobile carrier gateways, all users behind that NAT map to one instance.
- Estimate the skew. How many unique keys map to each instance? If one instance gets 100x the keys of another, the hash distribution is broken.
- Consider whether stickiness is needed at all. Session affinity is often a crutch for state that should be externalized (session store, Redis, JWT).
- Choose a mitigation. Bucket by session ID (higher-cardinality input), fall back to round-robin for the hot instance, or remove stickiness entirely.
- “Add more instances and the load will even out.” Does not fix the hash distribution; you just get more idle instances while the hot one stays hot.
- “Disable session affinity.” May be the right answer in the end, but only if the application can handle it. Blindly disabling affinity often breaks features (logged-out users, lost shopping carts) in ways that are hard to diagnose.
- “Load Balancing is Impossible” by Tyler McMullen — talk explaining why perfect distribution is theoretically impossible
- “Session Affinity in Kubernetes” — official Kubernetes docs on clientIP affinity and its limitations
- “Zoom’s Architecture” — various 2020-2021 blog posts about how Zoom scaled during the pandemic
Interview Deep-Dive
'You have 5 instances of a service behind a load balancer using round-robin. One instance is consistently slower than the others (maybe it is on a degraded host). How does this affect the system, and what load balancing algorithm would you switch to?'
'You have 5 instances of a service behind a load balancer using round-robin. One instance is consistently slower than the others (maybe it is on a degraded host). How does this affect the system, and what load balancing algorithm would you switch to?'
kubectl describe node.Follow-up: “How does load balancing work differently for gRPC compared to REST?”gRPC uses HTTP/2 with long-lived connections. A traditional L4 load balancer (AWS NLB, kube-proxy) balances at connection time, not per-request. Once a gRPC client opens a connection to one backend, all requests go to that backend. With 5 backends and 3 clients, you might have all 3 clients connected to the same 2 backends while 3 backends sit idle.The fix is L7 load balancing that understands HTTP/2 frames and can route individual requests within a connection. Envoy (in Istio or standalone) does this natively. Alternatively, use client-side load balancing: gRPC has built-in support for resolving multiple backends (via DNS or a custom resolver) and distributing requests across them using round-robin or pick-first policies. The client opens connections to all backends and rotates requests.'Explain consistent hashing and why it matters for distributed caching. What happens when you add or remove a cache node?'
'Explain consistent hashing and why it matters for distributed caching. What happens when you add or remove a cache node?'
'How do health checks work in a load balancing context, and what is the difference between liveness and readiness checks?'
'How do health checks work in a load balancing context, and what is the difference between liveness and readiness checks?'