Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Part III — Performance
Chapter 6: Performance Fundamentals
Performance is not about making things fast. It is about understanding where time goes, what users experience, and making informed decisions about which optimizations matter.Real-World Story: Discord -- From MongoDB to Cassandra to Handle Trillions of Messages
Real-World Story: Discord -- From MongoDB to Cassandra to Handle Trillions of Messages
(channel_id, bucket) where each bucket represented a window of time. This meant all messages for a channel in a given time window lived on the same node — reads for “recent messages in this channel” hit a single partition, which is the fastest possible Cassandra read.By 2022, Discord was storing trillions of messages across 177 Cassandra nodes. But even Cassandra developed pain points — compaction storms caused latency spikes, and “hot partitions” (extremely active channels) created uneven load. In 2023, Discord migrated again — this time to ScyllaDB, a C++ rewrite of Cassandra that offered better tail latency, more predictable compaction, and lower operational overhead. The lesson: database selection is never “one and done.” As your scale changes by orders of magnitude, your storage layer assumptions get invalidated. The architecture that handles 1 billion messages is not the architecture that handles 1 trillion.Real-World Story: Twitter's Fail Whale and the Thundering Herd Problem
Real-World Story: Twitter's Fail Whale and the Thundering Herd Problem
6.1 Latency
Latency is the time between a request and a response. Types: network latency (speed of light across distance — ~0.5 ms within a datacenter, ~40 ms US coast-to-coast, ~150 ms US-to-Asia), processing latency (server computation — a simple REST handler: ~1-5 ms, image resize: ~200-2000 ms), queue latency (waiting for resources — 0 ms when pools are available, potentially seconds under contention). Tail latency (p95, p99, p99.9) measures worst-case experience. At scale, 1% of requests being slow means thousands of users having a terrible experience. Senior engineers always look at percentiles, not averages. A service doing 10,000 req/s with a P99 of 3 seconds means 100 users every second are waiting 3+ seconds — that is 360,000 miserable experiences per hour. The problem compounds in distributed systems: when a single user request fans out to dozens of backend services, the probability of hitting at least one slow tail response grows dramatically — a phenomenon called tail latency amplification. Google’s research showed that at 100 fan-out, even with individual services at 99th-percentile of 10 ms, the aggregate P99 is dominated by the slowest responder.P99 vs P50 — Why Averages Lie
The P50 (median) tells you what a “typical” user experiences. The P99 tells you what your worst 1% endures. In a healthy system these are close together; when they diverge it signals a systemic issue that averages will hide.| Metric | Healthy API | Degraded API |
|---|---|---|
| P50 | 12 ms | 15 ms |
| P95 | 45 ms | 800 ms |
| P99 | 90 ms | 4,200 ms |
| Average | 22 ms | 180 ms |
- GC pauses: JVM or Go stop-the-world GC adds 50-200 ms spikes to a small fraction of requests.
- Cold cache misses: 99% of requests hit the cache (2 ms), 1% fall through to the database (150 ms).
- Connection pool exhaustion: Most requests get a pooled connection instantly; a few wait 2-5 seconds for one to free up.
- Downstream tail latency amplification: If your service fans out to 20 microservices in parallel, and each has a 1% chance of being slow, roughly 18% of your aggregate requests will hit at least one slow dependency.
Interview Question: Good average latency but poor p99. How do you investigate?
Interview Question: Good average latency but poor p99. How do you investigate?
- Pull the trace IDs of requests above the P99 threshold from your APM tool (Datadog, New Relic, etc.).
- Group them by downstream dependency — do they all bottleneck on the same service or database?
- Check if the slow requests share a data characteristic (e.g., queries on a specific tenant, a large payload, a missing index for a rare filter combination).
- Look at host-level metrics during the spike window: CPU, memory, GC pause logs, thread pool queue depth.
- If it is GC, tune heap size or switch to a low-pause collector (ZGC, Shenandoah for JVM;
GOGCtuning orGOMEMLIMITfor Go). If it is cache misses, consider a warm-up strategy or increase cache TTL. If it is connection pool exhaustion, right-size the pool (start with HikariCP’s formula:(CPU_cores * 2) + effective_spindle_count) and add circuit breakers (Resilience4j for JVM,opossumfor Node.js).
- “I’d look at the logs.” (Too vague — which logs? What search pattern?)
- “I’d increase server count.” (Horizontal scaling does not fix per-request variance.)
- “The database is probably slow.” (Guessing a root cause without methodology.)
- “I’d start by pulling trace IDs for requests above the P99 threshold and grouping them by common traits — endpoint, tenant, payload size, downstream dependency.”
- “I’d compare the P99 requests against the P50 requests to find what is structurally different about the slow cohort.”
- “My first hypothesis is a resource contention issue visible only under specific conditions — cache miss on a particular key pattern, GC pause timing, or connection pool wait.”
- Failure mode: What if the P99 investigation reveals multiple independent causes (GC + cache misses + a slow query) each contributing 30% of the tail? How do you prioritize?
- Rollout: You find a fix (adding a composite index). How do you validate the fix under production load before full rollout?
- Rollback: The index fix helps P99 but degrades write latency by 8%. At what threshold do you rollback?
- Measurement: What metric proves the P99 is truly fixed and not just temporarily improved? How long do you monitor before declaring success?
- Cost: If the fix requires a larger database instance to support the new index, what is the cost-per-millisecond-saved calculation?
- Security/governance: Could the slow P99 requests be correlated with specific attack patterns (large payloads, injection attempts) rather than organic traffic?
- Jeff Dean & Luiz Barroso — “The Tail at Scale” (Google Research) — the canonical paper on tail latency amplification in fan-out systems.
- Grafana Labs blog — “How to use SLOs and burn rate alerts” (grafana.com/blog) — practical tuning of tail-latency alerting.
- Brendan Gregg — “USE Method for Performance Analysis” — systematic methodology for classifying CPU, memory, and I/O bottlenecks.
AI-Assisted Engineering Lens: Latency Investigation
AI-Assisted Engineering Lens: Latency Investigation
EXPLAIN ANALYZE output into Claude or ChatGPT and get an instant interpretation of the plan, highlighting missing indexes, bad row estimates, or sequential scans. AI tools can analyze flame graph data to identify hot functions faster than manual inspection. GitHub Copilot can generate the boilerplate for distributed tracing instrumentation (OpenTelemetry spans, custom metrics) in minutes instead of hours. However, AI tools have a blind spot: they optimize for correctness in isolation and may miss the systemic interaction — e.g., suggesting a cache that fixes latency but introduces a stampede risk, or recommending an index without considering its write-path cost. Always validate AI-suggested performance fixes with load testing at production-representative scale.Work-Sample Prompt: P99 Latency Spike
Work-Sample Prompt: P99 Latency Spike
pg_stat_activity, and Grafana dashboards. Walk me through the exact commands you would run, in order, in the first 10 minutes. What do you look at first, second, third? What would make you decide to rollback vs fix-forward?Diagnose from Signals: Latency
Latency Numbers Every Engineer Should Know
This reference table (inspired by Jeff Dean’s “Numbers Everyone Should Know”, updated for modern hardware circa 2024) provides the order-of-magnitude intuition every senior engineer needs. These numbers are the foundation of back-of-the-envelope estimation — if you memorize even the bold rows, you can sanity-check any system design proposal in seconds by multiplying counts by per-operation costs.| Operation | Approximate Latency | Notes |
|---|---|---|
| L1 cache reference | 1 ns | On-chip, fastest possible |
| L2 cache reference | 4 ns | |
| Branch mispredict | 5 ns | |
| Mutex lock/unlock | 25 ns | Uncontended |
| Main memory (RAM) reference | 100 ns | |
| Compress 1 KB with Snappy | 3 us | |
| Read 1 MB sequentially from RAM | 10 us | |
| SSD random read (4 KB) | 16 us | NVMe SSD |
| SSD sequential read (1 MB) | 50 us | NVMe SSD |
| Round trip within same datacenter | 500 us (0.5 ms) | |
| Redis GET (local datacenter) | 0.5 - 1 ms | Network + deserialization |
| Read 1 MB sequentially from SSD | 1 ms | SATA SSD |
| HDD random read (seek) | 2 - 10 ms | Spinning disk |
| PostgreSQL simple indexed query | 1 - 5 ms | Warm buffer pool |
| PostgreSQL complex JOIN | 10 - 100 ms | Depends on table sizes, indexes |
| TLS handshake | 5 - 15 ms | Within datacenter; add RTT for cross-region |
| Round trip US East to US West | 40 ms | ~4,000 km at speed of light in fiber |
| Round trip US East to Europe | 75 ms | ~6,000 km |
| Round trip US to Asia | 150 - 200 ms | ~12,000 km |
| External HTTP API call (third-party) | 50 - 500 ms | Highly variable |
6.1b Performance Budgets — Allocating Latency Across Components
A performance budget is the practice of allocating a total latency target across the components in a request path, the same way a financial budget allocates dollars across departments. If your SLA promises a 200 ms P95 response time, you cannot simply hope each component is “fast enough” — you must explicitly decide how much of that 200 ms each component is allowed to consume. Why performance budgets matter: Without explicit budgets, every team optimizes locally. The database team is happy with 80 ms queries, the cache team is happy with 5 ms lookups, the API gateway adds 15 ms, the serialization layer takes 20 ms, and the network round-trip is 40 ms. Add it up: 160 ms — dangerously close to the 200 ms SLA. And that is the happy path. Any single component having a bad day pushes you over. Performance budgets make this math explicit before production reveals it. How to build a performance budget: Start with the user-facing latency target (e.g., 200 ms P95 for an API endpoint). Then trace a typical request and assign a budget to each component:| Component | Budget | Rationale |
|---|---|---|
| API gateway / load balancer | 5 ms | Nginx/ALB overhead — mostly fixed |
| Authentication / middleware | 10 ms | JWT validation, rate limiting |
| Application logic | 15 ms | Business logic, validation, transformation |
| Cache lookup (Redis) | 5 ms | Includes network hop within datacenter |
| Database query (cache miss) | 50 ms | Indexed query on warm buffer pool |
| External API call (if any) | 80 ms | Third-party service with timeout set |
| Serialization + response | 10 ms | JSON serialization, compression |
| Network to client | 25 ms | Varies by geography; budget for P95 |
| Total | 200 ms | Matches SLA target |
- Budget for P95/P99, not P50. Your median database query takes 5 ms, but your P99 takes 50 ms. Budget for the P99 — that is what breaks your SLA.
- Leave a margin. Budget only 80% of your SLA target. The remaining 20% is headroom for unexpected spikes, GC pauses, and slow days. A 200 ms SLA should have a 160 ms internal budget.
- Set timeouts to match budgets. If the database budget is 50 ms, set a query timeout at 60 ms. If an external API budget is 80 ms, set the HTTP client timeout at 100 ms. Without timeouts, a single slow component consumes the entire request budget and blocks downstream processing.
- Monitor each component against its budget independently. A dashboard that only shows end-to-end latency hides which component is eating into the margin. Per-component latency tracking (via distributed tracing spans) reveals budget violations early.
- Renegotiate budgets when requirements change. Adding a new feature that calls an external service means the budget must be redistributed. If nothing else can shrink, the SLA must change — or the new call must be moved to an async path.
Interview Question: Your team is building a new API with a 200ms P95 latency SLA. How do you ensure you meet it?
Interview Question: Your team is building a new API with a 200ms P95 latency SLA. How do you ensure you meet it?
EXPLAIN ANALYZE on every database query to verify it fits within the 50 ms budget, load test with k6 at 2x expected traffic to check that budgets hold under contention, and profile the application logic with pprof (Go), py-spy (Python), or async-profiler (Java) to ensure the 15 ms code budget is realistic.The key insight: The performance budget turns a vague SLA (“be fast”) into a concrete, testable engineering constraint for each team. Without it, every team assumes “someone else” is responsible for the overall latency.Follow-up chain:- Failure mode: What happens when a new feature requires an additional external API call that blows the budget? Who decides what gets cut?
- Rollout: How do you roll out performance budgets to a team that has never had them? Do you start by measuring and then setting targets, or set aspirational targets first?
- Rollback: A budget-driven timeout kills a legitimate slow request that the user was waiting for. How do you handle false positives from aggressive timeouts?
- Measurement: How do you measure each component’s budget compliance independently? What tooling is required?
- Cost: Adding distributed tracing to measure per-component budgets costs engineering time and APM licensing. How do you justify the investment?
- Security/governance: A performance budget of 80ms on the external API call means you set an aggressive timeout. Could an attacker exploit this by slowing down the API just enough to trigger constant timeouts?
- Google SRE Book — Chapter 4 “Service Level Objectives” — foundational framing for SLI/SLO/error budgets.
- highscalability.com — “How Shopify handles Black Friday traffic” — applied performance budgeting in e-commerce.
- opentelemetry.io/docs — “Tracing Semantic Conventions” — how to instrument spans for per-component budget measurement.
Work-Sample Prompt: Performance Budget Design
Work-Sample Prompt: Performance Budget Design
6.2 Throughput
Requests per second (RPS) your system handles. As throughput approaches capacity, latency increases — requests queue for resources. Load testing reveals where this ceiling is. Concrete benchmarks to calibrate your intuition: a single Node.js process handles ~10,000-30,000 simple HTTP req/s. A single Go server handles ~50,000-100,000 req/s for lightweight handlers. A single PostgreSQL instance handles ~5,000-15,000 simple transactional queries/s. Redis handles ~100,000-200,000 GET/SET operations/s on a single node. A single Kafka broker can sustain ~200,000-500,000 messages/s depending on message size. The relationship between latency and throughput: They are not independent. At low load, latency is stable. As throughput increases toward the system’s capacity, latency grows — slowly at first, then sharply. This is the “hockey stick” curve, rooted in queueing theory — as utilization approaches 100%, wait times grow toward infinity because every small burst has no slack to absorb it. The point where latency starts climbing steeply is your practical throughput limit, even if the system has not technically crashed. In practice, you want to operate at 60-70% of measured capacity — this leaves headroom for traffic spikes without entering the steep part of the curve. Little’s Law: The average number of requests in the system (L) = arrival rate (lambda) x average time each request spends in the system (W). If 100 requests arrive per second and each takes 200 ms, you have L = 100 x 0.2 = 20 concurrent requests at any time. This tells you how many connections, threads, or workers you need. Little’s Law is remarkably powerful because it holds for any stable system regardless of arrival distribution or processing order — it works for web servers, database connection pools, checkout lines, and highway traffic. In interviews, use it to immediately quantify resource requirements: “If we expect 5,000 req/s and each request takes 50 ms, we need at least 250 concurrent connections in our pool.”Interview Question: Your service handles 500 req/s with 50 ms average latency, but during a traffic spike to 2000 req/s, latency jumps to 5 seconds. What is happening?
Interview Question: Your service handles 500 req/s with 50 ms average latency, but during a traffic spike to 2000 req/s, latency jumps to 5 seconds. What is happening?
jstack or Micrometer’s executor.* metrics; Node.js: clinic doctor for event loop delays), check CPU and memory via htop or your cloud provider’s monitoring (CloudWatch, GCP Monitoring), check downstream service latency via distributed traces (Jaeger, Zipkin, or your APM’s service map), and run pg_stat_activity on the database to identify long-running queries holding connections. Short-term fix: rate limit (token bucket at the API gateway — Kong, NGINX limit_req, or AWS API Gateway throttling) or shed load (return HTTP 503 with Retry-After header) to protect the system. Medium-term: identify the bottleneck (add read replicas if DB, increase pool size if connections, add instances if CPU). Long-term: add autoscaling policies based on queue depth (KEDA with SQS/Kafka metrics, or custom CloudWatch alarms).Applying Little’s Law to this scenario:- At steady state: L = 500 req/s x 0.05 s = 25 concurrent requests. A thread pool of 50 handles this easily.
- During spike: L = 2000 req/s x 5 s = 10,000 concurrent requests in flight. This far exceeds any reasonable pool size. Requests are stacking in queues, each one waiting behind thousands of others, which is why latency explodes non-linearly.
- “Just add more servers to handle the spike.” (Does not identify the bottleneck.)
- “Increase the thread pool size.” (Does not address why concurrency spiked — could make it worse.)
- “I’d use Little’s Law to quantify the concurrency gap, then identify whether the bottleneck is CPU, connections, or a downstream dependency before choosing a fix.”
- “The 100x jump in response time for a 4x traffic increase tells me we hit a non-linear queueing regime — the system needs load shedding, not just more capacity.”
- Failure mode: What happens if your rate limiter itself becomes a bottleneck during the spike? How do you rate-limit without adding latency?
- Rollout: You decide to add load shedding. How do you test that shedding actually works without causing an outage during the test?
- Rollback: You deployed a connection pool size increase as a quick fix. It helped latency but database CPU jumped to 85%. How do you rollback safely?
- Measurement: After the spike subsides, how do you determine whether your fix was effective or the traffic just dropped?
- Cost: The 4x spike lasted 15 minutes. You autoscaled from 5 to 20 pods. What was the cost of that burst, and is it cheaper than pre-provisioning?
- Security/governance: Could the 4x traffic spike be a DDoS attack rather than organic growth? How do you distinguish the two in real-time?
Retry-After headers including jittered values, and fix on the client side by enforcing exponential backoff with full jitter (AWS’s recommended algorithm). This is why “naive retries” are considered an anti-pattern at scale.Q: At what autoscaling trigger would you replace reactive scaling with predictive?A: Two signals tip me toward predictive: (1) you have reliably recurring spikes (daily batch at 2 AM, weekly peak Monday morning, Black Friday) — pre-warm before the spike; (2) your cold start time exceeds your SLA tolerance — by the time reactive autoscaling has provisioned pods (30-120 seconds), the spike has already hurt users. Netflix’s Scryer does exactly this for their predictable daily patterns.- AWS Architecture Blog — “Exponential Backoff and Jitter” — the canonical source on retry storm mitigation.
- highscalability.com — “The Netflix Simian Army” — how Netflix uses chaos to validate load shedding works.
- Marc Brooker — “Timeouts, retries, and backoff with jitter” (aws.amazon.com/builders-library) — deep dive on retry correctness.
Diagnose from Signals: Throughput
6.3 CPU-Bound vs IO-Bound
CPU-bound: data transformation, encryption (~1-10 ms for AES-256 on a 1 MB payload), image processing (~200-2000 ms for a resize operation), JSON serialization of large objects (~5-50 ms for 10 MB payloads). Optimize with better algorithms, caching computed results, parallelism across cores. IO-bound: database queries (~1-100 ms), network calls (~0.5-200 ms depending on distance), file reads (~0.05-10 ms for SSD, ~2-20 ms for spinning disk). Most web apps are IO-bound — the CPU sits at 5-15% utilization while threads wait on network and disk. Optimize by reducing IO operations (batching, caching), reducing time per operation (query optimization, connection pooling), overlapping operations (async processing). How to identify which you are: If CPU utilization is high (> 70%) during slowness, you are CPU-bound. If CPU is low (< 30%) but latency is high, you are IO-bound — your threads are waiting, not working. A flame graph (CPU profiler) shows where time is spent for CPU-bound work. Distributed tracing shows where time is spent for IO-bound work. Real example: An image resizing API was slow. CPU profiling showed 85% of time in the resize function — CPU-bound. Solution: moved to a more efficient library (libvips instead of ImageMagick), added a worker pool to parallelize across cores, and cached resized images by hash. A product page API was slow. CPU was at 5%. Distributed tracing showed 400 ms waiting on 3 sequential database queries — IO-bound. Solution: made the queries parallel (Promise.all), added a composite index, and cached the result for 60 seconds.
6.4 Query Optimization
Database queries are the most common bottleneck in web applications. In a typical CRUD app, 60-80% of request latency is spent waiting on database queries. A single unindexed query on a 10M-row table can take 500-5000 ms (full table scan) vs 1-5 ms with a proper index — a 100-1000x difference from a single line of SQL. Key techniques:EXPLAIN / EXPLAIN ANALYZE for execution plans. Understand index types: B-tree (equality and range), hash (equality only), composite (multi-column, order matters), partial (subset of rows), covering (includes all columns needed). Fix N+1 problems (load a list then loop to load related items — use JOINs or batch fetches instead). Use batch operations for bulk reads and writes.
6.5 Connection Pooling
Opening a database connection involves TCP handshake (~0.5 ms local, ~40 ms cross-region), TLS negotiation (~5-15 ms), authentication (~1-3 ms), and resource allocation on the server (~5-10 MB of memory per PostgreSQL connection). Total cost of a fresh connection: ~10-60 ms and measurable server memory. Connection pooling maintains reusable connections that skip all of this overhead — a pooled connection acquisition takes ~0.01-0.1 ms. Pool size should match database capacity, not application desire. Pool sizing formula (from HikariCP documentation):pool_size = (CPU_cores * 2) + effective_spindle_count. For a 4-core database server with SSD, that is ~9-10 connections. Counter-intuitive: a smaller pool often outperforms a larger one. A pool of 200 connections on a 4-core database server means massive context switching overhead — each connection competes for CPU time. Start with 10-20, load test, and increase only if you observe connection wait times.
PgBouncer modes:
- Session pooling — connection assigned per client session. Simple, supports all features.
- Transaction pooling — connection assigned per transaction. Much higher connection multiplexing, but prepared statements do not work.
- Statement pooling — connection assigned per statement. Maximum efficiency, but transactions do not work.
Diagnose from Signals: Connection Pool Exhaustion
6.6 Async Processing
Not everything needs to happen in the request. Acknowledge the user’s action immediately, process side effects asynchronously. The difference is dramatic: a synchronous order creation that sends email, updates analytics, and triggers a webhook takes ~800-2000 ms. The same operation with async processing: ~15-50 ms (database write + queue publish). The user sees a 20-50x faster response. What to process async (with typical sync latency cost): Email/SMS sending (~200-800 ms via SendGrid/Twilio API). Image/video processing (~500-5000 ms for resize, ~10-120 seconds for transcode). PDF generation (~200-2000 ms depending on complexity). Analytics/logging (~5-50 ms but adds up across many events). Webhook delivery (~100-3000 ms depending on the receiver). Search index updates (~50-500 ms for Elasticsearch). Anything where the user does not need the result in the same HTTP response. Patterns:- Fire-and-forget — publish to a queue, return immediately. Simplest, no delivery guarantee to the user. Best for side effects where failure is tolerable (analytics events, non-critical notifications). The risk: if the worker fails and retries are not configured, the work is silently lost. Always pair with a dead-letter queue for visibility into failures.
- Request-acknowledge-poll — return a job ID, client polls for status (or subscribes via WebSocket/SSE). Good for long-running operations like video transcoding, PDF generation, or data exports where the user expects a result but not instantly. The API returns
202 Acceptedwith a status URL; the client checks/jobs/{id}until it resolves tocompletedorfailed. - Event-driven — publish a domain event (e.g.,
OrderCreated), multiple independent consumers react (email service, inventory service, analytics service). Decoupled and extensible — adding a new consumer requires zero changes to the publisher. The trade-off is debugging complexity: tracing a single user action across 5 consumers requires distributed tracing infrastructure.
Diagnose from Signals: Async Processing
6.7 Performance Testing and Monitoring
Profiling Tools by Language — Quick Reference
Choosing the right profiler depends on your language runtime, whether you can attach to production, and what type of bottleneck you are hunting (CPU, memory, I/O, or concurrency). This table covers the primary profiler for each major language ecosystem along with the production-safe option and what each tool reveals.| Language | Primary Profiler | Production-Safe Option | Overhead | What It Reveals | Output Format |
|---|---|---|---|---|---|
| Go | pprof (built-in) | pprof via /debug/pprof | ~0% when inactive; ~1-3% when sampling | CPU time, heap allocations, goroutine stacks, mutex contention, block profiles | Flame graph, top-N, graph |
| Python | cProfile (built-in) | py-spy (sampling, no code change) | cProfile: 30-50% (deterministic); py-spy: ~2-5% (sampling) | CPU time per function, call counts, wall-clock time. py-spy can profile running PIDs without restart | Flame graph (py-spy), pstats (cProfile) |
| Java / JVM | async-profiler | async-profiler with -e wall | ~2% CPU overhead | CPU time, allocation sites, lock contention, native frames (unlike jstack). Captures both Java and C/C++ frames | Flame graph (SVG), JFR |
| Node.js | clinic.js (suite) | clinic flame / 0x | ~5-10% | Event loop delays (clinic doctor), I/O bottlenecks (clinic bubbleprof), CPU flamegraphs (clinic flame). 0x generates production-grade flamegraphs | Flame graph (HTML), diagnostic report |
| Rust | perf + flamegraph crate | perf record | ~2-5% | CPU cycles, cache misses, branch mispredictions. cargo flamegraph wraps perf for Rust-friendly output | Flame graph (SVG) |
| .NET | dotnet-trace | dotnet-counters (live) | ~1-3% | CPU sampling, GC events, thread pool starvation, HTTP request durations. dotnet-dump for heap analysis | Speedscope, Chromium trace, nettrace |
| Frontend (JS) | Chrome DevTools Performance | Lighthouse CI (automated) | N/A (client-side) | Long tasks, layout thrashing, paint timing, Core Web Vitals (LCP, FID, CLS). Lighthouse CI runs in CI/CD for regression detection | Timeline, Lighthouse report |
Chapter 6b: Systems Internals — Pools, Threads, Serialization, and Resource Exhaustion
This section covers the low-level mechanics that senior engineers must understand to debug production systems, design performant architectures, and avoid resource exhaustion — the silent killer of production services.6.8 Thread Pools and Connection Pools — How They Work and Why They Fail
Thread pools: A fixed set of pre-created threads that pick up work from a queue. Instead of creating a new thread per request (expensive — each thread consumes 512 KB - 1 MB of stack memory, and creating a thread takes ~50-100 us on Linux), work is submitted to the pool and the next available thread executes it. A pool of 200 threads costs ~100-200 MB of stack memory alone, plus OS scheduling overhead. For comparison, Go goroutines start at ~2 KB each — you can run 100,000 of them in the memory that 200 OS threads consume.6.9 Resource Exhaustion Patterns
The most dangerous production failures are resource exhaustion — they start silently and cascade rapidly. Connection exhaustion: Your application opens database connections faster than it closes them (connection leak). Or your pool is too small for the traffic. Or a slow query holds a connection for 30 seconds instead of 30 ms, and all other requests wait. Symptoms: increasing latency, connection timeout errors, “too many connections” errors. Detection: Monitor active connections, pool wait time, and pool utilization. Alert when pool utilization > 80%. Memory exhaustion (OOM): Unbounded caches, memory leaks (event listeners not removed, closures holding references to large objects), loading entire query results into memory (instead of streaming/pagination). In containerized environments, OOM kills the process instantly with no graceful shutdown. Fix: Set memory limits. Use streaming for large data sets. Profile memory with heap snapshots. Bounded caches with eviction (LRU with max size). File descriptor exhaustion: Every open socket, file, and pipe consumes a file descriptor. Linux defaults to 1024 per process (ulimit -n). A service with 1,000 HTTP connections, 50 database connections, and 100 open files is near the limit. Production services commonly set ulimit -n 65536 or higher. Nginx recommends worker_rlimit_nofile 65535. This is a classic “works in dev, breaks in prod” issue — your laptop never has enough concurrent connections to hit the limit, but production with 10,000 concurrent users will. The failure mode is sudden and confusing: new connections are refused, log files cannot be opened, and the error messages (“too many open files”) may not appear if the logging system itself cannot open files. Fix: Increase ulimits in production (set in systemd unit files or container specs, not just the shell). Close connections and files promptly. Monitor open FD count (ls /proc/<pid>/fd | wc -l on Linux, or expose via Prometheus process_open_fds metric) — alert at 80% of the limit. See the OS Fundamentals chapter for the deep mechanics of file descriptors, the distinction between soft and hard limits, and the real-world story of how a file descriptor leak took down an entire microservices platform.
Disk exhaustion: Log files growing unbounded. Temp files not cleaned up. Database WAL files accumulating during replication lag. Fix: Log rotation (logrotate), retention policies, disk usage monitoring and alerting.
CPU exhaustion — the subtle one: Not always obvious. Symptoms: high latency, GC pauses (JVM, .NET, Go), thread contention (many threads fighting for the same lock). In Node.js: a single CPU-intensive operation blocks the event loop for all requests. Fix: Profile with flame graphs, offload CPU work to worker threads/separate services, optimize hot paths.
Container CPU throttling — the invisible killer: In Kubernetes, a container with a CPU limit of 1000m (1 core) gets throttled by the kernel’s CFS (Completely Fair Scheduler) when it exceeds its quota within a 100ms scheduling period. The container is not killed — it is paused. This manifests as mysterious latency spikes that do not correlate with any application metric. The container’s cpu.stat file shows nr_throttled and throttled_time increasing. A service that bursts to 1.5 cores for 50ms during request processing gets throttled for 50ms — adding 50ms of latency to that request with zero visibility in application-level monitoring. Symptoms: P99 latency spikes that correlate with traffic but not with any application bottleneck. CPU utilization appears “fine” at 60-70% average, but the CFS quota is being exceeded in short bursts. Detection: Monitor container_cpu_cfs_throttled_periods_total and container_cpu_cfs_throttled_seconds_total in Prometheus (from cAdvisor). Alert when throttled periods exceed 5% of total periods. Fix: Either increase the CPU limit, remove the CPU limit (use only requests for scheduling, no limits — this is the approach recommended by many Kubernetes experts including Tim Hockin from Google), or optimize the burst behavior.
Noisy neighbor — the shared infrastructure problem: In cloud environments, your VM shares physical hardware with other tenants. A “noisy neighbor” is another VM on the same host that is consuming disproportionate I/O bandwidth, network bandwidth, or causing cache pollution at the CPU level (L3 cache eviction). Symptoms: Periodic, unpredictable latency spikes that do not correlate with your application’s traffic or deployments. Spikes may occur at consistent times (when the neighbor runs a batch job) or randomly. Detection: Check steal time in top or htop — this shows the percentage of CPU time the hypervisor stole from your VM to give to other tenants. Steal time above 5% is concerning; above 10% is impacting performance. On AWS, enhanced monitoring shows per-instance hardware metrics. Fix: Use dedicated hosts or bare-metal instances for latency-sensitive workloads (AWS host tenancy, or m5.metal instances). In Kubernetes, use node anti-affinity rules to spread latency-sensitive pods across nodes. For less severe cases, simply stopping and restarting the instance moves it to different hardware (usually).
Diagnose from Signals: Resource Exhaustion
6.10 Serialization and Wire Formats — JSON, Protobuf, and Why It Matters
The problem: Every time services communicate, data must be serialized (object -> bytes) for transmission and deserialized (bytes -> object) on the other end. This has real performance cost at scale. JSON: Human-readable text format. Advantages: universal support, easy to debug (you can read it), flexible schema. Disadvantages: verbose (field names repeated in every object), slow to parse at scale (text parsing is CPU-intensive), no schema enforcement (a typo in a field name is silently accepted), large payloads (a JSON message can be 3-10x larger than binary equivalent). Protocol Buffers (Protobuf): Binary format with a schema (.proto file). Advantages: 3-10x smaller than JSON, 5-20x faster to serialize/deserialize, strong typing with code generation (compile-time safety), backward/forward compatible schema evolution. Disadvantages: not human-readable (cannot curl and read the response), requires schema management, tooling is more complex.
Why gRPC exists: gRPC = HTTP/2 + Protobuf + code generation + streaming. It was created because Google’s internal services were spending significant CPU cycles on JSON serialization and dealing with the overhead of HTTP/1.1. At scale (millions of inter-service calls per second), the serialization format matters more than most engineers realize.
| Format | Size (typical) | Parse Speed | Human-Readable | Schema | Streaming |
|---|---|---|---|---|---|
| JSON | Large (100%) | Slow (100%) | Yes | Optional (JSON Schema) | No (newline-delimited hack) |
| Protobuf | Small (20-30%) | Fast (10-20%) | No | Required (.proto) | Yes (gRPC) |
| Avro | Small (25-35%) | Fast (15-25%) | No | Required (JSON schema) | Yes |
| MessagePack | Medium (50-70%) | Medium (40-60%) | No | No | No |
6.11 Why HTTP/3 and QUIC Exist — The Performance Chain
Performance problems cascade through the stack. Understanding the chain explains why each technology was created: TCP head-of-line blocking -> HTTP/2 solved HTTP-level multiplexing but TCP still blocks all streams on packet loss -> QUIC (HTTP/3) moves to UDP with independent streams. JSON serialization overhead -> Protobuf provides binary serialization (3-10x smaller, 5-20x faster) -> gRPC wraps Protobuf with HTTP/2, code generation, and streaming. TCP connection setup cost (3-way handshake: 1 RTT ~0.5-150 ms + TLS 1.3 handshake: 1 RTT ~0.5-150 ms = 2 round trips, ~1-300 ms before first byte depending on distance) -> HTTP/2 multiplexing reduces the need for multiple connections -> QUIC 0-RTT resumption eliminates round trips for returning clients (first byte in ~0 ms for repeat visitors vs ~1-300 ms with TCP+TLS). Connection pool exhaustion under load -> Thread-per-connection model (Java Servlets) wastes threads -> Event-driven async I/O (Node.js, Go goroutines, Netty) handles thousands of connections with few threads -> But CPU-intensive work still blocks -> Worker threads / separate services for CPU work. Each technology exists because the previous layer’s limitations became a bottleneck at scale. This is the story of systems engineering: optimizations at one layer expose bottlenecks at the next.Part IV — Scalability
Chapter 7: Scaling Strategies
7.1 The Scale Cube
The Scale Cube (from The Art of Scalability by Abbott and Fisher) defines three dimensions of scaling: X-axis: Horizontal duplication. Run multiple identical copies behind a load balancer. The simplest scaling approach. Each instance handles a subset of requests. Works for stateless services. Y-axis: Functional decomposition. Split by function or service. The monolith becomes microservices: order service, payment service, notification service. Each scales independently based on its own load characteristics. Z-axis: Data partitioning. Split by data. Each instance handles a subset of data — sharding by user ID, tenant ID, or geography. Each shard handles requests for its subset.Interview Question: Walk me through the Scale Cube with examples.
Interview Question: Walk me through the Scale Cube with examples.
- X-axis: Running 10 identical web server instances behind a load balancer.
- Y-axis: Extracting the image processing into its own service that scales independently from the API.
- Z-axis: Sharding the users database by user ID range so each shard handles 1 million users.
- Martin Abbott & Michael Fisher — “The Art of Scalability” — the book that defined the Scale Cube.
- engineering.shopify.com — “A Pods Architecture to Allow Shopify to Scale” — applied Z-axis sharding with pod isolation.
- highscalability.com — “How Uber scales their real-time market platform” — combined X/Y/Z axes in a high-throughput system.
7.2 Vertical vs Horizontal Scaling
Vertical scaling (scale up): Bigger machine — more CPU, more RAM, faster disks. Advantages: no distributed system complexity, no code changes needed, one machine to monitor. Limits: there is a maximum machine size (AWS u-24tb1.metal: 448 vCPUs, 24 TB RAM, ~0.96/hour, 8 vCPUs, 64 GB RAM) which handles far more load than most startups realize. Horizontal scaling (scale out): More machines of the same size behind a load balancer. Advantages: theoretically unlimited capacity, high availability (losing one machine is not catastrophic), cost-efficient (many small machines can be cheaper than one giant one). Challenges: requires stateless services (or shared state infrastructure), adds distributed system complexity (network latency between nodes, partial failures, data consistency), and not all workloads parallelize well (a single SQL query cannot be split across machines without sharding).7.3 Stateless Design
A stateless service stores no local state between requests — every request can be handled by any instance. What to externalize (with latency cost of externalization):- Sessions -> Redis or a session store (~0.5-1 ms per lookup vs ~0 ms for in-memory, but enables horizontal scaling).
- Files/uploads -> object storage (S3: ~20-50 ms for a GET, but infinitely scalable and durable at 99.999999999%).
- Cached data -> distributed cache (Redis: ~0.5-1 ms per GET, Memcached: ~0.2-0.5 ms per GET).
- Config -> environment variables or a config service (read at startup, ~0 ms runtime cost).
- Background job state -> a job queue (BullMQ, Sidekiq, Celery — queue publish: ~1-5 ms).
7.4 Sharding and Partitioning
When a single database cannot handle the load, split data across multiple instances. Shard key selection is critical — it must distribute data evenly and align with access patterns. Trade-offs: Cross-shard queries are expensive (scatter-gather across 10 shards: ~10x the latency of a single-shard query, plus coordination overhead — a 5 ms query becomes ~50-80 ms). Cross-shard transactions are extremely difficult (two-phase commit adds ~10-50 ms of coordination overhead and reduces throughput by 2-5x). Rebalancing when adding shards is complex (migrating 1 TB of data online can take hours to days). Joins across shards are impossible at the database level — you must do them in the application. Sharding should be a last resort after exhausting vertical scaling, read replicas, caching, query optimization, and data archiving.Database Sharding Strategies — Detailed Comparison
The choice of sharding strategy has deep, long-term consequences. Here are the three primary approaches and when each is appropriate.| Strategy | How It Works | Even Distribution | Range Queries | Rebalancing | Best For |
|---|---|---|---|---|---|
| Hash-based | shard = hash(key) % N | Excellent (uniform) | Expensive (scatter-gather) | Hard (adding a shard reshuffles data) | User profiles, sessions, general-purpose |
| Range-based | shard 1: IDs 1-1M, shard 2: IDs 1M-2M | Poor (hot spots on recent ranges) | Efficient (one shard has the range) | Easy (split a range in two) | Time-series, logs, sequential access |
| Geo-based | shard-us, shard-eu, shard-apac | Depends on user distribution | Efficient within region | Medium (move users between regions) | Multi-region apps, data residency (GDPR) |
hash(key) % N (where changing N reshuffles almost everything), consistent hashing arranges all shards on a virtual ring. Each key is assigned to the next shard clockwise on the ring. When you add a shard, it takes ownership of a segment of the ring, and only the keys in that segment (~1/N of total data) migrate. When you remove a shard, its keys move to the next neighbor. This means adding or removing a shard moves the minimum possible amount of data. DynamoDB, Cassandra, and MongoDB all use variants of consistent hashing. In practice, “virtual nodes” (each physical shard gets multiple positions on the ring) improve balance further.
Range-based sharding preserves data ordering, making time-range queries efficient (a single shard contains the relevant range). The cost is hot spots: if you shard by creation timestamp, the newest shard absorbs all writes while older shards are idle. Mitigation: compound shard keys (e.g., tenant_id + timestamp) spread writes across shards while keeping each tenant’s data range-queryable within one shard.
Geo-based sharding assigns data to the shard closest to the user’s region. This reduces read/write latency (the database is in the same region as the user) and satisfies data residency requirements (EU user data stays in EU). The challenge is cross-region queries — a global analytics dashboard must scatter-gather across all regional shards. This strategy works well for applications with strong regional locality (social networks, e-commerce marketplaces) but poorly for applications where any user might access any data (collaborative document editing).
Interview Question: You need to shard a multi-tenant SaaS database. How do you choose a shard key?
Interview Question: You need to shard a multi-tenant SaaS database. How do you choose a shard key?
tenant_id because it aligns with access patterns — almost every query is scoped to a single tenant, so queries never cross shards during normal operation.However, watch for:- Whale tenants: If one tenant has 50% of the data, that shard becomes a hot spot. Mitigation: sub-shard large tenants (e.g.,
tenant_id + hash(user_id)) or place whale tenants on dedicated shards. - Cross-tenant queries: Admin dashboards, billing aggregation, and analytics that span all tenants require scatter-gather. Solve with a separate analytics pipeline (e.g., replicate to a data warehouse) rather than querying across shards.
- Tenant migration: When a shard gets too large, you need to move tenants between shards. Design the migration process upfront — it should be online (no downtime) using double-write followed by cutover.
hash(tenant_id) % N for even distribution. If tenants vary wildly, use a lookup table (tenant -> shard mapping) so you can manually place large tenants and rebalance without changing the hashing scheme.Real-World Example — GitHub’s Sharding with Vitess: GitHub migrated their core MySQL database to Vitess (the Kubernetes-native sharding layer built at YouTube) to shard their massive gists and repositories tables. They shard by repository_id hash for repositories and by user_id for user-scoped data. When they encountered whale repositories (kubernetes/kubernetes, for example), they used Vitess’s vreplication to split individual shards online without downtime — a process that would have required a maintenance window with traditional MySQL sharding.1/N of the keys instead of reshuffling everything. Use this term when explaining why DynamoDB, Cassandra, and MongoDB can add nodes without massive data migration.tenant_id as the shard key. A new whale tenant emerges that’s 40% of your total traffic. What do you do?A: Move the whale to a dedicated shard using a lookup table routing layer — queries for tenant_id = whale_123 go to shard-whale-1, everyone else uses the hash. This is why I’d build the routing layer with a lookup table from day one rather than hardcoding hash(tenant_id) % N in the application. Migration: double-write to both old and new shards, verify data integrity, flip reads to the new shard, stop the old double-write.Q: Your analytics team wants to run “top 10 most active tenants by event count this week.” Cross-shard query. How do you handle it?A: Don’t run it on the operational shards. Replicate to a data warehouse (Snowflake, BigQuery, Redshift) via CDC (Debezium, Kafka Connect), and run analytics there. The operational database’s job is to serve transactional queries fast; analytics has different access patterns and shouldn’t compete with checkout for resources. If you must scatter-gather, cache the result aggressively and accept 1-hour staleness.Q: You’re 2 years into sharding and the team regrets the decision. They want to un-shard. What’s the plan?A: It’s hard but not impossible. The approach: (1) scale up vertically enough that a single instance could theoretically hold all data, (2) replicate all shards into a single unified instance in parallel, (3) verify consistency via checksum comparison, (4) cut over reads shard-by-shard, (5) cut over writes last. Expect 3-6 months. The lesson for interviews: always discuss the cost of reversal, not just the cost of implementation.- Vitess documentation (vitess.io) — the CNCF sharding layer used by Slack, GitHub, YouTube.
- highscalability.com — “How Notion scales their SaaS platform” — applied multi-tenant sharding in a block-based data model.
- engineering.linkedin.com — “Espresso: LinkedIn’s Distributed Document Store” — deep dive on sharding strategies in a production datastore.
Diagnose from Signals: Sharding and Hot Partitions
7.5 Queue-Based Scaling and Backpressure
Message queues decouple producers from consumers. During spikes, the queue absorbs the burst and consumers process steadily. Backpressure is signaling upstream that you are overwhelmed. HTTP 429 withRetry-After: 30 header. Queue depth limits (e.g., reject new messages when SQS queue exceeds 100,000 messages). Connection pool limits (reject when all connections are occupied). Rate limiting at the gateway (e.g., 1,000 req/s per API key). Load shedding — dropping low-priority requests (analytics, recommendations) to protect high-priority ones (checkout, payment). Netflix’s system sheds up to 90% of non-critical traffic during overload to keep the core streaming path alive.
Backpressure Strategies in Depth
Without explicit backpressure, an overwhelmed system silently degrades until it collapses. Here are the key patterns for implementing backpressure at different layers of the stack. Token Bucket The token bucket algorithm controls the rate of requests by maintaining a “bucket” of tokens. Each request consumes one token. Tokens are added at a fixed rate (e.g., 100 per second). If the bucket is empty, the request is either rejected (HTTP 429) or queued.- Bucket size controls burst tolerance: a bucket of 200 tokens with a 100/s refill rate allows a burst of 200 requests followed by a sustained rate of 100/s.
- Use case: API rate limiting, per-client throttling.
- Trade-off: Simple to implement and reason about, but does not adapt to downstream capacity — if the database slows down, the token bucket still allows its configured rate.
- Difference from token bucket: The leaky bucket enforces a perfectly smooth output rate. The token bucket allows bursts up to the bucket size.
- Use case: Smoothing bursty traffic before it hits a database or external API that cannot tolerate spikes.
- Trade-off: Excellent for protecting fragile downstream systems, but adds latency to every request (even when the system has spare capacity) because the processing rate is fixed.
- As a backpressure mechanism: The circuit breaker propagates pressure upstream. When the database circuit opens, the API responds with HTTP 503. The load balancer routes traffic to healthy instances. The client receives a clear signal to retry later.
- Key parameters: failure threshold (percentage), window size (time), cooldown duration, half-open request limit.
- Trade-off: Prevents cascade failures and gives the downstream system time to recover, but requires careful tuning — too sensitive and it trips on transient errors; too lenient and it fails to protect.
Real-World Story: Shopify's Flash Sale Architecture -- 10K+ Orders per Second on Black Friday
Real-World Story: Shopify's Flash Sale Architecture -- 10K+ Orders per Second on Black Friday
7.6 Autoscaling
Dynamically adjust instance count based on demand. Choose the right metric (CPU, memory, queue depth, custom). Set cooldown periods to avoid flapping (typical: 60-300 seconds scale-down cooldown, 30-60 seconds scale-up cooldown). Account for startup time — a JVM service takes 30-90 seconds to boot and warm the JIT; a Go/Rust binary takes 1-5 seconds; a Docker container pull adds 10-30 seconds; a new EC2 instance takes 60-120 seconds. Pre-warm for predictable spikes (e.g., scale up at 7:45 AM for the 8 AM traffic surge, not during it).Interview Question: You are designing an autoscaling policy for a service that handles both real-time API requests and background batch jobs. How do you approach it?
Interview Question: You are designing an autoscaling policy for a service that handles both real-time API requests and background batch jobs. How do you approach it?
- Startup time matters: If instances take 90 seconds to boot (typical for a Spring Boot JVM service), scale-up must trigger early. Use AWS predictive scaling (needs 24+ hours of historical data) or scheduled scaling (
aws autoscaling put-scheduled-update-group-action) if traffic patterns are cyclical (e.g., scale up at 7:45 AM before the 8 AM spike, not during it). For Kubernetes, use KEDA’scrontrigger to pre-scale pods ahead of known traffic patterns. - Cost guardrails: Set a maximum instance count (
MaxSizein AWS ASG,maxReplicasin K8s HPA) to prevent runaway scaling from a bug or an attack that generates fake load. Also set billing alerts — a compromised API key generating 10x traffic can autoscale you into a $50,000 weekend. - Scale-to-zero: For batch processors, scale to zero when the queue is empty. AWS Lambda, Kubernetes KEDA (with
minReplicaCount: 0), or Google Cloud Run can handle this. The API layer should never scale to zero (cold start latency is unacceptable for user-facing traffic — Lambda cold starts: 100-500 ms for Python/Node, 3-10 seconds for JVM; K8s pod startup: 5-30 seconds including image pull).
maxReplicas ceiling that prevents bankruptcy regardless of load. Autoscaling should be the last line of defense, not the first. A seasoned engineer builds in “circuit breakers” at multiple levels to prevent a single failure mode from escalating.Q: Your predictive scaling pre-warms 40 pods at 7:45 AM for the 8 AM spike. The spike doesn’t come (product launch delayed). How much did that mistake cost?A: Depends on pod size and duration. If each pod is 8. Cheap insurance. The real question isn’t “did predictive scaling waste money on this one day” — it’s “over a quarter of pre-warming, did we save more in P99 latency than we spent on unused capacity?” Track this metric. If predictive scaling is net-negative, switch back to reactive.- Kubernetes documentation — “Horizontal Pod Autoscaler” (kubernetes.io/docs) — canonical reference for HPA tuning.
- keda.sh/docs — event-driven autoscaling for Kubernetes with 60+ scaler types (Kafka, SQS, Prometheus).
- AWS Builders’ Library — “Using load shedding to avoid overload” (aws.amazon.com/builders-library) — the philosophy behind combining autoscaling with load shedding.
Real-World Story: How Instagram Served 1 Billion Users with Just a Few Hundred Engineers
Real-World Story: How Instagram Served 1 Billion Users with Just a Few Hundred Engineers
Diagnose from Signals: Autoscaling
7.6b Cost-Performance Trade-offs
Performance optimization and cost optimization are often in tension. A senior engineer must navigate this trade-off explicitly rather than defaulting to “make it faster” or “make it cheaper.” The cost of latency:- Amazon found that every 100ms of added latency reduced sales by 1%. At 5B per 100ms.
- Google found that an extra 500ms in search results reduced traffic by 20%.
- But these numbers apply to user-facing latency. Shaving 5ms off an internal batch job that runs at 3 AM has zero revenue impact.
- Running 20 pods when 8 would suffice at current traffic wastes 60% of compute spend.
- An oversized RDS instance (db.r6g.4xlarge at 691/month) would handle the load wastes 24,876/year.
- Reserved instances and savings plans can reduce cloud compute costs by 30-60%, but require commitment and capacity planning.
- A 50K in lost revenue and engineer time, is a net loss of $75,254/year.
- Under-provisioned connection pools that cause 0.5% of requests to timeout mean lost transactions and degraded user trust.
Interview Question: Your cloud bill is $180K/month. The CFO wants it at $120K. How do you cut 33% without degrading performance?
Interview Question: Your cloud bill is $180K/month. The CFO wants it at $120K. How do you cut 33% without degrading performance?
- Right-size instances. Use AWS Compute Optimizer or Datadog’s resource optimization to identify instances running at <30% CPU/memory utilization. A
c5.4xlargerunning at 15% CPU should be ac5.xlarge. This alone typically saves 10-15%. - Reserved instances / Savings Plans. If the workload is stable, commit to 1-year reserved instances for the baseline (the minimum you always run). This saves 30-40% on those instances. Use on-demand only for the variable portion.
- Eliminate waste. Unattached EBS volumes, idle load balancers, forgotten staging environments, snapshots older than 90 days. Every company has $5K-15K/month of zombie resources.
- Move from provisioned to on-demand where appropriate. DynamoDB provisioned capacity costs for the peak even during troughs. On-demand costs more per request but nothing during idle periods. For bursty workloads, this can save 30-50%.
- Tiered storage. Move cold data from RDS to S3 + Athena. A 2TB RDS instance costs ~46/month. Athena queries cost $5/TB scanned.
- Cache aggressively. Every database query you eliminate with Redis saves database IOPS and potentially allows downsizing the RDS instance. A Redis
cache.r6g.largeat 2,764/month) to adb.r6g.2xlarge(1,182/month net.
- Set performance budgets alongside cost budgets. Every cost reduction must be validated against latency SLAs. If downsizing an instance increases P95 from 100ms to 140ms, that might be acceptable. If it increases to 800ms, it is not.
- Implement cost anomaly detection. AWS Cost Anomaly Detection or a custom CloudWatch alarm for daily spend exceeding 120% of the 7-day average. This catches runaway autoscaling, a misconfigured batch job, or a compromised key spinning up crypto-mining instances.
- FinOps Foundation (finops.org) — the canonical framework for cloud financial management.
- highscalability.com — “How Netflix uses AWS” — applied architectural choices for cost-efficiency at massive scale.
- AWS Well-Architected Framework — “Cost Optimization Pillar” — official guide to structured cost reduction.
7.7 Read-Heavy vs Write-Heavy Design
Read-heavy: Add read replicas (each PostgreSQL replica handles ~5,000-15,000 read queries/s independently), aggressive caching (Redis cache hit: ~0.5 ms vs database query: ~5-50 ms — a 10-100x improvement), CDN for static content (edge response: ~1-5 ms vs origin: ~50-200 ms), denormalized read models (CQRS — single-table lookup: ~1-3 ms vs multi-JOIN query: ~50-500 ms). Most web applications are read-heavy. Write-heavy: Write-ahead logs, batched writes (100 individual INSERTs: ~200 ms, one batched INSERT of 100 rows: ~5-15 ms — 15-40x faster), append-only data structures (LSM trees in Cassandra/RocksDB: ~0.01-0.1 ms per write vs B-tree random write: ~1-5 ms), eventual consistency, write sharding. Analytics ingestion and logging systems are write-heavy.Interview Question: How do you decide between optimizing for reads vs writes in a system design?
Interview Question: How do you decide between optimizing for reads vs writes in a system design?
- Denormalize data (duplicate it) so reads require no JOINs. Accept that writes become more complex (update multiple places).
- Use CQRS: a normalized write model for data integrity, a denormalized read model (materialized views or a separate read store) for performance.
- Cache aggressively at multiple layers: application cache (Redis), CDN for static/semi-static content, HTTP caching headers for browser caching.
- Read replicas handle read queries, freeing the primary for writes.
- Batch writes to amortize disk I/O (e.g., buffer 1,000 events in memory, flush once to disk).
- Use append-only data structures (LSM trees, as in Cassandra, RocksDB) instead of B-trees (which require random writes for updates).
- Accept eventual consistency — the read path can lag behind the write path by seconds or minutes if the use case allows it.
- Shard the write path by partition key so writes parallelize across nodes.
- Martin Kleppmann — “Designing Data-Intensive Applications” — chapters 3 (storage engines) and 11 (stream processing) are essential.
- highscalability.com — “The architecture Twitter uses to deal with 150M active users” — applied fanout-on-write vs fanout-on-read.
- engineering.linkedin.com — “The Log: What every software engineer should know about real-time data’s unifying abstraction” — Jay Kreps on append-only write architectures.
Scenario-Based Interview Deep Dives
These questions test the intersection of debugging instincts, architectural thinking, and production experience. They are the kind of questions that separate engineers who have operated systems under pressure from those who have only designed them on a whiteboard.Scenario: Your API's P99 latency jumped from 50ms to 2s after a deploy. The P50 is still fine. Walk me through your investigation.
Scenario: Your API's P99 latency jumped from 50ms to 2s after a deploy. The P50 is still fine. Walk me through your investigation.
/v2/search endpoint.”Step 4 — Check resource-level metrics during the P99 window.- Database: Slow query log — did the deploy introduce a query that lacks an index? Run
EXPLAIN ANALYZEon suspicious queries. - Connection pool: Are pool wait times elevated? A new query that holds connections longer reduces pool availability for everyone else.
- GC pauses: Did the deploy increase object allocation? Check GC pause logs. A new feature that creates large temporary objects (e.g., deserializing a big response into memory) can trigger long GC pauses for a fraction of requests.
- External dependencies: Did the deploy add or change a call to an external service? Check that service’s latency.
- Jeff Dean & Luiz Barroso — “The Tail at Scale” (research.google) — why P99 regressions are inevitable in fan-out systems.
- Shopify engineering blog (shopify.engineering) — posts on fast rollback culture and deploy safety.
- Netflix Tech Blog — “Automated Canary Analysis at Netflix with Kayenta” — the ML-based auto-rollback system that catches P99 regressions in minutes.
Scenario: You're building a flash-sale system. 100K users will hit 'buy' within 10 seconds. Design the architecture.
Scenario: You're building a flash-sale system. 100K users will hit 'buy' within 10 seconds. Design the architecture.
- Redis atomic decrement: Store available quantity in Redis. Use
DECR(atomic) to reserve. If the result is >= 0, the reservation succeeds. If negative,INCRback and reject. This handles 100K+ operations per second easily. Redis is single-threaded, soDECRis inherently serialized — no race conditions. - Database with SELECT FOR UPDATE: Works but creates lock contention. At 10K concurrent requests, this will bottleneck.
- Optimistic locking with version counter:
UPDATE inventory SET quantity = quantity - 1, version = version + 1 WHERE product_id = ? AND version = ? AND quantity > 0. Retry on conflict. Works at moderate scale but retry storms at 100K are expensive.
DECR is atomic because Redis is single-threaded. Use this term when explaining why in-memory operations avoid the race conditions that plague distributed locking.- Shopify engineering blog — “How Shopify handles Black Friday / Cyber Monday” — applied flash sale architecture at platform scale.
- Cloudflare blog — “Cloudflare Waiting Room” — how an edge-based waiting room works and integrates with origin health.
- highscalability.com — “Ticketmaster: Architecting a Concert Ticket Sale” — real-world flash sale at extreme concentration.
- AWS Architecture Blog — “Flash Sale Architecture on AWS” — AWS’s reference architecture using SQS, Redis (ElastiCache), and CloudFront.
Scenario: A database query that was fast is now slow. The data grew 100x. What are your options beyond 'add an index'?
Scenario: A database query that was fast is now slow. The data grew 100x. What are your options beyond 'add an index'?
EXPLAIN ANALYZE on the slow query and compare the output to a saved plan from when it was fast. Look at the planner’s row estimates vs actuals — if it estimated 100 rows but got 10 million, stale statistics are the cause (fix: ANALYZE table_name). Check pg_stat_user_tables — if n_dead_tup is in the millions, autovacuum has fallen behind and table bloat is causing sequential scans. Check the buffer pool hit ratio (SELECT pg_stat_get_buf_alloc()) — if it dropped below 95%, the working set no longer fits in shared_buffers and queries are hitting disk.Then, understand why indexes alone are not enough. An index on a 1 million row table fits comfortably in RAM (~10-50 MB for a typical B-tree index). An index on a 100 million row table may not (~1-5 GB) — the B-tree is deeper (4-5 levels vs 3 levels, meaning 1-2 extra disk I/O per lookup), the index itself is larger (slower to scan for range queries), and index maintenance on writes becomes more expensive (each INSERT updates every index, and with 5 indexes on a 100M row table, that is 5 B-tree rebalancing operations per write). Beyond a certain data size, indexes help but do not solve the fundamental problem.Option 1 — Partition the table (table partitioning, not sharding).
Split the table into partitions by a key (usually time-based). PostgreSQL native partitioning (PARTITION BY RANGE (created_at)) creates child tables. A query for “orders in the last 30 days” only scans the recent partition, not 100 million rows. The database query planner prunes irrelevant partitions automatically (verify with EXPLAIN — look for “Partitions removed” in the output). This is the first thing to try because it requires no application code changes. In practice, monthly partitions for a table growing at 10M rows/month means each partition is a manageable 10M rows with indexes that fit in RAM.Option 2 — Archive cold data.
If the query is slow because it scans 100 million rows but only 1 million are “active,” move old data to an archive table or a cheaper storage layer (S3 + Athena for analytics). The active table stays small and fast. Implement a retention policy: rows older than 90 days are moved nightly. This is a common pattern for order histories, log tables, and audit trails.Option 3 — Materialized views or pre-computed aggregations.
If the slow query is an aggregation (SUM, COUNT, AVG over millions of rows), pre-compute the result and store it. A materialized view refreshes periodically. For real-time aggregations, maintain a running counter in Redis or a summary table updated on each write. Trading write-time computation for read-time speed.Option 4 — Denormalize the read path (CQRS).
If the slow query involves multiple JOINs across large tables, create a denormalized read model. When data is written to the normalized tables, an event triggers an update to a flat, pre-joined read table. Reads become single-table lookups. The cost: write amplification and eventual consistency on the read model.Option 5 — Change the storage engine.
A 100x data growth may mean the workload has outgrown the database’s storage model. PostgreSQL (B-tree based) is excellent for transactional reads and writes on moderate data sizes. For analytical queries across hundreds of millions of rows, a columnar store (ClickHouse, DuckDB, BigQuery) can be 10-100x faster because it reads only the columns needed and compresses aggressively. For time-series data, a purpose-built time-series database (TimescaleDB, InfluxDB) uses time-aware partitioning and compression.Option 6 — Shard the database.
The nuclear option. If a single database instance cannot hold the data or handle the query load even after the above optimizations, shard across multiple instances. This adds significant complexity (cross-shard queries, distributed transactions, shard rebalancing) and should be the last resort.Option 7 — Cache the query result.
If the data is read-heavy and can tolerate staleness, cache the expensive query result in Redis with a TTL. A query that takes 5 seconds to run but is valid for 60 seconds can serve thousands of requests from cache. Invalidate on writes if fresher data is required.What weak candidates say: “Add more indexes.” (Shows only one tool in the toolbox.) “Upgrade to a bigger server.” (Vertical scaling buys time but does not address the fundamental query pattern.) “Switch to NoSQL.” (Cargo-cult answer — NoSQL does not magically make queries faster; it trades flexibility for specific access pattern optimization.)Words that impress: “partition pruning,” “I’d start with EXPLAIN ANALYZE and check if the planner’s row estimates match actuals,” “I’d check if this is an OLTP query being asked to do OLAP work,” “materialized view with a refresh policy tied to pg_cron,” “the B-tree depth increases logarithmically but the working set no longer fits in shared_buffers,” “cold/hot data separation with an archival pipeline to S3 + Athena,” “I’d check pg_stat_user_tables for vacuum lag before assuming the query itself is the problem.”notifications table exceeding 4 billion rows. They did not shard. Instead, they (1) partitioned by user_id hash into 16 partitions, (2) aggressively archived notifications older than 60 days to S3, and (3) added covering indexes for the three hot query patterns. The result: query P99 dropped from 2.3 seconds to 85ms without taking on the operational burden of a sharded MySQL cluster. The lesson: exhaust partitioning + archival before reaching for sharding, especially for time-correlated data.CREATE INDEX on the partitioned table). Global indexes cost more to maintain on writes but enable fast lookups. The goal: make the common case fast at the cost of the uncommon case being slower.Q: You propose a materialized view that refreshes every 5 minutes. The product team says “we need real-time data.” How do you respond?A: Quantify “real-time.” If they mean “within 60 seconds,” the materialized view with a 60-second refresh is still usable. If they mean “within 1 second,” the materialized view approach is wrong — I would propose either a streaming aggregation (Flink/Kafka Streams maintaining the aggregate incrementally) or showing cached-plus-delta: display the materialized view value with ”+ N events since last refresh” computed on the fly. Both are more expensive than a 5-minute refresh but cheaper than real-time recomputation.Q: The team archives old data to S3 + Athena. A user complains that searching their 5-year-old order history takes 30 seconds. Is this acceptable?A: Depends on the access pattern. If 0.1% of users search old history once per year, 30 seconds is fine — acceptable latency for rare queries. If it is a common flow (every user views full history on page load), 30 seconds is unacceptable. The architectural fix: keep hot data (90 days) in the primary DB, warm data (90 days - 2 years) in a cheaper-but-faster store (e.g., ClickHouse), and cold data (>2 years) in S3 + Athena for ad-hoc. Three tiers by access frequency, each with appropriate latency.- PostgreSQL Docs — “Table Partitioning” (postgresql.org/docs) — authoritative reference for declarative partitioning and partition pruning.
- ClickHouse Docs (clickhouse.com/docs) — the columnar store most teams reach for when OLTP queries become OLAP.
- highscalability.com — “How Notion scales their block-based data model” — applied partitioning and archival in a SaaS context.
- AWS Database Blog — “Tiered storage for PostgreSQL” — patterns for hot/warm/cold data separation.
Scenario: You are shown this Grafana screenshot -- HTTP request rate is flat at 2,000 req/s but the SQS queue depth has grown from 500 to 45,000 in the last hour. The consumer service CPU is at 8%. What do you investigate?
Scenario: You are shown this Grafana screenshot -- HTTP request rate is flat at 2,000 req/s but the SQS queue depth has grown from 500 to 45,000 in the last hour. The consumer service CPU is at 8%. What do you investigate?
kubectl get events --field-selector reason=Evicted.Hypothesis 3: Poison message causing retry loops. A single malformed message that causes the consumer to fail and retry blocks that consumer thread. With 10 consumers each running 5 threads, one poison message blocks 1/50th of capacity. If the DLQ is configured with a high maxReceiveCount (say 100), the message is retried 100 times before being dead-lettered — blocking the thread for 100 x processing_time. Check SQS ApproximateReceiveCount for messages with high receive counts.Hypothesis 4: Lock contention or serialized processing. If the consumer acquired a distributed lock to process messages in order, and the lock holder is slow, all other consumers are waiting. Check Redis or DynamoDB lock metrics.Measurement that proves resolution: Queue depth is decreasing (drain rate > arrival rate). The key secondary metric: consumer processing time per message returned to baseline. If you scaled up consumers to compensate for slow processing, you patched the symptom — the root cause (slow dependency) still needs fixing.Follow-up: Failure mode. What happens if the queue hits the SQS maximum message retention (14 days, 120K messages default for standard queues — actually no hard limit on depth, but retention is 14 days)? Old messages start being deleted. If those messages represent orders, you have data loss. Follow-up: Rollback. If the consumer slowdown was caused by a bad deploy to the downstream service, rolling back that service is the right move. But the 44,500 accumulated messages still need to be drained. Do you scale up consumers temporarily, or let them drain at normal rate? Follow-up: Cost. At 45K messages and growing, each SQS API call costs 0.20 per million invocations, and you have 45K messages times 3 retries each, you are spending more on failed retries than on successful processing.Scenario: After a performance optimization, your team claims 'we improved checkout latency by 40%.' How do you validate this claim? What questions do you ask?
Scenario: After a performance optimization, your team claims 'we improved checkout latency by 40%.' How do you validate this claim? What questions do you ask?
- Memory increased. The optimization pre-computes and caches results, trading memory for latency. If memory usage grew 30%, you may need larger containers — increasing cost.
- Write latency increased. A denormalization that speeds up reads by 40% might slow writes by 20% because each write now updates multiple tables.
- Freshness decreased. A cache that delivers 40% faster responses might serve data that is 60 seconds stale. Is that acceptable for checkout?
- Complexity increased. The optimization added a caching layer, a background refresh job, and a cache invalidation hook. Future debugging is harder. Count the added lines of code and new dependencies.
From Theory to Ops: The Operational Bridge
The sections above teach you how performance systems work. This section teaches you how they break — and what you see on your dashboard at 2 AM when they do. Every topic below starts from a signal, not a textbook definition. This is how senior engineers actually think during incidents: signal first, hypothesis second, fix third.Queue Buildup: When Arrival Rate Exceeds Drain Rate
Queues are the shock absorbers of distributed systems. When they grow beyond normal, the system is telling you something changed — either producers sped up or consumers slowed down.Pool Exhaustion: The Silent Capacity Cliff
Connection pools, thread pools, and worker pools all share the same failure mode: everything works fine until the pool is full, then latency goes vertical. There is no gradual degradation — the hockey stick curve from queueing theory applies directly.Container CPU Throttling: The Invisible Latency Tax
CPU throttling in Kubernetes is uniquely dangerous because it adds latency without any application-level signal. Your service logs show normal processing times, your APM shows normal span durations, but the user experiences multi-second delays. The time is being stolen by the kernel, not consumed by your code.Noisy Neighbors: When the Problem Is Not Your Code
In shared cloud infrastructure, you share physical hardware with other tenants. A noisy neighbor is another VM or container consuming disproportionate resources on the same physical host, causing your service to experience degradation that has no internal explanation.Cost-Performance Trade-Off Matrix
The hardest performance decisions are not “how do I make it faster” but “is making it faster worth the money.” This matrix maps common performance optimizations to their cost profile and helps you make the ROI argument.(monthly infrastructure cost of the optimization) / (milliseconds of P95 latency reduction * requests per month). This gives you the cost per millisecond per request — the unit economics of performance investment.| Optimization | Typical Latency Improvement | Monthly Cost | Cost Per ms Saved (at 10M req/month) | Verdict |
|---|---|---|---|---|
| Add a missing database index | 50-500ms per affected query | ~$0 (DDL operation) | Effectively free | Always do this first |
| Add Redis cache for hot reads | 20-100ms per cache hit | $50-200 (ElastiCache) | $0.0005-0.01 per ms | Almost always worth it |
| Move from t3.medium to c5.xlarge | 10-30ms (reduced CPU contention) | +$80/month | $0.003-0.008 per ms | Worth it for user-facing services |
| Add a CDN (CloudFront/Cloudflare) | 50-200ms for cacheable content | $50-500 depending on traffic | $0.0003-0.005 per ms | Worth it once traffic exceeds ~1M req/month |
| Multi-region deployment | 50-150ms for remote users | +$2,000-10,000 | $0.01-0.10 per ms | Only if you have significant global users |
| gRPC migration (from REST+JSON) | 5-20ms per inter-service call | 0 infra | Very high initially, amortizes over time | Only at >10K internal req/s |
| Sharding the database | Variable (avoids ceiling) | +$500-3,000/month per shard | Varies wildly | Last resort after all cheaper options |
Interview Prompt: Your Grafana dashboard shows P95 checkout latency at 1.2s. Business says it must be under 500ms. Your cloud bill is already $140K/month. Walk me through your approach.
Interview Prompt: Your Grafana dashboard shows P95 checkout latency at 1.2s. Business says it must be under 500ms. Your cloud bill is already $140K/month. Walk me through your approach.
- Database queries: 450ms (3 sequential queries, 150ms each)
- Payment API call (Stripe): 350ms
- Application logic + serialization: 80ms
- Redis cache lookups: 15ms
- Network overhead: 50ms
- Unaccounted (queue waits, GC): 255ms
- The 3 sequential database queries at 150ms each: can they be parallelized?
Promise.allor Go goroutines reduce 450ms to ~150ms. Cost: $0, 2 hours of engineering time. - 150ms per query is high for indexed reads. Run
EXPLAIN ANALYZE— a missing composite index or stale statistics could bring each query to 5-20ms. Cost: $0. - The 255ms unaccounted time: likely connection pool wait time or GC pauses. Check HikariCP pending count and GC logs. Tuning pool size or GC settings: $0.
- Cache the database results that do not change per-request (product details, tax rates) in Redis. If 2 of the 3 queries are cacheable, you eliminate 300ms of database time and replace it with 2ms of Redis time. Cost: Redis is likely already running; marginal cost ~$0.
- Set a timeout on the Stripe API call at 2 seconds (currently unbounded). Add a fallback: if Stripe is slow, queue the payment and tell the user “processing.” This does not reduce the P50 but caps the P95 at a predictable value.
Symptom-First Diagnostic Reference Table
When the dashboard fires an alert, start here. Match the symptom pattern to the most likely root cause and the first diagnostic command to run.| Symptom Pattern | CPU | Memory | Error Rate | Most Likely Cause | First Command to Run |
|---|---|---|---|---|---|
| P99 high, P50 normal, CPU low | Low | Normal | Normal | I/O wait: pool exhaustion, slow query, or GC pause | Check connection pool wait time and GC logs |
| P99 high, P50 high, CPU high | High (>80%) | Normal | Normal | CPU-bound: hot code path, serialization, crypto | Flame graph (pprof, async-profiler, py-spy) |
| P99 high, P50 normal, error rate spiking | Low | Normal | Elevated | Downstream dependency failing for subset of requests | Distributed traces filtered by error; check circuit breaker state |
| All latency up, pod restarts increasing | Low/Medium | High (near limit) | Elevated | OOMKill from container memory limit; off-heap memory growth | kubectl describe pod for OOMKilled; check RSS vs heap |
| Latency spikes at regular intervals (e.g., every 5 min) | Spikes then normal | Spikes then drops | Normal | GC pauses (JVM full GC) or cron jobs competing for resources | JVM: jstat -gcutil; K8s: kubectl top pods during spike window |
| Latency up only on some instances | Low on affected | Normal | Normal | Noisy neighbor (steal time) or node-level hardware degradation | top — check %st column; EC2 status checks |
| Queue depth growing, consumer CPU low | Low (consumers) | Normal | Normal (producers) | Consumer downstream dependency slowed or consumer instances lost | Consumer APM traces; check consumer pod count and downstream latency |
| Latency up after deploy, no code change in hot path | Normal | Normal | Normal | Cache key rotation causing cache miss storm | Check cache hit rate before/after deploy; compare cache keys |
Recommended Reading and Resources
Curated resources for going deeper on performance and scalability. These are not generic textbook recommendations — each one is here because it offers a specific, practical insight that you will not get elsewhere.Performance Analysis and Debugging
Distributed Systems and Scalability
Auto-Scaling and Traffic Management
Architecture Patterns at Scale
Performance Debugging Playbook
A step-by-step checklist for when something is slow. Print this. Tape it to your monitor. Follow it in order when the next P99 alert fires at 2 AM.Step 1: Check the Metrics Dashboard (Time: 2-5 minutes)
Open your APM dashboard (Datadog, New Relic, Grafana, etc.) and answer these questions:- When did the problem start? Correlate with deployments, config changes, traffic spikes, or external events. Check the deploy log — was anything released in the last 1-4 hours?
- Which endpoints are affected? All of them (systemic issue) or specific ones (code-level issue)?
- Which percentiles are affected? P50 and P99 both degraded = systemic. P50 fine but P99 bad = subset of requests hitting a different path.
- What’s the error rate? Elevated errors alongside latency often points to a downstream dependency failure (timeouts, connection refused).
- What does the traffic volume look like? Is this a load spike (more requests than usual) or a per-request slowdown (same traffic, slower responses)?
- Normal P50 latency: _____ ms
- Normal P99 latency: _____ ms
- Normal throughput: _____ req/s
- Normal error rate: _____% (ideally < 0.1%)
- Database connection pool utilization: _____% (ideally < 70%)
Step 2: Profile the Slow Requests (Time: 5-15 minutes)
- Pull distributed traces for requests above the P99 threshold. Most APM tools let you filter by duration > X ms.
- Identify the slow span. In the trace waterfall, which span takes the most time? Common culprits:
- Database query: ~1-5 ms normal, you’re seeing ~500-5000 ms. Check query plan.
- External API call: ~50-200 ms normal, you’re seeing ~5-30 seconds. Check the dependency’s status page.
- Serialization/processing: ~1-10 ms normal, you’re seeing ~200-2000 ms. CPU profiling needed.
- Queue wait time: ~0 ms normal, you’re seeing ~1-10 seconds. Resource pool exhaustion.
- Check if slow requests share a pattern: Same tenant? Same data shape (large payload, specific filter)? Same geographic region? Same application instance?
Step 3: Identify the Bottleneck Category (Time: 5-10 minutes)
Based on Step 2, classify the bottleneck:| Symptom | Likely Bottleneck | Next Action |
|---|---|---|
| CPU > 80% on app servers | CPU-bound | Flame graph (pprof, async-profiler, py-spy) |
| CPU low, latency high | IO-bound (waiting on network/disk) | Distributed tracing, check downstream deps |
| Connection pool wait time > 50 ms | Pool exhaustion | Check pool size, query duration, connection leaks |
| Memory climbing steadily | Memory leak | Heap snapshot comparison (before/after) |
| GC pause time > 100 ms | GC pressure | Check heap utilization, object allocation rate |
| Disk IOPS > 80% capacity | Disk-bound | Check for unindexed queries, excessive logging |
| Error rate spike on downstream | Dependency failure | Check dependency health, circuit breaker state |
Step 4: Test Your Hypothesis (Time: 10-30 minutes)
- For database bottlenecks: Run
EXPLAIN ANALYZEon the slow query. Check for sequential scans on large tables (~100 ms+ for tables > 1M rows without an index). Checkpg_stat_activityfor lock waits. Checkpg_stat_user_tablesfor tables that need vacuuming. - For connection pool exhaustion: Check active vs idle connections. Look for long-running queries or transactions holding connections. A query that takes 30 seconds holds a connection 1000x longer than a 30 ms query — one slow query can exhaust the pool.
- For CPU bottlenecks: Capture a flame graph during the slow period. Look for hot functions consuming > 10% of CPU time. Common culprits: JSON serialization of large objects, regex on unbounded input, encryption/hashing, image processing.
- For memory issues: Compare heap snapshots taken 5 minutes apart. Look for object counts that grow monotonically. In Node.js, check for event listener leaks. In JVM, check for class loader leaks.
- For external dependency issues: Check the dependency’s status page. Check your circuit breaker metrics. Test the dependency directly with a simple
curl— is it slow from your network too, or only from your application?
Step 5: Fix (Time: varies)
Apply the fix based on your diagnosis. Common fast fixes ranked by time to implement:| Fix | Time to Implement | Impact |
|---|---|---|
| Add missing database index | 5-15 minutes | Query goes from ~500 ms to ~5 ms (100x improvement) |
| Increase connection pool size | 2 minutes (config change) | Eliminates pool wait time |
| Add caching (Redis) for hot query | 30-60 minutes | Read latency drops from ~50 ms to ~1 ms |
| Rollback the offending deploy | 5-10 minutes | Restores previous performance baseline |
| Enable query result pagination | 1-2 hours | Prevents unbounded result sets from consuming memory |
| Add circuit breaker on flaky dependency | 2-4 hours | Prevents cascade failure, fast-fails instead of hanging |
| Shard the database | Weeks-months | Last resort — only after exhausting all above options |
Step 6: Verify the Fix (Time: 5-15 minutes)
- Check that the specific metric that triggered the investigation has improved. P99 latency back to baseline? Error rate normalized?
- Check for regressions. Did your fix improve one metric but worsen another? (Example: adding a cache improves read latency but increases memory usage. Adding an index speeds up reads but slows writes by ~5-10%.)
- Run a targeted load test if the issue was load-related. Use k6 or your preferred tool to simulate the traffic pattern that triggered the issue and confirm the fix holds under load.
- Check upstream and downstream. If you fixed a slow dependency call, verify that the upstream service’s latency also improved.
Step 7: Monitor and Prevent Recurrence (Time: 30-60 minutes)
- Set up or tighten alerts. If this issue wasn’t caught by existing alerts, add one. Example: “Alert when P99 latency exceeds 2x the 7-day rolling average for more than 5 minutes.”
- Add the fix to your deployment checklist. If a missing index caused the issue, add “verify indexes exist for new queries” to your code review checklist.
- Write a brief incident report. Even for minor issues, document: what happened, why, how it was found, how it was fixed, and what prevents recurrence. This builds institutional knowledge.
- Consider automated prevention. Can a CI check catch this class of issue? (Example: a test that runs
EXPLAINon all queries and fails if any show a sequential scan on a table with > 10,000 rows.)
Interview Deep-Dive Questions
These questions go beyond the scenario-based section above. They are structured as multi-round interview exchanges — each question has a strong candidate answer followed by branching follow-ups that a senior interviewer would ask to probe depth, test judgment, and expose whether you have operated real systems or only read about them.Q1: Explain Little's Law and use it to size a connection pool for a service handling 3,000 requests per second with an average query time of 20ms. What happens when that query time doubles?
Q1: Explain Little's Law and use it to size a connection pool for a service handling 3,000 requests per second with an average query time of 20ms. What happens when that query time doubles?
Follow-up: What if the traffic is bursty rather than uniform — say 3,000 req/s average but spikes to 9,000 for 5 seconds every minute?
Strong answer: Little’s Law gives you the average concurrency, but bursts are where systems break. During a 9,000 req/s spike with 20ms queries, instantaneous concurrency is L = 9,000 times 0.02 = 180. If my pool is 100, 80 requests per second start queuing. Over 5 seconds, that is 400 requests waiting.The options are layered. First, I would size the pool for the sustained spike, not the average — so closer to 200. Second, I would add a queue with a bounded depth (say 500 items) so that during extreme spikes, excess requests get rejected with HTTP 503 and Retry-After rather than queuing indefinitely and making every request slow. Third, if this is a known pattern, I would consider a token bucket rate limiter at the gateway that smooths the burst before it reaches the pool. The worst outcome is unbounded queuing where every request gets slow — load shedding is better than universal degradation.Follow-up: You mentioned PgBouncer. Explain the difference between its pooling modes and when transaction pooling breaks things.
Strong answer: PgBouncer has three modes. Session pooling assigns a database connection for the entire client session — simple and compatible with everything, but you only get multiplexing when sessions disconnect. Transaction pooling assigns a connection only for the duration of a transaction, then returns it to the pool — this is dramatically more efficient because web requests are typically idle between transactions. Statement pooling assigns per statement, which is the most efficient but breaks transactions entirely.The gotcha with transaction pooling — and this catches people in production — is that prepared statements do not work. A prepared statement is bound to a specific database connection. In transaction pooling mode, your next transaction might land on a different connection, and the prepared statement does not exist there. This breaks ORMs like Django and ActiveRecord by default because they use prepared statements under the hood. The fix is to either disable prepared statements in the ORM (Django:DISABLE_SERVER_SIDE_CURSORS = True plus use django-db-connection-pool; Rails: prepared_statements: false in database.yml), or to use PgBouncer’s session pooling mode and accept lower multiplexing.Another thing transaction pooling breaks: SET commands. If you run SET search_path = ... in one transaction, that setting is lost when the connection returns to the pool. Anything that relies on per-connection state — LISTEN/NOTIFY, advisory locks, temporary tables — will not work with transaction pooling.Going Deeper: How would you monitor a connection pool in production to detect problems before they become outages?
Strong answer: I would expose four key metrics via Prometheus or your APM: (1) pool utilization — active connections divided by total pool size, alert at 80%; (2) pool wait time — how long requests wait for a connection, alert at 50ms; (3) connection creation rate — if this is high, connections are being churned rather than reused, which means the pool is too small or connections are being dropped; (4) connection age — if all connections are young, something is killing them (network issue, database restart, idle timeout mismatch between the pool and PgBouncer).On the database side, I would monitorpg_stat_activity for connection count, idle connections, and long-running queries. A single query holding a connection for 30 seconds consumes that connection 1,000x longer than a 30ms query — one slow query under heavy load can exhaust the entire pool. I would set statement_timeout in PostgreSQL to kill queries over a threshold (say 5 seconds for OLTP) as a safety net.Q2: Your team is debating whether to use gRPC or REST for a new internal service. Walk me through how you would make that decision.
Q2: Your team is debating whether to use gRPC or REST for a new internal service. Walk me through how you would make that decision.
Follow-up: You said gRPC uses HTTP/2 multiplexing. Can you explain what happens when a gRPC call hits a load balancer that only understands HTTP/1.1?
Strong answer: This is a real production gotcha. gRPC requires HTTP/2 end-to-end. If you put a Layer 7 load balancer in front that downgrades to HTTP/1.1 (like an older AWS ALB configuration, or a misconfigured Nginx withoutgrpc_pass), gRPC calls will fail outright because the HTTP/2 framing that gRPC depends on is lost.Even when the load balancer supports HTTP/2, there is a subtlety with connection-level load balancing vs request-level load balancing. HTTP/2 multiplexes many requests over a single TCP connection. If the load balancer does connection-level balancing (Layer 4 / TCP), all requests on that multiplexed connection go to the same backend — you lose the distribution benefit. You need request-level (Layer 7) load balancing that understands HTTP/2 frames and can route individual gRPC calls to different backends.In Kubernetes, this is why you often see gRPC services using a client-side load balancer (like gRPC’s built-in round-robin or a service mesh like Istio/Linkerd) rather than relying on kube-proxy, which does TCP-level balancing. Envoy proxy handles this correctly because it understands HTTP/2 at the frame level.Follow-up: At what scale does the JSON vs Protobuf serialization difference actually matter? Give me concrete numbers.
Strong answer: I have seen teams switch to Protobuf prematurely when they were doing 500 req/s and the JSON overhead was completely negligible. Here is when it starts to matter.Serializing a typical JSON payload (say 1KB) takes about 10-50 microseconds depending on the library and language. Protobuf for the same data: 1-5 microseconds. The difference is 10-45 microseconds per request. At 500 req/s, that is 5-22 milliseconds of total CPU per second — invisible. At 50,000 req/s, that is 500-2,250 milliseconds of CPU per second — you are burning half a core just on serialization. At 500,000 req/s (which Google and large-scale systems deal with on a single service), JSON serialization can consume 5+ CPU cores.The payload size difference matters for bandwidth too. A 1KB JSON payload becomes roughly 200-300 bytes in Protobuf. At 50,000 req/s, that saves 35-40 MB/s of network bandwidth. Within a datacenter this is not critical, but across regions or on mobile networks, it is significant.My rule of thumb: below 5,000 req/s, JSON is fine for virtually all use cases and the developer experience benefits outweigh the performance cost. Between 5,000-50,000 req/s, profile before switching — the bottleneck is usually the database or network, not serialization. Above 50,000 req/s, Protobuf is almost certainly worth it on the internal hot path.Q3: Explain the bulkhead pattern. When would you use it, and how does it relate to thread pool management?
Q3: Explain the bulkhead pattern. When would you use it, and how does it relate to thread pool management?
maxSockets setting.The trade-off is resource utilization. With a single shared pool of 200 threads, all 200 can serve any request — maximum flexibility. With bulkheads, each pool is smaller and cannot borrow from others. If Service A is busy and Service B is idle, Service A’s pool may be exhausted while Service B’s threads sit idle. This is the price of isolation, and it is worth paying. The alternative — cascade failures that take down the entire service — is far more expensive.Follow-up: How do you decide the size of each bulkhead? What happens if you get it wrong?
Strong answer: This is where Little’s Law comes back. For each downstream dependency, estimate: expected request rate to that dependency times average response time equals required concurrency. If Service B receives 500 req/s and responds in 20ms, you need L = 500 times 0.02 = 10 concurrent threads minimum. I would set the bulkhead to 20-30 to handle bursts.If you set it too small, you will throttle healthy traffic to that dependency unnecessarily. Requests will queue or get rejected even though the downstream is perfectly fine — you are the bottleneck, not the dependency. If you set it too large, the bulkhead does not protect you — a slow dependency can still consume enough resources to impact the system.The key is monitoring and adjustment. Expose bulkhead utilization as a metric. If a bulkhead is consistently above 70% utilization, it is too small. If it is consistently below 20%, it is wasting resources. And critically, set a sensible timeout on each bulkhead — if Service B normally responds in 20ms, set a 200ms timeout. Without a timeout, a slow dependency can hold threads in the bulkhead indefinitely, and the bulkhead size becomes irrelevant because threads are stuck, not returned.Follow-up: Netflix famously used Hystrix for bulkheading. What were the limitations that led them to retire it in favor of Resilience4j?
Strong answer: Hystrix used thread-pool isolation exclusively — every call to a dependency was dispatched to a separate thread pool and the calling thread blocked waiting for the result. This provided strong isolation but had two costs: the overhead of a thread context switch per call (~5-15 microseconds, which adds up at hundreds of thousands of calls per second), and the memory consumption of maintaining many thread pools (each pool is dozens of threads at ~1MB stack each).Resilience4j offers both thread-pool isolation and semaphore isolation. Semaphore isolation uses a simple counter to limit concurrency without an extra thread — the calling thread executes the downstream call directly, but only N calls can execute simultaneously. This eliminates the context-switch overhead and is more appropriate for I/O-bound calls (which most HTTP calls are). Thread-pool isolation is still useful when you want to enforce strict timeouts — you can interrupt a thread in a separate pool, but you cannot easily interrupt the calling thread in semaphore mode.Resilience4j is also designed for reactive and async programming models (Project Reactor, RxJava), which Hystrix’s thread-pool model did not integrate well with. And Resilience4j is modular — you can use just the bulkhead, just the circuit breaker, or any combination, while Hystrix was more monolithic.Going Deeper: Beyond thread-pool bulkheads, how would you implement bulkheading at the infrastructure level?
Strong answer: Bulkheading applies at every layer, not just threads. At the infrastructure level, you isolate failure domains: deploy critical and non-critical workloads on separate Kubernetes node pools so that a noisy neighbor (a batch job consuming all CPU) cannot starve the user-facing API. Use separate database connection pools for different workload types — one pool for real-time queries, one for reporting queries, so a heavy report does not exhaust connections for the checkout flow. Run separate Redis clusters for cache and session storage so a cache flush does not impact user sessions. Use separate AWS accounts or VPCs for production and staging so a staging load test cannot impact production network quotas. Even at the DNS level, use separate domains for API and static assets so a DNS issue on the CDN does not take down the API.The principle is always the same: identify which components share a failure domain, and ask whether they should. If the failure of component A should not affect component B, they should not share resources — threads, connections, servers, databases, or network paths.Q4: Your service runs on Kubernetes and autoscales on CPU. During peak traffic, latency doubles but CPU never exceeds 40%. What is happening and how do you fix the autoscaling policy?
Q4: Your service runs on Kubernetes and autoscales on CPU. During peak traffic, latency doubles but CPU never exceeds 40%. What is happening and how do you fix the autoscaling policy?
- Scale on request latency (P95 or P99). This directly measures what users experience. If P95 exceeds your SLA threshold, add pods. The downside is latency is a lagging indicator — by the time P95 is elevated, users are already affected.
- Scale on concurrent connections or active requests. This is a leading indicator. If each pod handles 100 concurrent requests comfortably and you see 90, scale up before latency degrades. In Kubernetes, expose this as a custom metric via Prometheus and configure the HPA to use it.
- Scale on queue depth. If the service consumes from a message queue (SQS, Kafka, RabbitMQ), use KEDA to scale on queue depth. When messages pile up faster than consumers process them, add pods.
- Scale on connection pool wait time. If the bottleneck is the database connection pool, scale when pool wait time exceeds a threshold (say 10ms). This is a very specific and effective metric when the pool is the constraint.
Follow-up: How do you handle the cold start problem — new pods are not ready to handle traffic immediately, and during the scaling lag, latency gets even worse?
Strong answer: This is a multi-layered problem. First, there is the pod startup time: container image pull (10-30 seconds if the image is not cached on the node), application initialization (1-5 seconds for Go, 30-90 seconds for a Spring Boot JVM application with dependency injection and connection pool warm-up), and readiness probe passing (depends on the interval and threshold you set).Solutions in order of impact: (1) Use readiness probes correctly. The pod should not receive traffic until it has established database connections, warmed caches, and is actually ready. SetinitialDelaySeconds appropriately. (2) Pre-pull images. Use a DaemonSet that pulls your application image to every node ahead of time so the cold pull does not add 30 seconds. (3) Use Kubernetes PodDisruptionBudgets and preStop hooks so existing pods drain gracefully while new ones start. (4) For JVM services, use CDS (Class Data Sharing) or GraalVM native image to cut startup from 60 seconds to 2-5 seconds. (5) Use predictive or scheduled scaling — if you know traffic spikes at 8 AM every day, schedule the HPA to pre-scale at 7:45 AM using KEDA’s cron trigger. (6) Maintain a warm pool: keep 1-2 extra pods running above minimum at all times as headroom. The cost of running 2 extra small pods is trivial compared to the cost of degraded user experience during scaling events.Follow-up: Can you describe a situation where autoscaling makes things worse rather than better?
Strong answer: Absolutely — this is a trap I have seen in production. Suppose your bottleneck is the database. The database can handle 5,000 queries per second across all application pods. You have 10 pods, each making 500 queries/s. Latency rises because the database is saturated at 5,000 queries/s. The autoscaler sees elevated latency (or elevated queue depth) and adds 5 more pods. Now 15 pods each make 500 queries/s, pushing 7,500 queries/s at the database. The database is now severely overloaded — query latency triples, connection pool contention spikes, and some queries start timing out. The autoscaler sees worse metrics and adds MORE pods. This is a positive feedback loop that drives the system into the ground.The fix is twofold. First, set amaxReplicas on the HPA so the scaling cannot run away. Second, and more fundamentally, ensure your scaling metric is correlated with a resource you can actually scale. If the bottleneck is the database, adding application pods does not help. You need to address the database — add read replicas, cache hot queries in Redis, optimize the queries, or add PgBouncer to multiplex connections. Only scale the layer that is actually the constraint.Another case: if each new pod opens 10 database connections and your database supports 100 max connections, you can only scale to 10 pods before hitting the connection limit. Scaling to pod 11 causes connection failures across all pods. The autoscaler is not aware of this constraint. This is why monitoring database connection count alongside pod count is critical.Q5: What is tail latency amplification? Explain it with a concrete example and describe two techniques to mitigate it.
Q5: What is tail latency amplification? Explain it with a concrete example and describe two techniques to mitigate it.
Follow-up: How does hedged requests interact with non-idempotent operations? Can you use it for writes?
Strong answer: This is the critical limitation. Hedged requests only work safely for idempotent operations — reads, or writes that produce the same result regardless of how many times they execute. You cannot hedge a “debit 200.For non-idempotent writes, you need a different approach. One option is to make writes idempotent using an idempotency key — the client attaches a unique request ID, and the server checks whether it has already processed that ID before executing. This is common in payment systems (Stripe uses idempotency keys for exactly this reason). With idempotent writes, you can safely hedge.Another approach for writes is to accept higher tail latency on the write path and focus hedging on the read path, which is typically the much higher-volume path. In most systems, the read-to-write ratio is 10:1 or higher, so improving read tail latency delivers the most user-facing impact.Follow-up: If hedged requests increase load on your backends by 5%, what happens during a traffic spike when backends are already stressed?
Strong answer: This is the danger of hedged requests during overload. If backends are already at 90% capacity and you add 5% more load from hedged requests, you might push them past the tipping point where queueing causes latency to spike non-linearly. The hedged requests — designed to reduce tail latency — actually increase it by adding load to already-stressed backends.The mitigation is adaptive hedging. During normal operation, hedge freely. When backend utilization or latency exceeds a threshold, disable or reduce hedging. Google’s systems do this — they monitor backend queue depth and only send hedged requests when the backend has spare capacity. You can implement this with a simple circuit: if the P50 of the backend exceeds 2x its normal value, stop hedging until it recovers. This way hedging helps when things are healthy and gets out of the way when things are stressed.Q6: Walk me through how you would implement a performance budget for a microservices-based checkout flow that touches 5 services.
Q6: Walk me through how you would implement a performance budget for a microservices-based checkout flow that touches 5 services.
| Component | Budget (P95) | Rationale |
|---|---|---|
| API Gateway + auth | 10ms | Fixed overhead, well-optimized |
| Cart Service | 30ms | Single Redis lookup + validation |
| Pricing Service | 50ms | May hit a rules engine, cache-assisted |
| Inventory Service | 40ms | Database read, indexed by SKU |
| Payment Service | 150ms | External payment provider (Stripe, Adyen) — highest variance |
| Order Service | 60ms | Database write + event publish |
| Network overhead (inter-service) | 30ms | ~5ms per hop, 6 hops |
| Serialization overhead | 15ms | JSON/Protobuf across all hops |
| Headroom (20%) | 85ms | Buffer for GC, spikes, unknown unknowns |
| Total | 470ms | Under 500ms SLA with margin |
Follow-up: The payment provider’s latency varies wildly — sometimes 50ms, sometimes 800ms. How do you handle a component that does not respect its budget?
Strong answer: External dependencies are the hardest part of performance budgets because you do not control them. Three strategies:First, set an aggressive timeout on the payment call — say 300ms. If the provider does not respond in time, retry once with a shorter timeout (200ms) against a different endpoint or region if available, then fail the request with a clear error. This protects your SLA at the cost of some failed checkouts. Track the timeout rate — if it exceeds 2%, escalate with the provider.Second, use a circuit breaker on the payment call. If the error rate (including timeouts) exceeds a threshold, the circuit opens and immediately fails payment attempts for a cooldown period. This prevents a slow provider from dragging down your entire checkout flow.Third, explore architectural alternatives. Can you do an optimistic checkout — accept the order, show confirmation to the user, and process payment asynchronously? If payment fails, notify the user and hold the order. This decouples the user experience from payment latency entirely. The trade-off is complexity (handling failed payments after the user thinks the order is confirmed) but it eliminates payment provider latency from the critical path.Follow-up: How do you get 5 different teams to care about their individual performance budgets? This is as much an organizational problem as a technical one.
Strong answer: You are right — the technical solution is the easy part. Making budgets work organizationally requires three things.First, make budgets visible. A shared dashboard where every team can see their service’s P95 against their budget, updated in real-time, creates social accountability. Nobody wants to be the red bar on the dashboard.Second, tie budgets to deployment gates. In the CI/CD pipeline, run a load test against staging. If the service’s P95 exceeds its budget, the deploy is blocked. This makes it impossible to ignore a budget violation because it literally prevents shipping.Third, and most importantly, frame budgets as a contract, not a punishment. The budget gives each team a clear target and autonomy within that target. They can use whatever technology, architecture, or optimization they want as long as they stay within budget. It also gives them leverage: if a new product requirement would blow their budget, the performance budget is the data that justifies pushing back or requesting architectural changes. Without a budget, the conversation is “your service is slow” which is vague and confrontational. With a budget, the conversation is “this change would push our P95 from 45ms to 120ms against a 50ms budget — we need to discuss options” which is specific and constructive.Q7: Explain how consistent hashing works and why it is better than simple modular hashing for distributed caches. What happens when a node fails?
Q7: Explain how consistent hashing works and why it is better than simple modular hashing for distributed caches. What happens when a node fails?
shard = hash(key) % N — the key’s location depends on the total number of nodes N. When N changes (you add or remove a node), almost every key’s shard assignment changes. If you go from 10 to 11 nodes, roughly 90% of keys now hash to a different node. In a distributed cache, that means 90% of your cached data is now on the “wrong” node — effectively a mass cache miss event. At high traffic, this creates a thundering herd to the database as thousands of requests simultaneously fall through an empty cache.Consistent hashing solves this by arranging all nodes on a virtual ring (hash space from 0 to 2^32, conceptually). Each node is placed on the ring at a position determined by hashing its identifier. Each key is also hashed onto the ring, and it is assigned to the first node found clockwise from the key’s position.When a node is added, it takes ownership of the keys between its position and the previous node on the ring. Only those keys — roughly 1/N of the total — need to move. When a node is removed, its keys move to the next node clockwise. Again, only 1/N of keys are affected.So going from 10 to 11 nodes, approximately 9% of keys move instead of 90%. That is a 10x reduction in cache invalidation, which at scale is the difference between a smooth operation and a cascading outage.The virtual nodes refinement. In basic consistent hashing, if you have 3 physical nodes, each gets one position on the ring. The distribution is often uneven — one node might end up responsible for 50% of the ring by bad luck. Virtual nodes fix this: each physical node gets, say, 150 positions on the ring. This smooths the distribution dramatically, giving each physical node roughly 1/N of the key space regardless of where the positions fall. DynamoDB uses virtual nodes with this approach.Follow-up: If a node fails in consistent hashing, the next node clockwise absorbs all of its keys. Does not that create a hot spot?
Strong answer: Yes, and this is a subtle but important point. If Node B fails and all of its keys move to Node C, Node C now handles its own keys plus Node B’s — roughly double its load. If Node C was already at 60% capacity, it could be overwhelmed at 120%, causing it to fail too. And then Node D absorbs both Node C’s and the inherited Node B keys. This is a cascade failure triggered by consistent hashing.The mitigation is replication. In systems like Cassandra and DynamoDB, each key is stored on N replicas (typically 3) — the primary node plus the next N-1 nodes clockwise on the ring. When the primary fails, the replicas already have the data. The read load redistributes across the remaining replicas rather than concentrating on a single successor.Another mitigation is the virtual nodes themselves. With 150 virtual nodes per physical node, the keys from a failed physical node scatter across many different physical successors rather than all landing on one. This distributes the extra load across the cluster instead of concentrating it.Follow-up: When would you NOT use consistent hashing and prefer simple range-based partitioning instead?
Strong answer: Consistent hashing optimizes for even distribution and minimal reshuffling when nodes change. But it destroys key ordering — keys that are numerically adjacent might end up on completely different nodes. If your access pattern relies heavily on range queries — “give me all records with keys between 1000 and 2000” — consistent hashing is a poor fit because that range could span every node in the cluster, requiring a scatter-gather across all of them.Range-based partitioning preserves ordering. Shard 1 handles keys 0-1000, Shard 2 handles 1001-2000, and so on. A range query for keys 1000-2000 hits at most 2 shards. This is why HBase and the original Bigtable use range-based partitioning — their primary use case is sequential scans over sorted key ranges (think time-series data where you scan “all events from 2:00 to 3:00 PM”).The trade-off is that range-based partitioning creates hot spots when writes cluster at one end of the key space (all new records have the highest keys, all landing on the last shard). You mitigate this with salting the key prefix or using a compound key. The choice between consistent hashing and range partitioning depends on whether your workload is dominated by point lookups (consistent hashing wins) or range scans (range partitioning wins).Q8: You have a Node.js API that processes image uploads. Under load, all requests become slow -- not just the image uploads. Explain why and propose a solution.
Q8: You have a Node.js API that processes image uploads. Under load, all requests become slow -- not just the image uploads. Explain why and propose a solution.
-
Worker threads (Node.js
worker_threadsmodule). Spawn a pool of worker threads (typically one per CPU core minus one for the main thread). Image processing jobs are dispatched to workers. The main event loop stays responsive. This is the simplest and most direct solution. - Separate service. Extract image processing into a dedicated service. The API publishes an “image uploaded” event to a queue (SQS, BullMQ). A separate image processing service — which could be in Go, Rust, or Python with C bindings for maximum performance — consumes the queue and processes images. The API returns immediately with a 202 Accepted. This is the cleanest architecture and allows the image processing to scale independently.
- Use a native library. Replace ImageMagick (which is notoriously slow and CPU-hungry) with sharp (which wraps libvips). sharp is typically 4-5x faster for the same operations, meaning the event loop is blocked for 50-400ms instead of 200-2000ms. Faster is still blocking, but it reduces the impact.
Follow-up: How would you detect event loop blocking in production before users complain?
Strong answer: There are three mechanisms. First, measure event loop lag. ThemonitorEventLoopDelay API (built into Node.js since v12) samples how long the event loop takes to cycle. Normally it is under 1ms. If it spikes to 50-100ms, something is blocking. Libraries like clinic doctor (from NearForm) visualize this beautifully. Expose the event loop lag as a Prometheus metric and alert when it exceeds 20ms.Second, use an event loop latency heartbeat. Set a setInterval that records the time between invocations. If setInterval(fn, 100) actually fires every 350ms, the event loop is being blocked for ~250ms per cycle. This is a quick-and-dirty approach but effective.Third, use --prof or clinic flame to generate a flame graph under load. The widest bars on the flame graph show where CPU time is concentrated. If you see a massive bar on an image processing library, that is your culprit.Follow-up: Would switching to Go or Rust solve this problem, or would it manifest differently?
Strong answer: In Go, the same image processing would use CPU across all cores via goroutines, and the Go scheduler would preempt a goroutine that has been running too long (at function call boundaries in older Go, and with async preemption since Go 1.14). So a CPU-intensive goroutine would not block all other goroutines the way a CPU-intensive operation blocks the Node.js event loop. Other HTTP handlers would continue to be scheduled across available cores. The problem would not disappear — you would still see elevated latency under heavy image processing load because CPU is finite — but it would degrade gracefully across all requests rather than catastrophically blocking everything.In Rust, you would typically use Tokio (an async runtime) for I/O and spawn CPU work onto a separate thread pool usingtokio::task::spawn_blocking. The Tokio docs explicitly warn against doing CPU-intensive work on the async runtime for exactly the same reason as Node.js — it blocks the executor and starves other tasks. The pattern is the same: keep I/O on the async runtime, offload CPU work to a dedicated pool.The lesson is runtime-agnostic: CPU-bound and I/O-bound work should not share the same execution context, regardless of language. The failure mode varies (event loop blocking in Node.js, executor starvation in Tokio, GIL contention in Python), but the principle is universal.Q9: A senior engineer on your team proposes sharding the PostgreSQL database. The database is currently on a db.r6g.4xlarge (16 vCPUs, 128GB RAM) and handling 8,000 queries/second. Should you proceed? What questions do you ask first?
Q9: A senior engineer on your team proposes sharding the PostgreSQL database. The database is currently on a db.r6g.4xlarge (16 vCPUs, 128GB RAM) and handling 8,000 queries/second. Should you proceed? What questions do you ask first?
- What is the actual symptom? Is it query latency? Connection saturation? Storage capacity? CPU utilization on the database? Each symptom has cheaper solutions than sharding.
-
What does the query profile look like? Run
pg_stat_statementsto identify the top 10 queries by total time. In my experience, 80% of database load comes from 5-10 queries. Optimizing those (adding indexes, rewriting joins, adding caching) can buy a 5-10x improvement with days of work, not months. - Have we explored read replicas? If the workload is read-heavy (most web apps are), adding a read replica takes an afternoon. Route read queries to the replica, keep writes on the primary. This effectively doubles read capacity with near-zero application changes (most ORMs support read/write splitting with a configuration change).
-
Is the working set fitting in memory? Check
shared_buffersusage and cache hit ratio (SELECT sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) FROM pg_statio_user_tables). If the hit ratio is below 99%, increasingshared_buffersor upgrading to a machine with more RAM might solve the problem cheaper than sharding. -
Is vacuum keeping up? Check
pg_stat_user_tablesfor tables with highn_dead_tup. If autovacuum is behind, table bloat causes sequential scans on tables that should use index scans. Tuningautovacuum_vacuum_scale_factorandautovacuum_vacuum_cost_delaycan recover significant performance. - Have we considered partitioning? PostgreSQL native declarative partitioning splits a table into child tables (e.g., by month) without any application changes. Partition pruning means queries only scan relevant partitions. This gives you most of the benefits of sharding for time-series data without the distributed systems complexity.
- What is the growth trajectory? If we are at 8,000 qps and growing 20% per month, we have ~4 months before doubling. If we are growing 5% per year, sharding is a premature optimization that the company may never need.
Follow-up: Suppose you have done all of that and sharding is genuinely needed. How do you choose between application-level sharding, using a proxy like Vitess, or switching to a distributed database like CockroachDB?
Strong answer: Three approaches, three trade-off profiles.Application-level sharding means your application code contains the shard routing logic. You maintain a mapping of “this tenant goes to this shard” and your data access layer routes queries accordingly. Advantages: full control, no external dependencies, you can optimize for your specific access patterns. Disadvantages: every developer must be aware of sharding (queries that accidentally skip the shard key do a scatter-gather), migrations and rebalancing are DIY, and every new application or microservice needs the same routing logic.Vitess (the proxy approach) sits between your application and MySQL. Your application thinks it is talking to a single database; Vitess handles shard routing, connection pooling, and even online schema changes. Advantages: minimal application changes, battle-tested at YouTube/Slack/GitHub scale, handles shard rebalancing. Disadvantages: another piece of infrastructure to operate (though it is very mature), it is MySQL-only, and complex queries that span shards may behave unexpectedly.CockroachDB (or Spanner, TiDB, YugabyteDB) handles sharding automatically. You write SQL as if it were a single database, and the system distributes data and routes queries internally. Advantages: simplest application-level experience, supports distributed transactions natively, automatic rebalancing. Disadvantages: higher per-query latency (distributed consensus adds ~5-15ms per write), less mature ecosystem than PostgreSQL, and operational expertise is harder to hire for.My recommendation depends on the team and the workload. If the team has deep PostgreSQL expertise and the access patterns are well-defined (every query scopes to a tenant), application-level sharding gives maximum control with manageable complexity. If the team needs to move fast and is on MySQL, Vitess is the best balance of control and abstraction. If the team values developer simplicity above all and can accept slightly higher write latency, CockroachDB or YugabyteDB gives the most transparent experience.Going Deeper: How do you handle database migrations and schema changes across a sharded database? This is often where sharding pain really lives.
Strong answer: This is the hidden cost of sharding that most people do not appreciate until they are living it. With a single database, a schema migration is oneALTER TABLE command. With 64 shards, it is 64 migrations that must be coordinated.For small, non-locking changes (adding a nullable column, creating an index concurrently), you can run the migration against all shards in parallel using a tool like pt-online-schema-change (Percona, for MySQL), pg_repack (PostgreSQL), or Vitess’s built-in online DDL (which handles this transparently).For large, locking changes (changing a column type, adding a NOT NULL constraint with a default value), the risk is that the migration locks the table and blocks queries. On a single database, you take a brief outage or use an online migration tool. On 64 shards, you run the migration shard-by-shard in a rolling fashion — migrate shard 1, verify it is healthy, migrate shard 2, and so on. This takes much longer but eliminates the risk of a global outage.The hardest case is when the shard key itself needs to change — for example, you initially sharded by user_id but now need to re-shard by tenant_id + user_id because the access patterns evolved. This is essentially a full data migration: create new shards, dual-write to old and new shards, backfill historical data, verify consistency, cut over reads to new shards, stop writing to old shards. This can take weeks to months. This is why choosing the right shard key upfront is so critical — it is the one decision that is hardest to change later.Q10: Compare async processing patterns: fire-and-forget, request-acknowledge-poll, and event-driven. When would you choose each, and what are the failure modes?
Q10: Compare async processing patterns: fire-and-forget, request-acknowledge-poll, and event-driven. When would you choose each, and what are the failure modes?
202 Accepted with a status URL (/jobs/{id}), and enqueues the work. The client polls the status URL (or subscribes via WebSocket/SSE for push-based updates) until the job completes. The worker processes the job and updates the status from pending to processing to completed or failed.Best for: long-running operations where the user expects a result but not instantly — PDF generation, video transcoding, data exports, report generation, large file processing. Stripe uses this pattern for payment intents that require additional authentication steps. The failure mode is a stuck job: the worker crashes mid-processing, the status stays on “processing” forever. The fix is a reaper — a background process that identifies jobs stuck in “processing” longer than a threshold (say 5 minutes) and either retries them or marks them as failed.Event-driven (pub/sub). The API publishes a domain event (e.g., OrderCreated) to a topic. Multiple independent consumers subscribe and react. The email service sends a confirmation, the inventory service adjusts stock, the analytics service records the conversion, the shipping service creates a label. Each consumer is decoupled — adding a new consumer requires zero changes to the publisher.Best for: systems where a single action triggers multiple downstream effects owned by different teams. Microservices architectures use this heavily. The failure modes are more nuanced. First, consumer failure: if the inventory service is down, it misses the event. You need durable subscriptions (Kafka consumer groups, SQS subscriptions) so events wait until the consumer recovers. Second, ordering: events may arrive out of order (especially in multi-partition Kafka topics). If OrderCreated arrives after OrderShipped, your consumer must handle it. Third, the hardest failure mode is eventual consistency: the order exists in the order service database, the event is published, but the inventory service has not processed it yet. During that window, the system is inconsistent. If someone queries inventory, they see the old value. You need to decide whether this window is acceptable for each consumer.How I choose: I ask two questions. Does the user need to know the result? If no, fire-and-forget. If yes, does the user need it in this HTTP response? If yes, make it synchronous. If not, request-acknowledge-poll. Are there multiple independent downstream effects from one action? If yes, event-driven. If the action only triggers one downstream process, event-driven adds unnecessary complexity — just use a direct queue.Follow-up: How do you guarantee that a database write and a queue publish happen atomically? What if the database write succeeds but the queue publish fails?
Strong answer: This is the dual-write problem, and it is one of the trickiest challenges in event-driven systems. If you write to the database and then publish to the queue, and the publish fails (network error, queue down), you have an order in the database but no event — the downstream consumers never learn about it. If you reverse the order (publish first, then write), a database failure means you published an event for an order that does not exist.The cleanest solution is the transactional outbox pattern. Instead of publishing directly to the queue, you write the event to an “outbox” table in the same database transaction as the business data. The order INSERT and the outbox INSERT are in the same transaction — they either both succeed or both fail. A separate process (a “relay” or “publisher”) reads the outbox table and publishes events to the queue. Once the event is successfully published, it marks the outbox row as processed.This guarantees at-least-once delivery: if the relay crashes after reading the event but before marking it processed, it will publish the event again on restart. Consumers must be idempotent to handle duplicates.Debezium is a popular implementation of this pattern — it reads the database’s write-ahead log (WAL in PostgreSQL, binlog in MySQL) and publishes change events to Kafka. This is even better than polling the outbox table because it captures changes in real-time with no polling overhead.Follow-up: In an event-driven system, how do you debug a flow where an order was placed but the confirmation email was never sent?
Strong answer: This is the observability challenge of event-driven systems. With synchronous calls, you have a single distributed trace that shows every hop. With events, the chain is broken — the API trace ends at “event published” and the email service trace starts at “event consumed” and they are not linked by default.The fix is correlation IDs. Every event carries acorrelation_id (typically the original request ID) that propagates through every consumer. When the email service processes the event, it logs with the same correlation ID. To debug the missing email, I search for the correlation ID across all services.Then I work backwards: (1) Was the event published? Check the outbox table or the Kafka topic — search by order ID. If the event is there, the API did its job. (2) Was the event consumed by the email service? Check the email service’s consumer logs for that correlation ID. If there is no log entry, the consumer either crashed, was down, or the message is stuck in a dead-letter queue. Check the DLQ. (3) Was the email sent? Check the email provider’s API logs (SendGrid, SES). If the event was consumed but the email was not sent, the email service had an internal error processing it. Check its error logs.This is why investing in distributed tracing infrastructure for event-driven systems is even more important than for synchronous systems. OpenTelemetry supports context propagation across message queues, which automatically links the producer trace to the consumer trace. Without it, debugging becomes a manual correlation-ID hunt across multiple log systems.Q11: Explain the difference between horizontal scaling and vertical scaling. Then tell me about a situation where horizontal scaling makes things worse.
Q11: Explain the difference between horizontal scaling and vertical scaling. Then tell me about a situation where horizontal scaling makes things worse.
Follow-up: Given those pitfalls, how do you decide the right time to move from vertical to horizontal scaling?
Strong answer: I use three trigger criteria. First, the vertical ceiling: when the next machine size up either does not exist, costs more than 3x per unit of performance, or has availability issues (the largest instance types are not always available in every AZ). Second, high availability requirements: if the business cannot tolerate the service being down for the time it takes to restart a single machine (typically 1-5 minutes), you need at least two instances, which means horizontal. Third, independent scaling requirements: if different parts of the system have different resource profiles (image processing is CPU-bound, the API is I/O-bound), they should be separate services that scale independently.In practice, most startups should run on a single well-sized machine for longer than they think. A db.r6g.2xlarge (8 vCPU, 64GB RAM) handles far more load than most early-stage products generate. The engineering time spent setting up and operating a horizontally-scaled distributed system is time not spent building product features. Premature horizontal scaling is one of the most common forms of over-engineering I see.Going Deeper: In a cloud environment, is vertical scaling truly limited, or has the cloud changed the calculus?
Strong answer: The cloud has made vertical scaling more viable than it was in the on-premise era, but the limits are different, not absent. AWS offers instances up to u-24tb1.metal (448 vCPUs, 24 TB RAM), which is more compute than most companies will ever need in a single machine. You can scale vertically with near-zero downtime using AWS’s instance type change (stop, change type, start — ~2-5 minutes of downtime, or zero downtime with a blue-green approach using Route53 failover).But the cloud introduces new vertical scaling constraints. First, cost non-linearity: an r6g.8xlarge costs more than 2x an r6g.4xlarge for 2x the resources, because of the memory-to-compute premium. Second, blast radius: a single large instance going down takes all traffic with it. Even with a multi-AZ standby (RDS Multi-AZ), failover takes 60-120 seconds. Third, noisy neighbor effects: larger instances on shared hardware can experience more variable performance (though dedicated hosts eliminate this at a cost).The modern approach is to stay vertical as long as it is cheap and simple, but design for horizontal from the start. This means stateless applications, externalized sessions, no local file storage. That way, the move from one big machine to multiple smaller ones is a deployment configuration change, not an architecture rewrite.Q12: You are on-call and get paged at 2 AM: memory usage on your production service is climbing steadily at 50MB per hour and is now at 85% of the container limit. Walk me through your response.
Q12: You are on-call and get paged at 2 AM: memory usage on your production service is climbing steadily at 50MB per hour and is now at 85% of the container limit. Walk me through your response.
-
A heap snapshot. In Node.js:
process.memoryUsage()plus a heap snapshot viav8.writeHeapSnapshot()or by sendingSIGUSR2if usingnode --heapsnapshot-signal=SIGUSR2. In JVM:jmap -dump:live,format=b,file=heap.hprof <pid>. In Go: hit the/debug/pprof/heapendpoint. In Python:tracemallocif enabled, orguppy3for heap analysis. -
The process’s memory map:
cat /proc/<pid>/smaps_rollupon Linux to see RSS, shared memory, and anonymous memory breakdown. - Current metrics: connection count, active request count, goroutine count (Go), thread count (JVM), event listener count (Node.js). Any of these growing monotonically alongside memory is a strong signal.
-
Event listeners not removed (Node.js). If
EventEmitterobjects are growing, some code is adding listeners on every request without removing them. Classic: adding a listener to a database connection or a WebSocket inside a request handler. - Unbounded caches. A Map or object used as a cache without a max size or eviction policy. Every unique request key adds an entry that is never removed. Fix: use an LRU cache with a max size (lru-cache in Node.js, Guava Cache in JVM, groupcache in Go).
- Closures holding references. A callback or promise chain captures a reference to a large object (e.g., the full HTTP request body). Even after the request is complete, the closure keeps the object alive. This is subtle and hard to spot without a heap snapshot diffing tool.
- Connection leak. Database or HTTP connections are opened but never returned to the pool. Each leaked connection consumes memory on both the application and the database side. Check the pool’s “active” count — if it grows monotonically, connections are not being released.
- String accumulation (JVM/.NET). String interning or string concatenation in a loop creating many intermediate String objects that are not GC-eligible because something holds a reference.
Follow-up: How do you distinguish between a genuine memory leak and normal memory growth (like a cache filling up)?
Strong answer: A genuine leak is memory that is allocated, no longer reachable by any active code path, but not freed — in a garbage-collected language, this means objects that are technically still referenced (so the GC cannot collect them) but will never be used again. A growing cache is allocated, referenced, AND actively used — it is not a leak, it is a design choice.The diagnostic difference: after a full GC cycle, a leak shows memory that does not decrease. A cache shows memory that is reclaimable if forced (you could evict entries). In JVM, trigger a full GC (jcmd <pid> GC.run) and check if RSS drops. If it does not, the retained objects are genuinely leaked. In Go, the pprof inuse_space profile shows currently live allocations — if this grows monotonically across GC cycles, you have a leak.For caches, the fix is bounded size with eviction. For leaks, the fix is finding and breaking the reference that keeps the dead object alive.Follow-up: In a containerized environment (Kubernetes), what is the difference between RSS, container memory usage, and the OOM kill threshold? Why does this matter for your diagnosis?
Strong answer: This catches people all the time. RSS (Resident Set Size) is the physical memory your process is actually using. But the container’s memory usage as reported by the cgroup (and what Kubernetes uses for its memory limit) includes RSS PLUS file-system cache (page cache) used by your process. A process reading a lot of files can have a low RSS but a high cgroup memory usage because the kernel caches file pages in that container’s memory accounting.This means your memory monitoring dashboard might show “85% memory usage” but your process’s actual heap is only at 50%. The rest is kernel page cache, which is evictable — the kernel will release it under pressure. So the OOM danger is lower than it appears.However, in Kubernetes, the OOM killer triggers on the cgroup memory limit, which includes page cache. If the container’s memory limit is 4GB and RSS (2GB) + page cache (2GB) hits 4GB, the OOM killer fires even though your application’s heap is healthy and the page cache is reclaimable. The fix is either increasing the memory limit to account for page cache, or settingmemory.high (a soft limit that triggers reclaim before OOM) if your kernel version supports cgroup v2.This is why I always look at RSS specifically, not just total container memory, when diagnosing leaks. kubectl top pod shows container-level memory. /proc/<pid>/status shows VmRSS for the actual process. They can tell very different stories.Advanced Interview Scenarios
These questions are designed to expose the gap between “I read about this” and “I lived through this.” Each one contains a trap — a place where the obvious answer is wrong, where conventional wisdom breaks down, or where the real problem is organizational rather than technical. These are the questions that staff-level interviewers use to separate builders from reciters.Scenario: You add Redis caching to your slowest endpoint. Response times improve by 10x for 24 hours, then the endpoint becomes SLOWER than it was without caching. What happened?
Scenario: You add Redis caching to your slowest endpoint. Response times improve by 10x for 24 hours, then the endpoint becomes SLOWER than it was without caching. What happened?
The Trap
The obvious answer is “the cache expired” or “cache hit ratio dropped.” But the real scenario is more insidious: the cache made things worse because the system adapted to the cache’s presence, and when the cache failed to absorb load as expected, the uncovered database was in a worse position than before caching existed.What weak candidates say: “The cache TTL expired and requests hit the database.” This is surface-level. It does not explain why things are worse than before — before caching, the database handled this load fine.What strong candidates say:The way I think about this is through the lens of hidden capacity coupling. When you add a cache, two things happen that most people do not anticipate.-
The database lost its warm buffer pool for that query pattern. Before caching, the database served this query thousands of times per hour. PostgreSQL’s
shared_buffersand the OS page cache kept the relevant data pages hot in RAM. The B-tree index nodes for this query were always in the buffer pool. After 24 hours of caching absorbing 99% of reads, the database has evicted those pages to make room for other workloads. Now when cache misses start hitting the database, every query is a cold read — hitting disk instead of RAM. A query that took 5ms before caching now takes 50-200ms because the buffer pool is cold for this access pattern. - Traffic grew because the endpoint got faster. This is the Jevons paradox of performance. The product team saw the faster endpoint and enabled a feature that calls it 3x more frequently. Or users started using the feature more because it was responsive. The pre-cache traffic was 1,000 req/s. Post-cache traffic grew to 3,000 req/s. The cache was absorbing 2,970 req/s, the database handled the remaining 30 req/s easily. When the cache degrades (Redis failover, memory pressure eviction, key pattern change from a deploy), those 3,000 req/s hit a database that was never sized for that load — it was sized for the original 1,000 req/s, and even that working set is now cold.
- Cache stampede. If a popular cache key expires and 500 concurrent requests all see the cache miss simultaneously, they all query the database in parallel for the same data. The database is hit with 500 identical expensive queries instead of 1. This is a thundering herd at the cache layer. At scale, with thousands of keys expiring near the same time (common with fixed TTLs set at the same moment), this becomes a periodic stampede that is worse than no caching.
- Stale-while-revalidate. Serve the stale cached value immediately while refreshing in the background. The user gets a fast (slightly stale) response, and only one request triggers the refresh. Cloudflare, Fastly, and most CDNs support this natively with
Cache-Control: stale-while-revalidate=60. - Cache warming on startup. When Redis restarts or a new cache node joins, pre-populate it with the top 1,000 keys by query frequency before routing traffic to it.
- Probabilistic early expiration (PER). Each request checks if the key is “close to expiring” (say within 20% of TTL) and refreshes it with a probability proportional to how close it is to expiration. This spreads refresh load over time instead of concentrating it at expiry. The XFetch algorithm formalizes this.
- Circuit breaker on the database. If cache miss rate exceeds a threshold (say 10%), trip a circuit breaker that returns degraded responses (stale data, partial results, or HTTP 503) rather than letting the database be overwhelmed.
Follow-up: How would you design a caching layer that survives a complete Redis failure without overwhelming the database?
Follow-up: Explain the difference between cache-aside, write-through, and write-behind. When does write-behind lose data?
Follow-up: Your cache hit ratio is 99.5% but your P99 latency is still bad. How is that possible?
Scenario: You deploy a new version of Service A. Within 10 minutes, Service D -- three hops away in the call chain -- starts returning 500s. Service A's metrics look fine. Walk me through the investigation.
Scenario: You deploy a new version of Service A. Within 10 minutes, Service D -- three hops away in the call chain -- starts returning 500s. Service A's metrics look fine. Walk me through the investigation.
The Cascading Failure
This is a cascade that crosses team boundaries. The trap is focusing only on Service D or only on Service A. The real problem is in the interaction between services B and C, triggered by a subtle behavioral change in A.What weak candidates say: “I’d check Service D’s logs for errors.” This starts at the wrong end of the causal chain. Or: “I’d roll back Service A since it’s the most recent change.” This might work but teaches you nothing about why it happened or how to prevent it.What strong candidates say:- Step 1: Establish the timeline. I pull up a service dependency map (Datadog Service Map, Grafana Tempo’s service graph, or Kiali if we are on Istio). The call chain is A -> B -> C -> D. I overlay deploy events on the latency/error graphs for all four services. Service A deployed at 14:32. Service B’s latency started climbing at 14:35. Service C’s error rate spiked at 14:38. Service D’s 500s began at 14:40. The cascade propagated downstream over 8 minutes — this delay rules out a direct code dependency and suggests a resource exhaustion cascade.
- Step 2: Examine what changed in Service A’s behavior, not just its health. Service A’s dashboard shows green — low error rate, normal latency. But I check its outbound call pattern. The new deploy changed a retry policy from 2 retries with 500ms backoff to 5 retries with 100ms backoff. Individually, each request from A to B looks fine. But at 10,000 req/s, the old policy generated ~200 retries/s (1% error rate times 2 retries). The new policy generates ~500 retries/s (same 1% error rate, but 5 retries with aggressive backoff means each failure generates 5x the downstream load). Service B now receives 10,500 req/s instead of 10,200 — a 3% increase that pushes it past its capacity threshold.
- Step 3: Trace the resource exhaustion. Service B at 10,500 req/s starts queuing. Its P99 latency rises from 30ms to 200ms. Service B’s responses to Service C now take longer, and some timeout. Service C has a connection pool of 50 connections to Service B, each now held for 200ms instead of 30ms. By Little’s Law, C needs 50 connections for the old latency but now needs ~330. The pool is exhausted. Service C’s requests to B start timing out at the pool wait layer. Service C’s error rate to D spikes because C cannot complete its own processing. Service D receives either malformed requests or no requests at all from C, and returns 500s.
-
Step 4: The fix is not where the symptom is. Rolling back Service A’s retry policy immediately resolves the cascade. But the deeper fix involves three things: (1) Retry budgets — instead of configuring retries per-client, set a cluster-wide retry budget: “total retries must not exceed 10% of baseline traffic.” Google’s SRE book recommends this. (2) Circuit breakers on B, C, and D so that each service fails fast instead of cascading timeout pressure downstream. (3) Load shedding in Service B — when queue depth exceeds a threshold, reject new requests with HTTP 503 and
Retry-Afterheader rather than accepting them into an increasingly deep queue.
axios with default retries to a custom wrapper with exponential backoff but no jitter. Under normal conditions, it was fine. During a brief network hiccup that caused 5% packet loss for 30 seconds, every instance retried at the same exponential intervals. The retries synchronized — 200 instances all retried at T+1s, T+2s, T+4s, T+8s. This created periodic request spikes at exactly 2x normal load every power-of-two seconds. The downstream payment service could not absorb the synchronized bursts, its connection pool drained, and we had a 12-minute outage on checkout. The fix was adding jitter (delay * (0.5 + random() * 0.5)) to all retry intervals and implementing a global retry budget of 15% above baseline.Follow-up: What is the difference between retries with jitter and a retry budget? When do you need both?
Follow-up: How would you implement a circuit breaker that accounts for slow responses (not just errors)? A service returning 200 OK in 10 seconds is worse than a clean 503.
Follow-up: In a service mesh (Istio/Linkerd), how does the mesh change your approach to retry and circuit breaker configuration?
Scenario: Your team runs a load test at 2x production traffic. Everything passes. Two weeks later, production hits 1.5x normal traffic and the system falls over. How is this possible?
Scenario: Your team runs a load test at 2x production traffic. Everything passes. Two weeks later, production hits 1.5x normal traffic and the system falls over. How is this possible?
When Load Tests Lie
The trap is assuming that a load test that passes at 2x traffic proves the system can handle 2x traffic. Load tests can pass for reasons that do not apply to production, and fail to test conditions that only exist in production.What weak candidates say: “Maybe the production traffic was different from the test traffic.” This is vaguely correct but lacks specificity. Or: “The load test environment was more powerful than production.” Possible but unlikely if you are testing properly.What strong candidates say:There are at least six ways a load test at 2x can pass while production at 1.5x fails, and I have personally encountered three of them:- The load test used synthetic data that was uniformly distributed. Production data is skewed. A load test that generates random user IDs distributes queries evenly across database shards and index pages. In production, 5% of users generate 40% of the traffic. Those hot users trigger hot partitions, cache contention on the same keys, and lock contention on the same rows. The database handles 100K uniformly distributed queries easily but buckles under 75K queries when 30K of them hit the same partition. At one company, our load test used a pool of 10,000 test user IDs. Production had 2 million users, but 200 of them were enterprise accounts with 50,000+ records each. Every query for those accounts bypassed the index’s fast path and hit a sequential scan. The load test never exercised that code path.
- The load test ran for 30 minutes. The production failure happened after 6 hours. Short load tests miss slow-burn issues: memory leaks that accumulate over hours, connection pool leaks where connections are borrowed but not returned under specific error conditions, GC heap growth that only triggers a full GC after hours of gradual accumulation, and log files filling a disk. Our Celery workers had a memory leak that grew at 2MB per hour per worker. In a 30-minute load test, that is 1MB — invisible. Over a 12-hour production day, that is 24MB per worker times 50 workers = 1.2GB of leaked memory, enough to trigger OOM kills.
- The load test hammered a single endpoint. Production traffic is a mix. Load tests often focus on the “main” endpoint at the expected ratio. But production has long-tail endpoints that share the same connection pool and thread pool. A rarely-used admin endpoint that does a full table scan was not in the load test. During the 1.5x spike, the admin endpoint’s traffic also increased, and its expensive queries competed with the main endpoint for database connections.
- The load test started with a warm cache. Production had a cache cold-start event. If you run the load test against a system that has been serving traffic for hours (warm cache, warm DB buffer pool, warm JIT compiler), the test benefits from conditions that do not exist after a deploy, a cache flush, or a Redis failover.
- The staging database was a recent snapshot but lacked production’s data volume. Staging had 10M rows; production had 500M. The query plan was different at production scale because PostgreSQL’s planner uses table statistics to choose between index scan and sequential scan. At 10M rows, the planner chose an index scan. At 500M rows with slightly stale statistics, it chose a sequential scan.
- Network conditions differed. The load test ran from the same AWS region as the service. Production clients come from everywhere, with higher latency and more packet loss. Higher client-to-server latency means connections are held open longer, which means more concurrent connections, which means more connection pool pressure for the same throughput.
Follow-up: How would you build a load test that actually catches these problems? What does a production-representative load test look like?
Follow-up: What is the role of chaos engineering here? How would you combine load testing with fault injection?
Follow-up: Your load test uses production traffic replay. What are the risks of replaying real traffic in a staging environment?
Scenario: You need to migrate a PostgreSQL table with 2 billion rows from one schema to another with zero downtime. The table serves 5,000 reads/second and 500 writes/second. How?
Scenario: You need to migrate a PostgreSQL table with 2 billion rows from one schema to another with zero downtime. The table serves 5,000 reads/second and 500 writes/second. How?
The Zero-Downtime Migration
This is a question where every shortcut has a consequence. The trap is proposingALTER TABLE or a simple migration script that locks the table.What weak candidates say:
“I’d run an ALTER TABLE during a maintenance window.” A maintenance window for a table serving 5,500 operations/second means downtime. At 500 writes/second, even a 2-minute window loses 60,000 writes. Or: “I’d create the new table and swap them.” This glosses over the hardest part — what happens to the writes during the swap.What strong candidates say:This is a multi-phase online migration. I have done this pattern three times and the key insight is: never make the cutover a single atomic moment. Instead, make it a gradual process where both schemas coexist and you can roll back at every step.- Phase 1: Dual-write setup (1-2 days of engineering). Deploy application code that writes to BOTH the old table and the new table simultaneously. Every INSERT, UPDATE, DELETE goes to both. The old table remains the source of truth for reads. This is the “expand” phase. Use a feature flag to enable dual-write so you can turn it off instantly if something goes wrong. Critical detail: the dual-write must be in the same database transaction to maintain consistency. If the new table write fails, the transaction rolls back and the old table is also not written — you do not end up with divergent data.
-
Phase 2: Backfill historical data (hours to days). While dual-write is running, backfill the 2 billion existing rows from the old table to the new table. Use a batch migration script that processes rows in chunks of 10,000-50,000, with a configurable delay between batches to avoid overwhelming the database. At 50K rows per batch with a 500ms sleep between batches, processing 2 billion rows takes roughly 5.5 hours. Run this during off-peak hours. Use
WHERE id > last_processed_id ORDER BY id LIMIT 50000to resume from where you left off if the script crashes. Track progress in a separatemigration_statustable. - Phase 3: Verification (1-2 days). After backfill completes, run a consistency check: compare row counts, checksum a sample of rows between old and new tables. Fix any discrepancies (rows written between the backfill read and the dual-write activation). Let dual-write run for 24-48 hours and verify that both tables stay consistent. Monitor write latency — the dual-write adds one extra INSERT per transaction, roughly 1-5ms of overhead.
- Phase 4: Shadow reads (1-2 days). Start reading from BOTH tables and comparing results, but return only the old table’s result to the user. Log discrepancies. GitHub’s Scientist library (Ruby) or a custom comparator does this. This catches bugs in the new schema that would only manifest at production scale — incorrect JOIN behavior, missing indexes, collation differences.
- Phase 5: Cut over reads (gradual). Switch reads to the new table using a feature flag. Start with 1% of traffic, monitor latency and correctness, ramp to 10%, 50%, 100%. At each stage, you can instantly revert to reading from the old table. The old table is still receiving writes via dual-write, so it is always up to date.
- Phase 6: Remove old table writes (the “contract” phase). Once 100% of reads use the new table and you are confident in correctness, remove the dual-write. The old table stops receiving updates. Keep it around for a week as a safety net, then drop it.
orders table (1.8B rows, ~600GB) from a denormalized schema to a normalized one with a separate order_items table. The dual-write phase revealed a bug we never would have caught in staging: our ORM batched order_items INSERTs in groups of 100, and at production write volume, the batch INSERT occasionally exceeded PostgreSQL’s max_stack_depth because the query string for 100 items with 15 columns each was 47KB. In staging with 5 items per order, the query was 2KB — well within limits. We discovered this only because the dual-write to the new table started throwing errors in production while the old table (single denormalized row per order) was fine. We fixed the batch size to 25 items and completed the migration without any user-visible impact. Total migration took 11 days from dual-write activation to old table drop.Follow-up: What happens if a row is updated in the old table during the backfill but before the backfill reaches that row? How do you handle this race condition?
Follow-up: The dual-write adds 3ms of latency to every write. The product team says that is unacceptable for the checkout flow. How do you handle this?
Follow-up: How would this approach change if you were migrating across databases (PostgreSQL to DynamoDB) rather than across schemas within the same database?
Scenario: Your GC pause times on a JVM service are 200-400ms, causing P99 latency spikes. A junior engineer proposes increasing the heap from 4GB to 16GB. Why might this make things WORSE?
Scenario: Your GC pause times on a JVM service are 200-400ms, causing P99 latency spikes. A junior engineer proposes increasing the heap from 4GB to 16GB. Why might this make things WORSE?
When More Memory Hurts
This is the canonical “obvious answer is wrong” question. More heap sounds like it should help — fewer GCs because more room. The trap is that GC pause duration is proportional to the live object set that must be traversed, and a larger heap means the GC runs less frequently but each pause is much longer.What weak candidates say: “More heap means fewer GC pauses, so latency should improve.” This is correct for frequency but ignores duration. Or: “Just switch to a different GC.” This is a reasonable suggestion but does not demonstrate understanding of why the current approach fails.What strong candidates say:- The heap size and GC pause trade-off is non-linear and depends on the collector. With the default G1GC on a 4GB heap, the GC runs mixed collections frequently (every few seconds) with pauses of 200-400ms — bad, but bounded. If you increase to 16GB without changing anything else, the GC delays collection because there is plenty of free space. Objects accumulate for minutes instead of seconds. When the GC finally runs, it has 4x the live objects to scan, mark, and compact. The pause goes from 200-400ms to 800ms-2 seconds. You traded frequent short pauses for infrequent catastrophic pauses. The P99 latency actually gets worse because the P99.9 is now a 2-second stop-the-world event.
-
The right fix depends on the GC algorithm, not just the heap size. With G1GC on 16GB, you should set
-XX:MaxGCPauseMillis=50to tell G1 to target 50ms pauses. G1 will collect more frequently in smaller increments to stay within the target. But G1’s ability to meet this target degrades above ~8-12GB heaps because the marking phase still scales with live object count. - The better answer is to switch to a low-pause collector. ZGC (production-ready since JDK 15) or Shenandoah (Red Hat JDK) are designed for large heaps. ZGC pauses are typically under 1ms regardless of heap size — I have seen ZGC hold sub-millisecond pauses on 32GB heaps with 20GB live data. The trade-off is ~5-15% throughput reduction (the GC does more concurrent work alongside your application threads, stealing CPU cycles) and higher memory overhead (~20% more RSS because ZGC uses colored pointers that consume extra memory).
-
But before changing any GC configuration, I would profile to understand what is generating garbage. A high GC frequency with 200-400ms pauses on a 4GB heap suggests the application is allocating aggressively. Common JVM allocation hot spots: (1) Excessive object creation in hot loops — autoboxing
inttoIntegerin collections, creating newStringobjects via concatenation instead ofStringBuilder. (2) Large temporary objects for JSON serialization — Jackson’s ObjectMapper creates intermediate tree structures proportional to payload size. (3) Framework overhead — Spring’s request-scoped beans, Hibernate’s dirty checking creating proxy objects. Useasync-profiler -e allocto capture an allocation flame graph and find the top allocators. Reducing allocation rate by 50% (which is often achievable by fixing 2-3 hot paths) halves GC frequency without touching heap size or collector.
-XX:+UseZGC), set the heap to 12GB, and profiled allocations. The allocation flame graph showed that 40% of garbage was from protobuf deserialization creating temporary byte arrays. We switched to a zero-copy deserialization path using Protobuf’s ByteString with aliasing, cutting allocation rate by 35%. Final result: ZGC pauses under 0.8ms at P99, and throughput actually increased 8% because the CPU was no longer spending 12% of its time on G1’s marking phase. The junior engineer learned that GC tuning is not “make the number bigger” — it is understanding the relationship between allocation rate, live set size, collector algorithm, and pause time targets.Follow-up: Explain the difference between G1GC, ZGC, and Shenandoah. When would you choose each?
Follow-up: How would you diagnose GC issues in a Go service? Go does not have heap size configuration the same way JVM does.
Follow-up: Your service runs in a container with a 4GB memory limit. The JVM heap is set to 3GB. Why might you still get OOM-killed?
Scenario: You implement rate limiting at 1,000 req/s per API key. A legitimate customer with a batch workflow is constantly being throttled, while an abusive bot with 50 API keys sails through. How do you fix this?
Scenario: You implement rate limiting at 1,000 req/s per API key. A legitimate customer with a batch workflow is constantly being throttled, while an abusive bot with 50 API keys sails through. How do you fix this?
Rate Limiting That Actually Works
The trap is that simple per-key rate limiting is trivially bypassed by distributing requests across keys, and overly restrictive for legitimate high-volume users. Rate limiting sounds simple but the design space is surprisingly deep.What weak candidates say: “Increase the rate limit for that customer.” This helps the legitimate customer but does nothing about the bot. Or: “Block the bot’s IP.” Bots rotate IPs. This is whack-a-mole.What strong candidates say:This scenario exposes the fundamental problem with single-dimensional rate limiting. Rate limiting on one axis (API key) is always gameable on that axis. The fix is multi-dimensional rate limiting plus behavioral analysis.- Layer 1: Tiered rate limits per API key. Not all API keys are equal. Free tier: 100 req/s. Paid tier: 1,000 req/s. Enterprise tier: 10,000 req/s with a burst allowance of 15,000 for 30 seconds. The legitimate batch customer gets an enterprise key with appropriate limits. This is table stakes — every serious API does this.
- Layer 2: Aggregate rate limiting. In addition to per-key limits, apply limits per source IP, per IP subnet (/24 block), and per user-agent. The bot with 50 API keys but one IP subnet hits a subnet-level limit of 5,000 req/s even though each individual key is under its limit. Cloudflare, AWS WAF, and Kong all support multi-dimension rate limiting. The key insight: attackers can cheaply generate API keys. They cannot cheaply generate diverse network infrastructure.
-
Layer 3: Adaptive rate limiting based on behavior. Static limits are a blunt instrument. Implement a scoring system: requests that look like automated traffic (identical user-agents, no
Accept-Languageheader, perfectly uniform inter-request timing with zero jitter, sequential API key usage) accumulate a “suspicion score.” As the score rises, the effective rate limit tightens. A legitimate user with organic request patterns gets full throughput. A bot with machine-perfect timing gets progressively throttled. Stripe and Shopify use variants of this approach. -
Layer 4: Cost-based rate limiting (the senior answer). Not all requests are equal in cost. A
GET /users/{id}costs almost nothing (cache hit). APOST /reports/generatetriggers a 30-second database aggregation. Flat per-request rate limiting lets the bot hammer the expensive endpoint. Instead, assign each endpoint a “cost” in tokens. The token bucket drains faster for expensive operations. At Stripe, their API rate limiter assigns different token costs to read vs write vs list operations. The customer’s batch workflow (many cheap reads) stays within budget. The bot targeting expensive endpoints exhausts its budget quickly. -
Implementation: Sliding window counters in Redis. I would implement this with a Redis sorted set per rate limit dimension. Each request adds a member with the current timestamp as the score. To check the limit,
ZRANGEBYSCOREcounts members within the last N seconds andZREMRANGEBYSCOREremoves expired entries. This gives a true sliding window (not a fixed-window approximation that allows 2x burst at window boundaries). For multi-node setups, Redis is already centralized, so the rate limit is global. The latency overhead is ~0.5-1ms per request for the Redis round-trip.
Follow-up: How do you implement rate limiting in a distributed system with 20 application instances? Where does the counter live?
Follow-up: What is the difference between a fixed window, sliding window, and sliding window log rate limiter? What is the edge case that makes fixed windows unreliable?
Follow-up: Your rate limiter uses Redis. Redis goes down. Do you fail open (allow all requests) or fail closed (reject all requests)? What are the consequences of each?
Scenario: Your microservices team adopts eventual consistency for the order-inventory system. Three months in, customers are buying products that are out of stock, and the business is losing $50K/month on cancelled orders. Was eventual consistency the wrong choice?
Scenario: Your microservices team adopts eventual consistency for the order-inventory system. Three months in, customers are buying products that are out of stock, and the business is losing $50K/month on cancelled orders. Was eventual consistency the wrong choice?
When Eventual Consistency Bites
The trap is answering “yes, switch to strong consistency.” The real answer is nuanced: eventual consistency was likely the right architectural choice, but the consistency window was too wide, and the team failed to account for the business impact of the inconsistency gap.What weak candidates say: “Eventual consistency was a mistake for inventory. Use strong consistency.” This ignores why eventual consistency was chosen (performance, availability, service decoupling) and does not address the actual problem. Or: “Just make the consistency window shorter.” How short? At what cost?What strong candidates say:- First, I would quantify the problem. $50K/month in cancelled orders — what percentage of total orders is that? If it is 0.1% of orders, that is a business cost that might be cheaper than the engineering cost of strong consistency. If it is 5%, it is a crisis. The right architecture depends on the math. At Amazon, they accept a certain rate of overselling on non-critical items because the cost of occasional cancellation is lower than the cost of pessimistic locking on every purchase across a global inventory system.
- The root cause is likely not the consistency model itself but the consistency window. “Eventual” could mean 50ms or 50 seconds. If inventory updates propagate via Kafka with consumer lag, the window could be 2-10 seconds during normal operation and 30-60 seconds during a consumer backlog. During a flash sale, a product could sell out in 5 seconds, but the inventory service does not know for 30 seconds. That 30-second window is where oversells happen.
-
Solution 1: Reduce the consistency window to below the business-critical threshold. If a product sells out in 5 seconds, the inventory update must propagate in under 1 second. Switch from batch-processed Kafka events to a Redis-based real-time inventory counter. The order service atomically decrements Redis (
DECR) at purchase time, and the inventory service syncs to the database asynchronously. The Redis counter is the real-time source of truth for availability, and the database is the durable source of truth for accounting. Consistency window: ~0ms for the hot path. - Solution 2: Reservation pattern. Instead of decrementing inventory on order creation, reserve inventory first. The order service calls the inventory service synchronously to reserve N units. If the reservation succeeds (inventory > 0), the order proceeds. If it fails, the user gets “out of stock” immediately. The reservation has a TTL (say 10 minutes). If the order is not completed within the TTL, the reservation expires and inventory is released. This is synchronous for the critical check (is it in stock?) but still eventually consistent for the fulfillment side.
- Solution 3: Accept overselling with graceful recovery. For non-critical items (not limited edition, not flash sale), allow overselling up to a configurable threshold (say 5% of stock). When an oversell is detected during fulfillment, automatically offer the customer a choice: backorder, substitute, or refund with a discount code. This is what most large retailers actually do. The 10K/month in refund costs, which is an acceptable business cost for the architectural simplicity.
- The meta-insight: consistency is not a binary choice between “strong” and “eventual.” It is a spectrum, and different parts of the same system need different consistency guarantees. Inventory availability for display (“12 left!”) can tolerate 30 seconds of staleness. Inventory decrement for purchase must be strongly consistent (or near-real-time). Order-to-shipment status can be eventually consistent with a 5-minute window. The mistake was applying one consistency model uniformly instead of matching the consistency guarantee to the business requirement at each boundary.
SELECT FOR UPDATE to an event-driven Kafka pipeline because the lock contention was causing 500ms+ checkout latencies. Oversells jumped from 0.02% to 3.8% of orders. The business team was furious. The engineering team blamed “eventual consistency.” The real problem: the Kafka consumer for inventory updates had 8 partitions but was running only 2 consumer instances due to a deployment misconfiguration. Lag was 45 seconds during peak hours. We fixed the consumer scaling (8 consumers for 8 partitions), added a Redis-based real-time inventory counter as the hot-path check (order service does DECR on Redis, Kafka events update the database asynchronously for accounting), and added a “soft reserve” with 15-minute TTL. Oversells dropped to 0.05% — below the pre-Kafka baseline — while checkout latency stayed at the improved 80ms. The consistency model was not wrong. The operational implementation of the consistency model was.Follow-up: How do you monitor the “consistency window” in production? What metrics would you track?
Follow-up: Explain the CAP theorem in the context of this inventory scenario. What are you actually giving up?
Follow-up: A product manager asks “can’t we just make everything strongly consistent?” How do you explain the trade-off in business terms, not engineering terms?
Scenario: You need to deploy a new version that changes the database schema AND the API response format. Both changes are breaking. Your service handles 20K req/s. How do you deploy with zero downtime and zero errors?
Scenario: You need to deploy a new version that changes the database schema AND the API response format. Both changes are breaking. Your service handles 20K req/s. How do you deploy with zero downtime and zero errors?
The Coordinated Breaking Change
The trap is thinking about this as a single deploy. Any approach that attempts to make both breaking changes atomically will fail because you cannot update the database schema and all application instances simultaneously. There will always be a window where old code runs against a new schema or new code runs against the old schema.What weak candidates say: “Deploy during off-hours when traffic is low.” At 20K req/s, “low traffic” is still 5K req/s. Errors during deploy are still thousands of affected users. Or: “Use blue-green deployment and switch all at once.” Blue-green switches the application but not the database. New code hitting the old schema still breaks.What strong candidates say:This requires the expand-contract pattern (also called parallel change), executed in three distinct deploys spread over days or weeks. The fundamental principle: never make a change that is incompatible with the currently running code. Every intermediate state must be valid.-
Deploy 1: Expand the database (backward-compatible). Add the new columns, tables, or schema changes WITHOUT removing or renaming old ones. For example, if you are renaming
user_nametodisplay_name, adddisplay_nameas a new nullable column. Write a database trigger or application-level dual-write that copies values fromuser_nametodisplay_nameon every write. The old application code continues to read and writeuser_namewithout any issues. The new column silently populates alongside it. Backfill existing rows (UPDATE users SET display_name = user_name WHERE display_name IS NULL, in batches of 10K to avoid long-running transactions). -
Deploy 2: Migrate application code (backward-compatible API). Deploy new application code that reads from
display_name(with a fallback touser_nameifdisplay_nameis null — belt and suspenders), writes to BOTH columns, and returns the new API response format. But here is the critical part for the API: use API versioning. The new response format is served under/v2/users. The old/v1/usersendpoint continues to return the old format by readingdisplay_nameand mapping it back to the old field name. Clients migrate from v1 to v2 on their own timeline. During the rolling deploy, some instances serve v1 code and some serve v2 code — both are valid because the database has both columns and the application handles both. -
Deploy 3: Contract (remove the old). After all clients have migrated to v2 (monitor v1 traffic — when it hits zero or a negligible threshold), deploy code that removes the
user_namefallback and drops the v1 endpoint. Run a migration to drop theuser_namecolumn (ALTER TABLE users DROP COLUMN user_name). UseDROP COLUMNcarefully in PostgreSQL — it acquires anAccessExclusiveLockbut is instant because it only marks the column as dropped in the catalog without rewriting the table (since PG 11). In MySQL,DROP COLUMNrewrites the table — usept-online-schema-changeorgh-ostto avoid downtime. -
Key details that separate a real answer from a textbook one:
- Never rename a column directly. Add new, dual-write, backfill, switch reads, drop old. A direct
ALTER TABLE RENAME COLUMNbreaks every running instance that references the old name. - Feature flags control which code path is active. If Deploy 2 causes issues, flip the flag to fall back to the old code path without a redeploy.
- Database migrations must be separate deploys from application code. Deploy the migration first, let it propagate, THEN deploy the code that uses the new schema. Never put both in the same release — if the app deploy rolls back, the migration does not roll back with it.
- Never rename a column directly. Add new, dual-write, backfill, switch reads, drop old. A direct
transactions table from storing amounts as integers (cents) to using NUMERIC(19,4) for sub-cent precision required by a new currency pair. The table had 3.2 billion rows and served 12K reads/s. Direct ALTER TABLE ALTER COLUMN was estimated at 4+ hours of table lock. We used the expand-contract pattern: added a amount_precise NUMERIC(19,4) column, wrote a trigger that populated it on every INSERT/UPDATE (NEW.amount_precise = NEW.amount_cents / 100.0), backfilled 3.2B rows in batches of 50K over 18 hours during off-peak, deployed new code that read from amount_precise with fallback to amount_cents / 100.0, monitored for a week, then scheduled the old column drop. Zero downtime, zero errors. The entire process took 3 weeks from first deploy to final cleanup. The trigger added ~0.3ms per write — undetectable in our metrics.Follow-up: What if the schema change is not additive? For example, you need to split one table into two tables with a different primary key structure.
Follow-up: How do you handle the backfill of 2 billion rows without impacting production query latency?
Follow-up: A deploy of the new code goes out but has a bug. You need to roll back. The database already has the new schema. How do you ensure the rollback is safe?
Scenario: Your team uses an LLM to generate API responses (summarization, recommendations). The LLM call takes 800ms-3s. Your API SLA is 500ms. How do you architect this?
Scenario: Your team uses an LLM to generate API responses (summarization, recommendations). The LLM call takes 800ms-3s. Your API SLA is 500ms. How do you architect this?
AI-Assisted Engineering: The Latency Problem
LLM inference is the new “slow external dependency” — but unlike a slow database query, you cannot simply add an index. LLM latency is fundamentally bound by the model size, token count, and inference hardware. This question tests whether you can integrate a fundamentally slow component into a fast system.What weak candidates say: “Use a faster model.” This is hand-waving. Or: “Just cache the responses.” LLM inputs are often unique (user-specific queries), making cache hit rates low.What strong candidates say:- The fundamental constraint. An LLM call for a 200-token response on GPT-4 class models takes 800ms-3s. On a smaller model (Llama 3.1 8B self-hosted on GPU), it takes 200-800ms. On a distilled model or a fine-tuned small model, 50-200ms. The first architectural decision is: does the user need the LLM result in this HTTP response, or can it be async?
-
Pattern 1: Streaming (if the result IS the response). Use Server-Sent Events (SSE) or WebSocket to stream tokens as they are generated. The first token arrives in 100-300ms (Time to First Token, TTFT), and the user sees progressive output. The total latency is 2-3s, but the perceived latency is 200ms because the user sees content immediately. This is what ChatGPT, Claude, and every chat interface does. Your API returns a
text/event-streamresponse. The SLA shifts from “full response in 500ms” to “first token in 300ms, full response in 3s.” - Pattern 2: Async with pre-computation (if the result can be prepared ahead of time). For recommendations and summaries that are based on relatively stable data, pre-compute the LLM output in a background job and cache it. When user data changes, enqueue an LLM call to regenerate the summary. The API reads the pre-computed result from Redis or the database — 1-5ms. The LLM result may be 5-30 minutes stale, which is acceptable for “product recommendations” but not for “summarize this document I just uploaded.”
- Pattern 3: Hybrid — fast response with async enrichment. Return the API response immediately with non-LLM data (structured data, cached results, rule-based defaults). Fire an async LLM call. When the LLM result is ready, push it to the client via WebSocket or update the cache for the next request. The user sees a fast initial load and the AI-generated content appears moments later. This is common in e-commerce: show the product page instantly, then load AI-generated “Why you’ll love this” copy asynchronously.
- Pattern 4: Model selection based on latency budget. Not every LLM call needs GPT-4. Route simple tasks (classification, sentiment, entity extraction) to a small, fast model (fine-tuned Llama 3.1 8B, <100ms inference). Route complex tasks (multi-step reasoning, creative generation) to a larger model (GPT-4o, Claude Sonnet). This is the “model router” pattern. The routing decision can itself be made by a small classifier.
Follow-up: How do you handle LLM rate limits from the provider? At 100K req/day, you will hit OpenAI’s TPM (tokens per minute) limits.
Follow-up: An LLM hallucination in a product recommendation causes a customer complaint. How do you add guardrails without adding latency?
Follow-up: Your team wants to fine-tune a model to reduce latency. What is the trade-off between fine-tuning cost, inference cost, and quality?
Scenario: Your team argues about whether to use a single multi-tenant database or one database per tenant for a B2B SaaS product. There are currently 200 tenants, expected to grow to 5,000 in 2 years. One tenant is 100x the size of the median tenant. What do you recommend?
Scenario: Your team argues about whether to use a single multi-tenant database or one database per tenant for a B2B SaaS product. There are currently 200 tenants, expected to grow to 5,000 in 2 years. One tenant is 100x the size of the median tenant. What do you recommend?
The Multi-Tenancy Architecture Decision
This is a staff-level question with no clean answer. Both approaches have serious trade-offs, and the right answer depends on dimensions that most candidates do not ask about: regulatory requirements, tenant isolation expectations, operational team size, and the distribution of tenant sizes.What weak candidates say: “One database per tenant for isolation.” This does not scale to 5,000 databases — that is 5,000 connection pools, 5,000 sets of migrations, 5,000 backup schedules. Or: “Shared database for simplicity.” This ignores the 100x whale tenant that will dominate the shared resources.What strong candidates say:I would not pick either extreme. I would use a hybrid model — shared multi-tenant database for the majority, with dedicated databases for whale tenants. Here is the reasoning:-
Shared database for the “long tail” (195+ smaller tenants). Use a
tenant_idcolumn on every table. Every query includesWHERE tenant_id = ?. This is operationally simple: one schema to migrate, one connection pool, one backup, one monitoring dashboard. Row-level security in PostgreSQL (CREATE POLICY) can enforce tenant isolation at the database level so even a bug in the application cannot leak data across tenants. At 200 small tenants, the shared database handles the aggregate load easily. - Dedicated database for whale tenants (top 3-5 by size). The 100x tenant gets its own PostgreSQL instance. Its queries do not compete with smaller tenants for buffer pool, connection pool, or I/O bandwidth. Its large table scans do not evict smaller tenants’ hot data from shared_buffers. Its backup does not cause I/O contention for others. This also provides a stronger isolation story for enterprise clients who contractually require dedicated infrastructure (common in healthcare, finance, government).
-
Routing layer. The application looks up
tenant_id -> database connectionin a routing table (cached in Redis for <1ms latency). Small tenants route to the shared database. Whale tenants route to their dedicated instance. Adding a new dedicated instance for a growing tenant is a data migration (dual-write, backfill, cut over), not an architecture change. - Why not database-per-tenant for everyone? At 5,000 tenants, that is 5,000 PostgreSQL instances. Even using RDS, that is: 250K-500-2,000/month.
-
Why not a fully shared database? The 100x whale tenant’s queries will dominate
shared_buffers. A full table scan on the whale’s data evicts the hot pages for 199 other tenants. The whale’s write volume drives autovacuum frequency, which competes with read workloads. Worst case: the whale’s operations trigger lock contention that affects everyone. Also, some enterprise clients will contractually refuse to share a database with other tenants. -
The path to 5,000 tenants. At 5,000 tenants, the shared database may need to be sharded. Shard by
tenant_idhash. Each shard handles ~1,000 tenants. This is where the routing layer pays off — you already have tenant-to-database routing, so adding shards is a configuration change in the routing table, not an application rewrite. The whale tenants remain on dedicated instances.
Follow-up: How do you handle cross-tenant analytics queries (e.g., a global admin dashboard showing metrics across all tenants) in this hybrid model?
Follow-up: Tenant B is growing fast and is about to become a whale. How do you migrate them from the shared database to a dedicated instance with zero downtime?
Follow-up: How do you prevent a query bug from accidentally leaking data between tenants in the shared database? What is your defense-in-depth strategy?
Scenario: Your team migrates from a monolith to microservices. Six months in, end-to-end latency has increased by 3x and debugging takes 5x longer. The CTO is questioning the migration. What went wrong and how do you fix it?
Scenario: Your team migrates from a monolith to microservices. Six months in, end-to-end latency has increased by 3x and debugging takes 5x longer. The CTO is questioning the migration. What went wrong and how do you fix it?
The Microservices Tax
The trap is defending the microservices decision or blaming the team’s execution. The real answer acknowledges that microservices have a concrete performance and complexity tax that must be paid with specific infrastructure — and most teams start paying this tax before they have built the infrastructure to offset it.What weak candidates say: “The team should have planned better.” Vague and unhelpful. Or: “Microservices are always slower because of network hops.” This is partially true but does not explain what to do about it.What strong candidates say:The 3x latency increase and 5x debugging slowdown are the two most predictable consequences of a microservices migration, and both have concrete causes and fixes.- Why latency increased 3x. In the monolith, a function call between modules took ~10 nanoseconds. In microservices, the same logical call is now an HTTP/gRPC request: DNS resolution (~1ms cached, ~50ms uncached), TCP connection setup (~0.5ms within a datacenter), TLS handshake (~5ms), serialization/deserialization (~0.1-1ms for JSON, less for Protobuf), and network transit (~0.5ms within a datacenter). The floor is ~2-7ms per hop. If the checkout flow previously called 5 internal functions (total: ~50ns), it now makes 5 network calls (total: ~10-35ms). That is a 200,000x increase per call chain. Worse, calls that were parallel in the monolith (goroutines, threads accessing shared memory) may have become sequential because the team did not implement concurrent service calls in the API gateway. Fixes: (1) Identify sequential chains and parallelize them. If Service A calls B, then C, then D, but B and C are independent, call them concurrently. This alone often cuts latency by 30-50%. (2) Use connection pooling and keep-alive on all inter-service HTTP clients. A fresh connection per request adds 5-10ms. A pooled connection adds ~0.1ms. (3) Switch high-throughput internal calls from REST/JSON to gRPC/Protobuf for 3-10x smaller payloads and faster serialization. (4) Merge services that are always called together. If Service A calls Service B on 100% of requests, they should probably be one service. Microservices are not “make everything its own service” — they are “split at boundaries where independent scaling and deployment matter.”
-
Why debugging takes 5x longer. In the monolith, a stack trace showed the entire call chain. Logs were in one place. A debugger could step through the full flow. In microservices, a single user request generates logs across 5 services, each with its own log format, timestamp, and deployment version. Without distributed tracing, correlating these logs is manual detective work. Without structured logging with a shared request ID, it is nearly impossible.
Fixes: (1) Implement distributed tracing immediately — OpenTelemetry with Jaeger, Datadog APM, or Grafana Tempo. Every request gets a trace ID that propagates across all service calls. A single trace shows the entire call chain, timing, and errors. This is not optional — it is a prerequisite for operating microservices. (2) Centralized structured logging. All services log to the same system (ELK stack, Datadog Logs, Grafana Loki) with a shared schema:
{timestamp, service, trace_id, level, message}. (3) Service dependency map. Visualize which services call which and at what volume. Datadog, Kiali (for Istio), and Grafana’s service graph do this automatically from trace data. - The organizational question nobody asks. The CTO’s real question is “was this migration worth it?” The answer depends on what the monolith’s pain points were. If the monolith’s problem was deployment coupling (a bug in the payments module blocks the entire release), microservices solve that. If the problem was independent scaling (the image processing needs 10x the compute of the API), microservices solve that. If the problem was “the codebase is too big and confusing,” microservices do not solve that — they replace in-process complexity with distributed systems complexity, which is harder to reason about. The honest answer to the CTO might be: “We migrated for the right reasons, but we underinvested in the observability and infrastructure tax. Here is the 90-day plan to close that gap.”
Follow-up: How do you decide which services to extract from a monolith first? What criteria do you use?
Follow-up: The team wants to add a service mesh (Istio) to handle retries, circuit breaking, and mTLS. What is the performance cost, and when is it worth paying?
Follow-up: Two services have a circular dependency — A calls B and B calls A. How do you break this cycle?
Scenario: You are designing a system to ingest 1 million events per second from IoT devices. Each event is 500 bytes. The data must be queryable within 5 seconds of ingestion. What architecture do you propose?
Scenario: You are designing a system to ingest 1 million events per second from IoT devices. Each event is 500 bytes. The data must be queryable within 5 seconds of ingestion. What architecture do you propose?
The High-Throughput Ingestion Design
This is a system design question that tests whether you can work backward from hard numbers. The trap is jumping to a technology choice without doing the math first.What weak candidates say: “I’d use Kafka and write to a database.” This is directionally correct but completely lacks specificity. What kind of database? How many Kafka partitions? What is the write throughput math? Or: “I’d use AWS Kinesis.” Fine, but how many shards, and can Kinesis handle the throughput at what cost?What strong candidates say:- Start with the math. 1M events/s at 500 bytes = 500 MB/s raw data. That is 43 TB/day. With replication (factor 3 for durability), that is 1.5 GB/s of write throughput and ~130 TB/day of storage. These numbers immediately eliminate most databases. PostgreSQL maxes out at ~50-100 MB/s sustained write throughput on a large instance. Even DynamoDB at 1M writes/s would cost ~$650/hour in on-demand mode. The architecture must use purpose-built ingestion infrastructure.
- Ingestion layer: Apache Kafka. A single Kafka broker sustains ~200-500 MB/s write throughput depending on message size and replication factor. With replication factor 3, I need at least 3-5 brokers for the write throughput alone. I would use 64-128 partitions to parallelize consumption. Message key: device_id (ensures ordering per device). Retention: 7 days for reprocessing capability. With log compaction disabled (events are immutable), storage is linear with retention. 7 days times 43 TB/day = ~300 TB of Kafka storage. Use tiered storage (Kafka’s tiered storage feature, or Confluent’s) to offload older segments to S3 and keep the broker’s local SSD for the hot tail. Alternative: AWS MSK (managed Kafka) with tiered storage, or Redpanda (Kafka-compatible, written in C++, better single-node throughput).
-
The 5-second queryable requirement is the hard constraint. “Queryable within 5 seconds” means the data must be indexed and searchable, not just stored. Kafka alone does not satisfy this — you can consume from Kafka, but Kafka is not a query engine. Options for the query layer:
-
ClickHouse. Columnar database designed for real-time analytics on append-heavy data. ClickHouse can ingest 1M+ rows/s on a modest cluster (3-5 nodes) and make them queryable within 1-2 seconds via the
MergeTreeengine family. This is the option I would default to for time-series IoT data. ClickHouse’sReplicatedMergeTreeprovides durability, and theMaterializedViewfeature can pre-aggregate data at ingest time for common query patterns (e.g., average temperature per device per minute). - Apache Druid. Real-time OLAP database with sub-second query latency on high-cardinality data. Druid ingests from Kafka directly (built-in Kafka indexing service), making the pipeline simpler. Trade-off: Druid’s operational complexity is higher than ClickHouse, and it uses more memory.
- TimescaleDB. PostgreSQL extension for time-series. Easier to operate if the team already knows PostgreSQL. But at 1M events/s, a single TimescaleDB instance maxes out. You would need a multi-node deployment or Timescale Cloud with aggressive partitioning.
-
ClickHouse. Columnar database designed for real-time analytics on append-heavy data. ClickHouse can ingest 1M+ rows/s on a modest cluster (3-5 nodes) and make them queryable within 1-2 seconds via the
- Stream processing layer (optional but valuable). Between Kafka and the query layer, Apache Flink or Kafka Streams can perform real-time transformations: deduplication (IoT devices often send duplicate events), enrichment (attach device metadata from a lookup table), windowed aggregations (compute 1-minute averages and store only the aggregates for long-term retention). This reduces the volume hitting the query layer. If 80% of queries are on pre-aggregated data, the query layer needs to handle only 200K raw events/s plus the aggregated stream.
- Cold storage for historical data. After 30 days, move raw events from ClickHouse to S3 in Parquet format. Use DuckDB, Athena, or Trino for ad-hoc queries on historical data. This keeps the ClickHouse cluster small and fast for recent data while maintaining full history at S3 storage costs (~$23/TB/month).
Follow-up: How do you handle late-arriving events — a sensor that was offline for an hour and then sends a burst of historical data?
Follow-up: What happens when a Kafka consumer falls behind? At 500 MB/s, even a 10-minute lag means 300 GB of backlog. How do you recover?
Follow-up: How do you handle schema evolution? A new firmware version adds three new fields to the event payload. What happens to the downstream pipeline?
Scenario: Your ML pipeline processes 500K images/day for product classification. GPU utilization on your g5.2xlarge instances is 35%. Your monthly GPU bill is $52K. The ML team says they need more GPU instances for a new model. How do you respond?
Scenario: Your ML pipeline processes 500K images/day for product classification. GPU utilization on your g5.2xlarge instances is 35%. Your monthly GPU bill is $52K. The ML team says they need more GPU instances for a new model. How do you respond?
Cost-Aware Performance Engineering
The trap is either blindly approving more GPUs or blindly refusing. The real answer requires understanding GPU utilization patterns, batch scheduling, and the economics of inference infrastructure.What weak candidates say: “Give them more GPUs, ML is important.” No cost analysis. Or: “They should optimize their model first.” This may be correct but is not constructive without specifics.What strong candidates say:- First, understand the utilization gap. 35% GPU utilization on a 0.79/hour is wasted per instance. With multiple instances running 24/7, that is significant. But GPU utilization is tricky — it might be 35% average because the pipeline runs at 95% for 8 hours and 0% for 16 hours. Or it might be 35% continuously because the batch size is too small to saturate the GPU. These require different solutions.
- If utilization is bursty (high during batch windows, idle otherwise): Use spot instances for the burst portion. GPU spot instances are 60-70% cheaper than on-demand. The risk is interruption, but for batch ML workloads that checkpoint progress, a spot interruption costs minutes, not hours. Alternatively, use AWS Batch or Kubernetes with KEDA to scale GPU pods to zero when no work is queued and scale up during batch windows.
- If utilization is consistently low: The batch size is likely too small. Increasing the batch size from 16 to 64 images often improves GPU utilization from 35% to 80%+ because the GPU’s parallel cores are better utilized. This is free performance — no additional cost. Also check: is the pipeline CPU-bottlenecked on preprocessing (image decode, resize) before the GPU inference step? If so, the GPU is idle waiting for data. Add more CPU-based preprocessing workers or use NVIDIA DALI for GPU-accelerated preprocessing.
- Before approving new GPU instances for a new model: Can the new model time-share the existing GPUs? If the current model runs 8 hours/day, the new model can run during the other 16 hours on the same hardware. Use a job scheduler (Kubernetes with priority classes, or SageMaker Pipelines) to orchestrate this. Alternatively, can the new model run on a smaller GPU? If the current model needs an A10G (g5.2xlarge, 0.53/hour), do not put it on the expensive hardware.
-
The cost optimization playbook for GPU inference:
- Right-size the instance. Do you need an A100 (1.21/hour) handle the throughput?
- Use spot instances for batch workloads. Save 60-70%.
- Optimize batch size to maximize GPU utilization.
- Use inference frameworks like TensorRT, ONNX Runtime, or vLLM that optimize model execution. TensorRT can improve inference throughput by 2-5x on the same hardware.
- Consider model distillation. A smaller model that is 95% as accurate but 10x faster on cheaper hardware might be the right trade-off.
(monthly GPU cost) / (total images processed per month). Target: reduce cost per image by 40%+ through utilization and scheduling improvements before approving new instances.Follow-up: The ML team wants to run real-time inference (product classification at upload time, <200ms latency). How does this change the GPU provisioning strategy?
Follow-up: NVIDIA Triton Inference Server claims to improve GPU utilization through dynamic batching. How does dynamic batching work and what is the latency trade-off?
Follow-up: Your spot GPU instances get interrupted during a critical batch run. How do you design the pipeline to be resilient to interruptions?
Scenario: You are using an AI coding assistant (Copilot, Claude, Cursor) to write performance-critical code. The generated code passes tests but introduces a subtle O(n^2) algorithm where O(n) was possible. How do you build guardrails?
Scenario: You are using an AI coding assistant (Copilot, Claude, Cursor) to write performance-critical code. The generated code passes tests but introduces a subtle O(n^2) algorithm where O(n) was possible. How do you build guardrails?
AI-Assisted Engineering: Performance Guardrails
This is the emerging class of interview questions about working with AI tools without blindly trusting them. It tests whether you understand the limitations of AI-generated code specifically around performance characteristics.What weak candidates say: “Review all AI-generated code carefully.” This is correct but generic. Or: “Don’t use AI for performance-critical code.” This throws away a powerful tool instead of building guardrails.What strong candidates say:- The failure mode of AI-generated code is correctness without efficiency. LLMs optimize for code that works (passes the test), not code that is fast. They will generate a nested loop when a hash map lookup would work, a recursive solution with redundant computation when dynamic programming would be O(n), or an ORM query that triggers N+1 without realizing it. The code is correct — it produces the right output — but it may be 100-1000x slower than necessary.
-
Guardrail 1: Automated complexity analysis in CI. Tools like
eslint-plugin-no-loops(JavaScript),pylintwith custom rules, or SonarQube’s cognitive complexity checks can flag code patterns that are likely O(n^2) or worse. More advanced: use a static analysis tool that estimates algorithmic complexity (e.g.,big-o-calculatorfor Python). Flag any function with estimated complexity above O(n log n) for manual review. -
Guardrail 2: Performance benchmarks in the test suite. For performance-critical paths, write benchmark tests alongside unit tests. The benchmark specifies: “this function must process 10,000 items in under 50ms.” If AI-generated code passes the unit test but fails the benchmark, the CI pipeline catches it. In Go,
testing.Bbenchmarks are built-in. In Python, usepytest-benchmark. In JVM, use JMH. The key: the benchmark input size must be large enough to reveal O(n^2) behavior — 100 items processes fine at O(n^2); 100,000 items exposes it. -
Guardrail 3: Query analysis for database code. For ORM-generated queries, run
EXPLAIN ANALYZEin CI against a test database with production-scale data (or a representative subset). Fail the build if any query in the critical path shows a sequential scan on a table with >10K rows or an estimated cost above a threshold. This catches the N+1 queries and missing indexes that AI-generated ORM code commonly introduces. - Guardrail 4: Load testing as a deployment gate. Before any deploy to production, run a subset of your load test suite (k6, Locust) against the staging environment. If P95 latency regressed by more than 20% compared to the previous version, block the deploy. This is the final safety net that catches performance regressions regardless of their source — human or AI.
- The human review protocol for AI-generated code. When reviewing AI-generated code, explicitly ask: (1) What is the time complexity of this function? (2) What is the space complexity? (3) How does this behave when the input is 100x larger? (4) Does this create N+1 queries? (5) Does this allocate in a hot loop? These five questions catch 90% of AI-generated performance issues.