Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Interview Day Quick Review
The essentials worth re-reading in the elevator on the way up.| # | Remember This |
|---|---|
| 1 | RAM is ~1,000x faster than SSD, SSD is ~100x faster than HDD. Latency intuition wins system design rounds. |
| 2 | QPS = (DAU x actions/user) / 86,400. Always estimate peak at 3x average. |
| 3 | 99.99% uptime = 52 minutes downtime/year. Each dependency in the critical path multiplies downtime. |
| 4 | Cache-Aside is the default caching pattern. Know when Write-Behind or Read-Through is better. |
| 5 | Default to REST for public APIs, gRPC for internal service-to-service. Know why, not just what. |
| 6 | 401 = “who are you?” (authn), 403 = “you can’t do this” (authz). Getting this wrong is a red flag. |
| 7 | O(1) = hash map, O(log n) = binary search, O(n log n) = sort. Pattern-match complexity to data structure. |
| 8 | Canary + feature flags is the gold standard deployment strategy at scale. |
| 9 | Never roll your own auth. Say this in the interview. Mean it. |
| 10 | SOLID is about managing change, not about following rules. Explain the why, not just the acronym. |
| 11 | Raft = strong leader + majority quorum. Most modern consensus systems (etcd, CockroachDB, Consul) use Raft, not Paxos. |
| 12 | CRDTs merge without coordination. G-Counter, PN-Counter, OR-Set — know which to reach for. |
| 13 | epoll is O(1) per event; select is O(n). This is why Nginx and Node.js handle 100K+ connections. |
| 14 | DynamoDB: design for access patterns, not entities. Single-table design with PK/SK is the canonical pattern. |
| 15 | WebSocket = bidirectional, SSE = server-push only (simpler), WebRTC = peer-to-peer (lowest latency). Pick the simplest protocol that fits. |
| 16 | GraphQL solves over-fetching but creates N+1 problems. Always mention DataLoader. |
| 17 | Service mesh = sidecar proxy on every pod. It handles mTLS, retries, and circuit breaking so your app code doesn’t. |
| 18 | Lambda cold starts: Go/Rust ~50-100ms, Python/Node ~100-300ms, Java ~3-10s. Choose runtime by latency budget. |
System Design Interview Framework
Use this five-step framework in every system design round. It keeps you structured and prevents the most common mistake: jumping into low-level details before establishing the big picture.1. Latency Numbers Every Engineer Should Know
| Operation | Latency | Notes |
|---|---|---|
| L1 cache reference | 0.5 ns | Fastest memory access available |
| Branch mispredict | 5 ns | CPU pipeline flush penalty |
| L2 cache reference | 7 ns | ~14x slower than L1 |
| Mutex lock/unlock | 25 ns | Contention makes this much worse |
| Main memory (RAM) reference | 100 ns | ~200x slower than L1 |
| Compress 1 KB with Snappy | 3 μs | Fast compression for real-time use |
| Read 1 MB sequentially from RAM | 3 μs | RAM is fast for sequential access |
| SSD random read | 150 μs | ~1,500x slower than RAM |
| Read 1 MB sequentially from SSD | 1 ms | SSDs excel at sequential reads |
| Network round trip (same datacenter) | 500 μs | Assumes modern datacenter networking |
| HDD disk seek | 10 ms | Mechanical latency — avoid random reads |
| Read 1 MB sequentially from HDD | 20 ms | HDDs still viable for bulk sequential I/O |
| Network round trip (cross-continent) | 150 ms | Speed of light is the bottleneck |
| TLS handshake | 250 ms | 1–2 round trips depending on version |
| DNS lookup (uncached) | ~50 ms | Varies widely; caching helps enormously |
| TCP connection setup (3-way handshake) | ~1.5x RTT | One and a half round trips |
Senior vs Staff Signal -- Latency Numbers
Senior vs Staff Signal -- Latency Numbers
AI-Assisted Lens -- Latency Numbers
AI-Assisted Lens -- Latency Numbers
- LLM inference latency is a new tier. A GPT-4 class API call is 500ms-5s — slower than a cross-continent round trip. If you are adding AI features to a hot path, you need async patterns, streaming responses, or pre-computation. This single number reshapes architecture for AI-augmented products.
- AI-powered profiling tools (Datadog’s Watchdog, Dynatrace Davis AI) can automatically detect latency anomalies and correlate them with deployments, infrastructure changes, or traffic patterns — replacing hours of manual investigation.
- Copilot-assisted estimation. When doing back-of-envelope calculations in interviews, AI tools can sanity-check your math in real-time during design docs (not during the interview itself, obviously). Build the intuition so you do not need the crutch.
Quick Fire Q&A -- Latency
Quick Fire Q&A -- Latency
2. Database Selection Matrix
| Use Case | Recommended DB | Reasoning |
|---|---|---|
| Transactions, complex joins | PostgreSQL / MySQL | ACID guarantees, mature tooling, SQL standard |
| Flexible schema, rapid dev | MongoDB / DynamoDB | Document model maps to application objects, schema-on-read |
| Session store, caching, leaderboards | Redis / Memcached | Sub-ms latency, in-memory, simple key-value operations |
| Social networks, recommendations | Neo4j / Amazon Neptune | Native graph traversal, relationship-first data model |
| Metrics, IoT, monitoring | TimescaleDB / InfluxDB | Optimized for time-ordered writes and range queries |
| Full-text search, log analytics | Elasticsearch / OpenSearch | Inverted index, fuzzy matching, aggregation pipelines |
| Wide-column, massive scale | Cassandra / ScyllaDB | Linear horizontal scaling, tunable consistency |
| Embedded / edge devices | SQLite | Zero-config, single-file, surprisingly powerful |
| Multi-model (graph + doc + KV) | ArangoDB / SurrealDB | One engine for multiple access patterns |
Senior vs Staff Signal -- Database Selection
Senior vs Staff Signal -- Database Selection
AI-Assisted Lens -- Database Selection
AI-Assisted Lens -- Database Selection
- Vector databases are a new category. If your system needs semantic search or AI-powered recommendations, you need a vector store (Pinecone, pgvector, Weaviate, Qdrant). This did not exist as a mainstream choice two years ago. Know when to bolt vector search onto your existing DB (pgvector) vs when to use a dedicated vector store (high-volume similarity search at <10ms).
- AI-generated query optimization. Tools like Amazon Q for databases and EverSQL use AI to analyze slow queries and suggest indexes, rewrites, or schema changes. A staff engineer evaluates these suggestions critically rather than blindly applying them.
- Schema design with AI. LLMs are surprisingly good at generating initial schema designs from natural-language requirements. The staff-level skill is reviewing the AI-generated schema for normalization issues, missing indexes for write-heavy patterns, and access-pattern mismatches that the AI cannot anticipate without production traffic data.
Quick Fire Q&A -- Database Selection
Quick Fire Q&A -- Database Selection
3. Caching Strategy Decision Tree
| Pattern | How It Works | When to Use | Trade-off |
|---|---|---|---|
| Cache-Aside | App checks cache; on miss, reads DB, fills cache | Read-heavy, general purpose | Possible stale data; app manages logic |
| Read-Through | Cache fetches from DB on miss automatically | Transparent caching | Cache library must support DB integration |
| Write-Through | Write to cache and DB synchronously | Cannot tolerate stale reads | Higher write latency (two writes) |
| Write-Behind | Write to cache; async flush to DB | Write-heavy, low-latency needs | Data loss risk if cache crashes pre-flush |
| Refresh-Ahead | Proactively refresh before TTL expires | Predictable access, low latency | Wasted resources if prediction is wrong |
Cache Invalidation — The Two Hard Problems
Cache Invalidation — The Two Hard Problems
- TTL (Time-To-Live): Simple, but stale data during the window.
- Event-driven invalidation: Publish a cache-bust event on write. Accurate but adds coupling.
- Version keys: Append a version number to cache keys; bump version on write.
- Lease-based: Cache entry holds a lease; writer must acquire lease before updating.
Why It Matters in Production -- Caching
Why It Matters in Production -- Caching
Senior vs Staff Signal -- Caching
Senior vs Staff Signal -- Caching
AI-Assisted Lens -- Caching
AI-Assisted Lens -- Caching
- AI-powered cache warming. ML models can predict which keys will be requested next based on access patterns, pre-warming the cache proactively. Netflix uses this approach for content metadata caching.
- Intelligent TTL tuning. Instead of static TTLs, AI can dynamically adjust TTL per key based on access frequency and change frequency. A product page viewed 10,000 times/hour gets a longer TTL than one viewed 3 times/day.
- LLM response caching is a new challenge. Semantic caching (caching LLM responses by meaning, not exact input) using embedding similarity is an emerging pattern. If two prompts are semantically identical, serve the cached response instead of making a $0.03 API call.
Quick Fire Q&A -- Caching
Quick Fire Q&A -- Caching
4. API Style Comparison
| Dimension | REST | gRPC | GraphQL | WebSocket |
|---|---|---|---|---|
| Protocol | HTTP/1.1 or HTTP/2 | HTTP/2 (always) | HTTP/1.1 or HTTP/2 | TCP (upgraded from HTTP) |
| Payload format | JSON (typically) | Protocol Buffers (binary) | JSON | Any (text or binary frames) |
| Best for | Public APIs, CRUD | Internal microservices, low-latency | Mobile/frontend with varied data needs | Real-time bidirectional communication |
| Streaming | Not native (SSE possible) | Bidirectional streaming built-in | Subscriptions via WebSocket | Full-duplex by design |
| Tooling | Excellent (Postman, curl) | Growing (grpcurl, BloomRPC) | Good (GraphiQL, Apollo) | Moderate (wscat) |
| Schema/Contract | OpenAPI / Swagger | .proto files (strict) | SDL (strongly typed) | No built-in contract |
| Overhead | Moderate (text-based) | Low (binary, multiplexed) | Moderate (single endpoint) | Low after handshake |
| Cacheability | Excellent (HTTP caching) | Hard (binary, no native HTTP cache) | Hard (POST requests) | Not applicable |
| Browser support | Native | Requires grpc-web proxy | Native | Native |
Senior vs Staff Signal -- API Styles
Senior vs Staff Signal -- API Styles
AI-Assisted Lens -- API Styles
AI-Assisted Lens -- API Styles
- AI-powered API generation. Tools like GitHub Copilot and Cursor can generate OpenAPI specs, protobuf definitions, and GraphQL schemas from natural-language descriptions. The staff-level skill is reviewing generated contracts for backward compatibility, proper error modeling, and missing edge cases (pagination, rate limiting, partial failures).
- LLM-as-a-service APIs are creating a new API style: streaming JSON over SSE for token-by-token responses. If you are building AI features, you need to understand SSE streaming, chunked transfer encoding, and how to display partial results in UIs. This is a hybrid of REST and real-time protocols.
- Automatic API documentation. AI tools can generate human-readable documentation from OpenAPI specs, and vice versa. But the staff engineer knows that the best API docs include example workflows, error handling guides, and rate limit explanations — context that AI cannot infer from the schema alone.
Quick Fire Q&A -- API Styles
Quick Fire Q&A -- API Styles
graphql-ws protocol). The subscription is the GraphQL abstraction (declarative, typed, schema-driven), WebSocket is the underlying transport. You can also implement subscriptions over SSE for simpler use cases.Q: You are building a public API. REST or GraphQL?
A: REST, almost always. Public APIs need aggressive HTTP caching (CDN-friendly), simple rate limiting (requests/second, not query-cost analysis), stable versioning (URL-based), and universal client compatibility (curl, any HTTP library). GraphQL’s advantages (flexible querying, no over-fetching) primarily benefit first-party clients where you control both sides.5. Deployment Strategy Matrix
| Strategy | Risk Level | Downtime | Infra Cost | Complexity | Rollback Speed | Best For |
|---|---|---|---|---|---|---|
| Rolling | Medium | Zero | Low | Low | Slow | Stateless services, general use |
| Blue-Green | Low | Zero | High (2x) | Medium | Instant | Critical services needing instant rollback |
| Canary | Low | Zero | Medium | High | Fast | High-traffic services, gradual validation |
| Shadow | Very Low | Zero | High | Very High | N/A (no live traffic affected) | Testing new versions with real traffic patterns |
| Recreate | High | Yes | Low | Low | Slow | Dev/staging, or when in-place upgrade is required |
| A/B Testing | Low | Zero | Medium | High | Fast | Feature experiments, UX testing |
Why It Matters in Production -- Deployments
Why It Matters in Production -- Deployments
Senior vs Staff Signal -- Deployment Strategies
Senior vs Staff Signal -- Deployment Strategies
AI-Assisted Lens -- Deployment Strategies
AI-Assisted Lens -- Deployment Strategies
- AI-powered canary analysis. Tools like Harness and Dynatrace use ML to automatically compare canary metrics against baseline, detecting anomalies humans would miss (subtle latency distribution shifts, slow memory leaks, gradual error rate increase).
- Predictive rollback. AI models trained on historical deployment data can predict whether a deploy will need rollback before the canary period completes, based on early signals (first 60 seconds of metrics).
- AI-assisted rollback root cause analysis. When a deploy is rolled back, AI tools can correlate the diff, error logs, and metric changes to pinpoint the exact code change that caused the regression — reducing mean time to fix from hours to minutes.
Quick Fire Q&A -- Deployments
Quick Fire Q&A -- Deployments
6. Authentication Method Decision Matrix
| Method | Use Case | Stateful? | Revocation | Complexity | Scalability |
|---|---|---|---|---|---|
| Session | Traditional web apps | Yes | Easy (delete from store) | Low | Requires shared store (Redis) |
| JWT | Stateless APIs, microservices | No | Hard (must wait for expiry or use blocklist) | Medium | Excellent (no central store) |
| OAuth 2.0 | Third-party access, SSO | Depends | Moderate (token revocation endpoint) | High | Good |
| API Key | Server-to-server, developer APIs | Yes | Easy (delete key) | Low | Good |
| mTLS | Zero-trust service mesh, internal | No | Hard (CRL/OCSP) | Very High | Excellent |
| SAML | Enterprise SSO | Yes | Moderate | High | Good |
| Passkeys/WebAuthn | Passwordless consumer auth | No | Easy (remove credential) | Medium | Excellent |
Why It Matters in Production -- Auth
Why It Matters in Production -- Auth
Senior vs Staff Signal -- Authentication
Senior vs Staff Signal -- Authentication
AI-Assisted Lens -- Authentication
AI-Assisted Lens -- Authentication
- AI-powered anomaly detection for auth. ML models detect suspicious login patterns (impossible travel, credential stuffing, token abuse) in real-time. AWS GuardDuty and Datadog Security Monitoring use AI to flag compromised credentials before manual review could catch them.
- AI code review for auth vulnerabilities. Tools like Snyk Code and Semgrep use AI to detect auth anti-patterns in code: hardcoded secrets, missing CSRF tokens, JWT validation bypass paths, insecure token storage. These should be in your CI pipeline.
- Passkeys and the passwordless future. AI-powered biometric authentication (face, fingerprint) is replacing passwords. As an engineer, understand that WebAuthn/FIDO2 eliminates the entire class of password-related attacks (phishing, credential stuffing, brute force). This is not a trend — it is the end state.
Quick Fire Q&A -- Authentication
Quick Fire Q&A -- Authentication
403 Forbidden not always the right response for authorization failures?
A: Sometimes you should return 404 Not Found instead. If a user should not even know a resource exists (e.g., another tenant’s data in a multi-tenant system), returning 403 leaks information — the attacker now knows the resource exists. Return 404 to hide the resource’s existence entirely. This is called “authorization by obscurity” layered on top of real authorization.7. Message Queue Comparison
| Dimension | Kafka | RabbitMQ | SQS | Redis Streams |
|---|---|---|---|---|
| Throughput | Millions/sec | Tens of thousands/sec | Nearly unlimited (managed) | Hundreds of thousands/sec |
| Ordering | Per-partition | Per-queue (with caveats) | Best-effort (FIFO available) | Per-stream |
| Persistence | Disk (configurable retention) | Optional (disk or memory) | Managed (AWS handles it) | AOF / RDB snapshots |
| Delivery | At-least-once / exactly-once | At-least-once / at-most-once | At-least-once / exactly-once (FIFO) | At-least-once |
| Consumer model | Pull-based consumer groups | Push-based (with prefetch) | Pull-based polling | Consumer groups (pull) |
| Best for | Event streaming, log pipelines | Task queues, RPC, routing | Serverless, AWS-native apps | Lightweight streaming with existing Redis |
| Operational cost | High (ZooKeeper/KRaft) | Medium (Erlang runtime) | Zero (fully managed) | Low (Redis add-on) |
When to use a message queue vs direct API calls
When to use a message queue vs direct API calls
- The downstream service can be temporarily unavailable
- You need to decouple producers from consumers
- Work can be processed asynchronously
- You need to buffer traffic spikes
- Multiple consumers need the same event
- You need a synchronous response
- The operation must complete before proceeding
- Latency is critical (queues add latency)
- The system is simple enough that a queue adds unjustified complexity
Senior vs Staff Signal -- Message Queues
Senior vs Staff Signal -- Message Queues
AI-Assisted Lens -- Message Queues
AI-Assisted Lens -- Message Queues
- AI-powered consumer lag prediction. ML models can predict consumer lag hours before it becomes critical, based on producer throughput trends and consumer processing patterns. This enables preemptive scaling of consumers.
- Intelligent dead-letter queue triage. Instead of manually inspecting DLQ messages, AI can classify failure reasons, suggest fixes, and even auto-retry messages that failed due to transient issues vs routing truly poisonous messages for human review.
- Event-driven AI pipelines. Modern AI systems use Kafka as the backbone for real-time feature engineering, model inference pipelines, and feedback loops. Understanding Kafka is now table stakes for ML engineers, not just backend engineers.
Quick Fire Q&A -- Message Queues
Quick Fire Q&A -- Message Queues
8. Container Orchestration Quick Reference
Core Kubernetes Objects
| Object | What It Does |
|---|---|
| Pod | Smallest deployable unit; one or more containers sharing network/storage |
| Deployment | Manages ReplicaSets; handles rolling updates and rollbacks |
| ReplicaSet | Ensures a specified number of pod replicas are running at all times |
| Service | Stable network endpoint that routes traffic to a set of pods |
| Ingress | HTTP/HTTPS routing rules from external traffic to internal services |
| ConfigMap | Injects non-sensitive configuration data into pods as env vars or files |
| Secret | Stores sensitive data (tokens, passwords) with base64 encoding |
| StatefulSet | Like Deployment but with stable pod identity and persistent storage |
| DaemonSet | Runs exactly one pod per node (logging agents, monitoring) |
| Job / CronJob | Runs a task to completion once (Job) or on a schedule (CronJob) |
| Namespace | Virtual cluster for isolating resources within the same physical cluster |
| PersistentVolume (PV) | A piece of storage provisioned in the cluster |
| PersistentVolumeClaim (PVC) | A request for storage by a pod |
| HorizontalPodAutoscaler | Scales pod count based on CPU, memory, or custom metrics |
| NetworkPolicy | Firewall rules controlling pod-to-pod and external traffic |
Senior vs Staff Signal -- Kubernetes
Senior vs Staff Signal -- Kubernetes
c5.2xlarge node running 30% utilized pods is burning money, and drives bin-packing optimization. Evaluates when Kubernetes is overkill (team of 3, two services) and when it is essential (50+ services, multi-region). Champions GitOps (ArgoCD/Flux) for deployment consistency.AI-Assisted Lens -- Kubernetes
AI-Assisted Lens -- Kubernetes
- AI-powered resource right-sizing. Tools like Kubecost and StormForge use ML to analyze historical usage and recommend optimal resource requests/limits — eliminating the manual guesswork that wastes 30-60% of cluster spend at most organizations.
- AI-assisted YAML generation and debugging. LLMs can generate K8s manifests from natural language, but more importantly, they can explain why a pod is not scheduling (
kubectl describeoutput is notoriously verbose — AI can extract the one relevant line from 200 lines of output). - Predictive autoscaling. Instead of reactive HPA (scale after CPU spikes), AI models predict load 10-30 minutes ahead based on historical traffic patterns and pre-scale. This eliminates cold-start latency during traffic ramps.
Quick Fire Q&A -- Kubernetes
Quick Fire Q&A -- Kubernetes
pod-0, pod-1) and persistent volumes that survive rescheduling. If you deploy a database on a Deployment, you lose your data when the pod restarts.Q: A pod is stuck in Pending state. What do you check?
A: Three things in order: (1) kubectl describe pod Events section — usually tells you directly. (2) Insufficient resources — the scheduler cannot find a node with enough CPU/memory. Check kubectl describe node for allocatable vs allocated. (3) Affinity/anti-affinity rules or taints preventing scheduling. The most common cause in production: someone set resource requests too high and no node can satisfy them.Q: Why do Kubernetes Secrets use base64 encoding instead of encryption?
A: Base64 is encoding, not encryption — anyone with kubectl get secret access can read them. Secrets are stored encrypted at rest in etcd (if you enable encryption at rest), but the base64 encoding is just for safe YAML transport of binary data. For actual secret management, use an external provider: AWS Secrets Manager, HashiCorp Vault, or the External Secrets Operator.9. Common HTTP Status Codes for Engineers
Success (2xx)
| Code | Name | When to Use |
|---|---|---|
200 | OK | Standard success for GET, PUT, PATCH |
201 | Created | Resource successfully created (POST) |
202 | Accepted | Request accepted for async processing (not yet completed) |
204 | No Content | Success with no response body (DELETE, PUT with no return) |
Redirection (3xx)
| Code | Name | When to Use |
|---|---|---|
301 | Moved Permanently | Resource URL has permanently changed (SEO-safe redirect) |
302 | Found | Temporary redirect (use 307 for strict method preservation) |
304 | Not Modified | Client cache is still valid (conditional GET) |
Client Error (4xx)
| Code | Name | When to Use |
|---|---|---|
400 | Bad Request | Malformed syntax, invalid parameters, validation failure |
401 | Unauthorized | Missing or invalid authentication credentials |
403 | Forbidden | Authenticated but not authorized for this resource |
404 | Not Found | Resource does not exist at this URI |
405 | Method Not Allowed | HTTP method not supported on this endpoint |
409 | Conflict | State conflict (duplicate resource, concurrent edit) |
422 | Unprocessable Entity | Syntactically valid but semantically incorrect |
429 | Too Many Requests | Rate limit exceeded — include Retry-After header |
Server Error (5xx)
| Code | Name | When to Use |
|---|---|---|
500 | Internal Server Error | Unhandled exception — generic server failure |
502 | Bad Gateway | Upstream service returned an invalid response |
503 | Service Unavailable | Server is overloaded or in maintenance — temporary |
504 | Gateway Timeout | Upstream service did not respond in time |
Quick Fire Q&A -- HTTP Status Codes
Quick Fire Q&A -- HTTP Status Codes
{"success": false, "error": "payment failed"}. What is wrong with this?
A: You are lying to the HTTP layer. Monitoring tools, CDNs, load balancers, and retry logic all use status codes to determine success. A 200 with an error body means your dashboards show green while customers are failing. Use proper status codes: 402 Payment Required, 422 Unprocessable Entity, or 400 Bad Request depending on the failure reason.Q: 502 vs 503 vs 504 — how do you differentiate them operationally?
A: 502 (Bad Gateway): the load balancer reached your service but got garbage back — your app crashed mid-response or returned invalid HTTP. 503 (Service Unavailable): your service is explicitly saying “I am overloaded or in maintenance, try later.” 504 (Gateway Timeout): the load balancer waited for your service and gave up — your app is alive but too slow. Debugging priority: 502 = check app crash logs, 503 = check scaling/circuit breakers, 504 = check downstream dependencies and timeouts.Q: When should you use 202 Accepted?
A: When the request was valid and accepted for processing, but the work is not done yet. Classic use cases: email sending, report generation, bulk imports, or any async workflow. Return 202 with a Location header pointing to a status endpoint where the client can poll for completion. This is how you design non-blocking APIs.10. The “Nines” Table — Availability Reference
| Availability | Common Name | Downtime / Year | Downtime / Month | Downtime / Week |
|---|---|---|---|---|
| 99% | Two nines | 3.65 days | 7.31 hours | 1.68 hours |
| 99.9% | Three nines | 8.77 hours | 43.83 minutes | 10.08 minutes |
| 99.95% | Three and a half | 4.38 hours | 21.92 minutes | 5.04 minutes |
| 99.99% | Four nines | 52.60 minutes | 4.38 minutes | 1.01 minutes |
| 99.999% | Five nines | 5.26 minutes | 26.30 seconds | 6.05 seconds |
| 99.9999% | Six nines | 31.56 seconds | 2.63 seconds | 0.60 seconds |
How to reason about SLAs in system design interviews
How to reason about SLAs in system design interviews
- Redundancy: Run multiple replicas across availability zones.
- Eliminate single points of failure: Every component in the critical path needs failover.
- Graceful degradation: Serve cached/stale data instead of failing entirely.
- Health checks + auto-restart: Detect and recover from failures automatically.
Why It Matters in Production -- Availability
Why It Matters in Production -- Availability
Senior vs Staff Signal -- Availability
Senior vs Staff Signal -- Availability
AI-Assisted Lens -- Availability
AI-Assisted Lens -- Availability
- AIOps for incident detection. AI-powered tools (PagerDuty AIOps, BigPanda, Moogsoft) correlate alerts across services to reduce noise and identify the root cause faster. Instead of 50 alerts firing during an incident, AI groups them into one actionable incident with a probable root cause.
- Predictive failure detection. ML models trained on historical metrics can predict failures 15-60 minutes before they happen (disk filling up, memory leak trajectory, connection pool approaching exhaustion), enabling preemptive action.
- SLO tracking automation. Tools like Nobl9 and Datadog SLOs automate error budget tracking, alerting when you are burning budget faster than expected, and recommending whether to freeze deployments or proceed.
Quick Fire Q&A -- Availability
Quick Fire Q&A -- Availability
11. Back-of-Envelope Estimation Cheat Sheet
Powers of 2 — Capacity Reference
| Power | Exact Value | Approximate Size |
|---|---|---|
| 2^10 | 1,024 | ~1 Thousand (1 KB) |
| 2^20 | 1,048,576 | ~1 Million (1 MB) |
| 2^30 | 1,073,741,824 | ~1 Billion (1 GB) |
| 2^40 | 1,099,511,627,776 | ~1 Trillion (1 TB) |
| 2^50 | ~1 Petabyte (1 PB) |
Common Estimation Building Blocks
| Metric | Value |
|---|---|
| Seconds in a day | ~86,400 (~10^5) |
| Seconds in a month | ~2.6 million (~2.5 x 10^6) |
| Seconds in a year | ~31.5 million (~3 x 10^7) |
| Average size of a tweet / text post | ~0.5 KB |
| Average size of a photo (compressed) | ~200 KB – 2 MB |
| Average size of a short video (1 min) | ~10 MB |
| Average HTTP request/response | ~1–10 KB |
| Characters in a URL | ~100 bytes |
QPS Quick Math
| Daily Active Users | Actions/User/Day | QPS (avg) | QPS (peak, ~3x avg) |
|---|---|---|---|
| 1 million | 10 | ~115 | ~350 |
| 10 million | 10 | ~1,150 | ~3,500 |
| 100 million | 10 | ~11,500 | ~35,000 |
| 1 billion | 10 | ~115,000 | ~350,000 |
Storage Estimation Formula
Senior vs Staff Signal -- Estimation
Senior vs Staff Signal -- Estimation
AI-Assisted Lens -- Estimation
AI-Assisted Lens -- Estimation
- AI-assisted capacity planning. Cloud providers (AWS Compute Optimizer, Google Recommender) use ML to analyze your actual usage patterns and recommend right-sized instances, reserved instance purchases, and storage tier transitions. This replaces manual estimation with data-driven forecasting.
- LLM cost estimation is a new skill. If you are building AI features, you must estimate: tokens per request x requests per day x cost per token = daily LLM API cost. A GPT-4 call at ~30K/day. This estimation skill did not exist two years ago and is now essential for AI product budgeting.
- Sanity-checking estimates with AI. LLMs are excellent at catching order-of-magnitude errors in back-of-envelope calculations. “Does it make sense that a URL shortener needs 110 TB over 5 years?” Use AI as a reviewer for your math, not a replacement for your intuition.
Quick Fire Q&A -- Estimation
Quick Fire Q&A -- Estimation
12. Design Pattern Quick Reference
| Pattern | Problem It Solves | When NOT to Use |
|---|---|---|
| Singleton | Ensures one instance globally (config, connection pool) | When it hides dependencies or makes testing difficult |
| Factory Method | Decouples object creation from usage | When there is only one concrete type and it will not change |
| Observer | One-to-many notifications on state change | When the order of notification matters or chains get deep |
| Strategy | Swap algorithms at runtime without changing client code | When there is only one algorithm and no foreseeable variation |
| Decorator | Adds behavior to objects dynamically without subclassing | When the combination explosion of wrappers becomes unreadable |
| Adapter | Makes incompatible interfaces work together | When you can modify the original interface instead |
| Builder | Constructs complex objects step-by-step | For simple objects where a constructor with parameters suffices |
| Proxy | Controls access to an object (lazy load, access control, caching) | When the indirection adds latency with no real benefit |
| Circuit Breaker | Prevents cascading failures by stopping calls to failing services | When failures are transient and retries are cheap |
| CQRS | Separates read and write models for scalability | For simple CRUD apps where read/write patterns are identical |
Distributed System Patterns Worth Knowing
Distributed System Patterns Worth Knowing
| Pattern | Purpose |
|---|---|
| Saga | Manage distributed transactions across microservices |
| Event Sourcing | Store state changes as an immutable sequence of events |
| Sidecar | Attach utility processes alongside your main container |
| Bulkhead | Isolate failures to prevent one component from sinking all |
| Strangler Fig | Incrementally migrate from legacy to new system |
| Leader Election | Coordinate a single active node among replicas |
| Consistent Hashing | Distribute load evenly with minimal remapping on scaling |
| Outbox Pattern | Reliably publish events alongside database transactions |
Senior vs Staff Signal -- Design Patterns
Senior vs Staff Signal -- Design Patterns
AI-Assisted Lens -- Design Patterns
AI-Assisted Lens -- Design Patterns
- AI code review for pattern misuse. LLMs can detect anti-patterns in code reviews: Singleton hiding dependencies, God objects violating SRP, deep inheritance hierarchies that should be composition. Use AI as a pattern smell detector in CI.
- Pattern selection assistance. Describe your problem to an LLM and it can suggest applicable patterns with trade-offs. The staff-level skill is evaluating whether the suggested pattern is the simplest solution or over-engineered for the context.
- Refactoring with AI. LLMs excel at mechanical refactoring: extracting a strategy pattern from a switch statement, converting inheritance to composition, or introducing a builder for a complex constructor. The human judgment is deciding WHEN to refactor, not how.
Quick Fire Q&A -- Design Patterns
Quick Fire Q&A -- Design Patterns
13. SOLID Principles — One-Liner
| Principle | One-Liner | Code Smell It Prevents |
|---|---|---|
| S — Single Responsibility | A class should have only one reason to change. | God classes that touch everything |
| O — Open/Closed | Open for extension, closed for modification. | Modifying existing code every time a new type appears |
| L — Liskov Substitution | Subtypes must be usable wherever their parent type is expected. | Subclasses that break parent behavior or throw unexpected errors |
| I — Interface Segregation | No client should be forced to depend on methods it does not use. | Fat interfaces where implementors stub out half the methods |
| D — Dependency Inversion | Depend on abstractions, not concretions. | Tightly coupled modules that cannot be tested or swapped |
Mnemonics and practical examples
Mnemonics and practical examples
User class that handles authentication, database access, and email sending.
Good: Separate UserAuth, UserRepository, and EmailService classes.O — Open/Closed:
Bad: A giant if/else chain that grows every time you add a payment method.
Good: A PaymentProcessor interface with StripeProcessor, PayPalProcessor implementations.L — Liskov Substitution:
Bad: A Square that extends Rectangle but breaks when setWidth is called independently.
Good: Use a common Shape interface instead of inheritance.I — Interface Segregation:
Bad: A Worker interface with work(), eat(), sleep() — robots do not eat.
Good: Split into Workable, Eatable, Sleepable interfaces.D — Dependency Inversion:
Bad: OrderService creates new MySQLDatabase() directly.
Good: OrderService accepts a Database interface via constructor injection.Senior vs Staff Signal -- SOLID Principles
Senior vs Staff Signal -- SOLID Principles
Quick Fire Q&A -- SOLID
Quick Fire Q&A -- SOLID
ReadOnlyRepository that extends Repository but throws UnsupportedOperationException on save() and delete(). Any code that accepts a Repository and calls save() will break. The fix: ReadOnlyRepository should be a separate interface, not a subclass. The LSP violation is that the subtype cannot be safely substituted where the parent is expected.Q: When is the Open/Closed Principle actually harmful?
A: When you add extension points for variation that never comes. Building a PaymentProcessor interface with a plugin system when you only ever use Stripe adds indirection, makes debugging harder, and slows down onboarding — all for a hypothetical second payment provider. Add the abstraction when you need it, not before. YAGNI trumps OCP when there is no evidence of variation.Q: Which SOLID principle does DI (dependency injection) implement?
A: Dependency Inversion (D) — you depend on abstractions (interfaces), not concretions (implementations). But it also enables testability (substituting mocks), which is a practical benefit beyond the principle itself. The principle is the “why,” DI is the “how.”14. Git Commands Engineers Actually Use
Beyond the Basics
| Command | What It Does |
|---|---|
git log --oneline --graph --all | Visualize the entire branch topology in your terminal |
git diff --staged | See exactly what will be committed (staged changes only) |
git stash -u | Stash all changes including untracked files |
git stash pop | Re-apply the most recent stash and remove it from the stash list |
git cherry-pick <commit> | Apply a single commit from another branch onto current branch |
git rebase -i HEAD~N | Interactively squash, reorder, or edit the last N commits |
git bisect start / good / bad | Binary search through commits to find the one that introduced a bug |
git reflog | View the full history of HEAD — your safety net for “I lost my work” |
git reset --soft HEAD~1 | Undo last commit but keep changes staged |
git blame -L 10,20 file.py | See who last modified lines 10–20 (great for understanding context) |
git log -S "functionName" | Search commit history for when a string was added or removed |
git shortlog -sn --no-merges | Leaderboard of contributors by commit count |
git clean -fd | Remove all untracked files and directories (destructive) |
git worktree add ../feature-branch feature | Check out a branch in a separate directory without switching |
git commit --fixup <commit> | Mark a commit as a fixup for a previous commit (use with autosquash) |
Aliases Worth Setting Up
AI-Assisted Lens -- Git
AI-Assisted Lens -- Git
- AI-powered commit message generation. Tools like GitHub Copilot and conventional-commit plugins generate commit messages from diffs. Useful for routine changes but staff engineers write messages that explain “why,” not “what” — AI struggles with intent.
- AI code review. GitHub Copilot code review, CodeRabbit, and similar tools can catch bugs, style issues, and security problems in PRs. The staff-level skill is configuring these tools to catch what matters (security, performance, breaking changes) and ignore what does not (style nitpicks the team has not agreed on).
- AI-assisted git bisect. When
git bisectfinds the offending commit, AI can analyze the diff and explain why that change caused the regression — reducing debugging time from hours to minutes.
Quick Fire Q&A -- Git
Quick Fire Q&A -- Git
git rebase vs git merge — when do you use each?
A: Rebase for a clean linear history on feature branches before merging to main. Merge for preserving the branch topology when multiple people worked on a branch or when you want an explicit merge commit for traceability. Golden rule: never rebase commits that have been pushed to a shared branch — it rewrites history that others depend on.Q: You accidentally committed a secret to git. It is already pushed. What do you do?
A: (1) Immediately rotate the secret — assume it is compromised the moment it hits the remote. (2) Use git filter-branch or BFG Repo-Cleaner to remove it from history. (3) Force-push the cleaned history. (4) Notify the team. Rotating the secret is step 1, not step 4 — do not waste time cleaning history while the secret is live.Q: What is git reflog and when is it your lifesaver?
A: Reflog records every change to HEAD, including operations that are not in git log (rebases, resets, amends). If you accidentally git reset --hard and lose commits, git reflog shows the commit hashes before the reset, and git checkout or git reset to that hash recovers your work. It is the safety net for destructive git operations.15. Common Complexity Patterns
Know which data structure or algorithm gives you each time complexity. Pattern-matching this in interviews is a superpower.| Complexity | Pattern | Typical Data Structure / Algorithm |
|---|---|---|
| O(1) | Direct lookup | Hash map, array index access |
| O(log n) | Halve the search space each step | Binary search, balanced BST, skip list |
| O(n) | Touch every element once | Linear scan, single-pass counting |
| O(n log n) | Sort then process | Merge sort, heap sort, sort-based problems |
| O(n^2) | Nested comparison of all pairs | Brute-force pair matching, bubble sort |
| O(2^n) | All subsets / combinations | Backtracking, power set generation |
| O(n!) | All permutations | Permutation-based brute force (TSP) |
Senior vs Staff Signal -- Complexity
Senior vs Staff Signal -- Complexity
Quick Fire Q&A -- Complexity
Quick Fire Q&A -- Complexity
ArrayList.add() is O(1) amortized because most adds are O(1), but occasionally the array doubles in size (O(n) copy). In latency-sensitive systems, that occasional O(n) spike matters — it shows up as a P99 latency outlier.Q: When does O(n) beat O(log n) in practice?
A: When n is small (linear scan of 100 elements beats binary search due to cache locality and branch prediction), or when the constant factor of O(log n) is large (a B-tree traversal with disk seeks per level vs a sequential scan in memory). Complexity notation hides constants — always benchmark with real data before optimizing.16. System Design Components — When to Reach for What
When the interviewer draws a blank architecture box, these are the building blocks you fill it with.| Component | What It Does | Reach for It When… |
|---|---|---|
| Load Balancer | Distributes traffic across servers | You have more than one instance of a service |
| CDN | Caches static assets at the edge | Users are geographically distributed; you serve images, JS, CSS, or video |
| Cache (Redis/Memcached) | Stores hot data in memory | Read-heavy workloads with tolerable staleness |
| Message Queue | Decouples producers from consumers | Async processing, traffic spikes, or unreliable downstream services |
| Database (SQL) | Structured, relational storage with ACID | Transactions, joins, or strong consistency requirements |
| Database (NoSQL) | Flexible schema, horizontal scale | High write throughput, variable data shapes, or key-value access patterns |
| Blob Storage (S3) | Stores large unstructured files | Images, videos, backups, logs — anything over ~1 MB |
| Search Index (ES) | Full-text search and aggregation | Users need fuzzy search, autocomplete, or faceted filtering |
| API Gateway | Single entry point for external traffic | Microservices needing auth, rate limiting, and routing in one place |
| Service Mesh | Handles inter-service networking (mTLS, retries) | Microservices at scale needing observability and zero-trust networking |
Quick Reference Architecture
This is the “default” web application architecture you can draw in the first 2 minutes of any system design interview, then customize based on requirements. Deep dive: Cloud & Problem Framing | System Design PracticeWhy It Matters in Production -- System Design Components
Why It Matters in Production -- System Design Components
AI-Assisted Lens -- System Design Components
AI-Assisted Lens -- System Design Components
- AI-assisted architecture review. Describe your system design to an LLM and ask it to identify missing components, single points of failure, and over-engineering. LLMs are surprisingly good at catching “you have a cache but no invalidation strategy” or “you have three sequential service calls that could be parallelized.”
- AI as an architecture search engine. “How does Uber handle ride matching at scale?” LLMs can synthesize architecture patterns from publicly available engineering blogs and conference talks, giving you starting points for specific system design problems.
- New AI-specific components. Modern architectures increasingly include: vector databases for semantic search, embedding services for content understanding, inference endpoints for real-time ML, and feature stores for ML pipelines. These are becoming standard building blocks alongside caches and queues.
Quick Fire Q&A -- System Design Components
Quick Fire Q&A -- System Design Components
17. REST API Naming Conventions
A quick reference for designing clean, idiomatic REST endpoints. Get these right and reviewers stop nitpicking your API.| Rule | Good | Bad | Why |
|---|---|---|---|
| Use nouns, not verbs | /users | /getUsers | HTTP methods already express the verb |
| Use plural nouns | /orders/42 | /order/42 | Consistent collection semantics |
| Use kebab-case | /user-profiles | /userProfiles | URLs are case-insensitive by convention |
| Nest for relationships | /users/5/orders | /getUserOrders?id=5 | Expresses hierarchy in the URI |
| Use query params for filtering | /orders?status=shipped | /shipped-orders | Keeps the resource URI clean |
| Version in the URL or header | /v1/users | /users?version=1 | Clear, cacheable, hard to miss |
| Return 201 on POST create | 201 Created + Location header | 200 OK with body only | Signals resource creation; Location enables follow-up |
| Use PATCH for partial updates | PATCH /users/5 | PUT /users/5 with partial body | PUT implies full replacement |
Quick Fire Q&A -- REST Conventions
Quick Fire Q&A -- REST Conventions
PUT or PATCH for an update? When does it matter?
A: PUT replaces the entire resource — you must send all fields, even unchanged ones. PATCH applies a partial update — send only the changed fields. Use PATCH for most updates (less bandwidth, no risk of accidentally nulling out omitted fields). Use PUT when the client truly intends to replace the entire resource and you want idempotent semantics.Q: Nested resources: /users/5/orders vs /orders?user_id=5. When do you choose which?
A: Nested when the child makes no sense without the parent (orders belong to a user, comments belong to a post). Flat with query params when the child can exist independently or when you need to filter across parents (all orders with status=shipped regardless of user). Deep nesting beyond two levels (/users/5/orders/42/items/3) is a red flag — flatten it.Q: URL versioning (/v1/users) vs header versioning. Which do you pick?
A: URL versioning for public APIs — it is explicit, cacheable, and impossible to miss. Header versioning for internal APIs where URL aesthetics matter and you control all clients. The industry has largely settled on URL versioning for public APIs. GitHub, Stripe, and Twilio all use it.18. Consensus Algorithm Comparison
Consensus is how distributed systems get unreliable machines to agree on a value. The three algorithms you need to know — and the systems that use them.| Aspect | Raft | Paxos | ZAB (ZooKeeper Atomic Broadcast) |
|---|---|---|---|
| Published | 2014 (Ongaro & Ousterhout) | 1989/1998 (Lamport) | 2011 (Junqueira, Reed, Serafini) |
| Design goal | Understandability | Correctness proof | ZooKeeper-specific total order broadcast |
| Leader | Strong leader required; all writes go through leader | No strict leader (leaderless or weak leader) | Designated leader; followers forward writes to leader |
| Log ordering | First-class log replication | Must be added on top (Multi-Paxos) | Total order broadcast of state changes |
| Election mechanism | Randomized timeout (150-300ms), majority vote | Proposer with highest ballot number | Longest zxid (transaction ID) wins; TCP-based ordering |
| Membership changes | Joint consensus protocol defined in paper | Complex and underspecified | Dynamic reconfiguration (since ZK 3.5) |
| Correctness proof | Full TLA+ specification available | Safety proven; liveness depends on implementation | Formal proof in the ZAB paper |
| Implementation difficulty | High, but significantly more tractable than Paxos | Very high; many subtle edge cases | Moderate (purpose-built for ZooKeeper) |
| Used by | etcd, CockroachDB, TiKV, Consul, RethinkDB | Google Chubby, some older systems | Apache ZooKeeper, Kafka (pre-KRaft) |
| Read performance | Leader reads by default; follower reads possible with lease | Flexible (leaderless reads possible) | Reads from any follower (sequential consistency) |
| Typical cluster size | 3, 5, or 7 nodes | 3, 5, or 7 acceptors | 3, 5, or 7 nodes |
Senior vs Staff Signal -- Consensus Algorithms
Senior vs Staff Signal -- Consensus Algorithms
AI-Assisted Lens -- Consensus
AI-Assisted Lens -- Consensus
- AI for distributed systems debugging. When a consensus cluster misbehaves (split-brain, leader flapping, stuck elections), AI tools can analyze logs from all nodes simultaneously and identify the timeline of events. This is a task that humans struggle with because it requires correlating logs across nodes with clock skew.
- Formal verification with AI. TLA+ and similar formal methods are used to verify consensus implementations. AI is getting better at writing and checking TLA+ specs, which could make formal verification accessible to more teams — not just researchers at AWS and Google.
- Managed consensus as a service. The trend is away from self-managed consensus toward managed services (AWS MemoryDB, DynamoDB, Aurora) that hide the consensus layer entirely. The staff engineer understands what is happening underneath but appreciates not having to operate it.
Quick Fire Q&A -- Consensus
Quick Fire Q&A -- Consensus
19. CRDT Types Quick Reference
CRDTs (Conflict-free Replicated Data Types) allow replicas to be modified independently and merged automatically — no coordination, no conflicts.| CRDT Type | What It Does | Merge Strategy | Limitations | Real-World Use |
|---|---|---|---|---|
| G-Counter | Grow-only counter (increment only) | Element-wise max of per-node counters; value = sum | Cannot decrement | Distributed page view counts |
| PN-Counter | Counter with increment and decrement | Two G-Counters: P (positive) and N (negative); value = sum(P) - sum(N) | State grows with node count | Like/dislike counters, inventory adjustments |
| G-Set | Grow-only set (add only) | Union of sets | Cannot remove elements | Tag collections, participant lists |
| OR-Set | Observed-Remove Set (add and remove) | Each add tagged with unique ID; remove removes known tags; concurrent add wins | Metadata overhead from tombstones/tags | Shopping carts, collaborative editing |
| LWW-Register | Last-Writer-Wins single value | Higher timestamp wins | Clock skew can silently discard concurrent writes | User profile fields, configuration values |
| MV-Register | Multi-Value Register | Keeps all concurrent values; app resolves | Requires application-level conflict resolution | Systems needing explicit merge UI |
| RGA | Replicated Growable Array (ordered list) | Unique element IDs with causal ordering | Complex implementation; metadata overhead | Collaborative text editing |
State-Based vs Operation-Based CRDTs
State-Based vs Operation-Based CRDTs
Senior vs Staff Signal -- CRDTs
Senior vs Staff Signal -- CRDTs
AI-Assisted Lens -- CRDTs
AI-Assisted Lens -- CRDTs
- Collaborative AI editing. AI-powered collaborative tools (Cursor, Replit) use CRDT-like data structures for real-time code collaboration. Understanding CRDTs helps you understand how your IDE handles concurrent edits from multiple developers and AI assistants simultaneously.
- Offline-first AI applications. AI features in mobile apps (smart compose, local inference) generate state changes offline that must merge when reconnecting. CRDTs provide a principled merge strategy for these scenarios without the complexity of conflict resolution UIs.
Quick Fire Q&A -- CRDTs
Quick Fire Q&A -- CRDTs
20. OS I/O Models Comparison
Understanding I/O models explains why different server architectures exist and what makes Nginx, Node.js, and Redis fast.| I/O Model | How It Works | Scalability | Programming Complexity | Used By |
|---|---|---|---|---|
| Blocking I/O | Thread calls read() and blocks until data arrives | Poor (~thousands of connections) | Lowest (sequential code) | Traditional Apache httpd, PHP-FPM |
| Non-blocking I/O | read() returns EAGAIN if no data; process must poll | Poor alone (busy-waiting) | Medium | Rarely used alone; combined with multiplexing |
| I/O Multiplexing (select/poll) | Monitor multiple fds; kernel scans all fds on each call | Moderate (O(n) per call, ~10K connections) | Medium | Older servers, legacy codebases |
| I/O Multiplexing (epoll/kqueue) | Kernel maintains interest set; returns only ready fds | Excellent (O(1) per event, ~100K+ connections) | Medium-High | Nginx, Node.js, Redis, HAProxy |
| Async I/O (io_uring) | Kernel performs I/O and notifies via shared ring buffers; zero syscalls in fast path | Highest (millions of ops/sec) | High | Next-gen storage engines, modern databases |
select() has a hard limit of ~1024 fds and is O(n). epoll is O(1) for events with no fd limit. This single difference is why Node.js can handle 100K connections on one thread. io_uring (Linux 5.1+) is the next leap — true async with no syscall overhead in the fast path.Senior vs Staff Signal -- I/O Models
Senior vs Staff Signal -- I/O Models
epoll is better than select for high-concurrency servers. Knows that Node.js and Nginx use event-driven I/O to handle many connections on few threads.Staff: Can reason about I/O model choices at the architecture level. Understands that io_uring is the future of Linux I/O and can explain why it matters for database engines (reduces syscall overhead by 10-100x for storage-bound workloads). Knows that the thread-per-connection model (Apache httpd, traditional Java) is not “wrong” — it is simpler to program and debug, and for <10K connections, the performance difference is negligible. Makes I/O model choices based on actual connection counts and throughput requirements, not blog-post hype.AI-Assisted Lens -- I/O Models
AI-Assisted Lens -- I/O Models
- AI inference serving and I/O. LLM inference servers (vLLM, TensorRT-LLM) are fundamentally I/O-bound — they batch requests, manage GPU memory, and stream tokens. Understanding async I/O models explains why these servers use event-driven architectures and why batching requests improves GPU utilization.
- AI-powered performance profiling. Tools like Intel VTune and Perf combined with AI analysis can identify I/O bottlenecks (excessive syscalls, page faults, context switches) in production workloads and suggest the right I/O model for your access pattern.
Quick Fire Q&A -- I/O Models
Quick Fire Q&A -- I/O Models
epoll/kqueue (via libuv) to multiplex I/O events on a single thread. The event loop processes callbacks when I/O completes, never blocking. CPU-bound work blocks the event loop — that is why you offload it to worker threads. The key insight: I/O multiplexing works because most connections are idle most of the time (waiting for client input, waiting for DB response).Q: io_uring vs epoll — when does the difference matter?
A: For network I/O (web servers, API servers), epoll is still excellent — the bottleneck is usually application logic, not syscall overhead. io_uring shines for storage I/O — database engines, file-processing pipelines — where the syscall overhead of read()/write() becomes the bottleneck at millions of ops/sec. RocksDB and some modern databases are adopting io_uring for this reason.Q: Why does Redis use single-threaded I/O and still achieve millions of ops/sec?
A: Redis operations are in-memory and complete in microseconds. The bottleneck is network I/O, not CPU. A single thread with epoll can handle 100K+ connections because each operation is so fast that the thread is almost never blocked. Redis 6+ added I/O threading for the network layer (read/write to sockets) while keeping the command execution single-threaded to avoid lock contention.21. Database Deep Dive Comparison
When interviewers ask “which database and why?” — this is the table you want in your head.| Dimension | PostgreSQL | MongoDB | DynamoDB | Redis |
|---|---|---|---|---|
| Model | Relational (tables, rows, SQL) | Document (JSON/BSON) | Wide-column / key-value (PK + SK) | In-memory data structures |
| Schema | Strict schema (schema-on-write) | Flexible (schema-on-read) | Schemaless (attribute-level) | Schemaless |
| Query language | SQL (full standard) | MQL + Aggregation Pipeline | PartiQL or API (GetItem, Query, Scan) | Commands (GET, SET, ZADD, etc.) |
| Consistency | Strong (ACID, serializable isolation) | Tunable (w:1 to w:majority, read concern local to linearizable) | Tunable (eventually consistent by default, strongly consistent reads optional) | Strong on single node; eventual across replicas |
| Scaling model | Vertical (read replicas for reads; app-level sharding for writes) | Horizontal (built-in sharding by shard key) | Horizontal (automatic partitioning by partition key) | Horizontal (Redis Cluster, 16384 hash slots) |
| Max practical size | ~10 TB single instance; larger with partitioning + Citus | Petabytes (with sharding) | Virtually unlimited (fully managed) | Limited by RAM (100s of GB per node) |
| Latency | 1-10 ms typical | 1-10 ms typical | 1-5 ms (single-digit guaranteed) | <1 ms (sub-millisecond) |
| JOINs | Native, optimized | $lookup (slow, not recommended at scale) | None (design around access patterns) | None (application-level) |
| Replication | Streaming replication (WAL-based) | Replica sets (oplog-based) | Global Tables (multi-region, CRDT-based) | Sentinel (HA) or Cluster (sharding) |
| Best for | Transactions, complex queries, strong consistency, JOINs | Flexible schemas, hierarchical data, rapid iteration | Predictable latency at any scale, serverless, key-based access | Caching, sessions, leaderboards, rate limiting, pub/sub |
| Worst for | Horizontal write scaling without extensions | Complex joins, many-to-many relationships | Ad-hoc queries, analytics, complex relationships | Datasets larger than RAM, complex queries |
| Managed options | RDS, Aurora, Supabase, Neon | Atlas | DynamoDB (native) | ElastiCache, MemoryDB |
Quick Fire Q&A -- Database Deep Dive
Quick Fire Q&A -- Database Deep Dive
jsonb), advanced indexing (GIN, GiST, BRIN), CTEs and window functions for analytics, or extensibility (PostGIS for geo, pgvector for embeddings). MySQL wins when you need: simpler replication setup, broader hosting compatibility, or your team already knows it deeply. The operational expertise factor outweighs the feature comparison 9 times out of 10.Q: DynamoDB charges per read/write. When does this become painfully expensive?
A: When you do full-table scans (Scan operations read every item and charge for every 4KB read), when your access patterns do not match your key design (leading to hot partitions and throttling), or when you use on-demand pricing for steady-state workloads (on-demand is ~6.5x costlier per request than provisioned). The worst case: a team running analytics queries via DynamoDB Scan operations instead of exporting to S3 and querying with Athena.Q: Redis persistence — AOF vs RDB. When do you choose each?
A: RDB (point-in-time snapshots): lower disk I/O, faster restarts, but you lose data since last snapshot. AOF (append-only file): logs every write, minimal data loss, but larger files and slower restarts. For caching (data is reconstructible), RDB or no persistence. For primary data store (sessions, rate limits), AOF with appendfsync everysec — at most 1 second of data loss. For zero data loss, use Redis with AOF appendfsync always (slower) or switch to a real database.22. Real-Time Protocol Comparison
| Aspect | WebSocket | SSE (Server-Sent Events) | WebRTC | Long Polling |
|---|---|---|---|---|
| Direction | Bidirectional (full-duplex) | Server to client only | Bidirectional (peer-to-peer) | Server to client (with request per update) |
| Protocol | TCP (upgraded from HTTP) | HTTP (text/event-stream) | UDP (SRTP/SCTP) | HTTP |
| Latency | Very low (~1-5ms LAN) | Low (~5-50ms) | Lowest (~10-50ms P2P) | Medium (~50-500ms per cycle) |
| Auto-reconnect | Manual (you implement it) | Built-in (EventSource API) | Manual (ICE restart) | Manual |
| Binary data | Yes (binary frames) | No (UTF-8 text only) | Yes (data channels) | Yes (HTTP body) |
| Frame overhead | 2-14 bytes after handshake | ~50 bytes per event | Varies (RTP headers) | Full HTTP headers per response (~200-800 bytes) |
| Scaling difficulty | Medium-Hard (stateful connections, pub/sub backbone needed) | Easy (stateless HTTP, works with CDNs) | Hard (STUN/TURN servers, SFUs at scale) | Easy (stateless HTTP) |
| Proxy/firewall | Sometimes blocked | Fully compatible (plain HTTP) | Often blocked (UDP) | Fully compatible |
| Server connections | ~50K-100K per server | 6 per domain (HTTP/1.1), unlimited (HTTP/2) | Limited by CPU/bandwidth | 6 per domain (HTTP/1.1) |
| Best for | Chat, collaboration, gaming, live dashboards | Notifications, feeds, build logs, stock tickers | Voice, video, screen sharing, P2P file transfer | Legacy fallback, serverless, low-frequency updates |
Senior vs Staff Signal -- Real-Time Protocols
Senior vs Staff Signal -- Real-Time Protocols
AI-Assisted Lens -- Real-Time Protocols
AI-Assisted Lens -- Real-Time Protocols
- Streaming LLM responses. Every AI chatbot uses SSE to stream token-by-token responses. Understanding SSE is now a must-have skill for any engineer building AI-powered UIs. The pattern: HTTP POST to
/chat/completions, response istext/event-streamwithdata:prefixed JSON chunks. - Real-time AI collaboration. Tools like Cursor and Copilot Chat use WebSocket for bidirectional communication — the user types, the AI responds, the user edits, the AI adapts. This is pushing WebSocket adoption beyond traditional chat apps into developer tooling.
- WebRTC for AI voice. AI voice assistants (OpenAI Realtime API, Hume) use WebRTC for low-latency audio streaming. Understanding WebRTC’s STUN/TURN/ICE negotiation is becoming relevant for AI product engineers, not just video conferencing teams.
Quick Fire Q&A -- Real-Time Protocols
Quick Fire Q&A -- Real-Time Protocols
23. GraphQL vs REST Decision Matrix
| Dimension | REST | GraphQL |
|---|---|---|
| Data fetching | Server decides response shape; often over-fetches or under-fetches | Client specifies exact fields; no over-fetching |
| Endpoints | One per resource (/users, /orders/42) | Single endpoint (/graphql) |
| Versioning | URL versioning (/v1/users) or header | Schema evolution with @deprecated directives; no versions needed |
| Caching | Excellent (HTTP caching, CDN-friendly, ETag, Cache-Control) | Hard (all requests are POST to one endpoint; need persisted queries or APQ) |
| Error handling | HTTP status codes (400, 404, 500) | Always returns 200; errors in response body { errors: [...] } |
| Tooling | Excellent (Postman, curl, any HTTP client) | Good (GraphiQL, Apollo Studio, Insomnia) |
| N+1 problem | Server controls query; can optimize with JOINs | Field-level resolvers cause N+1 by default; requires DataLoader |
| Type safety | Optional (OpenAPI/Swagger) | Built-in (SDL schema is the contract) |
| Real-time | Polling or SSE (bolted on) | Subscriptions (via WebSocket, built into spec) |
| File uploads | Native (multipart/form-data) | Not in spec (requires multipart extension or pre-signed URL workaround) |
| Rate limiting | Simple (requests per second) | Complex (each query has different cost; need calculated query cost) |
| Best for | Public APIs, simple CRUD, server-to-server, teams wanting HTTP caching | Mobile/frontend with varied data needs, multiple client types, graph-shaped data |
24. Service Mesh Comparison
| Aspect | Istio | Linkerd | Consul Connect |
|---|---|---|---|
| Data plane proxy | Envoy (C++) | linkerd2-proxy (Rust, purpose-built) | Envoy (C++) |
| Memory per sidecar | ~40 MB | ~10-20 MB | ~40 MB (Envoy) |
| Latency per hop | ~1-3 ms | ~1-2 ms | ~1-3 ms |
| Configuration complexity | High (many CRDs: VirtualService, DestinationRule, Gateway, etc.) | Low (opinionated defaults, fewer knobs) | Medium (integrates with Consul service catalog) |
| mTLS | Automatic with cert rotation (default 24h) | Automatic with cert rotation | Automatic via Consul CA or Vault |
| Traffic splitting | Yes (weight-based, header-based, fault injection) | Yes (basic traffic splitting) | Yes (via service-resolver and service-splitter) |
| Circuit breaking | Yes (outlier detection in Envoy) | Yes (built into proxy) | Yes (via Envoy) |
| Multi-cluster | Supported (complex setup) | Supported (simpler model) | Native (designed for multi-DC/multi-cloud) |
| Non-Kubernetes support | Limited (K8s-first) | K8s only | Yes (VMs, bare metal, Nomad, K8s) |
| Extensibility | Very high (Wasm/Lua filters in Envoy) | Limited (intentionally) | Moderate (Envoy filters + Consul intentions) |
| Learning curve | Weeks to months | Days to weeks | Days to weeks |
| Best for | Large orgs with platform teams needing advanced traffic management | Small-medium teams wanting mTLS + observability with minimal ops | Hybrid environments spanning K8s, VMs, and multi-cloud |
25. Distributed Systems Numbers
Numbers every engineer should have at their fingertips for distributed systems design and interviews.Consensus & Coordination
| Metric | Typical Value | Context |
|---|---|---|
| Raft election timeout | 150-300 ms (randomized) | Prevents split votes; must be >> heartbeat interval |
| Raft heartbeat interval | 50-150 ms | Leader sends to suppress elections |
| Raft leader election time | ~200-500 ms | From leader failure to new leader |
| ZooKeeper session timeout | 6-30 s (default varies) | Client must heartbeat within this window; too low = false expiry |
| ZooKeeper write latency | 2-10 ms | Writes go through leader; commit requires majority |
| etcd write latency | 2-10 ms | Similar to ZooKeeper (Raft-based) |
| etcd recommended max DB size | 8 GB (default) | Can be raised, but etcd is for metadata, not bulk data |
Replication & Messaging
| Metric | Typical Value | Context |
|---|---|---|
| Kafka replication lag (in-rack) | <10 ms | Between leader and ISR followers |
| Kafka end-to-end latency | 5-50 ms | Producer to consumer, same datacenter |
| Kafka single-broker throughput | 800 MB/s+ | With zero-copy (sendfile()), sequential I/O, batching |
| PostgreSQL streaming replication lag | <1 ms (same AZ) to seconds (cross-region) | Depends on network and write volume |
| MongoDB replica set replication lag | 0-100 ms typical | Oplog-based; can spike under heavy write load |
| DynamoDB Global Tables replication | Typically <1 second cross-region | CRDT-based conflict resolution |
| Redis replication lag | Sub-millisecond (same DC) | Async replication; can lose data on failover |
Timeouts & Failure Detection
| Metric | Typical Value | Context |
|---|---|---|
| TCP connect timeout | 1-5 s | Time to establish connection; 75s kernel default on Linux is too high for production |
| HTTP request timeout (internal) | 1-10 s | Service-to-service calls; longer = more resource holding |
| Health check interval | 5-30 s | Kubernetes liveness/readiness probes |
| Circuit breaker trip threshold | 5-10 consecutive failures | Open circuit, stop calling failing service |
| NTP clock skew (same DC) | 1-10 ms | NTP precision; not good enough for causal ordering |
| NTP clock skew (cross-DC) | 10-200 ms | Why logical clocks exist |
| Google TrueTime uncertainty | 1-7 ms | With atomic clocks + GPS; Spanner’s commit-wait |
26. Cloud Service Limits Cheat Sheet (AWS)
The numbers that matter when you’re sketching architecture on a whiteboard.Lambda
| Limit | Value | Notes |
|---|---|---|
| Max execution time | 15 minutes | Use Step Functions for longer workflows |
| Memory | 128 MB - 10,240 MB | CPU scales proportionally with memory |
| Deployment package (zip) | 50 MB zipped / 250 MB unzipped | Use container images for larger deps |
| Container image size | 10 GB | For ML models, large native binaries |
| Concurrent executions (default) | 1,000 per region | Raise via support ticket (commonly 10K-100K) |
| Burst concurrency | 500-3,000 (region-dependent) | Cannot be increased |
Ephemeral storage (/tmp) | 512 MB - 10,240 MB | Configurable since 2022 |
| Environment variables | 4 KB total | Use SSM Parameter Store for larger configs |
| Payload (sync invocation) | 6 MB request / 6 MB response | Use S3 for larger payloads |
| Payload (async invocation) | 256 KB | Events larger than this must reference S3 |
| Cold start (Go/Rust) | 50-100 ms | Compiled, minimal runtime |
| Cold start (Python/Node) | 100-300 ms | Interpreter startup + imports |
| Cold start (Java) | 3-10 s | JVM + class loading; use SnapStart |
S3
| Limit | Value | Notes |
|---|---|---|
| Object size (single PUT) | 5 GB max | Use multipart upload for files >100 MB |
| Object size (multipart) | 5 TB max | Up to 10,000 parts |
| Bucket count per account | 100 (soft limit) | Raise via support ticket |
| Request rate per prefix | 5,500 GET/s + 3,500 PUT/s | Distribute across prefixes to scale beyond this |
| Consistency model | Strong read-after-write (since Dec 2020) | No more eventual consistency surprises |
| Storage classes | 6 tiers (Standard to Deep Archive) | Use lifecycle policies to auto-transition |
| Minimum object charge | 128 KB for IA/Glacier classes | Small objects in IA cost more than Standard |
DynamoDB
| Limit | Value | Notes |
|---|---|---|
| Item size | 400 KB max | Compress or offload large attributes to S3 |
| Partition key value throughput | 3,000 RCU + 1,000 WCU per partition | Hot partitions get throttled even if table capacity is not exhausted |
| GSIs per table | 20 (hard limit) | Plan access patterns carefully |
| LSIs per table | 5 (must be defined at table creation) | Cannot be added later |
| Partition + all LSIs per PK value | 10 GB max | Reason most practitioners avoid LSIs |
| Query/Scan result set | 1 MB per call (paginate for more) | Use LastEvaluatedKey for pagination |
| Batch operations | 25 items per BatchWriteItem / BatchGetItem | Items can be up to 400 KB each |
| Transaction limit | 100 items per TransactWriteItems | 4 MB total request size |
| On-demand pricing | ~6.5x costlier per request than provisioned at steady state | Use for spiky/unknown traffic |
| Global Tables replication | Typically <1 second | CRDT-based; last-writer-wins per attribute |
RDS / Aurora
| Limit | Value | Notes |
|---|---|---|
| RDS max storage | 64 TB (PostgreSQL, MySQL) | Aurora auto-grows to 128 TB |
| RDS max connections (default) | ~5000 (depends on instance RAM) | Use PgBouncer or RDS Proxy |
| Aurora read replicas | Up to 15 | Replication lag typically <20 ms |
| RDS read replicas | Up to 5 | Lag can be seconds under heavy write load |
| RDS Multi-AZ failover time | 60-120 seconds | Aurora: <30 seconds |
| Aurora Serverless v2 min ACU | 0.5 ACU | Scales to zero only on v1 (with pause/resume delay) |
| RDS Proxy connections | Up to 1,000 per proxy endpoint | Multiplexes many app connections to fewer DB connections |
| Automated backup retention | Up to 35 days | Point-in-time recovery within retention window |
| Max IOPS (gp3) | 16,000 | Provisioned IOPS (io2) goes up to 256,000 |
27. API Gateway Comparison
| Gateway | Deployment | Strengths | Weaknesses | Best For |
|---|---|---|---|---|
| Kong | Self-hosted or Kong Cloud | Rich plugin ecosystem, Lua extensibility, PostgreSQL/Cassandra backing | Complex to operate at scale, Lua is niche | Teams wanting extensibility with a large plugin marketplace |
| AWS API Gateway | Fully managed | Zero ops, native AWS integration (Lambda, IAM), WebSocket support | 29-second timeout, cold starts, limited customization, vendor lock-in | AWS-native serverless architectures |
| Envoy | Self-hosted (often K8s) | Extreme performance, L7 protocol awareness, xDS dynamic config | Steep learning curve, YAML-heavy, not turnkey API management | K8s teams, performance-critical paths, service mesh data plane |
| Traefik | Self-hosted | Auto-discovery (Docker/K8s labels), Let’s Encrypt integration | Smaller plugin ecosystem, less enterprise tooling | Docker/K8s environments wanting automatic service discovery |
| APISIX | Self-hosted | High performance (Nginx/OpenResty), etcd-backed, CNCF ecosystem | Smaller community, less enterprise tooling | High-performance needs, CNCF-aligned teams |
| Nginx | Self-hosted | Battle-tested, extremely stable, massive community, low resources | Manual configuration, limited API management out of the box | Simple routing, teams already running Nginx |
Quick-Find Index
Alphabetical topic index — jump to what you need
Alphabetical topic index — jump to what you need
| Topic | Section |
|---|---|
| API Gateway comparison | 27 |
| API styles (REST, gRPC, etc.) | 4 |
| Authentication methods | 6 |
| Availability (“nines” table) | 10 |
| Back-of-envelope estimation | 11 |
| Caching strategies | 3 |
| Cloud service limits (Lambda, S3, DynamoDB, RDS) | 26 |
| Complexity patterns (Big O) | 15 |
| Consensus algorithms (Raft, Paxos, ZAB) | 18 |
| Container orchestration (K8s) | 8 |
| CRDT types | 19 |
| Database deep dive (Postgres, Mongo, Dynamo, Redis) | 21 |
| Database selection | 2 |
| Deployment strategies | 5 |
| Design patterns | 12 |
| Distributed systems numbers | 25 |
| Git commands | 14 |
| GraphQL vs REST | 23 |
| HTTP status codes | 9 |
| Interview Day Quick Review | Top |
| Latency numbers | 1 |
| Message queues | 7 |
| OS I/O models | 20 |
| Real-time protocols (WebSocket, SSE, WebRTC) | 22 |
| REST naming conventions | 17 |
| Service mesh comparison (Istio, Linkerd, Consul) | 24 |
| SOLID principles | 13 |
| System design components | 16 |
Interview Deep-Dive Questions
Q1. You are designing a system where latency matters. Walk me through how you would reason about where time is spent in a typical request from a user’s browser to a database and back.
Strong Answer
Strong Answer
Red Flags
Red Flags
- Candidate only mentions “network” and “database” without quantifying anything
- No awareness that TLS handshake can dominate the budget for cross-region requests
- Talks about optimizing application code but ignores that network latency is usually the real bottleneck for geo-distributed users
- Cannot explain why connection pooling matters (hint: it amortizes handshake cost)
Follow-up: Your P50 latency is 20ms but P99 is 800ms. What do you investigate first?
Strong Answer
Strong Answer
Follow-up: How do CDNs and edge computing change this latency picture?
Strong Answer
Strong Answer
Q2. You need to choose a database for a new service. Walk me through your decision framework — not which database to pick, but how you decide.
Strong Answer
Strong Answer
Red Flags
Red Flags
- Jumps straight to a specific database without asking about access patterns
- Says “MongoDB because it is flexible” or “Postgres because it is good” without deeper reasoning
- Does not mention operational expertise as a factor
- Treats the choice as permanent — a senior engineer knows you can (and often should) use different databases for different parts of the system
- No mention of consistency requirements as a decision axis
Follow-up: When would you use both PostgreSQL and DynamoDB in the same system?
Strong Answer
Strong Answer
Going Deeper: How do you avoid the dual-write problem when keeping two databases in sync?
Strong Answer
Strong Answer
Q3. Explain the trade-offs between JWT and session-based authentication. When would you choose one over the other?
Strong Answer
Strong Answer
Red Flags
Red Flags
- Says “JWT is more secure because it is signed” — signing proves integrity, not confidentiality
- Does not mention the revocation problem with JWTs
- Claims sessions “do not scale” without mentioning that Redis-backed sessions scale to millions of concurrent users
- No awareness of the hybrid approach (short-lived JWT + server-side refresh token)
- Puts sensitive data (roles, permissions, PII) in JWT claims without mentioning encryption
Follow-up: A security team tells you a user’s JWT has been stolen. What do you do?
Strong Answer
Strong Answer
jti (unique token ID) to a blocklist in Redis. Every service that validates JWTs must check this blocklist. This works but defeats the main advantage of JWTs — you are back to hitting a centralized store on every request. The blocklist only needs to persist until the token’s natural expiry, so it is a bounded problem.Rotate the signing key. If you rotate the key, all existing JWTs become invalid. This is a nuclear option — it logs out every user, not just the compromised one. Appropriate only in catastrophic breach scenarios.User-level version claim. Include a token_version claim in JWTs. Store the current version per user in a fast store. On revocation, increment the user’s version. Services compare the JWT’s version against the current version. This is a targeted blocklist — you only invalidate tokens for the compromised user, and the lookup is a simple key-value read.The lesson: the right architecture depends on your threat model. If “a stolen token is valid for 15 minutes” is acceptable, short-lived JWTs with no blocklist is the simplest approach. If that window is unacceptable (banking, healthcare), you need the blocklist or you should reconsider whether JWTs are the right choice for your system.Q4. You are designing a notification system that needs to deliver messages in real-time to millions of connected users. Walk me through the architecture and the protocol choices you would make.
Strong Answer
Strong Answer
userId -> connection.The fan-out problem. When a notification is generated, how does it reach the right connection server? The notification producer does not know which server holds the user’s connection. The pattern is: notification producer publishes to a message broker (Kafka or Redis Pub/Sub), and every connection server subscribes. Each server checks if the target user is connected to it, and if so, pushes the notification down the SSE stream. With Redis Pub/Sub, you can use channel-per-user or a fan-out pattern. With Kafka, you partition by user ID so that each connection server only processes notifications for users it is likely to hold.Offline users. If the user is not connected, the notification goes to a persistent store (DynamoDB or Postgres) and is delivered on next connection. The client, on reconnect, sends a Last-Event-ID header (SSE supports this natively), and the server replays missed events.Scaling to millions. At 1 million concurrent connections, I am looking at roughly 20 connection servers. The bottleneck shifts to the fan-out layer. Redis Pub/Sub works well up to maybe 100K messages per second, but beyond that you need Kafka with partitioning or a dedicated pub/sub system. The connection servers themselves are I/O-bound (holding connections), not CPU-bound, so they can be relatively small instances with high network capacity.The trade-off I would discuss with the team: do we need exactly-once delivery or is at-least-once acceptable? For notifications, at-least-once with client-side deduplication (using a notification ID) is usually the pragmatic choice. Exactly-once is significantly harder and rarely worth the complexity for notifications.Red Flags
Red Flags
- Jumps to WebSocket without considering SSE for a server-push-only use case
- Does not address the fan-out problem — how does a notification reach the right server
- No mention of offline handling or message persistence
- Assumes a single server can hold all connections
- Does not discuss delivery guarantees (at-least-once vs exactly-once)
Follow-up: How would you handle the “thundering herd” problem when a popular event triggers notifications to 10 million users simultaneously?
Strong Answer
Strong Answer
Q5. Explain the CAP theorem, then tell me why it is frequently misunderstood and what you actually think about when designing distributed systems.
Strong Answer
Strong Answer
Red Flags
Red Flags
- Says “pick any 2 of 3” as if you choose to not have partition tolerance
- Cannot define what consistency means in CAP (linearizability) versus general “data is correct”
- Treats it as a system-wide choice rather than per-operation or per-data-path
- No mention of PACELC or the latency vs consistency trade-off during normal operation
- Cannot give a concrete example of when eventual consistency is acceptable
Follow-up: Give me a concrete example of a system that appears to violate CAP and explain why it does not.
Strong Answer
Strong Answer
Going Deeper: How does CockroachDB handle a situation where a Raft leader becomes partitioned from its followers?
Strong Answer
Strong Answer
Q6. Walk me through how you would design a caching strategy for a read-heavy API endpoint that serves user profile data. The data changes infrequently but must not be stale for more than 30 seconds.
Strong Answer
Strong Answer
DEL or through a lightweight message like SNS/SQS). The cache entry is deleted immediately, and the next read triggers a cache miss and fetches fresh data. This gives us near-zero staleness for the common case, with the 30-second TTL as the backstop for edge cases (event lost, consumer lag).Cache stampede protection. When a popular user’s cache expires, hundreds of concurrent requests might all miss the cache simultaneously and hit the database. Three mitigations: (1) Lock/lease: the first request that gets a cache miss acquires a short-lived lock (Redis SET NX with a TTL). Other requests either wait for the lock to release and then read the now-populated cache, or serve a slightly stale value. (2) Refresh-ahead: proactively refresh the cache before the TTL expires. If the TTL is 30 seconds, refresh at 25 seconds. The cache never actually expires for hot keys. (3) Request coalescing: at the application level, deduplicate identical in-flight requests so that only one database query executes.Cache key design. user:profile:{userId} is the obvious key. I would also consider including a version if the profile schema evolves. For multi-region setups, consider whether caches should be regional (lower latency, possible cross-region staleness) or global (consistent but higher latency for cache operations).What I would monitor: cache hit ratio (target above 95% for this use case), P99 latency for cache misses, invalidation event lag (time between write and cache delete), and the rate of cache stampedes (concurrent misses for the same key).Red Flags
Red Flags
- Uses only TTL-based expiration without event-driven invalidation
- No awareness of cache stampede (thundering herd on cache miss)
- Does not mention the consistency model — what happens during the window between write and invalidation
- Proposes write-through without acknowledging the added write latency
- No monitoring strategy
Follow-up: The cache hit ratio is 60% and you expected it to be above 95%. How do you diagnose this?
Strong Answer
Strong Answer
INFO memory — if maxmemory is hit and evicted_keys is high, the cache is too small. The solution is either to increase memory or to be more selective about what gets cached. Not all user profiles need to be in cache — only hot ones.2. Is the TTL too short? If the TTL is 30 seconds and the average time between requests for the same user profile is 45 seconds, most entries will expire before being read again. Calculate the ratio: if the inter-request gap is longer than the TTL for most keys, you will have mostly misses. Consider increasing the TTL for this data (if the staleness budget allows) or using refresh-ahead for frequently accessed keys.3. Is the invalidation too aggressive? If the event-driven invalidation is firing too frequently (maybe a background job is “updating” profiles even when nothing changed), you are evicting entries unnecessarily. Check whether the invalidation events correspond to actual data changes.4. Is the cache key cardinality too high? If you have 10 million users but only 1 million are active daily, 90% of cache slots are wasted on profiles that will never be requested before TTL expires. Consider only caching on read (cache-aside does this naturally), not pre-warming the entire dataset.5. Are clients bypassing the cache? Check if some code paths hit the database directly without checking the cache. This is surprisingly common in large codebases — a new developer writes a query that does not go through the caching layer.The diagnostic tool I would build: a breakdown of misses by reason — “key did not exist” (never cached), “key expired” (TTL), “key evicted” (memory pressure). Redis does not give this out of the box, so I would instrument the application to log miss reasons. That single metric immediately tells you which of the above is the root cause.Q7. Compare Kafka with a traditional message queue like RabbitMQ. When would you reach for each, and what are the architectural implications of your choice?
Strong Answer
Strong Answer
Red Flags
Red Flags
- Says “Kafka is better because it is faster” without nuance
- Does not understand that Kafka retains messages after consumption while RabbitMQ deletes them
- Cannot explain consumer groups or offsets in Kafka
- No mention of replay capability as a key differentiator
- Does not acknowledge operational complexity differences
Follow-up: A consumer is falling behind and lag is growing in Kafka. How do you diagnose and fix this?
Strong Answer
Strong Answer
kafka-consumer-groups --describe or a monitoring tool (Burrow, Kafka Exporter for Prometheus) to see lag per partition. If lag is uniform across partitions, it is a throughput problem. If lag is concentrated on specific partitions, it is a hot partition or a stuck consumer problem.Step 2: Is the consumer processing fast enough? Profile the consumer. Is it CPU-bound (complex processing per message)? Is it I/O-bound (writing to a slow database)? Is it blocking on external calls? The most common cause I have seen is that the consumer makes a synchronous database write per message when it should be batching. Switching from single-row inserts to batch inserts of 100-1000 rows can improve throughput by 10-50x.Step 3: Is there enough parallelism? Kafka parallelism is bounded by the number of partitions. If you have 10 partitions and 10 consumers, you are maxed out — adding more consumers does nothing. You would need to increase partitions (a disruptive operation, especially if you depend on key-based ordering). Alternatively, each consumer can process messages using internal thread pools, but you lose per-partition ordering guarantees.Step 4: Is it a rebalance storm? If consumers are joining and leaving the group frequently (due to long processing time exceeding max.poll.interval.ms, or health check failures), the group constantly rebalances, and during rebalancing, no consumption happens. The fix is to increase max.poll.interval.ms, decrease max.poll.records, or use cooperative sticky rebalancing to minimize disruption.Step 5: Is the consumer committing offsets efficiently? If using synchronous commits after every message, that adds latency. Switch to async commits or commit every N messages.If lag is unrecoverable (days behind and growing), you may need to make a business decision: skip to the latest offset and accept data loss, or provision significantly more consumer capacity to catch up gradually.Going Deeper: How does Kafka achieve its high throughput despite writing to disk?
Strong Answer
Strong Answer
sendfile(). When a consumer reads data, Kafka uses the sendfile() system call to transfer data directly from the page cache to the network socket, bypassing user space entirely. Normal I/O requires: disk to page cache to application buffer to socket buffer to NIC. Zero-copy skips the two middle copies. This is a huge win for throughput.Batching everywhere. Producers batch multiple messages into a single network request. Brokers write batches as a single sequential append. Consumers fetch batches. This amortizes the per-message overhead (network round trips, syscalls, headers) across many messages.Compression at the batch level. Producers can compress batches (Snappy, LZ4, zstd). Because similar messages in a batch compress well together, you get better compression ratios than compressing individual messages. The broker stores and replicates compressed batches without decompressing.Put it all together: batched, compressed, sequential writes to the page cache with zero-copy reads. That is why a single Kafka broker can sustain 800+ MB/s of throughput. The disk is not the bottleneck — the network usually is.Q8. A canary deployment is showing a 2% error rate increase compared to the stable version. Walk me through your decision framework for whether to proceed, roll back, or investigate.
Strong Answer
Strong Answer
- Roll back immediately if: error rate is clearly significant AND affects user-facing functionality AND you do not immediately understand the cause. In production, you protect users first and debug later.
- Investigate (pause rollout, do not roll back) if: errors are significant but isolated to a non-critical path, or you have a strong hypothesis about the cause that you can verify quickly (under 15 minutes).
- Proceed with caution if: errors are within normal variance, you have high confidence they are unrelated to the change, and expanding to 5% traffic does not increase the rate.
Red Flags
Red Flags
- Immediately says “roll back” without asking about statistical significance
- Immediately says “proceed” without investigation — disregards error signals
- No awareness that error rate on a small sample can be noise
- Does not differentiate between types of errors (5xx vs 4xx, user-facing vs internal)
- No mention of business impact as part of the decision
Follow-up: How would you build an automated canary analysis system that makes this decision without human intervention?
Strong Answer
Strong Answer
Q9. Explain the Circuit Breaker pattern. When is it essential, and when is it actually harmful?
Strong Answer
Strong Answer
- Preventing cascading failures. If Service A calls Service B, and B is down, A’s threads/connections pile up waiting for B’s timeouts. Soon A is exhausted and fails too, which cascades to everything that depends on A. The circuit breaker fails fast, freeing A’s resources.
- Protecting a recovering service. When a downstream service crashes and restarts, it is vulnerable. If all callers immediately slam it with backed-up requests, it crashes again. The circuit breaker gradually allows traffic back (half-open state), giving the recovering service time to warm up.
- Reducing latency during outages. Instead of waiting for a 10-second timeout on every request, the circuit breaker returns instantly. This improves user experience — a fast error is better than a slow error.
- Idempotent retry scenarios. If the downstream failure is transient (network blip, brief GC pause) and your retry strategy would handle it in 100ms, a circuit breaker that opens and blocks all requests for 30 seconds is a massive overreaction. You lose 30 seconds of availability to protect against a 100ms blip.
- Systems with natural variance. If the downstream has a legitimately high error rate for some requests (e.g., 5% of requests fail because of bad user input, not downstream issues), the circuit breaker may trip on expected errors. You need to differentiate between errors that indicate downstream health problems (5xx, timeouts) and errors that are expected (4xx).
- Single-dependency systems. If your service has exactly one downstream and it is the only thing you do, opening the circuit breaker means you are 100% unavailable. There is no graceful degradation possible. The circuit breaker only helps when you can serve partial functionality or a fallback.
Red Flags
Red Flags
- Can only describe the pattern at a textbook level without discussing when it hurts
- Does not mention the half-open state or how recovery works
- Cannot explain cascading failures as the primary motivation
- Thinks circuit breaker and retry are the same thing
- No awareness of the tuning challenge
Follow-up: How does a circuit breaker interact with retries, timeouts, and bulkheads in a resilience stack?
Strong Answer
Strong Answer
Q10. You need to do a back-of-the-envelope estimation for the storage requirements of a URL shortener serving 100 million new URLs per day. Walk me through your calculation.
Strong Answer
Strong Answer
- Short code: ~7 characters = 7 bytes (base62 encoding: 62^7 = ~3.5 trillion possible codes, more than enough)
- Original URL: average ~100 bytes (URLs vary, but this is a reasonable average)
- Creation timestamp: 8 bytes
- Expiration timestamp: 8 bytes
- User ID (optional): 8 bytes
- Click count (optional): 8 bytes
- 100 million URLs/day times 200 bytes = 20 GB per day
- 20 GB/day times 30 = 600 GB per month
- 20 GB/day times 365 = ~7.3 TB per year
- 7.3 TB times 3 = ~22 TB per year of raw storage
- 22 TB times 5 = ~110 TB over five years
- Write QPS: 100M/86400 = ~1,150 QPS average, ~3,500 peak
- Read QPS: ~115,000 average, ~350,000 peak
- The dataset is not enormous — 7.3 TB per year fits comfortably in a single sharded database cluster
- The read QPS (350K peak) strongly suggests we need caching. Redis can easily handle this. If the hot set (recently created URLs) fits in a few hundred GB of RAM, we get sub-millisecond reads for most requests
- The write QPS (3.5K peak) is very manageable for most databases — even a single Postgres instance can handle this
- I would use DynamoDB or Cassandra for this because the access pattern is pure key-value (short code to URL), no joins, no complex queries. DynamoDB gives me single-digit millisecond latency at any scale with zero operational overhead
Red Flags
Red Flags
- Cannot produce reasonable per-record size estimates
- Forgets replication factor
- Does not calculate QPS alongside storage — storage alone does not drive architecture
- Gives exact numbers instead of orders of magnitude (the point is approximate reasoning, not precision)
- Does not state assumptions explicitly
Follow-up: How would you generate the short codes to ensure uniqueness at 100 million URLs per day?
Strong Answer
Strong Answer
Q11. Explain the difference between horizontal and vertical scaling. Then tell me when vertical scaling is actually the right choice.
Strong Answer
Strong Answer
- Operational simplicity. One database to back up, monitor, tune, and upgrade is massively simpler than a 10-node cluster. Distributed systems introduce failure modes (network partitions, split brains, rebalancing storms) that a single node simply does not have.
- Strong consistency is trivial. A single node is always consistent with itself. No consensus protocols, no replication lag, no conflicting writes.
- Cost efficiency at moderate scale. Running one large instance is often cheaper than running and coordinating many small ones, especially when you factor in the engineering time to operate a distributed system.
- Stateful workloads that resist partitioning. Some workloads — like a game server managing shared state for 1000 players in a single match — do not naturally partition. Horizontal scaling requires rethinking the data model.
- You hit the ceiling. There is a maximum machine size. When you need more than the biggest available instance, you have no choice but to go horizontal.
- Availability requirements. A single machine is a single point of failure. If you need four or five nines, you need redundancy, which is horizontal by definition.
- Geographic distribution. You cannot put one machine in three continents simultaneously.
Red Flags
Red Flags
- Immediately dismisses vertical scaling as “not scalable” without nuance
- Cannot name a single scenario where vertical scaling is preferable
- Thinks horizontal scaling has no downsides
- Does not mention operational complexity as a real cost of horizontal scaling
- No mention of the hybrid approach (vertical DB, horizontal app)
Follow-up: Your PostgreSQL primary is at 80% CPU utilization. Walk me through your options before adding read replicas.
Strong Answer
Strong Answer
pg_stat_statements to find the top queries by total time and by frequency. In my experience, 80% of database load comes from 5% of queries. Look for sequential scans on large tables (add missing indexes), N+1 queries from the ORM (batch them), and queries that return more data than needed (add appropriate column selection and LIMIT clauses). This alone has taken me from 80% to 30% CPU before.2. Connection management. Check pg_stat_activity for the number of connections. If you have hundreds of application connections directly to Postgres, the per-connection overhead (memory, context switching) is significant. Add PgBouncer or RDS Proxy in front of Postgres to multiplex connections. This can reduce effective connection count by 10-100x.3. Caching hot queries. If the same query is executed thousands of times per second with the same parameters, cache the result in Redis. Profile-by-ID lookups, configuration values, and materialized aggregations are prime candidates. A 95% cache hit rate means Postgres sees 20x fewer queries.4. Upgrade the instance. If you are on a db.r6g.xlarge (4 vCPU), move to r6g.4xlarge (16 vCPU). This is a restart but no application changes. Buying yourself 4x headroom with a 15-minute maintenance window is a great trade.5. VACUUM and bloat. Check for table bloat with pg_stat_user_tables. If autovacuum is not keeping up, dead tuples accumulate, tables and indexes bloat, and queries scan more data than necessary. Tuning autovacuum parameters or running a manual VACUUM FULL on bloated tables can dramatically reduce I/O.6. Partitioning. If one table dominates the load and is very large, table partitioning (by date, by customer) can reduce query scope. Instead of scanning a 500M-row table, you scan a 10M-row partition.Only after exhausting all of these would I add read replicas, and even then, I would be careful to route only truly read-only queries to them and ensure the application handles replication lag gracefully.Q12. What is the Outbox pattern and why is it important in microservice architectures?
Strong Answer
Strong Answer
Red Flags
Red Flags
- Does not understand the dual-write problem that motivates the pattern
- Suggests using a distributed transaction instead (2PC across a database and Kafka is extremely fragile)
- Cannot explain the difference between polling and CDC approaches
- Does not mention idempotency requirements for the consumer
- Thinks “just write to Kafka first” is fine
Follow-up: How do you ensure exactly-once processing when the consumer reads from the Outbox via Kafka?
Strong Answer
Strong Answer
read_committed consumer isolation, you get exactly-once semantics within Kafka itself. But the consumer’s side-effects (database writes, API calls) still need application-level idempotency.The practical answer: use event IDs for deduplication, make operations idempotent where possible, and accept that in rare edge cases (consumer crash between processing and committing offset), you might process an event twice. If your operations are idempotent, this is harmless.Advanced Interview Scenarios
Q13. Your Kubernetes pod is in CrashLoopBackOff and the logs show nothing obvious. Walk me through your debugging process, step by step.
What weak candidates say
What weak candidates say
kubectl logs and maybe restart the pod.” They treat Kubernetes as a black box and have no mental model of the pod lifecycle. They stop investigating after one tool returns nothing useful.What strong candidates say
What strong candidates say
kubectl describe pod <name> is the single most informative command. I am looking at three things: the Events section at the bottom (which tells me why the container is being restarted), the Last State section (which shows the exit code — exit code 137 is OOMKilled, 1 is application error, 139 is segfault), and whether the Reason field says “OOMKilled,” “Error,” or “ContainerCannotRun.”Step 2: Check previous container logs. kubectl logs <pod> --previous shows the logs from the last crashed container. Most people forget this flag. The current container might have zero logs because it just started and immediately crashed, but the previous one might have logged the actual error before dying.Step 3: Exit code analysis. Exit code 137 is the most common surprise. It means the kernel sent SIGKILL, almost always because the container exceeded its memory limit. The application might not log anything because OOM kills are instant — the process does not get a chance to catch the signal. I have seen this catch entire teams for days. The fix is to either increase the memory limit in the pod spec or fix the memory leak. I check kubectl top pod and compare against the resource limits in the deployment YAML.Step 4: Check resource quotas and limits. kubectl describe node <node> shows if the node itself is under memory or CPU pressure. If the node is in MemoryPressure, pods get evicted regardless of their own limits. Also check if there is a LimitRange or ResourceQuota in the namespace that is silently constraining the pod below what it needs.Step 5: Init container failures. If the pod has init containers (common for migration scripts, sidecar injection, secret fetching), the init container might be the one failing. kubectl describe pod will show init container status separately. I have seen production outages where the Vault sidecar init container could not reach Vault, and the application container never started.Step 6: Image pull issues. If the image tag is wrong or the registry requires authentication, the container cannot even start. This shows as “ImagePullBackOff” initially but can transition to CrashLoopBackOff in some edge cases with init containers.Step 7: If I am truly stuck, I kubectl exec into a debug container (using ephemeral containers in K8s 1.23+) sharing the pod’s namespace, or I temporarily override the container command with sleep infinity in the deployment spec so the container stays running and I can exec in and debug interactively.War Story: At a previous company, we had a microservice that went into CrashLoopBackOff every Monday morning at 6 AM. Logs showed nothing. Exit code was 137. Turned out the service loaded a config file from a ConfigMap that was updated by a CronJob every Sunday night, and the new config included a larger in-memory lookup table that pushed the container over its 512MB limit. The fix was 30 seconds — bump the limit to 1GB — but the diagnosis took two weeks because nobody checked the exit code and everyone was looking at application logic.Follow-up: The exit code is 137 but your container’s memory limit is 4GB and your app only uses 800MB according to metrics. What is going on?
Strong Answer
Strong Answer
- Native memory in JVM. If this is a Java service, the JVM heap might be 800MB, but Metaspace, thread stacks (1MB per thread by default, times 200 threads = 200MB), JIT compiled code cache, direct ByteBuffers (Netty loves these), and native memory from JNI calls can easily add another 1-2GB. Use
-XX:NativeMemoryTracking=summaryandjcmd <pid> VM.native_memoryto see the full picture. - Page cache thrashing. If the application reads large files or does memory-mapped I/O, the kernel counts those pages against the cgroup. A service that sequentially reads a 3GB log file for processing can trigger OOM even though its heap is tiny.
- Sidecar containers. If Istio’s Envoy proxy or a logging sidecar shares the pod, their memory counts toward the pod limit. Envoy alone uses 40-100MB. A Fluentd sidecar can use 200-500MB depending on buffer configuration. So your “4GB pod” might only have 3GB for the application.
cat /sys/fs/cgroup/memory/memory.usage_in_bytes inside the container right before it crashes (or look at cAdvisor/Prometheus container_memory_usage_bytes metric) to see the real cgroup usage, not just the application-reported heap.Follow-up: How do you set resource requests and limits properly? Most teams get this wrong.
Strong Answer
Strong Answer
Q14. Your team adopted GraphQL six months ago. The frontend team loves it. But your database is falling over. What is happening and how do you fix it?
What weak candidates say
What weak candidates say
What strong candidates say
What strong candidates say
{ users { posts { comments } } } and there are 50 users, each with 20 posts, here is what happens. The users resolver fires one query: SELECT * FROM users (1 query). Then the posts resolver fires for each user: SELECT * FROM posts WHERE user_id = ? (50 queries). Then the comments resolver fires for each post: SELECT * FROM comments WHERE post_id = ? (50 * 20 = 1,000 queries). Total: 1,051 queries for one GraphQL request. Scale that to 100 requests per second and you are hitting the database with 105,000 queries per second. No database survives that.Fix 1: DataLoader (essential, non-negotiable). DataLoader batches and deduplicates data-fetching calls within a single request. Instead of 50 individual SELECT ... WHERE user_id = ? calls, DataLoader collects all 50 user IDs and issues one SELECT ... WHERE user_id IN (?, ?, ..., ?) call. Those 1,051 queries collapse to 3 queries. At Facebook, where GraphQL was born, DataLoader is not optional — it is part of the architecture. Every team adopting GraphQL should implement DataLoader from day one.Fix 2: Query depth and complexity limiting. Without limits, a client can craft a deeply nested query that causes exponential resolver calls. Implement query complexity analysis that assigns a cost to each field, and reject queries that exceed a budget. Libraries like graphql-query-complexity or graphql-cost-analysis handle this. At Shopify, they assign point costs to every field in their storefront API and reject queries above 1,000 points.Fix 3: Persisted queries. Instead of letting clients send arbitrary query strings, pre-register allowed queries at build time and have clients send a query hash. This prevents malicious or accidental query abuse, enables server-side query optimization, and makes caching feasible (you can cache by query hash + variables).Fix 4: Look-ahead optimization. Use GraphQL’s info parameter in resolvers to inspect what fields the client actually requested. If the client only asked for users { name, email } and did not request posts, do not eagerly load posts. In SQL terms, this means your root resolver can dynamically build a JOIN or a SELECT with only the necessary columns based on the requested fields.War Story: I worked with a team that migrated from REST to GraphQL and their average database query count per API request went from 3 to 47. Their p99 latency went from 80ms to 2.3 seconds. They thought GraphQL was inherently slow. It took us one afternoon to add DataLoader and query complexity limits. The query count dropped to 4 per request, and p99 went back to 90ms. The lesson: GraphQL is not slow — naive resolver implementations are slow.Follow-up: How do you cache GraphQL responses? You cannot use standard HTTP caching because everything is a POST to a single endpoint.
Strong Answer
Strong Answer
user(id: 5) as part of completely different query shapes, caching at the field level gives you better hit rates. Apollo Server’s @cacheControl directive lets you set per-field TTLs, and a cache plugin stores/retrieves resolved values.Normalized caching on the client. Apollo Client and urql maintain a normalized cache keyed by __typename:id. When a mutation updates a user, the client automatically updates that user everywhere it appears in any cached query result. This means the client often does not need to re-fetch after mutations, reducing server load.CDN-level with edge caching. For public, non-authenticated queries (product catalogs, content), use persisted queries over GET and cache at the CDN layer. Fastly and Cloudflare both support this pattern. For authenticated queries, you are out of luck with CDN caching unless you segment by role or user tier.The hard truth: if your GraphQL API serves highly personalized, authenticated data with complex nesting, caching is significantly harder than REST. That is a real trade-off you accept when choosing GraphQL.Follow-up: When would you argue against adopting GraphQL, even when the frontend team is pushing for it?
Strong Answer
Strong Answer
Q15. You are asked to implement a distributed lock. Your first thought is to use Redis with SET NX EX. Why is this harder than it looks, and what goes wrong in production?
What weak candidates say
What weak candidates say
SETNX with a TTL and you’re done.” They describe the happy path and have never thought about what happens when a lock holder crashes, clocks drift, or the Redis node fails. They may not even know that Redlock exists or why it was invented.What strong candidates say
What strong candidates say
SET lock_key unique_value NX EX 30) works beautifully on a whiteboard and fails in production in at least three ways.Problem 1: Lock expiry while the holder is still working. You acquire the lock with a 30-second TTL. Your processing takes 35 seconds (GC pause, slow downstream call, network partition to the DB). The lock expires after 30 seconds, another process acquires it, and now two processes are in the critical section simultaneously. The “fencing token” pattern mitigates this: the lock returns a monotonically increasing token, and the resource being protected (e.g., the database) rejects operations with a token lower than the highest it has seen. But this requires the downstream resource to cooperate, which is not always possible.Problem 2: Redis failover. You write the lock to a Redis primary. Before the lock replicates to the replica, the primary crashes. The replica is promoted to primary — and the lock does not exist. Another process acquires the same lock. You now have two lock holders. This is not theoretical — Redis replication is asynchronous by default. This exact failure mode is why Martin Kleppmann wrote his famous critique of Redlock.Problem 3: Clock drift and Redlock. The Redlock algorithm (acquire locks on N/2+1 independent Redis nodes) was designed to address Problem 2. But Kleppmann showed that it relies on the assumption that process pauses and clock drift are bounded. If a process pauses for longer than the lock TTL (GC, swap, CPU scheduling), Redlock’s safety guarantee breaks. Salvatore Sanfilippo (Redis creator) and Kleppmann had a public debate about this, and the conclusion is: Redlock provides better safety than a single Redis node, but it is not a substitute for a consensus-based system if you need absolute correctness.What I actually recommend:- For efficiency locks (preventing duplicate work, rate limiting) where occasional double-execution is annoying but not catastrophic: single-node Redis
SET NX EXis fine. Just accept the rare failure and make your operations idempotent. - For correctness locks (financial transactions, inventory decrements) where double-execution causes real damage: use a consensus-based system like etcd or ZooKeeper. etcd’s lease-based locks use Raft consensus and provide linearizable guarantees. The performance is lower (~2-10ms per operation vs sub-ms for Redis), but the correctness guarantee is real.
- For database-specific locking: PostgreSQL’s
SELECT ... FOR UPDATEor advisory locks are underrated. If your critical section is a database transaction anyway, use the database’s own locking. No external coordination needed.
SETNX to prevent duplicate charge processing. It worked perfectly for 11 months. Then a Redis failover during a peak traffic period caused 23 duplicate charges in 4 minutes. Total customer impact: $47,000 in overcharges. They switched to PostgreSQL advisory locks the next week. The lock acquisition was 3ms slower, but correctness was absolute. The CTO’s quote: “I will trade 3ms for not calling 23 customers to apologize.”Follow-up: Walk me through the Redlock algorithm and why Martin Kleppmann says it is fundamentally flawed.
Strong Answer
Strong Answer
SET NX EX to N independent Redis nodes (typically 5). It records the time before starting. If it successfully sets the lock on N/2+1 (majority) nodes, AND the total time to acquire is less than the lock TTL, the lock is acquired. The effective lock validity is TTL minus the time spent acquiring.Kleppmann’s core argument is about time. Redlock assumes that processes do not pause for longer than the lock validity period. But in a garbage-collected language, a GC pause can last hundreds of milliseconds or even seconds. Here is the attack scenario: Client A acquires the Redlock. Client A enters a long GC pause. The lock TTL expires while A is paused. Client B acquires the Redlock. Client A’s GC pause ends. Client A still believes it holds the lock (it never received the expiry notification). Both A and B now execute the critical section.Why fencing tokens fix this (but Redlock does not provide them natively): a fencing token is a monotonically increasing number returned with each lock acquisition. When A acquired the lock it got token 34, and when B later acquired it, B got token 35. The storage system rejects A’s write because 34 < 35. But if you have a storage system that supports fencing tokens, Kleppmann argues, you can use that same system for locking — you do not need Redis at all.Sanfilippo’s counterargument was that clock drift and process pauses are bounded in practice, and Redlock is designed for “practical distributed systems” not “theoretical adversarial environments.” Both are right — it depends on your failure tolerance.Follow-up: If you need a truly correct distributed lock today, what do you use and what is the latency cost?
Strong Answer
Strong Answer
Q16. You have a microservice that works perfectly in integration tests but fails intermittently in production with 502 errors. Tests pass, staging passes, production fails. What is your mental model for this class of bug?
What weak candidates say
What weak candidates say
What strong candidates say
What strong candidates say
kubectl describe pod— is the pod healthy? Restart count?- Load balancer access logs — which upstream returned the 502? Was it a timeout or a connection reset?
- Distributed tracing (Jaeger/Datadog APM) — where did the failing requests spend their time?
- Thread dump / goroutine dump — are threads blocked? On what?
- Compare QPS at time of failures vs baseline — is this a load-triggered issue?
Follow-up: How do you build test environments that actually catch these production-only bugs?
Strong Answer
Strong Answer
Q17. Your team wants to add a service mesh (Istio). You think it is a mistake. Make the argument against it.
What weak candidates say
What weak candidates say
What strong candidates say
What strong candidates say
- mTLS: Use cert-manager with SPIFFE identities, or just use network policies in Kubernetes to restrict pod-to-pod communication. If you are in a single VPC and trust your network boundary, mTLS between internal services may be security theater for your threat model.
- Observability: Instrument with OpenTelemetry directly in your application code. You get more meaningful traces (business-level spans, not just HTTP hops) with less infrastructure overhead.
- Retries and circuit breaking: Libraries like resilience4j (Java), Polly (.NET), or go-retryablehttp handle this at the application layer with more nuanced control than mesh-level policies.
- Traffic splitting for canaries: Use Argo Rollouts or Flagger, which integrate with your existing ingress controller without requiring a full mesh.
Follow-up: At what team size and service count does a service mesh start making sense?
Strong Answer
Strong Answer
Q18. You inherit a system using the Saga pattern for a distributed transaction that spans three microservices. Orders are occasionally being left in an inconsistent state. What is going wrong?
What weak candidates say
What weak candidates say
What strong candidates say
What strong candidates say
- Add a
saga_statetable that tracks each saga instance: saga_id, current step, status (in_progress, completed, compensating, failed), last_updated timestamp. - Add a sweeper that detects sagas stuck in any state for longer than the SLA and triggers compensations or alerts.
- Add idempotency keys to every saga step.
- Add compensating transaction retry logic with dead-letter alerting.
Follow-up: When would you use a saga versus just using a distributed transaction (2PC)?
Strong Answer
Strong Answer
PREPARE TRANSACTION supports 2PC natively across multiple Postgres instances. XA transactions in Java work across a database and a JMS broker. The problem is not 2PC itself — it is 2PC across unreliable network boundaries with independent failure modes.The real answer: most systems should use neither. Design your service boundaries so that transactions do not span services. If “Create Order” and “Reserve Inventory” always happen together, maybe they should be the same service. The need for distributed transactions often reveals a service boundary that was drawn in the wrong place.Q19. Someone on your team says “we should use event sourcing for our new service.” When is this a great idea, and when is it the worst architectural decision you could make?
What weak candidates say
What weak candidates say
What strong candidates say
What strong candidates say
UserUpdated with a before/after diff as the payload, you are not doing event sourcing — you are doing change data capture with extra steps. Real event sourcing has domain-meaningful events: OrderPlaced, PaymentReceived, ItemShipped. If your events are just EntityChanged, reconsider.War Story: A team I consulted for event-sourced their user management service. They had events like UserCreated, UserEmailChanged, UserAvatarUpdated. After 18 months, they had 2 billion events for 500,000 users. Rebuilding the user projection (needed after a schema change) took 4 hours. Loading a single user required replaying an average of 4,000 events or maintaining snapshots. They eventually migrated to a standard PostgreSQL table with a separate CDC-based audit log. Total migration cost: 6 engineer-weeks. The lesson: event sourcing is not an audit log mechanism. If you only need an audit trail, use CDC or database triggers.Follow-up: How do you handle event schema evolution when events are immutable?
Strong Answer
Strong Answer
Q20. Your monitoring shows everything green — all health checks passing, error rate under 0.1%, latency under SLA. But customers are complaining they cannot complete purchases. What is going on?
What weak candidates say
What weak candidates say
What strong candidates say
What strong candidates say
{ "success": false, "error": "Payment processor declined" }. The health check passes. The error rate metric does not increment. The customer cannot buy anything. This is the number-one cause of “everything green but customers angry.” Fix: monitor business-level success metrics, not just HTTP status codes. Track “successful purchases per minute,” not just “non-5xx responses per minute.”Category 2: The request never reaches your servers. DNS is resolving to an old IP. The CDN is serving a stale version of the JavaScript bundle that has a bug. A WAF rule is blocking legitimate requests from a specific geography. Your health checks pass because they hit the backend directly, bypassing the entire edge stack. Fix: synthetic monitoring (also called “real user monitoring” or “synthetic transactions”) that exercises the full path: DNS resolution -> CDN -> WAF -> load balancer -> backend -> database -> response validation. Tools like Datadog Synthetics, Catchpoint, or even a simple curl-based check from an external location.Category 3: Silent data corruption. The purchase flow completes, the API returns 200, but the order record in the database has a null shipping address because a recent migration introduced a bug in the address-parsing logic. The customer gets a confirmation email but the warehouse cannot fulfill the order. Fix: data integrity monitors that periodically validate business invariants: “every order has a non-null shipping address,” “every payment has a corresponding order,” “every order placed in the last hour has a fulfillment record.”Category 4: Client-side failures invisible to backend monitoring. A JavaScript error in the checkout page prevents the “Place Order” button from firing the API request. Your backend sees nothing because the request was never made. Or: the API call is made but fails with a CORS error that the backend never logs. Fix: client-side error tracking (Sentry, Bugsnag) and real user monitoring (RUM) that captures client-side errors, JavaScript exceptions, and failed network requests.Category 5: Partial failures in a multi-step flow. Step 1 (add to cart) works. Step 2 (enter shipping) works. Step 3 (payment) fails intermittently. Your aggregate error rate across all endpoints is 0.1%, but the payment endpoint’s error rate is 8%. The low-traffic endpoint’s failures are drowned in the aggregate metric. Fix: per-endpoint monitoring and funnel analysis. Track “started checkout” -> “entered shipping” -> “entered payment” -> “order confirmed” as a funnel. If 1,000 users start checkout but only 200 complete it, something is broken between steps.My immediate debugging protocol when this happens:- Check client-side error tracking (Sentry) for JavaScript errors in the purchase flow.
- Check the purchase completion funnel for where users are dropping off.
- Inspect the actual HTTP response bodies for the failing endpoint — are we returning 200 with an error payload?
- Run a synthetic transaction that exercises the full purchase flow end-to-end, including payment processor integration.
- Check if the issue is geography-specific (CDN, WAF, DNS).
Follow-up: Design a monitoring strategy that catches this class of problem before customers complain.
Strong Answer
Strong Answer
Q21. A junior engineer proposes using a Singleton pattern for the database connection pool. It seems reasonable. Why might you push back, and what would you recommend instead?
What weak candidates say
What weak candidates say
What strong candidates say
What strong candidates say
ConnectionPool.getInstance() is a global mutable reference. In unit tests, you cannot substitute it with a mock or an in-memory database without reflection hacks or resetting global state between tests. Every test that touches any code path that uses the database now implicitly depends on the Singleton. Test isolation is destroyed.Configuration rigidity. What happens when you need to connect to two different databases? The Singleton assumes exactly one instance. Now you refactor to ConnectionPool.getInstance("primary") and ConnectionPool.getInstance("analytics"), which is a dictionary of Singletons — at which point you have reinvented a service locator, which has its own problems.Lifecycle management. When the application shuts down, who closes the connection pool? The Singleton has no natural lifecycle hook. In a web framework, you want the pool to initialize on startup and close gracefully on shutdown, draining active connections. A Singleton’s lifecycle is “created on first access, destroyed when the process exits,” which often means connections are not closed cleanly.Hidden dependency. Any class in your codebase can call ConnectionPool.getInstance() anywhere. You cannot tell from a class’s constructor or interface what it depends on. This makes code harder to reason about and harder to refactor. If you need to split the application into two modules, which one gets the connection pool?What I recommend: dependency injection. Create the connection pool at application startup (in your composition root — main(), the DI container setup, the framework bootstrap). Pass it as a constructor parameter to the services that need it. This gives you:- Testability: pass a mock pool in tests.
- Flexibility: pass different pools to different services.
- Explicit dependencies: a class’s constructor tells you exactly what it needs.
- Lifecycle control: the startup code creates it, the shutdown hook closes it.
db.getInstance() in a 200-line Python script is the pragmatic choice. Software engineering principles exist to manage complexity — if there is no complexity, the principle is overhead.War Story: A team had a Singleton connection pool in a Java service. When they needed to add a read-replica connection for reporting queries, they could not — the Singleton assumed one database. They forked the Singleton into PrimaryPool and ReplicaPool, but now every DAO method needed to decide which pool to use, and the Singletons proliferated. When I joined, we refactored to dependency injection with Spring’s @Qualifier annotations. The refactor touched 40 files and took a week, but after that, adding a third database (Redis, for caching) took 20 minutes. The Singleton pattern was the bottleneck to architectural evolution.Follow-up: Name a case where a Singleton is genuinely the right choice, not just the convenient one.
Strong Answer
Strong Answer
logging module) use Singletons internally, and this is correct. The logger is a write-only, stateless (from the caller’s perspective), globally-needed utility. You never need to mock the logger in tests (you mock the handler/appender instead). There is no lifecycle complexity. Multiple instances would just be wasteful.Configuration registries (read-only after initialization). A global config object loaded once at startup and never modified is a safe Singleton because it has no mutable state to cause concurrency issues and no lifecycle management needs.The pattern I use: Singletons are appropriate when the resource is physically singular, stateless or immutable, and has no testing implications. Connection pools fail on all three — they are logically singular (not physically), mutable (connections are checked in/out), and absolutely have testing implications.Q22. You are designing a system that must work across three geographic regions with users who expect sub-100ms reads. Strong consistency is required for financial data, but you also have a social feed feature where eventual consistency is fine. How do you architect this?
What weak candidates say
What weak candidates say
What strong candidates say
What strong candidates say
- Financial data: CockroachDB multi-region with pinned leaseholders. Strong consistency. Reads sub-100ms from user’s home region.
- Social feed: DynamoDB Global Tables + regional Redis. Eventual consistency. Reads sub-10ms.
- Integration: Kafka event bridge with Outbox pattern for cross-domain events.
- Total operational surface: 3 databases (CockroachDB, DynamoDB, Redis), 1 event backbone (Kafka). This is manageable with a competent platform team.
Follow-up: A user in Europe makes a deposit and immediately checks their balance. Can you guarantee they see the updated balance if the leaseholder is in US-East?
Strong Answer
Strong Answer
ALTER TABLE accounts CONFIGURE ZONE USING lease_preferences = '[[+region=eu]]' WHERE region = 'eu' to pin European user data leaseholders to the European region. This gives European users fast reads after their own writes, while maintaining strong consistency guarantees.The worst-case scenario: the leaseholder fails over to another region right after the write but before the read. The new leaseholder in the other region will have the write (because the write was acknowledged by a majority before the response was sent), so the user still sees their updated balance — but the read takes longer due to the cross-region hop. Correctness is maintained; latency temporarily spikes.Interview-Ready One-Liners
Memorize these. They are the crisp, senior-engineer-sounding openers for high-frequency topics.| Topic | One-Liner |
|---|---|
| CAP | ”Partition tolerance is not a choice; during a partition I pick consistency or availability. I prefer PACELC because it tells me what to do when there is no partition.” |
| Consistency model | ”Strong consistency for money, read-your-writes for user actions, eventual for feeds and analytics.” |
| Caching | ”Cache-aside is the default; I add Redis only on the hot path and measure hit rate before committing to it.” |
| Rate limiting | ”Token bucket for APIs, sliding window for per-user limits, leaky bucket when smoothing bursts matters.” |
| Queue choice | ”Kafka for log-style fan-out at scale, RabbitMQ for per-message routing, SQS when I want to not operate a queue.” |
| Auth | ”401 is authn, 403 is authz. I never roll my own auth; I use the IDP and focus on authorization policy.” |
| Microservices | ”Services, not microservices. I split only when team boundaries or scaling axes justify the ops cost.” |
| Tests | ”Unit tests for pure logic, integration tests against real infra, one e2e smoke per critical journey. I don’t mock the database for integration.” |
| Observability | ”Metrics for rate/error/duration (RED), logs for forensics, traces for request flow. Alert on symptoms, not causes.” |
| Deployments | ”Canary with automated rollback on error-rate regression, feature flags for decoupling deploy from release.” |
| Database choice | ”Postgres until you prove you’ve outgrown it. At 50K writes/sec or multi-region writes, then I evaluate Spanner/CockroachDB/Cassandra.” |
| Kubernetes | ”Autopilot for small teams and stateless, Standard for GPU/DaemonSet/custom. RBAC is code-reviewed; nobody gets cluster-admin.” |
| Schema migration | ”Expand -> migrate -> contract. Never backward-incompatible in one deploy. pg_repack or gh-ost for online rewrites.” |
| Memory leak | ”Reproduce, snapshot, diff, find the retainer. Five patterns: globals, timers, closures capturing too much, detached DOM, orphaned listeners.” |
| Thundering herd | ”Jittered retry + circuit breaker + request coalescing. Don’t let every client retry at the same second.” |
| Idempotency | ”Client supplies an idempotency key; server stores result keyed by it for a bounded window. Retry-safe by construction.” |
| Distributed lock | ”Redis SET NX EX is not safe; use Redlock carefully or better, use Zookeeper/etcd leases with a fencing token.” |
Quick-Fire Q&A: 60-Second Answers
Q: Why is `SELECT *` usually bad?
Q: Why is `SELECT *` usually bad?
SELECT * is okay is in ad-hoc debugging.Q: What is the practical difference between a 502 and a 504?
Q: What is the practical difference between a 502 and a 504?
Q: Why is an index on `deleted_at IS NULL` often useless?
Q: Why is an index on `deleted_at IS NULL` often useless?
deleted_at IS NULL, an index on that column is not selective — the planner picks a sequential scan anyway. Solutions: (a) partial index WHERE deleted_at IS NULL which only indexes live rows and is much smaller, (b) don’t soft-delete if the deletion ratio is low, (c) move deleted rows to a separate archive table.Q: When does horizontal scaling stop helping?
Q: When does horizontal scaling stop helping?
Q: What does 'shift left' actually mean in practice?
Q: What does 'shift left' actually mean in practice?
Q: Why is 'retry with exponential backoff' not enough?
Q: Why is 'retry with exponential backoff' not enough?
sleep = base * 2^attempt * random(0.5, 1.5). Also cap max attempts (give up eventually) and max backoff (don’t wait 30 minutes). Finally, combine with a circuit breaker so you stop hammering a dead service at all.Q: Blue-green vs canary -- when do you pick which?
Q: Blue-green vs canary -- when do you pick which?
Q: Why don't you use JWT for sessions?
Q: Why don't you use JWT for sessions?
Q: What is the difference between at-least-once and exactly-once delivery?
Q: What is the difference between at-least-once and exactly-once delivery?
Q: Why does everyone say 'cache invalidation is hard'?
Q: Why does everyone say 'cache invalidation is hard'?
AI-Assisted Lens per Concept
How would you use AI tools to work with each topic in a real interview / real job?| Topic | AI-assisted lens |
|---|---|
| System design | Use Claude/GPT to stress-test your design by role-playing an adversarial reviewer: “What are 5 failure modes of this design?” |
| Debugging | Paste stack trace + recent code changes into the LLM for root-cause hypotheses, then verify each manually. |
| SQL optimization | Give the LLM your EXPLAIN ANALYZE output and ask for the top 3 fixes with reasoning. Validate with EXPLAIN (ANALYZE, BUFFERS) after each change. |
| Code review | Ask the LLM to review a PR with a specific lens: “Review for concurrency bugs only” or “Review for cost efficiency only.” More signal than a generic review. |
| Test generation | LLMs are excellent at generating property-based test cases and edge cases you would miss. Useful for serialization, parsers, and date/time math. |
| Documentation | Draft ADRs from bullet points, generate runbook skeletons from an incident report. You polish; AI does the boilerplate. |
| Learning new tech | Ask the LLM for a “concept map” of the new tech vs something you already know. “How is Temporal like and unlike Airflow?” |
| Interview prep | Have the LLM generate 10 adversarial follow-ups to your answer. Use it as an interview sparring partner. |